Optimization Theory, Methods, and Applications in Engineering 2013
View this Special IssueResearch Article  Open Access
Classification of Hospital Web Security Efficiency Using Data Envelopment Analysis and Support Vector Machine
Abstract
This study proposes the hybrid data envelopment analysis (DEA) and support vector machine (SVM) approaches for efficiency estimation and classification in web security. In the proposed framework, the factors and efficiency scores from DEA models are integrated with SVM for learning patterns of web security performance and provide further decision support. The numerical case study of hospital web security efficiency is demonstrated to support the feasibility of this design.
1. Introduction
During the past decades, the Internet and World Wide Web (www) have been prevalent platforms for information sharing and transformation. Consequently, web security management becomes a major theme in profit as well as nonprofit organizations. Assurance of information systems security involves not only tangible costs but also intangible inputs, which makes it challenging to evaluate the performance of the investments. This section concisely introduces the basics of web security, data envelopment analysis (DEA), support vector machine (SVM), classification on web security, and the goals of this study.
1.1. Phishing and Web Security
Phishing [1–4] is a criminal activity employing both social engineering and technical subterfuge to acquire personal data such as usernames, passwords, and credit card numbers. Phishing has become a serious threat to information security and Internet privacy. An analysis of phishing attacks by the Financial Services Technology Consortium [3] produces a taxonomy consisting of six stages: planning, set up, attack, collection, fraud, and postattack.
The increasing popularity of webbased systems has resulted in phishing behaviors causing significant financial damage to both individuals and organizations. The statistics provided by the AntiPhishing Working Group [4] show that the number of unique phishing websites detected during the fourth quarter of 2009 was 137,619 and that financial services and payment services were the mosttargeted industry sectors. It is clear that financial gain is the main objective of phishing attacks. A survey by Gartner [5] reveals that between September 2007 and September 2008, more than 5 million people in the United States were affected by phishing attacks with the average loss of US$ 351 per incident, and the number of victims has increased by 39.8 percent.
Protection against attacks and unauthorized access to sensitive information is vital in the Internet. Several technical antiphishing solutions have been proposed on the server’s side or client’s side. Serverside defenses employ Secure Sockets Layer certificates, userselected site images, and other security indicators to help users verify the legitimacy of web sites, while clientside defenses equip web browsers with automatic phishingdetection features or addons (e.g., SpoofGuard) to warn users against suspected phishing sites [6]. In addition to the technical solutions, training users on antiphishing techniques is a frequently recommended and widely used approach for countering phishing attacks. ISO and NIST security standards, which many companies are contractually obligated to follow, include security training as an important component of security compliance [7, 8]. These standards describe a threelevel framework comprising awareness, training, and education. Users can be trained in several ways to understand how phishing attacks work [9–12].
1.2. Data Envelopment Analysis
Efficiency evaluation is a common issue in various domains and organizations, which is critical to investment analysis and resource allocation. Data envelopment analysis [13–15] is a celebrated efficiency evaluation technique and has been widely used in medical practices [16–19]. The DEA CCR ratio model developed by Charnes et al. [15] assesses the relative efficiency of decisionmaking units (DMUs) by maximizing the ratio of the weighted sum of outputs to that of inputs. Consider DMUs () that require assessment. Each DMU consumes inputs () and produces outputs (), denoted by and , respectively. The efficiency of is computed as follows.
1.2.1. CCR Ratio Model
One has
Based on the CCR ratio model, the objective function is maximized for every individually. In the model, and are the th input and th output of ; , are the weights of the outputs and inputs, respectively; is a small positive value which ensures that all weights are nonnegative. For computational convenience, frequently the CCR ratio model is transformed into a linear programming (LP) model by assuming [14] that
Notably, the solution space of the CCR LP model is smaller than that of the CCR ratio model due to the constraint (2); thus, the CCR LP model finds the local optimum for the ratio model which comprises fractional terms [20].
1.3. Support Vector Machine
Support vector machine (SVM) is a popular classifier and pattern recognition method based on statistical learning [21–23]. Suppose that training data , , are given, where are the input patterns and are the related target values of twoclass pattern classification case. Then the standard linear support vector machine is as follows: where is the location of hyperplane relative to the origin. The regularization constant is the penalty parameter of the error term to determine the tradeoff between the flatness of linear functions and empirical error.
Hence, for such a generalized optimal separating hyperplane, the functional to be minimized comprises an extra term accounting the cost of overlapping errors. In fact the cost function (3) can be even more general as given below: subject to the same constraints. This is a convex programming problem that is usually solved only for or , and such soft margin SVMs are dubbed and SVMs, respectively [23].
For SVMs (), the solution to a quadratic programming problem (3) is given by the saddle point of the primal Lagrangian shown below where and are the Lagrange multipliers.
Due to the KKT conditions, a dual Lagrangian function has to be maximized as follows:
In learning a nonlinear classifier, we can define a kernel and the dual Lagrangian to be maximized as follows: where is the kernel function which maps the training vector into a higher dimensional space. Popularly used kernel types include linear, polynomial, Gaussian, radial basis, and sigmoid [23].
1.4. Estimation and Classification on Web Security
Bose and Leung [2] investigate antiphishing preparedness of banks in Hong Kong by analyzing the websites of the registered Hong Kong banks. They compute the score for each bank by averaging the performance of the bank’s website in three aspects, accessibility, usability, and information content. Later on Chen et al. [24] assess the severity of phishing attacks in terms of their risk levels and the potential loss in market value suffered by the targeted firms. They analyze 1030 phishing alerts released on a public database and financial data related to the targeted firms using a hybrid method that predicts the severity of the attack. Nishanth et al. [25] employ a twostage soft computing approach for data imputation to assess the severity of phishing attacks, which involves Kmeans algorithm and multilayer perception (MLP), probabilistic neural network (PNN), and decision trees (DT). Similar machinelearning techniques are employed by Lakshmi and Vijaya [26] for modelling the prediction task. The supervised learning algorithms, namely, multilayer perception, decision tree induction, and naïve Bayes classification, are used for exploring the results.
This study intends to integrate DEA and SVM for web efficiency estimation and classification for several key reasons. First, as the medical informatics and security gain growing attention, a practical evaluation scheme is needed. We develop the DEA to assess the relative efficiency of the hospitals as the pioneer study in the related field. Second, in addition to evaluating the current websites at one snapshot, some websites may be reviewed as potential data set. An efficient and reliable classifier is essential to discriminate future data. Among wide machine learning methods, SVM is relatively robust and convincing, so we integrate DEA and SVM to build the efficiency classification platform. Third, compared with related studies, this work emphasizes web security preparedness instead of potential web attacks detection. That is, we assess web security from a proactive view not limited to technical aspect. The rest of this paper is organized as follows. Section 2 addresses the problem and methods. Section 3 presents the numerical case study of web security analysis in medical institutions. Finally, the concluding remarks are given in Section 4.
2. The Method
This section develops the hybrid DEA and SVM approaches for efficiency classification. Consider DMUs () that require assessment. Each DMU consumes inputs () and produces outputs (), denoted by and , respectively. In the proposed framework, the factors and efficiency scores from DEA models are integrated with SVM in learning patterns of DMUs’ performance and provide further decision support. The procedure of the hybrid methods is demonstrated in Figure 1.
Step 1 (efficiency evaluation). Based on (1) and (2), this study first evaluates the efficiency of the training data of the DMUs. The efficiency of is defined as in (1).
Step 2 (efficiency tier analysis). This step iteratively discriminates the fully productive group () and the subproductive group (). Using tier analysis [27], the DMUs are divided according to their efficiency scores. Then, the fully productive group is moved to the current tier, while the remaining DMUs are kept for further tier extraction. The algorithm is described as follows.S1: set . Set SC = {all DMUs}. Set .S2: compute the efficiency scores of DMUs in SC.S3: store the efficient DMUs with score 1 in . Set .S4: determine if the extraction process continues. If yes, set and go to S2; else set and go to S5.S5: for to output DMUs in . end.
The procedure of this step is demonstrated in Figure 2. Each DMU will belong to one tier thereafter.
Step 3 (SVM learning). Here the classification schema is learned as where stands for the tier that belongs to and is the vector combining the input and output factors of . Since there can be more than two tiers, so this is a multiclass classification problem.
Step 4 (testing). The set of testing data will be used to validate the classification model.
In the next section, the case of hospital web security efficiency will be thoroughly studied for demonstrating the procedure developed above.
3. Case Study
This study investigates 91 medical institutes in Taiwan, among which 8 (8.79%) are medical centers, 45 (49.45%) are metropolitan hospitals, and 38 (41.76%) are local community hospitals. To assess the hospitals’ efficiencies in web security, 10 input factors in 3 categories (clarity of purpose, communication, and security framework) and 2 output factors (user satisfaction and progress of ISO 27001 accreditation) are defined in Table 1. Two professional users with web security expertise independently assess the web sites of these hospitals. All items are scored between 1 and 9 by the predefined measures, where 9 means total consistency between the statement and the practice while 1 stands for the opposite. After the reviewers observe the sample web sites, they give the scores according to the level of conformability to the factors in Table 1. The scores from each reviewer will be averaged as the input/output values for the DEA models. Most of the factors are nearly objective except user satisfaction (). Notably, the input variables are surrogate variables to construct the investments to the web security.

Step 1 (efficiency evaluation). By (1) and (2), the efficiencies of the information security investment in hospitals are computed. The distributions of the results are summarized in Tables 2 and 3 and Figure 3.
 
The efficiency from the first round evaluation without tier analysis. 

Step 2 (efficiency tier analysis). By each iteration of the tier analysis algorithm in Section 2, DEA determines one productive group of hospitals () and the other group of subproductive hospitals (). Then the fully efficient group is extracted and the other proceeds to the next step. Each DMU will belong to an efficiency tier thereafter. The members of the tiers are distributed as in Table 4 and Figure 4.

Steps 3 and 4 (SVM learning and testing). In this step, we use the utility LIBSVM [28] to build the SVM classification models. Four types of kernel functions are learned, including linear, polynomial, radial basis, and sigmoid. The accuracy in testing by different number of tiers and kernel function types is compared. The report is shown in Table 5 and Figure 5.

The distribution of efficiencies in Figure 3 manifests the unbalanced pattern where most hospitals lie in the two extremes of efficiency scales. However, by tier analysis, the distribution of tiers is nearly even except the second tier with the lowest number of hospitals.
From the results, the kernel functions with satisfactory prediction accuracy are linear (90.11% in average), radial basis (89.37% in average), and polynomial (87.18% in average), while the sigmoid function results in the lowest accuracy (average of 55.31%). Obviously, the linear, radial basis, and polynomial functions are more appropriate kernel types for web security efficiency classification in this case. The pattern of tier distribution is possibly the reason why those three kernel types outperform the sigmoid function in SVM classification.
From the perspective of data tier refinement, 2 tiers get the highest accuracy (average of 88.74%) while 4 tiers have the lowest rate (average of 72.53%). The results show that fewer data tiers obtain better accuracy in classification, which is consistent with the rule of thumb.
4. Conclusions
This study proposes the data envelopment analysis and support vector machine approaches for efficiency estimation and classification. For the feasibility of data collection, we use the surrogate variables to construct the tangible and intangible input factors. In defining the output factors, we define an objective variable of ISO 27001 accreditation progress and a subjective one of user satisfaction, which evaluate the efficiency from not only technical perspective but also users’ perception. In the proposed framework, the factors and efficiency scores from DEA models are integrated with SVM for learning patterns of expected web security performance. From the case study, linear and radial basis kernel functions have superior performance in classification. Also classification with fewer data tiers obtains better accuracy in testing, which is consistent with the rule of thumb. This design integrates performance estimation and pattern learning to provide decision support in medical information security.
Acknowledgments
The authors are indebted to the anonymous reviewers for their careful reading and suggestions to enhance the quality of this paper. This work is supported by the National Science Council, Taiwan (Grant no. NSC 1022410H259039, NSC 1012221E259030).
References
 M. Jakobsson and S. Myers, Phishing and Countermeasures: Understanding the Increasing Problem of Electronic Identity Theft, WileyInterscience, Hoboken, NJ, USA, 2007.
 I. Bose and A. C. M. Leung, “Assessing antiphishing preparedness: a study of online banks in Hong Kong,” Decision Support Systems, vol. 45, no. 4, pp. 897–912, 2008. View at: Publisher Site  Google Scholar
 R. Wetzel, “Tackling phishing,” Business Communications Review, vol. 35, no. 2, p. 46, 2005. View at: Google Scholar
 AntiPhishing Working Group, Phishing Activity Trends Report, 2009, http://www.antiphishing.org/reports/apwg_report_Q4_2009.pdf.
 Gartner, “Gartner says number of phishing attacks on U.S. consumers increased 40 percent in 2008,” 2009, http://www.gartner.com/it/page.jsp?id=936913. View at: Google Scholar
 N. Chou, R. Ledesma, Y. Teraguchi, and J. C. Mitchell, “Clientside defense against webbased identity theft,” in Proceedings of the Annual Network and Distributed System Security Symposium (NDSS '04), 2004. View at: Google Scholar
 ISO, “ISO/IEC 27001:2005—information technology—security techniques—information security management systems—requirements,” Tech. Rep., International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC), 2005. View at: Google Scholar
 NIST, “NIST special publication 80012: an introduction to computer security—the NIST handbook,” Tech. Rep., National Institute of Standards and Technology, 2004. View at: Google Scholar
 eBay, “Spoof email tutorial,” 2010, http://pages.ebay.com/education/spooftutorial/. View at: Google Scholar
 S. Srikwan and M. Jakobsson, “Using cartoons to 8 teach internet security,” Tech. Rep., DIMACS, 2007. View at: Google Scholar
 SonicWALL Phishing and Spam IQ Quiz, 2010, http://survey.mailfrontier.com/survey/quiztest.html.
 S. A. Robila and J. W. Ragucci, “Don't be a phish: steps in user education,” in Proceedings of the 11th Annual SIGCSE Conference on Innovation and Technology in Computer Science Education (ITiCSE '06), pp. 237–241, New York, NY, USA, June 2006. View at: Google Scholar
 R. D. Banker, A. Charnes, and W. W. Cooper, “Some models for estimating technical and scale inefficiencies in data envelopment analysis,” Management Science, vol. 30, no. 9, pp. 1078–1092, 1984. View at: Google Scholar
 A. Charnes and W. W. Cooper, “Programming with linear fractional functionals,” Naval Research Logistics Quarterly, vol. 9, pp. 181–186, 1962. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 A. Charnes, W. W. Cooper, and E. Rhodes, “Measuring the efficiency of decision making units,” European Journal of Operational Research, vol. 2, no. 6, pp. 429–444, 1978. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 J. GarciaLacalle and E. Martin, “Rural versus urban hospital performance in a “competitive” public health service,” Social Science and Medicine, vol. 71, no. 6, pp. 1131–1140, 2010. View at: Publisher Site  Google Scholar
 S. J. Chang, H. C. Hsiao, L. H. Huang, and H. Chang, “Taiwan quality indicator project and hospital productivity growth,” Omega, vol. 39, no. 1, pp. 14–22, 2011. View at: Publisher Site  Google Scholar
 M. CaballerTarazona, I. MoyaClemente, D. VivasConsuelo, and I. BarrachinaMartínez, “A model to measure the efficiency of hospital performance,” Mathematical and Computer Modelling, vol. 52, no. 78, pp. 1095–1102, 2010. View at: Publisher Site  Google Scholar
 Y. Chen, J. Du, H. D. Sherman, and J. Zhu, “DEA model with shared resources and efficiency decomposition,” European Journal of Operational Research, vol. 207, no. 1, pp. 339–349, 2010. View at: Publisher Site  Google Scholar  Zentralblatt MATH  MathSciNet
 C. H. Huang and H. Y. Kao, “On solving the DEA CCR ratio model,” International Journal of Innovative Computing, Information and Control, vol. 4, no. 11, pp. 2765–2773, 2008. View at: Google Scholar
 V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 1995. View at: MathSciNet
 V. N. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, NY, USA, 1998. View at: MathSciNet
 L. Wang, Ed., Support Vector Machines: Theory and Applications, Springer, Berlin, Germany, 2005.
 X. Chen, I. Bose, A. C. M. Leung, and C. Guo, “Assessing the severity of phishing attacks: a hybrid data mining approach,” Decision Support Systems, vol. 50, no. 4, pp. 662–672, 2011. View at: Publisher Site  Google Scholar
 K. J. Nishanth, V. Ravi, N. Ankaiah, and I. Bose, “Soft computing based imputation and hybrid data and text mining: the case of predicting the severity of phishing alerts,” Expert Systems with Applications, vol. 39, no. 12, pp. 10583–10589, 2012. View at: Publisher Site  Google Scholar
 V. S. Lakshmi and M. S. Vijaya, “Efficient prediction of phishing websites using supervised learning algorithms,” Procedia Engineering, vol. 30, pp. 798–805, 2012. View at: Publisher Site  Google Scholar
 H. K. Hong, S. H. Ha, C. K. Shin, S. C. Park, and S. H. Kim, “Evaluating the efficiency of system integration projects using data envelopment analysis (DEA) and machine learning,” Expert Systems with Applications, vol. 16, no. 3, pp. 283–296, 1999. View at: Google Scholar
 C. C. Chang and C. J. Lin, “LIBSVM—a library for support vector machines,” 2003, http://www.csie.ntu.edu.tw/~cjlin/libsvm/. View at: Google Scholar
Copyright
Copyright © 2013 HanYing Kao et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.