Abstract

This study proposes the hybrid data envelopment analysis (DEA) and support vector machine (SVM) approaches for efficiency estimation and classification in web security. In the proposed framework, the factors and efficiency scores from DEA models are integrated with SVM for learning patterns of web security performance and provide further decision support. The numerical case study of hospital web security efficiency is demonstrated to support the feasibility of this design.

1. Introduction

During the past decades, the Internet and World Wide Web (www) have been prevalent platforms for information sharing and transformation. Consequently, web security management becomes a major theme in profit as well as nonprofit organizations. Assurance of information systems security involves not only tangible costs but also intangible inputs, which makes it challenging to evaluate the performance of the investments. This section concisely introduces the basics of web security, data envelopment analysis (DEA), support vector machine (SVM), classification on web security, and the goals of this study.

1.1. Phishing and Web Security

Phishing [14] is a criminal activity employing both social engineering and technical subterfuge to acquire personal data such as usernames, passwords, and credit card numbers. Phishing has become a serious threat to information security and Internet privacy. An analysis of phishing attacks by the Financial Services Technology Consortium [3] produces a taxonomy consisting of six stages: planning, set up, attack, collection, fraud, and postattack.

The increasing popularity of web-based systems has resulted in phishing behaviors causing significant financial damage to both individuals and organizations. The statistics provided by the Anti-Phishing Working Group [4] show that the number of unique phishing websites detected during the fourth quarter of 2009 was 137,619 and that financial services and payment services were the most-targeted industry sectors. It is clear that financial gain is the main objective of phishing attacks. A survey by Gartner [5] reveals that between September 2007 and September 2008, more than 5 million people in the United States were affected by phishing attacks with the average loss of US$ 351 per incident, and the number of victims has increased by 39.8 percent.

Protection against attacks and unauthorized access to sensitive information is vital in the Internet. Several technical antiphishing solutions have been proposed on the server’s side or client’s side. Server-side defenses employ Secure Sockets Layer certificates, user-selected site images, and other security indicators to help users verify the legitimacy of web sites, while client-side defenses equip web browsers with automatic phishing-detection features or add-ons (e.g., SpoofGuard) to warn users against suspected phishing sites [6]. In addition to the technical solutions, training users on antiphishing techniques is a frequently recommended and widely used approach for countering phishing attacks. ISO and NIST security standards, which many companies are contractually obligated to follow, include security training as an important component of security compliance [7, 8]. These standards describe a three-level framework comprising awareness, training, and education. Users can be trained in several ways to understand how phishing attacks work [912].

1.2. Data Envelopment Analysis

Efficiency evaluation is a common issue in various domains and organizations, which is critical to investment analysis and resource allocation. Data envelopment analysis [1315] is a celebrated efficiency evaluation technique and has been widely used in medical practices [1619]. The DEA CCR ratio model developed by Charnes et al. [15] assesses the relative efficiency of decision-making units (DMUs) by maximizing the ratio of the weighted sum of outputs to that of inputs. Consider DMUs () that require assessment. Each DMU consumes inputs () and produces outputs (), denoted by and , respectively. The efficiency of is computed as follows.

1.2.1. CCR Ratio Model

One has

Based on the CCR ratio model, the objective function is maximized for every individually. In the model, and are the th input and th output of ; , are the weights of the outputs and inputs, respectively; is a small positive value which ensures that all weights are nonnegative. For computational convenience, frequently the CCR ratio model is transformed into a linear programming (LP) model by assuming [14] that

Notably, the solution space of the CCR LP model is smaller than that of the CCR ratio model due to the constraint (2); thus, the CCR LP model finds the local optimum for the ratio model which comprises fractional terms [20].

1.3. Support Vector Machine

Support vector machine (SVM) is a popular classifier and pattern recognition method based on statistical learning [2123]. Suppose that training data , , are given, where are the input patterns and are the related target values of two-class pattern classification case. Then the standard linear support vector machine is as follows: where is the location of hyperplane relative to the origin. The regularization constant is the penalty parameter of the error term to determine the tradeoff between the flatness of linear functions and empirical error.

Hence, for such a generalized optimal separating hyperplane, the functional to be minimized comprises an extra term accounting the cost of overlapping errors. In fact the cost function (3) can be even more general as given below: subject to the same constraints. This is a convex programming problem that is usually solved only for or , and such soft margin SVMs are dubbed and SVMs, respectively [23].

For SVMs (), the solution to a quadratic programming problem (3) is given by the saddle point of the primal Lagrangian shown below where and are the Lagrange multipliers.

Due to the KKT conditions, a dual Lagrangian function has to be maximized as follows:

In learning a nonlinear classifier, we can define a kernel and the dual Lagrangian to be maximized as follows: where is the kernel function which maps the training vector into a higher dimensional space. Popularly used kernel types include linear, polynomial, Gaussian, radial basis, and sigmoid [23].

1.4. Estimation and Classification on Web Security

Bose and Leung [2] investigate antiphishing preparedness of banks in Hong Kong by analyzing the websites of the registered Hong Kong banks. They compute the score for each bank by averaging the performance of the bank’s website in three aspects, accessibility, usability, and information content. Later on Chen et al. [24] assess the severity of phishing attacks in terms of their risk levels and the potential loss in market value suffered by the targeted firms. They analyze 1030 phishing alerts released on a public database and financial data related to the targeted firms using a hybrid method that predicts the severity of the attack. Nishanth et al. [25] employ a two-stage soft computing approach for data imputation to assess the severity of phishing attacks, which involves K-means algorithm and multilayer perception (MLP), probabilistic neural network (PNN), and decision trees (DT). Similar machine-learning techniques are employed by Lakshmi and Vijaya [26] for modelling the prediction task. The supervised learning algorithms, namely, multilayer perception, decision tree induction, and naïve Bayes classification, are used for exploring the results.

This study intends to integrate DEA and SVM for web efficiency estimation and classification for several key reasons. First, as the medical informatics and security gain growing attention, a practical evaluation scheme is needed. We develop the DEA to assess the relative efficiency of the hospitals as the pioneer study in the related field. Second, in addition to evaluating the current websites at one snapshot, some websites may be reviewed as potential data set. An efficient and reliable classifier is essential to discriminate future data. Among wide machine learning methods, SVM is relatively robust and convincing, so we integrate DEA and SVM to build the efficiency classification platform. Third, compared with related studies, this work emphasizes web security preparedness instead of potential web attacks detection. That is, we assess web security from a proactive view not limited to technical aspect. The rest of this paper is organized as follows. Section 2 addresses the problem and methods. Section 3 presents the numerical case study of web security analysis in medical institutions. Finally, the concluding remarks are given in Section 4.

2. The Method

This section develops the hybrid DEA and SVM approaches for efficiency classification. Consider DMUs () that require assessment. Each DMU consumes inputs () and produces outputs (), denoted by and , respectively. In the proposed framework, the factors and efficiency scores from DEA models are integrated with SVM in learning patterns of DMUs’ performance and provide further decision support. The procedure of the hybrid methods is demonstrated in Figure 1.

Step 1 (efficiency evaluation). Based on (1) and (2), this study first evaluates the efficiency of the training data of the DMUs. The efficiency of is defined as in (1).

Step 2 (efficiency tier analysis). This step iteratively discriminates the fully productive group () and the subproductive group (). Using tier analysis [27], the DMUs are divided according to their efficiency scores. Then, the fully productive group is moved to the current tier, while the remaining DMUs are kept for further tier extraction. The algorithm is described as follows.S1: set . Set SC = {all DMUs}. Set .S2: compute the efficiency scores of DMUs in SC.S3: store the efficient DMUs with score 1 in . Set .S4: determine if the extraction process continues. If yes, set and go to S2; else set and go to S5.S5: for to output DMUs in .end.
The procedure of this step is demonstrated in Figure 2. Each DMU will belong to one tier thereafter.

Step 3 (SVM learning). Here the classification schema is learned as where stands for the tier that belongs to and is the vector combining the input and output factors of . Since there can be more than two tiers, so this is a multiclass classification problem.

Step 4 (testing). The set of testing data will be used to validate the classification model.

In the next section, the case of hospital web security efficiency will be thoroughly studied for demonstrating the procedure developed above.

3. Case Study

This study investigates 91 medical institutes in Taiwan, among which 8 (8.79%) are medical centers, 45 (49.45%) are metropolitan hospitals, and 38 (41.76%) are local community hospitals. To assess the hospitals’ efficiencies in web security, 10 input factors in 3 categories (clarity of purpose, communication, and security framework) and 2 output factors (user satisfaction and progress of ISO 27001 accreditation) are defined in Table 1. Two professional users with web security expertise independently assess the web sites of these hospitals. All items are scored between 1 and 9 by the predefined measures, where 9 means total consistency between the statement and the practice while 1 stands for the opposite. After the reviewers observe the sample web sites, they give the scores according to the level of conformability to the factors in Table 1. The scores from each reviewer will be averaged as the input/output values for the DEA models. Most of the factors are nearly objective except user satisfaction (). Notably, the input variables are surrogate variables to construct the investments to the web security.

Step 1 (efficiency evaluation). By (1) and (2), the efficiencies of the information security investment in hospitals are computed. The distributions of the results are summarized in Tables 2 and 3 and Figure 3.

Step 2 (efficiency tier analysis). By each iteration of the tier analysis algorithm in Section 2, DEA determines one productive group of hospitals () and the other group of sub-productive hospitals (). Then the fully efficient group is extracted and the other proceeds to the next step. Each DMU will belong to an efficiency tier thereafter. The members of the tiers are distributed as in Table 4 and Figure 4.

Steps  3 and  4 (SVM learning and testing). In this step, we use the utility LIBSVM [28] to build the SVM classification models. Four types of kernel functions are learned, including linear, polynomial, radial basis, and sigmoid. The accuracy in testing by different number of tiers and kernel function types is compared. The report is shown in Table 5 and Figure 5.

The distribution of efficiencies in Figure 3 manifests the unbalanced pattern where most hospitals lie in the two extremes of efficiency scales. However, by tier analysis, the distribution of tiers is nearly even except the second tier with the lowest number of hospitals.

From the results, the kernel functions with satisfactory prediction accuracy are linear (90.11% in average), radial basis (89.37% in average), and polynomial (87.18% in average), while the sigmoid function results in the lowest accuracy (average of 55.31%). Obviously, the linear, radial basis, and polynomial functions are more appropriate kernel types for web security efficiency classification in this case. The pattern of tier distribution is possibly the reason why those three kernel types outperform the sigmoid function in SVM classification.

From the perspective of data tier refinement, 2 tiers get the highest accuracy (average of 88.74%) while 4 tiers have the lowest rate (average of 72.53%). The results show that fewer data tiers obtain better accuracy in classification, which is consistent with the rule of thumb.

4. Conclusions

This study proposes the data envelopment analysis and support vector machine approaches for efficiency estimation and classification. For the feasibility of data collection, we use the surrogate variables to construct the tangible and intangible input factors. In defining the output factors, we define an objective variable of ISO 27001 accreditation progress and a subjective one of user satisfaction, which evaluate the efficiency from not only technical perspective but also users’ perception. In the proposed framework, the factors and efficiency scores from DEA models are integrated with SVM for learning patterns of expected web security performance. From the case study, linear and radial basis kernel functions have superior performance in classification. Also classification with fewer data tiers obtains better accuracy in testing, which is consistent with the rule of thumb. This design integrates performance estimation and pattern learning to provide decision support in medical information security.

Acknowledgments

The authors are indebted to the anonymous reviewers for their careful reading and suggestions to enhance the quality of this paper. This work is supported by the National Science Council, Taiwan (Grant no. NSC 102-2410-H-259-039-, NSC 101-2221-E-259-030).