Abstract

For the past few years, Internet of Things (IoT) has developed rapidly and been extensively used. However, its transmission security and privacy protection are insufficient, which limits the development of IoT to a certain extent. As a technology of IoT information transmission, anonymous communication technology comes into being as an important means to ensure the security of healthcare data, which can better protect users’ privacy in some ways. Nowadays, a variety of attack techniques for anonymous communication systems have been proposed by the academic community to track senders and receivers or discover communications between two users. Thus, the MSFA (Multiple System Fingerprint Attack) scheme for anonymous communication systems is presented in this paper where the MSFA scheme architecture, implementation in the Tor environment, and experimental data processing are described. Through a comparative analysis between two traces of visiting the same website based on the edit distance, it is shown that the longer the length of the site traffic data, the greater the edit distance of the site access traffic and the larger the range.

1. Introduction

A great number of research achievements have been gained about anonymous communication systems in the past 40 years. Ruei-Hau Hsu et al. [1] proposed network-covered and network-absent authenticated key exchange protocols for D2D communications to guarantee accountable group anonymity, end-to-end security to network operators. Anonymous communication technology is an effective technique for healthcare data privacy protection. Amin et al. [2] proposed the architecture of a patient monitoring system in WMSN (wireless medical sensor network) and designed an anonymous mutual authentication protocol suitable for mobile users to provide secure access and privacy for patient data. Mingshan Xie et al. [3] proposed the anonymization protection algorithm which is suitable for the data exchange in an incompletely open manner for the ego of data in the IoT. The anonymous dataset generated by the algorithm can effectively protect the sensitive information of IoT under the premise of ensuring the availability of the data. The EXCHANge protocol [4], a cryptoless over-the-air key establishment multiround protocol based on sender/receiver anonymity, was specifically conceived to secure IoT networks based on the IEEE 802.15.4 communication technology. Network malicious attackers or criminals rarely attack directly through their own computers. Before attacking the final target, they often land anonymous communication systems to hide their identities, such as Tor [5], JAP [6], Freenet [7], and I2P [8]. Tor and I2P have a large group of users and they have published software versions for mobile ad-hoc networks. The correspondence between input and output streams is hidden by Tor anonymous system in a variety of ways, while the attacker’s goal is to identify the correspondence. Al-muhtadi et al. [9] made a preliminary evaluation about Misty clouds which is a privacy-preserving platform for online user anonymity in Social Internet of Things, indicating that the new algorithm was better than the existing Tor algorithm and could achieve the expected privacy goal within the expected performance cost. The communication relationship between sender and receiver in the anonymous communication network can be discovered by the network fingerprint attack. The academia has carried out extensive research on the scheme and application of fingerprint attack.

Network traffic can be disturbed; for example, packets may be cached on a relay node for a period of time or be cut, recombined, retransmitted, or even lost. In web fingerprint, passive traffic analysis attack techniques are used that only require an attacker to configure a network environment similar to a regulator and access the target site using the same encryption proxy technology. The actual address of the regulator’s communication side is identified by analyzing the generated traffic characteristics. Fingerprinting [10] combines a number of input sources. Fingerprint attack technique identifies whether the sender communicates with a particular recipient by collecting the sender or receiving the feature information of both. The feature information can be network traffic characteristics, routing information characteristics, and node information characteristics. When the receiver communicates with the sender, the feature information between them will be collected by the fingerprint technology to form the fingerprint, which can determine a communication relationship between the sender and the receiver when they communicate again. In the existing fingerprint identification attacks [1118], the researchers use the packet size distribution, the sum of the packet size, packet timing interval, etc. as the basic statistical features to characterize the web fingerprint feature set. Researchers have demonstrated the feasibility of the website fingerprint attack methods and conducted further research on web fingerprinting. Cai et al. [11] proposed a website fingerprint attack that could successfully attack the latest proposed defensive traffic analysis attack scheme HTTPOS [19].

Our contributions are as follows: based on the CAI fingerprint attack prototype [11], we propose the MSFA fingerprint attack scheme in three aspects, namely, MSFA scheme architecture and module design, the implementation of MSFA scheme in Tor anonymous communication network, and the capture of the original traffic information and data processing.

2. CAI Fingerprint Attack Scheme

In this section, the background of the CAI fingerprint attack scenario, the CAI fingerprint attack model, as well as the attack process and features will be outlined.

2.1. The Background of the CAI Fingerprint Attack Scheme

Cai et al. [11] proposed a website fingerprint attack that could successfully attack the HTTPOS [13] traffic analysis scheme. CAI fingerprint attack is based on a simple network behavior model which can correctly predict the pages accessed by users over half of the time for any defense model. At the same time, it can correctly identify whether the user accesses a specific site with the experimental success rate over 90%.

2.1.1. Web Page Tracking

Web pages contain multiple objects such as HTML files, images, and flash, and the browser sends a separate request to each object. With the way of a combination of multiple TCP links and pipelines, it is more quickly for browsers to load pages. The browser requests the page-related objects before loading the page. Note that the order of requests has inheritance stability, and an object can only be requested after the browser has received some referencing pages. Some requests may be delayed due to the CPU load and packet reordering so that the order of requests and responses may be different when the browser loads a page every time. Some requests may be omitted if there is a copy of the object in memory. The number of requests sent by the browser and the total number of packets returned to the server may vary with the change of the size of the dynamic web page and the objects it contains [11].

2.1.2. Damerau-Levenshtein Edit Distance

In information theory and computer science, Damerau-Levenshtein distance [2022] indicates the distance between two strings. In short, it refers to two finite sequences of symbols that convert a string to another string with a minimum number of operations, where the operation is defined as an insertion, deletion, replacement, or swap of two adjacent characters. In Damerau’s [20] study, not only are the four edits distinguished, but it also points out that they correspond to more than 80% of all spelling errors while Damerau only considers edits that could correct a misspelling. The difference between Damerau-Levenshtein and Classic Levenshtein is that Damerau-Levenshtein distance allows the exchange between characters while only insert, delete, and replace operations are permitted in Levenshtein distances. The Levenshtein distance is optimized to include the exchange of adjacent characters, resulting in the different measurement distances called the Damerau-Levenshtein distance [21].

2.1.3. LIBSVM

LIBSVM software is a library for integrated support vector machines that supports multicategory classification. Support vector machine (SVM) is a technology that effectively classifies data. The essence of LIBSVM is a library for support vector machines, and there is no need for users to understand the basic theory behind the support vector machines while just following the basic program to get the corresponding result. A classification task for LIBSVM usually involves two separate datasets, a training set and a testing set. A model can be generated by LIBSVM based on the training data, which predicts the target value of the test set. The data in the test set only provides the attribute values of the test data.

2.2. The Characteristics of CAI Fingerprint Attack

In the CAI fingerprint attack process (Figure 1), the client traffic information is captured through the client agent first and then the captured packet information is processed to obtain the packet length information. The edit distance is then calculated to generate the training and testing sets required for LIBSVM classification. The classification model is generated by LIBSVM for the prediction of the test set data to determine whether the user has visited the target website.

Firstly, Cai et al. proposed a new method to calculate the similarity of web access [11]. The order of the request and response packets shows the size and importance of the objects referenced in a page, so packet scheduling is very important for identifying web pages. CAI fingerprint attack transformed tracks into strings and compared the similarities between the two tracks using edit distances. Damerau-Levenshtein distance is a good metric that allows insertion, deletion, replacement, and interchange. Secondly, the CAI fingerprint attack scheme used LIBSVM to establish a classification model for the processed web packet data and predict the pages accessed by the user.

3. MSFA Attack Scheme

In this section, we have designed the MSFA fingerprint attack scheme.

3.1. MSFA Scheme Architecture

The architecture of the MSFA fingerprint attack scheme which includes the design goals of the MSFA scheme, the threat model, and the attack model is discussed.

3.1.1. MSFA Design Goal

The design idea of the MSFA attack scheme is based on CAI fingerprint attack prototype, and the concrete realization scheme is proposed for different anonymous systems. The following describes the MSFA attack scheme through three aspects:

(i) The database system of the MASF attack scheme. The captured network traffic information can be effectively saved by the database system, which designs a scientific and reasonable database for the site IP address, traffic information, site name, etc. The database system is provided with the advantages of simple structure and clear function.

(ii) The improved MSFA attack scheme is based on the CAI fingerprint attack scheme, and experiments are performed on different categories of websites for data collection. The MSFA attack scheme is implemented in the Tor anonymous communication system environment.

(iii) Capture the original flow of information. To obtain the fingerprint information of the website, it is necessary to capture the website original traffic information. A web page is composed of multiple objects, and a separate request is sent for each object by the browser. Multiple TCP links can be exploited by the browser to load the page faster. The browser will request the relevant objects of the page before loading the page that generates network traffic. The packets in the tracking can be roughly divided into two categories: request packets and response packets. When a user browses the encrypted proxy page, all the relevant documents on the page will be downloaded by the user browser and each of them requires a separate TCP link to return.

3.1.2. MSFA Scheme Threat Model

As shown in Figure 2, firstly, it needs to connect the I2P network through the local connection when Amy visits the website through the I2P network. Kad algorithm is exploited by the I2P network to obtain information on the network node and access the destination server through the nodes of the I2P network. Multiple links are used by the I2P to send and receive data, but if the links of sending data and receiving data are different, the number of nodes on the two links will be different. Ken accesses through the Tor network and it reaches the destination server to visit the site through the entrance node, intermediate node, and exit node three hops in the Tor network. The three-hop routing nodes which have been set up will not automatically change unless they are manually changed. Moreover, the I2P routing nodes have a valid period. Fingerprint attacker captures the traffic between the client and the anonymous communication system entry node, subsequently analyzes the traffic characteristics, and finally destroys the anonymous communication system to form an attack.

3.1.3. MSFA Attack Model

As shown in Figure 3, the attack model of MSFA scheme has been presented in this paper. When the user accesses the Internet through the ordinary browser or anonymous communication system, the original traffic information can be captured by Charles software or Wireshark [23] software.

3.2. MSFA Scheme Module Design

We deeply study the module design of MSFA scheme (Figure 4), and mainly discuss the design of the traffic capture module, packet information acquisition module, and edit distance calculation module.

3.2.1. Traffic Capture Module Design

The experiment collects data from multiple target sites and organizes them according to the categories of target sites, giving each site a unique ID. In terms of data capture at the site, the experiment selects Wireshark to capture the traffic of TOR and I2P site.

3.2.2. Obtain the Packet Information Module Design

The module design for obtaining packet information is presented in Figure 5. In this experiment, we used Ubuntu system to capture the site traffic and Wireshark to capture data packets. The TCP/IP protocol defines the packets transmitted over the Internet, called IP Datagram. IP Datagram is independent of hardware and it consists of header and data. The first 20 bytes of the header are fixed length while the rear is an optional field whose length is variable. The source address and destination address of the header are both IP address.

The original data saved by Wireshark is complex while just the size and the number of Request and Response in the experiment need to be recorded to calculate the Damerau-Levenshtein distance. Therefore, the data needs to be processed. The experiment first groups the captured traffic data according to IP which must be the routing nodes of the Tor to obtain valuable packet information. In the experiment, after the traffic data is obtained, it only needs to request and respond to the packet size as it contains more complex data. Therefore, the original data needs to be processed. Filter class (Table 1) includes four attributes, namely, server_ips, client_ips, PROXYPORT_MIN, and PROXYPORT_MAX, and five methods, namely, Filter (), read_ips (), is_onload (), parse_one (), and is_monitoredtraffic (), and its partial details are shown in Algorithm 1. During the conversion of the website traffic file  .cap, the lengths of the request and response packets are required in the experiment and saved in the file in turn. The request packet is identified as negative while the response packet is identified as positive. Algorithm 2 shows the algorithm for obtaining the packet information.

(1) class Filter
(2) public:
//Defines the storage server IP address; server_ips
(3) unordered_set <string> server_ips;
//Defines the storage client IP address: client_ips
(4) unordered_set <string> client_ips;
//Defines the maximum port that needs to be processed
(5) int PROXYPORT_MIN;
//Defines the minimum port that needs to be processed
(6) int PROXYPORT_MAX;
(7) public:
//FilterConstructor
(8) Filter(char clientipfname, char serveripfname, int portmin, int portmax);
//Read the IP function
(9) int read_ips(unordered_set<string>&set, char fname);
//Determines whether the file loads the function
(10) bool is_onload(u_char payload);
//Determine whether the traffic function is listening
(11) bool is_monitoredtraffic(char src, unsigned int sport, char dst, unsigned int dport);
//Implement the transformation function for the data
(12) RETparse_one(char capfname, int proxy_port_min, intproxy_port_max, int remove_ack,
char monitoredoutname, char localoutname, char c2stau, char s2ctau, char timeseq);
(13) ;
(1) server_ips
(2) client_ips
(3) PROXYPORT_MIN
(4) PROXYPORT_MAX
(5) IF (Source address equals toclient_ips && Destination address equal toserver_ips
&& Source address port <= PROXYPORT_MAX
&& Source address port >= PROXYPORT_MIN
&& Destination address port <= PROXYPORT_MAX
&& Destination address port >= PROXYPORT_MIN)
(6) IF (Whether it is interrupted)
(7) IF (Destination address port <=PROXYPORT_MAX && Destination address port>=PROXYPORT_MIN)
(8) Packet length = -1
(9) Output packet length
3.2.3. Calculate the Editing Distance Module Design

Get the length of the packet that the site accesses by obtaining the packet information module; then, it is necessary to calculate the editing distance between the two sites visit. In the experiment, the Damerau-Levenshtein distance, also known as the editing distance, is used which refers to the conversion of a string to another string in a minimum number of operations. The operation is defined as an insertion, deletion, or replacement of a character, or transposition of adjacent characters. In order to calculate the distance between the two sites, we use a matrix to store the distance and complete all edit distance calculations to output an edit distance matrix. The algorithm for calculating the edit distance is shown in Table 2.

4. Implementation and Evaluation

4.1. Implementation of MSFA Scheme in the Tor Anonymous System Environment

The MSFA fingerprint attack scenario is tested under the Tor anonymous communication system.

4.1.1. Tor Anonymous System Installation and Configuration

In the experiment, we use the Linux system, download the corresponding Linux package on Tor official website, and then unpack the software package. Before running the Tor Browser, the global VPN proxy needs to be installed on the computer. Otherwise, the Tor Browser will not be able to establish a link and run normally.

4.1.2. Get the Packet Information

After the  .pcap file is captured by the site, each  .pcap file forms a  .txt file that records the traffic information of the site.

4.2. MSFA Scheme Experimental Data Processing

The data processing of the MSFA scheme is discussed in detail, and the data of different kinds of websites are classified.

4.2.1. Calculate the Edit Distance

After fetching the packet length information required for the experiment, the next step is to further process the packet information and calculate the Damerau-Levenshtein edit distance. We still use the CAI fingerprint attack [CZJ2012] to standardize the edit distance to compensate for the changes of the packet tracking length. If d(t, t’) means Damerau-Levenshtein edit distance, the fingerprint attack will normalize the edit distance as follows: represents the packet length in trace t, and the classifier normalizes the shortest value of two lengths. If the difference between t and t’ is very large in length, then these two may come from different pages. In this case, dividing by min(, ) will result in a larger normalized distance, which is a feasible standardized distance. The implementation of calculation is shown in Algorithm 3.

(1) double Levenshtein  ::  DLdis(int ms, int ns)
(2)
(3) double ret = 0;
(4) int min;
//Pretreatment
(5) int m = ms;
(6) int n = ns;
//min takes the smaller between m and n
(7) min = m < n ? m  :  n;
(8) min = min = = 0 ? 1  :  min;
(9) int i, j;
(10) double subcost, transcost;
//Define operating costs to two
(11) double idcost = 2;
//Store the distance array
(12) double dis = new double;
//Initialize the array
(13) for(i = 0;i < m;i++)
(14) dis = new double;
(15) for(i = 0; i < m;i++)
(16) for(j = 0; j < n; j++)
(17) dis = -1;
//Calculate the operating costs of the first ramp line and the first vertical line
(18) for(i = 0; i < m; i++)
(19) dis = i idcost;
(20) for(j = 0; j < n; j++)
(21) dis = j idcost;
//Calculate the operating costs of non-first rungs and first vertical lines.
(22) for(i = 1; i < m; i++)
(23)
(24) for(j = 1; j < n; j++)
(25)
//If the two strings are equal, the operating cost is zero.
(26) if(str1 = = str2)
(27) subcost = transcost = 0;
(28) else
(29)
//Otherwise the replacement cost is two.
(30) subcost = 2;
//The exchange cost is 0.1
(31) transcost = 0.1;
(32)
//The minimum cost is the edit distance, which is stored in the matrix.
(33) dis = minimum(dis + idcost, dis + idcost, dis + subcost);
//Two character exchanges
(34) if(i >1 && j >1 && str1[i] = = str2[j-1] && str1[i-1] = = str2[j])
(35) dis[i][j] = dis[i][j] < dis[i-2][j-2] + transcost ? dis[i][j]: dis[i-2][j-2] + transcost;
(36)
(37)
//Free dis
(38) for(i = 0; i < m; i++)
(39) delete dis[i];
(40) delete dis;
(41)
4.2.2. Data Processing

According to the collection of the sites, we sort and select a few for processing. The specific process is as follows:

Select msn.com to do the experiment. The MSN was accessed through the Tor anonymous system at different times, with a total of 10 visits. The data for the site traffic was formed after the Wireshark captured the accessing traffic and the Filter class handled the file. The data is shown in Table 3.

After obtaining the traffic information for 10 visits to the msn.com website, we use Levenshtein_cantor_mpi to calculate the edit distance for this 10 traffic, as shown in Table 4.

When calculating the edit distance, the string that is accessed by the two traffic records of the site is compared. The smaller the edit distance is, the more similar the two records are.

By comparison, we find that the minimum distance of edit distance is 0.069 which is the fourth visit and the ninth visit msn.com site between the two edit distances. The maximum is 2.64 which is the editing distance between the first and fifth visiting. So we can initially determine that the edit distance of accessing MSN website ranges from 0 to 2.64. At the same time, the smallest distance to the other visiting average edit distance is selected as msn.com website fingerprint to store into the fingerprint database. By comparison, the minimum of average editing distance is between the second and other visiting msn.com, so the second visiting is put as a fingerprint of msn.com into the database.

Select Amazon to do the experiment. Amazon was accessed through the Tor anonymous system at different times, with a total of 10 visits. After Wireshark captured the accessing traffic and the Filter class handled the file, it formed the data for the site traffic. The data is shown in Table 5.

After obtaining the traffic information for 10 visits to the Amazon website, we use Levenshtein_cantor_mpi to calculate the edit distance for this 10 traffic, as shown in Table 6.

By comparison, we find that the minimum distance of edit distance is 0.034 which is the fourth visit and the fifth visit of amazon.com site between the two edit distances. The maximum is 11.81 which is the editing distance between the first and ninth visiting. So we can initially determine that the edit distance of accessing Amazon website ranges from 0 to 11.81. At the same time, the smallest distance to the other visiting average edit distance is selected as amazon.com website fingerprint to store into the fingerprint database. By comparison, the minimum of average editing distance is between the 9th and other visiting amazon.com, so the 9th visiting is put as a fingerprint of amazon.com into the database.

Select YouTube to do the experiment. YouTube was accessed through the Tor anonymous system at different times, with a total of 10 visits. After Wireshark captured the accessing traffic and the Filter class handled the file, it formed the data for the site traffic. The data is shown in Table 7.

The Levenshtein_cantor_mpi is used to calculate the edit distance for the 10 traffic which has been obtained by 10 visits to the youtube.com website, as shown in Table 8.

By comparison, we find that the minimum distance of edit distance is 0.27 which is the 8th visit and the 7th visit of youtube.com site between the two edit distances. The maximum is 8.44 which is the editing distance between the 10th and 9th visiting. So we can initially determine that the edit distance of accessing YouTube website ranges from 0 to 8.44. At the same time, the smallest distance to the other visiting average edit distance is selected as youtube.com website fingerprint to store into the fingerprint database. By comparison, the minimum of average editing distance is between the 8th and other visiting youtube.com, so the 8th visiting as a fingerprint of youtube.com is put into the database.

5. Conclusion

The real spreading of IoT services requires customized security and privacy levels to be guaranteed. Many IoT services and applications may expose sensitive and personal information which may be abused by attackers. As such, privacy protection must be considered and it is a core requirement in any IoT ecosystem. The MSFA attack scheme proposed in this paper is based on the edit distance to compare the similarity between the two visits. Firstly, the differences between the different types of website traffic can be observed from the above data. Take the Amazon which is the e-commerce website and YouTube which is the video website as examples. The traffic of Amazon website ranges from 0 to 11.81, while the traffic of YouTube video ranges from 0 to 8.44. Secondly, the length of the traffic data has an impact on the edit distance. In general, the longer the length of the site traffic data, the greater the edit distance of the site access traffic and the larger the range.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work is supported by the following grants: the National Natural Science Foundation of China under Grant no. 61170273; the China Scholarship Council under Grant no. 3050. We thank the anonymous reviewers for their valuable comments and suggestions.