Abstract

Deep packet inspection (DPI) is widely used in detecting abnormal traffic and suspicious activities in networks. With the growing popularity of secure hypertext transfer protocol (HyperText Transfer Protocol over Secure Socket Layer, HTTPS), inspecting the encrypted traffic is necessary. The traditional decryption-and-then-encryption method has the drawback of privacy leaking. Decrypting encrypted packets for inspection violates the confidentiality goal of HTTPS. Now, people are faced with a dilemma: choosing between the middlebox’s ability to perform detection functions and protecting the privacy of their communications. We propose OTEPI, a system that simultaneously provides both of those properties. The approach of OTEPI is to perform the deep packet inspection directly on the encrypted traffic. Unlike machine and deep learning methods that can only classify traffic, OTEPI is able to accurately identify which detection rule was matched by the encrypted packet. It can facilitate network managers to manage their networks at a finer granularity. OTEPI achieves the function through a new protocol and new encryption schemes. Compared with previous works, our approach achieves rule encryption with oblivious transfer (OT), which allows our work to achieve a better balance between communication traffic consumption and computational resource consumption. And our design of Oblivious Transfer and the use of Natural Language Processing tools make OTEPI outstanding in terms of computational consumption.

1. Introduction

Packet inspection and analysis are widely used to detect, mitigate, and prevent suspicious network activities. Real-time inspection of packet payloads and headers is essential to achieve these goals. The equipment deployed for these purposes is middlebox, an intermediate device providing various services. Middlebox is essential in today’s network infrastructure. The primary services provided by middlebox include firewalls, intrusion detection, parental filtering, data leakage detection, forensic analysis, malware analysis, and others.

With the popularity of HTTPS, 87%–90% of the current network traffic is encrypted by protocols such as SSL (Secure Sockets Layer) and TLS (Transport Layer Security)[1]. According to Google’s statistics, in November 2020, 81% to 98% of the traffic from the Chrome browser on different platforms used HTTPS [2]. The Man-in-the-Middle (MitM) technology is one of the commonly used approaches to inspect encrypted traffic. An establishes encrypted connections between both the client and the server. The middlebox decrypts the traffic in the connections, inspects the payloads according to the intrusion detection rules, and then re-encrypts the payloads. When rules are matched, alerts will be sent to the administrator to take actions such as disconnection and alert.

This decrypt-and-detect approach violates the security goal of HTTPS. A survey [3] from USENIX Association shows that 75.8% of users have privacy concerns about the MitM system that decrypts encrypted traffic, and 83.2% of the surveyors believe that the third-party inspection should be notified in advance. The MitM technology can achieve either the function of traffic inspection or privacy in communication due to the intrinsic conflict between the two. With the popularity of HTTPS, TLS/SSL encrypted traffic in the network has gradually become the majority. Performing the traffic inspection while protecting the privacy of both parties has become a problem that attracted much research. We hope to implement a traffic inspection that provides privacy protection, propose a new method to optimize the bandwidth, and overhead compared with the previous methods.

1.1. Related Works

Encrypted traffic detection technologies are classified into three categories: Searchable Encryption, Machine and Deep Learning, and Trusted Hardware. We give a brief survey as follows.

1.1.1. Searchable Encryption

Sherry et al. proposed the BlindBox [4] in 2015, which is the first privacy-preserving deep packet inspection scheme using searchable encryption techniques. They adopted oblivious transfer (OT) [57] along with garbled circuits (GC) [8, 9] to perform the inspection of the encrypted traffic without decrypting the payloads. The middlebox cannot access the plaintext in the traffic, and the client and the server cannot learn the content of the rules. While this method achieved the desired privacy protection, it requires a significant amount of calculation and communication due to the use of garbled circuits [10]. In addition to the considerable overhead of the garbled circuit itself, every new rule in each new session has to generate a new garbled circuit.

To address the performance bottleneck of the BlindBox, Canard et al. proposed BlindIDS [11], which is a token-matching protocol that uses a Decryptable Searchable Encryption (DSE) tool based on the pairing-based public key algorithm. Compared with BlindBox, BlindIDS drastically reduce the overhead of the rules setup, moving some overhead to the middlebox detection phase.

Fan et al. introduced the SPABox [12], which uses oblivious pseudorandom functions in the rule encryption. The middlebox portion performs two matching operations, namely, keyword matching and machine learning model matching. Like all searchable encryption methods, SPABox also adopts tokenization. The only difference between the SPABox and other methods is the adoption of machine learning which uses a different approach in token matching.

Ning et al. proposed the PrivDPI [13]. This method improves the BlindBox that enhances the setup phase to reduce bandwidth consumption. They introduced the idea of reusable encryption rules, which significantly reduced the bandwidth overhead in the case of multiple continuous sessions. However, modular operations in token encryption increase the computational overhead compared with BlindBox.

1.1.2. Machine and Deep Learning

Machine learning technology is also widely used in encrypted traffic inspection. Yamada et al. [14] proposed a new anomaly detection technology. This method performs anomaly detection by analyzing the data packet’s size and the flow’s temporal characteristics. The scheme proposed by Anderson et al. [15, 16] detects malicious programs mainly by TLS header information and DNS data. The article finds that encrypted malware traffic has different characteristics from regular traffic.

Deep learning methods are also widely used in intrusion detection for network security. Ferrag et al. [17] analyzed the performance of seven deep learning models is analyzed for three metrics: accuracy, false alarm rate, and detection rate under different data sets. Montieri et al. [18] proposed a scheme that allows traffic classification in anonymous browsers (e.g., tor) employing a hierarchical approach. In detail, the proposed framework was designed with varying constraints, resulting in implementations with different degrees of complexity (in terms of classifiers, features, and reject options). Adept [19] is an attack detection and identification framework for identifying multi-stage distributed attacks on the Internet-of-Things (IoT). It is based on a hierarchical distributed framework, where local gateways monitor network traffic and generate alerts for any anomalous activity. The central security manager detects attacks by mining the aggregated alerts and identifies corresponding attack stages using a comprehensive set of features via machine learning. Liu et al. [20] proposed a Flow Sequence Network (FS-Net) scheme that uses recurrent neural networks for encrypted traffic classification. The FS-Net takes a multi-layer bi-GRU [21] encoder to learn the representation of the flow sequence and reconstructs the original sequence with a multi-layer bi-GRU decoder. The features learned from the encoder and decoder are combined for classification. Aceto et al. [22] proposed a traffic classifier called DISTILLER, a multi-modal multi-task deep learning method. DISTILLER addresses the performance limitations of existing traffic classifiers based on single-mode deep learning and provides an adequate design basis for sophisticated network management requiring the solution of different network visibility tasks. A new method for classifying end-to-end encrypted traffic using one-dimensional convolution neural networks is proposed by Wang et al. [23] based on the analysis of traditional methods for classifying encrypted traffic in machine learning. This strategy differs from the traditional divide-and-conquer strategy. The 1D-CNN classification strategy integrates feature design, feature extraction, and feature selection in a single framework, making it more likely to yield a globally optimal solution than a divide-and-conquer strategy. A Tree-Shaped Deep Neural Network (TSDNN) and Quantity Dependent Backpropagation (QDBP) algorithm was proposed by Chen et al. [24]. This model can memorize and learn from the minority class, to perform malicious flow detection.

1.1.3. Trusted Hardware

Trusted hardware is a new technology that has been deployed for privacy-preserving deep packet inspection. David Goltzsche et al. presented ENDBox [25] that uses the Intel SGX tool provided by Intel, which supports both local and remote verification. ENDBox and SGX are connected through a virtual network, and the traffic is decrypted and inspected inside the SGX. The SGX between the two terminals is directly connected to enable the direct transmission of the traffic, without which the SGX traffic will be garbled. The McTLS proposed by David Naylor et al. [10] modifies the existing TLS protocol and use the SGX to allow the client, middlebox, and the server to establish an authenticated and secure channel to exchange read and write secret keys in addition to the session key.

1.1.4. Summary

The searchable encryption-based scheme can enhance privacy protection, but it will incur a huge overhead. All searchable encryptions require two data flows: the TLS session and the tokenized data. Although most of the works performing traffic and malware classification leverage Machine and Deep Learning Approaches. However, machine and deep learning methods extensively depend on reliable training sets. And this method can only classify the traffic and cannot accurately identify the exact rule matched. Furthermore, in Trusted Hardware technology, the security of the Trusted Hardware (i.e., Intel SGX) is still actively being studied. Moreover, it is less efficient for inspection, at least when compared to the searchable encryption schemes discussed. Therefore, improving the efficiency of the searchable encryption method is still a meaningful research direction.

1.2. Our Contributions

This paper proposes an encrypted packet inspection scheme based on oblivious transfer, namely, OTEPI (Encrypted Packet Inspection Based on Oblivious Transfer). OTEPI is in the category of searchable encryption-based scheme. OTEPI reduces the bandwidth consumption required to rules encryption without increasing the cost at the packet sender compared to BlindBox. We also adopt the idea of reusable encryption rules in PrivDPI. Though the bandwidth consumption is higher than that of PrivDPI, the computational costs of the packet sender is less than that of the PrivDPI. In general, for arbitrary types of data, the proposed scheme is able to strike a balance between the low computational resource consumption and high bandwidth consumption of BlindBox and the low bandwidth consumption and high computational resource consumption of PrivDPI. In particular, for plaintext data such as HTML web pages, our scheme optimizes the tokenization method, which is smaller than either PrivDPI or BlindBox in terms of computational resource consumption.

Table 1 shows the specific characteristics of our proposed OTEPI method. OTEPI uses oblivious transfer to reduce bandwidth overhead and does not use the exponential operation to reduce computing overhead. We also use NLP tools to segment tokens to reduce the number of tokens.

Our contributions are as follows:(1)We designed a new rule encryption method based on oblivious transfer that can protect the privacy of both the traffic and rules and realizes the reuse of encryption rules. Compared to BlindBox, the rule encryption consumes much less bandwidth; the bandwidth required for 3000 rules encryption is reduced from 50 GB in BlindBox to 82 MB.(2)We designed a new rule encryption method based on oblivious transfer that can protect the privacy of traffic and rules and realize the reuse of encryption rules. The tokenization reduces the number of tokens compared to the sliding window method. We only generate 10% to 20% of the tokens generated by BlindBox. Our encryption performance with NLP is 1.7 times faster than BlindBox and 7.6 times faster than PrivDPI. For recurring packets, our token encryption is 3.5 times more efficient than BlindBox and 3.8 times more efficient than PrivDPI.(3)We use the sliding window tokenization for payloads unsuitable for NLP, such as images and audio. In this case, token encryption of OTEPI is 2.6 times slower than that of PrivDPI. For recurring packets, OTEPI is more efficient than PrivDPI. Although it is not as efficient as BlindBox in encryption, OTEPI consumes less bandwidth than BlindBox.

1.3. Article Structure

We organize the paper as follows. Section 1 reviews the related work and presents the contributions of this paper. Section 2 describes the system architecture, threat model, and preliminaries. Section 3 details our scheme. Section 4 provides correctness and security analysis. Section 5 gives the performance evaluations. We conclude in Section 6.

2. Overview

We provide notations, the system architecture, and the threat model used in the paper.

2.1. Preliminaries

For a vector or 1-D array , we use to denote the -th element of . For a matrix or 2-D array , the entry in the -th row and -th column is denoted by , the -th row vector of is denoted by . For a bit string , denotes the -th bit of . As in BlindBox and PrivDPI, we tokenize the network traffic to a series of tokens, and the lengths of rules and tokens are fixed to bits.

2.1.1. Oblivious Transfer

We define the 1-out-of-2 oblivious transfer protocol between two parties, and . has twobit-strings and . has a bit . When the protocol completes, gets without knowing , and has no information on . The process is denoted as . We build the oblivious transfer based on the Even-Goldreich-Lempel OT protocol proposed by Even et al. [5].

In this paper, represents the public key encryption and represents the private key decryption.(1)For each OT, parties and first share two -bit string and .(2) selects an -bit string and the input bit , computes the to send to B.(3) calculates as follows and sends to (4) as the receiver of oblivious transfer, the formula for calculating is as follows:

2.2. The Architecture

Our solution has a similar architecture to BlindBox [4]. Its architecture is shown in Figure 1.

The system consists of four entities: Rule Generator , middlebox , the client , and server . is a third-party agency generating rules. monitors the traffic using the rules provided by . is the party sending network traffic. is the party that receives the traffic sent by .

The client encrypts tokens with the secret key shared by client and server in the setup phase. Moreover, encrypts the rules with this secret key. only needs to check whether the encryption rules and the encrypted token are the same, and there is no need to decrypt the payloads or tokens.

The goal of the system is that the can detect the matching of rules in traffic and cannot access the plaintext of encrypted traffic and the rules from ; the client and server cannot access the rules.

2.3. Threat Model

We assume that at least one of the client and server in a session is honest, and is semi-honest (honest but curious). This assumption is the same as these in BlindBox [4] and PrivDPI [13]. Under this security assumption, there are two threats. The first comes from either or . One of and can be malicious, but two entities will not be malicious simultaneously. The cases when both and are malicious are beyond the scope of our assumptions because they can deceive the by collusion.

The second threat comes from the . will not actively attack encrypted payloads but will monitor and analyze encrypted tokens to learn the content of the encrypted traffic.

3. Encrypted Traffic Inspection by Oblivious Transfer

This section introduces our oblivious transfer-based encrypted traffic inspection approach. The approach uses the same system architecture and threat model as BlindBox and PrivDPI. Unlike BlindBox and PrivDPI, we use OT solely to achieve secure multi-party computation [26] to encrypt rules. Meanwhile, to address the problem of excessively useless tokens generated by current tokenize methods, we introduce an NLP-based tokenizing method, which significantly reduces the number of tokens.

Table 2 describes the variables used in our approach.

3.1. System Flow

Our solution includes the following phases:(1)Setup: receives rules and rule validation from .(2)Rule preprocessing: interacts with and to establish a set of reusable obfuscated rules using oblivious transfer. It ensures that and will not learn the rules, and cannot learn the key used by and .(3)Packet tokenization and token encryption: tokenizes the payloads, encrypts the tokens, and loads them into traffic.(4)Token inspection: inspects the tokens sent by to search for the matching of the rules obtained in equation (2).(5)Token validation: checks the whether the tokens sent by accord with the payloads of the TLS/SSL session.

3.2. Setup

The ruleset from is denoted by,…, . In this phase, , , and set up the parameters used in the procedure. We assume that has the public key of , and and have the public key of . For each , generates a rule verification pair: , where is the ciphertext of encrypted by the public key of , is the signature of signed by . sends all rule verification pairs (, ) to . gets the public key of and decrypts each to have the rule set.

Next, and establish a session with a session key sk. and generate , , and using the same method with sk as a random seed. Array has entries where each entry is a pair of -bit strings. denotes the -th pair of , and is the -th bit-string of the pair . is used in token and rule encryption by OT. In OT, serves as the stand-in for the bit for the bit on position of the token. has a total of bits.

The random number is shared by , , and , which is used as the seed to generate a random array . This array will be employed to confuse duplicate tokens against frequency-based attacks by . We give the whole process in Algorithm 1.

for to do
encrypts with ’s public key.
signs with ’s private key.
 sends to .
end for
uses as a random seed to generate , , and .
generate two random big prime numbers .
.
send and to on a secure channel.
decrypts and verifies each from .

represents the encryption result of the -th occurrence of the string . denotes a one-way function. In this paper, we use the Rabin one-way function. is responsible for generating the parameters of the Rabin function, including two big prime numbers , , and . sends to and and keeps , secret.

3.3. Rule Preprocessing

In this phase, will obtain the rule set encrypted with by oblivious transfers with and . The procedure of the rule preprocessing is shown in Figure 2.

The security requirement in the rule encryption procedure is that should not obtain the key array , and should not obtain any rule. The rule encryption process has the following steps (a)–(e).

processes the rule set as follows:(a)Standardization of rule length: pads the each with 0’s or computes the hash value such that the resulting length . Each is a string of -bit length./ performs the following operations:(b)Verification: / uses the public key of to check whether and are matched.(c)Generate the confusion vector of the key: for each , generates a mask array that is an array of entries and is an -bit string. For any , the following relation holds:Then, encrypts using . The result is an array, namely, , whereNext, and run the OT as follows.(e)Rule encryption: for each bit , and run the OT protocol as follows,

By the OT, gets all from /. then computes the keys used to encrypt the rule set, namely, . is an array of entries, where

Bit-string is the key used in encrypting .

and run the same procedure, and computes an alternative . checks whether the two ’s are the same. If the results are different, the procedure stops. Finally, encrypts each as follows:

The whole process is shown in Figure 3. The mask prevents from revealing the content of , but also allows to encrypt the rule set with . We will prove the correctness in Section 4.2.

3.3.1. Obfuscating Repeated Tokens

We use an array to hide the repeated tokens, an array of -bit strings. and / use the same method to generate the same . A token with copies in previous tokens will be masked by ; thus, all encrypted tokens of the same token will be different. We set to 0 bits.

3.4. Packet Tokenization

We introduce natural language processing (NLP) to traffic tokenization. Many packets have a text payload of natural languages and program codes. These texts are composed of words (keywords) and delimiters used to represent the grammatical structure of the text. The inspection rules for these texts also have the same proprieties, such as the parent control system and keyword censoring. The NLP-based tokenization segments the token without generating tokens that violate the structure properties. The NLP-based tokenization also supports languages with longer encoding, such as Chinese, Japanese, and Korean. pads each token with 0’s or computes the hash value as does with rules, which ensures that both the tokens and the rules are -bit strings.

3.5. Token Encryption

We take a token as an -bit bit-string. Given a token of content such that there are tokens with content in previous tokens, the encrypts as follows.

For duplicated tokens, the encryption can be simplified. The client record the ciphertext of the last occurrence of , namely, .

And the times of occurrences of so far, namely, . For a token with content , the client encrypts the token as follows.

Then the client computes to and increases by 1 (see Algorithms 2 and 3).

Input: Token series, ,
Output: Encrypt token series for each token do
if occurs at the first time then
  
  
  
else
  
  
end if
end for
Input: Encrypted token series, , , and
Output: Whether the encrypted token is the same as the detection rule for to do
end for
for each encrypted token do
ifthen
  report a matching of
  
  
end if
end for
3.6. Token Inspection

To search for the occurrence of rules in packets, matches the encryption ruleset against the encrypted token sequence. To accord with the token encryption for duplicated tokens, initializes each to . When an encrypted token arrives, compares the token with each . If the token matches , updates in the same way as token encryption as follows.

The counter of occurrences of the , namely, , is increased by 1. also takes actions such as disconnecting or issuing a warning to the user or administrator. The whole process is shown in Algorithm 3.

3.7. Token Validation

The receiver runs the same tokenization and token encryption on the decrypted TLS/SSL traffic. checks whether the plaintext of traffic is the same as the plaintext of the encrypted token sequence received from . Any inconsistency implies that is malicious.

3.8. Detecting the Malicious

This section considers a stronger adversary that does not follow the protocol and applies chosen plaintext attacks to the proposed system. We present a mechanism for the client to detect the middlebox that uses faked rules. For a given , the generates some rule verification information that the clients can use to verify the honesty of the middlebox, that is, the middlebox uses the unfeigned rules in the rule encryption phase. First, determines all the parameters of the OTs between the given and clients. is a 2-D array of integers, and are 1-D arrays of integers, where , , and are the , , and parameters of the OT for encrypting bit , respectively. The first message of the OT between and C for can be expressed as follows:

The rule verification for is defined as follows:

The can compute each as has all the parameters for computing each . Then, sends each to the client. In the followed rule encryption phase between and , the client computes using the message in OT with the same method in equation (15). If , the client is sure that uses the unfeigned rule.

The above mechanism imposes a heavy burden on since must compute the verification for all the sessions between any clients and servers. An improvement is to use the garbled circuit. builds a garbled circuit to compute each for a given middlebox. As all the parameters to compute except for the client’s public key are known before a session, the circuit’s input is the client’s public key. When a client starts a session, it asks for the garbled circuit for its corresponding from and computes each with its public key as input. In rule encryption, the client computes the from both the garbled circuit and messages from the and checks whether they are the same. In this scheme, only needs to build the garbled circuit for each rather than compute verifications for every pair of .

4. Security and Correctness

4.1. Security and Correctness Requirements

The security definition follows BlindBox, PrivDPI, and Song et al. [27]. Either of the two endpoints in our system may be malicious, but at least one of the two endpoints is honest. This requirement is also essential for any intrusion detection system [28]. Because when both ends are malicious, they can collude to treat the middlebox.

There are some requirements implemented in methods such as BlindBox and PrivDPI. (A) can perform rule detection. can identify the substring of the payload matching a rule. (B) / cannot obtain the rules used by . This requirement prevents / from eluding detection. It also makes the rules-suppliers () keep the confidentiality of the rules that are their pivot assets.

In addition to the above two requirements, we also achieve the same security requirements of BlindBox and PrivDPI: (C) cannot decrypt the payloads. (D-i) cannot decrypt the encrypted token sequence. (D-ii) cannot analyze the frequency of plaintext occurrence by the sequence of encrypted tokens. can only learn the number of occurrences of the rules in the session. cannot learn the frequencies of other tokens.

4.2. Correctness

We prove that the requirement (A) is met.

4.2.1. Correctness Definition

Correctness is defined as follows. (i) Assume that a substring of the plaintext equals rule . The will identify the corresponding token and report a matching of . (ii) When does not equal any rule, the probability of reporting a matching is negligibly small.

4.2.2. Correctness Guarantees

We first prove that encrypts the rule set correctly, that is, . According to equation (8), we have the following:

As , we have

Therefore,

Let the content of a token be Ri, and it is the first occurrence of Ri in the token series. The encrypted token is . will detect the match as . Assume that a token with the content is the -th occurrence of in the token series. The encrypted token is . Assume that on the side, we have and detects the match. Then, for ’s -th occurrence, the encrypted token is . On the MB side, we have and detects the match.

Therefore, the correctness definition is held. As keys for encryption are random, the ciphertexts of two different tokens may be the same, which we call the case a collision. The probability that a token and a rule leads to a collision is (the first type of birthday attack). For or 128, this probability is or . The correctness definition (ii) is held.

4.3. Security

We first show that when one of and is not honest, the honest can detect the case, and the honest can detect the dishonest . In the rule encryption stage, works out encrypted rules along with both and . If the two results are not the same, knows that one of or is not honest. holds the session key and decrypts the SSL/TLS traffic. can verify whether sends the tokens in accord with the SSL/TLS traffic.

We prove that requirements (B), (C), (D-i), and (D-ii) are met. Because does not have the session key to decrypt the payloads and the threat requirement (C) has been met.

For requirement (B), According to the oblivious transfer, cannot know the bit of each rule. We also hide the length of the rules. Requirement (B) is met.

We consider requirement (D-i) and requirement (D-ii). Since is a one-way function and , are not known by , cannot recover the plaintexts of encrypted tokens. The security definition (D-i) is held. For bit , works out and knows nothing about by the OT. Given , as and are different random bit strings, cannot recover from and .

As the same tokens are masked with different entries of , the resulting encrypted tokens are different, which avoids frequency-based attacks from . Likewise, an eavesdropping adversary cannot recover the plaintexts of encrypted tokens and their frequencies. As OTs between and protect the rules from leaking, the eavesdropping adversary cannot detect the matching of a rule. The security definition (D-ii) is held.

In this approach, we have fulfilled the requirement (D).

5. Performance Evaluation

We use OpenSSL-1.1.1a to implement encryption and message sending and Cppjieba to implement natural-language-processing-based tokenization. The one-way function we used is the Rabin function. We employ RawCap-0.2.0 to monitor the traffic and Wireshark-3.4.5 to collect statistics on the traffic. Each test is conducted 1,000 times, and the running time is the average time of the runs. The experiments use open-source rule sets and real and random network traffic. The detection rules are randomly inserted into test packets to measure the accuracy of MB by checking whether the rules are matched correctly.e conducted experiments on a PC with Intel(R) Core(TM) i5-6300U CPU with four cores at 2.20 GHz running 64-bit Windows 10.We use OpenSSL-1.1.1a to implement encryption and message sending and Cppjieba to implement natural-language-processing-based tokenization. The one-way function we used is the Rabin function. We employ RawCap-0.2.0 to monitor the traffic and Wireshark-3.4.5 to collect statistics on the traffic. Each test is conducted 1,000 times, and the running time is the average time of the runs. The experiments use open-source rule sets and real and random network traffic. The detection rules are randomly inserted into test packets to measure the accuracy of MB by checking whether the rules are matched correctly.

We compare OTEPI with BlindBox and PrivDPI. The three approaches have the same security and threat model and conduct the same function. The existing machine learning and deep learning approaches only inspect the plaintext part of the traffic and perform different security functions from OTEPI.

5.1. Client (or Server)

The client/server’s main computation and communication overhead are in the token encryption step.

5.1.1. Tokens with Distinct Content

BlindBox performs AES encryption twice for one token. OTEPI performs bitwise XORs and a one-way function on -bit strings. For PrivDPI, one exponentiation operation, one multiplication operation, and one AES encryption operation are required. The comparison of the token encryption time is shown in Figure 4. It can be seen that the token encryption time is linear in the number of tokens in the three approaches. For the same amount of tokens, OTEPI consumes twice as much time as BlindBox and PrivDPI consumes 5.6 times as much time as BlindBox.

The token encryption is much faster than the rule encryption in both OTEPI and BlindBox. In the latter, BlindBox needs to transmit garbled circuits and encrypt and OTEPI needs to use OTs to transfer keys. However, in the latter, all these operations are not needed.

The NLP-based tokenization is more flexible than fix-length tokenization used in BlindBox and PrivDPI. For example, for the payload “login.html?usern-ame = bob,” NLP can yield “login” and “username = bob” instead of “login.ht.” Most NLP tools support dictionary-based segmentation, which is suitable for texts. The number of tokens is greatly reduced by discarding meaningless words such as “a” and “the” NLP-based tokenization transfers some computation of the to the client-side. An is usually heavy-loaded, and it is desirable for the client to share the load.

The sliding window based tokenization leaks information about the payload length. Tokenization by NLP can hide the payload length since meaningless words are not recorded.

The running time of the tokenization by NLP is 2.3 to 2.8 times more than that of the sliding window. The time used for NLP tokenization is shown in Figure 5.

As NLP reduces the number of tokens, the time of token encryption and matching are reduced. In Figure 6, we compare the time of tokenization and token encryption of the client in BlindBox, PrivDPI, and OTEPI. It shows that by using NLP tokenization, OTEPI becomes the fastest in token encryption.

5.1.2. Tokens with Duplicate Content

When tokens repeat, the encryption time is different from that with unique content. The client (or server) uses the recorded encrypted tokens for acceleration in all three approaches. The encryption method for an existing token in BlindBox is , where is the number of occurrences of . The re-encrypting method in OTEPI can be found in equation (12). PrivDPI uses table lookup to compute the exponentiation and multiplication operations for duplicated tokens, such that only one AES operation is needed.

We evaluate the encryption time for the traffic with different percentages of repeated tokens. Repeated tokens are common in the real world. For example, when searching for recipes and travel brochures online, multiple queries will return similar results.

In Figure 7(a), We use 500 tokens, among which 10% to 100% tokens are repeated tokens. When the repetition rate of the token is 100%, the encryption time of OTEPI and BlindBox are the same.

In Figure 7(b), we show the computational overhead of the server in encrypting an HTML web page accessed the second time. In Figure 7(a), for the recurring token, our encryption is faster than BlindBox and PrivDPI. The running time of BlindBox is about 3.5x of OTEPI, and PrivDPI is about 3.8x of OTEPI.

5.2. Middlebox
5.2.1. First Session

For , the time required for encrypting rules and the communication overhead for obtaining these encryption rules are shown in Table 3.

The high bandwidth consumption of BlindBox is due to the garbled circuits. In OTEPI, bandwidth consumption is significantly reduced compared to BlindBox, for the low bandwidth consumption of OT in the rule setup. PrivDPI only needs to send a few group elements per rule, which only incurs very low bandwidth.

A comparison of the rule encryption time of the three approaches is shown in Table 4. OTEPI has a high time consumption because each rule requires times OTs(64 or 128). BlindBox requires one garbled circuit per rule, while PrivDPI only requires one exponentiation. In BlindBox, the communication transmission between and / is a garbled circuit of the function , where is the key of the client-side to encrypt tokens. Using the garbled , encrypts rule . Then BlindBox adds a random number and computes as the ciphertext of rule in its -th occurrence. In OTEPI, the computation costs mainly come from the oblivious transfer. In both BlindBox and OTEPI, the setup of encrypted rules has a high cost, so there is a huge gap in time.

5.2.2. Subsequent Sessions

We compare the bandwidth usages between OTEPI and BlindBox in case of multi-sessions. We still use 3000 rules in the tests. The results are shown in Table 5.

In subsequent sessions, OTEPI consumes less bandwidth than BlindBox. BlindBox needs to generate a garbled circuit for each rule in each session. PrivDPI transmits rules encryption parameters in the first session. PrivDPI can reuse the obfuscated rules set up in the first session in subsequent sessions, and only sends one group element in each followed session.

In terms of bandwidth consumption of multiple sessions, though not as efficient as PrivDPI, OTEPI significantly reduces the bandwidth consumption of establishing encryption rules compared with BlindBox. Meanwhile, OTEPI also achieves the reusable obfuscated rule as PrivDPI.

5.2.3. Accuracy of Tokenization

The accuracy of tokenization impacts the recognition accuracy of the system. OTEPI, BlindBox, and PrivDPI detect the matching when the token matches a rule. In this set of experiments, rules from different rule sets are randomly inserted into the traffic, and the accuracy of s using different tokenization methods in matching is tested.

We use three rule sets in the accuracy test. The testfilter(cn) [29] is a pure Chinese ruleset, and parentfilter [30] is a parent filter ruleset, and testfilter(cn-en) [31] is a ruleset mixed with Chinese and English rules. As shown in Figure 8, for the parents-filtering rules, BlindBox has a higher accuracy rate than OTEPI. This is because parents-filtering rules are long, and NLP tools divide a rule into several words, e.g., “zippyvideos” is divided into “zippy” and “videos,” which affects the accuracy. For testfilter(cn) and testfilter(cn-en), each Chinese character occupies 2-3 bytes under UTF-8 encoding. In BlindBox, rules shorter than the sliding window may be missed. As an example, the first token of text “adult check” is “adult ch” under 8 byte window. The rule “adult” will be missed. The fixed-length tokenization is not as accurate as NLP tokenization because sensitive words are always short. The famous anonymous website 4chan, for example, does not have any board with a name longer than four characters and uses the shortened form of multisyllabic words.

5.3. Summary

Compared with BlindBox, we significantly reduced the communication bandwidth from 50 GB to 82 MB in the rule encryption. Although our bandwidth consumption is higher than PrivDPI, the rule encryption is faster than PrivDPI. Without NLP tokenization, our token encryption is 2.6 times faster than PrivDPI. When NLP tokenization is used for HTML or other plaintext data, OTEPI achieves 1.7x speedup on BlindBox and 7.6x speedup on PrivDPI. In terms of accuracy, OTEPI has a higher recognition rate than BlindBox and PrivDPI for short rules and a slightly lower recognition rate than BlindBox for parent-filtered URL rules.

6. Discussion

Many directions can be developed in the future under the scheme proposed in this paper. Advances in NLP technology that produce fewer, more accurate, and fewer tokens can improve the accuracy and computational performance of OTEPI. The bandwidth overhead of oblivious transfer is still more significant than that of PrivDPI when encrypting rules. Finding or optimizing an oblivious transfer algorithm that saves more communication traffic can bring better bandwidth performance to OTEPI. OTEPI currently supports middleboxes for DPI filtering only. The machine learning approaches can also benefit from the multi-party security computing. We believe that the general blueprint OTEPI provides can extend the machine learning approach to process the encrypted payloads.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Commission of Development and Reform of Jilin Province No. 2019C053-10 and the Education Department of Jilin Province No. JJKH20190162KJ.