Abstract

Recently, hackers intend to reproduce malicious links utilizing several ways to mislead users. They try to control victims’ machines or get their data remotely by gaining access to private information they use via cyberspace. QR codes are two-dimensional barcodes with the capacity to encode various data types and can be viewed by digital devices, such as smartphones. However, there is no approved protocol in QR code generation; therefore, QR codes might be exposed to several questionable attacks. QR code attacks might be perpetrated using barcodes, and there are some security countermeasures. Some of these solutions are restricted to malicious link detection techniques with knowledge of cryptographic methods. Therefore, this study aims to detect malicious links embedded in 1D (linear) and 2D (QR) codes. A cybercrime attack was proposed based on barcode counterfeiting that can be used to perform online attacks. A dataset of 100000 malicious and benign URLs was created via several resources, and their lexical features were obtained. Analyses were conducted to illustrate how different features and users deal with online barcode content. Several artificial intelligence models were implemented. A decision tree classifier was identified as the most suitable model for identifying malicious URLs. Our outcomes suggested that a secure artificial intelligence barcode scanner (BarAI) is recommended to detect malicious barcode links with an accuracy of 90.243%.

1. Introduction

QR code is a machine-readable code consisting of an array of white and black squares, typically utilized for storing URLs or other information for viewing by several devices such as smartphones [1]. The retrieval of the data encoded in a QR code occurs within few seconds; thanks to the ultrahigh speeds used to verify the validity of the code received from the sensor [2].

Due to the high price of tags and identification devices, some researchers directed their attention to smartphones’ cameras as an alternative identification source such as fingerprints and barcodes [3, 4].

When Denso Wave first invented the QR code in 1994, the main objective was to enable quick automobile scanning during manufacturing [1]. QR codes are now widely used in much broader contexts, such as commercial tracking and mobile tagging. A QR code can include collecting data, sensing, and reading parameters from different environments [2].

QR codes were confirmed as an international standard in 2000 [5]. The current standard version was published in 2015 [6]. QR codes can store various information types, for instance, numeric (0–9), alphanumeric (letters and numerals), and binary data (0 and 1), as well as Kanji characters (Japanese writing) [7]. Table 1 briefly describes the capacities of different data types used in QR codes.

Typically, a QR code image contains two regions: the encoding and function pattern regions [6, 7] (see Figure 1).

QR codes have been used extensively due to the limited technological characteristics of linear (one-dimensional, 1D) barcodes. However, there has been an increasing demand for more information storage than a 1D barcode can provide [7].

QR codes have become widespread in several fields. They can be attached to any screen, poster, or product surface, effectively used in education, transportation, product tracking, ticketing, SmartTags, book returning methods in libraries, payment transfer systems, and tourism promotion [2, 814].

Nowadays, QR codes are used as an environment-friendly move toward ensuring a sustainable marketing strategy in various sectors, such as education, fish farming, land management, healthcare services. In this context, for instance, QR codes can improve healthcare services in effective patient identification management. Personal data can be associated with the QR codes on the patients’ wristbands [15, 16]. Healthcare services can use a QR code scanner application on their smartphone to access patient information, medication, and medical reports [1720]. QR codes enable high-speed component scanning in factories [21].

In some cases, secure barcodes can be used in IoT apps to add security, privacy, and management layers, as a free alternative to RFID tags. Barcodes can be a bridge that connects IoT objects to cloud computing, where the cloud can handle big data operations and allow security factors in IoT development [4, 22].

There are various security threats linked with QR codes. Barcodes are unreadable without a particular reader device or apps. However, there is no approved protocol in QR code generation; therefore, QR codes might be exposed to several questionable attacks. QR code attacks might be perpetrated using barcodes, and there are some security countermeasures. Some of these solutions are restricted to malicious link detection techniques with knowledge of cryptographic methods [7, 23, 24].

The main objective of this study is to detect malicious URLs embedded in barcodes (both 1D and QR codes). A cybercrime attack was proposed based on barcode counterfeiting that can be used to perform online attacks, a procedure in which a malicious 1D barcode segment is pasted over a legitimate QR code image to deceive users. In addition, we conducted tests that showed how different features affect barcode scanning. A dataset of 100 000 malicious and benign URLs was created via several resources, and their lexical features were obtained. Furthermore, five classifiers were compared to select the most suitable classifier for detecting malicious URLs. The classifiers were as following: naive Bayes (NB), support vector machine (SVM), logistic regression (LR), K-nearest neighbors (KNN), and decision tree J48 (DT) classifiers.

1.1. Contributions

The contributions of this study are summarized as follows. (i) We explore a type of barcode-in-barcode attack based on QR code counterfeiting that can be used to perform online attacks. (ii) We conducted tests that show how different factors such as size and distance affect barcode scanning. (iii) We built an AI model to detect malicious URLs encoded in barcodes based on the URL lexical properties. (iv) We applied several AI classifiers and compared them. (v) We developed BarAI based on the best model against malicious QR code links and analyzed the comparison results.

1.2. Paper Structure

The structure of this paper is as follows. Section 2 shows the literature review on QR code attacks countermeasures. Section 3 discusses the barcode injection attack, and Section 4 presents our materials and methods. Section 5 explores the experimental techniques and outcome evaluation. Section 6 discusses BarAI and the comparison results. Finally, Section 7 draws the concluding remarks and presents the topics for future work.

2. Literature Review

This section presents a literature review to ascertain the state-of-the-art current research on the available countermeasures and solutions to preserve 2D barcodes.

In this section, we first present a summary of cryptography and information security terms and algorithms. Then, we will discuss the barcodes security solutions.

The main security terminology involves three terms known as the CIA triad: confidentiality, integrity, and availability [7]. Confidentiality means protecting data from being accessed by unauthorized entities. It is commonly achieved by encrypting data so that only authorized users who have the key can decrypt and access contents. Data integrity includes assuring that data were not modified by unauthorized entities and delivered accurately. Moreover, availability indicates that the information system should be available whenever it is needed. In addition, information security includes authentication, which aims to verify the identity of users or entities, and nonrepudiation ensures that an entity cannot deny the sending of a message or sign a document [7].

Public-key cryptography (asymmetric) is a cryptographic method with two keys: public and private. It is extensively used in data encryption and authentication [7, 25]. Besides, the hash function takes a QR code content as input and delivers a fixed-size value called “hash.” A hash function is a one-way method; it is hard to get the original content by processing the hash value. It is impossible to have two QR code contents with the same hash value using secure and robust hash functions.

Symmetric-key encryption is a system that uses the same secret key for encryption and decryption. Symmetric-key algorithms are considered more straightforward and faster than asymmetric, but the key exchange should be established securely [7, 25].

Digital signature (DS) is a security method that uses public-key cryptography to confirm authentication, nonrepudiation, and data integrity of the content. It computes the hash and then signed it by the private key [7, 25]. Rivest–Shamir–Adleman (RSA) is a public-key cryptographic algorithm that uses a mathematical approach based on prime numbers. RSA is widely used for data encrypting and digital signatures. In comparison, elliptic curve digital signature algorithm (ECDSA) is a public-key cryptographic algorithm commonly used for digital signatures.

Advanced encryption standard (AES) is a popular symmetric-key algorithm for encrypting electronic data, and it is considered a highly secure algorithm used for confidentiality [7, 25].

References [7, 26, 27] highlighted the gaps in and limitations of the available 2D barcode protection mechanisms. The researchers compared and evaluated 2D barcode security systems using cryptographic characteristics and their security levels. They explored how various usability features affect QR code scanning and assessed several cryptographic techniques concerning QR code usability. Some asymmetric solutions lead to break QR code usability, while the elliptic curve digital signature algorithm (ECDSA) was recommended. The results also showed that symmetric methods were appropriate solutions.

Moreover, according to [28], the authors demonstrated a means of flooding the physical side of the IoT using QR codes by encoding irrelevant content or fake and phishing Web pages. In the experiments, the ECDSA was adopted to guarantee the benign usage of 2D barcodes with physical objects. Different key lengths and hash functions showed various time/space overheads.

Furthermore, in [29], the researchers highlight QR code phishing attacks. They embedded a fake Google Web page inside QR codes and performed a phishing attack. Their results showed the possibility of tricking and successfully skipping the safe browsing service provided by Google. Consequently, the researchers proposed a quick response code secure (QRCS), a comprehensive model that uses a client-server architecture and utilizes the digital signature. The proposed QRCS model adopts the ECDSA with hash function SHA2 or SHA3 (256 bits) to guarantee QR code generator authentication and data integrity. The proposed model analysis demonstrated the flexibility of implementation and efficiency against barcode attacks.

Several studies have been conducted using secret hiding schemes based on hamming code and visual secret sharing schemes to protect QR code content and private information during online transactions [3036]. The study described in [33] was related to computational security by supposing that the attacker technique was restricted to the QR code scanner.

Reference [37] proposed a stereographic scheme to encode message authentication codes and digital signatures to authenticate data inside QR codes. The main advantage of the proposed method was that any barcode reader application could decode the barcode content. Moreover, a universal message authentication code and ECDSA with a small key length (160 bits) were used in the experiments. The results showed that the performance of the proposed scheme was better than those of the existing methods.

3. Will You Trust This Barcode? A Barcode Injection Attack

A commercial (linear) or 1D barcode is represented by horizontal lines of varying widths and spacing. Commercial barcodes are widely used to encode particular identification values, such as product IDs [38] and prices. Figure 2 shows an example of a 1D barcode that is used to store a specific URL.

The data type and length vary according to the standard used, and popular linear (1D) barcodes include universal product code (UPC), European article number (EAN), code 128 [39], and postal numeric encoding technique codes [40]. Both UPC and EAN codes support numeric data with fixed sizes, while code 128 supports variable data lengths and allows the encoding of alphanumeric data (all ASCII characters) [39]. In addition, code 39 type enables the encoding of uppercase letters A–Z, numbers 0–9, and several special characters (spaces, ., $, and %), with variable lengths [38].

Even with the limited size of 1D barcodes, they can still be used maliciously, such as encoding phishing URLs. Here, we explore an attack scenario in which a malicious 1D barcode is injected (pasted) over a legitimate QR code image to deceive users. In our study, we considered this type of attack by hiding a 1D malicious barcode inside a QR code, which is a new form of a barcode-in-barcode attack [41] that hides a malicious 2D barcode inside a QR code. In the barcode injection attack (BIA), an attacker can modify the height of the vertical lines that define the 1D barcode (compare Figures 2 and 3) and paste it over a legitimate QR code image. Note that the 1D barcode cropped image (see Figure 3) will not affect the QR code readability and will be treated as noise. The error correction feature of QR code can recover content damaged by noise; thus, both the 1D barcode and QR code will be readable.

QR codes are adaptable with various environments by using the Reed-Solomon error correction method, which facilitates reading barcodes even if some data blocks have been damaged (i.e., pasted 1D on QR code). The Reed-Solomon process involves presenting a group of redundant bits that attempt to identify, track, and correct errors based on the system itself [6, 7]. QR codes support robust four levels (percentages) of error correction capabilities for restoring the destructed data [7].

Table 2 shows the error correction levels and their tolerances for possible image damage.

Figure 4 shows examples of a BIA (left) and barcode-in-barcode attack (right). In the BIA, the two barcode types are readable: the 1D barcode containing a URL and the QR code containing random data (this code could also be a URL in another example). Thus, while reading the same barcode during different iterations, the same user could obtain two different contents. The ability to hide a 1D barcode in a BIA is visually better than that in a barcode-in-barcode attack [41]. The 1D barcode has a small size and does not have particularly distinct segments.

The QR code generator can select the appropriate error correction level depending on the type and importance of the encoded data. For example, the high error correction level (30%) can be used with severely damaged industrial barcodes distributed in a dirty environment. The low level (7%) is preferred with QR codes displayed electronically [41, 42]. The medium level (15%) is the most frequently used level for QR codes [1]. Attackers can utilize error correction levels to perform BIA attacks, making them reliable and dangerous.

For non-expert users, it will be challenging to identify the inner hidden barcode visually. In contrast, a barcode-in-barcode attack (Figure 4, right) has a larger internal barcode size and distinct QR code finder patterns. According to [24, 41], some barcode readers can read both the outer and inner barcodes; that is, scanners can read several types of 1D and 2D barcodes.

When reading a barcode that contains a URL, almost all barcode scanners display the encoded URL content before redirecting to the Web page, and the user can decide whether to visit the URL [43, 44].

Thus, attackers can trick users by using a URL shortening service such as is.gd URL shortener [45], BL.INK [46], or Shorby [47]. These services reduce the number of characters required for URLs to 12 and display malicious URLs as short URLs to trick users.

3.1. Barcode Readability Range

The QR code readability range (RR) is defined as the range of distances inside which a barcode is readable [26]. A BIA will violate the reliability of the barcode and may put readers in danger of security risks. We employed the RR experiments described in [26] and measured the RR of BIA content regarding the RR of QR code.

Figure 5 shows RR for 300 × 300 and 500 × 500 pixel barcodes, where the X-axis represents the data size in bytes, and the Y-axis represents the distance between the scanning device and barcode image. “Max distance” means the maximum distance at which the legitimate QR code can be read. In contrast, “min distance” represents the minimum distance at which the legitimate QR code can be read and the maximum distance at which the BIA code can be read. “Attack min distance” represents the minimum distance at which the BIA code can be read. The BIA code will be readable between the BIA minimum distance and the minimum distance of the legitimate QR code. The scanning device will comprehensively cover the 1D barcode as an inner barcode.

For example, a QR code with a data size of 100 bytes is readable from a distance of 31–190 cm (RR = 159 cm) for a 500 × 500 pixel barcode, whereas a BIA code with the same data size is readable from a distance of 15–40 cm (RR = 25 cm). Thus, scanning the QR code without covering the whole QR code finder pattern will lead to retrieval of the BIA content, putting the user at risk. Similarly, for a 300 × 300 pixel barcode, RR for reading the legitimate QR code ranges from 40 to 190 cm, and that for reading the BIA code ranges from 10 to 20 cm. Thus, it can be concluded that increasing the image size facilitates the implementation of a BIA.

4. Materials and Methods

Since QR codes may include suspicious online content, there are four possible attack scenarios as follows:(i)Embedding of malicious links inside QR codes(ii)Embedding of malicious links inside QR and BIA codes(iii)Embedding of benign links inside QR codes and malicious links inside BIA codes(iv)Use of several BIA codes inside QR codes and embedding with benign and malicious links to confuse users when reading the content of the same barcode

Figure 6 shows the proposed methodology for the approach adopted in this study to find solutions for these attacking scenarios as the following:

4.1. Data Collection

One hundred thousand benign and malicious URLs were collected that might be embedded in QR and BIA codes from various environments. The dataset contained 50000 malicious URLs collected from the most recent phishing [48] and malware domains blacklists [49, 50]. Moreover, we collected 50000 benign URLs of secure websites [51, 52].

4.2. Features’ Extraction

To identify malicious URLs with reduced network delay, the link-based features of URLs was analyzed. Figure 7 presents the URL structure [53].

As shown in Figure 7, cybercriminals frequently utilize the second-level domain, generic top-level domain (gTLD), and path directory to conduct cybercrimes. The domain could be a popular website, such as blog, Instagram, and TikTok. The gTLD could be edu, gov, net, and org. Cybercriminals attempt to gimmick their malicious links and bypass blacklists utilizing URL shortening services such as is.gd URL shortener [45], BL.INK [46], and Shorby [47]. Therefore, there is an increasing demand to retrieve the entire link-based features of URLs [48, 49, 54]. Correlation feature selection (CFS) [55] was utilized to assess which lexical URLs properties can be adopted in this study. CFS means that useful features positively correlated with the URL class. All our link-based features get a positive correlation with the class label. Table 3 [54] shows our adopted URL lexical properties.

4.3. Artificial Intelligence (AI) Classifiers

In this section, we will describe the AI classifiers that we used in this study.(1)Naive Bayes (NB). The NB classifier is a machine learning probabilistic classifier that utilizes Bayes’ theorem and uses conditional independence assumptions between the properties [56].(2)Support Vector Machine (SVM). The SVM classifier is a supervised machine learning classifier suitable for analyzing data for classification and regression. It aims to get the decision boundary that divides the data collection into two classes. The decision boundary decides if the instances are correctly classified or not [57].(3)Logistic Regression (LR). The LR classifier is a machine learning classifier that depends on a probability function and is widely used for binary classification to perform predictive analysis [58].(4)K-Nearest Neighbors (K-NN). The KNN classifier is a straightforward machine learning classifier that performs instance-based learning and depends on similarity measurements. In the KNN approach, the instance is assigned to the class through the majority vote of its K neighbors. The KNN method is employed for both classification and regression problems [59].(5)Decision Tree (DT). The DT classifier is a popular supervised machine learning classifier employed for data classification. It works as a decision support tool that uses a flowchart of hierarchical decisions. The paths from the roots to leaves and branches represent the classification rules that indicate class labels [60].

5. Experimental Techniques and Outcome Evaluation

This section shows the outcomes of the applied five AI classifiers (NB, SVM, LR, KNN, and DT). They were evaluated using 10-fold cross-validation [61] in which the set of instances inside the data collection was split into ten parts, among which nine were used for training and one for testing. The cross-validation process is repeated ten times to test all ten parts. We compared the results using the confusion matrix, a table design employed to visualize a classifier performance (Figure 8). It includes the following prediction quality measures: true positive (TP), true negative (TN), false positive (FP), and false negative (FN).

Besides, we computed the accuracy, true positive rate (TPR), false positive rate (FPR) precision (P), recall (R), and F-measure (F-M), as expressed in (1)–(5) [61].

The NB experiments yielded an accuracy of 73.928% with an error rate of 26.072%. Table 4 shows the detailed accuracy of the NB classifier.

Besides, the NB classifier successfully classified 48179 benign URLs and 25749 malicious URLs, as shown in Table 5.

As shown in Tables 4 and 5, the NB classifier successfully obtained optimal prediction results for benign URLs (0.964), whereas it recorded a lower detection percentage for malicious URLs (0.515). Moreover, the NB classifier failed in predicting malicious URLs and classifying them as benign URLs with a false positive rate of 0.485. In comparison, NB successfully classified the malicious URLs as benign URLs within a low false positive rate (0.036).

The SVM classifier yielded an accuracy of 84.671% for the weighted average of two classes with an error rate of 15.329%. Table 6 presents the detailed results for the SVM classifier.

Table 6 demonstrates that the SVM classifier obtained close accuracy results as the NB classifier results for the benign URLs. In contrast, the SVM results are slightly improved compared with the NB results for the malicious URLs. Table 7 shows that the SVM classifier was able to assign 84671 URLs correctly among the entire dataset. These results still require enhancement to make this approach useful for malicious link detection.

The overall accuracy achieved using the LR classifier was 85.726%. The detailed results reveal enhancement compared with those of the NB and SVM classifiers for benign and malicious URL detection, as shown in Table 8.

The LR classifier was able to classify 85726 URLs correctly. In particular, it predicted 41193 malicious URLs correctly. Table 9 presents the detailed confusion matrix of the LR model.

The weighted average results of the KNN classifier, when k = 1, exhibit an accuracy of 89.614% with an error rate of 10.386%. Highly accurate detection was achieved for both the benign and malicious classes, with accuracies of 91.4% and 87.8%, respectively, as shown in Table 10.

By comparing the results in Tables 8 and 10, it is shown that the KNN classifier could detect the benign class better than the LR classifier. The KNN classifier detected malicious instances better than the LR approach, so it is recognized both benign and malicious URLs with the high accuracy. However, the main target of this study was to find the most suitable model for identifying the malicious URLs. Table 11 presents the confusion matrix of the KNN classifier.

As shown in Table 11, the KNN classifier could predict 43920 of 50000 malicious URLs correctly. The last classifier we applied in our experiments was the DT classifier, which yielded an overall accuracy of 90.243%, and in particular, 90.5% and 90% for detecting benign and malicious URLs, respectively. The detailed information about DT results is shown in Table 12.

The DT results exhibit slight enhancement compared to the accuracy percentage of the KNN classifier for detecting malicious URLs. The DT method correctly classified over 1094 malicious URLs more than the KNN classifier did. More details are shown in Table 13.

When comparing the classifiers, we look for the highest values of TP, precision, recall, and F-measure. On the other hand, the FP rate should be minimized. According to this, the results of Tables 4, 6, 8, and 10 show clearly that the DT classifier recorded the best results for both classes and the weighted average. It recorded more than 0.9 for TP, precision, recall, and F-measure and less than 0.1 for the FP rate.

6. Discussion

Based on the outcomes presented in Section 5, we utilized the DT classifier since it yielded the most accurate detection and prediction results for malicious links. BarAI was consequently implemented [62] based on the guidelines recommended by [24]. The following features describe our implementation.(i)Self-Supporting. BarAI uses a DT classifier and does not require any external web service.(ii)Inspect BIA. Besides the ability to detect QR code malicious links, our proposed approach can detect the malicious usage of the 1D barcode.(iii)Camera-Only Privilege Functionality. This approach minimizes the levels of access to the camera (to scan the barcode image).(iv)Interoperability in an Open Application Environment. Therefore, no supplementary key management or cryptographic specifications are required.(v)Applicable. Neither storing nor retrieving data is required (signatures or certificates); therefore, there is no size overhead.

Few cryptographic QR codes applications offer generation and scanning services [63]. The BarSec Droid app [64] provides various symmetric and asymmetric cryptographic algorithms to secure barcodes and uses the standard JavaScript Object Notation as the formal structure with QR codes [27]. BarSec Droid [64] decodes cryptographic barcodes if its own generation app produced them. Other QR code cryptographic apps include no formal way of encoding cryptographic information inside QR codes. Each app uses its own structure. Thus, to retrieve cryptographic QR code content, the user must have the same generation tool [6573]. Some of these apps employ unsecured encryption algorithms [65, 66]. We could not evaluate the strength of the remaining apps [6873] because of missing cryptographic information, that is, the algorithm used or key length. These apps [6573] use base 64 strings to represent ciphertexts that require size overhead.

Table 14 summarizes the features of applying the QR code cryptographic apps with regards to BarAI.

All apps [6473] require a secure means of exchanging their keys, and they use encryption methods that require size overhead. Hence, these apps cannot prevent BIAs, owing to the limited size of 1D barcodes. Although digital signature techniques cannot avoid BIAs for the same reason for size overhead, BarSec Droid can still deal with and check for BIAs using a particular web service [64].

All apps [6473] can work in closed/controlled environments, except [64] which can work also in an open environment when using a digital signature or checking URLs. Users should be aware of the potential threats when using digital signature certificates such as expired certificates, self-signed certificates, hostname mismatch, and chain of certificates issues [74, 75].

Our BarAI represents a comprehensive solution to the limitations of cryptographic apps. It works in both open and closed environments, checking all the possible suspicious online content in all barcodes types, and does not require size overhead with least privilege permissions.

7. Conclusions

This study demonstrates some QR code cybercrime attacks of phishing and malware propagation. Our work explores barcode-counterfeiting attacks and describes experiments performed to determine the effects of size and distance on barcode reading. A dataset containing 100000 URLs categorized as benign and malicious and their features were extracted for further analysis. Besides, five AI classifiers were applied, NB, SVM, LR, K-NN, and DT. The outcomes showed that the DT classifier is the most suitable model for recognizing QR code malicious links. Based on that, the BarAI was developed and later proposed as a swift warning management tool for identifying QR code malicious links among the available apps with an accuracy of 90.243%. Consequently, the proposed tool is under evaluation for a possible used to improve agroecosystems sustainability by using secure QR codes technology for durable development.

Data Availability

The dataset used in the experiments is available at https://tinyurl.com/5779wdw2.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

The authors acknowledge the Deanship of Scientific Research at King Faisal University for their support under grant number 17122016.