Abstract

Deoxyribonucleic acid (DNA) can be considered as one of the most useful biometrics. It has effectively been used for recognizing persons. However, it seems that there is still a need to propose a new approach for verifying humans, especially after the recent big wars, where too many people lost and die. This approach should have the capability to provide high personal verification performance. In this paper, a personal recognition approach based on artificial intelligence is proposed. This approach is called the artificial DNA algorithm for recognition (ADAR). It utilizes a unique identity for each person acquired from DNA nucleotides, and it can verify individuals efficiently with high performance. The ADAR has been designed and applied to multiple datasets, namely, the DNA classification (DC), sample DNA sequence (SDS), human DNA sequences (HDS), and DNA sequences (DS). For all datasets, a low value of 0% is achieved for each of the false acceptance rate (FAR) and false rejection rate (FRR).

1. Introduction

With advanced science and technology, it is now possible to authenticate people in order to achieve high levels of security. Maintaining private data and meeting the increased demands for security have become important matters. There are several methods that use biometrics to approve the identity such as fingerprint [1], palm print [2], iris print [3], and voice print [4]. Biometrics include measuring an individual’s distinctive physical or behavioral biometric trait [5]. The Greek word “bio” means life and “metric” means measuring; both words are combined to form the phrase “biometric” [6]. In fact, there are different terminologies that are associated with the word “biometrics” such as verification, identification, classification, authentication, and recognition. It seems hard to distinguish between each one of them. However, such terminologies are clarified over years of working. Verification utilizes the one-to-one policy, where a user declares his/her identity in order to compare with specific related information belonging to the same user. Then, a decision about accepting or rejecting the personal identity claim is provided [7]. Identification exploits the one-to-many policy. Here, it is necessary to apply matching between the provided information by a user and all the stored information of all users. So, there is no need to provide a user’s identity, and the decision can either assign or refuse to declare the identity [8]. Classification refers to categorizing information into a certain group or set [9]. Authentication refers to the process of proving an actual action. In computer science, this term is typically associated with approving a user’s identity [10]. Recognition is a general terminology, and it can be used to mention any of the previous biometric styles (verification, identification, classification, or authentication).

Deoxyribonucleic acid (DNA) can offer trustworthy personal verification. It is inherently digital and remains unchanged during the person’s lifetime and even after death [11]. The form of DNA known as a double helix; it is comprised of two connected strands that twist around one another to resemble a spiral ladder. Deoxyribose and phosphate are the main components of the backbone of each strand. Each sugar molecule in the DNA has one of four bases (or nucleotides): adenine (A), cytosine (C), guanine (G), or thymine (T) [12]. The A, T, C, and G refer to the chemical elements that connect the two strands together. Figure 1 demonstrates a sample of the DNA with the chemical components.

DNA bases pair up with each other, A with T and C with G, to form units called base pairs. Each base is also attached to a sugar molecule and a phosphate molecule. The sequence of these pairs differs from one person to another, making the DNA unique for each individual and therefore this can be used for personal verification or any other recognition style.

It is known that using the DNA is so valuable for personal verification. However, a more effective DNA system is still required. This has been exposed in Iraq because of the big issues and wars, where too many humans died and were lost. Such DNA verification system should have the ability to deal with huge number of samples and provide precise outcomes. This work presents a new system based on the artificial intelligence by employing a unique DNA pattern of the nucleotides (A, T, C, and G) for verifying persons. The proposed approach here is called the artificial DNA algorithm for recognition (ADAR). It can provide high performance, it facilitates searching for DNA verification samples, and its efficiency is proven with four utilized datasets.

The next sections are architectured as follows: Section 2 presents the literature review. Section 3 describes the ADAR theory. Section 4 discusses the experimental work and Section 5 provides the conclusion.

2. Literature Review

There are many prior DNA studies that can be highlighted. In 2005, Mitra presented a survey about the roles of different soft computing techniques such as fuzzy sets, artificial neural networks (ANNs), evolutionary computation (EC), and support vector machines (SVMs) to classify and recognize the major pattern for DNA genomic sequence and protein architecture. The SVM classifier recorded the highest accuracy and least error compared to other applied methods [14]. In 2009, Wei proposed a system for categorizing the DNA sequence of four types of bacteria. It consists of the following steps: extracting DNA sequence features, constructing the ANN model, and classifying data. The accuracies of classifying the four types of bacteria for lengthy and repetitive DNA sequences in the utilized dataset was 92.9%, 90.2%, 80.4%, and 41.7% after learning the ANN model [15]. In 2012, Khashei et al. presented a novel hybrid model integrating AI and fuzzy logic for the analysis of gene data. Comparative evaluations against conventional approaches such as artificial neural networks (ANN), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), K-nearest neighbor (KNN), and support vector machines (SVM) demonstrated that the proposed model achieved enhanced classification accuracy. This suggests that the suggested hybrid model holds promise as a viable alternative technique, particularly in scenarios where data scarcity is a concern [16]. In 2017, Pashaei et al. concentrated on the human genome by considering splice site identification with random forest. The effectiveness of the employed classifiers was mostly influenced by feature extraction and feature selection techniques used in DNA encoding. The feature selection methods removed the extraneous information, whereas the feature extraction methods attempted to extract as much information from the DNA sequences as possible. The applied random forest was examined as a means of feature selection and classification in the splice site domain [17]. In 2018, Pashaei and Aydin worked on Markovian encoding models. Recognition of splice sites for persons was considered. A third order Markov model with SVM (MM3-SVM) was proposed. It outperformed the best-known state-of-the-art methods [18]. In the same year, Kaniwa and Phuthego explained how genetics is affected by next-generation sequencing to rapidly generate the DNA, and ribonucleic acid (RNA) sequences. This is for swiftly constructing the DNA and RNA sequences. Madrid, Spain, was the site of this study. It was based on the fundamental notion that DNA sequence information was expanded, which made simple and affordable analysis possible [19]. In 2020, Sun et al. a novel multilayer deep neural network (DNN) was devised and implemented for survival prediction in a genome-wide association study. This DNN survival model exhibited superior predictive accuracy compared to several existing models, while also successfully identifying clinically significant risk subgroups. The model employed an effective approach for capturing complex architectures among genetic variants. The evaluation of the model was conducted on genome-wide association studies (GWAS) data from two large-scale randomized clinical trials involving over 7800 participants with age-related macular degeneration (AMD) [20]. In 2021, Alatrany et al. proposed a hybrid machine learning (ML) technique for the prediction of Alzheimer’s disease using genome sequence. The most important single-nucleotide polymorphisms linked to Alzheimer’s disease were chosen. Using data from a random forest, a DL model for the illness prediction was then provided. Utilizing a convolutional neural network (CNN) and multilayer perceptron (MLP), the simulation results showed that the hybrid model was effective in predicting people who had Alzheimer’s disease [21]. In 2022, Manhal investigated the use of DNA to identify individuals. An efficient algorithm was used to find the distinctive DNA patterns. The unique personal DNA pattern (UPDP) was approached for personal identification. Four databases were employed, and they all yielded low reported errors [22]. In the same year, Rukhsar et al. introduced DL analysis of RNA sequence gene expression data for cancer classification. Five different kinds of cancer data from the Mendeley archive were examined. The appropriate characteristics were retrieved and chosen using the DL. Eight DL algorithms were employed to accomplish classification in the final phase. The evaluation of DL classifiers was performed using k-fold cross-validation and four different data splitting techniques. Among the evaluated classifiers, the CNN exhibited the highest overall performance [23]. Also in the same year, Hamed et al. provided a review on enhancing algorithms for pattern matching. This survey concentrated on biological sequences. It presented analyses of techniques, efficiency, and complexity. Furthermore, it offered comparisons between various algorithms for matching [24]. In 2023, Ibrahim et al. proposed a novel fast technique. It was for pattern matching. It is determined by biological sequences. This work was constructed to increase speed up the search for DNA sequence pattern [25]. In the same year, Hamed et al. investigated the efficiency of optimizing classification. It considered machine learning. It focused on pattern matching. This study suggested a new DNA sequence classification model. It fused between a pattern-matching procedure and machine learning techniques [26].

This paper adds a significant contribution to previous work by approaching an artificial intelligence algorithm named the ADAR. This algorithm is employed for verifying persons according to their DNA sequences of nucleotides.

3. Proposed Approach

The proposed approach is called the ADAR. Its construction starts with substantial numbers of DNA sequences. Each DNA sequence has a unique nucleotide pattern code. Each strand of DNA is viewed as a fundamental sequence of nucleotides (or bases). Figure 2 depicts a DNA sequencing sample of two strands with nucleotide arrangements.

Any sequencing arrangement in a single DNA strand consists of A, G, T, and C nucleotides. In this work, determining the identity of a person is considered after counting the number of repeated patterns of four nucleotides (quaternary nucleotides).

The ADAR algorithm considers counting all numbers of repeated four-nucleotide patterns. Then, the maximum repeated pattern is determined. An identification claim is applied to a specific person. Therefore, comparisons for (pattern index, maximum repetition and identity claim) are employed in the case of verification.

The full system of the ADAR works in two main phases: enrolment and verification. In the enrolment phase, DNA samples are received and processed for storage in the system. In the verification phase, an identity claim and a DNA sample are provided for testing. A flowchart for the proposed ADAR with the two phases is given in Figure 3.

For the enrolment phase, the system of the ADAR consists of the following layers: input layer, search layer, max layer, identity layer, and comparison layer, which will be used for comparison. The verification phase of the ADAR system involves the same stages as the enrolment phase and the output layer, which is added at the end and provides the verification decision. The proposed ADAR layers for the two phases of enrolment and verification are demonstrated in Figure 4. They can be illustrated as follows:Input Layer: It is required for receiving DNA sample D as a string of sequences of nucleotides (or bases).Search Layer: It is employed for counting the numbers of repeated quaternary patterns X of nucleotides (frequencies of quaternary patterns of nucleotides). It considers all possible probabilities P(X), starting from “AAAA” and ending with “CCCC” (this covers 256 probabilities).Max Layer: This layer collects the maximum frequencies of the most repeated quaternary patterns of nucleotides for all D samples. The following equation expresses a maximum operation:where is the maximum frequency of the most repeated quaternary pattern of nucleotides, is the maximum operation between all frequencies of patterns, and possibilities.Identity Layer: This layer during the enrolment phase stores the identity of persons who provide their DNA sequences. Whereas, this layer during the verification phase matches between the identity claim for a person who requires to be verified and his/her stored information.Comparison Layer: It assigns three factors for each DNA sequence provided by any person in order to be used for verification comparisons. These factors are [pattern index (), maximum repetition (), and identity claim]. This layer is crucial for ensuring reliable and accurate verification of individuals.Output Layer: It provides the output verification decision according to all processing layers and identity claim.

The ADAR verification algorithm can be illustrated as follows:Step 1: Receiving the DNA sample as a string of sequence of nucleotides.Step 2: Counting the numbers of repeated quaternary patterns of nucleotides.Step 3: Collecting the maximum frequencies of the most repeated quaternary patterns of nucleotides for all the DNA sample.Step 4: Matching between the identity claim for a person who requires verification and his/her stored information.Step 5: Comparing with the three factors of (pattern index (), maximum repetition () and identity claim).Step 6: Providing the output verification decision according to all processing layers and identity claim.

Parameters used for the ADAR analysis are given in Table 1.

4. Results and Discussion

4.1. Datasets Descriptions

Four datasets are employed in this paper: these are the DNA classification (DC) [28], sample DNA sequence (SDS) [29], human DNA sequences (HDS) [30], and DNA sequences (DS) [31]. Each one of these datasets consists of many DNA sequences of nucleotides (A, G, T, and C). The DC database involves 106 samples, the SDS dataset includes 426 samples, the HDS dataset contains 4380 samples, and the DS dataset has 11738 samples. All samples are used as strings of DNA sequences for nucleotides.

In more details, such datasets with their total numbers provide huge numbers of probabilities for clients and imposters, as shown in Table 2.

4.2. ADAR System

The proposed ADAR approach is constructed within a system. It is applied four times, each for an employed dataset. The ADAR is implemented in both phases of enrolment and verification. Simple yet effective graphical unit interfaces (GUIs) are designed and provided. Figure 5 shows first GUI, which has 5 essential buttons:(1)Load dataset: it is responsible for loading the dataset and applying the enrolment phase.(2)Input DNA pattern: it allows entering a DNA sequence for the verification phase, as demonstrated in Figure 6, where the requesting window to enter a DNA sequence and an example of providing a DNA sequence are shown.(3)Input identity claim: it facilitates entering an identity claim for the verification phase, as illustrated in Figure 7, where a request window to enter an identity claim and an example of providing an identity claim are given.(4)Result: it is for performing the verification process and displaying the result of accepting or rejecting the identity claim.(5)End: It is for stopping and closing the ADAR system. Otherwise, the system stays working and can be used for other information.

As mentioned, the verification result should include accepting or rejecting the identity claim. Figure 8 shows both expected verification results in the ADAR system, where the output of rejecting the identity claim and the output of accepting the identity claim are displayed. Rejecting the identity claim is reported as incorrect identity with a red colored icon and accepting the identity claim is reported as correct identity with a blue-colored icon.

4.3. Results Discussion

For evaluating the generalization of any ADAR system, holding out separate testing samples with effective loop instructions can be used. This causes intensive evaluations as: 106 clients and 11130 imposters for the DC datasets; 426 clients and 181050 imposters for the SDS dataset; 4380 clients and 19180020 for the HDS dataset; and 11738 clients and 137768906 imposters for the DS dataset.

It can be concluded that the ADAR system was successfully constructed. Furthermore, very high verification performance can be attained for each of the four datasets, as false acceptance rate (FAR) equals to 0%, and false rejection rate (FRR) equals to 0%. It can also be highlighted that the artificial intelligence system in ADAR is user-friendly and easy to implement.

Additional metrics are also considered, these are the precision, recall, loss, and F1-score. In addition, receiver operating characteristic (ROC) curve and confusion matrices are also provided, as given in Figures 9 and 10, respectively. For all the employed datasets, the following values are computed: Precision = 1, Recall = 1, Loss = 0, F1-score = 1, and area under the curve (AUC) = 1. This is expected as all false positive verifications and all false negative verifications have 0 values, as they are demonstrated in the confusion matrices.

Time spent for an ADAR verification has been measured, and it attained an interesting outcome of around 0.23 second. This measurement was carried out on a computer with the following specifications: a hp laptop, an Intel Core i7 processor, 2.70 GHz processor speed, and 8 GB main memory.

4.4. ADAR Limitations

The proposed ADAR approach still has limitations and challenges to be considered. Examples of these are as follows:(i)It cannot be utilized for DNA samples that have no nucleotides (having different values instead).(ii)It is assigned for the verification, so, it requires adaptation for the identification too.(iii)It is not a machine learning technique; therefore, it is suggested to be developed in this direction.

4.5. Comparisons

Comparisons between the proposed ADAR approach and state-of-the-art studies are considered, as given in Table 3.

This table shows performance of state-of-the-art studies, which are conducted with the Unique Personal DNA Pattern (UPDP) method. They use the same employed datasets but with the numbers of samples as: 106 samples for the DC, 426 samples for the SDS, 500 samples for the HDS, and 1000 samples for the DS. Manhal et al. [22, 32] focus on identification and have reported the FAR achievements as: 2.07%, 1.41%, 0.26%, and 0.75% for the DC, SDS, HDS, and DS, respectively. Ahmad et al. [32, 33] work on verification and have recorded the FAR results as: 0.32%, 0.31%, 0%, and 0.16% for the DC, SDS, HDS, and DS, respectively. The verification tasks using our ADAR can achieve even better performances. Significantly, it accepts full numbers of samples for all employed datasets: 106 samples for the DC, 426 samples for the SDS, 4380 samples for the HDS, and 11783 samples for the DS. Each one of the datasets can benchmark a remarkable FAR performance of 0% by using the proposed system. The FRR can be reported as 0% for any method, recognition (verification or identification), and dataset.

As a summary, the proposed system which uses the ADAR approach for verification has the capability to provide superior performance compared to previous state-of-the-art studies. It also accepts the full numbers of samples for all employed datasets. It can provide high reliabilities and performance.

5. Conclusion

This paper provides a new artificial intelligence approach called the ADAR. It has been proposed for person verification by DNA nucleotides. ADAR works on two main phases: enrolment and verification. During the enrolment phase, DNA samples are received, processed, and stored for their unique information. In the verification phase, a DNA sample and identity claim are provided and processed, and their unique information is compared with the stored ones to make a verification decision. The ADAR approach involves multiple layers: input layer for receiving a DNA sample, search layer for counting the frequencies of repeated quaternary patterns of nucleotides, max layer for specifying the maximum frequency among the repeated patterns, identity layer for storing or matching the identity claims, comparison layer for assigning comparison factors, and the output layer for providing the verification decision in the verification phase.

A system is also presented in this study; it implements the proposed ADAR. Moreover, four datasets, namely, the DC, SDS, HDS, and DS are employed. Remarkable performances can be achieved as 0% FAR and 0% FRR for applying the ADAR in a system of any employed dataset. Comparisons with state-of-the-art studies are also illustrated. The ADAR approach can overcome previous proposed methods or approaches. In addition to its ability to accept the full number of DNA samples for any employed dataset. It can be revealed that the ADAR can deal with a huge number of DNA samples.

In the future, multiple considerations can be suggested such as developing the ADAR to be used for identification and adapting it for machine learning.

Data Availability

The (DNA classification (DC), sample DNA sequence (SDS), human DNA sequences (HDS), and DNA sequences (DS)) data used to support the findings of this study are available from the corresponding author upon request.

Disclosure

The research was performed as a part of the employment of authors where the employer’s name is Northern Technical University.

Conflicts of Interest

The authors declare that they have no conflicts of interest.