#### Abstract

Handwritten characters recognition is a challenging research topic. A lot of works have been present to recognize letters of different languages. The availability of Arabic handwritten characters databases is limited. Motivated by this topic of research, we propose a convolution neural network for the classification of Arabic handwritten letters. Also, seven optimization algorithms are performed, and the best algorithm is reported. Faced with few available Arabic handwritten datasets, various data augmentation techniques are implemented to improve the robustness needed for the convolution neural network model. The proposed model is improved by using the dropout regularization method to avoid data overfitting problems. Moreover, suitable change is presented in the choice of optimization algorithms and data augmentation approaches to achieve a good performance. The model has been trained on two Arabic handwritten characters datasets AHCD and Hijja. The proposed algorithm achieved high recognition accuracy of 98.48% and 91.24% on AHCD and Hijja, respectively, outperforming other state-of-the-art models.

#### 1. Introduction

Approximately a quarter of a billion people around the world speak and write the Arabic language [1]. There are a lot of historical books and documents that represent a crucial data set for most Arabic countries written in the Arabic language [1, 2].

Recently, the area of Arabic handwritten characters recognition (AHCR) has received increased research attention [3–5]. It is a challenging topic of computer vision and pattern recognition [1]. This is due to the following:(i)The difference between handwriting patterns [3].(ii)The form similarity between Arabic alphabets [1, 3].(iii)The diacritics of Arabic characters [6].(iv)As shown in Figure 1, in the Arabic language the shape of each handwritten character depends on its position in the world. For example, here in the word “أمراء” the character “Alif” is written in two different forms “أ” and “ا”, where, in the Arabic language, each character has between two and four shapes. Table 1 shows the different shapes of the twenty-eight Arabic alphabets.

With the development of deep learning (DL), convolution neural networks (CNNs) have shown a significant capability to recognize handwritten characters of different languages [3, 7, 8]: Latin [9, 10], Chine [11], Devanagari [12], Malayalam [11], etc.

Most researchers improved the CNN architecture to achieve good handwritten characters recognition performance [6, 13]. However, a neural network with excellent performance usually requires a good tuning of CNN hyperparameters and a good choice of applied optimization algorithms [14–16]. Also, a large amount of training dataset [17, 18] is required to achieve outstanding performance.

The main contributions of this research can be summarized as follows:(i)Suggesting a CNN model for recognizing Arabic handwritten characters.(ii)Tuning of different hyperparameters to improve the model performance.(iii)Applying different optimization algorithms. Reporting the effectiveness of the best ones.(iv)Presenting different data augmentation techniques. Reporting the influence of each method on the improvement of Arabic handwritten characters recognition.(v)Mixing two different Arabic handwritten characters datasets for shape varying. Testing the impact of the presented data augmentation approaches on the mixed dataset.

The rest of this paper is organized as follows. In Section 2, we expose the related works in Arabic handwritten character classification. In Sections 3 and 4, we describe the convolution neural network architecture and the model tuning hyperparameters. In Section 5, we make a detailed description of various used optimization algorithms. In Section 6, we describe the different utilized data augmentation techniques chosen in this study. In Section 7, we provide an overview of the experimental results showing the CNN distinguished performance. Section 8 is conclusion and possible future research directions.

#### 2. Related Work

In recent years, many studies have addressed the classification and recognition of letters, including Arabic handwritten characters. On the other hand, there are a smaller number of proposed approaches for recognizing individual characters in the Arabic language. As a result, Arabic handwritten character recognition is less common compared to English, French, Chinese, Devanagari, Hangul, Malayalam, etc.

Impressive results were achieved in the classification of handwritten characters from different languages, using deep learning models and in particular the CNN.

El-Sawy et al. [6] gathered their own Arabic Handwritten Character dataset (AHCD) from 60 participants. AHCD consists of 16.800 characters. They have achieved a classification accuracy of 88% by using a CNN model consisting of 2 convolutional layers. To improve the CNN performance, regularization and different optimization techniques have been implemented to the model. The testing accuracy was improved to 94.93%.

Altwaijry and Turaiki [13] presented a new Arabic handwritten letters dataset (named “Hijja”). It comprised 47.434 characters written by 591 participants. Their proposed CNN model was able to achieve 88% and 97% testing accuracy, using the Hijja and AHCD datasets, respectively.

Younis [19] designed a CNN model to recognize Arabic handwritten characters. The CNN consisted of three convolutional layers followed by one final fully connected layer. The model achieved an accuracy of 94.7% for the AHCD database and 94.8% for the AIA9K (Arabic alphabet’s dataset).

Latif et al. [20] designed a CNN to recognize a mix of handwriting of multiple languages: Persian, Devanagari, Eastern Arabic, Urdu, and Western Arabic. The input image is of size (28 × 28) pixels, followed by two convolutional layers, and then a max-pooling operation is applied to both convolution layers. The overall accuracy of the combined multilanguage database was 99.26%. The average accuracy is around 99% for each individual language.

Alrobah and Albahl [21] analyzed the Hijja dataset and found irregularities, such as some distorted letters, blurred symbols, and some blurry characters. They used the CNN model to extract the important features and SVM model for data classification. They achieved a testing accuracy of 96.3%.

Mudhsh et al. [22] designed the VGG net architecture for recognizing Arabic handwritten characters and digits. The model consists of 13 convolutional layers, 2 max-pooling layers, and 3 fully connected layers. Data augmentation and dropout methods were used to avoid the overfitting problem. The model was trained and evaluated by using two different datasets: the ADBase for the Arabic handwritten digits classification topic and HACDB for the Arabic handwritten characters classification task. The model achieved an accuracy of 99.66% and 97.32% for ADBase and HACDB, respectively.

Boufenar et al. [23] used the popular CNN architecture Alexnet. It consists of 5 convolutional layers, 3 max-pooling layers, and 3 fully connected layers. Experiments were conducted on two different databases, OIHACDB-40 and AHCD. Based on the good tuning of the CNN hyperparameters and by using dropout and minibatch techniques, a CNN accuracy of 100% and 99.98% for OIHACDB-40 and AHCD was achieved.

Mustapha et al. [24] proposed a Conditional Deep Convolutional Generative Adversarial Network (CDCGAN) for a guided generation of isolated handwritten Arabic characters. The CDCGAN was trained on the AHCD dataset. They achieved a 10% performance gap between real and generated handwritten Arabic characters.

Table 2 summarizes the literature reviewed for recognizing Arabic handwriting characters using the CNN models. From the previous literature, we notice that most CNN architectures have been trained by using adult Arabic handwriting letters “AHCD”. In addition, we observe that most researchers try to improve the performance through the good tuning of the CNN model hyperparameters.

#### 3. The Proposed Arabic Handwritten Characters Recognition System

As shown in Figure 2, the model that we proposed in this study is composed of three principal components: CNN proposed architecture, optimization algorithms, and data augmentation techniques.

In this paper, the proposed CNN model contains four convolution layers, two max-pooling operations, and an ANN model with three fully hidden layers used for the classification. To avoid the overfitting problems and improve the model performance, various optimization techniques were used such as dropout, minipatch, choice of the activation function, etc.

Figure 3 describes the proposed CNN model. Also, in this work, the recognition performance of Arabic handwritten letters was improved through the good choice of the optimization algorithm and by using different data augmentation techniques “geometric transformations, feature space augmentation, noise injection, and mixing images.”

#### 4. Convolution Neural Network Architecture

A CNN model [25–34] is a series of convolution layers followed by fully connected layers. Convolution layers allow the extraction of important features from the input data. Fully connected layers are used for the classification of data. The CNN input is the image to be classified; the output corresponds to the predicted class of the Arabic handwritten character.

##### 4.1. Input Data

The input data is an image of size . Defines the width and the height of the image and denotes the space or number of channels. The value of is 1 for a grayscale image and equals 3 for a RGB color image.

##### 4.2. Convolution Layer

The convolution layer consists of a convolution operation followed by a pooling operation.

###### 4.2.1. Convolution Operation

The basic concept of the classical convolution operation between an input image of dimension and a filter of size is defined as follows (see Figure 4):

Here, ⊗ denotes the convolution operation. *C* is the convolution map of size (*a* × *a*), where . is the stride and denotes the number of pixels by which over . is the padding; often it is necessary to add a bounding of zeros around to preserve complete image information. Figure 4 is an example of the convolution operation between an input image of dimension (8 × 8) and a filter of size (3 × 3). Here, the convolution map is of size (6 × 6) with a stride and a padding .

Generally, a nonlinear activation function is applied on the convolution map C. The commonly used activation functions are Sigmoid [34–36], Hyperbolic Tangent “Tanh” [35, 37], and Rectified Linear Unit “ReLU” [37, 38] wherehere, is the convolution map after applying the nonlinear activation function *f.* Figure 5 shows the map when the ReLU activation function is applied on .

###### 4.2.2. Pooling Operation

The pooling operation is used to reduce the dimension of thus reducing the computational complexity of the network. During the pooling operation, a kernel of size is sliding over . denotes the number of patches by which over . In our analysis *s*_{p} is set to 2. The pooling operation is expressed aswhere is the pooling map and is the pooling operation. The commonly used pooling operations are average-pooling, max-pooling, and min-pooling. Figure 6 describes the concept of average-pooling and max-pooling operations using a kernel of size and a stride of 2.

##### 4.3. Concatenation Operation

The concatenation operation maps the set of the convoluted images into a vector called the concatenation vector .here, is the output of the convolution layer. denotes the number of filters applied on the convoluted images .

##### 4.4. Fully Connected Layer

The CNN classification operation is performed through the fully connected layer [39]. Its input is the concatenation vector ; the predicted class is the output of the CNN classifier. The classification operation is performed through a series of *t* fully connected hidden layers. Each fully connected hidden layer is a parallel collection of artificial neurons. Like synapses in the biological brain, the artificial neurons are connected through weights . The model output of the fully connected hidden layer is where the weight sum vector ishere, is a nonlinear activation function (sigmoid, Tanh, ReLU, etc.). The bias value defines the activation level of the artificial neurons.

#### 5. CNN Learning Process

A trained CNN is a system capable of determining the exact class of a given input data. The training is achieved through an update of the layer’s parameters (filters, weights, and biases) based on the error between the CNN predicted class and the class label. The CNN learning process is an iterative process based on the feedforward propagation and backpropagation operations.

##### 5.1. Feedforward Propagation

For the CNN model, the feedforward equations can be derived from (1)–(5) and (6). The Softmax activation [40, 41] function is applied in the final layer to generate the predicted value of the class of the input image . For a multiclass model, the Softmax is expressed as follows:where denotes the number of classes, is the coordinate of the output vector , and the artificial neural output .

##### 5.2. Backpropagation

To update the CNN parameters and perform the learning process, a backpropagation optimization algorithm is developed to minimize a selected cost function . In this analysis, the cross-entropy (CE) cost function [40] is used.here, is the desired output (data label).

The most used optimization algorithm to solve classification problems is the gradient descent (GD). Various optimizers for the GD algorithm such as momentum, AdaGrad, RMSprop, Adam, AdaMax, and Nadam were used to improve the CNN performance.

###### 5.2.1. Gradient Descent [40, 42]

GD is the simplest form of optimization gradient descent algorithms. It is easy to implement and gives significant classification accuracy. The general update equation of the CNN parameters using the GD algorithm iswhere represents the update of the filters , the weights , and the biases . is the gradient with respect to the parameter is the model learning rate. A too-large value of may lead to the divergence of the GD algorithm and may cause the oscillation of the model performance. A too-small stops the learning process.

###### 5.2.2. Gradient Descent with Momentum [43]

The momentum hyperparameter defines the velocity by which the learning rate must be increased when the model approaches to the minimal of the cost function . The update equations using the momentum GD algorithm are expressed as follows:where is the moment gained at iteration.

###### 5.2.3. AdaGrad [44]

In this algorithm, the learning rate is a function of the gradient . It is defined as follows:wherewhere is a small smoothing value used to avoid the division by 0 and is the sum of the squares of the gradients .

With a small magnitude of , the value of is increasing. If is very large, the value of is a constant. AdaGrad optimization algorithm changes the learning rate for each parameter at a given time with considering the previous gradient update. The parameter update equation using AdaGrad is expressed as follows:

###### 5.2.4. AdaDelta [45]

The issue of AdaGrad is that with much iteration the learning rate becomes very small which leads to a slow convergence. To fix this problem, AdaDelta algorithm proposed to take an exponentially decaying average as a solution, where where is the decaying average over past squared gradients and is a set usually around 0.9.

###### 5.2.5. RMSprop [45, 46]

In reality, RMSprop is identical to AdaDelta’s initial update vector, which we derived above:

###### 5.2.6. ADAM [17, 45, 46]

This gradient descent optimizer algorithm computes the learning rate based on two vectors:where and are the and the order moments vectors. and are the decay rates. and represent the mean and the variance of the previous gradient.

When and are very small, a large step size is needed for parameters update. To avoid this issue, a bias correction value is added to and .where is power and is power .

The Adam update equation is expressed as follows:

###### 5.2.7. AdaMax [45, 47]

The factor in the Adam algorithm adjusts the gradient inversely proportionate to the norm of previous gradients (via the ) and current gradient *t*:

The generalization of this update to the norm is as follows:

To avoid being numerically unstable, ℓ1 and ℓ2 norms are most common in practice. However, in general ℓ∞ also shows stable behavior. As a result, the authors propose AdaMax and demonstrate that with ℓ∞ converges to the more stable value. Here,

###### 5.2.8. Nadam [43]

It is a combination of Adam and NAG, where the parameters update equation using NAG is defined as follows:

The update equation using Nadam is expressed as follows:

#### 6. Data Augmentation Techniques

Deep convolutional neural networks are heavily reliant on big data to achieve excellent performance and avoid the overfitting problem.

To solve the problem of insufficient data for Arabic handwritten characters, we present some basic data augmentation techniques that enhance the size and quality of training datasets.

The image augmentation approaches used in this study include geometric transformations, feature space augmentation, noise injection, and mixing images.

Data augmentation based on geometric transformations and feature space augmentation [17, 48] is often related to the application of rotation, flipping, shifting, and zooming.

##### 6.1. Rotation

The input data is rotated right or left on an axis between 1° and 359°. The rotation degree parameter has a significant impact on the safety of the dataset. For example, on digit identification tasks like MNIST, slight rotations like 1 to 20 or −1 to −20 could be useful, but when the rotation degree increases, properly the CNN network cannot accurately distinguish between some digits.

##### 6.2. Flipping

The input image is flipped horizontally or vertically. This augmentation is one of the simplest to implement and has proven useful on some datasets such as ImageNet and CIFAR-10.

##### 6.3. Shifting

The input image is shifting right, left, up, or down. This transformation is a highly effective adjustment to prevent positional bias. Figure 7 shows an example of shifting data augmentation technique using Arabic alphabet characters.

##### 6.4. Zooming

The input image is zooming, either by adding some pixels around the image or by applying random zooms to the image. The amount of zooming has an influence on the quality of the image; for example, if we apply a lot of zooming, we can lose some image pixels.

##### 6.5. Noise Injection

As it could be seen on Arabic handwritten characters, natural noises are presented in images. Noises make recognition more difficult and for this reason, noises are reduced by image preprocessing techniques. The cos of noise reduction is to perform a high classification, but it causes the alteration of the character shape. The main datasets in this research topic are considered with denoising images. The question which we answer here is how the method could be robust to any noise.

Adding noise [48, 49] to a convolution neural network during training helps the model learn more robust features, resulting in better performance and faster learning. We can add several types of noise when recognizing images, such as the following.(i)Gaussian noise: injecting a matrix of random values drawn from a Gaussian distribution(ii)Salt-and-pepper noise: changing randomly a certain amount of the pixels to completely white or completely black(iii)Speckle noise: only adding black pixels “pepper” or white pixels “salt”

Adding noise to the input data is the most commonly used approach, but during training, we can add random noise to other parts of the CNN model. Some examples include the following:(i)Adding noise to the outputs of each layer(ii)Adding noise to the gradients to update the model parameters(iii)Adding noise to the target variables

##### 6.6. Mixing Image’s Databases

In this study, we augment the training dataset by mixing two different Arabic handwritten characters datasets, AHCD and Hijja, respectively. AHCD is a clean database, but Hijja is a dataset with very low-resolution images. It comprises many distorted alphabets images.

Then, we evaluate the influence of different mentioned data augmentation techniques (geometric transformations, feature space augmentation, and noise injection) on the recognition performance of the new mixing dataset.

#### 7. Experimental Results and Discussion

##### 7.1. Datasets

In this study, two datasets of Arabic handwritten characters were used: Arabic handwritten characters dataset “AHCD” and Hijja dataset.

AHCD [6] comprises 16.800 handwritten characters of size (32 × 32 × 1) pixels. It was written by 60 participants between the ages of 19 and 40 years and most of the participants are right handed. Each participant wrote the Arabic alphabet from “alef” to “yeh” 10 times. The dataset has 28 classes. It is divided into a training set of 13.440 characters and a testing set of 3.360 characters.

Hijja dataset [13] consists of 4.434 Arabic characters of size (32 × 32 × 1) pixels. It was written by 591 school children ranging in age between 7 to 12 years. Collecting data from children is a very hard task. Malformed characters are characteristic of children’s handwriting; therefore the dataset comprises repeated letters, missing letters, and many distorted or unclear characters. The dataset has 29 classes. It is divided into a training set of 37.933 characters and a testing set of 9.501 characters (80% for training and 20% for test).

Figure 8 shows a sample of AHCD and Hijja Arabic handwritten letters datasets.

**(a)**

**(b)**

##### 7.2. Experimental Environment and Performance Evaluation

In this study the implementation and the evaluation of the CNN model are done out in Keras deep learning environment with TensorFlow backend on Google Colab using GPU accelerator.

We evaluate the performance of our proposed model via the following measures: Accuracy (*A*) is a measure for how many correct predictions your model made for the complete test dataset: Recall (*R*) is the fraction of images that are correctly classified over the total number of images that belong to class: Precision (*P*) is the fraction of images that are correctly classified over the total number of images classified: *F*1 measure is a combination of Recall and Precision measures:

Here, TP = true positive (is the total number of images that can be correctly labeled as belonging to a class x), FP = false positive (represents the total number of images that have been incorrectly labeled as belonging to a class x), FN = false negative (represents the total number of images that have been incorrectly labeled as not belonging to a class x), *TN* = true negative (represents the total number of images that have been correctly labeled as not belonging to a class x).

Also we draw the area under the ROC curve (AUC), where we have the following.

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of all classification thresholds. This curve plots two parameters:(i)True-positive rate(ii)False-positive rate

AUC stands for “area under the ROC curve.” That is, AUC measures the entire two-dimensional area underneath the entire ROC curve from (0.0) to (1.1).

##### 7.3. Tuning of CNN Hyperparameters

The objective is to choose the best model that fits the AHCD and Hijja datasets well. Many try-and-error trials in the network configuration tuning mechanism were performed.

The best performance was achieved when the CNN model was constructed of four convolution layers followed by three fully connected hidden layers. The model starts with two convolution layers with 16 filters of size (3 × 3), then the remaining 2 convolution layers are with 32 filters of size (3 × 3), and each two convolution layers are followed by max-pooling layers with (2 × 2) kernel dimension. Finally, three fully connected layers (dense layers) with Softmax activation function to perform prediction. ELU, a nonlinear activation function, was used to remove negative values by converting them into 0.001. The values of weights and bias are updated by a backward propagation process to minimize the loss function.

To reduce the overfitting problem a dropout of 0.6 rate is added to a model between the dense layers and applies to outputs of the prior layer that are fed to the subsequent layer. The optimized parameters used to improve the CNN performance were as follows: Optimizer algorithm is Adam, the loss function is the cross-entropy, learning rate = 0.001, batch size = 16, and epochs = 40.

We compare our model to CNN-for-AHCD over both the Hijja dataset and the AHCD dataset. The code for CNN-for-AHCD is available online [31], which allows comparison of its performance over various datasets.

On the Hijja dataset, which has 29 classes, our model achieved an average overall test set accuracy of 88.46%, precision of 87.98%, recall of 88.46%, and an F1 score of 88.47%, while CNN-for-AHCD achieved an average overall test set accuracy of 80%, precision of 80.79%, recall of 80.47%, and an F1 score of 80.4%.

On the AHCD dataset, which has 28 classes, our model achieved an average overall test set accuracy of 96.66%, precision of 96.75%, recall of 96.67%, and an F1 score of 96.67%, while CNN-for-AHCD achieved an average overall test set accuracy of 93.84%, precision of 93.99%, recall of 93.84%, and an F1 score of 93.84%.

The detailed metrics are reported per character in Table 3.

We note that our model outperforms CNN-for-AHCD by a large margin on all metrics.

Figure 9 shows the testing result AUC of AHCD and Hijja dataset.

**(a)**

**(b)**

##### 7.4. Optimizer Algorithms

The objective is to choose the best optimizers algorithms that fit the AHCD and Hijja best performance. In this context, we tested the influence of the following algorithms on the classification of handwritten Arabic characters:(i)Adam(ii)SGD(iii)RMSprop(iv)AdaGrad(v)Nadam(vi)Momentum(vii)AdaMax

By using Nadam optimization algorithm, on the Hijja dataset, our model achieved an average overall test set accuracy of 88.57%, precision of 87.86%, recall of 87.98%, and an F1 score of 87.95%.

On the AHCD dataset, our model achieved an average overall test set accuracy of 96.73%, precision of 96.80%, recall of 96.73%, and an F1 score of 96.72%.

The detailed results of different optimizations algorithms are mentioned in Table 4.

##### 7.5. Results of Data Augmentation Techniques

Generally, the neural network performance is improved through the good tuning of the model hyperparameters. Such improvement in the CNN accuracy is linked to the availability of training dataset. However, the networks are heavily reliant on big data to avoid overfitting problem and perform well.

Data augmentation is the solution to the problem of limited data. The image augmentation techniques used and discussed in this study include geometric transformations and feature space augmentation (rotation, shifting, flipping, and zooming), noise injection, and mixing images from two different datasets.

For the geometric transformations and feature space augmentation, we try to well choose the percentage of rotation, shifting, flipping, and zooming for the model attending a good performance. For example, if we rotate the Latin handwritten number database (MNIST) by 180°, the network will not be able to accurately distinguish between the handwritten digits “6” and “9”. Likewise, on the AHCD and Hijja datasets, if rotating or flipping techniques are used the network will be unable to distinguish between some handwritten Arabic characters. For example, as shown in Figure 10, with a rotation of 180°, the character Daal isolated (د) will be the same as the character Noon isolated (ن).

**(a)**

**(b)**

The detailed results of rotation, shifting, flipping, and zooming data augmentation techniques are mentioned in Table 5.

As shown in Table 5 and Figure 11, by using rotation and shifting augmentation approaches, our model achieved a testing accuracy of 98.48% and 91.24% on AHCD dataset and Hijja dataset, respectively. We achieved this accuracy through rotating the input image by 10° and shifting it just by one pixel.

**(a)**

**(b)**

Adding noise is a technique used to augment the training input data. Also in most of the cases, this is bound to increase the robustness of our network.

In this work we used the three types of noise to augment our data:(i)Gaussian noise(ii)Salt-and-pepper noise(iii)Speckle noise

The detailed results of different types of noise injection are mentioned in Table 6. As shown by adding different types of noise, the model accuracy is improved, which demonstrate the robustness of our proposed architecture. We achieved good results when adding noise to the outputs of each layer.

The proposed idea in this study is to augment the number of training databases by mixing the two datasets AHCD and Hijja, and then we apply the previously mentioned data augmentation methods on the new mixed dataset. Our purpose to use malformed handwritten characters as it proposes the Hijja dataset is to improve the accuracy of our method with noised data.

The detailed results of data augmentation techniques on the mixed database are mentioned in Table 7. As shown, the model performance depends on the rate of using Arabic handwriting “Hijja” database. The children had trouble following the reference paper, which results in very low-resolution images comprising many unclear characters. Therefore mixing the datasets would certainly reduce performance.

#### 8. Conclusions and Possible Future Research Directions

In this paper, we proposed a convolution neural network (CNN) to recognize Arabic handwritten characters dataset. We have trained the model on two Arabic datasets AHCD and Hijja. By the good tuning of the network hyperparameters, we achieved an accuracy of 96.73% and 88.57% on AHCD and Hijja.

To improve the model performance, we have implemented different optimization algorithms. For both databases, we achieved an excellent performance by using Nadam optimizer.

To solve the problem of insufficient Arabic handwritten datasets, we have applied different data augmentation techniques. The augmentation approaches are based on geometric transformation, feature space augmentation, noise injection, and mixing of datasets.

By using rotation and shifting techniques, we achieved a good accuracy equal to 98.48% and 91.24% on AHCD and Hijja.

To improve the robustness of the CNN model and increase the number of training datasets, we added three types of noise (Gaussian noise, Salt-and-pepper, and Speckle noise).

Also in this work we first augmented the database by mixing two Arabic handwritten characters datasets; then we tested the results of the previously mentioned data augmentation techniques on the new mixed dataset, where the first database “AHCD” comprises clear images with a very good resolution, but the second database “Hijja” has many distorted characters. Experimentally show that the geometric transformations (rotation, shifting, and flipping), feature space augmentation, and noise injection always improve the network performance, but the rate of using the unclean database “Hijja” harms the model accuracy.

An interesting future direction is the cleaning and processing of Hijja dataset to eliminate the problem of low-resolution and unclear images and then the implementation of the proposed CNN network and data augmentation techniques on the new mixed and cleaned database.

In addition, we are interested in evaluating the result of other augmentation approaches, like adversarial training, neural style transfer, and generative adversarial networks on the recognition of Arabic handwritten characters dataset. We plan to incorporate our work into an application for children that teaches Arabic spelling.

#### Abbreviations

AHCR: | Arabic handwritten characters recognition |

DL: | Deep learning |

CNNs: | Convolution neural networks |

AHCD: | Arabic handwritten character dataset |

SVM: | Support vector machine |

ADBase: | Arabic digits database |

HACDB: | Handwritten Arabic characters database |

OIHACDB: | Offline handwritten Arabic character database |

CDCGAN: | Conditional deep convolutional generative adversarial network |

Tanh: | Hyperbolic tangent |

ReLU: | Rectified linear unit |

CE: | Cross-entropy |

GD: | Gradient descent |

NAG: | Nesterov accelerated gradient |

TP: | True positive |

FP: | False positive |

FN: | False Negative |

TN: | True negative |

AUC: | Area under curve |

ROC: | Receiver operating curve |

ELU: | Exponential linear unit |

#### Symbols

: | Image |

: | Width and height of the image |

: | Number of channels |

: | Filter |

: | Filter size |

⊗: | Convolution operation |

: | Convolution map |

: | Size of convolution map |

: | Stride |

: | Padding |

: | Nonlinear activation function |

: | Convolution map after applying |

: | Kernel |

: | Number of patches |

: | Pooling operation |

: | Pooling map |

: | Concatenation vector |

: | Output of the convolution layer |

: | Convoluted image |

: | Input of the fully connected hidden layer |

: | Output of the fully connected hidden layer |

: | Weight sum vector |

: | Bias |

: | Cost function |

: | Desired output |

: | Update of the filter |

: | Gradient |

α: | Model learning |

: | Momentum |

: | Moment gained at the iteration |

ε: | Smoothing value |

: | Sum of the squares of the gradient |

: | Decaying overage |

: | Moments vector |

: | Decay rate |

: | Mean of the previous gradient |

: | Variance of the previous gradient. |

#### Data Availability

Previously reported AHCD data were used to support this study and are available at https://www.kaggle.com/mloey1/ahcd1. These prior studies (and datasets) are cited at relevant places within the text as [43].

#### Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this study.