#### Abstract

The electronic properties of graphene nanoflakes (GNFs) with embedded hexagonal boron nitride (hBN) domains are investigated by combined ab initio density functional theory calculations and machine-learning techniques. The energy gaps of the quasi-0D graphene-based systems, defined as the differences between LUMO and HOMO energies, depend not only on the sizes of the hBN domains relative to the size of the pristine graphene nanoflake but also on the position of the hBN domain. The range of the energy gaps for different configurations increases as the hBN domains get larger. We develop two artificial neural network (ANN) models able to reproduce the gap energies with high accuracies and investigate the tunability of the energy gap, by considering a set of GNFs with embedded rectangular hBN domains. In one ANN model, the input is in one-to-one correspondence with the atoms in the GNF, while in the second model the inputs account for basic structures in the GNF, allowing potential use in upscaled systems. We perform a statistical analysis over different configurations of ANNs to optimize the network structure. The trained ANNs provide a correlation between the atomic system configuration and the magnitude of the energy gaps, which may be regarded as an efficient tool for optimizing the design of nanostructured graphene-based materials for specific electronic properties.

#### 1. Introduction

The absence of an electronic gap in pristine graphene hinders many of the expected applications based on the field effect. Graphene nanopatterning is one way to tune the electronic and transport properties, and this can be achieved by reducing the dimensionality [1–4], by drilling periodic arrangements of holes [5, 6], by embedding hexagonal boron nitride (hBN) [7–12], or by combining any of these. Graphene nanoribbons (GNRs) and graphene nanoflakes (GNFs), typically passivated with monovalent species like hydrogen or halogen atoms, are two examples of quasi-1D and quasi-0D graphene systems, respectively, which attracted a lot of attention in the past few years. GNRs can have a metallic or semiconducting behavior depending on the lateral width and edge type: armchair or zigzag. In contrast to GNRs, where only the edge states may influence the electronic properties, in GNFs these are markedly influenced by both edge and corner states and, in general, by the different possible shapes [13, 14]. In addition, GNFs may be functionalized, which further extends the range of the electronic, optical and, magnetic properties.

GNFs can be produced by bottom-up approaches, where the synthesis takes place in solution by mechanical extrusion, using magnetic field alignment and thermal annealing [15, 16] or by top-down methods, using techniques like e-beam lithography [17], plasma etching [18], or a cationic surfactant-mediated exfoliation of graphite [19]. Besides the many applications envisioned for nanoelectronics and spintronics [20], more recently, novel applications also indicate the role of GNFs for biological recognition [21]. Therefore, the methods for an efficient investigation of multiple configurations of GNFs and related structures are highly demanded.

In the past few years, machine-learning (ML) techniques are gaining ground in the field of condensed matter. They have been developed to predict the band gaps in solids [22, 23], while they also provide new clues in crystal structure prediction [24, 25]. They can be used to bypass the Kohn-Sham equations by learning energy functionals via examples [26] or predicting DFT Hamiltonians [27]. The generic aim is to develop less expensive and faster methods to calculate the system’s properties. To this end, the methodology contained in PROPhet [28] provides a general framework for coupling machine learning and first-principles methods. ML techniques can also provide more insights about the physical properties of a system. The usefulness as a universal descriptor of grain boundary systems was pointed out [29], potentially indicating which building blocks map to particular physical properties. ML techniques can also achieve high accuracies, the prediction errors of molecular machine-learning models being below that of the hybrid DFT error [30]. High-throughput DFT calculations in connection with ML techniques, as well as some of the problems, challenges, and future perspectives are illustrated in a recent review [31].

Regarding graphene systems, ML techniques have been employed in several studies, e.g., for obtaining an accurate interatomic potential for graphene [32], for searching the most stable structures of doped boron atoms in graphene [33], for investigating the influence of GNF topology [34], and for predicting accuracy differences between different levels of theory [35], as well as for the prediction of interfacial thermal resistance between graphene and hBN [36].

In this paper, we investigate the electronic properties of hybrid graphene-hBN nanoflakes, using combined DFT and ML methods. We construct the distribution of gap energies using ab initio DFT calculations, as LUMO-HOMO differences, which depend on the size and position of the hBN domains within the GNF. Given the large number of possibilities of setting the hBN domains, extensive DFT calculations are typically required, with a significant computational cost. Instead, we develop artificial neural network (ANN) models able to reproduce the energy gaps with high accuracies, which significantly reduce the computational effort. We test our ANN models against reference gap values obtained by DFT and discuss the optimal conditions for the network structure.

#### 2. Model Systems and Computational Methods

We consider GNFs with embedded hBN domains, passivated with hydrogen, as indicated in Figure 1. The hBN domains are rectangular-shaped regions containing an equal number of boron and nitrogen atoms. In this way, the systems retain an intrinsic semiconducting behavior, without a net chemical doping. The embedded rectangular hBN is randomly positioned in the graphene nanoflake. The widths and heights of the rectangular hBN regions are extracted from a flat distribution so that the entire graphene nanoflake can be replaced by BN. The systems analyzed here have a total of 200 atoms, of which atoms are stemming from graphene/hBN and hydrogen atoms. For the investigation of the electronic properties, a number of 900 nonequivalent systems are generated.

The DFT calculations are performed using the SIESTA code [37] employing local density approximation (LDA) in the parametrization of Ceperley and Alder [38]. The strictly localized basis set allows a linear scaling of the computational time with the system size. The self-consistent solution of the Kohn-Sham equations was obtained using the standard double-*ζ* polarized basis set, a grid cutoff of 100 Ry, and the norm-conserving pseudopotentials of Troullier and Martins [39] with typical valence electron configurations for carbon, boron, and nitrogen. The supercells are cubic cells with a linear size of 50 Å, which provides enough empty space so that two neighboring nanoflake structures do not interact. Gamma point calculations are performed for the cluster-type systems. The gap energies are determined, being defined as the difference between the LUMO and HOMO energies, .

Based on the DFT results, we implement ANN models able to reproduce the gap energy for similar systems from a new set. The ANNs are standard fully connected backpropagation neural networks implemented using the FANN library [40], with three layers: one input layer, one hidden layer, and one output layer. In *Method 1*, we assign an input neuron to each atom of species C, B, or N, so that the number of input neurons is . *Method 2* accounts for the chemical neighborhood of a certain atomic species and its prevalence in the system. In this case, we use input neurons, where 4 of them account for the proportions of the four atomic species (C, B, N, and H) and 16 neurons are associated with the normalized counts of atom quadruplets , where , B, and N and are the three nearest neighbors of , with . Therefore, to obtain the normalized counts of atom quadruplets, one has to loop over all atoms (C, B, or N) and consider the number of times each particular configuration of nearest neighbors appears. The number of neurons in the hidden layer is varied, from 25 to 200 neurons, in order to find a close-to-optimal configuration. The output layer has a single neuron, , and the result maps the gap energy by a continuous function in the [0,1] interval, corresponding to a maximum gap energy . *Method 2* has the advantage that the input does not depend on the system size, allowing the same ANN to handle upscaled structures. The two proposed methods are not limited to rectangular shapes and may handle systems with irregular patterns.

For training, we employ the iRPROP algorithm of Igel and Husken [41], which is a variant of the standard RPROP algorithm introduced by Riedmiller and Braun [42]. The iRPROP algorithm is adaptative and there is no preset learning rate. The sigmoid activation function is used and the mean square error during training is set to . Since the ANNs are randomly initialized and the final weight configurations depend on the seeds, an ensemble of 1000 ANNs is trained on the same data set. Finally, a statistics regarding the accuracy obtained on the test data is performed.

The trained ANNs are tested on a set of 100 new examples and the predicted gaps are compared to the reference values obtained by DFT calculations. We use the coefficient of determination as a measure of how far the observed outcomes are replicated by the ANN model.

#### 3. Results and Discussion

GNFs are quasi-0D systems with a discrete energy spectrum, where the gap energy is typically influenced by their geometry, passivation, and nanopatterning. By embedding hBN in GNFs, which is a wide band-gap isomorph of graphene, it is expected that the gap energy has a strong variation. Particularly in finite systems, the position and shape of the embedded rectangular hBN domain, closer to the edges or at the center of the GNF, significantly influences .

We first investigate the variation of as a function of the hBN domain size, given by the BN fraction , where and are the number of boron and nitrogen atoms, respectively. As it is indicated in Figure 2, there is a rather wide dispersion of values, as there are multiple configurations with the same . Still, a clear trend is visible for : larger gaps may be obtained as the BN domain size increases, while smaller gaps are still present. A fit with a second degree polynomial function shows the statistical increase of the gap energy, as .

Next, we investigate the accuracies in predicting the energy gaps for the proposed ANN models. In *Method 1*, we start with an ANN configuration with three layers, with neurons in the input layer, neurons in the hidden layer, and output neuron. The ANN is trained on 800 examples and tested on a new set of 100 structures. The results are represented in Figure 3, where the predicted gap is plotted vs. the reference DFT gap. The coefficient of determination calculated for the training set yields a rather high value of 99.7%, which indicates a consistent convergence during training. Typically, for this ANN configuration, the threshold for the mean square error set to 10^{−5} is reached in ∼400 steps. Running the ANN on the test systems, one obtains values as high as 95%. However, as detailed in the following, the performance of the ANN relies on the converged configuration, which may depend on the ANN initialization.

**(a)**

**(b)**

In the second method, labeled *Method 2*, the ANN is trained to capture the local chemical neighborhood. For the same set of systems, there are 16 instances of atom quadruplets , with , B, and N and , B, N, and H. These are counted for each structure and normalized to , the total number of carbon, boron, and nitrogen atoms. Along with these 16 inputs, the fractions corresponding to each of the four atomic species are added, yielding a total number of input neurons. These extra input neurons improve the prediction behavior of the ANN as they emphasize the importance of the size of the hBN domains. The same training procedure and convergence criterion are employed as for *Method 1*. The convergence during training is poorer (), and the obtained accuracy is typically smaller () for *Method 2*, although they are comparable with the ones obtained for *Method 1*. However, *Method 2* is by construction scale invariant and this is potentially a significant advantage in investigating systems with different sizes.

The final ANN configuration following the training phase depends on the assigned random initial weights. Consequently, the accuracy of the output results obtained by running the test examples is subject to the initialization procedure. In order to see how robust are the obtained results, we construct histograms using an ensemble of 2000 trained ANNs. The results are shown in Figure 4 for the two methods. In *Method 1*, as the number of hidden neurons is varied, the distributions evolve from a rather widespread distribution of coefficients for to a more confined distribution around the high accuracy values; this is for a number of neurons in the hidden layer, (between 100 and 125 neurons). Increasing further does not improve the accuracy. Rather, as the ANN becomes larger, memory effects become important to the detriment of capturing the essential features of the structures. Moreover, by decreasing the mean square error to 10^{−6} when training ANNs with , they become overtrained and the coefficient does not improve either. Therefore, we conclude that optimal ANN configurations exist, with quite high maximal output accuracies (∼97%) and a relatively narrow band of ∼10%, where the coefficients of the most trained ANNs can be found.

**(a) Method 1**

**(b) Method 2**

Comparatively, by employing *Method 2*, the histograms follow the same trend, although the accuracy spread is larger. Still, the highest values can reach as high as ∼91%. This shows that by describing the local chemical environment and constructing a statistics reflecting the neighborhood of the different species, one can infer quite reasonably the electronic features of the GNFs, in particular the energy gaps. A direct comparison to *Method 1* is shown in Figure 5. Additionally, the distribution of coefficients for an intermediate model based on geometrical parameters of the rectangular hBN domains is also indicated. In this case, the four distances between the edges of the hBN rectangles and the edges of the GNF along with the two linear sizes of the hBN domains were taken as inputs, i.e., input neurons. However, this approach can be used as long as the geometric features of the samples can be easily identified, i.e., in this case the parameters describing the rectangular shapes. The distribution of the coefficients lies in between the ones corresponding to *Method 1* and *Method 2*, with a maximum at 94.1%, compared to the best results of 97.2% obtained with *Method 1* and 91.9% using *Method 2*. This also shows that by identifying the geometrical features in graphene-hBN systems, without taking into account a detailed representation of the species present in the structure, i.e., considering the hBN domain as a whole, reasonable accuracies may be achieved.

#### 4. Conclusions

The electronic properties of GNFs with embedded hBN domains were investigated using combined DFT and ML techniques. Using DFT calculations, we constructed the energy gap distribution for a set of systems with different rectangular hBN shapes. The collected data was used to train two types of ANNs. In *Method 1*, one input neuron is assigned to one atom of species C, B, or N, while in *Method 2* the prevalence of the chemical neighborhood and atomic species was taken into account. The trained ANNs provide a correlation between the different domain shapes and their sizes and locations within the GNFs, on one hand, and the magnitude of the energy gaps, on the other hand. *Method 1* shows the highest accuracies, while in *Method 2* smaller ANNs are not bound to a fixed system size and the accuracies are comparable. A statistical analysis reveals the optimal configurations of the three-layer ANNs, pointing out potential memory and overtraining effects in large networks. The approach based on ANNs is therefore a feasible route, providing a reduction of the computational effort, while retaining a high accuracy; therefore, it may be employed for optimizing the design and selecting candidates of nanostructured graphene-based materials for specific electronic properties.

#### Data Availability

The data used to support the findings of this study are included within the article.

#### Conflicts of Interest

The authors declare that they have no conflicts of interest.

#### Acknowledgments

This work was supported by the Romanian Ministry of Research and Innovation under the project PN 19060205/2019 and by the Romania-JINR cooperation project.