Abstract

During the last several decades, researchers have made significant advances in sedimentary environment interpretation of grain-size analysis, but these improvements have often depended on the subjective experience of the researcher and were usually combined with other methods. Currently, researchers have been using a larger number of data mining and knowledge discovering methods to explore the potential relationships in sediment grain-size analysis. In this paper, we will apply bipartite graph theory to construct a Sample/Grain-Size network model and then construct a Sample network model projected from this bipartite network. Furthermore, we will use the Mini Batch K-means algorithm with the most appropriate parameters (reassignment ratio and mini batch = 25) to cluster the sediment samples. We will use four representative evaluation indices to verify the precision of the clustering result. Simulation results demonstrate that this algorithm can divide the Sample network into three sedimentary categorical clusters: marine, fluvial, and lacustrine. According to the results of previous studies obtained from a variety of indices, the precision of experimental results about sediment grain-size category is up to 0.92254367, a fact which shows that this method of analyzing sedimentary environment by grain size is extremely effective and accurate.

1. Introduction

Data mining, knowledge discovery, and machine learning algorithms have virtually permeated into research in various fields [14]. The complex network as a significant method of data mining gives top priority to discovering concealed information between things. Therefore, a great number of researchers from various research fields, including mathematics, physics, biology, chemistry, and oceanology, used the complex network to explore the potential relationships between data [59]. The complex network has some characteristics: self-similarity, self-organization, scale-free, small-world, community structure (cluster), and node centrality. The community structure is one of the most important traits because it can objectively reflect the potential relationships between nodes. A community is made of one group of nodes within which the links between nodes are densely connected but between which they are sparsely connected with other clusters [10, 11].

The grain-size analysis is one of the basic tools for classifying sedimentary environments, an analysis which can provide important clues to the provenance, transport history, and depositional conditions [12]. In general, the representative statistical parameters of grain-size analysis involve median, mode, mean, separation parameter, skewness, and kurtosis [13]. During the last few decades, two computing methods of grain-size parameters were developed: the graphical method and the moment method [12]. Blott and Pye (2011) presented that these two analysis methods had some advantages and disadvantages in computing sediment grain-size samples with various parameters. As most sediments are polymodal, curve shape and statistical measures usually simply reflect the relative magnitude and separation of populations. Polymodal grain-size spectrum can be considered as a result of the superposition of several unimodal components [14]. Many works have shown that different grain-size distribution is related to special transport and deposition process [15]. Three kinds of functions are commonly used to fit the grain-size distribution: Normal function, Lognormal function, and Weibull function [15]. Base on experimental results, Sun et al. [15] found that the Weibull function was appropriate for the mathematical description of the grain-size distribution of all kinds of sediments while the application of Normal function for fluvial and lacustrine sediments was also acceptable. Although these methods, especially Weibull function, performed well in sediment in fitting grain-size distribution, they often need subjective experience of the researchers, and the definite criteria for environmental determination have not been given. Based on the data of borehole Lz908, Yi et al. analyzed the evolution of the sedimentary environment. Besides grain-size data, they also used the data of magnetic susceptibility, tree pollen, radiocarbon dating, and optically stimulated luminescence (OSL) dating [16, 17]. Can the same conclusion be obtained by using only the grain-size data which are the relatively convenient and low-priced indices?

In this paper, we introduce complex network into the data modeling of sediment grain-size data. Based on the theory of bipartite graph [18], we construct the Sample/Grain-Size bipartite weighted network model which can objectively reflect the association relationships between sediment samples and grain sizes. By using projection, we will construct the Sample network model from the bipartite network. After repeatedly testing based on tens of representative clustering algorithms, we have selected the Mini Batch K-means algorithm [19], an optimization algorithm combined with the K-means algorithm [20], and the classical batch algorithm [21] to split the Sample nodes into their categories and find the relationships between the sedimentary environment and grain size. After 400 tests, we can find the most appropriate parameters in Mini Batch K-means algorithm. Finally, we will use four evaluation indices AMI, NMI, completeness, and precision to verify the accuracy and efficiency of clustering divisions.

2. Evaluation Functions

In the research field of complex networks, researchers always use several representative performance evaluation indices AMI, NMI, completeness, and precision to verify the accuracy and efficiency of clustering divisions. It is universally acknowledged that the higher the value of one index is, the better the result of clustering division is. Therefore, we will also use these four evaluation indices to verify the clustering result of sediment grain-size samples.

2.1. NMI and AMI

Normalized Mutual Information (NMI) [22, 23] is an approach to measure shared information between two data distribution by the information theory in which entropy is defined as the information included in a distribution [24]. where represents the probability that an object picks at random falls into both classes and . The two label assignments and have the corresponding entropy and defined as follows: where is the probability that an object picked at random from falls into class . The equation has the similar definition with .

The NMI and the Adjusted Mutual Information (AMI) [25] are defined as where is the expected value for MI. The range of NMI and AMI is and , respectively.

2.2. Completeness

Based on the standard partitioning precious study results of known grain-size samples, conditional entropy analysis is used to define some intuitive measures. The completeness assigns all nodes in the given class to the same cluster [26, 27]. The completeness is formally given by where is the entropy of the classes and is the conditional entropy of the classes given the cluster assignments.

2.3. Precision

Precision [28] () is the number of true positives () over plus the number of false positives (). The precision is given by

3. Dataset of Sediment/Grain-Size

The sediment samples of this study came from borehole Lz908 (37° 09N, 118° 58E), which is located in the southern Bohai Sea, China (Figure 1). The borehole was drilled to 101.3 m below the surface in 2007, and the recovery rate reached 75%. The existing research results show that this region developed into three transgressive layers from late Pleistocene, and the thickness of fluvial, lacustrine, and marine sediments reached 2000–3000 m in this basin [16]. We extracted 2141 sediment samples of grain size from the borehole based on a 2 cm interval. We tested the grain size by using a thorough pretreatment method at the First Institute of Oceanography, State Oceanic Administration, China. The measuring instrument for grain size was a Mastersizer 2000 laser particle analyzer produced by the UK company Malvern; the measurement range was 0.02–2000 μm; the repeated measuring error was less than 3%.

We calculate the Phi value of every sediment sample by using 51 sequences (Table 1), which represent the corresponding magnitude of various grain sizes. The data describe the percentage of the magnitude of each grain size accounting for the total magnitude of the grain size. Consequently, we constructed a dataset with the 2141 × 51 matrix, where denoted the percentage composition of the th grain size in the th sample (Table 1).

4. Construction of Sample/Grain-Size Bipartite Network

In this paper, we construct the Sample/Grain-Size network based on the bipartite graph theory [18] in which the graph is denoted as , where is the node set and represents the edge set. In the bipartite graph theory, is divided into two disjoint subsets , where is one class of nodes and represents the other class; denotes the association relationships between a node in the set and a node in the set .

According to this theory, the construction process of Sample/Grain-Size bipartite weighted network model is as follows.

In this process, one class, , is the sample nodes and another class, , denotes the grain-size nodes. As shown in Figure 2, the sample node numbered as Lz04-076 includes several grain-size nodes with the magnitude of 7.25–7.00. If the sample node includes a grain-size node, an edge will exist between this sample node and the corresponding grain-size node. The weight of edge denotes the number of grain sizes included in one sample. Based on this regular, we construct the final Sample/Grain-Size bipartite weighted network model as follows (Figure 3).

In this bipartite network, we identify the grain-size nodes as a green color, corresponding to the 51 class grain sizes with different magnitudes; we mark sample nodes as a pink color, corresponding to 2141 sets of samples. This model can clearly reflect the association relationships between the sample nodes and the grain-size nodes.

We construct a Sample network model projected from the Sample/Grain-Size bipartite network model (Figure 4). The Sample network model has 2141 nodes and 44,198 edges; a node denotes a sample and an edge represents that the two samples contain a grain size with the same magnitude. The weight of edge shows the frequency of the two samples having the same grain size.

5. Sediment Grain-Size Sample Analysis Based on Mini Batch K-Means

5.1. Idea of Sediment Grain-Size Data Analysis

In this paper, we cluster the Sample network model by the Mini Batch K-means algorithm. In the processing of every iteration time for the sediment samples, we randomly extract the mini batch subsamples from the total samples and update every mini batch sample by using the method of convex combination. At the same time, we use the per-center learning rates to increase the speed of the convergence rate. As the iteration times increased, we detect the convergence condition of this algorithm when the clustering result has no change in successive iterations. In the end, we divide the sample nodes into several clusters.

5.2. Steps of Sediment Grain-Size Sample Data Analysis

Step 1Randomly extract mini batch subsamples from sediment sample dataset with 2141 samples and 51 propertiesStep 2Randomly select samples as the initial clustering centers; save them into an array storing clustering centers which will be changed as the algorithm runsStep 3Select a sample from ; calculate the clustering central sample node having the nearest distance to the sample by using Euclidean distance; save results in an array . The Euclidean distance is as follows: where indicates the nearest Euclidean distance between the sample and the central nodes in . The th property in the sample is Step 4Acquire sample and ; update the per-center counter : Step 5Get the real-time per-center learning rates , which speeds up the convergence of this algorithm Step 6Take the gradient step: Step 7If , all the samples have been divided into a cluster, otherwise, return to step 4Step 8If iteration time ≤ , return to step 1. The algorithm will stop when the convergence condition is satisfied or the iteration time > 

Algorithms 1 and 2 show the pseudocode of Mini Batch K-means algorithm for sediment sample data processing.

Mini Batch K-Means
Input: the dataset of grain size is X; the number of initial clusters k is 3; the iteration times is t;
   the mini batch is b.
Output: the set of clustering labels is C; the cluster label of every sample is c.
Initialize every sample label as .
;
for i = 1 to t do
              //extract randomly mini batch sub-samples from .
   for do
     ;    //calculate and storage the clustering central sample
   end for
   for do
     ;       //acquire the central sample
     ;    //update the per-center counter
     ;      //get the real-time per-center learning rates
     ; //take gradient step
   end for
end for
: an Projection to L1 Ball
Input: tolerance, L1-ball radius , vector
if then exit
while or do
  
  
   if then else
end while
for i = 1 to m do
  
end for

6. Simulations and Analysis

6.1. Multi-Index Analysis of Clustering Results

In this paper, we use the four indices AMI, NMI, completeness, and precision to verify the clustering result of sediment samples. We set the classical two parameters as mini batch and reassignment ratio in the Mini Batch K-means algorithm. After repeatedly testing 400 times, we acquire the corresponding results in Table 2. The maximum values in different evaluation indices are marked as bold numbers: 0.40919072, 0.41485376, 0.44747697, and 0.92254367.

6.2. Heatmap Analysis of Clustering Results

The following heatmaps can objectively reflect the accuracy and efficiency of the clustering division of sediment sample data calculated by Mini Batch K-means algorithms. In each figure, every square represents the index score with different mini batch and reassignment ratio in a certain evaluation index. The various colors in the rightmost dashboard show the different scores, and the score range of every index is . The gradation of color in every square represents the size of the value.

As shown in Figures 5 and 6, the AMI can acquire the maximum value 0.40919072 when reassignment ratio and mini batch = 25. The maximum value of NMI is 0.41485376 under the same parameters.

Based on Figures 7 and 8, the completeness and precision can acquire the maximum values 0.44747697 and 0.92254367, respectively, when reassignment ratio and mini batch = 25.

Through clustering analysis, we can assign these samples to their actual sediment clusters. Objectively, precision is the most significant index from these four performance evaluation indices. By analyzing the simulation results above, we can know that the clustering result of sediment grain-size samples calculated by Mini Batch K-means algorithm with appropriate parameters, and mini batch = 25, has high precision: 0.92254367. The other three indices can also acquire maximum values: AMI = 0.40919072, NMI = 0.41485376, and completeness = 0.44747697.

6.3. Network Characteristic Analysis of Clustering Results and Comparison with Other Studies

We calculate the clustering result of sediment grain-size samples by using the Mini Batch K-means algorithm with the most appropriate parameters, and mini batch = 25. The simulation results are in Table 3 and Figure 9.

According to Table 3 and Figure 9, we divide the Sample network model into three clusters calculated by the Mini Batch K-means algorithm. Yi et al. divided the sedimentary environment of Lz908 through a variety of indices in the representative manuscripts [12, 16]. Compared with their results, the three clusters correspond three sedimentary environments: marine, fluvial, and lacustrine. The green cluster shows that these samples can be assigned to the marine sediment category; the orange cluster indicates that these samples can be split into fluvial; the blue cluster represents that these samples can be divided into lacustrine. This network can require a high precision, 0.92254367, of clustering division when the parameters of Mini Batch K-means algorithm are set as and mini batch = 25. Furthermore, we find that most of the different points with precious studies are located at the junction of different sediment types (Figure 10). These results show that this method of analyzing sedimentary environment by using grain size is extremely effective and accurate.

7. Conclusions

During the last several decades, researchers have made significant advances in the environmental interpretation of grain-size analyses, but the definite criteria for environmental determination have not been given. Previous studies often overemphasized the subjective experience of the researcher and usually combined grain-size analysis with other methods and rarely used only grain size for sedimentary environment analysis. Recently, complex networks have been playing an increasingly significant role in data mining and knowledge discovery because they can reveal the potential relationship and concealed information between things. In this paper, we use complex networks and the bipartite graph theory to construct a Sample/Grain-Size network model and a Sample network model. Furthermore, we use the Mini Batch K-means algorithm to cluster the sediment grain-size samples. We use the representative evaluation indices AMI, NMI, completeness, and precision to verify the precision of the clustering results for the sample division. Simulation results show that this algorithm can divide the Sample network into three clusters—marine, fluvial, and lacustrine—a fact which is almost identical to the division in the classical manuscripts. At the same time, the evaluation indices can also acquire high values when we set the appropriate parameters as and mini batch = 25. The results also denote that the clustering results are efficient; for example, the samples that have the same classification with traditional method are up to 0.92254367, an excellent calculation result through a relatively convenient and low-priced way.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

We acknowledge the funding support from the China National Key Research Project (2016YFC0402801) and the National Natural Science Foundation (41406072).