Mechanism, Cause, and Control of Water, Solutes, and Gas Migration Triggered by Mining ActivitiesView this Special Issue
Sedimentary Environment Analysis by Grain-Size Data Based on Mini Batch K-Means Algorithm
During the last several decades, researchers have made significant advances in sedimentary environment interpretation of grain-size analysis, but these improvements have often depended on the subjective experience of the researcher and were usually combined with other methods. Currently, researchers have been using a larger number of data mining and knowledge discovering methods to explore the potential relationships in sediment grain-size analysis. In this paper, we will apply bipartite graph theory to construct a Sample/Grain-Size network model and then construct a Sample network model projected from this bipartite network. Furthermore, we will use the Mini Batch K-means algorithm with the most appropriate parameters (reassignment ratio and mini batch = 25) to cluster the sediment samples. We will use four representative evaluation indices to verify the precision of the clustering result. Simulation results demonstrate that this algorithm can divide the Sample network into three sedimentary categorical clusters: marine, fluvial, and lacustrine. According to the results of previous studies obtained from a variety of indices, the precision of experimental results about sediment grain-size category is up to 0.92254367, a fact which shows that this method of analyzing sedimentary environment by grain size is extremely effective and accurate.
Data mining, knowledge discovery, and machine learning algorithms have virtually permeated into research in various fields [1–4]. The complex network as a significant method of data mining gives top priority to discovering concealed information between things. Therefore, a great number of researchers from various research fields, including mathematics, physics, biology, chemistry, and oceanology, used the complex network to explore the potential relationships between data [5–9]. The complex network has some characteristics: self-similarity, self-organization, scale-free, small-world, community structure (cluster), and node centrality. The community structure is one of the most important traits because it can objectively reflect the potential relationships between nodes. A community is made of one group of nodes within which the links between nodes are densely connected but between which they are sparsely connected with other clusters [10, 11].
The grain-size analysis is one of the basic tools for classifying sedimentary environments, an analysis which can provide important clues to the provenance, transport history, and depositional conditions . In general, the representative statistical parameters of grain-size analysis involve median, mode, mean, separation parameter, skewness, and kurtosis . During the last few decades, two computing methods of grain-size parameters were developed: the graphical method and the moment method . Blott and Pye (2011) presented that these two analysis methods had some advantages and disadvantages in computing sediment grain-size samples with various parameters. As most sediments are polymodal, curve shape and statistical measures usually simply reflect the relative magnitude and separation of populations. Polymodal grain-size spectrum can be considered as a result of the superposition of several unimodal components . Many works have shown that different grain-size distribution is related to special transport and deposition process . Three kinds of functions are commonly used to fit the grain-size distribution: Normal function, Lognormal function, and Weibull function . Base on experimental results, Sun et al.  found that the Weibull function was appropriate for the mathematical description of the grain-size distribution of all kinds of sediments while the application of Normal function for fluvial and lacustrine sediments was also acceptable. Although these methods, especially Weibull function, performed well in sediment in fitting grain-size distribution, they often need subjective experience of the researchers, and the definite criteria for environmental determination have not been given. Based on the data of borehole Lz908, Yi et al. analyzed the evolution of the sedimentary environment. Besides grain-size data, they also used the data of magnetic susceptibility, tree pollen, radiocarbon dating, and optically stimulated luminescence (OSL) dating [16, 17]. Can the same conclusion be obtained by using only the grain-size data which are the relatively convenient and low-priced indices?
In this paper, we introduce complex network into the data modeling of sediment grain-size data. Based on the theory of bipartite graph , we construct the Sample/Grain-Size bipartite weighted network model which can objectively reflect the association relationships between sediment samples and grain sizes. By using projection, we will construct the Sample network model from the bipartite network. After repeatedly testing based on tens of representative clustering algorithms, we have selected the Mini Batch K-means algorithm , an optimization algorithm combined with the K-means algorithm , and the classical batch algorithm  to split the Sample nodes into their categories and find the relationships between the sedimentary environment and grain size. After 400 tests, we can find the most appropriate parameters in Mini Batch K-means algorithm. Finally, we will use four evaluation indices AMI, NMI, completeness, and precision to verify the accuracy and efficiency of clustering divisions.
2. Evaluation Functions
In the research field of complex networks, researchers always use several representative performance evaluation indices AMI, NMI, completeness, and precision to verify the accuracy and efficiency of clustering divisions. It is universally acknowledged that the higher the value of one index is, the better the result of clustering division is. Therefore, we will also use these four evaluation indices to verify the clustering result of sediment grain-size samples.
2.1. NMI and AMI
Normalized Mutual Information (NMI) [22, 23] is an approach to measure shared information between two data distribution by the information theory in which entropy is defined as the information included in a distribution . where represents the probability that an object picks at random falls into both classes and . The two label assignments and have the corresponding entropy and defined as follows: where is the probability that an object picked at random from falls into class . The equation has the similar definition with .
The NMI and the Adjusted Mutual Information (AMI)  are defined as where is the expected value for MI. The range of NMI and AMI is and , respectively.
Based on the standard partitioning precious study results of known grain-size samples, conditional entropy analysis is used to define some intuitive measures. The completeness assigns all nodes in the given class to the same cluster [26, 27]. The completeness is formally given by where is the entropy of the classes and is the conditional entropy of the classes given the cluster assignments.
Precision  () is the number of true positives () over plus the number of false positives (). The precision is given by
3. Dataset of Sediment/Grain-Size
The sediment samples of this study came from borehole Lz908 (37° 09N, 118° 58E), which is located in the southern Bohai Sea, China (Figure 1). The borehole was drilled to 101.3 m below the surface in 2007, and the recovery rate reached 75%. The existing research results show that this region developed into three transgressive layers from late Pleistocene, and the thickness of fluvial, lacustrine, and marine sediments reached 2000–3000 m in this basin . We extracted 2141 sediment samples of grain size from the borehole based on a 2 cm interval. We tested the grain size by using a thorough pretreatment method at the First Institute of Oceanography, State Oceanic Administration, China. The measuring instrument for grain size was a Mastersizer 2000 laser particle analyzer produced by the UK company Malvern; the measurement range was 0.02–2000 μm; the repeated measuring error was less than 3%.
We calculate the Phi value of every sediment sample by using 51 sequences (Table 1), which represent the corresponding magnitude of various grain sizes. The data describe the percentage of the magnitude of each grain size accounting for the total magnitude of the grain size. Consequently, we constructed a dataset with the 2141 × 51 matrix, where denoted the percentage composition of the th grain size in the th sample (Table 1).
4. Construction of Sample/Grain-Size Bipartite Network
In this paper, we construct the Sample/Grain-Size network based on the bipartite graph theory  in which the graph is denoted as , where is the node set and represents the edge set. In the bipartite graph theory, is divided into two disjoint subsets , where is one class of nodes and represents the other class; denotes the association relationships between a node in the set and a node in the set .
According to this theory, the construction process of Sample/Grain-Size bipartite weighted network model is as follows.
In this process, one class, , is the sample nodes and another class, , denotes the grain-size nodes. As shown in Figure 2, the sample node numbered as Lz04-076 includes several grain-size nodes with the magnitude of 7.25–7.00. If the sample node includes a grain-size node, an edge will exist between this sample node and the corresponding grain-size node. The weight of edge denotes the number of grain sizes included in one sample. Based on this regular, we construct the final Sample/Grain-Size bipartite weighted network model as follows (Figure 3).
In this bipartite network, we identify the grain-size nodes as a green color, corresponding to the 51 class grain sizes with different magnitudes; we mark sample nodes as a pink color, corresponding to 2141 sets of samples. This model can clearly reflect the association relationships between the sample nodes and the grain-size nodes.
We construct a Sample network model projected from the Sample/Grain-Size bipartite network model (Figure 4). The Sample network model has 2141 nodes and 44,198 edges; a node denotes a sample and an edge represents that the two samples contain a grain size with the same magnitude. The weight of edge shows the frequency of the two samples having the same grain size.
5. Sediment Grain-Size Sample Analysis Based on Mini Batch K-Means
5.1. Idea of Sediment Grain-Size Data Analysis
In this paper, we cluster the Sample network model by the Mini Batch K-means algorithm. In the processing of every iteration time for the sediment samples, we randomly extract the mini batch subsamples from the total samples and update every mini batch sample by using the method of convex combination. At the same time, we use the per-center learning rates to increase the speed of the convergence rate. As the iteration times increased, we detect the convergence condition of this algorithm when the clustering result has no change in successive iterations. In the end, we divide the sample nodes into several clusters.
5.2. Steps of Sediment Grain-Size Sample Data Analysis
Step 1Randomly extract mini batch subsamples from sediment sample dataset with 2141 samples and 51 propertiesStep 2Randomly select samples as the initial clustering centers; save them into an array storing clustering centers which will be changed as the algorithm runsStep 3Select a sample from ; calculate the clustering central sample node having the nearest distance to the sample by using Euclidean distance; save results in an array . The Euclidean distance is as follows: where indicates the nearest Euclidean distance between the sample and the central nodes in . The th property in the sample is Step 4Acquire sample and ; update the per-center counter : Step 5Get the real-time per-center learning rates , which speeds up the convergence of this algorithm Step 6Take the gradient step: Step 7If , all the samples have been divided into a cluster, otherwise, return to step 4Step 8If iteration time ≤ , return to step 1. The algorithm will stop when the convergence condition is satisfied or the iteration time >
Algorithms 1 and 2 show the pseudocode of Mini Batch K-means algorithm for sediment sample data processing.
6. Simulations and Analysis
6.1. Multi-Index Analysis of Clustering Results
In this paper, we use the four indices AMI, NMI, completeness, and precision to verify the clustering result of sediment samples. We set the classical two parameters as mini batch and reassignment ratio in the Mini Batch K-means algorithm. After repeatedly testing 400 times, we acquire the corresponding results in Table 2. The maximum values in different evaluation indices are marked as bold numbers: 0.40919072, 0.41485376, 0.44747697, and 0.92254367.
6.2. Heatmap Analysis of Clustering Results
The following heatmaps can objectively reflect the accuracy and efficiency of the clustering division of sediment sample data calculated by Mini Batch K-means algorithms. In each figure, every square represents the index score with different mini batch and reassignment ratio in a certain evaluation index. The various colors in the rightmost dashboard show the different scores, and the score range of every index is . The gradation of color in every square represents the size of the value.
As shown in Figures 5 and 6, the AMI can acquire the maximum value 0.40919072 when reassignment ratio and mini batch = 25. The maximum value of NMI is 0.41485376 under the same parameters.
Based on Figures 7 and 8, the completeness and precision can acquire the maximum values 0.44747697 and 0.92254367, respectively, when reassignment ratio and mini batch = 25.
Through clustering analysis, we can assign these samples to their actual sediment clusters. Objectively, precision is the most significant index from these four performance evaluation indices. By analyzing the simulation results above, we can know that the clustering result of sediment grain-size samples calculated by Mini Batch K-means algorithm with appropriate parameters, and mini batch = 25, has high precision: 0.92254367. The other three indices can also acquire maximum values: AMI = 0.40919072, NMI = 0.41485376, and completeness = 0.44747697.
6.3. Network Characteristic Analysis of Clustering Results and Comparison with Other Studies
We calculate the clustering result of sediment grain-size samples by using the Mini Batch K-means algorithm with the most appropriate parameters, and mini batch = 25. The simulation results are in Table 3 and Figure 9.
According to Table 3 and Figure 9, we divide the Sample network model into three clusters calculated by the Mini Batch K-means algorithm. Yi et al. divided the sedimentary environment of Lz908 through a variety of indices in the representative manuscripts [12, 16]. Compared with their results, the three clusters correspond three sedimentary environments: marine, fluvial, and lacustrine. The green cluster shows that these samples can be assigned to the marine sediment category; the orange cluster indicates that these samples can be split into fluvial; the blue cluster represents that these samples can be divided into lacustrine. This network can require a high precision, 0.92254367, of clustering division when the parameters of Mini Batch K-means algorithm are set as and mini batch = 25. Furthermore, we find that most of the different points with precious studies are located at the junction of different sediment types (Figure 10). These results show that this method of analyzing sedimentary environment by using grain size is extremely effective and accurate.
During the last several decades, researchers have made significant advances in the environmental interpretation of grain-size analyses, but the definite criteria for environmental determination have not been given. Previous studies often overemphasized the subjective experience of the researcher and usually combined grain-size analysis with other methods and rarely used only grain size for sedimentary environment analysis. Recently, complex networks have been playing an increasingly significant role in data mining and knowledge discovery because they can reveal the potential relationship and concealed information between things. In this paper, we use complex networks and the bipartite graph theory to construct a Sample/Grain-Size network model and a Sample network model. Furthermore, we use the Mini Batch K-means algorithm to cluster the sediment grain-size samples. We use the representative evaluation indices AMI, NMI, completeness, and precision to verify the precision of the clustering results for the sample division. Simulation results show that this algorithm can divide the Sample network into three clusters—marine, fluvial, and lacustrine—a fact which is almost identical to the division in the classical manuscripts. At the same time, the evaluation indices can also acquire high values when we set the appropriate parameters as and mini batch = 25. The results also denote that the clustering results are efficient; for example, the samples that have the same classification with traditional method are up to 0.92254367, an excellent calculation result through a relatively convenient and low-priced way.
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
We acknowledge the funding support from the China National Key Research Project (2016YFC0402801) and the National Natural Science Foundation (41406072).
W. Wei, J. Cai, X. Hu et al., “A numerical study on fractal dimensions of current streamlines in two-dimensional and three-dimensional pore fractal models of porous media,” Fractals, vol. 23, no. 1, article 1540012, 2015.View at: Publisher Site | Google Scholar
T. Cordier, P. Esling, F. Lejzerowicz et al., “Predicting the ecological quality status of marine environments from eDNA metabarcoding data using supervised machine learning,” Environmental Science & Technology, vol. 51, no. 16, pp. 9118–9126, 2017.View at: Publisher Site | Google Scholar
Y. Jiang, Y. Gou, T. Zhang, K. Wang, and C. Hu, “A machine learning approach to Argo data analysis in a thermocline,” Sensors, vol. 17, no. 10, p. 2225, 2017.View at: Publisher Site | Google Scholar
C. Zhou, K. Yin, Y. Cao et al., “Landslide susceptibility modeling applying machine learning methods: a case study from Longju in the Three Gorges Reservoir area, China,” Computers & Geosciences, vol. 112, pp. 23–37, 2018.View at: Publisher Site | Google Scholar
J. Cai and B. Yu, “Prediction of maximum pore size of porous media based on fractal geometry,” Fractals, vol. 18, no. 4, pp. 417–423, 2010.View at: Publisher Site | Google Scholar
P. Tahmasebi and A. Hezarkhani, “A hybrid neural networks-fuzzy logic-genetic algorithm for grade estimation,” Computers & Geosciences, vol. 42, pp. 18–27, 2012.View at: Publisher Site | Google Scholar
M. Newman, “Community detection in networks: modularity optimization and maximum likelihood are equivalent,” 2016, http://arxiv.org/abs/1606.02319.View at: Google Scholar
J. Fu and J. Wu, “A deep stochastic model for detecting community in complex networks,” Journal of Statistical Physics, vol. 166, no. 2, pp. 230–243, 2017.View at: Publisher Site | Google Scholar
F. Hu, Y. Zhu, Y. Shi, J. Cai, L. Chen, and S. Shen, “An algorithm Walktrap-SPM for detecting overlapping community structure,” International Journal of Modern Physics B, vol. 31, no. 15, article 1750121, 2017.View at: Publisher Site | Google Scholar
M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical Review E, vol. 69, no. 2, article 026113, 2004.View at: Publisher Site | Google Scholar
F. Hu, M. Wang, Y. Wang, Z. Hong, and Y. Zhu, “An algorithm J-SC of detecting communities in complex networks,” Physics Letters A, vol. 381, no. 42, pp. 3604–3612, 2017.View at: Publisher Site | Google Scholar
L. Yi, C. Deng, X. Xu et al., “Paleo-megalake termination in the Quaternary: paleomagnetic and water-level evidence from south Bohai Sea, China,” Sedimentary Geology, vol. 319, pp. 1–12, 2015.View at: Publisher Site | Google Scholar
G. M. Friedman, “Differences in size distributions of populations of particles among sands of various origins: addendum to IAS Presidential Address,” Sedimentology, vol. 26, no. 6, pp. 859–862, 1979.View at: Publisher Site | Google Scholar
G. M. Ashley, “Interpretation of polymodal sediments,” The Journal of Geology, vol. 86, no. 4, pp. 411–421, 1978.View at: Publisher Site | Google Scholar
D. Sun, J. Bloemendal, D. K. Rea et al., “Grain-size distribution function of polymodal sediments in hydraulic and aeolian environments, and numerical partitioning of the sedimentary components,” Sedimentary Geology, vol. 152, no. 3-4, pp. 263–277, 2002.View at: Publisher Site | Google Scholar
L. Yi, H.-J. Yu, J. D. Ortiz et al., “Late Quaternary linkage of sedimentary records to three astronomical rhythms and the Asian monsoon, inferred from a coastal borehole in the south Bohai Sea, China,” Palaeogeography, Palaeoclimatology, Palaeoecology, vol. 329-330, pp. 101–117, 2012.View at: Publisher Site | Google Scholar
L. Yi, H. Yu, J. D. Ortiz et al., “A reconstruction of late Pleistocene relative sea level in the south Bohai Sea, China, based on sediment grain-size analysis,” Sedimentary Geology, vol. 281, pp. 88–100, 2012.View at: Publisher Site | Google Scholar
K. Fukuda and T. Matsui, “Finding all the perfect matchings in bipartite graphs,” Applied Mathematics Letters, vol. 7, no. 1, pp. 15–18, 1994.View at: Publisher Site | Google Scholar
M. Capó, A. Pérez, and J. A. Lozano, “An efficient approximation to the K-means clustering for massive data,” Knowledge-Based Systems, vol. 117, pp. 56–69, 2017.View at: Publisher Site | Google Scholar
J. A. Hartigan and M. A. Wong, “Algorithm AS 136: a k-means clustering algorithm,” Journal of the Royal Statistical Society Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.View at: Publisher Site | Google Scholar
K. P. Papadaki and W. B. Powell, “An adaptive dynamic programming algorithm for a stochastic multiproduct batch dispatch problem,” Naval Research Logistics, vol. 50, no. 7, pp. 742–769, 2003.View at: Publisher Site | Google Scholar
L. Ana and A. K. Jain, “Robust data clustering,” in 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings, Madison, WI, USA, June 2003.View at: Publisher Site | Google Scholar
L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas, “Comparing community structure identification,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2005, no. 9, article 09008, 2005.View at: Publisher Site | Google Scholar
N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance,” Journal of Machine Learning Research, vol. 11, pp. 2837–2854, 2010.View at: Google Scholar
S. Romano, N. X. Vinh, J. Bailey, and K. Verspoor, “Adjusting for chance clustering comparison measures,” The Journal of Machine Learning Research, vol. 17, pp. 4635–4666, 2016.View at: Google Scholar
A. Rosenberg and J. Hirschberg, “V-measure: a conditional entropy-based external cluster evaluation measure,” in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 410–420, Prague, 2007.View at: Google Scholar
T. Grinshpoun and A. Meisels, “Completeness and performance of the APO algorithm,” Journal of Artificial Intelligence Research, vol. 33, pp. 223–258, 2008.View at: Publisher Site | Google Scholar
A. Biswas and B. Biswas, “Investigating community structure in perspective of ego network,” Expert Systems with Applications, vol. 42, no. 20, pp. 6913–6934, 2015.View at: Publisher Site | Google Scholar