Mechanism, Cause, and Control of Water, Solutes, and Gas Migration Triggered by Mining Activities
View this Special IssueResearch Article  Open Access
Qiao Su, Yanhui Zhu, Yalin Jia, Ping Li, Fang Hu, Xingyong Xu, "Sedimentary Environment Analysis by GrainSize Data Based on Mini Batch KMeans Algorithm", Geofluids, vol. 2018, Article ID 8519695, 11 pages, 2018. https://doi.org/10.1155/2018/8519695
Sedimentary Environment Analysis by GrainSize Data Based on Mini Batch KMeans Algorithm
Abstract
During the last several decades, researchers have made significant advances in sedimentary environment interpretation of grainsize analysis, but these improvements have often depended on the subjective experience of the researcher and were usually combined with other methods. Currently, researchers have been using a larger number of data mining and knowledge discovering methods to explore the potential relationships in sediment grainsize analysis. In this paper, we will apply bipartite graph theory to construct a Sample/GrainSize network model and then construct a Sample network model projected from this bipartite network. Furthermore, we will use the Mini Batch Kmeans algorithm with the most appropriate parameters (reassignment ratio and mini batch = 25) to cluster the sediment samples. We will use four representative evaluation indices to verify the precision of the clustering result. Simulation results demonstrate that this algorithm can divide the Sample network into three sedimentary categorical clusters: marine, fluvial, and lacustrine. According to the results of previous studies obtained from a variety of indices, the precision of experimental results about sediment grainsize category is up to 0.92254367, a fact which shows that this method of analyzing sedimentary environment by grain size is extremely effective and accurate.
1. Introduction
Data mining, knowledge discovery, and machine learning algorithms have virtually permeated into research in various fields [1–4]. The complex network as a significant method of data mining gives top priority to discovering concealed information between things. Therefore, a great number of researchers from various research fields, including mathematics, physics, biology, chemistry, and oceanology, used the complex network to explore the potential relationships between data [5–9]. The complex network has some characteristics: selfsimilarity, selforganization, scalefree, smallworld, community structure (cluster), and node centrality. The community structure is one of the most important traits because it can objectively reflect the potential relationships between nodes. A community is made of one group of nodes within which the links between nodes are densely connected but between which they are sparsely connected with other clusters [10, 11].
The grainsize analysis is one of the basic tools for classifying sedimentary environments, an analysis which can provide important clues to the provenance, transport history, and depositional conditions [12]. In general, the representative statistical parameters of grainsize analysis involve median, mode, mean, separation parameter, skewness, and kurtosis [13]. During the last few decades, two computing methods of grainsize parameters were developed: the graphical method and the moment method [12]. Blott and Pye (2011) presented that these two analysis methods had some advantages and disadvantages in computing sediment grainsize samples with various parameters. As most sediments are polymodal, curve shape and statistical measures usually simply reflect the relative magnitude and separation of populations. Polymodal grainsize spectrum can be considered as a result of the superposition of several unimodal components [14]. Many works have shown that different grainsize distribution is related to special transport and deposition process [15]. Three kinds of functions are commonly used to fit the grainsize distribution: Normal function, Lognormal function, and Weibull function [15]. Base on experimental results, Sun et al. [15] found that the Weibull function was appropriate for the mathematical description of the grainsize distribution of all kinds of sediments while the application of Normal function for fluvial and lacustrine sediments was also acceptable. Although these methods, especially Weibull function, performed well in sediment in fitting grainsize distribution, they often need subjective experience of the researchers, and the definite criteria for environmental determination have not been given. Based on the data of borehole Lz908, Yi et al. analyzed the evolution of the sedimentary environment. Besides grainsize data, they also used the data of magnetic susceptibility, tree pollen, radiocarbon dating, and optically stimulated luminescence (OSL) dating [16, 17]. Can the same conclusion be obtained by using only the grainsize data which are the relatively convenient and lowpriced indices?
In this paper, we introduce complex network into the data modeling of sediment grainsize data. Based on the theory of bipartite graph [18], we construct the Sample/GrainSize bipartite weighted network model which can objectively reflect the association relationships between sediment samples and grain sizes. By using projection, we will construct the Sample network model from the bipartite network. After repeatedly testing based on tens of representative clustering algorithms, we have selected the Mini Batch Kmeans algorithm [19], an optimization algorithm combined with the Kmeans algorithm [20], and the classical batch algorithm [21] to split the Sample nodes into their categories and find the relationships between the sedimentary environment and grain size. After 400 tests, we can find the most appropriate parameters in Mini Batch Kmeans algorithm. Finally, we will use four evaluation indices AMI, NMI, completeness, and precision to verify the accuracy and efficiency of clustering divisions.
2. Evaluation Functions
In the research field of complex networks, researchers always use several representative performance evaluation indices AMI, NMI, completeness, and precision to verify the accuracy and efficiency of clustering divisions. It is universally acknowledged that the higher the value of one index is, the better the result of clustering division is. Therefore, we will also use these four evaluation indices to verify the clustering result of sediment grainsize samples.
2.1. NMI and AMI
Normalized Mutual Information (NMI) [22, 23] is an approach to measure shared information between two data distribution by the information theory in which entropy is defined as the information included in a distribution [24]. where represents the probability that an object picks at random falls into both classes and . The two label assignments and have the corresponding entropy and defined as follows: where is the probability that an object picked at random from falls into class . The equation has the similar definition with .
The NMI and the Adjusted Mutual Information (AMI) [25] are defined as where is the expected value for MI. The range of NMI and AMI is and , respectively.
2.2. Completeness
Based on the standard partitioning precious study results of known grainsize samples, conditional entropy analysis is used to define some intuitive measures. The completeness assigns all nodes in the given class to the same cluster [26, 27]. The completeness is formally given by where is the entropy of the classes and is the conditional entropy of the classes given the cluster assignments.
2.3. Precision
Precision [28] () is the number of true positives () over plus the number of false positives (). The precision is given by
3. Dataset of Sediment/GrainSize
The sediment samples of this study came from borehole Lz908 (37° 09N, 118° 58E), which is located in the southern Bohai Sea, China (Figure 1). The borehole was drilled to 101.3 m below the surface in 2007, and the recovery rate reached 75%. The existing research results show that this region developed into three transgressive layers from late Pleistocene, and the thickness of fluvial, lacustrine, and marine sediments reached 2000–3000 m in this basin [16]. We extracted 2141 sediment samples of grain size from the borehole based on a 2 cm interval. We tested the grain size by using a thorough pretreatment method at the First Institute of Oceanography, State Oceanic Administration, China. The measuring instrument for grain size was a Mastersizer 2000 laser particle analyzer produced by the UK company Malvern; the measurement range was 0.02–2000 μm; the repeated measuring error was less than 3%.
We calculate the Phi value of every sediment sample by using 51 sequences (Table 1), which represent the corresponding magnitude of various grain sizes. The data describe the percentage of the magnitude of each grain size accounting for the total magnitude of the grain size. Consequently, we constructed a dataset with the 2141 × 51 matrix, where denoted the percentage composition of the th grain size in the th sample (Table 1).
 
(The size of sediment samples is 2141, and the number of various magnitudes of grain size is 51; 0.17 denotes the percentage of grain sizes with the magnitude of 11.75~11.50 accounting for the total number of 51 magnitudes of grain size in sample Lz0101). 
4. Construction of Sample/GrainSize Bipartite Network
In this paper, we construct the Sample/GrainSize network based on the bipartite graph theory [18] in which the graph is denoted as , where is the node set and represents the edge set. In the bipartite graph theory, is divided into two disjoint subsets , where is one class of nodes and represents the other class; denotes the association relationships between a node in the set and a node in the set .
According to this theory, the construction process of Sample/GrainSize bipartite weighted network model is as follows.
In this process, one class, , is the sample nodes and another class, , denotes the grainsize nodes. As shown in Figure 2, the sample node numbered as Lz04076 includes several grainsize nodes with the magnitude of 7.25–7.00. If the sample node includes a grainsize node, an edge will exist between this sample node and the corresponding grainsize node. The weight of edge denotes the number of grain sizes included in one sample. Based on this regular, we construct the final Sample/GrainSize bipartite weighted network model as follows (Figure 3).
In this bipartite network, we identify the grainsize nodes as a green color, corresponding to the 51 class grain sizes with different magnitudes; we mark sample nodes as a pink color, corresponding to 2141 sets of samples. This model can clearly reflect the association relationships between the sample nodes and the grainsize nodes.
We construct a Sample network model projected from the Sample/GrainSize bipartite network model (Figure 4). The Sample network model has 2141 nodes and 44,198 edges; a node denotes a sample and an edge represents that the two samples contain a grain size with the same magnitude. The weight of edge shows the frequency of the two samples having the same grain size.
5. Sediment GrainSize Sample Analysis Based on Mini Batch KMeans
5.1. Idea of Sediment GrainSize Data Analysis
In this paper, we cluster the Sample network model by the Mini Batch Kmeans algorithm. In the processing of every iteration time for the sediment samples, we randomly extract the mini batch subsamples from the total samples and update every mini batch sample by using the method of convex combination. At the same time, we use the percenter learning rates to increase the speed of the convergence rate. As the iteration times increased, we detect the convergence condition of this algorithm when the clustering result has no change in successive iterations. In the end, we divide the sample nodes into several clusters.
5.2. Steps of Sediment GrainSize Sample Data Analysis
Step 1Randomly extract mini batch subsamples from sediment sample dataset with 2141 samples and 51 propertiesStep 2Randomly select samples as the initial clustering centers; save them into an array storing clustering centers which will be changed as the algorithm runsStep 3Select a sample from ; calculate the clustering central sample node having the nearest distance to the sample by using Euclidean distance; save results in an array . The Euclidean distance is as follows: where indicates the nearest Euclidean distance between the sample and the central nodes in . The th property in the sample is Step 4Acquire sample and ; update the percenter counter : Step 5Get the realtime percenter learning rates , which speeds up the convergence of this algorithm Step 6Take the gradient step: Step 7If , all the samples have been divided into a cluster, otherwise, return to step 4Step 8If iteration time ≤ , return to step 1. The algorithm will stop when the convergence condition is satisfied or the iteration time >
Algorithms 1 and 2 show the pseudocode of Mini Batch Kmeans algorithm for sediment sample data processing.


6. Simulations and Analysis
6.1. MultiIndex Analysis of Clustering Results
In this paper, we use the four indices AMI, NMI, completeness, and precision to verify the clustering result of sediment samples. We set the classical two parameters as mini batch and reassignment ratio in the Mini Batch Kmeans algorithm. After repeatedly testing 400 times, we acquire the corresponding results in Table 2. The maximum values in different evaluation indices are marked as bold numbers: 0.40919072, 0.41485376, 0.44747697, and 0.92254367.

6.2. Heatmap Analysis of Clustering Results
The following heatmaps can objectively reflect the accuracy and efficiency of the clustering division of sediment sample data calculated by Mini Batch Kmeans algorithms. In each figure, every square represents the index score with different mini batch and reassignment ratio in a certain evaluation index. The various colors in the rightmost dashboard show the different scores, and the score range of every index is . The gradation of color in every square represents the size of the value.
As shown in Figures 5 and 6, the AMI can acquire the maximum value 0.40919072 when reassignment ratio and mini batch = 25. The maximum value of NMI is 0.41485376 under the same parameters.
Based on Figures 7 and 8, the completeness and precision can acquire the maximum values 0.44747697 and 0.92254367, respectively, when reassignment ratio and mini batch = 25.
Through clustering analysis, we can assign these samples to their actual sediment clusters. Objectively, precision is the most significant index from these four performance evaluation indices. By analyzing the simulation results above, we can know that the clustering result of sediment grainsize samples calculated by Mini Batch Kmeans algorithm with appropriate parameters, and mini batch = 25, has high precision: 0.92254367. The other three indices can also acquire maximum values: AMI = 0.40919072, NMI = 0.41485376, and completeness = 0.44747697.
6.3. Network Characteristic Analysis of Clustering Results and Comparison with Other Studies
We calculate the clustering result of sediment grainsize samples by using the Mini Batch Kmeans algorithm with the most appropriate parameters, and mini batch = 25. The simulation results are in Table 3 and Figure 9.

According to Table 3 and Figure 9, we divide the Sample network model into three clusters calculated by the Mini Batch Kmeans algorithm. Yi et al. divided the sedimentary environment of Lz908 through a variety of indices in the representative manuscripts [12, 16]. Compared with their results, the three clusters correspond three sedimentary environments: marine, fluvial, and lacustrine. The green cluster shows that these samples can be assigned to the marine sediment category; the orange cluster indicates that these samples can be split into fluvial; the blue cluster represents that these samples can be divided into lacustrine. This network can require a high precision, 0.92254367, of clustering division when the parameters of Mini Batch Kmeans algorithm are set as and mini batch = 25. Furthermore, we find that most of the different points with precious studies are located at the junction of different sediment types (Figure 10). These results show that this method of analyzing sedimentary environment by using grain size is extremely effective and accurate.
7. Conclusions
During the last several decades, researchers have made significant advances in the environmental interpretation of grainsize analyses, but the definite criteria for environmental determination have not been given. Previous studies often overemphasized the subjective experience of the researcher and usually combined grainsize analysis with other methods and rarely used only grain size for sedimentary environment analysis. Recently, complex networks have been playing an increasingly significant role in data mining and knowledge discovery because they can reveal the potential relationship and concealed information between things. In this paper, we use complex networks and the bipartite graph theory to construct a Sample/GrainSize network model and a Sample network model. Furthermore, we use the Mini Batch Kmeans algorithm to cluster the sediment grainsize samples. We use the representative evaluation indices AMI, NMI, completeness, and precision to verify the precision of the clustering results for the sample division. Simulation results show that this algorithm can divide the Sample network into three clusters—marine, fluvial, and lacustrine—a fact which is almost identical to the division in the classical manuscripts. At the same time, the evaluation indices can also acquire high values when we set the appropriate parameters as and mini batch = 25. The results also denote that the clustering results are efficient; for example, the samples that have the same classification with traditional method are up to 0.92254367, an excellent calculation result through a relatively convenient and lowpriced way.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
We acknowledge the funding support from the China National Key Research Project (2016YFC0402801) and the National Natural Science Foundation (41406072).
References
 W. Wei, J. Cai, X. Hu et al., “A numerical study on fractal dimensions of current streamlines in twodimensional and threedimensional pore fractal models of porous media,” Fractals, vol. 23, no. 1, article 1540012, 2015. View at: Publisher Site  Google Scholar
 T. Cordier, P. Esling, F. Lejzerowicz et al., “Predicting the ecological quality status of marine environments from eDNA metabarcoding data using supervised machine learning,” Environmental Science & Technology, vol. 51, no. 16, pp. 9118–9126, 2017. View at: Publisher Site  Google Scholar
 Y. Jiang, Y. Gou, T. Zhang, K. Wang, and C. Hu, “A machine learning approach to Argo data analysis in a thermocline,” Sensors, vol. 17, no. 10, p. 2225, 2017. View at: Publisher Site  Google Scholar
 C. Zhou, K. Yin, Y. Cao et al., “Landslide susceptibility modeling applying machine learning methods: a case study from Longju in the Three Gorges Reservoir area, China,” Computers & Geosciences, vol. 112, pp. 23–37, 2018. View at: Publisher Site  Google Scholar
 J. Cai and B. Yu, “Prediction of maximum pore size of porous media based on fractal geometry,” Fractals, vol. 18, no. 4, pp. 417–423, 2010. View at: Publisher Site  Google Scholar
 P. Tahmasebi and A. Hezarkhani, “A hybrid neural networksfuzzy logicgenetic algorithm for grade estimation,” Computers & Geosciences, vol. 42, pp. 18–27, 2012. View at: Publisher Site  Google Scholar
 M. Newman, “Community detection in networks: modularity optimization and maximum likelihood are equivalent,” 2016, http://arxiv.org/abs/1606.02319. View at: Google Scholar
 J. Fu and J. Wu, “A deep stochastic model for detecting community in complex networks,” Journal of Statistical Physics, vol. 166, no. 2, pp. 230–243, 2017. View at: Publisher Site  Google Scholar
 F. Hu, Y. Zhu, Y. Shi, J. Cai, L. Chen, and S. Shen, “An algorithm WalktrapSPM for detecting overlapping community structure,” International Journal of Modern Physics B, vol. 31, no. 15, article 1750121, 2017. View at: Publisher Site  Google Scholar
 M. E. J. Newman and M. Girvan, “Finding and evaluating community structure in networks,” Physical Review E, vol. 69, no. 2, article 026113, 2004. View at: Publisher Site  Google Scholar
 F. Hu, M. Wang, Y. Wang, Z. Hong, and Y. Zhu, “An algorithm JSC of detecting communities in complex networks,” Physics Letters A, vol. 381, no. 42, pp. 3604–3612, 2017. View at: Publisher Site  Google Scholar
 L. Yi, C. Deng, X. Xu et al., “Paleomegalake termination in the Quaternary: paleomagnetic and waterlevel evidence from south Bohai Sea, China,” Sedimentary Geology, vol. 319, pp. 1–12, 2015. View at: Publisher Site  Google Scholar
 G. M. Friedman, “Differences in size distributions of populations of particles among sands of various origins: addendum to IAS Presidential Address,” Sedimentology, vol. 26, no. 6, pp. 859–862, 1979. View at: Publisher Site  Google Scholar
 G. M. Ashley, “Interpretation of polymodal sediments,” The Journal of Geology, vol. 86, no. 4, pp. 411–421, 1978. View at: Publisher Site  Google Scholar
 D. Sun, J. Bloemendal, D. K. Rea et al., “Grainsize distribution function of polymodal sediments in hydraulic and aeolian environments, and numerical partitioning of the sedimentary components,” Sedimentary Geology, vol. 152, no. 34, pp. 263–277, 2002. View at: Publisher Site  Google Scholar
 L. Yi, H.J. Yu, J. D. Ortiz et al., “Late Quaternary linkage of sedimentary records to three astronomical rhythms and the Asian monsoon, inferred from a coastal borehole in the south Bohai Sea, China,” Palaeogeography, Palaeoclimatology, Palaeoecology, vol. 329330, pp. 101–117, 2012. View at: Publisher Site  Google Scholar
 L. Yi, H. Yu, J. D. Ortiz et al., “A reconstruction of late Pleistocene relative sea level in the south Bohai Sea, China, based on sediment grainsize analysis,” Sedimentary Geology, vol. 281, pp. 88–100, 2012. View at: Publisher Site  Google Scholar
 K. Fukuda and T. Matsui, “Finding all the perfect matchings in bipartite graphs,” Applied Mathematics Letters, vol. 7, no. 1, pp. 15–18, 1994. View at: Publisher Site  Google Scholar
 M. Capó, A. Pérez, and J. A. Lozano, “An efficient approximation to the Kmeans clustering for massive data,” KnowledgeBased Systems, vol. 117, pp. 56–69, 2017. View at: Publisher Site  Google Scholar
 J. A. Hartigan and M. A. Wong, “Algorithm AS 136: a kmeans clustering algorithm,” Journal of the Royal Statistical Society Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979. View at: Publisher Site  Google Scholar
 K. P. Papadaki and W. B. Powell, “An adaptive dynamic programming algorithm for a stochastic multiproduct batch dispatch problem,” Naval Research Logistics, vol. 50, no. 7, pp. 742–769, 2003. View at: Publisher Site  Google Scholar
 L. Ana and A. K. Jain, “Robust data clustering,” in 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings, Madison, WI, USA, June 2003. View at: Publisher Site  Google Scholar
 L. Danon, A. DiazGuilera, J. Duch, and A. Arenas, “Comparing community structure identification,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2005, no. 9, article 09008, 2005. View at: Publisher Site  Google Scholar
 N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance,” Journal of Machine Learning Research, vol. 11, pp. 2837–2854, 2010. View at: Google Scholar
 S. Romano, N. X. Vinh, J. Bailey, and K. Verspoor, “Adjusting for chance clustering comparison measures,” The Journal of Machine Learning Research, vol. 17, pp. 4635–4666, 2016. View at: Google Scholar
 A. Rosenberg and J. Hirschberg, “Vmeasure: a conditional entropybased external cluster evaluation measure,” in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 410–420, Prague, 2007. View at: Google Scholar
 T. Grinshpoun and A. Meisels, “Completeness and performance of the APO algorithm,” Journal of Artificial Intelligence Research, vol. 33, pp. 223–258, 2008. View at: Publisher Site  Google Scholar
 A. Biswas and B. Biswas, “Investigating community structure in perspective of ego network,” Expert Systems with Applications, vol. 42, no. 20, pp. 6913–6934, 2015. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2018 Qiao Su et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.