Geofluids

Volume 2018, Article ID 8519695, 11 pages

https://doi.org/10.1155/2018/8519695

## Sedimentary Environment Analysis by Grain-Size Data Based on Mini Batch K-Means Algorithm

^{1}Key Laboratory of Marine Sedimentology and Environmental Geology, First Institute of Oceanography, State Oceanic Administration, Qingdao 266061, China^{2}College of Information Engineering, Hubei University of Chinese Medicine, Wuhan 430065, China^{3}Department of Mathematics and Statistics, University of West Florida, Pensacola 32514, USA

Correspondence should be addressed to Fang Hu; ude.fwu@1uhf

Received 26 April 2018; Accepted 23 September 2018; Published 2 December 2018

Academic Editor: Umberta Tinivella

Copyright © 2018 Qiao Su et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

During the last several decades, researchers have made significant advances in sedimentary environment interpretation of grain-size analysis, but these improvements have often depended on the subjective experience of the researcher and were usually combined with other methods. Currently, researchers have been using a larger number of data mining and knowledge discovering methods to explore the potential relationships in sediment grain-size analysis. In this paper, we will apply bipartite graph theory to construct a *Sample/Grain-Size* network model and then construct a *Sample* network model projected from this bipartite network. Furthermore, we will use the Mini Batch K-means algorithm with the most appropriate parameters (*reassignment ratio * and *mini batch* = 25) to cluster the sediment samples. We will use four representative evaluation indices to verify the precision of the clustering result. Simulation results demonstrate that this algorithm can divide the *Sample* network into three sedimentary categorical clusters: *marine*, *fluvial*, and *lacustrine*. According to the results of previous studies obtained from a variety of indices, the precision of experimental results about sediment grain-size category is up to 0.92254367, a fact which shows that this method of analyzing sedimentary environment by grain size is extremely effective and accurate.

#### 1. Introduction

Data mining, knowledge discovery, and machine learning algorithms have virtually permeated into research in various fields [1–4]. The complex network as a significant method of data mining gives top priority to discovering concealed information between things. Therefore, a great number of researchers from various research fields, including mathematics, physics, biology, chemistry, and oceanology, used the complex network to explore the potential relationships between data [5–9]. The complex network has some characteristics: self-similarity, self-organization, scale-free, small-world, community structure (cluster), and node centrality. The community structure is one of the most important traits because it can objectively reflect the potential relationships between nodes. A community is made of one group of nodes within which the links between nodes are densely connected but between which they are sparsely connected with other clusters [10, 11].

The grain-size analysis is one of the basic tools for classifying sedimentary environments, an analysis which can provide important clues to the provenance, transport history, and depositional conditions [12]. In general, the representative statistical parameters of grain-size analysis involve median, mode, mean, separation parameter, skewness, and kurtosis [13]. During the last few decades, two computing methods of grain-size parameters were developed: the graphical method and the moment method [12]. Blott and Pye (2011) presented that these two analysis methods had some advantages and disadvantages in computing sediment grain-size samples with various parameters. As most sediments are polymodal, curve shape and statistical measures usually simply reflect the relative magnitude and separation of populations. Polymodal grain-size spectrum can be considered as a result of the superposition of several unimodal components [14]. Many works have shown that different grain-size distribution is related to special transport and deposition process [15]. Three kinds of functions are commonly used to fit the grain-size distribution: Normal function, Lognormal function, and Weibull function [15]. Base on experimental results, Sun et al. [15] found that the Weibull function was appropriate for the mathematical description of the grain-size distribution of all kinds of sediments while the application of Normal function for fluvial and lacustrine sediments was also acceptable. Although these methods, especially Weibull function, performed well in sediment in fitting grain-size distribution, they often need subjective experience of the researchers, and the definite criteria for environmental determination have not been given. Based on the data of borehole Lz908, Yi et al. analyzed the evolution of the sedimentary environment. Besides grain-size data, they also used the data of magnetic susceptibility, tree pollen, radiocarbon dating, and optically stimulated luminescence (OSL) dating [16, 17]. Can the same conclusion be obtained by using only the grain-size data which are the relatively convenient and low-priced indices?

In this paper, we introduce complex network into the data modeling of sediment grain-size data. Based on the theory of bipartite graph [18], we construct the *Sample/Grain-Size* bipartite weighted network model which can objectively reflect the association relationships between sediment samples and grain sizes. By using projection, we will construct the *Sample* network model from the bipartite network. After repeatedly testing based on tens of representative clustering algorithms, we have selected the Mini Batch K-means algorithm [19], an optimization algorithm combined with the K-means algorithm [20], and the classical batch algorithm [21] to split the *Sample* nodes into their categories and find the relationships between the sedimentary environment and grain size. After 400 tests, we can find the most appropriate parameters in Mini Batch K-means algorithm. Finally, we will use four evaluation indices AMI, NMI, completeness, and precision to verify the accuracy and efficiency of clustering divisions.

#### 2. Evaluation Functions

In the research field of complex networks, researchers always use several representative performance evaluation indices AMI, NMI, completeness, and precision to verify the accuracy and efficiency of clustering divisions. It is universally acknowledged that the higher the value of one index is, the better the result of clustering division is. Therefore, we will also use these four evaluation indices to verify the clustering result of sediment grain-size samples.

##### 2.1. NMI and AMI

Normalized Mutual Information (NMI) [22, 23] is an approach to measure shared information between two data distribution by the information theory in which entropy is defined as the information included in a distribution [24]. where represents the probability that an object picks at random falls into both classes and . The two label assignments and have the corresponding entropy and defined as follows: where is the probability that an object picked at random from falls into class . The equation has the similar definition with .

The NMI and the Adjusted Mutual Information (AMI) [25] are defined as where is the expected value for MI. The range of NMI and AMI is and , respectively.

##### 2.2. Completeness

Based on the standard partitioning precious study results of known grain-size samples, conditional entropy analysis is used to define some intuitive measures. The completeness assigns all nodes in the given class to the same cluster [26, 27]. The completeness is formally given by where is the entropy of the classes and is the conditional entropy of the classes given the cluster assignments.

##### 2.3. Precision

Precision [28] () is the number of true positives () over plus the number of false positives (). The precision is given by

#### 3. Dataset of *Sediment/Grain-Size*

The sediment samples of this study came from borehole *Lz908* (37° 09N, 118° 58E), which is located in the southern Bohai Sea, China (Figure 1). The borehole was drilled to 101.3 m below the surface in 2007, and the recovery rate reached 75%. The existing research results show that this region developed into three transgressive layers from late Pleistocene, and the thickness of *fluvial*, *lacustrine*, and *marine* sediments reached 2000–3000 m in this basin [16]. We extracted 2141 sediment samples of grain size from the borehole based on a 2 cm interval. We tested the grain size by using a thorough pretreatment method at the First Institute of Oceanography, State Oceanic Administration, China. The measuring instrument for grain size was a Mastersizer 2000 laser particle analyzer produced by the UK company Malvern; the measurement range was 0.02–2000 *μ*m; the repeated measuring error was less than 3%.