Abstract

The method based on conventional index and UV-vision has been widely applied in the field of water quality abnormality detection. This paper presents a qualitative analysis approach to detect the water contamination events with unknown pollutants. Fluorescence spectra were used as water quality monitoring tools, and the detection method of unknown contaminants in water based on alternating trilinear decomposition (ATLD) is proposed to analyze the excitation and emission spectra of the samples. The Delaunay triangulation interpolation method was used to make the pretreatment of three-dimensional fluorescence spectra data, in order to estimate the effect of Rayleigh and Raman scattering; ATLD model was applied to establish the model of normal water sample, and the residual matrix was obtained by subtracting the measured matrix from the model matrix; the residual sum of squares obtained from the residual matrix and threshold was used to make qualitative discrimination of test samples and distinguish drinking water samples and organic pollutant samples. The results of the study indicate that ATLD modeling with three-dimensional fluorescence spectra can provide a tool for detecting unknown organic pollutants in water qualitatively. The method based on fluorescence spectra can be complementary to the method based on conventional index and UV-vision.

1. Introduction

Water pollution problem is attracting more and more attention, and monitoring of water quality is important in order to avoid health risk to residents. It is necessary to develop a technique for rapid water quality analysis in the case of unknown contaminants because of deterioration of water resources, indiscriminate discharge of wastewater, chemical leakage, and so on.

Currently, water pollution abnormality detection mainly depends on conventional water quality parameters. In the conventional detection, water anomaly detection methods are mainly based on traditional indexes of water quality. For example, Conde [1] developed regression models of artificial neural networks (ANNs) and relevance vector machines (RVMs) according to the normal water quality parameters and generated a discriminant classifier to discriminate the abnormal water from the normal water data online. However, the process of obtaining these water quality parameters has various problems, such as long analysis time, not sensitive enough, requiring reagents, and producing waste [2]. It cannot satisfy the online high-frequency water quality anomaly detection. Compared with the method of conventional detection, the method of water quality analysis based on spectra can measure samples directly by spectra without other operations such as extraction or separation, and its advantages are simple, rapid, and so on [3]. There have been many researches on water quality analysis based on UV-visible spectroscopy. Dürrenmatt and Gujer [4] analyzed the UV/Vis data with a two-staged clustering method consisting of the self-organizing map algorithm and the Ward clustering method to distinguish the industrial sewage and living sewage from the sewage treatment plant. Langergraber et al. [5, 6] continuously monitored the water quality through ultraviolet spectroscopy and judged the anomaly according to the three-dimensional spectrum and the historical spectrum of ultraviolet spectrum and time axis to obtain the good result. There are also some researches about distribution water quality anomaly detection by analyzing the ultraviolet spectroscopy [7, 8]. Hou et al., for example, integrated principal component analysis (PCA) with chi-square distribution to detect distribution water quality anomalies. However, detection limit of some organic matter based on UV-vision spectra is still not low enough to achieve the standard.

Three-dimensional fluorescence spectra have lower detection limit of organic matter than UV-visible spectra [9]; it can provide more complete spectral information; it has characteristics of high selectivity, high sensitivity, good reproducibility, requiring less sample, not damaging the structure of the sample, and so on [10]. So, three-dimensional fluorescence spectroscopy can be considered to be used as a tool of detecting water quality contaminant events and is complementary to conventional index methods and UV-vision spectra methods. It can detect some matters that the conventional index methods and UV-vision spectra methods cannot be used to detect, and the lower detection limit may be lower. Three-dimensional fluorescence spectra can be used to analyze water samples quantitatively. Three-dimensional fluorescence spectra are widely applied to measure the relative concentration of organic matter in water [1113]. Three-dimensional fluorescence spectra are also applied to water qualitative analysis, mainly in the field of classification. The analysis of nature organic matter (NOM) in the sea, rivers, or lakes is used to classify the water samples from different sources. Baker [14] applied three-dimensional fluorescence spectroscopy to the water samples from 10 sample sites in six rivers and demonstrated that tryptophan- and fulvic-like fluorescence intensity is associated with whether the rivers accept the sewage treatment plant effluent or not. Pavelescu et al. [15] applied fluorescence EEM spectroscopy to the samples from 12 groundwater sources (wells) located in different surroundings, and fluorescence indices can be used to discriminate the water samples from different sources. Thus, studies about three-dimensional fluorescence spectra mainly highlight the quantification analysis and classification of specific known organic matter in water. However, there are few researches about unknown pollutant detection in water supply system.

The aim of this paper was to make qualitative detection of unknown pollutants for early warning when sudden pollution event happens in water supply system by using three-dimensional fluorescence spectra to make up for the lack of conventional method and UV-vision method in the water pollution qualitative detection. Fluorescence data were analyzed using matrix feature extraction method based on alternating trilinear decomposition (ATLD) and residual sum of squares, combining the threshold to judge the unknown samples and to detect the aqueous samples of organic contaminants.

2. Methodology

The basic idea of the paper is to establish a model of normal water sample. If a test sample does not conform to the model, it will be determined as an anomaly sample. Before the establishment of the model, the data need to be pretreated. Then, the ATLD algorithm is used to obtain the model. The model is applied to test samples to obtain the model data, and the model matrix is compared to the measured matrix to obtain the residual matrix as a basis of judging whether the water samples are abnormal or not. The main process of the method is shown in Figure 1.

2.1. Pretreatment

Three-dimensional fluorescence spectra could contain not only the fluorescence of the substance to be tested but also Rayleigh and Raman scattering. Since that scattering part does not meet the requirements of trilinear decomposition algorithm theory, it is not appropriate to analyze the scattering part with decomposition model. So it is essential to eliminate the scattering of the three-dimensional fluorescence spectra. Some researches eliminated the scattering by taking out the background of distilled water, which may still remain some scattering [16]. It is more likely to cause some problems when the concentration of the sample is relatively low. This paper adopts the Delaunay triangulation interpolation method [17] to preprocess the raw data in order to eliminate the impact of Rayleigh scattering. In the meanwhile, this paper eliminated the impact of Raman scattering by taking out the background of solvent.

When measuring the Raman spectra of ultrapure water on the excitation wavelength of 350 nm and the emission wavelength of 397 nm before the experiment starts up, it is discovered that there exists difference among the measurements. Consequently, in order to eliminate the difference, this paper did Raman normalization [18] to all of the three-dimensional fluorescence data, which equals to divide excitation-emission matrix (EEM) by the Raman peak value on the excitation/emission wavelength 350/397 nm of the ultrapure water.

2.2. Feature Extracting Based on ATLD and Residual Sum of Squares
2.2.1. ATLD Algorithm

Alternating trilinear decomposition (ATLD) algorithm, put forward by Wu et al. [19] in 1998, is an improvement to the conventional PARAFAC algorithm. Taking advantage of the alternating least squares theory, Moore-Penrose generalized inverse calculation based on singular value decomposition (SVD) as well as alternate iterations which are used to improve the performance of the trilinear decomposition. Compared to PARAFAC, ATLD, making use of generalized inverse and matrix diagonal elements extracting, is not sensitive to the number of components [20]; meanwhile, ATLD makes calculation by slice matrix, which accelerates the calculation speed and therefore improves the calculation efficiency.

Below is the trilinear model of a three-dimensional data matrix , as Figure 2 shows: where ; represents the number of factors and factors are all the measurable components, including components which are useful for fluorescence data and components which are disturbance for fluorescence data; represents the elements () of the three-dimensional data matrix . In this paper, the three-dimensional data matrix is just the three-dimensional fluorescence spectra matrix for k samples after pretreatment. represents the element () of the relative excitation matrix in size ; represents the element () of the relative emission matrix in size ; represents the element () of the relative concentration matrix in size ; represents the element () of the three-dimensional residual matrix E in size .

First, ATLD gives matrices and randomly and alternately updates ABC in the three separate directions of the three-dimensional data matrix, which are as follows: where ; ; ; represents taking the diagonal matrix elements as a column vector; , respectively, represent the Moore-Penrose generalized reverse of matrices ; , respectively, represent the transposition of the row vector of the relative excitation matrix , the row vector of the relative emission matrix , and the row vector of the relative concentration matrix ; matrices and normalized to unit length column by column every iterative loop.

Matrices are solved by the alternative iteration of the iteration formulas (2), (3), and (4) until the objective function is converged. The rule of convergence is as follows:

The residual sum of squares (SSR) is the loss function defined by ATLD, which is

As a result, ATLD could decompose the three-dimension fluorescence spectrum matrix to three low-dimension matrices, which are known as relative excitation matrix , relative emission matrix , and relative concentration matrix .

2.2.2. Feature Extracting Based on Residual Sum of Squares

Model parameters are obtained with ATLD algorithm, which is known as relative excitation matrix , relative emission matrix , and relative concentration matrix of the background sample.

Known by formula (4), if the background sample matrix , the relative excitation matrix , and the relative emission matrix of the background sample are substituted into formula (4), the relative concentration of each component of the sample could be obtained based on the ATLD model. Moreover, if , , and are substituted into formula (1), the modeling value of the sample could be given. As what formula (7) presents, the residual matrix is obtained according to the difference of actual value and the modeling value .

The residual sum of squares is , where is the element () of the matrix and can be obtained by calculating the sum of squares of the residual matrix of all the elements. The residual sum of squares is defined as the basis of the qualitative discrimination.

2.3. Qualitative Discrimination Based on Threshold

The method based on threshold is often used for image segmentation because of its simple calculation. The grayscale of the image is usually divided into several parts through one or several thresholds and pixels belonging to the same part are considered as the same object. This paper separates test samples into two parts. One is normal drinking water samples, and the other is organic pollutant samples.

This paper aims at setting reasonable threshold of the object sequence for qualitative discrimination. In other words, through setting the threshold to the residual sum of squares, the aim that a new unknown sample is qualitatively identified can be reached. The setting of the threshold is a critical problem. If the threshold is too large, the polluted water sample cannot be detected; on the other hand, if the threshold is too small, the drinking water may be detected as the polluted water sample by mistake.

In the math analysis, the mean and standard deviation are often used to indicate the important characteristics of the data set. n times detection values are , then the mean is , and the standard deviation is .

Three times of the standard deviation of qualitative discrimination object sequence is usually used to detect the test sample qualitatively. Byer and Carlson [21] monitored 16,000 drinking water multiparameter samples online, and the data revealed that parameters were changed with the change of time and flow. Suppose that the data follow Gaussian distribution, the samples within the scope of three times of the standard deviation () occupied 99.96%. In the research of Byer and Carlson, three times of the standard deviation () is just the threshold of organic matter sample detection. For test samples, if the detection value is in the scope of , it is determined to be drinking water samples, otherwise determined to be organic pollutant samples.

In this paper, the standard for the setting of threshold is that false positive rate of drinking water samples is under 10%, so two times of the standard deviation of the residual sum of squares of 10 background drinking water samples is used as the threshold for qualitative discrimination. Explanatorily, the residual sums of squares of 10 drinking water samples are used to get the mean first; then, and are put together to get the standard deviation , and is defined as the threshold of qualitative discrimination.

3. Results and Discussion

3.1. Sampling

One group of samples was the three kinds of organic solution with fluorescent characteristics in the concentrations of 2 μ/L, 4 μ/L, 6 μ/L, 8 μ/L, 10 μ/L, 20 μ/L, 30 μ/L, and 40 μ/L, and the experiment was repeated three times. The other group of samples was in the concentrations of 1 μ/L, 3 μ/L, 5 μ/L, 7 μ/L, 15 μ/L, 25 μ/L, 35 μ/L, and 45 μ/L, and the experiment was repeated two times. In each experiment, ten drinking water samples were collected at the interval of half an hour from the laboratory tap as the background samples. Fluorescence spectra, in the form of excitation-emission matrices (EEMs), were recorded in the excitation wavelength range from 200 to 700 nm and emission wavelength range from 200 to 700 nm. Scanning intervals for excitation and emission were both 5 nm.

3.2. Pretreatment of Spectra

Three-dimensional fluorescence spectra measured by the spectrometer contain Rayleigh and Raman scattering, as shown in Figure 3(a). The Delaunay triangulation interpolation method was used to make the scattering pretreatment of the raw data. In order to remove the noise of instrument, EEM was subjected to Raman normalization after removing scattering, that is, divided by Raman scattering value of water at λex = 350 nm, λem = 397 nm. The charts before and after the pretreatment are shown in Figure 3. It is clear that the three-dimensional fluorescence spectra before pretreatment had significant scattering, and spectral pretreatment removed scattering and retained effective fluorescence spectra part.

3.3. Qualitative Determination Based on ATLD and Threshold

ATLD was applied to background drinking water samples to establish the model. Because of the complexity of the composition of water samples, the number of factors in the ATLD model is hard to identify. However, there are two peaks in the spectra of drinking water after pretreatment, as in Figure 3(b); according to the common dissolved organic matter with fluorescent characteristic in the water, the two peaks may correspond to tyrosine-like protein organic matter and tryptophan-like protein organic matter, so it can be seen as a mixture of two kinds of organic matters without other fluorescent organic matters, and the number of factors is identified as . Relative excitation matrix A, relative emission matrix B of water samples (Figure 4), and relative concentration matrix C of each component can be obtained. In Figure 4, factor 1 and factor 2 correspond to the two peaks in original fluorescence spectra of water samples, which may represent tyrosine-like protein organic matter and tryptophan-like protein organic matter.

Relative excitation matrix , relative emission matrix , and the test sample matrix are used to estimate relative concentration matrix of each component of test samples according to formula (4). According to the trilinear decomposition principle, combining relative excitation matrix A and relative emission matrix B with relative concentration matrix can obtain spectra data modeling value of unknown test samples, as is shown in Figure 5(a). Compared with Figures 5(a) and 5(b), it can be found in Figure 5(b) that the spectra of water samples had two fluorescence peaks; the model established by ATLD as Figure 5(a) can characterize the fluorescence signals caused by main organic matters in the drinking water.

The measured value subtracting the modeling value of spectra matrix can obtain residual matrix (Figure 5(c)), which can indicate the fluorescence signal caused by the noise of instrument and the remaining scattering part that cannot be represented by the model. The ATLD model established with the drinking water can be used to explain its main organic matter composition and content.

For organic matter samples, three-dimensional fluorescence spectra can be considered as superposition of drinking water spectra and organic matter spectra. Combined with Figure 5(d), it can be found that the ATLD model based on drinking water background can represent the part of drinking water spectra in the organic matter solution. However, according to Figure 5(f), rhodamine B not belonging to the background water sample cannot be explained by the model. It can be used as the basis of distinguishing the background drinking water samples and organic pollutant samples.

According to the residual matrix, the residual sums of squares of background drinking water samples and test samples are, respectively, calculated, as a detecting target sequence of qualitative determination. The residual sums of squares of one group experiment data are shown in Figure 6.

The mean and standard deviation of residual sums of squares of drinking water sample were calculated, and then the residual sum of squares of each unknown test sample subtracted the mean of background water samples. If the difference is larger than the threshold, the sample is judged to be organic pollutant sample; otherwise, drinking water sample. The result of qualitative discrimination based on the threshold depends largely on the threshold selection. As is mentioned in the second part of this paper, the standard of selecting the threshold is that false-positive rate of background drinking water samples is less than 10%. So two times of standard deviation of drinking water samples is selected as the threshold. The experiment result is shown in Table 1.

It can be seen from the result of qualitative discrimination that the detection rates of organic matter samples are all more than 77.78%. 107 organic matter samples are detected from the total 126, so the detection rate of all five groups of experiments is 84.92%. Some contaminants having similar spectra with drinking water were not detected, because the spectral peaks of contaminants overlap with those of drinking water. The result of detection rate in group 4 is not good. The possible reason is that the scattering part was not removed completely in many water samples of group 4, which resulted in the model established in group 4, and was not accurate enough because the scattering part cannot be explained by the trilinear model.

4. Conclusion

This paper studied a problem of unknown contaminant qualitative detection in water but not a qualitative problem of known contaminant detection. A method based on ATLD and threshold was applied to analyze three-dimensional fluorescence spectra in order to detect the unknown organic pollutants with fluorescent characteristics in the water. ATLD algorithm was used to establish the model of normal water sample, and the residual matrix was obtained through the difference of the model matrix and the measured matrix. The residual sum of squares was calculated according to the residual matrix and compared with the threshold to judge the test sample which was an organic pollutant sample or a normal water sample. In order to verify the theory, the experiments of analyzing the spectra of water samples and organic contaminant samples were launched.

The result shows that ATLD model extracting feature can be used to qualitatively discriminate whether test samples are polluted if the pollutants are with fluorescent characteristics. However, the detection rate of qualitative discrimination method based on the ATLD model fluctuated. This may be related that the method of extracting feature based on ATLD and residual sum of squares is a method depending on the background samples, that is, the result of the method is related to the quality and quantity of the samples.

In general, qualitative discrimination method based on ATLD and residual sum of squares, combining the threshold, can be used to detect the unknown organic pollutants with fluorescent characteristics in drinking water, but the detection rate is low which can be a problem worthy of further study. Besides, the method proposed in the paper does not apply to the situation of overlapped peaks well, and this will also be the work of further research.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was funded by the National Natural Science Foundation of China (no. 61573313) “Online water-quality anomaly detection, classification, and identification based on multi-source information fusion” and (no. U1509208) “Research on big data analysis and cloud service of urban drinking water network safety,” and the Key Technology Research and Development Program of Zhejiang Province (no. 2015C03G2010034) “Research on intelligent management and long-effective mechanism for river regulation and maintenance.”