One of the significant challenges in the food industry is the determination of the geographical origin, since products from different regions can lead to great variance in raw milk. Therefore, monitoring the origin of raw milk has become very relevant for producers and consumers worldwide. In this exploratory study, midinfrared spectroscopy combined with machine learning classification methods was investigated as a rapid and nondestructive method for the classification of milk according to its geographical origin. The curse of dimensionality makes some classification methods struggle to train efficient models. Thus, principal component analysis (PCA) has been applied to create a smaller set of features. The application of machine learning methods such as PLS-DA, PCA-LDA, SVM, and PCA-SVM demonstrates that the best results are obtained using PLS-DA, PCA-LDA, and PCA-SVM methods which show a correct classification rate (CCR) of 100% for PLS-DA and PCA-LDA and 94.95% for PCA-SVM, whereas the application of SVM without feature extraction gives a low CCR of 66.67%. These findings demonstrate that FT-MIR spectroscopy, combined with machine learning methods, is an efficient and suitable approach to classify the geographical origins of raw milk.

1. Introduction

Consumers are increasingly demanding guarantees concerning the quality and safety of food products, especially when they are of animal origin. Food authentication consists of checking that the product is consistent with the statements made on the label [1]. Falsification or willful mislabeling is usually used to reduce production costs [2]. In the milk industry especially, mislabeling can be used to confuse the industry and consumers about the origin of the milk, since products of different origins can have different qualities [1].

The European Union (EU) promotes dairy product quality programs designed to support farmers and safeguard their product names from abuse and imitation [3]. Particularly, the EU promotes two principal quality regimes that are based on the valuation of the geographical origins of foods, called Protected Designation of Origin (PDO) and Protected Geographical Indication (PGI), which respectively identify food products that are produced or closely associated with a given geographical area [4].

Food scientists support such programs by developing analytical techniques to enhance the capacity to identify the geographic origin of food [5]. These techniques are classified into two types: those based on targeted approaches and those based on untargeted approaches [6].

Targeted approaches are usually the most suitable for regulatory purposes, as they are very specific and sensitive [7]. These approaches are based on a screening that looks for a shortlist of predetermined chemical compounds [7]. Among these techniques, we can find the following: capillary electrophoresis (CE) [8], high-performance liquid chromatography (HPLC) [9], and gas chromatography (GC) [10, 11]. In addition to their advantages, these analytical methods have certain inconveniences as they are time-consuming and may require the use of expensive and polluting reagents. Additionally, these methods are not effective to cover the increased need for an analytical workflow that takes several hours [12]. While the nontargeted approaches allow the identification of the unknowns [13]. These nontargeted analyses are therefore increasingly gaining importance in the food sector, such as nontargeted metabolomics and spectroscopy [6]. This evolution is also explained by the increasing attention to very complex authentication issues such as geographical origin or agricultural techniques, thus increasing the need for innovative analytical approaches [6, 14]. Among these methods, we find spectroscopic methods [15], such as infrared spectroscopy, UV-Visible, Raman, and fluorescence spectroscopy [12, 16, 17]. Food analysis using spectroscopic techniques has become very common and widespread since these methods are extremely fast, inexpensive, nondestructive, and have no negative impact on the environment [18, 19]. The nontargeted analysis is a very difficult task, as it requires thorough processing of the generated data set. In order to make these data meaningful, multistep strategies using chemometric tools are needed before the eventual identification of a particular signal among a forest of interfering signals [20].

In order to authenticate the milk, several spectroscopic techniques have been used, such as near-infrared spectroscopy (NIR) [21], midinfrared spectroscopy (MIR) [2225], Raman spectroscopy [1, 26], and fluorescence spectroscopy [27, 28]. In these studies, quantitative chemometric approaches were used for the detection of adulteration, prediction of some quality parameters, and classification of milk according to species and origin.

To the best of our knowledge, there is only one study concerning the determination of the geographical origin of milk by means of MIR spectroscopy [25]. This study consists of studying the capacity of the fatty acid composition and the spectral information obtained by MIR spectroscopy for the discrimination between sheep milk coming from different geographical areas of Italy. In this study, the spectral results and the chemical composition (fatty acid) are processed by a genetic algorithm.

In this study, we develop several methods based on midinfrared spectroscopy, combined with machine learning, to classify the geographical origins of milk from 4 regions in Morocco. In detail, PLS-DA has been studied to simplify the feature extraction process while ensuring the precision and accuracy of the prediction. In addition, LDA and SVM were also applied to develop a set of classifiers with the spectral features selected by principal component analysis (PCA). The results provide a new insight and an attempt to apply MIR spectroscopy and machine learning methods in food and agri-food applications, and also to show the importance of using PCA as a preliminary method of feature reduction for building accurate, sensitive and specific classification models.

2. Materials and Methods

2.1. Sampling

This study was conducted on the raw milk of cows coming from various regions of Morocco (region 1: Mechra Bel Ksiri, region 2: Souk Larbaa, region 3: Maaziz, and region 4: Tiflet). The samples were collected during the year 2020. Samples are frozen immediately after collection in a cooler stacked with ice packs on their way to the freezer where they are stored until the day of analysis.

150 raw milk samples were selected to conduct this study; region 1 = 36, region 2 = 36, region 3 = 42, and region 4 = 36. In order to build a classification and prediction model, 100 samples are used for model training and 50 samples are used for model validation.

2.2. Spectral Acquisition

Fourier-transformed midinfrared spectra of cow's milk samples were recorded on a JASCO FTIR 460 PLUS spectrometer (PIKE Technologies, Madison, USA) in the spectral region between 600 and 4000 cm−1. The instrumental resolution was 4 cm−1, and each spectrum was composed of 3400 data points. Using a micropipette, each milk sample was placed on the crystalline surface of the ATR, which was cleaned for each analysis using the acetone solution, allowing both cleaning and drying of the ATR accessory. The spectra obtained were processed with the software (Spectra manager) in order to eliminate the effect of carbon dioxide in the three corresponding bands (at 2349 cm−1, 1388 cm−1, and 667 cm−1) as well as the effect of moisture (at 3756 cm−1, 3652 cm−1, and 1595 cm−1).

2.3. Data Analysis

In order to adequately process the spectral data obtained by midinfrared spectroscopy, multivariate data analysis (chemometrics) was used to explore the data and build classification models. The multivariate analysis became a major component of analytical chemistry. This is related to the necessity for computational approaches able to extract relevant information from increasing amounts of data provided by modern analytical instruments. Generally, these multivariate data analysis approaches concern the exploratory analysis of a single data matrix, such as principal component analysis (PCA), or the matching of an explanatory matrix to another descriptive matrix, as in regression methods such as PLS, or discriminant methods such as linear discriminant analysis (LDA), partial least squares discriminant analysis (PLS-DA), and support vector machine (SVM).

PCA is among the most commonly used methods in chemometrics since it allows to answer many objectives, such as visualization of data, detection of outliers, investigating the similarity between individuals, and the correlation between variables [29]. PCA is an unsupervised method for feature reduction that allows high-dimensional data to be projected into a new reduced-dimensional representation of the data that describes the variance of the data as much as possible with minimal reconstruction error [30]. This method provides a new set of variables, known as principal components. Each principal component represents a linear combination of the initial variables. All principal components are mutually orthogonal; therefore, there is no redundant information. The principal components together form an orthogonal database of datasets [31]. These databases can also be used as variables for other multivariate methods.

We can also find the supervised classification methods. These methods use the membership of the samples of different classes to build a model [32], such as PLS-DA, LDA, and SVM.

The PLS-DA uses the PLS method to explain and predict the membership of individuals to several classes, based on quantitative or qualitative explanatory variables [33]. PLS-DA is a relatively new technique in chemometrics that extends and merges the functionality of principal component analysis and multiple regression [12, 34, 35]. This technique consists of performing a decomposition of the two matrices, matrix of variables X and response (Y), under the condition that the factorial coordinates extracted from X should be correlated as much as possible with the factorial coordinates extracted from Y.

The main objective of linear discriminant analysis (LDA) is to be able to classify new individuals not belonging to the initial data. The idea is based on a method looking for a linear combination of the variables Xj that maximizes the similarity between the elements within the same group [36]. In other words, it is a question of finding a linear combination of Xj that maximizes the inertia or the intergroup variance and, therefore, the one that minimizes the intragroup variance. It consists in explaining and predicting the membership of an individual to a predefined class (group) from his/her characteristics measured by means of predictive variables. For large datasets such as in image recognition and spectral data, linear lines often do not allow a good separation of the groups because LDA is not the ideal method in cases where the explanatory variables are highly correlated. In such situations, it is necessary to regularize it in order to disrupt the correlation of the predictors to obtain better results. To overcome these limitations, there are other methods to extend the LDA in order to have a better classification, especially the combination of this method with other methods of variable reduction such PCA and genetic algorithm.

SVMs are a family of machine learning algorithms that solve classification, regression, and anomaly detection problems [37]. They are known for their strong theoretical guarantees, their great flexibility, and their simplicity of use, even without much knowledge of data mining. Its principle is simple: it aims at separating the data into classes using a boundary that is as simple as possible so that the distance between the different groups of data and the boundary that separates them is maximal [38]. This distance is also called “margin,” and SVMs are thus called “wide margin separators,” the “support vectors” being the data closest to the border. This notion of frontier assumes that the data are linearly separable, which is rarely the case. To overcome this, SVMs often rely on the use of “kernels” [39]. These mathematical functions allow to separate the data by projecting them in a feature space [32]. The technique of margin maximization allows us to guarantee better robustness to noise and therefore a more generalizable model.

These classification approaches struggle to build efficient models when the size of the spectral data set is very high, the so-called “curse of dimensionality,” since the redundancy of spectral variables affects the classification results of conventional machine learning models [4042]. This is particularly pertinent for algorithms that use distance calculations, such as LDA and SVM [41]. Feature extraction is the most crucial step to overcome the curse of dimensionality through the creation of a smaller set of features, so spectral features were extracted using PCA as a reduction variable method [40]. This is the case in this study, in which a preliminary treatment by PCA is used to extract synthetic variables used for the construction of classification models.

In classification issues, accuracy is commonly given as an evaluation metric. However, if there are more than two categories, accuracy on its own can be misleading and will not provide reliable information. Exploiting the confusion matrix obtained by the classification methods provides more information about their performance. The confusion matrix provides insight into the types of errors committed during estimation. As a result, it shows which points are correctly classified and which are misclassified.

In the present study, accuracy, sensitivity, and specificity parameters were used to compare the classification performance of different methods employed. Accuracy evaluates the efficiency of the algorithm by displaying the probability of the true value of the class target. For our purposes, accuracy is the number of correct predictions of geographical regions in relation to the total number of predictions. However, we denote sensitivity as the proportion of positive events that are well classified and specificity as the proportion of negative events that are well classified. These parameters are calculated according to the following formulas:where FN, FP, TN, and TP designate false negatives, false positives, true negatives, and true positives, respectively.

The flowchart of the main procedures applied to build the classification models is presented in Figure 1.

2.4. Software

PLS-DA models were built on the basis of the partial least squares algorithm using the NIPALS algorithm (nonlinear iterative partial least squares). PCA, PLS-DA, LDA, and SVM methods were applied using the Unscrambler software 10.4.

3. Results and Discussion

3.1. FT-MIR Spectra of Milk Samples

The spectra of the milk obtained (Figure 2) show the presence of spectral bands of interest. A broadband between 3700 and 3100 cm−1 corresponds to stretching of –ΟΗ and –NH in proteins, 3000–2800 cm−1 coincides with C–H stretching in fatty acids, 1.750–1.650 cm−1 corresponds to –C=O of fatty acids and esters, and a band between 1,660 and 1,446 cm−1 corresponds to –C=O and –NH of the Ι and ΙΙ amides of the proteins arising from various combinations of vibrations in the peptide linkages and secondary structure of the casein protein. Amide I vibration is mainly due to stretching of C=O bonds, and amide II vibration is due to deformation of N–H bonds and stretching of C–N bonds. The amide I vibration is measured in the region of 1660–1550 cm−1 and the amide II vibration in the region of 1550–1446 cm−1. Other small bands were observed in the spectral zone 1200–800 cm−1; this region corresponds to the stretching –C=O of polysaccharides and C=C of acids [43, 44].

These four groups of samples show slight differences in the band of 3000–2800 cm−1; 1,750–1,650 cm−1; and 1,660–1,446 cm−1, which mainly represents the absorption of substances, such as proteins, fatty acids, and esters.

However, it is still difficult to directly categorize the geographical origins owing to these minor differences. Consequently, it is necessary to study the classification model in order to help to identify the geographical origins of raw milk.

To further describe the data in a very small dimensional space, a PCA was first performed on the milk spectra to exploit the data set and to obtain information about the distribution and behavior of the samples regarding the measured variables that represent the wavenumber of the MIR spectral data.

3.2. Identification of the Geographic Origin of Raw Milk
3.2.1. Principal Component Analysis

PCA was applied to raw spectral data to visualize the samples in a well-reduced space in order to get a clear view of the data distribution. PCA shows that the first three components account for 97% (PC1: 65%, PC2:17%, and PC3:15%) of the total variability contained in datasets. From Figure 3, which represents the configuration of the samples on the first three PCs, we can distinguish a grouping of the samples according to their geographical origin, indicating that the samples belonging to each group had similar FTIR properties. However, the PCA plot shows that some milk samples from different regions exhibited high overlapping due to the same compositional properties. However, the first three PCs are not sufficient to distinguish between R1, R3, and R4. For this reason, more than 3 components are selected and used as feature variables for the construction of classification models such as PLS-DA, LDA, and SVM.

3.3. Discrimination Analysis

In order to build a model able to discriminate and predict the membership of milk according to their geographical origin. In this study, three different supervised classification algorithms, including PLS-DA, LDA, and SVM, were investigated to classify different geographical origins of raw milk. For the PLS-DA and LDA, feature extraction is used directly, since these methods cannot be applied directly on spectral data without feature extraction of latent variables. While the SVM is applied without and with feature extraction of latent variables by PCA, in order to demonstrate the importance of feature extraction variables in improving classification performance, finally, the classification models were compared.

(i) Partial Least Square Discrimination Analysis. PLS-DA analysis was applied over the entire MIR spectral region (4000 cm−1 and 600 cm−1) without spectral preprocessing and with spectral preprocessing. The choice of the optimal number of latent variables required for a good classification and prediction is made on the basis of the values obtained by the cross-validation procedure, namely, the root mean square error and R-square which are calculated on the basis of the leave one out algorithm [45]. In our case, 12 latent variables were selected.

Based on the results in Table 1 obtained by the PLS-DA method, we conclude that the PLS-DA method provides good results for the classification of milk according to their geographical origin, these performances are represented by the high value of the R-square and the low value of the RMSE for both calibration and cross-validation results. These results also show that the spectral data preprocessed by the detrend algorithm using polynomial degree 1 gives the lowest value of RMSE and the highest value of R-square.

(ii) Linear Discriminant Analysis. Subsequently, the PCA-LDA model was developed to classify and predict milk according to its geographical origin. LDA was applied on the first 5 components obtained by PCA using the statistical method of Mahalanobis. In our case, the application of the LDA analysis directly on the spectral data cannot be done because the number of spectral variables is higher than the number of samples.

Based on the results obtained by PCA-LDA, we can conclude that this method gives powerful results for the classification of milk. This classification ability is represented by the high values obtained for sensitivity, specificity, and classification correct rate % (CCR), as shown in Table 2. It was clear that all the built models provide high sensitivity (otherwise known as the true-positive rate), as demonstrated by the high value that reaches 100% for the three models. These models also show a high specificity, 99% for the model built on raw data, 100% for the model built on preprocessed data by detrend polynomial degree 1, and 98.5% for the model built on preprocessed data by detrend polynomial degree 2.

(iii) Support Vector Machine. The SVM method has been applied to the spectroscopic data with and without feature extraction, using the radial basis function method. The use of the SVM method without extraction of characteristics shows a CCR of 66.97, 69.68, and 67.97, respectively, for raw spectral data, spectral data corrected with detrend using a polynomial of orders 1 and 2, as shown in Table 3. According to the confusion matrix obtained by this method, we can observe that not all samples belong to their class for a certain group. This method shows an important specificity, i.e., a good capacity for the good classification of negative events. Therefore, low sensitivity has been observed for this method, i.e., low classification of positive events.

However, the application of the SVM method on the first 5 components obtained by PCA shows good results represented by the high values of CCR which reach 98.51, 98.49, and 99.00 using features extracted from spectral data without spectral preprocessing and spectral data preprocessed by detrend using polynomials of degree 1 and 2, respectively, as shown in Table 3. This method shows a sensitivity of 100%, which verifies the high capacity of this statistical approach for the prediction of correct events as well as correct events, and a high specificity ranging between 96.2% and 100%.

3.4. Validation of Classification Models

In order to evaluate the classification and predictive capacity of the constructed models, external validation of the models was carried out using different samples to those used for the construction of the models.

In the case of PLS-DA, the predicted y-value of a given sample is close to 1 (or greater than 0.5) and allocates the sample to a specific category, while a sample with a predicted y-value of less than 0.5 allocates it outside the category.

The results found by the external validation of the machine learning classification models (Table 4) show that the best results are provided by applying PLS-DA, PCA-LA, and then PCA-SVM which shows a strong classification capacity represented by the high values of CCR, specificity, and sensitivity which reaches 100% for PLS-DA and PCA-LDA and a CCR of 94.95 for PCA-SVM, while the application of the SVM method without feature extraction gives a CCR of 66.67% which is considered unsatisfactory for the corresponding classification of samples according to their membership. These results show the usefulness of using PCA as a first step for classification methods. Applying this method as a primary step for SVM analysis improves considerably the classification performance as shown by the results.

4. Conclusion

The present work constitutes an exploratory investigation on the use of vibrational spectroscopy for milk samples in Morocco. In which MIR spectroscopy combined with machine learning methods has been proposed to quickly identify the geographical origin of milk. After sample collection, the MIR spectra of raw milk can be rapidly acquired. Then, the obtained MIR spectral data are modeled using four classification methods (including PLS-DA, PCA-LDA, SVM, and PCA-SVM). Then, acceptable classification results are provided by these classification approaches. Compared with the PLS-DA (CCR = 100%), the PCA-LDA and PCA-SVM methods obtained more accurate and reliable classification results and were able to identify raw milk samples from different geographical origins with a CCR of 96% and 100% for the test set. In addition, satisfactory sensitivity and specificity were found reflecting the performance of these classification approaches. On the other hand, the application of the SVM method without feature extraction gives a low CCR of 65%. These results also prove that the use of PCA as a preliminary method for machine learning methods improves the classification performance by extracting features and reducing the data size.

The suggested exploratory approach based on MIR spectroscopy combined with machine learning methods proved to be an efficient strategy for the dairy industry to identify the geographical origin of raw milk. The above results confirm that the proposed method can be considered as a promising alternative for determining geographical origin.

Data Availability

The data are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.


The authors would like to thank all the people who contributed to make this work.