Journal of Spectroscopy

Volume 2018, Article ID 7960314, 11 pages

https://doi.org/10.1155/2018/7960314

## A Comparison of Regression Tree Approaches to Modelling the Efficacy of Water Hyacinth Biocontrol Using Multitemporal Spectral Datasets

School of Agriculture, Earth and Environmental Sciences, University of KwaZulu-Natal, P/Bag X01 Scottsville, Pietermaritzburg 3209, South Africa

Correspondence should be addressed to Na’eem Hoosen Agjee; moc.liamg@2neejga

Received 25 July 2017; Revised 4 December 2017; Accepted 15 February 2018; Published 14 May 2018

Academic Editor: Pedro D. Vaz

Copyright © 2018 Na’eem Hoosen Agjee et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Water hyacinth (*Eichhornia crassipes*) is an exotic plant species that is effectively controlled by* Neochetina* spp. weevils. This study is aimed at determining if spectroscopic data may be utilized to predict insect-induced stress on water hyacinth plants. Single target regression trees (STRTs), multitarget regression trees (MTRTs), and random forest multitarget regression trees (RF-MTRTs) have been used to predict feeding scar damage (FSD) and relative leaf chlorophyll content (RLCC) from hyperspectral canopy reflectance data. Results from this study show that the correlation coefficient of STRTs (training accuracy: 76%–97%; validation accuracy: 47%–86%) performs better than MTRTs (training accuracy: 74%–90%; validation accuracy: 45%–77%) for all infestation levels but are difficult to interpret simultaneously. In contrast, MTRTs (size: 23–35 nodes) are much smaller and more interpretable than STRTs (size: 11–47 nodes) because they predict FSD and RLCC simultaneously. Importantly, RF-MTRTs (training accuracy: 95%–98%; validation accuracy: 55%–88%) yield better predictive performance than STRTs and MTRTs for all infestation levels. It is concluded that MTRTs can be utilized for model interpretation as they are more interpretable; however, RF-MTRTs offer an improved predictive performance.

#### 1. Introduction

Water hyacinth (*Eichhornia crassipes*) is an exotic invasive plant species that occurs as mats on the surface of freshwater bodies [1]. Native to Brazil, water hyacinth has spread to most tropical and subtropical countries suitable for their development [2, 3]. Water hyacinth was first introduced into South African waters around the 1900s [2] and is currently classified as a category 1b invader according to South African legislation, requiring compulsory control. The resilience of this highly invasive exotic weed can be attributed to the prevalence of highly eutrophic waters and the absence of natural enemies [3–5]. Water hyacinth plants have been reported to hinder fishing activities, reduce water quality, impede water usage, and obstruct navigation waterways [6–9] thus placing severe strain on South Africa’s limited water resources. In response to this growing ecological concern, biocontrol programs have been initiated in an effort to alleviate the ecological impact imposed on freshwater ecosystems.

The release of biocontrol agents is recognized as an effective solution to sustainably control water hyacinth monocultures. *Neochetina eichhorniae* and *Neochetina bruchi* are two biocontrol agents currently being introduced into freshwater ecosystems throughout South Africa. Their utility is warranted as many studies have demonstrated the efficacy of *N. eichhorniae* and *N. bruchi* weevils in reducing weed density, plant vigor, and reproductive potential [10, 11]. Adult weevils achieve this through feeding by forming rectangular scars on the surface of the leaf [12, 13]. The weevils remove extensive proportions of epidermal tissue at the leaf surface as well as feed on the photosynthetic layers below the leaf surface [13, 14]. Continuous damage negatively affects the functioning of the chloroplasts subsequently reducing relative leaf chlorophyll content (RLCC) and the photosynthetic capacity of the leaves [15]. Consequently, feeding scar damage (FSD) and RLCC can be used as bioindicators of morphological and physiological damage inflicted to water hyacinth plants. Over time, the combined effects of morphological and physiological stress result in increased leaf mortality, a reduction in plant biomass, and possible plant mortality [16–18]. The ability to quantify the damage inflicted by biocontrol, that is, RLCC and FSD, of variable infestation levels is essential to establishing the efficiency of biocontrol agents to characterize the health status of water hyacinth plants.

Currently, reconnaissance surveys are conducted by manually sampling water hyacinth plants periodically to ascertain the severity of the damage inflicted by biocontrol and the health status of water hyacinth plants. Recently, hyperspectral remote sensing technologies have emerged as a powerful tool to synoptically detect, monitor, and predict vegetation stress [19–22]. Laboratory-based spectroscopic studies can contribute towards exploring the operational potential of predicting different severities of biocontrol damage from remotely sensed data [23]. Hyperspectral data is captured at a high spectral resolution (10 nm) warranting the identification of key spectral regions or diagnostic features that form the leaf optical properties which are related to the biochemical and/or biophysical status of the plants [24]. Importantly, identifying spectral regions that represent responses to key physiological processes (chlorophyll content, chlorophyll fluorescence, carbon, and nitrogen) can be used to detect vegetation stress prior to effects being seen visually [25]. Generally, changes of leaf reflectance in the visible region (350–700 nm) and near-infrared region (700–1000 nm) of the electromagnetic spectrum are indication of vegetation stress [26]. The ability to relate key spectral regions or bands with biocontrol damage reference measurements would allow for the development of calibrated models to monitor and possibly predict previsual and visual biocontrol damage. Consequently, it is imperative to investigate state-of-the-art modelling techniques to determine if these techniques can produce high nowcasting and possibly predictive accuracies when dealing with high dimensional datasets.

Over the last decade, a suite of machine learning algorithms (e.g., artificial neural networks, support vector machines, and fuzzy logic) has emerged as an accurate alternative to conventional parametric linear modelling techniques. One such technique is single target regression trees (STRT) conducting binary recursive partitioning producing a set of rules and a regression model to predict a single response variable [27, 28]. Several studies have successfully demonstrated the utility of STRTs as a powerful tool for data prediction [29–31]. This study attempts to predict biocontrol damage on water hyacinth plants which to the author’s knowledge has not been explored before. STRTs offer numerous advantages as a potential operational tool for biocontrol damage monitoring and prediction. STRTs are efficient when dealing with high dimensional datasets and produce a descriptive model [32]. Importantly, STRTs do not rely on data distribution assumptions and the algorithm can map nonlinear relationships between features (i.e., bands) and response variables in complex data spaces [32]. However, a limitation of STRTs is that only one response variable can be predicted per training session. In addition, STRTs can lead to the construction of complex trees that do not generalize well from the training data resulting in overfitting. To ascertain and understand the overall status of water hyacinth plants, environmental managers would have to construct STRTs for each response variable and then try to aggregate the output of the models. This process would be time consuming and inefficient to conduct. Alternatively, a more efficient approach would be to construct a model that simultaneously predicts multiple biocontrol parameters (i.e., responses) with one training session.

Multitarget regression trees (MTRTs) predict several numeric response variables simultaneously [29] and offer several advantages over STRTs. For example, MTRTs are smaller in size than STRTs and are faster to train thus making them more efficient to implement [30]. Furthermore, MTRTs explain dependencies between different variables [30] and are more interpretable than several STRTs [33]. Several studies have explored and successfully demonstrated the utility of MTRTs to predict multiple response variables simultaneously [29, 30, 33–35]. However, to the author’s knowledge, only Stojanova et al. [30] have used MTRTs and STRTs to predict vegetation height and canopy cover from remotely sensed data. Results showed that the MTRTs performed significantly better than STRTs when predicting canopy cover. This highlights the operational potential of MTRTs to simultaneously predict RLCC and FSD from hyperspectral data. This study attempts to implement MTRTs to not only predict biocontrol damage but also identify the most influential bands which is important to understand the relationship between influential bands and response variables. Although MTRTs construct easily interpretable models with good predictive performance, they are unstable. Small variations in the data might result in a completely different tree being generated. Unstable predictive models can be combined into an ensemble to improve predictive performance. Random forest multitarget regression trees (RF-MTRTs) are an ensemble of predictive models that when combined increases the predictive performance of their base classifiers [30]. For example, Kocev et al. [34] reported an improvement in the predictive performance when implementing RF-MTRTs compared with MTRTs and attributed the improvement to the ensemble method. However, despite the advantage of improving the predictive performance, RF-MTRTs are not interpretable because hundreds of MTRTs are constructed in an ensemble. Consequently depending on the goal of the application, either an interpretable model can be generated or a model that yields a high predictive performance.

In light of the above, this study is aimed at determining if hyperspectral data can be applied to monitor and predict biocontrol measures of variable infection levels to water hyacinth plants. More specifically, the objectives of this study are to (1) compare the interpretability of STRTs and MTRTs to predict FSD and RLCC of variable infection levels on water hyacinth plants and (2) compare the predictive performance of STRTs, MTRTs, and RF-MTRTs to predict FSD and RLCC of variable infection levels on water hyacinth plants.

#### 2. Materials and Methods

##### 2.1. Experimental Procedure

The experimental procedure implemented in this study was similar to that implemented by Agjee et al. [23]. However, in this study, three *Neochetina* spp. infestation levels, that is, low (two adult male weevils per plant), medium (four adult male weevils per plant), and high (six adult male weevils per plant) were considered to model biocontrol measures from plant spectral reflectance [6]. The three infestation levels were then applied for all subsequent analysis.

##### 2.2. Leaf Variables

Leaf variables sampled included FSD and RLCC. Leaf variables were sampled on the two youngest and two oldest unfurled leaves on each plant [13]. FSD is determined by counting the number of weevil feeding scars on the adaxial leaf laminae on each of the leaves. Subsequently, the chlorophyll content of each of the leaves has been measured using a SPAD-502 chlorophyll meter [36]. The SPAD-502 chlorophyll meter has a measurement area of 0.06 cm^{2} and utilizes the 650 nm and 940 nm wavelengths to estimate relative chlorophyll content [37, 38]. Three measurements were recorded on each of the leaves by positioning the leaf over the receptor window and closing the measuring head. FSD and RLCC measurements were averaged for each plant.

##### 2.3. Canopy Reflectance Measurements

Canopy reflectance spectra were captured for low, moderate, and high infestation levels in the same manner as that employed by Agjee et al. [23] over five weeks of infestation. However, in this study, reflectance spectra captured for each week were combined for each infestation level.

##### 2.4. Statistical Analysis

###### 2.4.1. Analysis of Variance

A one-way analysis of variance (ANOVA) was used to ascertain whether differences in FSD and RLCC occur between the variable infestation levels. ANOVAs were performed using TANAGRA version 1.4.50 [40].

##### 2.5. Machine Learning for Biocontrol Modelling

###### 2.5.1. Single Target Regression Trees

Individual STRTs were constructed to predict FSD and RLCC from canopy reflectance spectra for each infection level. A STRT is a hierarchical structure that recursively partitions a set of training observations to produce a model that will predict a single response variable from unseen observations [41]. A STRT is comprised of a root node, branches, internal nodes, and leaves [34]. Initially, the algorithm begins at the root node which contains all the training observations. Subsequently, the dataset is recursively partitioned into subsets at each internal node based on the predictor test. The heuristic function used for selecting the predictor test at each internal node is based on the intracluster variation summed over the subsets induced by the predictor test [29]. The intracluster variation is defined by
where *N* is the number of examples in the cluster, *T* is the number of response variables, and Var [*y _{t}*] is the variance of response variable

*y*in the cluster.

_{t}The goal of the heuristic function is to guide the algorithm towards small trees with good predictive performance [29]. The partitioning process is terminated when a stopping criterion is met [29, 34]. In this study, the *F*-test stop criterion is used where a node will be split only when a statistical *F*-test indicates a significant reduction of variance inside the subsets. The *F*-test value is optimized using the following values: 0.001, 0.005, 0.01, 0.05, 0.1, and 0.125. On termination, the prediction value of the response variable is stored in each leaf. The predicted value is calculated as the mean value of the response variable for the observations that are stored in that leaf [30].

In this study, STRTs were pruned using the M5 pruning method [42–44]. The M5 pruning method builds a multivariate linear model for each node using the observations in the node and the predictors tested in the subtree [42–44]. M5 then calculates the mean absolute deviation of the linear model which is then multiplied by a heuristic penalization factor [42–44]. The resulting error estimate is then compared with the error estimate for the subtree, and if the latter is larger, the subtree is pruned [42–44]. STRTs were constructed using the CLUS software [45].

###### 2.5.2. Multitarget Regression Trees

A MTRT was constructed to simultaneously predict FSD and RLCC using canopy reflectance data for each infection level. A MTRT is a hierarchy of clusters that produces a model to simultaneously predict several response variables from unseen observations [29]. Initially, the algorithm begins at the root node which contains a set of training data. Subsequently, the training dataset is recursively partitioned into smaller subsets using a heuristic function that selects a predictor test at each node [29]. Similar to STRTs, the heuristic function used for selecting the predictor test at each internal node is based on the intracluster variation summed over the subsets induced by the predictor test [29]. The variance function is standardized so that the relative contribution of the different targets to the heuristic score is equal [33]. The partitioning process stops when the *F*-test stop criterion is met [35]. On termination, the response variables (i.e., FSD and RLCC) are calculated for each leaf. The predicted value for each response variable is calculated as the mean value of the response variable for the observations that are stored in the corresponding leaf [30]. MTRTs are pruned using the M5 pruning method [42–44]. In this study, MTRTs were constructed using CLUS software [45].

###### 2.5.3. Random Forest Ensemble of Multitarget Regression Trees

The random forest algorithm constructs an ensemble of individually grown MTRTs with the prediction of response variables (i.e., FSD and RLCC) based on an average prediction of the response variables for all the regression trees in the forest [46–48]. At the outset bootstrap, aggregation is employed to create new bootstrap samples [49, 50]. Subsequently, a single MTRT is built for each bootstrap sample and the tree is grown fully without pruning [33]. Since random forest introduces randomness to the regression process, the accuracy of the prediction is improved and the correlation between individual MTRTs is reduced [48, 51]. Random forest introduces randomness through bagging and by choosing a random subset of predictors at each splitting node. The final prediction of each response variable is calculated by averaging the output predictions of the MTRT models in the ensemble [30]. The random forest multitarget analysis was implemented within CLUS software [45]. In this study, the number of trees grown (ntree) per ensemble is 500 trees. The default *mtry* value was used which is given by the function *F* where *F* = log_{2} (number of predictors + 1).

###### 2.5.4. Evaluating Regression Trees

Model interpretability has been evaluated by determining the size of STRT and MTRT after pruning. The size of the regression trees was calculated as the sum of the nodes (internal nodes and leaves) used to construct the tree [29]. Model size is important to note because the complexer the tree the more bands are used and the more complex the interpretation can be. In addition, each STRT and MTRT model was inspected to determine key spectral regions and identify influential bands used as decision rules to construct the trees.

A 10-fold crossvalidation was performed to validate the regression models constructed. The original dataset was partitioned into ten stratified subsamples, where each subsample was used as a validation dataset while the remaining subsamples were used as training datasets [52]. A regression model was then constructed for the training dataset and the error computed using the test dataset for each fold [52]. The final error is an average of 10 folds to provide a single error estimation.

As recommended by Stojanova et al. [30] and used in other studies [29, 34, 35], the predictive performance of the STRT, MTRT, and RF-MTRT was evaluated by computing the Pearson correlation coefficient and root mean square error (RMSE). The correlation coefficient indicates the direction and strength of a linear relationship between two random variables and has been calculated using
where and are the *i*th observations of the variables *x* and *y*, and *n* is the total number of pairs of *x-y* observations.

The RMSE is a measure of the differences between the value predicted by the model and the values actually observed. The RMSE was calculated using formula
where is the observed value and is the predicted value for the *i*th observation.

#### 3. Results

##### 3.1. Biocontrol Damage of Variable Infestation Levels

The extent of biocontrol damage on water hyacinth plants for the three infestation levels over a period of five weeks is shown in Figure 1. It was observed that water hyacinth plants with a low infestation level were healthy after four weeks of infestation. Plants with a medium and high infestation level showed moderate and severe damage after three weeks of infestation. Water hyacinth plants exposed to a high infestation level decreased producing new leaves and decreased in plant size. The base of the petioles were severely eaten with leaves showing signs of desiccation and eventually falling of the petiole.