Abstract

Ecologists are often faced with problem of small sample size, correlated and large number of predictors, and high noise-to-signal relationships. This necessitates excluding important variables from the model when applying standard multiple or multivariate regression analyses. In this paper, we present the results of applying PLS to explore relationships among biotic indicators of surface water quality and landscape conditions accounting for the above problems. Available field sampling and remotely sensed data sets for the Savannah Basin are used. We were able to develop models and compare results for the whole basin and for each ecoregion (Blue Ridge, Piedmont, and Coastal Plain) in spite of the data constraints. The amount of variability in surface water biota explained by each model reflects the scale, spatial location, and the composition of contributing landscape metrics. The landscape-biota model developed for the whole basin using PLS explains 43% and 80% of the variation in water biota and landscape data sets, respectively. Models developed for each of the three ecoregions indicate dominance of landscape variables which reflect the geophysical characteristics of that ecoregion.

1. Introduction

The primary objective of the US Environmental Protection Agency’s (EPA) Landscape Ecology research program is investigation of associations among indicators of water quality and landscapes. Statistically valid predictive models are an important means of expressing these associations. The analyses presented here represent an attempt to develop a statistical predictive model of biotic indicators of water quality based on associations with a selected suite of landscape indicators.

Investigation of associations among indicators of water quality and landscapes involves statistical analyses of fundamentally different data sets. Data on surface water conditions are generally obtained through field sampling programs and may include several different methods of data production, that is, on-site observation, chemical analysis of collected samples, and expert identification of biotic organisms. Each method is unique in its precision and variability. Field samples are representative of specific points or stream reaches. Field/analysis programs are expensive and labor intensive; consequently, the total number of sample sites is usually small. The data base may contain missing values due to the realities of field sampling: malfunctioning equipment, lost or destroyed samples, and invalidation of results due to poor quality control. Much of the data on watershed characteristics, or landscape data, are derived from remote sensing platforms, thereby permitting wall-to-wall coverage, although the data may be of lesser or more questionable quality than surface water sample data. The landscape indicators data sets may contain a very large number of variables, although many of these are not wholly independent (i.e., they may be collinear).

The characteristics of these data sets and the differences between point sample collection and remote sensing derivation present a challenge in selection of statistical methods for data analyses and model definition. Single-and multiple-regression analysis has frequently been used to relate water nutrient concentrations to selected landscape variables [13]. Regression analyses, however, are sensitive to missing values and dependence of explanatory variables (landscape variables). Reliable statistically significant results generally cannot be obtained unless the total number of samples greatly exceeds the number of variables. Canonical correlation is well suited to exploring the relationships among two or more distinct data sets to describe their association and connection to the physical environment [46]. However, canonical correlation is sensitive to collinearity in predictors and requires multinormal data sets when testing the significance level of the correlation. The ratio of the number of variables to sample size is critical in canonical correlation; a ratio of 0.025–0.05 at a minimum is recommended [79].

Partial least squares (PLS) analysis offers a number of advantages over the more traditionally used regression analyses. Through its extensive use in the field of chemometrics, PLS has been shown to produce significant results when the number of samples is small compared to the number of variables [1012]. It has been found to be useful both for providing accurate predictions and for interpreting relationships between data sets containing a high degree of collinearity [1315]. Additionally, the prediction error in PLS is smaller than in other multivariate methods [16, 17]. Although PLS is a primary statistical tool in chemometric studies, it has only occasionally been used in ecological studies for exploratory analyses in engineered revegetation studies [17, 18].

The advantages of PLS, described above, make it an attractive candidate statistical tool for development of landscape ecology models. In this paper, we present the results of applying PLS to exploration of the relationships among surface water biota and landscape conditions. Available-real-world data sets for the Savannah Basin and its three-component ecoregions are used. These data sets contain all of the limitations that hinder use of other multivariate statistics, for example, small number of sampling sites and large number of variables.

2. Methods and Materials

The water data used in this analysis were provided by EPA Region IV, Science and Ecosystem Support Division. As a Regional Environmental Monitoring and Assessment Program (REMAP) project, site selection and sampling were completed according to standard EMAP protocols. This included a random site selection process, with wadeable stream (generally first-to-third Strahler order) sites selected without consideration of watershed size, proximity to other sampling sites, ecoregion, or ease of access. Sample collection was completed one time during base flow conditions (generally late summer into fall), although selected sites may be visited a second time for quality assurance purposes. Site coordinates were checked with a global positioning system (GPS) unit and against topographic maps to verify the selected sampling location. During the field visit, observations were made and recorded on standardized forms for a number of physical traits related to stream physiology and habitat. Additional physical traits were measured, including temperature, conductivity, and dissolved oxygen. Macroinvertebrate samples were collected over a 100 m stream stretch above the water sampling point. Macroinvertebrate identification was completed in a biological laboratory following collection. Stream water samples were collected and filtered for subsequent laboratory analyses. All collected samples were sealed, labeled, and transported in coolers under chain-of-custody [19].

For each of the selected sites, the watershed support area was delineated and a suite of landscape variables was calculated [23]. Water biology and landscape variables (𝑛=86 sites) used in the analyses are described in Table 1. Table 1 also provides the abbreviations of variable names used in the figures and tables in this paper. Due to the great number of variables and the need for very short abbreviations for use in labeling figures, at each occurrence of a variable name throughout this text, the full variable name will be used with the abbreviation provided in parentheses.

2.1. Site Description

The Multi-Resolution Landscape Characterization Consortium (MRLC) land cover/land use data (June 15, 2011; http://www.epa.gov/nerlesd1/land-sci/savannah.htm) reveals distinctive spatial patterns within the Savannah River Basin. The headwaters of the Savannah River are located in the Blue Ridge mountains in which evergreen forests predominate. Below this lies a region of mixed and deciduous forest, agriculture dominated by pasture and hay fields, and several urban centers. Two large reservoirs are located on the main stem river. Below Augusta, Georgia, extensive row crop agriculture is evident, along with wetland areas. The city of Savannah is located near the outlet of the river to the Atlantic Ocean. The spatial patterns seen in the land cover correspond closely to the three ecoregions: Blue Ridge, Piedmont, and Coastal Plain. Two data sets were used (see Table 1): four variables for water biology and 26 variables for landscape condition.

2.2. Water Biology Variables

The four water biology variables (Table 1) used in this analysis were Algal Growth Potential Test (AGPT), macroinvertebrate habitat (HAB), macroinvertebrate species richness (RICH), and Ephemeroptera/Plecoptera/Trichoptera (EPT).

2.2.1. Algal Growth Potential Test

The Algal Growth Potential Test (AGPT) is a bioassay performed in the laboratory in which known amounts of nutrients (nitrogen and phosphorus) and a standard test alga are added to aliquots of filtered water collected from the site [21]. Its purpose is to provide an indication of the amount of nutrients biologically available to support algal growth, as opposed to analytical methodologies that measure the total amount of specific nutrients of which only a portion may be biologically available. The specific methodology used by EPA Region IV was based on the standard method but included a modification by [22] to speed the analytical process [19]. As a surrogate measurement of nutrient concentration, higher values indicate higher levels of nutrients.

2.2.2. Macroinvertebrate Habitat

Based on the Rapid Bioassessment Protocols [20] and modified by EPA Region IV to fit their specific ecoregions [19], the macroinvertebrate habitat (HAB) data was derived from visual observations at the sampling site of specific parameters categorized as primary, secondary, and tertiary parameters. Primary parameters characterize the stream habitat at a microscale; these parameters were bottom substrate, available cover, embeddedness, and flow regime. Secondary parameters characterize stream habitat at the macroscale; these parameters were channel alteration, bottom scouring/deposition, and sinuosity. The tertiary parameters of bank stability, bank vegetation, and streamside cover characterize the riparian zone composition and integrity [19]. From this parameter matrix, a single, weighted composition score was derived, with higher scores indicating better conditions for sustaining healthy macroinvertebrate populations.

2.2.3. Macroinvertebrate Species Richness

Macroinvertebrate species richness (RICH) is simply a count of the number of distinct taxa observed in a sample [20]. In this study, samples were collected from a 100 m stream segment above the water sample collection site. D-frame and A-frame dipnets were used to collect organisms from all substrate types within the stream reach [19]. Higher numbers indicate a greater diversity of taxa. The authors assigned ranges based on natural breaks in the data set: nonimpacted (greater than 26 taxa), slightly impacted (19–26 taxa), moderately impacted (11–18 taxa), and severely impacted (less than 11 taxa).

2.2.4. Ephemeroptera/Plecoptera/Trichoptera

The Ephemerop tera/Plecoptera/Trichoptera (EPT) variable is an index of three macroinvertebrate orders known to be sensitive to environmental impacts: Ephemeroptera (mayflies), Plecoptera (stoneflies), and Trichoptera (caddisflies). It is calculated as a percentage of the number of organisms in these three orders contained in a 100-organism sample [20]. The 100-organism samples used were a randomly selected subset of the sample collected for macroinvertebrate species richness, above. As with macroinvertebrate species richness, the authors assigned classifications based on natural breaks in the data set: nonimpacted (greater than 10 percent), slightly impacted (6–10 percent), moderately impacted (2–5 percent), and severely impacted (less than 2 percent).

2.3. Landscape Variables

All of the landscape variables used in this analysis were derived from available digital data sets in a geographic information system (GIS). The spatial data sets used were obtained from a variety of sources. The abbreviation, full name, and description of each of the landscape variables are given in Table 2. The primary data sets used to derive the 26 variables used in this analysis were: Multi-Resolution Land Characteristics (MRLC) Interagency Consortium land cover/land use [24], State Soil Geographic data base (STATSGO) soils [25], RF3 streams [26], USGS 8-digit HUCs, Georgia and South Carolina subbasins, Region IV sampling site locational data, 30-m and 100 m digital elevation models (DEMs) [27], and digital line graph (DLG) roads [28]. Slope was derived as percent rise from the 30-m DEM. Most of the landscape variables were calculated using the derived watershed above the sampling point as the base unit. The single exception in the variables used here is total roads located within 30 meters of a stream (r); for this variable, the base unit was the streams within the watershed, buffered out 30 meters on both sides.

The seven land cover variables were calculated from the MRLC cover classes. Percent crop (c) is the amount of land cover within each watershed identified in the MRLC data as “row crop” percent pasture (p) is the amount of land cover within each watershed identified in the MRLC data as “pasture or grassland” percent barren (b) is the amount of land cover within each watershed identified in the MRLC data as barren due to anthropogenic activities (e.g., “quarries, strip mines”), percent urban (u) is the total amount of land cover within each watershed identified in the MRLC data as “commercial”, “high-density residential”, and “low-density residential”, percent forest (f) is the total amount of land cover within each watershed identified in the MRLC data as “evergreen”, “deciduous”, and “mixed” forest, percent wetlands (q) is the total amount of land cover within each watershed identified in the MRLC data as “woody” and “herbaceous” wetlands, and percent water (w) is the amount of land cover within each watershed identified in the MRLC data as “water.”

Slopes (z) were considered to be all areas with greater than 3 percent rise slope, while mean slope (x) is the arithmetic mean of the 30 m slope pixels within the watershed and standard deviation of slope (s) is the first standard deviation of the total number of slope pixels within the watershed. The STATSGO K-factor was used to provide an estimation of slope erodibility; a K-factor greater than or equal to 0.4 was considered “highly erodible” (e) while a K-factor of greater than or equal to 0.2 but less than 0.4 was considered “moderately erodible.” Moderately erodible soils were used in overlays with land cover and slope data, but not as a variable by themselves.

Stream density (d) was calculated as the total length of RF3 stream vectors within the watershed divided by the total area of the watershed. Power lines, pipelines, and telephone lines (t) were calculated as the total length of these vectors from USGS TIGER files within the watershed divided by the total area of the watershed. Total roads within the watershed (r) are the total length of all USGS TIGER file road classes divided by the total watershed area and total roads within 30 meters of streams (v) is that subset of roads located within the buffered stream boundary, divided by the total stream length. The total roads within 30 meters of streams (v) also included railroads and sidings as these could produce an impact to streams equal to or greater than some passenger vehicle road classes. Railroads, however, were not included in the total roads within the watershed (r) variable.

The remaining eleven landscape variables were overlays of two or three of the land cover, slope, and soil erodibility variables. Five used total agriculture (MRLC data classified as “row crops” and “pasture or grassland”) in combination with slope (agriculture on slopes greater than 3 percent, (az)), soils (agriculture on highly erodible soils (ah) and agriculture on moderately erodible soils (am)) or both (agriculture on highly erodible soils on slopes greater than 3 percent (azh) and agriculture on moderately erodible soils on slopes greater than 3 percent (azm)). The subclassifications of agriculture overlayed with slopes and/or soils yielded another three variables (row crops on slopes greater than 3 percent (cz), row crops on slopes greater than 3 percent with moderately erodible soils (czm), and pastures or grasslands on slopes greater than 3 percent (pz)). Land cover classified as barren due to anthropogenic activities overlayed with slopes and soils accounted for two variables (barren on slopes with highly erodible soils (bzh) and barren on slopes with moderately erodible soils (bzm)). The last overlay variable used was slopes with moderately erodible soils (zm). Other possible overlay variables (e.g., row crops on slopes with highly erodible soils) were not used in this analysis primarily because they were nonexistent in the majority of the watersheds. For clarifications, landscape variables abbreviations in Tables 2 and 3 were used in figures.

3. Statistical Methodology

The PLS method is based on first computing a few relevant projections (latent variables), for example, linear combinations of the independent or predictor variable 𝑋 and then using these new variables in a regression equation for predicting the response 𝑌. In contrast, principal components analysis (PCA) uses only the predictors (𝑋). In PLS, both 𝑋- and 𝑌-matrices are decomposed into scores and weights matrices (𝑋=𝑇𝑃, where 𝑇𝑇=𝐼 is identity), then 𝑌 is estimated as 𝑌=𝑇𝐵𝑉 where 𝐵 is the regression coefficient and 𝑉 is linear weight. The matrix column “𝑇” is the latent vectors. Decomposition of the 𝑋- and 𝑌-matrices and forming linear combinations continue until the number of latent vectors is equal the number of variables in the 𝑋-matrix. PLS begins by the following.(1)Centering and scaling each of the response (𝑌) and predictor (𝑋) variables, Yo and Xo, respectively. (2)Constructing linear combinations of the predictors as 𝛿 (score)=𝑋𝑜𝜔 (weight). Scores are orthogonal. (3)Constructing linear combinations of the responses as 𝜇=𝑌𝑜𝜈. (4)Verifying the linear combination in (2) has maximum covariance (𝛿𝜇) with the response linear combination in (3); in addition constraints 𝜔𝑇𝜔=1 and 𝛿𝑇𝛿=1 should be met. (5)Predicting for both 𝑌𝑜 and 𝑋𝑜 by regression on 𝛿 (scores): 𝑋𝑜=𝛿𝐿𝑥,𝑌𝑜=𝛿𝐿𝑦,(1) where 𝐿𝑥(=(𝛿𝛿)1𝛿𝑋𝑜) and 𝐿𝑦(=(𝛿𝛿)1𝛿𝑌𝑜) are the 𝑋- and 𝑌-loadings, respectively. (6)The above steps are for constructing the first PLS factor.(7) Residuals for each X and Y are produced as: 𝑋1=𝑋𝑜𝑋𝑜,𝑌1=𝑌𝑜𝑌𝑜.(2)(8)The second factor is constructed by applying steps (1) through (5) to the residual (7); additional factors are constructed by repeating this process for each residual until the 𝑋 matrix becomes null.

In interpretation, the scores as well as weights (steps (2) and (3)) are computed and plotted in simple scatter plots (Figures 1 and 2). Weights are the contribution of each the predictors in 𝑋 to the PLS factor. Landscape metrics clustered near the origin indicate these provide little significant contribution to the predictive model. Clusters of variables with approximately equal weights indicate these variables may be collinear. The scores are the regression coefficients of the variables in 𝑋 and 𝑌 regressed upon the various variables in 𝛿 and represent how the different manifest variables are related to the scores 𝛿 (Figure 2). The scores are sometimes thought of as latent unobservable variables. Detailed discussions of PLS and other methods can be found in [12, 15, 16, 29, 30].

3.1. Validation

Validation of a prediction is always important for assessing the properties of the equation developed. Just testing the model on data already used for building the model is not enough and can lead to highly overoptimistic results [16]. Cross-validation, as used here, was accomplished by dividing the data into five groups, of which one group was left out (test data). The model was fitted on the remaining four groups (training data). The fitted models (𝑛 factor model, step (8)) were tested via cross validation using the test data sets, and the predicted values were compared with observed to calculate residuals. The sum of squares of these residuals for all models (null and 𝑛-factor models) was calculated giving PRESS (Predictive Residual Sum of Square), which can be used to define the optimum model and, hence, assess the predictive power of the model. A model with number of factors that minimizes PRESS is the optimum one to be chosen. However, several models may have PRESS values that are close and do not differ greatly from the absolute minimum, therefore, it is important to test whether these differences are significant. A statistical test (Hotelling’s 𝑇2) suggested [31] to test the significant differences between root means PRESS of models was used here. The final model was chosen based on the lowest significant PRESS value (Table 3).

3.2. Variable Influence on Projection (VIP)

VIP is also known as variable importance for projection [15]. VIP is calculated as VIP=𝑉𝑛𝑥,𝑉=𝑛𝑓𝑖=1𝜔2𝑛𝑖𝑟𝑦𝑖𝑛𝑓𝑖=1𝑟𝑦𝑖,𝜔𝑛𝑖=𝑋𝜔𝑖USS𝑋𝜔𝑖,(3)

𝑋𝜔𝑖 is the predictor (here it is landscape variables) weight per each model factor. For example, if the model has three significant factors, then there are three weights for each of the landscape variables. Each weight is normalized (𝜔𝑛𝑖) by dividing by the uncorrected sum of squares of predictor weights per factor, 𝑟𝑦𝑖 is the percent variability in the response variables (here, biota data) that is explained by each factor, 𝑛𝑓 is the number of significant factors in the model, and 𝑛𝑥 is the number of predictor variables. The values of the regression coefficients and the relative importance (VIP) of each predictor can be used to evaluate the contribution of each variable in the PLS model (Figure 3). Regression coefficient values indicate the contribution of each predictor (lines in Figure 3) for an individual response. The VIP value, as indicated in the above equations, is based on both response and predictor measures. Therefore, if the VIP for a predictor is small in value, it implies that variable has a relatively small contribution to the prediction and may be deleted from the PLS model. Variable with VIP values of less than 0.8 should be considered small contributors [15]. An improved model can be built by including variables with high VIP values and excluding others with low VIP. For the whole basin, we refined the preliminary model by removing 18 predictors which increased the amount of variability explained by the responses by 9% (Table 4) in the refined model.

The quality of the model developed here was determined by examining the residuals for both the biota and the landscape variables. An examination of any possible outliers using residuals and leverages was carried out to finalize the fitted PLS model. The above analyses were done on all data, and by ecoregion to demonstrate the utility of PLS in different geographical settings.

3.3. Predictive Capabilities Usage

Water quality data (response variables) are predominantly collected by manual methods at selected points. Often, permitting restrictions, cost of sampling, equipment malfunction, or other reasons may prohibit collection of a complete set of samples. Landscape variables (predictors), on the other hand, can generally be obtained for all sites. Use of satellite imagery provides nearly complete spatial coverage of the data used in computation of landscape variables. Low numbers of sites, collinearity in the landscape variables, missing values in water quality parameters, and low signal to noise ratios in relationships between landscape variables and biological data can all be overcome in describing relationships, quantifying variability, modeling, and prediction using PLS. We used SAS® [32] for all statistical analyses.

4. Results

The results for all PLS models are summarized in Table 4. Two models are presented for the whole Basin: a preliminary model in which all of the response and predictor variables are used and a refined model using a selected subset of variables. Table 4 also presents results of models for each of the three ecoregions.

4.1. Whole Basin

In the preliminary model, three factors are significant explaining 34% of the variability in the biota and 55% of the variability in the landscape data sets. Figure 1 is a plot of the landscape and biota scores for the first factor, indicating the strength of the relationship between the response and predictor variables in this factor (𝑟=0.64). Landscape metrics weights among the three significant factors (Figure 2) show that erodible soil, slope standard deviation, mean slope, agriculture on slopes, and pasture are heavily weighted in all three factors, while forest is heavily weighted only in factor 1, wetlands in factor 2, and crops in factor 3. Agriculture-related variables, including overlays with slopes and soils, are approximately equal weights within the PLS factor indicating collinearity.

Table 4 shows the predictor variables grouped by VIP value. Figure 3 depicts the regression coefficients for each response/predictor variable combination with the predictor variables listed in order of increasing VIP value. Landscape variables with regression coefficients close to zero and VIP <0.8 indicate little or insignificant associations between these landscape metrics and water biota in this study. Based on these low values for both regression coefficient and VIP, the following landscape variables are excluded from further analyses, including the PLS models for the individual ecoregions: barren on slopes with either highly or moderately erodible soils, water, stream density, transmission lines, and roads near streams. Several of the agriculture-/slope-/soils-related landscape variables have similar VIP values and factor weights (Figure 2), indicating approximately equal contribution to the model. Only those variables with high values for both VIP and regression coefficient are selected to create a final refined model for the whole basin with the strongest possible predictive capability; the nine predictor values selected are shown with shaded VIP bars in Figure 3.

The refined model (Table 4) has three significant factors with predictive ability that is more than twice that of the preliminary model. The three factors explain 43% and 80% of the variation in the biota (response) and landscape variables (predictors), respectively. The importance of the nine landscape variables is high (VIP ≥ 0.8). The agriculture-related variables contribute equally and minimally to the model, with VIPs close to 0.84. All of agriculture-related variables have a negative effect on EPT and HAB. The most significant contributors (VIP ≥ 1) are slope standard deviation, slope, forest, and erodible soils. Slope standard deviation was the most important variable (VIP = 1.5) and, along with forest, has a positive effect on EPT and HAB. Urban is also an important contributor, but ranks in between the above two groups. that is, urban contributes more than agriculture, but less than slope standard deviation, forest, and erodible soils. Like agriculture and erodible soils, urban negatively impacts biotic condition.

4.2. Ecoregions
4.2.1. Blue Ridge

Two water biota and six landscape variables (Table 4) for twenty sites are included in the model. There are three significant factors that account for 59% and 94% of the variability for the biota and landscape variables, respectively. The strength of the relationship between the two linear components for the first factor is moderate (𝑟=0.65) [33]. Slope standard deviation and roads are the most important variables (VIP > 1), followed by slopes with erodible soils (VIP 1; Table 4). Forest, slope, and pastures on slopes were ranked in the middle with VIPs greater than 0.8 but less than 1. Slope standard deviation and, to a lesser extent, forest, are positively correlated with EPT and HAB, while the remaining landscape variables are negatively correlated with the biota variables (Figure 2(a)). Normality of the response is met (𝑃>.13), and no outliers are found in the landscape metric data.

4.2.2. Piedmont

The PLS model for the Piedmont contains three water biota and nine landscape variables for 59 sites (Table 4), producing three significant factors explaining 42% of the variability in the biota and 65% of the variability in the landscape variables (Table 4). The strength of the relationship between the two linear compositions for the first factor is strong (𝑟=0.75) [33]. Slope features and forest are the most important variables (VIP > 1; Table 4). Slope features, forest, and wetlands (marginally) are positively correlated with EPT, whereas agriculture/slope/soils variables and urban are negatively correlated with EPT (Figure 2(b)).

Wetlands, slope standard deviation, and forest are positively correlated with HAB while urban and all agriculture-related variables are negatively correlated (Figure 2(b)). AGPT is heavily weighted and positively correlated with urban and agriculture and negatively correlated with forest (Figure 2(b)). Normality of the response variables is met (𝑃>.05), and no serious outliers in the landscape variables are found.

4.2.3. Coastal Plain

In spite of the scarcity of sampling sites (𝑛=7) in the Coastal Plain, a valid PLS model is constructed. Three water biota variables and seven landscape variables (Table 4) are included in the model. There are two significant factors that account for 66% and 86% of the variability for the biota and landscape variables, respectively. The strength of the relationship between the two linear compositions for the first factor is strong (𝑟=0.85). HAB and AGPT are positively correlated with erodible soils and agriculture on moderately erodible soils and negatively correlated with the remaining predictor variables. EPT is negatively correlated with agriculture on moderately erodible soils, erodible soils, and urban and positively correlated with the remaining variables (Figure 2(c)).

5. Discussion

PLS models for the whole basin and for each of the three ecoregions revealed different sets of variables of landscape variables that have relation with those of water quality. A number of variables are found to have little or no contribution to the predictive capability of the model and, therefore, can reasonably be excluded from refined analyses. Variables which are known to have a high degree of collinearity (specifically, the various overlays of agriculture/slopes/soils) are correctly identified in the analyses with similar weights, VIP values, and regression coefficients. This clustering permits further reduction of the number of variables in refined analyses. From an initial pool of 26 landscape variables, final models are produced with six to nine variables, all significant contributors to the predictive model.

On the ecological aspect, one may ask, do the PLS results identify meaningful associations between biotic and landscape indicators? With the exception of the Coastal Plain, this objective is successful. Macroinvertebrate indicators are positively correlated with natural land cover types (forest in the Blue Ridge and Piedmont, wetlands in the Piedmont) and negatively correlated with indicators of anthropogenic activities (agriculture, urban development, roads). As an indicator of nutrient enrichment, AGPT could be expected to be positively correlated with agriculture and erodible soils. Positive correlations are obtained in the models for the whole basin and for the Piedmont. In the Coastal Plain model, however, AGPT is positively correlated with agriculture on moderately erodible soils and with erodible soils, but is negatively correlated with agriculture on slopes. Also, EPT is negatively correlated with erodible soils and agriculture on moderately erodible soils as could be expected, but EPT is positively correlated with agriculture on slopes and with roads, which is contrary to what would be expected. Although the positive correlation is lower in magnitude than that of the negative, this unexpected relationship is possible due the collinearity between predictors. Soils in this ecoregion are generally of low erodibility, and the terrain is much flatter than the other two ecoregions, so possibly the detrimental effects of agricultural runoff are greatly lessened.

The model results also indicate slope is a significant predictor variable in the whole basin and in each of the ecoregions. In the Blue Ridge, slope variables receive the highest weightings and VIP values. This region is the upland headwaters of the Savannah River Basin and is characterized by hilly to mountainous terrain. Slope variables are also heavily weighted in the Piedmont, and standard deviation of slope produces the highest VIP in any of the ecoregion-specific models. The Piedmont is a transitional zone between the mountains of the Blue Ridge, and the flat terrain of the Coastal Plain and encompasses terrain varying from hilly to nearly flat. Slope is significant in the Coastal Plain model, but not as much as in the other ecoregions. Unlike the Blue Ridge and Piedmont, standard deviation of slope in the Coastal Plain is negatively correlated with HAB. This may be a function of the methodology used to score HAB which gives higher weights to areas with a variety of pool and riffle habitats. The Coastal Plain may lack this variety due to the lack of slope in this ecoregion.

An unexpected result is seen in the preliminary model for the whole basin. Forest is weighted heavily only on factor 1, wetlands only on factor 2, and row crops only on factor 3 (Figure 2). Forests are the dominant land cover type in the Blue Ridge, and row crops are dominant in the Coastal Plain. Wetlands are a small percentage of the total land cover in the Piedmont, but may play a critical role in water quality [23]. It appears that the linear combinations in the whole basin model factors may correspond to the characteristics which distinguish individual ecoregions. This result merits further investigation.

Species richness in 362 1-km2 grid squares in the Kevo Nature Reserve, Finland, was predicted using 227 vascular plant taxa and 27 environmental variables [17]. The resultant PLS model contained two factors which explained 40.3% of the variance in the single response variable. PLS was also used to relate riparian plant growth and survival to duration and frequency of flooding in a controlled experimental study [18]. The availability of remote sensing data for an area can be used to monitor vegetation indicator continuously over time and space with cost-effectiveness and ease of implementation more than those with field measurements. Schmidtlein [34] used transformed reflectance in 64 wavelength bands to predict averaged Ellenberg inidicator values (soil pH, soil fertility, and water supply) from 46 field sites using PLS regression. In each field site, all vascular plant species were also identified, and their cover was estimated. Predicted Ellenberg indicators for the study area were mapped showing the continuous environmental gradient that can be used to assess the floristic composition.

These studies used plant indicators as the response variable and a variety of predictor variables. Our approach using PLS regression differs from these studies in a number of ways: we use multiple response variables, our response variables are indicators of nutrients and macroinvertebrates, and our response data originated from ambient field sampling rather than from controlled experimental studies. These differences are encouraging in that they imply PLS may have utility in a broad range of ecological studies.

6. Conclusions

In both the preliminary and refined models for the whole basin, associations among water biota and landscape variables, largely conform to known ecological processes. Agriculture and urban variables, with their potential for nutrient runoff from fertilizer usage, are positively associated with AGPT measurements while forest is negatively associated with AGPT. Agriculture, urban, moderately eroded soils on slopes, and roads are negatively associated with HAB while wetlands, which filter and remove pollutants as well as slow runoff, are positively associated with HAB.

In each case the dominant landscape variable corresponds to a critical aspect of the ecoregion: forest in the evergreen forest-dominated Blue Ridge, wetland in the transitional Piedmont, and row crops in the agriculture-dominated Coastal Plain. For both the Blue Ridge and the Coastal Plain, the ecoregion-specific model yields improved results over the basin-wide model, despite the reduction in sample size. Only the Piedmont model fails to improve on the basin-wide model results, with 42% of the variability in the water biota data set and 65% of the variability in the landscape variables explained by three significant factors on a sample size of 59. The Piedmont is a transitional zone with pasture dominant in the upper region transitioning to row-crop-dominated agriculture in the lower region. Spatial variation across the ecoregion may at least partially explain the model results. In contrast, three significant factors in the Blue Ridge together explain 59% of the variability in the water biota data set and 94% of the variability in the landscape variables data set, based on a sample size of 20. Even with a very limited sample size of 7, the PLS model for the Coastal Plain yields two significant factors, together explaining 66% of the variability in the water biota data and 86% of the variability in the landscapes data.

Although further testing in different biogeophysical setting is needed, the results indicate PLS may prove to be a valuable statistical analysis tool for ecological studies. The data sets used in these analyses contain limitations typical of ecological studies: a small number of sampling sites, a large number of variables, missing values, low signal-to-noise ratio, differences in spatial extent, and different collection methodologies between the field-collection surface water samples and the remote sensing-derived landscape variables. The PLS methodology is less sensitive to these limitations than other statistical methods. The correlations among water biota variables and landscape variables provide much more information when they are all considered in multivariate regression than in univariate-multiple regressions. Univariate-multiple regression analyses with these data sets will not reveal a distinctive pattern of association due to a weak correlation. Summarizing information in the predictor variables by reduction into a few variables, that is, latent variables, conditioned on maximum covariance with the linear composition of the predictor variables, makes PLS more suitable in a multivariate context than other, more commonly used, multivariate methods.

Acknowledgments

The authors thank the effort of Dr. Chad Cross, Dr. Tormod Næs, Daniel Heggem, and the anonymous reviewers for their input and review. The contribution of the US Environmental Protection Agency, Region IV, Science and Ecosystem Support Division in the collection of the surface water data used in the statistical analyses presented here is gratefully acknowledged. The US Environmental Protection Agency (EPA), through its Office of Research and Development (ORD), funded and performed the research described here. It has been peer-reviewed by the EPA and approved for publication. Mention of trade names or commercial products does not constitute endorsement or recommendation for use.