## Statistical methods to analyze large spectral data sets

Spectral indices

Spectral indices are developed on the basis of ratios or differences in the reflectance at given wavelengths. Those wavelengths are related to certain performance traits such as biomass production, plant water content and photosynthesis (for further detail see the article about spectral reflectance indices).

Advantages: Cheap devices can be developed to measure the spectral indices (i.e. GreenSeeker for NDVI). Spectral indices provide a biological reason (green reflection = biomass).

Disadvantages: Not all information is used. Specific wavelengths for estimating grain yield (or any other trait) change with growth stage and environmental conditions

Multiple linear regression

Several ranges are regressed to the trait of interest. Each range has a different weight. The assumption is that there is no multicolinearity, which means that the traits/ ranges included in the regression are not correlated to each other.

Advantage: Several ranges can be regressed at the same time to the trait of interest.

Disadvantage: Multiple regression models are prone to over-fitting when more wavelengths are included.

Principal component analysis (PCA)

PCA converts a set of observations of possibly correlated spectral ranges into a set of values of linearly uncorrelated variables (principle components). With this the largest possible variance is explained in the constraints. The estimate score matrix consists of n dominating principal components of the traits. The assumptions are linearity and that the principle components are orthogonal.

Advantages: The first principal component explains the highest variance of the spectral data and includes the weight of each range with different weights.

Disadvantage: It has to be decided which and how many component to use in the regression equation

The result is not directly related to the trait of interest (i.e. grain yield).

Partial least square regression (PLSR)

PLSR decomposes the variability of the spectrum matrix into a small number of factors that capture most of the information of the spectral variables to predict the trait of interest (Naes et al., 2004). Further, PLSR takes the structure of the predicted and predictor variable into account. The assumption is linearity. The method is suited if n >>N and when there is multi-colinearity among values. The correlation between the predicted and predictor variables can be depicted by plotting the loading values (=regression coefficient). For prediction the optimum numbers of components need to be selected based on the lowest root mean square error of prediction (RMSE). Furthermore, the relative RMSE can be estimated as RMSE/standard error of the trait.

Advantage: Trait prediction with several ranges is possible by reduction to a defined number of PLSR factors and by regressing the weight of the wavelengths against grain yield (or any other trait of interest)

Disadvantage: Over-fitting with more than 10 PLSR factors.

Ridge regression

Multi-colinearity is reduced by using a ridge parameter and a X’X matrix instead of regression. The ridge parameter determines how much the ridge regression deviates from leas square regression. The assumption is linerarity.

Advantage: Trait prediction with several ranges preventing over-fitting.

Support vector machines

A Gauss kernel is used to solve the problem that data are non-linearly separable by implicitly mapping them into a feature space, in which the linear threshold can be used. The assumption is non-linearity.

Advantage: SVMs can be a useful tool for insolvency analysis, in the case of non-regularity in the data, for example when the data are not regularly distributed or have an unknown distribution

Disadvantage: lack of transparency of results

**Model validation**

Cross-validation (CV)

The data set is split into a training (4/5^{th} of the data set) and validation set (1/5^{th} of the data set). Within the training set the effect of each wavelength is estimated and then used for predicting the performance within the validation set. This procedure is repeated several times, and the average prediction ability of the model is estimated as the correlation between predicted and actual performance.

Advantage: The validation/ training split is not dependent on the number of iterations.

Disadvantage: Some observations may never be selected in the validation subsample, whereas others may be selected more than once. Takes a long time as multiple cycles of calculations are applied.

Leave one out validation (LOO)

All genotypes, except one are used for model calibration. The estimated effects of the spectral ranges are then used to predict the performance of the left out genotype.

Advantage: Samples used for validation are similar, thus the sample variance is low.

Disadvantage: The error variance within the calibration-validation process is high. Has to be repeated more often than CV

December 27th, 2012Topic: Crop Science, Plant breeding Tags: None