#### Chemometrics in Analytical Spectroscopy

**By Mike J. Adams** **The Royal Society of Chemistry**

** Copyright © 2004 The Royal Society of Chemistry**

All rights reserved.

ISBN: 978-0-85404-595-2

CHAPTER 1

*Descriptive Statistics*

**1 Introduction**

The mathematical manipulation of experimental data is a basic operation associated with all modern instrumental analytical techniques. Computerization is ubiquitous and the range of computer software available to spectroscopists can appear overwhelming. Whether the final result is the determination of the composition of a sample or the qualitative identification of some species present, it is necessary for analysts to appreciate how their data are obtained and how they can be subsequently modified and transformed to generate the information. A good starting point in this understanding is the study of the elements of statistics pertaining to measurement and errors. Whilst there is no shortage of excellent books on statistics and their applications in spectroscopic analysis, no apology is necessary here for the basics to be reviewed.

Even in those cases where an analysis is qualitative, quantitative measures are employed in the processes associated with signal acquisition, data extraction and data processing. The comparison of, say, a sample's infrared spectrum with a set of standard spectra contained in a pre-recorded database involves some quantitative measure of similarity to find and identify the best match. Differences in spectrometer performance, sample preparation methods, and the variability in sample composition due to impurities will all serve to make an exact match extremely unlikely. In quantitative analysis the variability in results may be even more evident. Within-laboratory tests amongst staff and inter-laboratory round-robin exercises often demonstrate the far from perfect nature of practical quantitative analysis. These experiments serve to confirm the need for analysts to appreciate the source of observed differences and to understand how such errors can be treated to obtain meaningful conclusions from the analysis.

Quantitative analytical measurements are always subject to some degree of error. No matter how much care is taken, or how stringent the precautions followed to minimize the effects of gross errors from sample contamination or systematic errors from poor instrument calibration, random errors will always exist. In practice this means that although a quantitative measure of any variable, be it mass, concentration, absorbance value, *etc.,* may be assumed to approximate the unknown true value, it is unlikely to be exactly equal to it. Repeated measurements of the same variable on similar samples will not only provide discrepancies between the observed results and the true value, but there will be differences between the measurements themselves. This variability can be ascribed to the presence of random errors associated with the measurement process, *e.g.* instrument generated noise, as well as the natural, random variation in any sample's characteristics and composition. As more samples are analysed, or more measurements are repeated, then a pattern to the inherent scatter of the data will emerge. Some values will be observed to be too high and some too low compared with the correct result, if this is known. In the absence of any bias or systematic error the results will be distributed evenly about the true value. If the analytical process and repeating measurement exercise could be undertaken indefinitely, then the true underlying distribution of the data about the correct or expected value would be obtained. In practice, of course, this complete exercise is not possible. It is necessary to hypothesize about the scatter of observed results and assume the presence of some underlying predictable and well-characterized parent distribution. The most common assumption is that the data are distributed *normally.*

**2 Normal Distribution**

The majority of statistical tests, and those most widely employed in analytical science, assume that observed data follow a normal distribution. The normal, sometimes referred to as *Gaussian,* distribution function is the most important distribution for continuous data because of its wide range of practical application. Most measurements of physical characteristics, with their associated random errors and natural variations, can be approximated by the normal distribution. The well-known shape of this function is illustrated in Figure 1.1 As shown, it is referred to as the normal probability curve. The mathematical model describing the normal distribution function with a single measured variable, *x,* is given by Equation 1.1.

f(x) = 1/σ[square root of (2π)]exp[-(x-μ)2/ 2σ2] (1.1)

The height of the curve at some value of *x* is denoted by *f(x)* while μ and σ are characteristic parameters of the function. The curve is symmetric about μ, the *mean* or average value, and the spread about this value is given by the *variance,* σ2, or *standard deviation,* σ. It is common for the curve to be standardized so that the area enclosed is equal to unity, in which case *f(x)* provides the probability of observing a value within a specified range of *x* values. With reference to Figure 1.1, one-half of observed results can be expected to lie above the mean and one-half below μ. Whatever the values of μ and σ, about one result in three will be expected to be more than one standard deviation from the mean, about one in twenty will be more than two standard deviations from the mean, and less than one in 300 will be more than 3σ from μ.

Equation 1.1 describes the idealized distribution function, obtained from an infinite number of sample measurements, the so-called *parent population distribution.* In practice we are limited to some finite number, *n*, of samples taken from the population being examined and the *statistics*, or estimates, of mean, variance, and standard deviation are denoted then by [bar.x], s2, and *s* respectively. The mathematical definitions for these parameters are given by Equations 1.2–1.4

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1.2)

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (1.3)

s = [square root of (s2)] (1.4)

where the subscript *i*(*i* = 1 ... *n*) denotes the individual elements of the set of data.

A simple example serves to illustrate the use of these statistics in reducing data to key statistical values. Table 1.1 gives one day's typical laboratory results for 40 mineral water samples analysed for sodium content by flame photometry. In analytical science it is common practice for such a list of replicated analyses to be reduced to these descriptive statistics. Despite their widespread use and analysts' familiarity with these elementary statistics care must be taken with their application and interpretation; in particular, what underlying assumptions have been made. In Table 1.2 is a somewhat extreme but illustrative set of data. Chromium and nickel concentrations have been determined in waste water supplies from four different sources (A, B, C and D). In all cases the mean concentration and standard deviation for each element is similar, but careful examination of the original data shows major differences in the results and element distribution. These data will be examined in detail later, but the practical significance of reducing the original data to summary statistics is questionable and may serve only to hide rather than extract information. As a general rule, it is always a good idea to examine data carefully before and after any transformation or manipulation to check for absurdities and loss of information.

Although both variance and standard deviation attempt to describe the width of the distribution profile of the data about a mean value, the standard deviation is often favoured over variance in laboratory reports as *s* is expressed in the same units as the original measurements. Even so, the significance of a standard deviation value is not always immediately apparent from a single set of data. Obviously a large standard deviation indicates that the data are scattered widely about the mean value and, conversely, a small standard deviation is characteristic of a more tightly grouped set of data. The terms 'large' and 'small' as applied to standard deviation values are somewhat subjective, however, and from a single value for *s* it is not immediately apparent just how extensive the scatter of values is about the mean. Thus, although standard deviation values are useful for comparing sets of data, a further derived function, usually referred to as the *relative standard deviation*, RSD, or *coefficient of variation,* CV, is often used to express the distribution and spread of data:

%CV, %RSD = 100s/[bar.x] (1.5)

If sets or groups of data of equal size are taken from the parent population then the mean of each group will vary from group to group and these mean values form the sampling distribution of [bar.x]. As an example, if the 40 analytical results mean = 11.04) provided in Table 1.1 are divided into five groups, each of eight results, then the group mean values are 11.05, 11.41, 10.85, 10.85, and 11.04 mg kg-1. The mean of these values is still 11.04, but the standard deviation of the group means is 0.23 compared with 0.78 mg kg-1 for the original 40 observations. The group means are less widely scattered about the mean than the original data (Figure 1.2). The standard deviation of group mean values is referred to as the *standard error of the sample mean,* σm, and is calculated from

σm = σp/[square root of n] (1.6)

where σp is the standard deviation of the parent population and *n* is the number of observations in each group. It is evident from Equation 1.6 that the more observations taken the smaller the standard error of the mean and the more accurate the value of the mean. This distribution of sampled mean values provides the basis for an important concept in statistics. If random samples of group size *n* are taken from a normal distribution then the distribution of the sample means will also be normal. Furthermore, and this is not intuitively obvious, even if the parent distribution is not normal, providing large sample sizes (*n* > 30) are taken then the sampling distribution of the group means will still approximate the normal curve. Statistical tests based on an assumed normal distribution can therefore be applied to essentially non-normal data. This result is known as the *central limit theorem* and serves to emphasize the importance and applicability of the normal distribution function in statistical data analysis since non-normal data can be normalized and can be subject to basic statistical analysis.

**Significance Tests**

Having introduced the normal distribution and discussed its basic properties, we can move on to the common statistical tests for comparing sets of data. These methods and the calculations performed are referred to as *significance tests.* An important feature and use of the normal distribution function is that it enables areas under the curve, within any specified range, to be accurately calculated. The function in Equation 1.1 can be integrated numerically and the results are presented in statistical tables as areas under the normal curve. From these tables, approximately 68% of observations can be expected to lie in the region bounded by one standard deviation from the mean, 95% within μ [+ or -] 2σ and more than 99% within μ [+ or -] 3σ.

Returning to the data presented in Table 1.1 for the analysis of the mineral water if the parent population parameters, σ and μ0 are 0.82 and 10.8 mg kg-1 respectively, then can we answer the question of whether the analytical results given in Table 1.1 are likely to have come from a water sample with a mean sodium level similar to that providing the parent data? In statistic's terminology, we wish to test the *null hypothesis* that the means of the sample and the suggested parent population are similar. This is generally written as

H0: [bar.x] = μ0 (1.7)

*i.e.* there is no difference between [bar.x] and μ0 other than that due to random variation. The lower the probability that the difference occurs by chance, the less likely it is that the null hypothesis is true. To decide whether to accept or reject the null hypothesis, we must declare a value for the chance of making the wrong decision. If we assume there is less than a 1 in 20 chance of the difference being due to random factors, the difference is *significant* at the 5% level (usually written as α = 5%) We are willing to accept a 5% risk of rejecting the conclusion that the observations are from the same source as the parent data if they are in fact similar.

The test statistic for such an analysis is denoted by *z* and is given by

*z* = [bar.x]-μ0/σ/[square root of n] (1.8)

[bar.x] is 11.04 mg kg-1, as determined above, and substituting into Equation 1.8 values for μ0 and σ then

*z* = 11.04-10.80/0.82/[square root of 40] = 1.85 (1.9)

The extreme regions of the normal curve containing 5% of the area are illustrated in Figure 1.3 and the values can be obtained from statistical tables. The selected portion of the curve, dictated by our limit of significance, is referred to as the critical region. If the value of the test statistic falls within this area then the hypothesis is rejected and there is no evidence to suggest that the samples come from the parent source. From statistic tables, 2.5% of the area is below -1.96σ and 97.5% is above 1.96σ. The calculated value for *z* of 1.85 does not exceed the tabulated *z*-value of 1.96 and the conclusion is that the mean sodium concentrations of the analysed samples and the known parent sample are not significantly different.

*(Continues...)*

Excerpted from **Chemometrics in Analytical Spectroscopy** by **Mike J. Adams**. Copyright © 2004 The Royal Society of Chemistry. Excerpted by permission of The Royal Society of Chemistry.

All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.

Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.