Managers-Net

Data Validation

In this Topic the following terms will be used:

population: the whole of the situation under consideration
sampling frame: a list of the population (e.g. telephone directory, electoral register, a consignment of goods)
sample: the items taken from the population as a representative set.
mean: the most commonly used measure of average.

Why is data validation so important?

In a another topic, Data Collection, reference was made to validation. The reason why data must be validated is that a sample of the data taken from a population only represents reliably those items which make up the sample. Because none of the others in the population will have been considered the relatively small sample cannot accurately represent the mean of the population.

If this is so, then how reliable are the results of the sample? The process of estimating the reliability is known as validation. We must accept that the results of a sample do not represent the population accurately but at least we can assess the extent of the error due to sampling, i.e. the statistical error

From this, it is clear that accepting results of a sample is dangerous unless we appreciate the size of the statistical error.

Estimating the statistical error

Without knowing much about statistics, it is evident that the bigger our sample size, the more accurate will be our estimate of the population. For example, in order to find the average or mean height of men in a certain country, if we measured the height of just one man taken at random we could not say with any certainty that this was a good estimate of the average man. We would need to take a much larger sample. However: it is natural to assume that larger populations need a larger sample size and smaller populations a correspondingly smaller sample size to achieve the same "accuracy" or reliability. Natural, maybe, but not so! Provided that there was no bias at all, a sample of 1,000 men taken randomly from a population of, say, 100,000 men would produce the same reliability as 1,000 men out of 1,000,000. What matters is the size of the sample, provided, of course, that the population is relatively large with respect to the sample.

The major parameter usually considered when assessing reliability of a mean is the measure of spread of the sample items such as standard deviation.

Simply put, the standard deviation is a measure of the dispersion of all items in the sample. Standard deviation is most easily measured using a scientific hand-calculator (one which includes the standard deviation facility or a computer spreadsheet (such as EXCEL or Lotus 1-2-3). It usually is identified by the Greek letter sigma (σ).

An estimate of the reliability of the mean of a sample as a representation of the mean of the whole population is made using the standard error of the mean.

Using this standard error, our estimate of the mean of the population is equal to the sample mean +/- standard error

Standard error

Standard error is identified by the Greek lower-case letter sigma σ_s.e. as is the standard deviation but with the subscript "s.e."

Calculation of the standard error depends on the data being considered. For example:

for continuous data (which means those measured in "units"),
σ_s.e = (standard deviation) divided by (square root of the sample size).
But for discrete data (e.g. binomial):
σ_s.e = square root of [(p x (1 - p))/n]
where:
- p = probability of an event occurring (example: proportion of reject items in a sample)
- 1 - p = probability it does not occur (proportion of acceptable items in the sample)
- n = sample size

Example: A sample of 100 bolts is taken at random from a large consignment delivered from the manufacturer. The customer's inspector finds two reject bolts in the 100 sample. The inspector calculates the mean and standard error as a binomial situation because there are only two possibilities, i.e. acceptable or reject.

Mean = 2 out of100 (2%) or a probability of rejects of 0.02 and probability of good ones of 0.98.

So the standard error is square root of (0.02 x 0.98 x 100) = 1.4 rejects.

Interpretation: So, the probability of rejects in the whole batch not tested will be:

2 rejects +/- 1.4 rejects or between 0.7 and 3.4 rejects in every 100 in the batch or more realistically, between 7 and 34 reject bolts in every 1,000 bolts tested.

However, there is always a catch! This is the matter of how confident one should be in all of these calculations. In the example the standard error has been used in this calculation, but statistically the use of one standard error only produces a result in which we are only 68% confident that the results are reliable. This level of confidence (and following examples), are obtained from statistical tables found in the appendices of all serious books on statistical method.

To improve the confidence, we must take more standard errors in the calculation. Thus, using the universally accepted levels of confidence as examples:

the sample mean +/- two standard errors gives a confidence 95% in the result.
the sample mean +/- three standard errors gives a confidence of 99.8%.

Suppose we are not happy with the error, i.e. if it is too large to tolerate, what then? The way to reduce the margin of error is to take a larger sample size, the size of which can be calculated by rearranging the formulae given previously.

To sum up.

When one undertakes sampling as a method of assessing a large population of data the apparent error depends on two things

the spread of the values in the population and sample measured by its standard deviation to produce the standard error, and
the amount of confidence in the result that we can tolerate.

Applications

These fundamental procedures are used as a basis for Statistical Process Control and, generally, in inspection processes and management reports. Most importantly, you should always be aware of how reliable is a result of a sample in terms of sampling errors and confidence.

Custom Search

browser implementation

For more information, contact: Managers-Net.