Managers-Net

Data collection

What is/are data?

One definition of data is: "known facts or things used as a basis for inference or reckoning":
- The OED.

Another is: "facts given from which others may be inferred": - Chambers Dictionary.

The term "data" more commonly is another word for "statistics" or numerical facts. The UK Prime minister, Disraeli, is quoted as saying, "There are lies, damned lies and statistics". Indeed, statistical data can be presented to mean what you wish them to mean. ("Data" is a plural word, the singular being datum. However, through American influence it is acceptable to use "data" in the singular form rather than "data are".

Data into knowledge - a recap on fundamentals

Data are facts, for example the number of items counted, or measurements of these items. To be of use we need to transform data into knowledge so that inferences can be made from them, such as decisions as to whether or not a component is capable of carrying out its allotted function.

Forms of data

Data can be separated into three categories of data (variables):

discrete variables, which are numerical and can only be particular numbers, such as the number of workers in an organization (i.e. they are counted in single units)
continuous variables, which are dimensions of items in units of measurement such as metres, litres, volts and other units of length, volume, time.
attribute variables, which are descriptive e.g. a machine "on" or "off", or an employee absent or present.

Important: It is crucial when dealing with any problems in which statistical method is used, one can differentiate between the three types of data, because the distinctions usually dictate which form of analysis is appropriate.

The main phases in the collection of data using sampling methods are:

The purpose or objective for collecting the data,
identification of the entire "population" from which the data are to be collected (e.g. a sampling frame).
decisions on:
- method of collection, or how the data are to be collected
- sample size (i.e. how many readings to collect), and
validation of the results, this being a vital part of the collection/analysis process.

Note: whereas "population" once referred to people, the term is now used to describe the whole situation to be sampled.

Sampling

One important thing to bear in mind is that something in the system must be random. This could be the situation which is random or a sampling method which contains a random element for picking the components of the sample. Some of these follow.

The choice of sampling method depends on the type of data being sampled. The following describes three methods, all of which are covered in more detail in another Topic in this Web-site (see References, below).

Random sampling:

A common method is simple random sampling or the lottery method. One of the most convenient ways is to allocate numbers to all components of the population to be sampled and obtain the required amount of numbers to constitute the sample size. The ways of obtaining a random sample of numbers range from drawing numbers blindly "from a hat", (or the mechanized version of agitated balls being ejected from a drum), to the use of computer generated numbers.

Systematic sampling.

Often known as the constant skip method, this form of sampling is based on taking every nth reading from the random population. For example, in a survey, taking every 9th house in a street, for example, numbers 3, 12, 21, 30, 39 and so on). Care must be taken to avoid bias, so in the UK, taking every 10th house means they would all be on the same side of the road, and this might be significant.

Stratified sampling.

In order to ensure that all groups in a population are properly represented, this method separates the population into strata and allocates proportional representation to each stratum. With people, the strata may be occupations, or social classes, ages, or income groups for example. Once selected, one of the other two methods may be used within the strata.

Other methods.

These include quota sampling, cluster sampling and multi-stage sampling.

Validation

It is of little use if the sample collected does not represent the whole population. Clearly no sample can exactly reflect the true result had the whole population been surveyed. Therefore, probably there the sample result will differ from the true situation. What is important is that we are aware of the probable statistical errors which inevitably arise because the whole population was not investigated. Provided that the population is relatively large, the magnitude of the statistical error depends not on the size of the population but on the size of the sample. The error can be calculated (dealt with elsewhere in this Managers-net Web-site) or alternatively, the sample size can be calculated prior to data collection if we decide on the size of the error which we can tolerate. If the subsequent error is too large, then a bigger sample size must be taken, i.e. a further set of observations to add to the existing ones. At least, we can be aware of the statistical error to which our results are subject due to sampling and use the data appropriately.

Validation of data is covered in more detail in another topic on this Web-site.