- Shopping Bag ( 0 items )
From the Publisher"…useful…for the theoretically inclined survey statistician…" (Technometrics, May 2007)
"This impressive piece of work can be thoroughly recommended to anyone" (Zentralblatt MATH Volume 1079)
Around the world a multitude of surveys are conducted every day, on a variety of subjects, and consequently surveys have become an accepted part of modern life. However, in recent years survey estimates have been increasingly affected by rising trends in nonresponse, with loss of accuracy as an undesirable result. Whilst it is possible to reduce nonresponse to some degree, it cannot be completely eliminated. Estimation techniques that account systematically for nonresponse and at the same time succeed in delivering acceptable accuracy are much needed.
Estimation in Surveys with Nonresponse provides an overview of these techniques, presenting the view of nonresponse as a normal (albeit undesirable) feature of a sample survey, one whose potentially harmful effects are to be minimised.
The accessible style of Estimation in Surveys with Nonresponse will make this an invaluable tool for survey methodologists in national statistics agencies and private survey agencies. Researchers, teachers, and students of statistics, social sciences and economics will benefit from the clear presentation and numerous examples.
"This impressive piece of work can be thoroughly recommended to anyone" (Zentralblatt MATH Volume 1079)
2.1. THE SURVEY OBJECTIVE
The objective of a survey is to provide information about unknown characteristics, called parameters, of a finite collection of elements, called a population, such as a population of individuals, of households, or of enterprises. A typical survey involves many study variables and produces estimates of different types of parameters. Simple parameters are the total or the mean of a study variable, or the ratio of the totals of two study variables. Different types of elements are sometimes measured in the same survey, as when both individuals and households are observed.
Many surveys are conducted periodically, for example monthly or yearly. As a consequence, an important objective is to measure the change in the level of a variable between two survey occasions. The objectives of estimation of change and estimation of level often coexist in a survey, but they may require somewhat different techniques.
A survey usually originates in an expressed need for information about a social or economic issue, a need which existing data sources are incapable of filling. The first step in the planning process is to determine the survey objectives as clearly and unambiguously as possible. The next step, referred to as survey design, is to develop the methodology for the survey.
Survey design involves making decisions on a number of future survey operations. The data collection method must be decided on, a questionnaire must be designed and pretested, procedures must be set out for minimizing or controlling response errors, the sampling method must be decided on, interviewers must be selected and trained, questionnaires must be constructed and tested, techniques for handling nonresponse must be agreed on, and procedures for tabulation and analysis must be settled.
A survey will usually encounter various technical difficulties. No survey is perfect in all regards. The statistics that result from the survey are not error-free. The frame from which the sample is drawn is hardly ever perfect, so there will be coverage errors. There will be sampling error whenever observation is limited to a sample of elements, rather than to the entire population. No matter how carefully the survey is designed and conducted, some of the desired data will be missing, because of refusal to provide information or because contact cannot be established with a selected element. Since nonresponding elements may be systematically different (for example, have larger or smaller variable values, on average) from responding elements, there will be nonresponse error.
These three types of error - sampling error, nonresponse error and coverage error - are discussed at length in this book, especially the first two. It is true that a survey will usually also have other imperfections, such as measurement error and coding error. These errors are not discussed.
Subpopulations of interest are called domains. If the survey is required to give accurate information about many domains, a complete enumeration within these domains may become necessary, especially if they are small.
The survey planner is likely to first consider whether statistics derived from available administrative registers could satisfy the need for information. This avenue can be followed in countries well endowed with high-quality administrative registers. If not, a census (a complete enumeration of the population) may have to be conducted. If all domains of interest are at least moderately large, a sample survey may give statistics of sufficient accuracy.
These three different types of survey - based on administrative registers, census survey and sample survey - differ not only in the extent to which they can produce accurate information for domains, but also in other important respects. For example, sample surveys have the advantage of yielding diverse and timely data on specified variables, whereas statistics derived from administrative registers, although perhaps less expensive, may give information of limited relevance, because except in fortunate circumstances, available registers are not designed to meet specific information needs. On the other hand, a census might provide the desired information with great accuracy, but is very expensive to conduct. For a discussion of these issues, see Kish (1979).
Most of the issues raised in the following apply to all three types of survey. But most often, we will have in mind a sample survey. Therefore, the term 'survey' will usually refer to a 'sample survey'. We will now review some frequently used survey terminology.
A survey seeks to provide information about a target population. The delimitation of the target population must be clearly stated at the planning stage of the survey. The statistician's interest does not lie in publishing information about individual elements of the target population (such disclosure is often ruled out by law), but in measuring quantities (totals or functions of totals) for aggregates of elements, the whole population or domains. These targeted quantities are called parameters or parameters of interest. For example, three important objectives of a labour force survey (as conducted in most industrialized countries) are to obtain information about the number of unemployed, the number of employed and the unemployment rate. These are examples of parameters. The first two parameters are population totals. The third is a ratio of population totals, namely, the number of unemployed persons divided by the total number of persons in the labour force.
Examples of other population parameters are population means - for example, mean household income - and regression coefficients, say, the regression coefficient of income (dependent variable) regressed on number of years of formal education (independent variable), for a population of individuals. We can estimate any of these parameters with the aid of data on a sample of elements.
The sample is a selection from the frame population. The frame population is a list or other device that identifies and represents all elements that could possibly be drawn. Ideally, the frame population represents exactly the set of physically existing elements that make up the target population. In reality, the frame population and the target population differ more or less, as we discuss in more detail later.
Sampling design is used as a generic term for the (usually probabilistic) rule that governs the sample selection. Commonly used sampling designs are: simple random sampling (SI), stratified simple random sampling (STSI), cluster sampling, two-stage sampling, and probability-proportional-to-size ([pi]ps) sampling, of which Poisson sampling is a special case. With the possible exception of SI, these designs require planning before sampling can be carried out. STSI requires a set of well-defined strata. Cluster sampling requires a decision on what clusters to use. Sampling in two or more stages requires a clear definition of the first-stage sampling units, the second-stage units, and so on.
Every sampling design involves two other important general concepts: inclusion probabilities and design weights. The inclusion probability of an element is the known probability with which it is selected under the given sampling design. The design weight of an element is computed as the inverse of its inclusion probability, assumed to be greater than zero for all elements. Examples of designs where the inclusion probabilities are equal for all elements are SI and STSI with proportional allocation. Many sampling designs used in practice do not give the same inclusion probability to all population elements. In STSI, the inclusion probabilities are equal within strata, but they can differ widely between strata.
The inclusion probability can never exceed one. Consequently, a design weight is greater than or equal to one. The inclusion probability (and the design weight) is equal to one for an element that is selected with certainty. Many business surveys include a number of elements (usually very large elements) that are 'certainty elements'. These form a subgroup often called a take-all stratum.
A majority of the elements have inclusion probabilities strictly less than one. For example, in an STSI design, an element belonging to a stratum from which 200 elements are selected out of a total of 1600 has an inclusion probability equal to the sampling rate in the stratum, 200/1600 = 0.125, and its design weight is 1/0.125 = 8. One often heard interpretation is that 'an element with a design weight equal to 8 represents itself and seven other (non-sampled, non-observed) population elements as well'. When it comes to estimation, the observed value for this element is given the weight 8. Another stratum in the same survey may have 100 sampled elements out of a total of 200. Each element in this stratum has the inclusion probability 100/200 = 0.5, and its design weight is then 1/0.5 = 2.
STSI is a widely used design. It is very well suited for surveys of individuals and households in countries that can rely on a frame in the form of a total population register (see Example 2.1). Such a register lists the country's population and contains a number of variables suitable for forming strata, such as age, sex and geographical area. It is often of interest to measure households as well as individuals in the same survey. One way to obtain a sample of households from the sample of individuals is to identify the households to which the selected individuals belong. Household variables such as household expenditure can be observed, and variables on individuals, some or all of those residing in an identified household, can also be observed. We can obtain statistics on households as well as statistics on individuals.
The reverse order of selection is also possible. Practical considerations may necessitate drawing first a sample of households, with a specified sampling design, then selecting some or all of the individuals in the selected households. Again, both household variables and variables on individuals can be measured in the same survey. The selection of households can, for example, proceed by drawing a stratified sample of city blocks from a city map and then enumerating all the households in the selected city blocks.
In business surveys, the distribution of many variables of interest is highly skewed. The 'industry giants' account for a major share of the total for typical study variables related to production and output. The largest elements (enterprises) must be given a high inclusion probability (probability one or very near to one). Many business surveys use coordinated sampling for small enterprises to distribute the response burden. This entails some control over the frequency with which an enterprise is asked to provide information over a designated period of time, say a year. A number of countries have (to some extent different) systems for coordinated sampling. Statistics Sweden, for example, uses the system referred to as SAMU, described in Atmer et al. (1975). Another early reference for coordinated sampling is Brewer et al. (1972).
Coordinated sampling techniques are based on the concept of permanent random numbers: a uniformly distributed random number is attached at birth to a statistical element (an enterprise), and it remains with that element for the duration of its life; in that sense, it is permanent. The permanent random numbers play a crucial role in realizing both the desired inclusion probabilities and the desired degree of coordination of samples.
2.2. SOURCES OF ERROR IN A SURVEY
In this section we discuss frames, sampling and nonresponse. Figure 2.1 gives the background.
We define the target population to be the set of elements that the survey aims to encompass at the point in time when the data are collected, by the completion of a questionnaire or in some other way. This point in time is called the reference time point for the target population. The sampling frame, on the other hand, is usually constructed at an earlier date, sometimes as much as 12 months earlier. This time point is referred to as the reference time point for the frame population. The lag between the two time points should be as short as possible, because the risk of coverage errors increases with the time lag. Three types of coverage error are commonly distinguished: undercoverage, overcoverage and duplicate listings. We comment on the first two of these in particular. As the name suggests, duplicate listings refer to errors occurring when a target population element is listed more than once in the frame.
Elements that are in the target population but not in the frame population constitute undercoverage. Especially in business surveys, a significant part of the undercoverage is made up of elements that are new to the target population and therefore absent from the frame population. These are commonly referred to as 'births'. Undercoverage may have other causes as well.
Elements that are in the frame population but not in the target population constitute overcoverage. Elements that have ceased to exist somewhere between the two reference time points can be a significant source of overcoverage. They are commonly referred to as 'deaths'.
It follows that undercoverage elements have zero probability of being selected for any sample drawn from the frame population. This is undesirable, because of the bias that can result if the study variable values for the undercoverage elements differ systematically from those of other population elements.
Bias from overcoverage can usually be avoided as long as it is possible to correctly identify the sample elements that belong to the overcoverage. But this is not always possible. For example, elements that are selected into the sample but become nonrespondents can often not be correctly classified as either 'in the target population' or 'in the overcoverage'. Biased estimates can be an undesirable consequence.
Although attempts may be made to minimize the lag between the frame population reference time point and the target population reference time point, the time lag is often considerable. It may be a practical necessity. One reason may be slow updating of the frame. As time goes by, events occur that motivate a change or update of the frame information. An example is a change in a variable value for an element existing in the frame, as when updated information is received about the number of employees or the gross business income of an enterprise. Such changes are sometimes recorded only with considerable delay.
It follows that the values recorded in the frame, at a given point in time and for a specific frame variable, may refer to different points in time for different elements. All elements are not necessarily updated at the same moment in time. This is not ideal, but it is a reality that has to be accepted.
Births and deaths are examples of events that need to be recorded. These events cause a change in the set of elements in the frame.
The frame population for a planned survey is sometimes created from a larger, more extensive collection or list of elements, each having recorded values for a number of variables. A frame population deemed appropriate for a particular survey may then be constructed from this larger collection, using some of the recorded variables to delimit the frame population. Imperfections in the recorded variable values, because of unequal reference times or other causes, may harm the effectiveness of the delimitation.
Imperfect frame variable values are undesirable for other reasons. For example, frame variables are often used before sampling to stratify the population and/or after sampling to poststratify the sample. Errors in the frame variables make these practices less efficient.
Excerpted from Estimation in Surveys with Nonresponse by C.-E. Sarndal S. Lundstrom Copyright © 2005 by John Wiley & Sons, Ltd. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
Chapter 1: Introduction.
Chapter 2: The Survey and Its Imperfections.
2.1 The survey objective.
2.2 Sources of error in a survey.
Chapter 3: General Principles to Assist Estimation.
3.2 The importance of auxiliary information.
3.3 Desirable features of an auxiliary vector.
Chapter 4: The Use of Auxiliary Information under Ideal Conditions.
4.2 The Horvitz–Thompson estimator.
4.3 The generalized regression estimator.
4.4 Variance and variance estimation.
4.5 Examples of the generalized regression estimator.
Chapter 5: Introduction to Estimation in the Presence of Nonresponse.
5.1 General background.
5.2 Errors caused by sampling and nonresponse.
Appendix: Variance and mean squared error under nonresponse.
Chapter 6: Weighting of Data in the Presence of Nonresponse.
6.1 Traditional approaches to weighting.
6.2 Auxiliary vectors and auxiliary information.
6.3 The calibration approach: some terminology.
6.4 Point estimation under the calibration approach.
6.5 Calibration estimators for domains.
6.6 Comments on the calibration approach.
6.7 Alternative sets of calibrated weights.
6.8 Properties of the calibrated weights.
Chapter 7: Examples of Calibration Estimators.
7.1 Examples of familiar estimators for data with nonresponse.
7.2 The simplest auxiliary vector.
7.3 One-way classification.
7.4 A single quantitative auxiliary variable.
7.5 One-way classification combined with a quantitative variable.
7.6 Two-way classification.
7.7 A Monte Carlo simulation study.
Chapter 8 The Combined Use of Sample Information and Population Information.
8.1 Options for the combined use of information.
8.2 An example of calibration with information at both levels.
8.3 A Monte Carlo simulation study of alternative calibration procedures.
8.4 Two-step procedures in practice.
Chapter 9 Analysing the Bias due to Nonresponse.
9.1 Simple estimators and their nonresponse bias.
9.2 Finding an efficient grouping.
9.3 Further illustrations of the nonresponse bias.
9.4 A general expression for the bias of the calibration estimator.
9.5 Conditions for near-unbiasedness.
9.6 A review of concepts, terms and ideas.
Appendix: Proof of Proposition 9.1.
Chapter 10: Selecting the Most Relevant Auxiliary Information.
10.2 Guidelines for the construction of an auxiliary vector.
10.3 The prospects for near-zero bias with traditional estimators.
10.4 Further avenues towards a zero bias.
10.5 A further tool for reducing the bias.
10.6 The search for a powerful auxiliary vector.
10.7 Empirical illustrations of the indicators.
10.8 Literature review.
Chapter 11: Variance and Variance Estimation.
11.1 Variance estimation for the calibration estimator.
11.2 An estimator for ideal conditions.
11.3 A useful relationship.
11.4 Variance estimation for the two-step A and two-step B procedures.
11.5 A simulation study of the variance estimation technique.
11.6 Computational aspects of point and variance estimation.
Appendix: Properties of two-phase GREG estimator.
Chapter 12: Imputation.
12.1 What is imputation?
12.3 Multiple study variables.
12.4 The full imputation approach.
12.5 The combined approach.
12.6 The full weighting approach.
12.7 Imputation by statistical rules.
12.8 Imputation by expert judgement or historical data.
Chapter 13: Variance Estimation in the Presence of Imputation.
13.1 Issues in variance estimation under the full imputation approach.
13.2 An identity of combined and fully weighted approaches.
13.3 More on the risk of underestimating the variance.
13.4 A broader view of variance estimation for the combined approach.
13.5 Other issues arising with regard to item nonresponse.
13.6 Further comments on imputation.
Appendix: Proof of Proposition 13.1.
Chapter 14: Estimation Under Nonresponse and Frame Imperfections.
14.2 Estimation of the persister total.
14.3 Direct estimation of the target population total.
14.4 A case study.