Read an Excerpt
APPLIED MULTIVARIATE ANALYSIS
Using Bayesian and Frequentist Methods of Inference
By S. JAMES PRESS
Dover Publications, Inc.Copyright © 2005 S. James Press
All rights reserved.
Multivariate analysis is that branch of statistics that is devoted to the study of random variables which are correlated with one another. If two random variables are correlated, knowledge of the behavior of one provides some knowledge about the behavior of the other. To specify the probability of occurrence of many events of practical interest, it is usually necessary to make probabilistic assertions about several correlated random variables simultaneously. For example, suppose the success of some social welfare program is being evaluated. Since program success is generally assessed in terms of several different but correlated measures, the problem must be handled by multivariate analysis.
The term "applied multivariate analysis," as used in this book, refers to the motivation for using the models developed. Thus, the models presented in Part II will be seen to be useful for studying phenomena which arise in real problems.
Bayesian statistics is a term applied to the body of inferential techniques that uses Bayes' theorem to combine observational data with personalistic or subjective beliefs. Multivariate Bayesian techniques were integrated into the text wherever it was deemed appropriate.
The essence of applied multivariate analysis involves the motivation to solve problems and arrive at numerical answers, or to generate strong degrees of belief about natural phenomena, or to provide results which can be used as the basis for decision making. Often an hypothesis is proposed, multivariate data is collected and examined to test the hypothesis, and if the hypothesis is rejected by the data, it is modified accordingly, and new data is gathered to test the new hypothesis. There is also the notion that, inherently, the problem of interest involves repeated observations on several correlated variables. Thus, the field not only embodies a collection of tools, techniques, and methods of thinking which may be brought to bear upon problems involving the treatment and interpretation of many correlated random variables simultaneously, but also it involves the translation of the multidimensional techniques and models into numerical decision strategies or findings.
1.2 ORIENTATION OF TEXT
Until recently, application of results from multivariate analysis to real data and their integration with judgmental information required considerable effort, long hours, much hand computation, and an artistic flair for choosing significance levels. Two fairly new developments that have drastically altered this situation are the large scale availability of computers and the accelerated growth of the use of the Bayesian approach in statistics.
The development that may have had the greatest impact was the advent of high-speed large-storage-capacity digital computers. Although such computers have been around since the post World War Two era, it has only been in recent years that computer software development permitted large numbers of people to communicate with these machines without considerable advance preparation, that libraries of prepackaged programs were made available to users who did not want to develop their own computer routines to solve problems many other people had already solved, and that computer speed and storage capacity were sufficiently great to solve realistic problems that typically demand examination in many dimensions simultaneously.
The other major development that has greatly affected the applications of multivariate analysis is the rapid development of multivariate Bayesian results. That is, during roughly the same time period that computers were becoming impressively useful and accessible, the multivariate Bayesian approach was being successfully brought to bear on problems that had been difficult to treat from other points of view.
The above, and other developments (such as the steadily growing mathematical sophistication of the social sciences) helped multivariate analysis to make a strong impact in the applied fields. Moreover, these applied fields have continued to press their demands that the gap be bridged between theory (mathematical statistics) and application.
For such reasons, this book provides discussions of the interpretation of the main results, sometimes presenting proofs in context or in Complements at the end of the chapter, and sometimes referring the more technically inclined reader to proofs in the literature. References are also given for additional applications of the various multivariate models discussed. As a result the bibliography, although certainly not exhaustive, was designed to be sufficiently broad in scope and deep in concept to serve the interests of a wide variety of readers.
For the subject "applied multivariate analysis" not to be a contradiction in terms, it is necessary that the reader become familiar with some of the computer programs available for applying multivariate procedures. The general characteristics of computer routines associated with the models discussed in this book are described in an appendix and are sometimes referred to in the text. The computer routines will prove most helpful to those users who fully understand the assumptions underlying the model for which the routine was developed.
Exercises are provided at the end of each chapter. Although some straightforward computational exercises are sometimes included for practice in the use of certain relations, most problems are not of this type; rather, they are intended to gauge the reader's overall grasp of the sense of the material in the chapter. That is, attention is often focused on questions such as: When should a particular kind of model be applied, what are the underlying assumptions, what other models might be used, why is one model better than another for a given problem, and so on?
1.3 GENESIS OF MULTIVARIATE MODELS
The models of multivariate analysis described in this book arose in real problems in many different disciplines. Correlation was used by F. Galton in the second half of the nineteenth century. Factor analysis was introduced in education and psychology by C. Spearman to explain human intelligence. R. A. Fisher introduced many notions of multivariate analysis (including that of intraclass correlation) to explain phenomena in genetics. Regression models were first well formulated by Gauss for multidimensional applications in astronomy. Many of the variants of the regression model were investigated and developed for applications in economics. Experimental design models were developed for use in agriculture. Latent structure models and multidimensional scaling methods were developed in education, psychology, and sociology, while control models have been pressed forward by chemists, economists, and engineers. Advances in multivariate stable distribution theory have been stimulated by problems in finance, and classification and discrimination models are finding increasing application in marketing, after having served well for many years in anthropology and taxonomy.
In summary, the models to be described have been drawn from many of the scientific disciplines, especially from biometrics, econometrics, psychometrics, sociometrics, education, and the subfields of business. It is expected that by exposing models drawn from diverse disciplines, the models will find new application in fields different from those in which they originally arose.
1.4 SAMPLING THEORY VERSUS BAYESIAN APPROACH
This book does not take a dogmatic position on the sampling theory versus the Bayesian approach toward solving problems. There is no claim that there is a right and a wrong way. Rather, it is believed that cogent arguments can be made for both approaches to inference and decision making, and each may involve some subjective or technical difficulties. In the sampling theory approach the analyst must use his prior beliefs relative to the trade-off between sample size and Type I and Type II errors. Moreover, he obtains confidence intervals applicable only to averages taken over many samples, rather than to the sample at hand. The Bayesian needs no Type I and II errors, but he does need to assess prior distributions, which may not always be an easy task. In many situations, in which the sampling theory approach is used for making inferences based upon a given sample, results are marginal in the sense that, for example, a statistic might be significant at the five percent level but not at the one percent level, and the analyst is not really strongly attached to either significance level. In such circumstances, calculating the entire posterior distribution with respect to a diffuse prior will often put the analyst in a much better position to make a decision about which he will feel confident.
It sometimes happens that a problem which is extremely difficult to analyze from a sampling theory viewpoint becomes simpler to study from a Bayesian viewpoint. Such is the case, for example in both multivariate regression and in Normal classification problems, for certain prior densities. In multivariate regression, a likelihood ratio approach to testing the coefficients for significance leads to an extremely complicated distribution which has been approximated in several ways. However, use of the Bayes approach (with diffuse priors) for testing requires the relatively simple application of the Student t" distribution (it is more complicated for natural conjugate priors). Similarly, there is not yet general agreement own the finite sample solution to the problem of classifying a vector into one of two multivariate Normal populations with unequal and unknown parameters. But a Bayesian will establish his basis for decision when he computes the posterior odds, which he accomplishes by inserting the observed data into the ratio of two Student t-densities. In the case of principal components and canonical correlations, the Bayesian result is complicated (it is expressed in terms of zonal polynomials), while the sampling theory result is relatively simple for the usual types of inferences desired. In the case of the square of the multiple correlation coefficient both approaches yield complicated results. More generally, no matter which model is involved, and no matter which approach is simpler, if prior information is easily assessable and if proach provides a formalism for combining judgemental information with observational data. If prior information is difficult to assess, Bayesian procedures based upon "vague" priors are sometimes used. Such procedures are sometimes equivalent to sampling theory procedures, but not always. In the latter case, a philosophical choice must be made.
The rationale for presenting multivariate analysis from both sampling theory and Bayesian viewpoints is to provide the reader with a broader base from which real problems may be studied.
The book is divided into two parts. Part I deals with foundations. As such, it provides a convenient summary of the main theoretical results that are needed in applied multivariate analysis. Since the study of many variables simultaneously is most efficiently and expeditiously handled in a vector and matrix context, Chapter 2 is devoted to a review of the pertinent matrix algebra and calculus required.
Chapter 3 begins with a brief discussion of multivariate distributions in general, and continues with a more detailed treatment of the multivariate Normal distribution in particular. In this book attention is directed largely to the class of continuous distributions that has densities. Thus, there will never be any need for measure theoretic arguments (exceptions for sets of Lebesgue measure zero are always to be understood, but will usually be omitted).
The multivariate Normal distribution is the principal one used for drawing inferences. In univariate analysis, the Normal distribution is fundamental largely because of its role as the limiting distribution for sums of independent and identically distributed random variables with finite second moments (central limit theorem). The multivariate analogue of the central limit theorem is the basis for the importance of the multivariate Normal distribution. This theorem, along with other asymptotic multidimensional results, is discussed in Chapter 4.
Chapters 5 and 6 treat the Wishart distribution (the distribution of sample variances and covariances), various multivariate versions of the Student t-distribution, and other major multivariate distributions (such as Hotelling's T2-distribution) which will be needed in making inferences in multivariate models. The subject of stable distributions, which has become of interest in the study of securities markets, income variation, portfolio analysis, and other areas of finance and economics, is discussed from a multivariate standpoint in Chapter 6.
Chapter 7 presents a brief treatment of elementary multivariate statistics, including estimation and testing using the multivariate Normal and Wishart distributions. Some attention is given to the fundamental work of Stein on the inadmissibility, in higher dimensions, of the sampling theory estimator of the mean vector in a multivariate Normal distribution.
Part II treats various models that might be used singly or sequentially to analyze multivariate data. It begins with Chapter 8, which focuses on linear models. Results are provided for univariate regression models under a wide variety of underlying assumptions likely to be encountered in applications (such as unequal variances and serially correlated disturbances). Both multivariate regression and generalized multivariate regression models (different regressor matrices for each dependent variable) are discussed. One and two way layouts for the analysis of variance are presented (univariate and multivariate), as are the multivariate analysis of covariance and multivariate multiple comparison techniques for obtaining simultaneous confidence intervals for regression parameters. Both the Bayesian and sampling theory methods of analyzing linear models are exposed.
Study of the variance within observed data is simplified by the principal components model, discussed in Chapter 9. In this model, data interpretation is simplified by representing the data in a rotated coordinate system.
Study of the correlation structure underlying the observed data variables is facilitated with the factor analysis and latent structure analysis models discussed in Chapter 10. The factor analysis model attempts to discover a few elemental factors that may have generated the data. The latent structure analysis model attempts to categorize the data, after the fact.
Chapter 11 treats the canonical correlations model, which is an attempt to correlate two or more groups of variates rather than pairs of random variables.
Chapter 12 discusses the problem of allocating scarce resources when the variables follow multivariate stable distributions (stable portfolio analysis).
In Chapter 13, Bayesian and sampling theory procedures are given for classifying data vectors (attribute profiles) into predesignated populations.
Chapter 14 discusses some adaptive multivariate control problems in the context of controlling a multivariate regression output.
The final chapter of the book (Chapter 15) is devoted to the new and important topic of structuring of multivariate populations. One area discussed involves the optimal grouping of multivariate observations into clusters. Another related problem is multidimensional scaling, or the locating of points in a multidimensional space based upon preference, or ordered data. These subjects are currently undergoing extensive investigation. Moreover, although many interesting results have already been obtained, rigorous statistical underpinning is still lacking.
Appendix A provides reference material on computer programs which will implement the models discussed in the text. Information is given, such as limitations on the input and output variables, location of the program, and references for additional program details. Several alternative programs (with different properties) are given for some of the models so that a program can be selected which best fits the problem at hand. Appendix B contains a variety of numerical tables useful for work in applied multivariate analysis, while Appendix C lists all references in the book by author.
Excerpted from APPLIED MULTIVARIATE ANALYSIS by S. JAMES PRESS. Copyright © 2005 S. James Press. Excerpted by permission of Dover Publications, Inc..
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.