Regression Models for Categorical, Count, and Related Variables: An Applied Approach

Social science and behavioral science students and researchers are often confronted with data that are categorical, count a phenomenon, or have been collected over time. Sociologists examining the likelihood of interracial marriage, political scientists studying voting behavior, criminologists counting the number of offenses people commit, health scientists studying the number of suicides across neighborhoods, and psychologists modeling mental health treatment success are all interested in outcomes that are not continuous. Instead, they must measure and analyze these events and phenomena in a discrete manner.

This book provides an introduction and overview of several statistical models designed for these types of outcomes—all presented with the assumption that the reader has only a good working knowledge of elementary algebra and has taken introductory statistics and linear regression analysis.

Numerous examples from the social sciences demonstrate the practical applications of these models. The chapters address logistic and probit models, including those designed for ordinal and nominal variables, regular and zero-inflated Poisson and negative binomial models, event history models, models for longitudinal data, multilevel models, and data reduction techniques such as principal components and factor analysis.

Each chapter discusses how to utilize the models and test their assumptions with the statistical software Stata, and also includes exercise sets so readers can practice using these techniques. Appendices show how to estimate the models in SAS, SPSS, and R; provide a review of regression assumptions using simulations; and discuss missing data.

A companion website includes downloadable versions of all the data sets used in the book.

Regression Models for Categorical, Count, and Related Variables: An Applied Approach

65.0 In Stock

Regression Models for Categorical, Count, and Related Variables: An Applied Approach

Add to Wishlist

Regression Models for Categorical, Count, and Related Variables: An Applied Approach

eBook

$65.00

View All Available Formats & Editions

eBook
$65.00

View All Available Formats & Editions

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.

WANT A NOOK? Explore Now

Buy As Gift

Related collections and offers

Overview

Product Details

ISBN-13:	9780520965492
Publisher:	University of California Press
Publication date:	08/16/2016
Sold by:	Barnes & Noble
Format:	eBook
Pages:	432
File size:	63 MB
Note:	This product may take a few minutes to download.

About the Author

John P. Hoffmann is Professor of Sociology at Brigham Young University. Before arriving at BYU, he was a senior research scientist at the National Opinion Research Center (NORC), a nonprofit firm affiliated with the University of Chicago. He received a master’s in Justice Studies at American University and a doctorate in Criminal Justice at SUNY–Albany. He also received a master’s in Public Health with emphases in Epidemiology and Behavioral Sciences at Emory University’s Rollins School of Public Health. His research addresses drug use, juvenile delinquency, mental health, and the sociology of religion.

Read an Excerpt

Regression Models for Categorical, Count, and Related Variables

An Applied Approach

By John P. Hoffmann

UNIVERSITY OF CALIFORNIA PRESS

CHAPTER 1

Review of Linear Regression Models

As you should know, the linear regression model is normally characterized with the following equation:

yi = α + β1x1 + β2x2 + ··· βkxk + εi {or use β0 for α}.

Consider this equation and try to answer the following questions:

• What does the yi represent? The β? The x? (Which often include subscripts i — do you remember why?) The εi?

• How do we judge the size and direction of the β?

• How do we decide which xs are important and which are not? What are some limitations in trying to make this decision?

• Given this equation, what is the difference between prediction and explanation?

• What is this model best suited for?

• What role does the mean of y play in linear regression models?

• Can the model provide causal explanations of social phenomena?

• What are some of its limitations for studying social phenomena and causal processes?

Researchers often use an estimation technique known as ordinary least squares (OLS) to estimate this regression model. OLS seeks to minimize the following:

[MATHEMATICAL EXPRESSION OMITTED]

The SSE is the sum of squared errors, with the observed y and the predicted y (y-hat) utilized in the equation. In an OLS regression model that includes only one explanatory variable, the slope (β1) is estimated with the following least squares equation:

[MATHEMATICAL EXPRESSION OMITTED]

Notice that the variance of x appears in the denominator, whereas the numerator is part of the formula for the covariance (cov(x, y)). Given the slope, the intercept is simply

[MATHEMATICAL EXPRESSION OMITTED]

Estimation is more complicated in a multiple OLS regression model. If you recall matrix notation, you may have seen this model represented as

Y = Xβ + ε.

The letters are bolded to represent vectors and matrices, with Y representing a vector of values for the outcome variable, X indicating a matrix of explanatory variables, and β representing a vector of regression coefficients, including the intercept (β0) and slopes (βi). The OLS regression coefficients may be estimated with the following equation:

[??] = (XX)-1 X'Y.

A vector of residuals is then given by

ε = Y - Y[??].

Often, the residuals are represented as e to distinguish them from the errors, ε. You should recall that residuals play an important role in linear regression analysis. Various types of residuals also have a key role throughout this book. Assuming a sample and that the model includes an intercept, some of the properties of the OLS residuals are (a) they sum to zero ([summation] εi = 0), (b) they have a mean of zero (E]ε] = 0), and (c) they are uncorrelated with the predicted values of the outcome variable (r(ε,[??])=0).

Analysts often wish to infer something about a target population from the sample. Thus, you may recall that the standard error (SE) of the slope is needed since, in conjunction with the slope, it allows estimation of the ITLt-values and the p-values. These provide the basis for inference in linear regression modeling. The standard error of the slope in a simple OLS regression model is computed as

[MATHEMATICAL EXPRESSION OMITTED]

Assuming we have a multiple OLS regression model, as shown earlier, the standard error formula requires modification:

[MATHEMATICAL EXPRESSION OMITTED]

Consider some of the components in this equation and how they might affect the standard errors. The matrix formulation of the standard errors is based on deriving the variance-covariance matrix of the OLS estimator. A simplified version of its computation is

[MATHEMATICAL EXPRESSION OMITTED]

Note that the numerator in the right-hand-side equation is simply the SSE since (yi - [bar.y] or εi or e1. The right-hand-side equation is called the residual variance or the mean squared error (MSE). You may recognize that it provides an estimate — albeit biased, but consistent — of the variance of the errors. The square roots of the diagonal elements of the variance–covariance matrix yield the standard errors of the regression coefficients. As reviewed subsequently, several of the assumptions of the OLS regression model are related to the accuracy of the standard errors and thus the inferences that can be made to the target population.

OLS results in the smallest value of the SSE, if some of the specific assumptions of the model discussed later are satisfied. If this is the case, the model is said to result in the best linear unbiased estimators (BLUE) (Weisberg, 2013). It is important to note that this says best linear, so we are concerned here with linear estimators (there are also nonlinear estimators). In any event, BLUE implies that the estimators, such as the slopes, from an OLS regression model are unbiased, efficient, and consistent. But what does it mean to say they have these qualities? Unbiasedness refers to whether the mean of the sampling distribution of a statistic equals the parameter it is meant to estimate in the population. For example, is the slope estimated from the sample a good estimate of an analogous slope in the population? Even though we rarely have more than one sample, simulation studies indicate that the mean of the sample slopes from the OLS regression model (if we could take many samples from a population), on average, equals the population slope (see Appendix B). Efficiency refers to how stable a statistic is from one sample to the next. A more efficient statistic has less variability from sample to sample; it is therefore, on average, more precise. Again, if some of the assumptions discussed later are satisfied, OLS-derived estimates are more efficient — they have a smaller sampling variance — than those that might be estimated using other techniques. Finally, consistency refers to whether the statistic converges to the population parameter as the sample size increases. Thus, it combines characteristics of both unbiasedness and efficiency.

A standard way to consider these qualities is with a target from, say, a dartboard. As shown in figure 1.1, estimators from a statistical model can be imagined as trying to hit a target in the population known as a parameter. Estimators can be unbiased and efficient, biased but efficient, unbiased but inefficient, or neither. Hopefully, it is clear why having these properties with OLS regression models is valuable.

You may recall that we wish to assess not just the slopes and standard errors, but also whether the OLS regression model provides a good "fit" to the data. This is one way of asking whether the model does a good job of predicting the outcome variable. Given your knowledge of OLS regression, what are some ways we may judge whether the model is a "good fit"? Recall that we typically examine and evaluate the R2, adjusted R2. and root mean squared error (RMSE). How is the R2 value computed? Why do some analysts prefer the adjusted R2? What is the RMSE and why is it useful?

A BRIEF INTRODUCTION TO STATA

In this presentation, we use the statistical program Stata to estimate regression models (www.stata.com). Stata is a powerful and user-friendly program that has become quite popular in the social and behavioral sciences. It is more flexible and powerful than SPSS and, in my judgment, much more user-friendly than SAS or R, its major competitors. Stata's default style consists of four windows: a command window where we type the commands; a results window that shows output; a variables window that shows the variables in the data file; and a review window that keeps track of what we have entered in the command window. If we click on a line in the review window, it shows up in the command window (so we don't have to retype commands). If we click on a variable in the variables window, it shows up in the command window, so we do not have to type variable names if we do not want to.

It is always a good idea to save the Stata commands and output by opening a log file. This can be done by clicking the brown icon in the upper left-hand corner (Windows) or the upper middle portion (Mac) of Stata or by typing the following in the command window:

log using "regression.log" * the name is arbitrary

This saves a log file to the local drive listed at the bottom of the Stata screen. To suspend the log file, type log off in the command window; or to close it completely type log close.

It is also a good idea to learn to use .do files. These are similar to SPSS syntax files or R script files in that we write — and, importantly, save — commands in them and then ask Stata to execute the commands. Stata has a do-editor that is simply a notepad screen for typing commands. The Stata icon that looks like a small pad of paper opens the editor. But we can also use Notepad++, TextEdit, WordPad, Vim, or any other text-editing program that allows us to save text files. I recommend that you use the handle .do when saving these files, though. In the do-editor, clicking the run or do icon feeds the commands to Stata.

AN OLS REGRESSION MODEL IN STATA

We will now open a Stata data file and estimate an OLS regression model. This allows us to examine Stata's commands and output and provide guidance on how to test the assumptions of the model. A good source of additional instructions is the Regression with Stata web book found at http://www.ats.ucla.edu/stat/stata/webbooks/reg. Stata's help menu (e.g., type help regress in the command window) is also very useful.

To begin, open the GSS data file (gss.dta). This is a subset of data from the biennial General Social Survey (see www3.norc.org/GSS+Website). You may use Stata's drop-down menu to open the file. Review the content of the Variables window to become familiar with the file and its contents. A convenient command for determining the coding of variables in Stata is called codebook. For example, typing and entering codebook seireturns the label and some information about this variable, including its mean, standard deviation, and some percentiles. Other frequently used commands for examining data sets and variables include describe, table, tabulate, summarize, graph box (boxplot), graph dotplot (dot plot), stem (stem-and-leaf plot), hist (histogram), and kdensity (kernel density plot) (see the Chapter Resources at the end of this chapter). Stata's help menu provides detailed descriptions of each. As shown later, several of these come in handy when we wish to examine residuals and predicted values from regression models.

Before estimating an OLS regression model, let's check the distribution of sei with a kernel density graph (which is also called a smoothed histogram). The Stata command that appears below opens a new window that provides the graph in figure 1.2. If sei follows a normal distribution, it should look like a bell-shaped curve. Although it appears to be normally distributed until it hits about 50, it has a rather long tail that is suggestive of positive skewness. We investigate some implications of this skewness later.

To estimate an OLS regression model in Stata, we may use the regress command. The Stata code in Example 1.1 estimates an OLS regression model that predicts sei based on sex (the variable is labeled female). The term beta that follows the comma requests that Stata furnish standardized regression coefficients, or beta weights, as part of the output. You may recall that beta weights are based on the following equation:

[MATHEMATICAL EXPRESSION OMITTED]

Whereas unstandardized regression coefficients (the Coef. column in Stata) are interpreted in the original units of the explanatory and outcome variables, beta weights are interpreted in terms of z-scores. Of course, the z-scores of the variables must be interpretable, which is not always the case (think of a categorical variable like female).

The results should look familiar. There is an analysis of variance (ANOVA) table in the top-left panel, some model fit statistics in the top-right panel, and a coefficients table in the bottom panel. For instance, the R2 for this model is 0.002, which could be computed from the ANOVA table using the regression (Model) sum of squares and the total sum of squares (SS(sei)): 2,002/1,002,219 = 0.002. Recall that the R2 is the squared value of the correlation between the predicted values and the observed values of the outcome variable. The F-value is computed as MSReg/MSResid or 2,002/360 = 5.56, with degrees of freedom equal to k and {n - k - 1}. The adjusted R2 and the RMSE — two useful fit statistics — are also provided.

The output presents coefficients (including one for the intercept or constant), standard errors, t-values, p-values, and, as we requested, beta weights. Recall that the unstandardized regression coefficient for a binary variable like female is simply the difference in the expected means of the outcome variable for the two groups. Moreover, the intercept is the predicted mean for the reference group if the binary variable is coded as {0, 1}. Because female is coded as {0 = male and 1 = female}, the model predicts that mean sei among males is 48.79 and mean sei among females is 48.79 - 1.70 = 47.09. The p-value of 0.018 indicates that, assuming we were to draw many samples from the target population, we would expect to find a slope of -1.70 or one farther from zero about 18 times out of every 1,000 samples.

The beta weight is not useful in this situation because a one z-score shift in female makes little sense. Perhaps it will become more useful as we include other explanatory variables. In the next example, add years of education, race/ethnicity (labeled nonwhite, with 0 = white and 1 = nonwhite), and parents' socioeconomic status (pasei) to the model.

The results shown in Example 1.2 suggest that one or more of the variables added to the model may explain the association between female and socioeconomic status (or does it? — note the sample sizes of the two models). And we now see that education, nonwhite, and parents' status are statistically significant predictors of socioeconomic status. Whether they are important predictors or have a causal impact is another matter, however.

The R2 increased from 0.002 to 0.353, which appears to be quite a jump. Stata's testcommand provides a multiple partial (nested) F-test to determine if the addition of these variables leads to a statistically significant increase in the R2. Simply type test and then list the additional explanatory variables that have been added to produce the second model. The result of this test with the three additional variables is an F-value of 406.4 (3, 2226 df) and a p-value of less than 0.0001. Given the different sample sizes, do you recommend using the nested F-test approach for comparing the models? How would you estimate the effect of female in this model?

Some interpretations from the model in Example 1.2 include the following:

• Adjusting for the effects of sex, race/ethnicity, and parents' sei, each 1-year increase in education is associated with a 3.72 unit increase in socioeconomic status.

• Adjusting for the effects of sex, race/ethnicity, and parents' sei, each one z-score increase in education is associated with a 0.556 z-score increase in socioeconomic status.

• Adjusting for the effects of sex, education, and race/ethnicity, each one-unit increase in parents' sei score is associated with a 0.077 unit increase in socioeconomic status.

It is useful to graph the results of regression models in some way. This provides a more informed view of the association between explanatory variables and the outcome variable than simply interpreting slope coefficients and considering p-values to judge effect sizes. For instance, figure 1.3 provides a visual depiction of the linear association between years of education and sei as predicted by the regression model. Stata's margins and marginsplot post-estimation commands are used to "adjust" the other variables by setting them at particular levels, including placing pasei at its mean. The vertical bars are 95% confidence intervals (CIs). What does the graph suggest?

Although it should be obvious, note that the graph (by design) shows a linear association. This is because the OLS regression model assumes a linear, or straight line, relationship between education and socioeconomic status (although this assumption can be relaxed). If we know little about their association, then relying on a linear relationship seems reasonable. But it is important to keep in mind that many associations are not linear. Think about what this means given how popular linear regression is in many scientific disciplines.

(Continues...)

Excerpted from Regression Models for Categorical, Count, and Related Variables by John P. Hoffmann. Copyright © 2016 The Regents of the University of California. Excerpted by permission of UNIVERSITY OF CALIFORNIA PRESS.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Preface
Acknowledgments

1. Review of Linear Regression Models
2. Categorical Data and Generalized Linear Models
3. Logistic and Probit Regression Models
4. Ordered Logistic and Probit Regression Models
5. Multinomial Logistic and Probit Regression Models
6. Poisson and Negative Binomial Regression Models
7. Event History Models
8. Regression Models for Longitudinal Data
9. Multilevel Regression Models
10. Principal Components and Factor Analysis

Appendix A: SAS, SPSS, and R Code for Examples in Chapters
Section 1: SAS Code
Section 2: SPSS Syntax
Section 3: R Code
Appendix B: Using Simulations to Examine Assumptions of OLS Regression
Appendix C: Working with Missing Data

References
Index

From the B&N Reads Blog

Page 1 of

Regression Models for Categorical, Count, and Related Variables: An Applied Approach

Regression Models for Categorical, Count, and Related Variables: An Applied Approach

eBook

eBook

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

Regression Models for Categorical, Count, and Related Variables

An Applied Approach

UNIVERSITY OF CALIFORNIA PRESS

Table of Contents

Customer Reviews

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

Regression Models for Categorical, Count, and Related Variables

An Applied Approach

UNIVERSITY OF CALIFORNIA PRESS

Table of Contents

Related Subjects

Customer Reviews