Top Banner
Bivariate 1 January 16, 2003 BIVARIATE ANALYSIS: ESTIMATING ASSOCIATIONS Carol S. Aneshensel University of California, Los Angeles I. OVERVIEW A. Determines whether two variables are empirically associated. B. Key issue: Is the association in the form anticipated by theory? C. Lays the foundation for multivariate analysis II. METHODS OF BIVARIATE ANALYSIS A. Criteria for Selection of a Method B. Proportions: Contingency Tables 1. Test of conditional probabilities 2. Used with 2 categorical variables 3. No distributional assumptions 4. Low statistical power 5. Calculation of Chi-squared (χ) and its degrees of freedom C. Mean Differences: Analysis of Variance 1. Test of mean differences relative to variance 2. Independent variable is categorical 3. Dependent variable is interval or ratio. 4. No assumptions about the form of the association 5. Calculation of F and its degrees of freedom D. Correlation: Correlation Coefficients 1. Test of linear association 2. Two interval variables 3. Linear association 4. Calculation of t and its degrees of freedom
30
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Biva riate analysis pdf

Bivariate 1

January 16, 2003 BIVARIATE ANALYSIS: ESTIMATING ASSOCIATIONS Carol S. Aneshensel University of California, Los Angeles I. OVERVIEW

A. Determines whether two variables are empirically associated. B. Key issue: Is the association in the form anticipated by theory? C. Lays the foundation for multivariate analysis

II. METHODS OF BIVARIATE ANALYSIS A. Criteria for Selection of a Method B. Proportions: Contingency Tables

1. Test of conditional probabilities 2. Used with 2 categorical variables 3. No distributional assumptions 4. Low statistical power 5. Calculation of Chi-squared (χ�) and its degrees of freedom

C. Mean Differences: Analysis of Variance 1. Test of mean differences relative to variance 2. Independent variable is categorical 3. Dependent variable is interval or ratio. 4. No assumptions about the form of the association 5. Calculation of F and its degrees of freedom

D. Correlation: Correlation Coefficients 1. Test of linear association 2. Two interval variables 3. Linear association 4. Calculation of t and its degrees of freedom

Page 2: Biva riate analysis pdf

Bivariate 2 BIVARIATE ANALYSIS: ESTIMATING ASSOCIATIONS Carol S. Aneshensel University of California, Los Angeles

The first step in the analysis of a focal relationship is to determine whether there is an empirical association between its two component variables. This objective is accomplished by means of bivariate analysis. This analysis ascertains whether the values of the dependent variable tend to coincide with those of the independent variable. In most instances, the association between two variables is assessed with a bivariate statistical technique (see below for exceptions). The three most commonly used techniques are contingency tables, analysis of variance (ANOVA), and correlations. The basic bivariate analysis is then usually extended to a multivariate form to evaluate whether the association can be interpreted as a relationship.

Not any association, will do, however: We are interested in one particular association, that predicted by theory. If we expect to find a linear association, but find instead a U-shaped one, then our theory is not supported even though the two variables are associated with one another. Thus, the object of explanatory analysis is to ascertain whether the independent variable is associated with the dependent variable in the manner predicted by theory.

In some instances, the association between two variables is assessed with a multivariate rather than a bivariate statistical technique. This situation arises when two or more variable are needed to express the functional form of the association. For example, the correlation coefficient estimates the linear association between two variables, but a nonlinear association requires a different approach, such as the parabola specified by the terms X and X2. Although two analytic variables (X and X2) are used to operationalize the form of the association (parabola), these variables pertain to one substantive theoretical variable (X). The two analytic variables are best thought of as one 2-part variable that reflects the nonlinear form of the association with the dependent variable. Thus, the analysis is bivariate even though a multivariate statistical technique is used.

Although this distinction may appear to be hair-splitting, it reduces confusion about the focal relationship when more than one term is used to operationalize the independent variable. For example, the categories of ethnicity might be converted into a set of dichotomous "dummy variables" indicating whether the person is (1) African American, (2) Latino, (3) Asian American, or (4) non-Latino White (or is in the excluded reference category of "Other"). A theoretical model containing one independent variable, ethnicity, now appears to involve four independent variables. The four "dummy variables," however, are in actuality one composite variable with five categories. This type of hybrid variable requires a multivariate statistical technique, such as regression, even though it represents a bivariate association.

The importance of bivariate analysis is sometimes overlooked because it superseded by multivariate analysis. This misperception is reinforced by scientific journals that report bivariate associations only in passing, if at all. This practice creates the misleading impression that analysis begins at the multivariate level. In reality, the multiple-variable model rests upon the foundation laid by the thorough analysis of the 2-variable model. The proper specification of the

Page 3: Biva riate analysis pdf

Bivariate 3

theoretical model at the bivariate level is essential to the quality of subsequent multivariate analysis. Some forms of bivariate analysis require that variables be differentiated into independent or dependent types. For example, the analysis of group differences in means, either by t-test or ANOVA, treats the group variable as independent, which means that the procedure is asymmetrical�different values are obtained if the independent and dependent variables are inverted. In contrast, the Pearson correlation coefficient, the most widely used measure of bivariate association, yields identical values irrespective of which variable is treated as dependent, meaning that it is symmetrical� the same coefficient and probability level are obtained if the two variables are interchanged. Similarly, the chi-squared (χ�) test for independence between nominal variables yields the same value irrespective of whether the dependent variable appears in the rows or the columns of the contingency table. Although the test of statistical significance is unchanged, switching variables yields different expressions of the association because row and column percentages are not interchangeable. Unlike the correlation coefficient, where both the statistic and test of statistical significance are symmetrical, only the probability level is symmetrical in the χ2 technique.

Designating one variable as independent and the other variable as dependent is productive even when this differentiation is not required by the statistical method. The value of this designation lies in setting the stage for subsequent multivariate analysis where this differentiation is required by most statistical techniques.1 This designation is helpful in the bivariate analysis of the focal relationship because multivariate analysis ultimately seeks to determine whether the bivariate association is indicative of a state of dependency between the two variables. This approach makes more sense if the original association is conceptualized as a potential relationship. METHODS OF BIVARIATE ANALYSIS Selection of a Method

There is a multitude of statistical techniques for the assessment of bivariate associations. This profusion of techniques reflects a key consideration in the selection of a method of analysis � the measurement properties of the independent and dependent variable. For example, correlational techniques are the method of choice for analysis of two interval variables (when the association is assumed to be linear), but are not suitable to the analysis of two categorical variables. Given that there are numerous possible combinations of measurement types, there are numerous analytic techniques. A second contributor to this proliferation is sample size: some methods are applicable only to large samples. Statistical techniques are also distinguished from one another on the basis of assumptions about the distributional properties of the variables. For example, there are different computational formulas for the simple t-test depending upon whether the variance of the

1 Some multivariate techniques do not require that one variable be treated

as dependent, for example, log-linear models, but this situation is an exception.

Page 4: Biva riate analysis pdf

Bivariate 4 dependent variable is assumed to be the same in the two groups being compared. In contrast, nonparametric techniques make no distributional assumptions.

The sheer number of alternative methods can bewilder. The bulk of bivariate analysis in the social sciences, however, is conducted with three techniques: contingency table analysis of proportions, ANOVA assessment of mean differences between groups, and correlation coefficients. As illustrated in Figure 1, a key consideration in the selection of a technique is the level of measurement. The contingency table technique is used when both variables are nominal. Means are analyzed when independent variable is nominal and the dependent variable is interval or ratio. Correlations are calculated when both variables are interval or ratio (and the association is assumed to be linear).

These three methods do not exhaust the possible combinations of independent and dependent variables, as indicated by the blank cells in Figure 1. Although there are alternative methods of analysis for these combinations, many researchers adapt one of the three methods shown in this figure. For instance, if the dependent variable is nominal and the independent variable is measured at a higher level, the independent variable is often collapsed into categorical form. This transformation permits the use of the familiar χ2 test, but wastes valuable information about the inherent ordering of the interval variable.

Ordinal variables are a bit troublesome because they do not satisfy the assumptions of the interval methods of analysis, but the use of a nominal method of analysis, the practical alternative, is wasteful because it does not make use of the ordering information. Interval methods are often used for ordinal variables that approximate interval variables, that is, quasi-interval variables, but strictly speaking this practice is inappropriate. If the ordinal variable does not approximate an interval variable, then it can be treated as a nominal variable in χ2 test, although, once again, this practice wastes information.

Why not use a statistical technique intended specifically for the exact measurement properties of the independent and dependent variables instead of relaxing assumptions or loosing power? Surely these techniques are more appropriate than the adaptations just described. What accounts for the popularity of bending methods or measures to permit the use of conventional methods of bivariate analysis? Quite simply, the conventional methods readily generalize into familiar methods of multivariate analysis. Correlations and ANOVA form the foundation for multiple linear regression. Similarly, logistic regression is based on the techniques used with contingency tables. These three methods of bivariate analysis are used frequently, then, because of their conceptual continuity with common forms of multivariate analysis.

The reader is referred to a standard statistic text for less frequently used methods of assessing bivariate associations. These methods are omitted here so that excessive attention to technique does not deflect attention from the logic of analysis. A discussion of the relative merits of various types of correlation coefficients, for example, would unnecessarily divert attention from the question of whether a linear model, assumed in correlational techniques, is appropriate based on theory � not theory as taught in a statistics class, but the substantive theory directing the research. According to the gospel of statistical theory, my dismissive treatment of technically correct methods of bivariate analysis is heretical. In defense of this stance, I note only that I am merely calling attention to a widespread practice in applied analysis.

Page 5: Biva riate analysis pdf

Bivariate 5

Page 6: Biva riate analysis pdf

Bivariate 6

In addition to level of measurement, the selection of a statistical procedure should also be based upon the type of association one expects to find. In most applications, this issue focuses on whether it is appropriate to use a linear model. In this context, linear means that there is a constant rate of change in the dependent variable across all values of the independent variable. This concept makes sense only where there are constant intervals on both variables, which means that, strictly speaking, linearity is relevant only when both variables are measured at the interval level.2 In other applications, the issue is specifying which groups are expected to differ from others. An example of this approach would be hypothesizing that the prevalence of depression is greater among women than men, as distinct from asserting that gender and depression are associated with one another (see below).

In practice, the expected functional form of an association is often overlooked in the selection of an analytic technique. We tend to become preoccupied with finding a procedure that fits the measurement characteristics of the independent and dependent variables. The validity of the entire analysis, however, depends upon the selection of analytic techniques that matches theory-based expectations about the form of the association. Unfortunately, theory is often mute on this topic. Nevertheless, it is incumbent upon the analyst to translate theory into the appropriate analytic model. Methods: Proportions, Means and Correlations

In this section, the three most common methods of bivariate analysis are summarized briefly: χ� tests of proportions, ANOVA for mean differences, and correlation coefficients for linear associations. As noted above, the selection of a method is likely to be driven by the measurement characteristics of the independent and dependent variables. Contingency tables are appropriate for two nominal variables; tests of mean differences are used for an interval outcome and nominal independent variable; correlations are employed for linear associations between interval variables (see Figure 1). If both variables are interval, but the association is expected to be non-linear, then the correlational technique needs to be adapted to the expected form of the association using a multiple regression format.

Each of these forms of analysis can be performed using any major statistical software program. The emphasis of this presentation, therefore, is not on computations, but on interpretation. It is useful, however, to review the fundamentals of these methods of analysis to understand their proper use and interpretation.

2 There are a few common exceptions. For example, correlation

coefficients are often calculated for ordinal variables that are quasi-interval. Also, dichotomous dependent variables are often treated as interval because there is only one interval.

Page 7: Biva riate analysis pdf

Bivariate 7

It should be noted that the techniques discussed here are but a few of the many options available for bivariate analysis. Characteristics of one's data may make other approaches much more appropriate. The three techniques described here are highlighted because of their widespread use and because they form the basis for the most commonly used types of multivariate analysis. The anticipation of multivariate analysis makes it logical to conduct bivariate analysis that is consistent with the multivariate model.

Proportions: Contingency Tables. Two properties of the χ� analysis of a contingency table make it an especially appealing form of bivariate analysis. First, it is based on the lowest form of measurement, two nominal variables. The absence of level-of-measurement restrictions means that the technique may also be used with ordinal, interval or ratio data. Second, this technique does not require assumptions about the nature of the association, in particular, it does not assume a linear association. It is used to determine whether any association is present in the data, without specifying in advance the expected form of this association. This flexibility is the method's most appealing characteristic.

These characteristics, however, also establish the limitations of the method. Using χ� analysis with higher-order variables means that some data is transformed into a lower form of measurement, converted to categorical form. This transformation leads to a loss of information and a concomitant loss of statistical power. Although other statistics for contingency table analysis take into consideration the ordinal quality of variables (e.g., Somers's D), these techniques are not as widely used as the simple yet less powerful χ�. Furthermore, the χ� test only tells you that some association seems to be present, without regard to its theoretical relevance. The conclusion that an association is present is not nearly as meaningful, compelling, or satisfying as the conclusion that the expected association is present. The χ� test does not yield this information, although it is possible to adapt the method to this end.

The χ� test for independence is used to determine whether there is an association between two categorical variables. If the two variables are unrelated, then the distribution of one variable should be the same regardless of the value of the other variable. If instead the two variables are associated, then the distribution of the dependent variable should differ across the values of the independent variable. The χ� test for independence does not distinguish between independent and dependent variables. Treating one variable as independent is optional and does not alter the value of the test statistic.

The dependency between the variables could be stated in the reverse direction: the distribution of the independent variable differs across categories of the dependent variable. Although immaterial to the calculation of χ�, this formulation is backwards in terms of the logic of cause and effect. It treats the dependent variable as fixed and assesses variation in the independent variable across these fixed values. However, a variable that depends upon another does not have fixed values; its values vary according to the influence of the independent variable. For example, the association between gender and depression is best stated as differences in the probability of being depressed between men and women, not whether the probability of being a woman differs between depressed and not depressed persons. The proper formulation of the

Page 8: Biva riate analysis pdf

Bivariate 8

association, then, is to examine variation in the distribution of the outcome variable across categories of its presumed cause.3

The χ� test for independence is illustrated in Figure 2. In this contingency table, the independent variable X appears in the rows (1...i) and the dependent variable Y appears in the columns (1...j). The analytic question is whether the distribution of Y varies across the categories of X.

3 Whether this distribution is calculated as row or column percentages is

immaterial.

Page 9: Biva riate analysis pdf

Bivariate 9

Page 10: Biva riate analysis pdf

Bivariate 10

The overall distribution of Y is given at the bottom of the table. For example, the proportion in column 1 is p1 = N.1/N; the proportion in column 2 is p2 = N.2/N; and so on until pj = N.j/N. This proportional distribution should be duplicated within each row of the table if Y is indeed independent of X. In other words, the distribution of subjects within row 1 should resemble the distribution of subjects within row 2, and so on through row i, the last row. This similarity means that the proportion of subjects in column j should be similar across all categories of X (1, 2, ... i), and similar to the overall proportion of subjects in column j (p.j or Nij/N). This equivalency should be manifest for all values of Y across all values of X.

The null hypothesis for the χ� test for independence essentially states that identical conditional probabilities are expected under the condition that Y does not depend upon X:

H0: p11 = p21 = . . . = pi1 = p.1 = N.1/N (1) p12 = p22 = . . . = pi2 = p.2 = N.2/N ... p1j = p2j = . . . = pij = p.j = Nij/N

The null hypothesis is evaluated with the χ� test statistic. Its definitional formula and degrees of freedom appear in Figure 2. A large χ2 value relative to its degrees of freedom leads to rejection of the null hypothesis. The null hypothesis is rejected if any pij � Nij/N, that is, if any proportion within the table deviates substantially from the marginal distributions of X and Y. This result means that the observed covariation between X and Y is unlikely to occur if in reality X and Y are independent.

The key to understanding this procedure is the calculation of expected values for the cells comprising the contingency table. These values are calculated assuming that column distributions are independent of row distributions (and vice versa). If so, then the overall marginal proportions for the rows and columns should be replicated within the body of the table.

The value expected under independence is compared to the observed value for each cell. If the assumption of independence is valid, there should be only a small difference between expected and observed values. If instead there is a large difference for one or more cells, the χ� statistic will be large relative to its degrees of freedom and the null hypothesis will be rejected.

The chief limitation of this procedure is the nonspecific nature of the null hypothesis. When we reject it, we know that one or more cell frequencies differs markedly from the expected, but do not know which cells are deviant. We know that an association exists, but do not know the form of the association. It might be the one predicted by theory, but then again it might be some other form of association. This limitation can be remedied if the expected relationship is more precisely specified prior to analysis.

To continue the example given earlier in this section, we might hypothesize that females are more likely to be depressed than males. This hypothesis is considerably more precise than the hypothesis that gender and depression are associated with one another. It could be operationalized as an odds ratio greater than 1.00 for depression for females relative to males. The odds ratio expresses the association between variables in a multiplicative form, meaning that a value of 1.00 is equivalent to independence. The odds ratio for the data reported in Table 1 is 1.80. This is the exact value for the sample (N = 1,393). If we wish to extend this finding to the

Page 11: Biva riate analysis pdf

Bivariate 11

population, we could calculate a confidence interval for the odds ratio and determine whether it includes the value of 1.4 The 95% confidence interval for this example is 1.25 to 2.60. The lower boundary is greater than 1, meaning that the odds of being depressed are significantly higher for women than men. Note that this conclusion is more precise than the conclusion made with χ�, namely that depression is not independent of gender.

Note also that we are interested only in the possibility that females are at greater risk. A greater risk among males would disconfirm our theory. However, a male excess of depression would lead to rejection of the null hypothesis in the χ� procedure. Our more specific null hypothesis is rejected only if the confidence interval is greater than 1; we fail to reject if the confidence interval does not differ from 1 or is smaller than 1. When only an association is hypothesized, we fail to reject only when the confidence interval includes 1, not when it is smaller than 1. By specifying the nature of the association we have increased the power of the test.

4 Agresti (1990:54-6) gives the following formula for the 100(1-α) percent

confidence interval for log θ for large samples:

log(θ)=zα/2 σ(logθ) where

θ = n11n22/n12n21

is the sample value of the odds ratio, and σ(logθ) = [(1/n11) + (1/n22) + (1/n12) + (1/n21)]1/2

The confidence interval for θ is obtained by exponentiating (taking the antilog) of the endpoints of this interval. (Agresti, Alan 1990. Categorical Data Analysis (New York: John Wiley & Sons).

Page 12: Biva riate analysis pdf

Bivariate 12

The χ� test is useful for inferential purposes, but it is not informative about the magnitude of the association. Quantifying the association requires an additional calculation. Although there are numerous options from which to select, the odds ratio described earlier has gained considerable popularity in recent years because of its straightforward interpretation. Somers's d is useful as well because of its common use in the logistic regression model, which is a multivariate generalization of the χ� procedure.

For an example of crosstabular analysis, we can turn to a study estimating the extent to which midlife and older parents experience the divorce of their sons and daughters (Spitze, Logan, Deane, and Zerger 1994).5 The overall prevalence of this event was influenced by time insofar as the oldest adults have been at risk for the longest time, whereas their younger counterparts have had less time to have adult children marry and divorce. The bivariate analysis shown in Table 2 takes to account the age of the parent.

Although the experience of having an adult child is normative in all of these age groups, barely half of the youngest adults have experienced this life course transition. At older ages, about four in five have done so. The researchers attribute the slight dropoff in the oldest age group to the lower fertility of those who were of

TABLE 2 Percent Distribution of Marital History of Children by Age of the Parent

Age of the Parent Percentage of all 40- 50- 60- 70+ respondents who have had: 49 59 69 An adult child 55.1 87.6 88.6 72.2 An ever-married child 19.5 70.5 83.1 74.4 An ever-divorced child 3.0 24.0 40.6 35.8 A currently divorced/ 4.4 19.6 27.6 24.4 separated child A remarried child 1.1 9.5 22.1 21.1 N 365 275 308 246 Source: Spitze et al. (1994).

5 Spitze, Glenna, Logan, John R., Deane, Glenn, and Zerger, Suzanne.

(1994). Adult children's divorce and intergenerational relationships. Journal of Marriage and the Family, 56:279-293.

Page 13: Biva riate analysis pdf

Bivariate 13

childbearing age prior to the baby boom period. These age trends are even more evident for having a married child. This event is

relatively rare for the youngest group, but quite common among persons over the age of 50 years. Age is strongly associated with having children who have divorced, including those who are currently divorced, and with having children who have remarried.

No tests of statistical significance are presented for the data in Table 2 in the original report. The differences are quite large between the youngest cohort and the other cohorts. Given the large sample size, these differences are obviously of statistical significance.

If χ2 tests were provided, however, there would be five such tests, one for each dependent variable. At first glance, it may not be obvious that there are five dependent variables. This confusion arises because the table reports percentages rather than cell frequencies (compare with Figure 2). (Cell frequencies can be extracted from Table 2 by multiplying the percentage by the sample size, and dividing by 100.) Only the percent "yes" is given in Table 2 because the percent "no" is implicit given that each of these variables is a dichotomy. Thus, Table 2 represents the cross-tabulation of age (in 4 categories) by each of the following dependent variables: adult child/no; ever-married child/no; ever-divorced child/no; currently separated-divorced child/no; remarried child/no.

The question of linearity is relevant to this example because the independent variable, age, is quasi-interval and the dependent variables are all dichotomies. Two distinct nonlinear patterns are evident in these data. The first is an inverted U-shaped curve: a sharp increase, especially between the forties and fifties, followed by a decline among those who are seventy or older. As noted previously, the researchers attribute the decline to a pre-baby boom cohort effect. The first three dependent variables follow this pattern. The increase at earlier ages reflects the combined impact of life course considerations, such as the duration of risk for the event. These factors appear to be the primary consideration for the second pattern, an initial increase followed by a plateau, which describes the last two entries in the table.

Means: Analysis of Variance. The ANOVA procedure is similar in many respects to the χ2 test for independence. In both techniques, the independent variable needs to be measured only at the nominal level. Also, the null hypothesis is structurally similar in the two procedures. In the case of χ�, we test whether proportions are constant across categories of the independent variable and, therefore, equal to the overall marginal proportion. For ANOVA, we test whether means are constant across categories of the independent variables and, thus, equal to the grand mean. In both techniques, the null hypothesis is nonspecific: rejecting it is not informative about which categories of the independent variable differ.

The main practical difference between methods is that ANOVA requires an interval level of measurement for the dependent variable.6 This restriction means that the ANOVA technique is not as widely applicable as the χ� test, for which the dependent variable can be nominal. The limitation of ANOVA to interval dependent variables, however, is a trade-off for greater statistical power.

6 This assumption is violated when ANOVA is used for ordinal variables

that approximate the interval level of measurement, but this procedure is, strictly speaking, incorrect.

Page 14: Biva riate analysis pdf

Bivariate 14

The χ� test could be substituted for ANOVA by transforming an interval dependent variable into categorical form. This approach is undesirable because valuable information is lost when a range of values is collapsed into a single category. Additional information is lost because the ordering of values is immaterial to χ2. The attendant decrease in statistical power makes χ� an unattractive alternative to ANOVA.

The ANOVA procedure is concerned with both central tendency and spread. The measure of central tendency is the mean. It is calculated for the dependent variable, both overall and within groups defined by the categories of the independent variable. Specifically, the null hypothesis is that the within-group mean is equal across groups and, therefore, equal to the grand mean:

H0: µ1 = µ2 = ... = µj = µ (2)

In this equation, µ is the mean, and j is the number of groups, which is the number of categories on the independent variable. The null hypothesis is rejected if any µj � µ, if any group mean differs from the grand mean.

The critical issue in ANOVA is not the absolute difference in means, however, but the difference in means relative to spread or the variance of the distribution.7 This feature is illustrated in Figure 3 for several hypothetical distributions. The first panel (a) displays large differences in means relative to the within-group variation, a pattern that yields a large value for F. This pattern would lead to a rejection of the null hypothesis.

The second panel (b) displays the same absolute mean differences as the top panel, but substantially larger within-group variation. Although the means differ from one another, there is a large amount of overlap among the distributions. The overlap is so extensive that the distribution with the lowest mean extends over with the distribution with the highest mean. This pattern would produce a low F value, leading to failure to reject the null hypothesis. This conclusion is reached even though the absolute mean differences are the same as in panel a, which led to rejection of the null hypothesis. The difference in conclusions for the two sets of distributions arises from their spreads: the mean difference is large relative to the variance of the distribution in panel a, but relatively small in panel b.

7 ANOVA was originally developed in terms of a variance, specifically the

hypothesis that all of the group means are equal is equivalent to the hypothesis that the variance of the means is zero (Darlington 1974).

The third panel (c) shows distributions with the same variances as the second panel (b), but with substantially larger mean differences. As in the previous case, the variances are large. In this instance, however, the mean differences between groups are also large. Similarly, in the last panel (d) the variances are small, but so are the mean differences between groups. We would fail to reject the null hypothesis despite the small variances because these variances are large

Page 15: Biva riate analysis pdf

Bivariate 15 relative to the mean differences. As these examples illustrate, it is not the absolute value of the mean differences that is crucial, but the mean difference relative to the variance.

Page 16: Biva riate analysis pdf

Bivariate 16

Page 17: Biva riate analysis pdf

Bivariate 17

ANOVA is based on the decomposition of variation in the dependent variable into within- and between-group components, with the groups being categories of the independent variable. The calculations are based on the sum of squares, that is, deviation of observations from the mean, as shown in Figure 4.8 The test statistic for ANOVA is F, which is a ratio of variances. If differences between the means are due to sampling error, then the F ratio should be around 1.00. Large values of F (relative to its degrees of freedom) would lead to rejection of the null hypothesis.

Although ANOVA may be used when there are only 2 categories on the independent variable, it is customary to use a t test in this situation. The definitional formula for t is:9

t = M1 - M2 / ((s12/n1) + (s2

2 / n2))� (3)

T and F are equivalent when there are only two groups: t� = F. Thus, the choice of one method over the other is immaterial.

ANOVA is not informative about which means are unequal: it tests only whether all means are equal to one another and to the grand mean. Rejection of the null hypothesis signifies that the dependent variable is probably not uniformly distributed across all values of the independent variable, but does not reveal which cells deviate from expectations. Thus, an association appears to be present, but it is not known whether it is the one forecast by theory. As was the case with the χ� procedure, additional steps are required to test more specific hypotheses about the precise nature of the association. This may be done a priori or using a post hoc test (e.g., Scheffe' ). Specifying contrasts in advance (on the basis of theory) is preferable to examining all possible contrasts after the fact because the later penalizes you for making multiple contrasts to reduce the risk of capitalizing on chance.

Depending upon one's hypothesis, it also may be desirable to test for a trend. Trend analysis is most relevant when the independent variable is interval and linearity is at issue. In this case, it may be desirable to partition the between-groups sum of squares into linear, quadratic, cubic, or higher-order trends. However, visual inspection of the data is usually the

8 There are numerous alternative specifications of the ANOVA model

depending upon whether fixed or random effects are modeled, whether there are equal or unequal cell frequencies, etc. The user should consult a text on ANOVA for full discussion of these technical concerns.

9 This formula for t assumes unequal variances between groups; a slightly different formula is used if one can assume equal variances.

Page 18: Biva riate analysis pdf

Bivariate 18 most informative approach for understanding the shape of the association. Simply plotting mean values is often more instructive than the results of sophisticated statistical tests.

Page 19: Biva riate analysis pdf

Bivariate 19

Page 20: Biva riate analysis pdf

Bivariate 20

Several core aspects of ANOVA are illustrated in Table 3, which shows group variation in levels of depressive symptoms. These data are from the survey of Toronto adults introduced earlier in this paper (Turner & Marino 1994; see Table 1). In addition to major depressive disorder, this study also assessed the occurrence of depressive symptoms during the previous week. This assessment was made with the Center for Epidemiologic Studies-Depression (CES-D) Scale, which is the summated total of 20 symptoms, each rated from (0) "0 days" through (3) "5-7 days."

The average symptom level varies significantly by gender, age, and marital status. The nature of this difference is clear for gender insofar as there are only two groups: the average is lower for men than women. For age and marital status, however, the differences are less clear because more than two groups are being compared.

Symptoms are most common among the youngest adults and thereafter decline with age (at least through age 55). It is tempting to conclude that the youngest and oldest groups differ, given that these scores are themost extreme. These large differences, however, may be offset by exceedingly large variances. As a result, we are limited to the nonspecific conclusion that

Table 3 Depression by Select Characteristics Characteristic CES-D�

(Mean) N MDD�

(%) Gender

Male 10.21*** 603 7.7*** Female 13.10 788 12.9 Age 18-25 15.14*** 304 18.4*** 26-35 10.92 470 9.8 36-45 11.09 393 7.2 46-55 9.15 224 4.7 Marital Status Married 9.98*** 673 6.6*** Previously Married 14.22 171 11.5 Never Married 13.70 547 15.8 Total 11.79 1,391 10.6

Source: Turner and Marino (1994); Table 1. � Depressive symptoms; Center for Epidemiology-Depression Scale.

� Major Depressive Disorder; Composite International Diagnostic Interview. *** p < .001

Page 21: Biva riate analysis pdf

Bivariate 21 depressive symptoms vary with age.

The two unmarried groups have similar levels of symptoms compared to the markedly different scores of the currently married. In the absence of specific tests for pairs of means, however, we can only conclude that at least one of these means differs from the grand mean.

The far right column of Table 3 presents prevalence estimates for major depressive disorder. These data are shown here to emphasize the similarity between the analysis of means (ANOVA) and the analysis of proportions (χ2). The prevalence of major depression differs significantly by gender, age, and marital status. The nature of the gender difference is again clear, given that there are only two groups. However, the nature of the age and marital status associations is not specified for particular subgroups. Thus, we are limited to the conclusion that depression is associated with age and cannot conclude that depression declines with age, even though the prevalence of depression in the youngest age group is almost twice that of any other age group. Similarly, although it is tempting to conclude that the married are substantially less likely to be depressed than the previously married or the never married, this information is not given by the overall test, meaning that we can only conclude that the three groups do not have the same rate of depression.

Finally, a comment on linearity. Figure 5 graphs the association between depression and age from the data in Table 3. In this example, age has been collapsed into four categories, meaning that it is ordinal rather than interval. In reality, however, age is an interval variable, making it reasonable to ask whether its association with depression is linear. The problem in treating age as a quasi-interval variable is the first interval, ages 18-25, which is shorter (7 years) than the other intervals (10 years). The problem of unequal age intervals can be circumvented, however, by assigning each interval a value equal to its midpoint.

The observed age trend for depressive symptoms [Figure 5(a)] is distinctly nonlinear. Symptoms are most common among the youngest age group and least common in the oldest age group, but do not follow a pattern of steady decline between these extremes. Instead, there is a plateau in average symptom levels between the two middle age groups.

The observed age trend for rates of major depressive disorder [Figure 5(b)] also is distinctly nonlinear, although the pattern differs somewhat from the pattern for average symptom levels. Like symptoms, disorder is most common for the youngest adults and least common for the oldest adults. Unlike symptoms, however, the decline with age is apparent across the two middle age groups. Despite the continuity of decline, the trend is nonlinear because the decline during the youngest period is noticeably steeper than thereafter.

In sum, although ANOVA does not assume linear associations, it is possible to ascertain whether this is the case when both variables are interval (or quasi-interval). The same is true for the analysis of proportions using the χ2 test. We turn now to the correlation coefficient, which assumes the linear form.

Page 22: Biva riate analysis pdf

Bivariate 22

Page 23: Biva riate analysis pdf

Bivariate 23

Correlations: Linear Associations. Although there are several correlation coefficients, Pearson's r is by far the most widely used. This coefficient is used when both the independent and dependent variables are measured at the interval level.10 From the perspective of operationalizing a theory-based relationship, the most important aspect of this technique is the assumption that the association between the independent and dependent variables is linear.11 It is conventional to recommend inspection of a scatterplot to ensure that there are no gross departures from linearity. This approach is illustrated in Figure 6 for both linear and nonlinear associations.

Although the shape of an association is usually clear in textbook illustrations such as this one, it is more difficult to visualize associations from scatterplots in practice. The difficulty arises because large sample sizes generate too many data points, many of which overlap. Computer-generated scatterplots use symbols such as letters to signify the number of observations at a particular location, but it is difficult to mentally weigh the points in a plot according to these symbolic tallies. It is sometimes useful to select a small random sample of one's sample to circumvent this problem.

Another tactic for detecting nonlinearity is to collapse the independent variable and examine the distribution of means as one would in ANOVA. This technique is not as informative as the scatterplot, given that many distinct values are collapsed into categories and means, but it is helpful in detecting departures from linearity, especially in combination with a scatterplot. Yet another strategy entails collapsing both variables into categorical form and examining their cross-tabulation. This procedure sacrifices even more information than the previous approach, but it may be helpful, especially if extreme scores are of special interest.

The correlation coefficient r describes the association between two variables as the straight line that minimizes the deviation between observed (Y) and estimated (�) values of the dependent variable, as illustrated in Figure 7. This feature gives the method its name, the "least-squares method." The value of r measures the association between X and Y in terms of how

10 As we have seen repeatedly, however, ordinal variables that approximate

the interval level of measurement are often used in practice for statistical techniques that require an interval level of measurement.

11 There are other requirements as well, including normal distributions and homoscedasticity. The reader is referred to a text on multiple regression for a through consideration of the correlational model and its assumptions.

Page 24: Biva riate analysis pdf

Bivariate 24 closely the data points cluster around the least-squares line. The absolute value of r is large when the data points hover close to the least-squares line; when observations are more widely dispersed around this line, the absolute value of r is close to zero. The values of the correlation coefficient range from 1, which indicates perfect correspondence, through 0, which signifies a complete lack of correspondence, to -1, which connotes perfect inverse (i.e., negative) association (see Figure 6).

Page 25: Biva riate analysis pdf

Bivariate 25

Page 26: Biva riate analysis pdf

Bivariate 26

Page 27: Biva riate analysis pdf

Bivariate 27

The null hypothesis is once again that Y is independent of X, specifically, H0: r = 0. The test statistic for generalization from the sample to the population is t, computed as shown in Figure 7. Although a two-tailed test may be used, the direction of the association is usually theoretically important, which makes the use of a one-tailed test appropriate.

It is important to note that this technique assumes that the association between X and Y is linear in form. If there is a nonlinear association, the value of r will be seriously misleading. This problem is illustrated in Figure 8. In this example, the data are better represented by a parabola than by a straight line. Correlational techniques for determining whether specific nonlinear trends are present entail a multivariate model. Here is suffices to note that these techniques are isomorphic to those described above for trends in ANOVA.

The slope of the least-squared line is of interest because it quantifies the magnitude of the association between the independent and dependent variables. Specifically, the slope is the change in Y produced by a one unit increase in X. A steep slope (in either direction) indicates a strong relationship whereas a weak relationship appears to be almost horizontal. The slope is not given by r, a common misconception, but can be derived from r and information about the distributions of X and Y. Another indicator of the strength of the association is r2, which is the proportion of the variance in the dependent variable that is accounted for by the independent variable.

For example, two measures of depression strongly covary, specifically, the correlation between the Child Depression Inventory (CDI)12 and the Stony Brook (SB) Child Psychiatric Checklist13 measure of depression is quite strong (r = .59; p < .001), but well below the perfect correspondence (r = 1.00) that would be expected if both measures were perfectly reliable and valid. The r2 value is .35, meaning that about a third of the variance in the Stony Brook is shared with the CDI (and vice versa). Although this correspondence is strong, most of the variance in the two measures is not shared in common. The significance test for the correlation between the CDI and the SB (p < .001) indicates that it is extremely unlikely that this correlation would have been observed if in truth the two variables are not correlated with one another.

12 Kovacs, M. & Beck, A.T. (1977). An empirical-clinical approach toward a

definition of childhood depression. In J.G. Schulterbrandt & A. Raskin (Eds.), Depression in Childhood: Diagnosis, Treatment, and Conceptual Models. NY: Raven, pp. 1-25.

13 Gadow, K.D. and Sprafkin, J. (1987). Stony Brook Child Psychiatric Checklist-3R. Stony Brook, New York. Unpublished manuscript.

Page 28: Biva riate analysis pdf

Bivariate 28

The correlation coefficient is an appropriate indicator of the association between the CDI and the SB for two reasons: (1) both variables are quasi-interval, and (2) the association can be assumed to be linear. The latter point is particularly important, given that the functional form of associations is often overlooked. The two variables are similar measures of the same construct, which means that an increase in one measure should be matched by an increase in the other measure. Moreover, this correspondence should be evident across the full span of values for both variables. There is no reason to anticipate, for example, a plateau, or a threshold effect. Thus, the correlation coefficient is an appropriate choice.

Page 29: Biva riate analysis pdf

Bivariate 29

Page 30: Biva riate analysis pdf

Bivariate 30

The correlation between the CDI and the SB is considerably stronger than their correlations with measures of other constructs. This pattern is expected, given that two measures of the same construct should be more highly correlated than measures of different constructs.

However, adolescent depression was also assessed with two other measures, SB ratings made by the mother and by the father. These measures correlate with the CDI (.29 and .22, respectively), and with the adolescent's self assessment on the SB (.28 and .21, respectively), but far below the correlation between the two adolescent measures (.59). Although these correlations are all statistically significant, the parental measures account for no more that 8.4 percent of the variance in the adolescent self-reports. Thus, as concluded earlier, parental ratings are not especially good measures of adolescent mood, despite the fact that such ratings have been standard practice in psychiatric epidemiology.

In sum, the few correlations reviewed here demonstrate the importance of considering both the statistical significance of an association and its substantive importance. SUMMARY

Although the specifics of bivariate analysis are unique to each statistical technique, there is a functional similarity across these methods. As noted above, the null hypothesis for bivariate analysis states that the values on the two variables are independent of one another. The usual goal is to reject this hypothesis, to conclude that the variables are not independent of one another. In practice, this means that knowing the values on one variable is informative about the likely values on the second variable.