In this discussion, we'll focus on regression analysis.
Regression analysis -- when one is just thinking about two variables -- offers
us a way to "predict" or "explain" one variable (the dependent variable) with
another variable (the independent variable. So, for instance, to use an
example we've used before, one could predict undergraduate student grades
with class attendance. Undergraduate grades would be the Y variable (the
dependent variable, or the "left hand side" variable), and class attendance
would be the X variable (the independent variable, or the "right hand side"
variable).
In its simplest form, with only two variables -- one dependent (or response) variable
Y and one explanatory or independent variable X -- regression analysis essentially
plots a line. The formula for any line is:
Y = a + bX
X and Y are variables--they can change value from case to case in
your data set. That is, the X for observation 1 (X1) is
not necessarily the same as the X for observation 2 (X2).
a and b, on the other hand, are constants. They are the same
no matter what the value of X and Y.
"a" is the intercept--it is the value of Y when X crosses the Y axis.
"b" is the slope--it is the change in Y associated with a one-unit change
in X.
How does this relate to correlation? Well, regression is very much based
on correlation. Recall that the Pearson's correlation was a measure of the strength
of the linear relationship between two variables. "Ordinary Least Squares"
regrssion analysis is a method that just calculates out a slope and an
intercept for a line that is plotted to fit the datapoints in the scatter
plot.
How does OLS (Ordinary Least Squares Regression) fit a line to data?
(Note--this is just background information; you don't actually need to know
this to do any of the calculations, or to figure out the slope or intercept).
OLS minimizes squared errors. What does this mean? What is an
"error"?
For each observation X or case, the error is the difference between the observed
Y in your dataset and the predicted Y that falls on the line.
Let's think about an example. Say you have a data set, such as one
we examined in lecture 10:
So, for example, the equation for the line that fits the data displayed in
the scatterplot above is:
Y (dye weight) = 368.35 + -.039 (X: time in minutes).
(We'll explain below how we calculated out the a and b).
368.35 is the intercept; it is the value of Y when X=0. Notice that this
works out algebraically, because if X=0, then -.039*X drops out of the
equation, leaving us with Y = 368.35.
-.039 is the slope; the negative sign indicates that it is downward sloping,
(or that an increase in X is associated with a decrease in Y). The slope
seems quite small in magnitude, but in part that's just because of the
range of the Y variables. The slope tells us that for each one unit change in X
(that is, for each minute) Y is expected to drop -.039.
For the second data point in our set of data, what is the observed Y?
Click
here
for the answer.
But what is the predicted Y for the second data point? You can
calculate it out by plugging in X and Y into the line above.
Click here
for the answer.
But there is some error; the actual observation doesn't fall
exactly on the line. "Error" in the world of OLS regression
doesn't mean mistake -- it just means that not all the points fall on the
line (in fact, in plenty of datasets, none of the points actually fall exactly
on the plotted line), and that there's some "error" in prediction.
Recall that the "residual" or "error" or (by common notation) "e"
is the observed Y - predicted Y. Calculate out the residual
--and click
here
for the answer. Note that the positive residual indicates that the
observed Y is larger than the predicted Y--in other words, that the
datapoint would be above the plotted line.
Note that "e", just like X and Y, can change from observation to observation--
in some ways "e" is just a variable, albeit one that is manufactured by
the line itself. a and b, however, are constants--they remain the same
across the entire data set, because there is one constant line, with
one slope and one intercept.
OLS picks the line that would minimize the sum of the squared errors
(imagine if you squared each residual, and then added up all the squared
residuals across all observations.
One topic that we won't discuss here (there are other classes you are welcome to sit in on) is multiple regression. In multiple regression, you can "control for" other variables, while testing the effect of one variable on another.
OLS relies on a number of assumptions (whether you are using bivariate or multivariate regression):
That all said, OLS is pretty robust even in the face of mild to moderate
violations of these assumptions.
The formula for the slope for the bivariate regression analysis is
as follows:
You can calculate out the intercept as follows:
a = (mean of Y) - (b)(mean of X)
So, for the example above, can you use excel to calculate out a slope and intercept? Click here for an excel table that outlines the answer.
How can we do significance testing for slopes?
Given that, generally, we can think of the slope as a sample slope, we
can then think about significance testing. We need to find out the
standard error associated with our slope. This is the same exact reasoning
that we went through when we talked about the mean--we talked about
taking an infinite number of samples from our population, calculating a
mean from each one--and then the standard deviation of the mean for
that hypothetical sample of an infinite number of means is the "standard
error".
To review, what does it mean when the standard error is large?
Click here for the answer.
What is the formula for the standard error of the slope? First, let's give the
formula for the variance of b, and then we can take the square root
to get the standard error:
A few points:
Once we have the stanard error of b, we can calculate
t = b / std error
And use a t-table, just as we've always done.
The "R-square" is used to represent the amount of variance in Y
explained by X.
In order to caluculate out a R2, you go through the following
steps:
Note that R2 are very useful in terms of measuring the
strength of linear relationships, or telling you how much of the
variance in Y is "explained" by X.
But, note the limitations of using R2:
For the data on how weight of dye changes over time, calculate out the variance of b, the t, and the R-square Would you (at a 95% confidence level) reject the null hypothesis that X does not have an effect on Y?