7964
Lecture 3
Lecture 3 focuses on some basic methods of data analysis.
I. Analyzing Data
Now, let's think about some common questions that can be answered using quantitative
data.
- Often, we might be interested in how many individuals fall into one category versus
another. Are individuals equally divided across categories--or are they more
likely to be in one category than another?
This general question might lead to the following specific questions:
- What percentage of college students favor legalization of marijuana?
- What percentage of individuals (of a certain sex, race, and locality) have a
particular DNA genotype?
- What is the relationship between two variables--does the category into which
individuals fall for one variable depend on the value that they get for another variable?
This general question might lead to the following specific questions:
- What is the relationship between sex and partisan affiliation? Does partisan affiliation
depend on sex (that is, are women more likely to be Democrats?)
- What is the relationship between the likelihood of being charged with a particular
crime and income
(Note that neither of these questions tap into why these relationships may
exist--that is a matter of theory).
- How can we describe our data--in a way that will help us to understand individual
variables, and help us to understand the collection
of individual cases that we have?
- Specifically--what are the range of values on a variable? What is the average?
The median (or the value in the middle?)
II. Basic Data Analysis
- Here, we want to cover univariate statistics--that is, statistical measures
that focus on just one variable at a time.
One can think about measures of central tendancy--measures that give an idea
of the value that is at the "center" of a distribution of a variable. There are three
commonly used measures of "central tendency": the mean (or average), the median,
or the mode.
- The
mean or average is calculated by adding all of the values up, and dividing by
the number of cases. Means tend to be very sensitive to outliers--for example,
the average income of the 1985-1995 graduates of the UNC Department of Geography
is very, very high--over $200,000 a year, I believe. This, however, is because
(1) there are very few graduates of the UNC Geography Department and (2) Michael
Jordan is one of those graduates.
A related term is
expected value,
which can be thought of as a long-run average of a variable.
- The
median is the value in the middle--approximately half of the cases fall below
the median value of a variable, and half of the cases fall above the median value
of a variable. Medians tend to be less sensitive to outliers than means are.
So, for instance, if you have a dataset with one variable (say, age) and 3 cases
in it: the three cases are 62, 39, and 54. The median value is the one in the middle--
it is 54. If you add a case (say, age=47), and now have four variables for age, what
is the median? There are two numbers "in the middle": 47 and 54. The median is
the center of those two numbers: it is 50.5.
- The
mode is the category with the most cases in it.
- The
range is the lowest and the highest value that the variable takes on (across
cases--not the lowest possible value or highest possible value, but the lowest value
and highest value within the dataset.
- The
upper quartile of a variable is the value that 75% of the
case values fall under.
- The
lower quartile of a variable is the value that 25% of the
case values fall under.
- The
variance
of a variable is the average squared distance each
value is from the mean value of a variable. So, to calculate the variance for
a variable in a population, you
- (1) calculate the average value of the variable (add up all the values,
and divide by the number of cases)
- (2) calculate the distance each case value is from that average/mean
- (3) square that distance--so that all the values are positive
- (4) and divide by the number of cases
- The
standard deviation
of a variable is the square root of the variance.
III. More on Variances and Standard Deviations
- A. Samples and Populations--Different Formula
When calculating variances and standard deviations for populations,
you divide by N, or the size (# of cases) in the population.
However, when calculating variances and standard deviations for samples,
divide by n-1, or the size of the sample (# of cases) in the population. We'll talk
more about "degrees of freedom" in future lectures, but the basic idea is that
in a sample, you're estimating the mean--and so you're "using up" one piece of
information, and therefore divide not by n, but by n-1.
See
this attachment
for information on notation, and on calculating out means, variances, and standard deviations.
- B. Why are Variances so Useful?
Variances are standard deviations are very useful measures of the
spread of your data--taking more into account than, for instance, the simple
range of your data.
- C. Z-Scores--and an Example
Often, we talk in terms of z-scores, or "standard deviation units". That
consider the following data set, which has one variable (X) and 7 cases:
- X1=5
- X2=7
- X3=9
- X4=6
- X5=4
- X6=8
- X7=3
What is the mean?
The mean is the total sum (42) divided by the number of cases (7)--that
is, this variable X has a mean of 6.
What is the squared distance each value is from the mean of 6?
- squared distance=square of (5-6)=1
- squared distance=square of (7-6)=1
- squared distance=square of (9-6)=9
- squared distance=square of (6-6)=0
- squared distance=square of (4-6)=4
- squared distance=square of (8-6)=4
- squared distance=square of (3-6)=9
What is the average squared distance? (Consider this a sample, so divided
by n-1 rather than N)
The total sum of all the squared distances is 1 + 1 + 9 + 0 + 4 + 4 + 9=28
Divide 28 by 6 (that is, by n-1).
The average squared distance (or the variance of X in this sample) is 4.67.
What is the standard deviation? It is simply the square root of the variance--
the square root of 4.67 is 2.16.
We can think about the values in terms of standard deviation units. For
example, X3=9. That is, X3 is 3 units above the mean of 6. How many
standard deviation units is that? One standard deviation unit is 2.16
actual units--so 3 actual units would be 1.39 standard deviation units.
We can say that X3 is 1.39 standard deviation units above the mean.
Likewise, X1 is just a little below the mean--it is .46 standard deviation
units below the mean.
IV. What if we want to measure the RELATIONSHIP between two or more
variables?
- Let's move away from univariate statistics to multivariate statistics.
Measures of association refer to statistics describing the degree to
which two or more variables are associated with each other--how is change in
one associated with change in another. For instance, a
"correlation" is a measure
of association (specifically, a correlation measures the strength of a linear relationship
between two variables).
V. Outliers
-
Outliers are cases that seem to be far away from other cases--sometimes
defined as more than two standard deviations from the mean of that variable.
Outliers are important because they can change the conclusions you draw. Some statistics
(such as means) are particularly sensitive to outliers. An outlier can suggest that
there is a relationship between two variables (i.e., whether one has a UNC Geography
degree and income) when none exists.
Outliers are a very complex topic. In sum, however, you should always look for outliers
in any data set. You should always try to explain why the outlier exists. And,
you should always determine whether the inclusion of the outlier changes your results
and conclusions--and be "up front" about that to your audience.
VI. How can we display our data?
We can use are bar charts that show frequencies--or
histograms (which are the same as a bar chart, just all the area
is shaded--the frequencies are not represented in different columns.)
VII. Questions
- 1. Why might you not want to describe a variable with just one descriptive statistic--
be it the mean, the mode, or any such measure? Why might researchers opt for
something like a "five number summary", providing the median, range (highest value and
lowest value), upper quartile, and lower quartile?
- 2. Consider the following example of a dataset. The units of analysis are
legislators (in the Arizona State House in 2005). There are two variables in
the dataset: name and age.
Sketch out a histogram or bar chart displaying the frequencies of legislator ages. You may
find it useful to categorize age by decades (i.e., 20s, 30s, etc.).
Your bar chart might look something like
this.
Take the quiz on blackboard. (You may find it easier to transfer the data into
excel--the quiz asks you to calculate means, modes, medians, quartiles, etc.)
Name |
Age |
|
Name |
Age |
|
Name |
Age |
|
Name |
Age |
|
Name |
Age |
|
Name |
Age |
Adams |
33 |
|
Aguirre |
52 |
|
Allen |
43 |
|
Alvarez |
64 |
|
Anderson |
51 |
|
Barnes |
69 |
Barto |
47 |
|
Biggs |
47 |
|
Boone |
56 |
|
Bradley |
53 |
|
Brown |
76 |
|
Burges |
62 |
Burns |
32 |
|
Cahill |
51 |
|
Carpenter |
54 |
|
Chase |
52 |
|
Davis |
54 |
|
Downing |
62 |
Farnsworth |
44 |
|
Gallardo |
60 |
|
Garcia |
60 |
|
Gorman |
37 |
|
Hershberger |
56 |
|
Huffman |
36 |
Jones |
57 |
|
Kirkpatrick |
55 |
|
Knaperek |
50 |
|
Konopnicki |
60 |
|
Landrum |
39 |
|
Lopes |
64 |
Lopez |
57 |
|
Lujan |
40 |
|
Mason |
61 |
|
McClure |
63 |
|
McComish |
62 |
|
McLain |
60 |
Meza |
41 |
|
Miranda |
49 |
|
Murphy |
34 |
|
Nelson |
69 |
|
Nichols |
36 |
|
O'Halleran |
59 |
Paton |
34 |
|
Pearce |
58 |
|
Pierce |
53 |
|
Prezelski |
35 |
|
Quelland |
57 |
|
Reagan |
36 |
Rios |
56 |
|
Robson |
50 |
|
Rosati |
46 |
|
Sinema |
29 |
|
Smith |
64 |
|
Stump |
34 |
Tom |
57 |
|
Tully |
38 |
|
Weiers |
52 |
|
Yarbrough |
57 |
|
Weiers |
57 |
|
Wilson |
45 |