7964
Lecture 3

Lecture 3 focuses on some basic methods of data analysis.

I. Analyzing Data

common questions that can be answered using quantitative data.

Often, we might be interested in how many individuals fall into one category versus another. Are individuals equally divided across categories--or are they more likely to be in one category than another?
This general question might lead to the following specific questions:
- What percentage of college students favor legalization of marijuana?
- What percentage of individuals (of a certain sex, race, and locality) have a particular DNA genotype?
What is the relationship between two variables--does the category into which individuals fall for one variable depend on the value that they get for another variable?
This general question might lead to the following specific questions:
- What is the relationship between sex and partisan affiliation? Does partisan affiliation depend on sex (that is, are women more likely to be Democrats?)
- What is the relationship between the likelihood of being charged with a particular crime and income
  (Note that neither of these questions tap into why these relationships may exist--that is a matter of theory).
How can we describe our data--in a way that will help us to understand individual variables, and help us to understand the collection of individual cases that we have?
- Specifically--what are the range of values on a variable? What is the average? The median (or the value in the middle?)

II. Basic Data Analysis

Here, we want to cover univariate statistics--that is, statistical measures that focus on just one variable at a time.

One can think about measures of central tendancy--measures that give an idea of the value that is at the "center" of a distribution of a variable. There are three commonly used measures of "central tendency": the mean (or average), the median, or the mode.
- The mean or average is calculated by adding all of the values up, and dividing by the number of cases. Means tend to be very sensitive to outliers--for example, the average income of the 1985-1995 graduates of the UNC Department of Geography is very, very high--over $200,000 a year, I believe. This, however, is because (1) there are very few graduates of the UNC Geography Department and (2) Michael Jordan is one of those graduates.
  
  A related term is expected value, which can be thought of as a long-run average of a variable.
- The median is the value in the middle--approximately half of the cases fall below the median value of a variable, and half of the cases fall above the median value of a variable. Medians tend to be less sensitive to outliers than means are.
  
  So, for instance, if you have a dataset with one variable (say, age) and 3 cases in it: the three cases are 62, 39, and 54. The median value is the one in the middle-- it is 54. If you add a case (say, age=47), and now have four variables for age, what is the median? There are two numbers "in the middle": 47 and 54. The median is the center of those two numbers: it is 50.5.
- The mode is the category with the most cases in it.
The range is the lowest and the highest value that the variable takes on (across cases--not the lowest possible value or highest possible value, but the lowest value and highest value within the dataset.
The upper quartile of a variable is the value that 75% of the case values fall under.
The lower quartile of a variable is the value that 25% of the case values fall under.
The variance of a variable is the average squared distance each value is from the mean value of a variable. So, to calculate the variance for a variable in a population, you
- (1) calculate the average value of the variable (add up all the values, and divide by the number of cases)
- (2) calculate the distance each case value is from that average/mean
- (3) square that distance--so that all the values are positive
- (4) and divide by the number of cases
The standard deviation of a variable is the square root of the variance.

III. More on Variances and Standard Deviations

A. Samples and Populations--Different Formula

When calculating variances and standard deviations for populations, you divide by N, or the size (# of cases) in the population.

However, when calculating variances and standard deviations for samples, divide by n-1, or the size of the sample (# of cases) in the population. We'll talk more about "degrees of freedom" in future lectures, but the basic idea is that in a sample, you're estimating the mean--and so you're "using up" one piece of information, and therefore divide not by n, but by n-1.

See this attachment for information on notation, and on calculating out means, variances, and standard deviations.
B. Why are Variances so Useful?

Variances are standard deviations are very useful measures of the spread of your data--taking more into account than, for instance, the simple range of your data.
C. Z-Scores--and an Example

Often, we talk in terms of z-scores, or "standard deviation units". That consider the following data set, which has one variable (X) and 7 cases:
- X1=5
- X2=7
- X3=9
- X4=6
- X5=4
- X6=8
- X7=3
What is the mean?

The mean is the total sum (42) divided by the number of cases (7)--that is, this variable X has a mean of 6.

What is the squared distance each value is from the mean of 6?
- squared distance=square of (5-6)=1
- squared distance=square of (7-6)=1
- squared distance=square of (9-6)=9
- squared distance=square of (6-6)=0
- squared distance=square of (4-6)=4
- squared distance=square of (8-6)=4
- squared distance=square of (3-6)=9
What is the average squared distance? (Consider this a sample, so divided by n-1 rather than N)

The total sum of all the squared distances is 1 + 1 + 9 + 0 + 4 + 4 + 9=28

Divide 28 by 6 (that is, by n-1).

The average squared distance (or the variance of X in this sample) is 4.67.

What is the standard deviation? It is simply the square root of the variance-- the square root of 4.67 is 2.16.

We can think about the values in terms of standard deviation units. For example, X3=9. That is, X3 is 3 units above the mean of 6. How many standard deviation units is that? One standard deviation unit is 2.16 actual units--so 3 actual units would be 1.39 standard deviation units. We can say that X3 is 1.39 standard deviation units above the mean. Likewise, X1 is just a little below the mean--it is .46 standard deviation units below the mean.

IV. What if we want to measure the RELATIONSHIP between two or more variables?

Let's move away from univariate statistics to multivariate statistics.

Measures of association refer to statistics describing the degree to which two or more variables are associated with each other--how is change in one associated with change in another. For instance, a "correlation" is a measure of association (specifically, a correlation measures the strength of a linear relationship between two variables).

V. Outliers

Outliers are cases that seem to be far away from other cases--sometimes defined as more than two standard deviations from the mean of that variable.

Outliers are important because they can change the conclusions you draw. Some statistics (such as means) are particularly sensitive to outliers. An outlier can suggest that there is a relationship between two variables (i.e., whether one has a UNC Geography degree and income) when none exists.

Outliers are a very complex topic. In sum, however, you should always look for outliers in any data set. You should always try to explain why the outlier exists. And, you should always determine whether the inclusion of the outlier changes your results and conclusions--and be "up front" about that to your audience.

VI. How can we display our data?

histograms

VII. Questions

1. Why might you not want to describe a variable with just one descriptive statistic-- be it the mean, the mode, or any such measure? Why might researchers opt for something like a "five number summary", providing the median, range (highest value and lowest value), upper quartile, and lower quartile?

2. Consider the following example of a dataset. The units of analysis are legislators (in the Arizona State House in 2005). There are two variables in the dataset: name and age.

Sketch out a histogram or bar chart displaying the frequencies of legislator ages. You may find it useful to categorize age by decades (i.e., 20s, 30s, etc.).

Your bar chart might look something like this.

Take the quiz on blackboard. (You may find it easier to transfer the data into excel--the quiz asks you to calculate means, modes, medians, quartiles, etc.)

Name	Age	Name	Age	Name	Age	Name	Age	Name	Age	Name	Age
Adams	33	Aguirre	52	Allen	43	Alvarez	64	Anderson	51	Barnes	69
Barto	47	Biggs	47	Boone	56	Bradley	53	Brown	76	Burges	62
Burns	32	Cahill	51	Carpenter	54	Chase	52	Davis	54	Downing	62
Farnsworth	44	Gallardo	60	Garcia	60	Gorman	37	Hershberger	56	Huffman	36
Jones	57	Kirkpatrick	55	Knaperek	50	Konopnicki	60	Landrum	39	Lopes	64
Lopez	57	Lujan	40	Mason	61	McClure	63	McComish	62	McLain	60
Meza	41	Miranda	49	Murphy	34	Nelson	69	Nichols	36	O'Halleran	59
Paton	34	Pearce	58	Pierce	53	Prezelski	35	Quelland	57	Reagan	36
Rios	56	Robson	50	Rosati	46	Sinema	29	Smith	64	Stump	34
Tom	57	Tully	38	Weiers	52	Yarbrough	57	Weiers	57	Wilson	45

7964Lecture 3