7964
Lecture 6
Lecture 6 focuses on distributions. There's a fair amount of review here.
I. Review--Types of Variables & Data
We've already talked about types of variables--categorical (or nominal), ordinal,
and interval. Let's review.
-
Categorical or nominal variables are those which are categories--without
order. Therefore, "religion" could be a categorical variable within a
data set:
- Those who are Protestant coded as 1
- Those who are Jewish coded as 2
- Those who are Catholic coded as 3
- Those who are Muslim coded as 4
- Those who are "other" (not Protestant, Jewish, Catholic, or Muslim) coded as 5
Note--when you are creating variables and coding data, the categorization should be exhaustive--that
is, each datapoint should be able to be coded. That is why, in the above example,
an "other" category is necessary.
Generally, it is often important that data be coded into mutually exclusive systems--that
is, you wouldn't want to code a variable system such as the following:
- Those who are Christian coded as 1
- Those who are Protestant coded as 2
- Those who are Jewish coded as 3
- Those who are Muslim coded as 4
- Those who are "other" (not Christian, not Protestant, not Jewish, and not Muslim) coded as 5
because you wouldn't be sure whether to code someone who is Protestant as a 1 or a 2.
Note there is no inherent order across the religions--the numbers are merely
used to classify the religions.
- Ordinal variables are variables which are categories, and which have a
certain order. However, one cannot say for sure that the distances between
levels or points on the scale are equal to each other. "Ordinal" just means "rank-ordered".
For example, one could code finishers in a race--first, second, third, and so on. But
the distance between the first and second place finisher isn't necessarily the same as the
distance between the third and fourth place finisher.
Likert scales which are often used in survey research, are generally considered to be
ordinal scales. A
Likert scale
is a rank-ordered scale, generally used to measure attitudes.
For example, you could ask the following survey question:
- To what degree do you agree with the statement "statistics is fun!"
- 1 Strongly Agree
- 2 Agree
- 3 Neither Agree nor Disagree
- 4 Disagree
- 5 Strongly Disagee
This is an ordinal variable because there's no assumption that the distance
between 1 and 2 is the same as between (for instance) 2 and 3, 3 and 4, and so on.
Interval / Ratio variables are those which are categories--but where the
distance between any two categories is the same as the distance between any other two
categories.
For instance--one could code as a variable the number of children that a survey respondent
has. The difference between one child and two children is the same as the difference between
two children and three children, four children and five children, and so on. (Of course,
the difference in the effect of going from one child to two children may be different in the
effect of going from two children to three children....but the actual distance is the same).
Continuous Versus Discrete Data
- Continuous data are
data that can be broken down into smaller parts and still have meaning. That is, they can
take on any value in an allowed range.
- Discrete data are
data that can't be broken down into smalelr units--they have to be thought of in terms of
whole numbers. Categorical and ordinal variables are generally considered discrete
data.
Factors are variables that are used to classify other variables. Factors are usually
either categorical or ordinal variables, but can be interval-level. An example would be a
a restaurant database--the variable "number customers served" can be classified by another variable,
"day of the week". Day of the week could have values ranging from 1 to 7. (Is it a categorial,
ordinal, or interval variable? Why?) Number of customers served
II. Review--Ways to Present Data
One can present data in several ways:
- Through a
frequency table
(which is a table that,
for a particular variable, gives the number and percentage of cases
in a dataset that have each possible value),
- a bar chart (a chart
with vertical bars showing the frequencies for each possible value
of that variable),
- or a histogram (which is the same as a bar chart--
but the vertical bars are contiguous, without space between them).
Click
here
for an excellent website which allows you to look at different histograms
and frequency tables--and create your own.
When looking at the "College SAT" frequency table or histogram, ask yourself
the following questions:
- What is the modal category?
- Does the distribution have a single mode?
Note that there is also the term bimodal distribution. A bimodal
distribution is generally understood as a frequency distribution that has two
peaks--but (the way the term bimodal is generally used) the peaks do not have to
be the same. Click
here for a description and example of a bimodal distribution. Note
that the histogram shows two peaks--but the one on the right is not as high
as the one on the left.
- Given that, is this distribution a bimodal distribution? Why or why not?
- A multimodal distribution is a distribution with more than
two obvious peaks that the other observations tend to gather around. Can you
find any examples of multimodal distributions on this website?
III. Questions
- An example above was of a restaurant database, with two variables: "number of customers
served" and "day of the week" (coded 1-7).
- Is the "number of customers served variable" continuous or discrete? Is it categorical,
ordinal, or interval? Why?
- Click here for the answer.
- Is the "day of the week" variable continuous or discrete? Is it categorical, ordinal,
or interval? Why?
- Click here for the answer.
- What sort of variable is a DNA profile?
- Click
here for the answer.
- Would you imagine that height in a population of college
aged female students is a unimodal variable?
- Click here for the answer.
- If one measures height of female college students, and then height
of male college students, what type of variable is "sex" being used as?
- Click here
for the answer.
- This question is from Lacy (2006): If you make five measurements of refractive index from each of 20 fragments
of glass from a single window pane. Is this set of measurements a population--
or a sample? Why?