I. Goals.
A. Uses of survey data.
B. Survey issues.
C. What to do with the data.
D. Survey exercise.
II. Uses of survey data. With a survey, you produce
the observation of interest by asking a question. You're still
not
manipulating anything, but your involvement is increasing.
A. Research: Find out attitudes, opinions or
behaviors.
This can be as they are now or as they change over time.
B. Practical: Find the same stuff, but put it to use.
For example, politicians tailor their message based on issues people
are
focusing on. Companies use product surveys to find out what
you're
buying and why you like what you do.
Top
III. Survey issues.
A. Sampling: You start with a population. This is
a theoretical entity that corresponds to everyone you want to
generalize
to. It is almost impossible to specify a complete
population.
Instead, you utilize a frame which is a list of members of the
population.
If the population is Middle Tennessee State University students, it's
easy
to get a frame that contains everyone (a directory). For
something
like the US population, you might need several sources for your
frame.
Since your frame won't contain everyone, you have to be aware of
selection
biases (over representing some element of the population). You
want
a representative sample (one that accurately reflects the make-up of
the
population). Once you have a frame, you take a sample from it and
administer the survey to those members of the population. How do
you get the sample?
1. Non probability sampling: This means you can't say how
likely any member of the population was to be in the sample (you can't
assign a probability of membership to elements of the population).
+: Easy.
-: Worry about selection biases. You might choose people
to survey based on how you think they'll answer the questions, biasing
the results. Taking who is handy is also a source of bias (as in
phone polls leave out people without phones).
2 kinds:
a. Accidental: I treat some group to whom I have easy
access
as the sample. If I want Middle Tennessee State University
students,
and survey members of this class, that's an accidental sample.
Standing
in front of the KUC and surveying the first 30 people who walk buy is
also
accidental sampling.
b. Purposive: I want to know about a particular subset
of the population, so I only sample those elements. For example,
if I want to identify study habits of “A” students, that's the only
group
sampled. This is a biased sample, but it's not a problem because
that's who I'm interested in.
2. Probability sampling: For any element of the population,
you can specify exactly how likely that element is to be included in
the
sample.
+: Representative samples.
-: Hard.
3 kinds:
a. Simple random: Every element has an equal chance of
being in the sample. It's like throwing everyone's name in a hat
and then drawing them out until you get the number of people you
want.
It's the easiest of this type, but it doesn't guarantee a
representative
sample.
b. Stratified random sampling: If the population is not
homogenous (for example, we have two distinct genders), then you
subdivide
into strata (subgroups) and sample from within each stratum
(group).
You can take equal numbers from each stratum, or you can do it
proportional
to the stratum's membership in the population. For example, if I
have 40 students, and 10 are male, then in a sample of four, I'd want
one
male and three females (so males are still 25%). I would randomly
sample one male from the males and three females from the
females.
To the extent that you do a good job of identifying strata, this is the
most representative kind of sampling.
c. Cluster sampling: Like simple random, but the elements
are actually clusters of individuals. For example, if I couldn't
get MTSU's student directory, I could randomly sample from the course
catalog,
and survey all of the members of the chosen courses (clusters of
students).
B. Methodology: How do you contact the sample?
1. Mail survey: Send it by mail.
+: Easy.
-: Potential response biases: If only some of the people
respond, they might differ from the people who didn't respond (like
they
might hold the most extreme views).
2. Interview: Face-to-face administration of the survey.
+: More responses.
-: Biases from interviewer (subtle or intentional), hard and
expensive.
3. Phone: Call them. Good compromise.
+: Response rate.
-: Still potential bias.
C. The questions.
1. Choosing what to ask: You need.
a. Very specific questions that go directly to what you want
to know.
b. To project answers to see if responses to the questions can
tell you what you want to know.
2. Decide on a format:
a. Closed: Multiple choice, true/false. Participants
have to select from several possibilities. Good by mail.
b. Open-ended: Essays and short answers. These are
more informative, but harder to score. Good for interviews.
3. Write a pool of questions (more than you need).
4. Revise: Try the questions on people with strong opinions
to check for bias, narrow the pool.
5. Pretest: Get a subset of the sample to take the
survey.
Interview them about the questions (what were they thinking, was it
clear).
6. Make instructions and a procedure for administering it.
You also need to be sensitive to the ordering of the questions:
a. Go from general to specific to avoid fixing participants’
responses.
b. If you have similar questions, mix the order so that you have
several orderings over the whole set of surveys.
c. Use filter questions to cut down on the work of respondents
(like “Do you own a car? If yes, then answer this set of car
questions...”).
D. Design: Time periods sampled.
1. Cross sectional: One time period. Tells you
opinion
at the time you did the survey for the population you sampled.
Attempts
to generalize beyond this are risky.
2. Successive independent samples: Give the same questions
at several time periods to different samples of people.
+: Easy way to assess changes with time.
-: Different people mean changes could be due to sampling error.
3. Longitudinal: Give the same questions at several time
periods to the same sample of people.
+: Outstanding data for assessing change over time.
-: Hard, mortality (people drop out, are they doing this for
reasons similar to what you're studying, leading to biases?).
Top
IV. What to do with the data.
A. Descriptive statistics. Everything applies.
B. Chi square (contingency tables).
C. Confidence interval: I could make a statement
like:
“We can be 95% confident that the interval 4 < m < 6 contains the
mean number of times MTSU students go to Nashville to shop.”
D. Correlation: Assess the strength of the relationship
between two variables. As an example, I can look at the
relationship
between the amount of time you spend studying and your exam
score.
If it's strong, I might want to do an experiment to see if studying
more
causes better grades. If there's no relationship, then there's no
need to do additional research. Some correlation stuff:
1. Scatter plots: You should always start by looking at
a graph that plots one set of scores on the x-axis, and the other set
on
the y-axis. This is like making a frequency distribution for
descriptive
statistics, it lets you get a qualitative feel for the data. This
can help make sense of the numbers later. What can you see?
Here are some idealized plots:
a. All points on a line, goes up from left to right = perfect
positive correlation.
b. All points on a line, goes down from left to right = perfect
negative correlation.
c. No pattern to the dots (uniformly distributed) = absolutely
no correlation.
d. Realistic data: A cloud of points that generally goes
up from left to right = positive correlation (if it goes down it's a
negative
correlation). The more clustered the points, the higher the
correlation.
2. What the numbers mean: Correlation ranges from -1 to
1. Perfect negative = -1. Perfect positive = 1.
Absolutely
no correlation = 0. Anything in between is what you'll find, the
closer to 1 or -1, the stronger the correlation. Keep in mind
that
absolute strength matters, not the sign, so -.83 is stronger than .24.
3. Interpreting: Some general things:
a. CORRELATION DOES NOT IMPLY CAUSATION. Even if there
appears to be a very strong relationship between shoe size and IQ, we
wouldn't
say one causes the other. This interpretation will be VERY
tempting
as the semester progresses and we get into less obvious examples.
Do not fall for it.
b. r (a correlation coefficient) is not very interpretable as
it stands. But, r^2 (r-squared) is. It's the proportion of
variance in the Y-scores explained by variance in the X-scores.
In
other words, how much of the difference in the Y's is accounted for by
differences in the X's. The bigger this is, the more knowing a
person's
X score will tell you about their Y score. This leads us to:
c. Predicting a person's performance based on what other people
have done. If you make a scatter plot for the data above and draw
a line through the points so that the distance from each point to the
line
is as small as it can be, you've made the regression line. It's
the
line that best fits the points. If you know the equation for the
regression line and an X score, you can use that to predict a Y
score.
A concrete example of this would be college admissions. Based on
thousands of past students’ SAT scores and college GPAs, I can compute
the regression line for SAT and GPA. Then, if you apply to my
college
and I know your SAT score (X), I can predict your GPA (Y) using my
regression
line. That way, I can get some idea of how successful you'll be
in
college before you ever take a class.
Here's an example: I collected five SAT scores and five
GPAs.
They are:
| SAT | GPA |
| 900 | 2.4 |
| 1000 | 3.2 |
| 850 | 2.0 |
| 1200 | 3.7 |
| 950 | 2.8 |
First, compute the correlation: r = .95. This is a
strong,
positive correlation. Graph scatter plot to see details.
Then, compute r^2 = .91. This is very high. What it means
is that 91% of the differences in GPA can be explained by differences
in
SAT. We don't know how to explain the other 9% (if we had more
possible
factors, we could run them into the regression and find out, but that's
multiple regression, and it's a bit beyond us here). This is
still
correlation research, so we don't want to get carried away with causal
statements.
Now for prediction: I compute the regression line. A line
always takes the form
Y = mX + b
m is the slope, we'll compute that based on the data. b is the
y-intercept (the place where the line crosses the y-axis). We'll
also compute that based on the data. X and Y define a point (in
this
case, the pair of scores corresponding to some individual's GPA and SAT
score). We know X (SAT). The question is, what is Y?
What GPA should this person get based on their SAT?
I did the math. Here's the regression line:
Y = .005X - 1.771
So, if I know a person's SAT score is 1050, I can plug in for X and compute, and I get a predicted GPA of 3.47. That person should probably be admitted, because it looks like he/she will be successful. With an SAT of 700, I get a predicted GPA of 1.73. That person will probably struggle, I might want to pass on an admission so someone else can have the spot.Back to Langston's Research Methods Page