I. Goals:
A. Choosing an IV.
B. Simple experiments (single factor, between participants).
C. Problems in design.
D. Choosing a DV.
E. Analysis.
II. Choosing an IV.
A. At this point, you've turned your question into a hypothesis.
Now you need to choose an IV from that hypothesis. What will you
manipulate? The first step is to operationally define the “if” part
of the hypothesis. If you haven't been thinking clearly, then this
is the first place you get caught. Consider, for example, the following
hypothesis:
H: If people are in love, then they'll have a hard time concentrating
What variable might we manipulate here (you might think in love/not
in love). What does it mean to be in love? What's the operational
definition? Until you have one, you'll have a hard time deciding
what to manipulate. In fact, what you manipulate will look entirely
different depending on your definition.
B. Construct validity comes into play here. Look at the
operational definitions we came up with in class. Some of these are
clearly better definitions of “in love” than others. To the extent
that a definition captures what is generally considered to be the important
aspects of a construct (like love), that definition has high construct
validity. Obviously, we need this to be high. If I define “in
love” as “a general positive feeling about someone” it seems better than
“being sexually aroused by someone.”
C. Once we have a good definition, then we can decide what to
manipulate, and choose the appropriate levels. Remember restriction
of range. We need to have enough levels to capture any relationship
that might exist between the IV and DV, and we need to have them spaced
far enough apart that any relationship that exists can be detected.
Two things:
1. If you have good operational definitions, this part is easy.
2. Even though we're trying to pretend that you can take each
of these steps independently of the others, it's smart to address restriction
of range in the context of the DV you've chosen. We'll pretend for
now that you don't, but that's just to give us time to get to choosing
a DV when it's more appropriate.
Top
III. Simple experiments.
A. These will all be between participants designs. This
means that each participant in our experiment will participate in one and
only one condition.
The designs we'll cover here are also all single factor designs.
This means that you're only manipulating one IV.
B. The 2-group, (after-only) design:
This is the simplest. Schematically:
2-group because you have two groups, after-only because you only observe
after the treatment.
Basic: One group gets a treatment, the other group gets nothing,
and then you measure the two groups, and look for a difference. If
a difference exists, you can say it's due to the treatment. (Note
the similarity to Mills’ Method of Difference.)
The only effort to equate the groups prior to the experiment is through
random assignment. This ensures that each group will be approximately
equal, and that any differences that do exist will be due to chance and
not your own biases. Randomization is almost always your friend,
but it doesn't guarantee that the groups are equal, so you might want to
check to be sure that they are.
You can do this design without an explicit control group. For
example, if I'm manipulating alcohol as no alcohol and one beer, my no
alcohol group might be non-alcoholic beer. This helps make the groups
more similar (both got a treatment, and that treatment seemed identical
from the standpoint of the participants). Also, all sorts of potential
confounds or problems are eliminated. For example, there might be
some effect of beer beyond its intoxicating effect (socially learned consequences
of drinking). The non-alcoholic beer group would also have these
same reactions. The only difference is then the alcohol
You don't even need an implicit control group. If you predict
that more of a variable will impact participants one way, and less in another,
you can just have two levels that are both treatments, and compare those
treatments. This is particularly useful when you're working with
variables that don't have natural zero levels (like mood: is there
a truly neutral mood, or should you instead compare happy-sad?).
C. 2-group, pre-post design:
2-group because we have two groups, pre-post because we observe both
before and after treating.
Because randomization is not always your friend, you can use this design
to check whether or not the two groups were equal prior to running the
experiment. Basic: Give both groups the DV and get a measurement,
treat one group and do nothing to the other, then give them both the DV
again to get a measurement. Two ways to use these scores:
1. Just look at the pre-test to see if there's a difference between
the groups prior to treatment. If there's not, assume they were the
same and continue as in 2-group, after-only.
2. Subtract each participant's pre-test score from their post-test
score and analyze change scores. This is better because it statistically
equates the groups. Since each participant serves as their own baseline,
and all you're looking at is change, it doesn't matter how different the
two groups were prior to the treatment (but see ceiling and floor effects
in the discussion of DV's below). The treatment group should change
more, and that's what you're really interested in. In fact, since
you're throwing out all of the variability between people and just looking
at change, you should have more statistical power.
A couple of additional notes:
1. You've statistically controlled for differences, but you haven't
eliminated them. Any differences that were there are still there.
These might still come back to haunt you.
2. You introduce the problem of test-retest effects. Since
people have seen the DV prior to treatment, it might impact the treatment
in some way that will then show up in the post-test. It might take
the form of a practice effect (as in a study on math skills where just
practicing the test once boosts people's performance regardless of treatment)
or it might influence how people attend to the treatment (as in a study
of attitude formation where answering questions about your attitudes prior
to someone attempting to change those attitudes might make you firm up
your beliefs and make you less likely to be affected by the treatment).
D. 2-group, matched design:
Another approach to handling preexisting differences. Basic:
If some variable is really important to your topic area (as in you're studying
representation of spatial texts and spatial ability is an important individual
difference) then you screen participants on that variable, match up participants
who score the same on the variable, and randomly assign one member of each
pair to group one and the other member to group two. After that,
everything is the same as the simple 2-group, after-only. This way,
you know prior to experimentation that the two groups are equal (with respect
to the matched variable).
Some notes:
1. There's nothing really preventing a pre-post design.
2. Suppose you don't get a set of pairs, but approximately continuous
variation. Then you have some alternatives to participant by participant
matching described above.
a. ABBA: If you put the highest score in A, next in B,
next in A... (ABAB...), then for each pair the person in A is a little
better than the person in B. This introduces a potential bias.
Instead, put first in A, next in B, next in B, next in A... This
will approximately equate those slight differences.
3. Again: Differences between the groups are not eliminated.
The matching variable has been accounted for, but there are countably infinite
other sources of variation that you haven't controlled. You can try
to match on multiple variables at once, but that's harder, and the more
you match the worse it becomes. This will help you appreciate how
randomization can be your friend.
Top
IV. Problems in design.
A. Threats to internal validity (whether you measure what you're
supposed to measure).
1. Is your manipulation effective? Sometimes, you're not
sure if the levels of the IV that you were supposed to have in your experiment
were actually present. For instance, if one group is supposed to
be in a sad mood, were they really in that mood? If the manipulation
was not effective, then the results of the experiment aren't very informative
(if they're not sad, then you're not measuring the effect of a sad mood).
What do we do in this situation? Build in a manipulation check.
Either have an extra condition or an extra dependent measure that can tell
us whether or not the manipulation was effective.
2. Do you have a control condition that is appropriate.
Remember, this is supposed to be a baseline for comparison, but sometimes
choosing the proper baseline can be difficult. If you've chosen badly,
your comparisons might not be valid.
3. Do you have any confounds? We already discussed the
two criteria for confounds: 1) they covary with the IV, and 2) they
could reasonably be expected to produce a change in the DV. If you
have confounds in your experiment then you don't know if changes were due
to the IV or the confound. Confound stuff:
a. Looking for them: This is hard. Before running
the experiment, ask yourself “could anything else reasonably be expected
to produce this effect?” List everything that comes to mind, however
trivial. After you brainstorm, take each item on your list and evaluate
it with respect to the two criteria. Anything that seems a threat
should be dealt with.
b. Fixing confounds:
1). Control them: If a variable doesn't covary with the
levels of the IV, then it's not an issue. So, if you're looking at
the effect of training on test performance and you're worried that time
of day of the test might also affect performance, then run everyone at
the same time. This makes time a constant, and it isn't a problem.
2). Counterbalance: If you have two groups (like experimental
and control) run half of each group at one time and half at the other.
This way, even though time of day varies (you have at least two times),
it doesn't covary with the IV (which would be the case if you ran all of
one group at one time and all of the other group at another time).
c. Using confounds: Sometimes, confounds can yield useful
information. Spotting confounds can lead you to discover new things
about the relationships between your IV and your DV. For example,
if time of day does affect test performance, that information has a lot
of important implications. Even if you don't discover the confound
until after the experiment, you've still learned something important.
In other words, you can sometimes profit from your mistakes.
d. When to control for confounds: Before the experiment!
e. How does experimental control affect confounds? The
more control you exercise, the less you need to worry. Confounds
are systematic sources of variation. To the extent that the only
differences between your groups are due to the IV (which is the case in
a perfect experiment) there will be no confounds. As you get sloppier,
the potential gets bigger.
4. Running the experiment:
a. Demand characteristics: When part of your experiment
suggests to participants what the hypothesis of the research might be or
what the experimenter wants them to do. This can lead to participants
performing in the way they think they're expected to perform, instead of
in the way they would perform naturally. Sometimes, you can have
demand characteristics that suggest something other than the real hypothesis,
but which will still influence behavior. You want to watch for these
as well.
b. Hawthorne effects: The classic story (which may not
be entirely accurate but will illustrate the situation suitably for our
purposes): Researchers were investigating the effect of light levels
on productivity of factory workers in the Hawthorne electric plant.
First, they turned on more light. Productivity increased. Then
they turned down the light. Productivity increased. Even when
the workers were toiling away in semi-darkness, productivity increased.
The problem was that just knowing they were being observed was causing
the workers to change their behavior (who wants to look bad in the experiment?).
This is a problem you might have as well, so when observation can influence
behavior, the observation should be done as discretely as possible.
The point of this section: You're working with humans, and they
will not act like machines. Knowing they're in an experiment can
impact what they do. You should be aware of that fact.
B. Threats to external validity: Generalizability:
As you become more and more control oriented, your experimental task ceases
to resemble the real world phenomenon that you're trying to study.
You need to try and balance your need for control with your desire to make
general statements about the world. Control can be a two-edged sword.
Three issues here:
1. Are the participants representative of the population?
Any time you work with a sample, you have to worry about whether they're
really representative of the population. Random sampling procedures
(like we discussed with surveys) would help a lot, but they're rarely,
if ever, used in experimentation. Sometimes this is no problem, sometimes
it is a problem.
2. Are the variables you're using representative of the changes
that happen in the real world? If not, then your findings won't apply.
For example, manipulating background noise with a white-noise generator
might not relate well to the real kinds of background interruptions people
normally face.
3. Is the experimental setting representative? This is
related to ecological validity. The lab setting may not be representative
of the normal setting for what you want to study.
Top
V. Choosing a DV.
A. Just like with the IV, you start with a good operational definition.
In particular, for the “then” part of the hypothesis (what did we mean
by “hard time concentrating?”). Some things to think about as you
choose a DV:
1. Scales of measurement: Ratio data is more informative
than ordinal data, and allows us to perform more interesting tests.
If you keep this in mind during the planning stages, you can shoot for
the highest quality of data.
2. Sensitivity: You need a DV that will be affected by
changes in the IV. If you did the Stroop experiment one word at a
time (instead of in groups), then naming time in seconds would be a bad
variable. It's not sensitive enough to capture the differences.
You don't want to find no effect in your experiment just because your DV
was too crude. One thing you can do is build in a condition that
should be very strongly affected, and use it like a manipulation check.
If there's a big difference between that condition and some other, then
you know the DV was sensitive to at least some changes.
3. Ceiling and floor effects:
a. Ceiling effects: When your task is too easy, and all
participants perform at or near perfect, you have a ceiling effect.
b. Floor effects: When the task is too hard and everyone
performs at the worst possible level.
Note the sensitivity issue here: You need variability in performance.
If nothing else, you want differences between the groups in your experiment.
So, you need people to perform around 80-90% of perfect in your experiment
to make room for the best participants to be better than the rest and the
worst participants to be worse.
4. Reliability: As defined before: How well you measure
what you intend to measure. You want a dependent measure that will
consistently yield the same score in the presence of the same circumstances.
For example, if your questionnaire measures my achievement orientation
trait, then it should yield the same score every time you use it to measure
that trait. As you just saw in 3, variability is good, but too much
variability kills you. If your measure isn't consistent in the scores
it provides, you'll get so much variability that real differences will
be obscured.
5. Validity:
a. Construct validity: You need an IV that manipulates
the appropriate construct, and you need a DV that measures the appropriate
construct. The issues here are very similar to the ones raised when
choosing an IV.
b. Face validity: How well it seems to measure what you
want to measure (a superficial sort of measure). On the one hand,
this is important to get people to believe you. If your measure appears
to bear no relationship to what you're claiming to measure, nobody will
take it seriously. But, if it's too transparent, you might get into
some of those participant compliance issues.
Top
VI. Analysis. For a two-group, between-participants
design, you will use an independent samples t-test for the analysis.
The computations are complex enough that it's worth letting a computer
do it for you. When you finish, here's a sample of how to write up
the results:
“The data were analyzed using an independent samples t-test. The independent variable was amount of love, and the conditions were in love and not in love. The dependent variable was concentration. The mean concentration scores for people in love and not in love were 2.00 (0.71) and 4.80 (0.45) respectively. With alpha = .05, the two population means were significantly different, t(8) = -7.57, estimated standard error = 0.37.”
Where are we now?
We have a hypothesis.
We've chosen an IV.
We've chosen a design.
We've checked that design for problems (like confounds, etc.).
We've chosen a DV.
Now, we'll proceed to add some wrinkles to this basic design.
Top
Back to Langston's Research Methods Page