For more detail on basic statistical
concepts, see N&L Sections 10.1-10.7. In particular
you should be very comfortable with these terms: i.e.
be able to explain what they mean in your own words,
produce an example when given an experiment scenario,
or compute the statistic.
- population vs sample
- sample mean
- confidence level of a statistic
- external and internal validity of an
experiment task
- independent and dependent experiment
variables
- nuisance variables
- normal distributions: what it is, how
you get one and how you tell if you have one
- variance and standard deviation, as
describing populations and samples (know how to calculate)
- null and experiment hypotheses
- degrees of freedom of a statistic
You should also be familiar with the t-test,
covered in the remainder of N&L Chapter 10 and below.
two basic kinds of experiments |
|
In this class, you'll very likely be
doing one of two kinds of statistical tests:
1. Comparing two means: for example,
the effect on a sample of Design A vs Design B
2. Comparing a single mean with a reference
value: for example, the performance of your design
relative to a design requirement
The two experiments are conducted in a
similar way, with one main difference: when you're comparing
a single mean to a reference value, you don't have an
independent variable. That is, you're not varying anything.
When comparing two means, the two means are obtained
as a result of setting your independent variable to
two levels.
A different t-statistic is used to evaluate
the two cases (described below).
issues
in experiment design |
|
Randomization
To justify your assumption of normality
(and permit the application of the t-test), you need
to collect a random sample of your population.
(The population you are sampling must also be normally
distributed, of course). There are several things you
can do to ensure that your sample is representative
of the population at large. The exact list depends on
the specifics of your experiment, but this should give
you the idea in the context of your experiments:
- Selection of subjects: they
should be a representative sample of the population
of interest at large, in all parameters that might
matter for your particular experiment (e.g. age,
educational background, handedness, physical abilities,
skill with the interface, visual acuity, ...)
- Order of application of "treatments"
to your subjects: e.g., to avoid learning effects
don't always administer Design A and Design B in the
same order.
- Details of your design which might
skew results: e.g. ordering of commands in a menu
list (early items are likely to be encountered sooner)
Individual
Differences
Often, the differences in performance
between individual subjects may be substantial relative
to the difference due to the experiment treatment (e.g.
which design). This acts as a "nuisance variable", since
if not accounted for, it will contribute to the "noise"
in the data and mask a real effect.
To address this common problem, the experiment
may be designed to use "paired comparisons", where the
treatments (let's say two, for a comparison-of-two-means
test) are assigned in pairs to each subject. Thus, each
subject is tested on both levels of the independent
variable. In the analysis phase, you can in effect treat
each subject as his own control: you can look at his
relative performance across the treatments, and
compare this relative performance with the other subjects.
In cases where between-subject variability is great
(as it often is), this is a valuable technique to increase
the experiment's power: a smaller difference in raw
treatment means results in a significant result. Details
of this method are given below.
basic
t-test comparing two means |
|
The t-test is a simple way of determining
the probability as to whether two samples (i.e. sets
of measurements) are part of the same population, or
part of two different populations. Each sample might
be measurements taken from one of two levels of an independent
variable - for example, "which design": Design A, or
Design B. The general question to be answered is: does
using the two designs result in a difference in the
measured performance parameter (i.e. dependent variable)?
If so, then participants using Design A effectively
represent a different population than participants using
Design B. If not, then statistically, those two sets
of measurements will look as if they all came from the
same population. You can use the t-test to see
which of these cases is probably true, to some level
of confidence.
Remember - the t-test is only valid for
normally distributed populations. It is important to
(a) randomize everything possible when collecting your
data, and (b) plot a histogram of your collected data
to make sure it is distributed at least roughly in a
sigmoid-shaped curve. If it is not, or if you have violated
randomization principles in data collection, then you
cannot assume that the t-statistic really means anything.
The steps to computing for the t statistic
in this scenario, shown here, is also given in N&L
10.7 (with more readable equations) as well as in the
course notes.
1. Compute a combined variance for the
two samples: |
|
s^2 = (SS1 + SS1)/(N1 + N2 -2) |
|
|
2. Compute "standard error of difference": |
|
sed = sqrt(s^2(1/N1 + 1/N2)) |
|
|
3. Compute the t statistic itself: |
|
t = (Xmean1 - Xmean2) / sed |
|
|
4. Compute the t statistic's degrees
of freedom |
|
df = N1+N2-2 |
|
|
5. Decide on the significance value
you require for the result. A common value is p=.01
or .05 (e.g. 1% chance that you are incorrect in
rejecting null hypothesis). |
|
|
6. Compare the t statistic to a table
of computed t-values for a two-sided t distribution,
given df and p from steps 4 and 5. You can find
tables in the back of a stats textbook, or in many
places online; for example |
|
|
7. If your computed value of t
is higher than the value of t in the table
for your df and, say, p=.05, then you can say that
the difference between the two means is statistically
valid - with a probability of p<.05. It is common
in fact to locate the smallest value of p for which
your computed t is significant, and state
that as the result's significance. |
This formulation of the t-statistic
is designed to compare the difference between two sample
means. You are here using it to determine whether
a difference between 2 sample means is statistically
significant, REGARDLESS OF DIRECTION. I.e., either Xmean1
or Xmean2 might be larger - you just want to know if
statistically they are different. You can easily
enough determine which is larger by simply looking at
them. Essentially, statistical significance means it's
unlikely that these two means could have come from the
same population distribution - the two populations which
these samples represent must be different from each
other.
single-tailed
and two-tailed t-test |
|
One aspect of the distinction between
single-tailed and two-tailed t-tests is covered briefly
in Section 10.7 and 10.8. The two-tailed t-test is the
"standard" one. The critical value of t for a single-tailed
hypothesis is half that for a double-tailed hypothesis,
and thus this is an easier test to pass. There are some
cases where it is argued that this easier single-tailed
t-test is justified - most notably, where a significant
difference in means will be interesting only if in one
direction, e.g. Xmean1 > Xmean2, but not vice versa.
This argument is somewhat controversial, and we won't
be using the single-tailed t-test for this purpose in
this class. If you're not sure, use the two-tailed
test, which is conservative.
There is, however, another use of the
single-tailed t-test which is of particular relevance
to us, which is when you want to use a t-test to compare
one sample mean to a reference value - e.g. a design
requirement. N&L 10.8 describes one way of doing
this, based upon a single-tailed t-test - method described
in next section. The reason it's okay to do this is
that rather than looking to see if one sample mean is
far enough away from another sample mean (in either
direction), here you're seeing if one sample mean is
far enough away from a constant value (in one direction).
The null hypothesis states that there's no significant
difference between your sample mean and the reference
value; this you are trying to reject. Your experimental
hypothesis states that your sample mean is, e.g., less
than the reference value - it does not allow greater
than. The probability (i.e. 5% of the area under the
t-distribution curve, if you are using p=.05) can be
bunched all one one side of the distribution, rather
than divided half on each side as it is for a two-tailed
test.
t-test
for comparing a mean to a reference value
|
|
Many 444 project experiments
involve comparing a single sample mean to a performance
requirement. N&L 10.8 describes one way of doing
this, which is reiterated and expanded upon here. For
purpose of discussion, let's say that your requirement
is that performance must be <= level R (i.e. R is
a maximum permissible value, not a minimum). Conceptually,
you need to make sure that your no part of the confidence
interval lies above the reference value.
(Note: in this case, you don't have two
levels of an independent variable; you aren't actually
varying anything, but comparing one design, for example,
to a fixed value).
First of all, there is no point in doing
the test at all if the sample mean is larger than R
(Xmean > R). Clearly, we will not find that Xmean
is significantly less than R in this case. If Xmean
<=R, then let's proceed.
Next, what's your confidence interval?
It's a range of values around Xmean, whose size is determined
by your desired significance p (also commonly
called alpha)- e.g. 0.05 or 0.01; or, as usual, you
can run the process backwards and determine the size
which the confidence interval would be if the constraint
were that it lie entirely under R. If you require a
small p, then your confidence interval will be
larger - the test will be harder to satisfy. So, choose
p and proceed.
Here are the computational steps:
1. Compute the variance for your single
sample population: |
|
s^2 = SS / N-1 |
|
|
2. Compute "standard error of the mean": |
|
sem = sqrt(s^2/N) |
|
|
3. Compute the t statistic's degrees
of freedom |
|
df = N-1 |
|
|
4. Decide on the significance value
you require for the result. A common value is p=.01
or .05 (e.g. 1% chance that you are incorrect in
rejecting null hypothesis). |
|
|
5. Find the critical value of the t
statistic, using the single-tailed t distribution,
based on df and p |
|
|
|
|
6. Compute the top and bottom of your
confidence interval |
|
Xmin = Xmean - (t,p,df x sem) |
|
Xmax = Xmean + (t,p,df x sem) |
|
|
6. Compare Xmax (or Xmin) with the reference
value as defined in the experiment and null hypothesis.
In our example, if Xmax|p <= R, then our sample
mean is below R, at significance p. |
t-test
for a paired comparison |
|
The paired comparison t-test is used when
you suspect that individual differences between subjects
might be masking an otherwise significant effect. For
example, Subject 1 performs well in all cases, and Subject
2 performs poorly in all cases; both of them do better
with Design A than with Design B, but the standard deviation
of their combined performances is so large that this
individual trend does not show up strongly enough.
The solution is to conduct the t-test
on the difference between the individual's
performances on the two treatments, rather than on all
performances lumped together. In effect, the subject's
mean performance is subtracted from his individual scores
before they are "thrown into the pot". Details are as
follows:
1. Compute the difference between
sample means for each individual: |
|
Di = Xai - Xbi |
|
where Xai and Xbi are the measured
responses for treatments A and B on subject i |
|
2. Compute the mean difference
of the individual differences for all subjects: |
|
Dmean = sumation(D)i / N |
|
where N= number of differences
= number of subjects |
|
3. Compute the sum of squares on the
differences: |
|
SSd = summation[ (Di-Dmean)^2] |
|
= summation[(Di)^2] - [summation(Di)]^2/N |
|
4. Compute the standard deviation of
differences: |
|
sd = sqrt[SSd / (N-1)] |
|
|
5. Compute "standard error of difference": |
|
sed = sd/sqrt(N) |
|
|
6. Compute the t statistic itself: |
|
t = Dmean / sed |
|
|
7. Compute the t statistic's degrees
of freedom |
|
df = N-1 |
|
|
8. Decide on the significance value
you require for the result. A common value is p=.01
or .05 (e.g. 1% chance that you are incorrect in
rejecting null hypothesis). |
|
|
9. Compare the t statistic to a table
of computed t-values for a two-sided t distribution,
given df and p from steps 4 and 5. You can find
tables in the back of a stats textbook, or in many
places online; for example |
|
|
10. If your computed value of t
is higher than the value of t in the table
for your df and, say, p=.05, then you can say that
the difference between the two means is statistically
valid - with a probability of p<.05. It is common
in fact to locate the smallest value of p for which
your computed t is significant, and state
that as the result's significance. |
The final lecture on controlled experiments
ends with an example of an experiment design to compare
difference in performance for "natural" and "abstract"
icon types. A set of sample data (ficticious) and some
sample calculations can be found in this excel
spreadsheet. Here, we'll step through these calculations.
In particular, the spreadsheet illustrates a way
to account for individual differences in your subjects.
The first worksheet, "data", shows the
sample data. The second worksheet, "1st pass" shows
one way of computing the analysis which is in fact not
a very good way because it doesn't account for individual
differences between subjects. The third worksheet, "2nd
pass", shows a better although more lengthy way of doing
the analysis.
Raw Data
There are two icon designs (natural and
abstract) which represent the two levels of the independent
variable. Performance time for each level for each subject
is listed; smaller is better.
This worksheet also shows plots of the
data in two ways, which help you decide if it is indeed
normally distributed. "Sorted Data" shows the two columns
of data plotted in a line after having been individual
sorted into ascending order. If normally distributed,
each of the two lines should rise steeply at first,
flatten somewhat then rise steeply again at the end.
Outliers are clearly evident in such a plot.
The second plot, Histogram, shows the
same data after they have been sorted into appropriately
sized bins. There need to be enough bins to demonstrate
a normal distribution (at least 5); and each bin should
have some values in it. It is difficult to do a histogram
for very small samples (10 and less).
In this case, the data seem fairly well
distributed, so we will proceed to do a t-test.
1st Pass:
neglects individual differences
In the simplest analysis, the t-statistic
for testing for a difference between two means is computed
in a straightforward way, exactly as laid out above.
The data from the 1st worksheet is reproduced in the
shaded columns. The basic statistics (mean, sum of squares,
variance and standard deviation) are computed according
to the formulas given in the N&L handout for each
column.
Then, the t statistic is computed based
on the combined variance and standard error of difference
of the two samples. A value of "t" of 0.9525 is found,
with 38 degrees of freedom (40 samples on 20 subjects).
Assuming a desired significance level
of 0.05, we consult a two-tailed distribution found
in one of the online tables, for example, the MedCalc one. The critical
value of t for .05 and 38 df is 2.024. Thus, the
difference in effects appears to not be significant.
2nd Pass:
accounts for individual differences
Let's try again, considering that perhaps
large individual differences are inflating the sample's
standard deviation. The 2ndpass worksheet illustrates
an analysis in accordance with the Paired Comparisons
approach. Now we compute the difference in performance
for each subject, and analyze this number. The
t-test is different because there is only one set of
values, rather than two. We now find a t-statistic of
4.8425, which exceeds the critical value of 2.0930 (higher
than before, since now we have just 19 rather than 38
degrees of freedom). In this analysis, the difference
due to treatment is indeed significant at at least p=.05;
in fact, looking at the table, it's significant at p>.001.
Analyzing subjects separately made a big
difference!
Karon MacLean,
Fall 2002
|