_images/title_Statistics.png

Introduction to Statistics

Describing a Distribution

The average male American adult weighs 88.3 kg. This is a lot (actually, way too much). But how do male Austrians compare?

To answer this question, we first have to get an overview over the weight-distribution in Austrian males. We take 5 males, record their weight, and get

\[weight = [84, 86, 88, 61, 82]\, kg.\]

Based on these data, we can now estimate the average weight, and how variable the weight is between different men. Thereby we make use of the assumption that the data often have a bell-shaped distribution: the most likely value is the average value, and it is equally likely to find a low value or a high value. Mathematically, we can describe such data with the normal distribution, also called Gaussian distribution: (Fig. 11).

\[\label{eq:normal} f_{\mu,\sigma} (x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-( x - \mu )^2 /2 \sigma^2}\]
_images/norm_dist_alt.jpg

Figure 11: Normal Distribution: the unit on the horizontal axis is 1 \(\sigma\). This distribution is sometimes also referred to as z-distribution.

:math:` - infty < x < infty , and :math:`f_{mu,sigma} is called the probability density function (PDF) of the normal distribution. \(\mu\) is the mean value, and \(\sigma\) is called standard devation, and characterizes the variability of the data.

From our data, the best estimate for the mean value for the weight of the average Austrian male is

\[\label{eq:sampleMean} \bar{w} = \frac{\sum\limits_{i=1}^{5}{w_i}}{n} = 81.2 \, kg\]

The best estimate we can get for the standard deviation is

\[\label{eq:sampleSD} s = \sqrt{ \frac{\sum{(w_i - \bar{w})^2}}{n-1} }\]

Note that we divide by \(n-1\) here! This is not quite intuitive, and is caused by the fact that the real mean is unknown, and our sample mean is chosen such that the standard deviation is minimized. This leads to an underestimation of the standard deviation, which can be compensated by dividing by \(n-1\) instead of \(n\).

The standard deviation as defined above is sometimes called sample standard deviation (because it is our best guess based on the sampled data) and is typically labelled with \(s\). In contrast, the population standard deviation is commonly denoted with \(\sigma\).

The variance is the square of the standard deviation.

Confidence Intervals

If we know the mean and the shape of the distribution, we can determine the :math:`alpha`-% confidence intervals, i.e. the weight-range about the mean that contains \(\alpha\)-% of all data (Fig. 11):

\[\label{eq:CI} CI = mean \pm \sigma * z_{1-\alpha/2}\]

So if we want to know the 95% confidence intervals, we have to corresponding z-value for \(0.05/2\). The factor \(\alpha/2\) is due to the fact that we have outliers below and above the confidence intervals (Fig. 12).

_images/two_sided.jpg

Figure 12: If 95% of the data are within the confidence limits, 2.5% of the data lie above, and 2.5% below those limits.

Standard Error

The more subjects we measure, the more accurate our estimation of the mean value of our population will be. The remaining standard devation of the mean is often referred to as the standard error, and is given by

\[\label{eq:standard_error} se = \frac{\sigma}{\sqrt{n}}\]

The equation for the confidence intervals for the mean value is the same as in the equation above, with the standard deviation \(\sigma\) replaced by the standard error \(se\).

If you know the mean value and the standard deviation, you use the z-distribution and \(\sigma\). If you have to estimate the mean and the standard deviation, you have to use \(s\) instead of \(\sigma\), and the normal distribution has to be replaced by the so-called t-distribution. The T-distribution also takes the errors induces by the estimation of the mean value based on \(n\) subjects into consideration:

\[CI_{mean} = mean \pm se * t_{n-1, 1-\alpha/2}\]

where \(t_{n-1, 1-\alpha/2}\) is the t-distribution for n subjects, at a \(\alpha\)-% confidence level. Note that with \(n\) subjects we have \(n-1\) degrees of freedom, because one parameter (the mean) has to be estimated.

The required t-value can be obtained with the Matlab command \(icdf\) (for inverse cumulative distribution function). In our case, where we have 5 subjects, and \(\alpha=0.05\)

>> tval =  icdf('t', 0.975, 4)
 tval = 2.7764
>> tval =  icdf('t', 0.975, 20)
 tval = 2.0860
>> zval = norminv(0.975)
 zval = 1.9600

For 5 subjects we get a value of 2.77, and for 21 subjects a value of 2.09, which is already pretty close to the corresponding value from the normal distribution, 1.96.

Statistical Data Analysis

Null-Hypythesis

Important: First of all, you have to state explicitly which hypothesis you want to test. This has to be done before you do the experiments/analysis, and should be formulated such that the hypothesis explicitly contains the value zero or null.

In the example above, the hypothesis would be: We hypothesize that the average weight of male Austrians minus 88.3 is equal to zero.

Check Your Assumptions

In the example presented above, we have used the assumption that our data are normally distributed. Such assumptions also have to be checked - at least visually! If these assumptions don’t hold, you have to use statistical tests that don’t rely on these assumptions.

Hypothesis Tests

The most common statistical tests, and the only ones we are discussing here, are tests that evaluate one or two groups. We distinguish between three cases:

  1. Comparison of one group to a fixed value (as in the example above).

  2. Comparison between two related groups (e.g. before-after experiments,

    Fig. 13.

  3. Comparison between two independent groups (see below).

These tests always return a p-value, which gives the probability that the hypothesis is true:

  • If \(p < 0.05\), we speak about a significant difference.
  • If \(p < 0.001\), we speak about a highly significant difference.
  • Otherwise we state that there is no significant difference.

One-Sample T-Test

The first two of these cases (comparison of a group to a fixed value, or related data-sets) are tested with the one-sample or paired-sample t-test. (The T-test is sometimes also called emph{Student’s t-test}, because the name of the researcher who developed this test was “Student”.)

_images/ttest_2.jpg

Figure 13: If every value in the dataset Date before has a corresponding value in the dataset Data after, a paired t-test can detect smaller differences. Note that this is equivalent to testing if the difference between the first and the second test is significantly different from zero.

>> weight = [84, 86, 88, 61, 82];
>> [h,p] = ttest(weight-88.3);
h =
    0
p =
    0.1739

\(h=0\) tells us that the null hypothesis can be accepted, and \(p=0.1739\) tells us that the probability that the null hypothesis is correct is 17%. This p-value is typically quoted in the Results part of your report.

T-Test for Independent Samples

Two independent groups are compared with the t-test for independent samples.

_images/ttest_1.jpg

Figure 14: <emphasis>ttest2</emphasis> compares data sets from two different, independent groups, and provide a quantitative estimate if the data sets from the two groups are different.

For example, let us compare the weight of 10 random American males with the weight of 10 random Austrian males:

>> weight_USA = [ 104, 86, 86, 105, 52, 80, 67, 82, 99, 102];
>> weight_A = [66, 74, 83, 77, 86, 87, 67, 78, 49, 98];
>> [h,p] = ttest2(weight_USA, weight_A);
h =
    0
p =
    0.1756

Again, the probability that the null hypothesis is true is relatively large, and the result is not significant. In order to get a significant result for a relatively small difference, we need a larger sample size:

>> weight_USA = roundd(randn(1,100)*17 + 88.3);
>> weight_A = roundd(randn(1,100)*15 + 81);
>> [h,p] = ttest2(weight_USA, weight_A)
h =
     1
p =
    0.0022

Hypothesis Tests with Confidence Intervals

Instead of p-values, it has become recommended practice to define the significance of a test based on the confidence intervals of a tested parameter. If the null hypothesis is that the measured value is zero, the hypothesis can be accepted if the 95%-confidence interval includes zero. Otherwise, the hypothesis has to be rejected.

One-sided T-test vs. Two-sided T-test

In most cases that you encounter, you will use a two-sided t-test. However, if the dataset in one group can only be larger (or only be smaller) than the dataset from the other group, you use a one-sided t-test.

_images/oneTwo_Sided_Ttests.jpg

Figure 15: If we do NOT know if the dataset from one group is larger or smaller than the data set from the other, we have to use two-sided t-tests. But if we know in advance that the values from the data set from the second group can only be larger than those from the first group, a one-sided t-test can be used. Note that this can determine if a measured difference is significant or not!

Exercises: Statistical analysis of data

Exercise 1: Analyze data

Let us assume that the weight of 100 random, 15 year old children in Tirol and in Vienna are:

tirol  = 7 * randn(100, 1) + 60;
vienna = 10* randn(100, 1) + 55;

Then the Tyrolian children try out a banana-diet: for one week, they eat only bananas for breakfast, and otherwise their normal food. After one week these 100 children weigh

tirol_after = tirol - 1.5 + 1.0 * randn(length(tirol));
  • calculate mean, median, standard deviation (SD), and standard error for the first 10 Tyrolian children.
  • calculate mean, median, standard deviation (SD), and standard error for all Tyrolian children

Exercise 2: Plot data

  • Plot mean +/- SD , and mean +/- SE for the first 10 Tyrolian children
  • Plot mean +/- SD , and mean +/- SE for all Tyrolian children

Exercise 3: Compare groups

  • Find out if the difference between the children in Tirol and the children in Vienna is significant.
  • Find out if the weight of the Tyrolian children after the banana-diet is less than the weight before the diet. Note that these data are from the same children as before the diet, in the same order.