# Introduction to Statistics¶

## Describing a Distribution¶

The average male American adult weighs 88.3 kg. This is a lot (actually, way too much). But how do male Austrians compare?

To answer this question, we first have to get an overview over the weight-distribution in Austrian males. We take 5 males, record their weight, and get

$weight = [84, 86, 88, 61, 82]\, kg.$

Based on these data, we can now estimate the average weight, and how variable the weight is between different men. Thereby we make use of the assumption that the data often have a bell-shaped distribution: the most likely value is the average value, and it is equally likely to find a low value or a high value. Mathematically, we can describe such data with the normal distribution, also called Gaussian distribution: (Fig. 11).

$\label{eq:normal} f_{\mu,\sigma} (x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-( x - \mu )^2 /2 \sigma^2}$

:math: - infty < x < infty , and :math:f_{mu,sigma} is called the probability density function (PDF) of the normal distribution. $$\mu$$ is the mean value, and $$\sigma$$ is called standard devation, and characterizes the variability of the data.

From our data, the best estimate for the mean value for the weight of the average Austrian male is

$\label{eq:sampleMean} \bar{w} = \frac{\sum\limits_{i=1}^{5}{w_i}}{n} = 81.2 \, kg$

The best estimate we can get for the standard deviation is

$\label{eq:sampleSD} s = \sqrt{ \frac{\sum{(w_i - \bar{w})^2}}{n-1} }$

Note that we divide by $$n-1$$ here! This is not quite intuitive, and is caused by the fact that the real mean is unknown, and our sample mean is chosen such that the standard deviation is minimized. This leads to an underestimation of the standard deviation, which can be compensated by dividing by $$n-1$$ instead of $$n$$.

The standard deviation as defined above is sometimes called sample standard deviation (because it is our best guess based on the sampled data) and is typically labelled with $$s$$. In contrast, the population standard deviation is commonly denoted with $$\sigma$$.

The variance is the square of the standard deviation.

## Confidence Intervals¶

If we know the mean and the shape of the distribution, we can determine the :math:alpha-% confidence intervals, i.e. the weight-range about the mean that contains $$\alpha$$-% of all data (Fig. 11):

$\label{eq:CI} CI = mean \pm \sigma * z_{1-\alpha/2}$

So if we want to know the 95% confidence intervals, we have to corresponding z-value for $$0.05/2$$. The factor $$\alpha/2$$ is due to the fact that we have outliers below and above the confidence intervals (Fig. 12).

## Standard Error¶

The more subjects we measure, the more accurate our estimation of the mean value of our population will be. The remaining standard devation of the mean is often referred to as the standard error, and is given by

$\label{eq:standard_error} se = \frac{\sigma}{\sqrt{n}}$

The equation for the confidence intervals for the mean value is the same as in the equation above, with the standard deviation $$\sigma$$ replaced by the standard error $$se$$.

If you know the mean value and the standard deviation, you use the z-distribution and $$\sigma$$. If you have to estimate the mean and the standard deviation, you have to use $$s$$ instead of $$\sigma$$, and the normal distribution has to be replaced by the so-called t-distribution. The T-distribution also takes the errors induces by the estimation of the mean value based on $$n$$ subjects into consideration:

$CI_{mean} = mean \pm se * t_{n-1, 1-\alpha/2}$

where $$t_{n-1, 1-\alpha/2}$$ is the t-distribution for n subjects, at a $$\alpha$$-% confidence level. Note that with $$n$$ subjects we have $$n-1$$ degrees of freedom, because one parameter (the mean) has to be estimated.

The required t-value can be obtained with the Matlab command $$icdf$$ (for inverse cumulative distribution function). In our case, where we have 5 subjects, and $$\alpha=0.05$$

>> tval =  icdf('t', 0.975, 4)
tval = 2.7764
>> tval =  icdf('t', 0.975, 20)
tval = 2.0860
>> zval = norminv(0.975)
zval = 1.9600


For 5 subjects we get a value of 2.77, and for 21 subjects a value of 2.09, which is already pretty close to the corresponding value from the normal distribution, 1.96.

# Statistical Data Analysis¶

## Null-Hypythesis¶

Important: First of all, you have to state explicitly which hypothesis you want to test. This has to be done before you do the experiments/analysis, and should be formulated such that the hypothesis explicitly contains the value zero or null.

In the example above, the hypothesis would be: We hypothesize that the average weight of male Austrians minus 88.3 is equal to zero.

## Check Your Assumptions¶

In the example presented above, we have used the assumption that our data are normally distributed. Such assumptions also have to be checked - at least visually! If these assumptions don’t hold, you have to use statistical tests that don’t rely on these assumptions.

## Hypothesis Tests¶

The most common statistical tests, and the only ones we are discussing here, are tests that evaluate one or two groups. We distinguish between three cases:

1. Comparison of one group to a fixed value (as in the example above).

2. Comparison between two related groups (e.g. before-after experiments,
3. Comparison between two independent groups (see below).

These tests always return a p-value, which gives the probability that the hypothesis is true:

• If $$p < 0.05$$, we speak about a significant difference.
• If $$p < 0.001$$, we speak about a highly significant difference.
• Otherwise we state that there is no significant difference.

### One-Sample T-Test¶

The first two of these cases (comparison of a group to a fixed value, or related data-sets) are tested with the one-sample or paired-sample t-test. (The T-test is sometimes also called emph{Student’s t-test}, because the name of the researcher who developed this test was “Student”.)

>> weight = [84, 86, 88, 61, 82];
>> [h,p] = ttest(weight-88.3);
h =
0
p =
0.1739


$$h=0$$ tells us that the null hypothesis can be accepted, and $$p=0.1739$$ tells us that the probability that the null hypothesis is correct is 17%. This p-value is typically quoted in the Results part of your report.

### T-Test for Independent Samples¶

Two independent groups are compared with the t-test for independent samples.

For example, let us compare the weight of 10 random American males with the weight of 10 random Austrian males:

>> weight_USA = [ 104, 86, 86, 105, 52, 80, 67, 82, 99, 102];
>> weight_A = [66, 74, 83, 77, 86, 87, 67, 78, 49, 98];
>> [h,p] = ttest2(weight_USA, weight_A);
h =
0
p =
0.1756


Again, the probability that the null hypothesis is true is relatively large, and the result is not significant. In order to get a significant result for a relatively small difference, we need a larger sample size:

>> weight_USA = roundd(randn(1,100)*17 + 88.3);
>> weight_A = roundd(randn(1,100)*15 + 81);
>> [h,p] = ttest2(weight_USA, weight_A)
h =
1
p =
0.0022


### Hypothesis Tests with Confidence Intervals¶

Instead of p-values, it has become recommended practice to define the significance of a test based on the confidence intervals of a tested parameter. If the null hypothesis is that the measured value is zero, the hypothesis can be accepted if the 95%-confidence interval includes zero. Otherwise, the hypothesis has to be rejected.

### One-sided T-test vs. Two-sided T-test¶

In most cases that you encounter, you will use a two-sided t-test. However, if the dataset in one group can only be larger (or only be smaller) than the dataset from the other group, you use a one-sided t-test.

# Exercises: Statistical analysis of data¶

## Exercise 1: Analyze data¶

Let us assume that the weight of 100 random, 15 year old children in Tirol and in Vienna are:

tirol  = 7 * randn(100, 1) + 60;
vienna = 10* randn(100, 1) + 55;


Then the Tyrolian children try out a banana-diet: for one week, they eat only bananas for breakfast, and otherwise their normal food. After one week these 100 children weigh

tirol_after = tirol - 1.5 + 1.0 * randn(length(tirol));

• calculate mean, median, standard deviation (SD), and standard error for the first 10 Tyrolian children.
• calculate mean, median, standard deviation (SD), and standard error for all Tyrolian children

## Exercise 2: Plot data¶

• Plot mean +/- SD , and mean +/- SE for the first 10 Tyrolian children
• Plot mean +/- SD , and mean +/- SE for all Tyrolian children

## Exercise 3: Compare groups¶

• Find out if the difference between the children in Tirol and the children in Vienna is significant.
• Find out if the weight of the Tyrolian children after the banana-diet is less than the weight before the diet. Note that these data are from the same children as before the diet, in the same order.