.. image:: ../Images/title_ANOVA.png :height: 100 px .. Relation Between Several Variables .. ================================== When we have two groups, we can ask the question: "Are they different?" The answer is provided by hypothesis tests: by a *t-test* if the data are normally distributed, or by a *Mann-Whitney test* otherwise. If we want to go one step further and predict the value of one variable from another, we have to use the technique of *linear regression*. So what happens when we have more than two groups? To answer the question "Are they different?" for more than two groups, we have to use the \emph{Analysis of Variance (ANOVA)-test} for data where the residuals are normally distributed. If this condition is not fulfilled, the \emph{Kruskal-Wallis Test} has to be used. What should we do if we have paired data? If we have matched pairs for two groups, and the differences are not normally distributed, we can use the *Wilcoxon signed rank sum test*. The rank test for more than two groups of matched data is the *Friedman test*. It may be worth mentioning that Thom Baguley suggested the following: Where one-way repeated measures ANOVA is not appropriate, rank transformation followed by ANOVA will provide a more robust test with greater statistical power than the Friedman test. An example for the application of the Friedman test: Ten professional piano players are blindfolded, and are asked to judge the quality of three different pianos. Each player rates each piano on a scale of 1 to 10 (1 being the lowest possible grade, and 10 the highest possible grade). The null hypothesis is that all three pianos rate equally. To test the null hypothesis, the Friedman test is used on the ratings of the ten piano players. When moving from two to many variables, the correlation coefficient gets replaced by the *correlation matrix*. And if we want to and predict the value of *many* other variables, linear regression has to be replaced by *multilinear regression*, sometimes also referred to as *multiple linear regression*. However, watch out for the pitfalls that loom when you work with many variables! Take for example the following hypothetical case: you make a survey about the activity and life circumstances of a large range of people, covering all the numbers that you can get your hand on. In this survey you find out that a) rich people spend more time playing golf than poor people, and b) rich people tend to have fewer children than poor people. This leads to a strong negative correlation between playing golf and having children, and you may be tempted to (falsely) draw the conclusion that playing golf reduces your fertility, while in reality it is the higher income which causes both effects. nicely describes where those problems come from, and how best to avoid them. Two-way ANOVA ----------------- .. index:: ANOVA-two-way Compared to one-way ANOVAs (see :ref:`one-way ANOVAs`), the analysis with two-way ANOVAs has a new element. We can look not only if each of the factors is significant; we can also check if the *interaction* of the factors has a significant influence on the distribution of the data. For sticking to the example above, if only women with treatment B get healthy, we have a significant interaction effect between "gender" and "treatment". Example: two-way ANOVA ~~~~~~~~~~~~~~~~~~~~~~~~ |ipynb| `90_anovaTwoway.ipynb `_ |python| `anovaTwoway.py `_ :: df sum_sq mean_sq F PR(>F) C(fetus) 2 324.00 162.00 2113.10 1.05e-27 C(observer) 3 1.19 0.39 5.21 6.497-03 C(fetus):C(observer) 6 0.56 0.09 1.22 3.29e-01 Residual 24 1.84 0.07 NaN NaN .. figure:: ../Images/ANOVA_3way.png :scale: 33 % *Three-way ANOVA* Three-way ANOVA --------------- When you have more than two factors, it is recommendable to use *statistical modeling* for the data analysis (see Chapter [chapter:Models]). However, as always with the analysis of statistical data, you should first inspect the data visually. *seaborn* makes this quite simple: :: import matplotlib.pyplot as plt import seaborn as sns sns.set(style="whitegrid") df = sns.load_dataset("exercise") sns.factorplot("time", "pulse", hue="kind", col="diet", data=df, hue_order=["rest", "walking", "running"], palette="YlGnBu_d", aspect=.75).despine(left=True) plt.show() Correlation Matrix ------------------ An elegant way to visualize the correlation between a large number of variables is the *correlation matrix*. Using *seaborn*, it can be implemented elegantly as follows: .. figure:: ../Images/many_pairwise_correlations.png :scale: 66 % *Visualization of the Correlation matrix.* :: import numpy as np import seaborn as sns import matplotlib.pyplot as plt sns.set(style="darkgrid") rs = np.random.RandomState(33) d = rs.normal(size=(100, 30)) f, ax = plt.subplots(figsize=(9, 9)) cmap = sns.diverging_palette(220, 10, as_cmap=True) sns.corrplot(d, annot=False, sig_stars=False, diag_names=False, cmap=cmap, ax=ax) f.tight_layout() Multilinear Regression ------------------------ .. index:: regression-multilinear If you have truly independent variables, *multilinear regression* is a straightforward extension of the simple linear regression. Multiple Regression ~~~~~~~~~~~~~~~~~~~ Example of *multiple regression* with covariates (i.e. independent variables) :math:`w_i` and :math:`x_i`. Again suppose that the data are 7 observations, and for each observed value to be predicted (:math:`y_i`), there are two covariates that were also observed :math:`w_i` and :math:`x_i`. The model to be considered is .. math:: y_i = \beta_0 + \beta_1 w_i + \beta_2 x_i + \epsilon_i This model can be written in matrix terms as .. math:: \begin{bmatrix}y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \\ y_6 \\ y_7 \end{bmatrix} = \begin{bmatrix} 1 & w_1 & x_1 \\1 & w_2 & x_2 \\1 & w_3 & x_3 \\1 & w_4 & x_4 \\1 & w_5 & x_5 \\1 & w_6 & x_6 \\ 1& w_7 & x_7 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \epsilon_4 \\ \epsilon_5 \\ \epsilon_6 \\ \epsilon_7 \end{bmatrix} However, you have to watch out: if your variables may be related to each other, you have to proceed much more carefully. For example, you may want to investigate how the prevalence of some disease correlates with age and with income: if you do so, you have to keep in mind that age and income are most likely correlated! For details, Kaplan(2009) gives a good introduction to that topic. Also, check out the chapter on Modeling. |ipynb| `91_mult_regress.ipynb `_ |python| `mult_regress.py `_ .. |ipynb| image:: ../Images/IPython.jpg :scale: 50 % .. |python| image:: ../Images/python.jpg :scale: 50 %