.. image:: ..\Images\title_relations.png :height: 100 px .. Relation Between Two Continuous Variables .. ========================================= So far we have been dealing with problems where only one variable is measured. Expressions or functions which only depend on one variable are sometimes called *univariate*. If more than one variable is involved, we are dealing with *multivariate* problems. In the simplest case we have two varibles involved, and we need a *bivariate* data analysis. For two related variables, the *correlation* measures the association between the two variables. In contrast, a *linear regression* is used for the prediction of the value of one variable from another. For the correlation between many variables, you should look into "correlation tables" (nicely implemented in *seaborn*). And an extension of linear regression to more than two variables brings you into the realm of statistical modeling. Correlation ----------- .. index:: correlation Correlation Coefficient ~~~~~~~~~~~~~~~~~~~~~~~ .. index:: correlation coefficient The *correlation coefficient* between two variables answers the question: "Are the two variables related? I.e., if the two variables are normally distributed, the standard measure of determining the *correlation coefficient*, often ascribed to *Pearson* , is .. math:: \label{eq:pearson} r = \frac{\sum\limits_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum\limits_{i=1}^n (X_i - \bar{X})^2} \sqrt{\sum\limits_{i=1}^n (Y_i - \bar{Y})^2}} With .. math:: s_{xy} = \frac{\sum\limits_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{n-1} and :math:`s_x, s_y` the sample standard deviations of the *x* and *y* values, respectively, this can also be written as .. math:: r = \frac{s_{xy}}{s_x \cdot s_y}. Pearson's correlation coefficient, sometimes also referred to as *population correlation coefficient* or *sample correlation*, can take any value from -1 to +1. Examples are given in the Figure below. Note that the formula for the correlation coefficient is symmetrical between *x* and *y*. Coefficient of determination ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. index:: coefficient of determination In order to interpret r, let me first define a few common terms. Residuals Differences between observed values and predicted values. | |image28| *Best-fit linear regression line (red) and residuals (black).* A data set has values :math:`y_i`, each of which has an associated modelled value :math:`f_i` (also sometimes referred to as :math:`\hat{y}_i`). Here, the values :math:`y_i` are called the *observed values*, and the modelled values :math:`f_i` are sometimes called the *predicted values*. In the following :math:`\bar{y}` is the mean of the observed data: .. math:: \bar{y}=\frac{1}{n}\sum_{i=1}^n y_i where n is the number of observations. The "variability" of the data set is measured through different sums of squares: :math:`SS_\text{tot}=\sum_i (y_i-\bar{y})^2`, the total sum of squares (proportional to the sample variance); :math:`SS_\text{mod}=\sum_i (f_i -\bar{y})^2`, the sum of squares of the model values, also called the explained sum of squares; :math:`SS_\text{res}=\sum_i (y_i - f_i)^2\,`, the sum of squares of residuals, also called the residual sum of squares. The notations :math:`SS_{R}` and :math:`SS_{E}` should be avoided, since in some texts their meaning is reversed to "Residual sum of squares" and "Explained sum of squares", respectively. .. image:: ../Images/Coefficient_of_Determination.png :scale: 50 % *The better the linear regression (on the right) fits the data in comparison to the simple average (on the left graph), the closer the value of :math:`R^2` is to one. The areas of the blue squares represent the squared residuals with respect to the linear regression. The areas of the red squares represent the squared residuals with respect to the average value (from Wikipedia)* With these expressions, the most general definition of the *coefficient of determination*, :math:`R^2`, is .. math:: \label{eq:R2} R^2 \equiv 1 - {SS_{\rm res}\over SS_{\rm tot}}.\, Since .. math:: SS_\text{tot} = SS_\text{mod} + SS_\text{res} Therefore the equation above is equivalent to .. math:: R^2 = \frac{SS_\text{mod}}{SS_\text{tot}} For simple linear regression (i.e. line-fits), the *coefficient of determination* or :math:`R^2` is the square of the correlation coefficient :math:`r`. It is easier to interpret than the correlation coefficient r: values of :math:`R^2` close to 1 are good, values close to 0 are poor. Note that for general models it is common to write :math:`R^2`, whereas for simple linear regression :math:`r^2` is used. **Relation to unexplained variance** In a general form, :math:`R^2` can be seen to be related to the unexplained variance, since the second term compares the unexplained variance (variance of the model's errors) with the total variance (of the data). **Examples** How large :math:`R^2` or :math:`\bar{R}^2` must be to be considered good depends on the discipline. They are usually expected to be larger in the physical sciences than it is in biology or the social sciences. In finance or marketing, it also depends on what is being modeled. Caution: the sample correlation and :math:`R^2` are misleading if there is a nonlinear relationship between the independent and dependent variables! | |image26| *Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the non-linearity and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the Fiure in the center has a slope of 0 but in that case the correlation coefficient is undefied because the variance of Y is zero. (From: Wikipedia)* Rank correlation ~~~~~~~~~~~~~~~~~~ .. index:: rank correlation If the data distribution is not normal, a different approach is necessary. In that case one can rank the set of subjects for each variable and compare the orderings. There are two commonly used methods of calculating the rank correlation. - *Spearman's* :math:`\rho`, which is exactly the same as the Pearson correlation coefficient :math:`r` calculated on the ranks of the observations. - *Kendall's* :math:`\tau`. is also a rank correlation coefficient, measuring the association between two measured quantities. It is harder to calculate than Spearman's rho, but it has been argued that confidence intervals for Spearman's rho are less reliable and less interpretable than confidence intervals for Kendall's tau-parameters. Regression ---------- .. index:: regression General linear regression model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. index:: regression-linear We can use the method of *regression* when we want to predict the value of one variable from the other. | |image27| *Linear regression. (From Wikipedia)* When we search for the best-fit line to a given :math:`(x_i,y_i)` dataset, we are looking for the parameters :math:`(k,d)` which minimize the sum of the squared *residuals* :math:`\epsilon_i` in .. math:: \label{eq:simpleRegression} y_i = k * x_i + d + \epsilon_i where :math:`k` is the *slope* or *inclination* of the line, and :math:`d` the *intercept*. This is in fact just the one-dimensional example of the more general technique, which is described in the next section. Note that in contrast to the correlation, this relationship between :math:`x` and :math:`y` is no more symmetrical: it is assumed that the :math:`x-`\ values are known exactly, and that all the variability lies in the residuals. Simple Regression ~~~~~~~~~~~~~~~~~ Example of *simple linear regression* with 7 observations. Suppose there are 7 data points :math:`\left\{ {{y_i},{x_i}} \right\}`, where :math:`i=1,2,…,7`. The simple linear regression model is .. math:: y_i = \beta_0 + \beta_1 x_i +\epsilon_i, \, where :math:`\beta_0` is the y-intercept and :math:`\beta_1` is the slope of the regression line. This model can be represented in matrix form as .. math:: \begin{bmatrix}y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \\ y_6 \\ y_7 \end{bmatrix} = \begin{bmatrix}1 & x_1 \\1 & x_2 \\1 & x_3 \\1 & x_4 \\1 & x_5 \\1 & x_6 \\ 1 & x_7 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \epsilon_4 \\ \epsilon_5 \\ \epsilon_6 \\ \epsilon_7 \end{bmatrix} where the first column of ones in the design matrix represents the y-intercept term while the second column is the x-values associated with the y-value. Design Matrix ~~~~~~~~~~~~~ .. index:: design matrix Quadratic Fit ^^^^^^^^^^^^^ The equation for a quadratic fit to the given data is .. math:: y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 +\epsilon_i, \, This can be rewritten in matrix form: .. math:: \begin{bmatrix}y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \\ y_6 \\ y_7 \end{bmatrix} = \begin{bmatrix}1 & x_1 & x_1^2 \\1 & x_2 & x_2^2 \\1 & x_3 & x_3^2 \\1 & x_4 & x_4^2 \\1 & x_5 & x_5^2 \\1 & x_6 & x_6^2 \\ 1 & x_7 & x_7^2 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \epsilon_3 \\ \epsilon_4 \\ \epsilon_5 \\ \epsilon_6 \\ \epsilon_7 \end{bmatrix} General Formulation ^^^^^^^^^^^^^^^^^^^ In general,this can be rewritten in matrix form as: .. math:: y=X\beta+ \epsilon, the matrix :math:`X` is the *design matrix*. :math:`Y` is a vector of dimension :math:`(n \times 1)` and is called the endogenous variable, :math:`X` is a matrix of dimension :math:`(n \times k)` where each colum is an explanatory variable and :math:`\varepsilon` is the error term. :math:`\beta` is the vector of dimension :math:`(k \times 1)` and contains the parameters we want to estimate. Coding ~~~~~~ If you have vectors *x,y* containing your data, you can use *statsmodels* to create a design matrix that also includes the *1's* for the offset: :: import statsmodels.api as sm Xmat = sm.add_constant(x) The parameters are then easily found as :: params = np.linalg.lstsq(Xmat, y) However, you get a lot more information if you use the OLS-fit from *statmodels*: :: import numpy as np import statsmodels.api as sm # Generate artificial data nobs = 100 X = np.random.random(nobs) X = sm.add_constant(X) beta = [5, 3.5] e = np.random.random(nobs) y = np.dot(X, beta) + e # Fit regression model results = sm.OLS(y, X).fit() # Inspect the results print(results.summary()) yields the following results: :: OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.923 Model: OLS Adj. R-squared: 0.922 Method: Least Squares F-statistic: 1173. Date: Fri, 04 Jul 2014 Prob (F-statistic): 2.45e-56 Time: 14:49:08 Log-Likelihood: -15.390 No. Observations: 100 AIC: 34.78 Df Residuals: 98 BIC: 39.99 Df Model: 1 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ const 5.4410 0.059 92.685 0.000 5.324 5.557 x1 3.5718 0.104 34.250 0.000 3.365 3.779 ============================================================================== Omnibus: 21.620 Durbin-Watson: 2.302 Prob(Omnibus): 0.000 Jarque-Bera (JB): 5.798 Skew: 0.223 Prob(JB): 0.0551 Kurtosis: 1.908 Cond. No. 4.60 ============================================================================== The meaning of many of these parameters is described in the chapter on "Statistical Models". From the *results*, you can extract e.g. the model parameters, standard errors, confidence intervals, and residuals: :: params = results.params std_err = results.bse ConfInt = results.conf_int() residuals = results.resid Assumptions ~~~~~~~~~~~ To use the technique of linear regression, the following assumptions should be fulfilled: 1. The *independent variables* (i.e. *x*) are exactly known. 2. Validity. Most importantly, the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. For example, a linear regression does not properly describe a quadratic curve. 3. Additivity and linearity. The most important mathematical assumption of the regression model is that its deterministic component is a linear function of the separate predictors. 4. Independence of errors. 5. Equal variance of errors. 6. Normality of errors. |ipynb| `80_multivariate.ipynb `_ |python| `multivariate.py `_ | |image29| *The sets in the Anscombe's quartet have the same linear regression line but are themselves very different.* | |image30| *Regression, with confidence intervals for the mean, as well as for the predicted data. The red dotted line shows the confidence interval for the mean; and the green dotted line the confidence interval for predicted data. (This can be compared to the standard error and the standard deviation for a population.)* Since to my knowledge there exists no program in the Python standard library (or numpy, scipy) to calculate the confidence intervals for a regression line, I include my corresponding program *fitLine.py*. The output of this program is shown in the figure below. This program also shows how Python programs intended for distribution should be documented. Exercises --------- #. **Correlation** Read in the data for the average yearly temperature at the Sonnblick, from *https://github.com/thomas-haslwanter/statsintro/blob/master/Data/data\_others/AvgTemp.xls* Calculate the Pearson and Spearman correlation, and Kendall's tau, for the temperature vs. year. #. **Regression** For the same data, calculate the yearly increase in temperature, assuming a linear increase with time. Is this increase significant? #. **Normality Check** For the data from the regression model, check if the model is ok by testing if the residuals are normally distributed (e.g. by using the Komogorov-Smirnov test) |python| `fitLine.py `_ .. [4] This section has been taken from Wikipedia .. |image24| image:: ../Images/Sensitivity_Specificity.png :scale: 50 % .. |image25| image:: ../Images/Sensitivity_Specificity_Example.jpg :scale: 50 % .. |image26| image:: ../Images/Correlation_examples2.png :scale: 50 % .. |image27| image:: ../Images/Linear_regression.png :scale: 50 % .. |image28| image:: ../Images/residuals_linreg.png :scale: 50 % .. |image29| image:: ../Images/Anscombes_quartet.png :scale: 50 % .. |image30| image:: ../Images/regression_wLegend.png :scale: 50 % .. |ipynb| image:: ../Images/IPython.jpg :scale: 50 % .. |python| image:: ../Images/python.jpg :scale: 50 %