When we record data, we often obtain data that can be approximated nicely by a straight line. If this is the case, it tremendously simplifies our description of the dataset: we do not need to know every simgle datapoint any more, but only the two coefficients defining the best-fit line to the data points.

To obtain the best-fit line to noisy data, we want to find the model coefficients “k” and “d” that define our data-model “y=k*x + d”.

The following chapter describes how to obtain the model coefficients for simple linear models, how accurately we know these coefficients, and how this concept can be used to describe much more complex patterns, such as polynomials, circles, etc.

# Fully determined linear model¶

A linear model describes a linear relationship between a dependent variable and a number of independent variables. In the simplest case, there is one dependent variable $$y$$ and one independent variable $$x$$. We can write the linear model as

$\label{eq:line} y = mx+c$

where $$m$$ and $$c$$ are the two free parameters. This is a line in a plane, with $$m$$ the slope, and $$c$$ the y-intercept. If we have two different data points relating the dependent and independent variables, then these two free parameters are fully determined. For example, if $$P_1=(x_1 / y_1)$$ and $$P_2=(x_2 / y_2)$$, then we can calculate that

$m = \frac{{\Delta y}}{{\Delta x}} = \frac{{{y_2} - {y_1}}}{{{x_2} - {x_1}}}\,and\,c = \frac{{{y_1}{x_2} - {x_1}{y_2}}}{{{x_2} - {x_1}}}$

In the general case, there is one dependent variable $$y$$ and $$k$$ independent variables $$x_i$$. The linear model then becomes

$y = c + \sum\limits_{d = 1}^k {{m_d}{x_d}}$

where $$c, m_1, ..., m_k$$ are the $$k+1$$ free parameters. If we have $$k+1$$ different data points that lie exactly on an k-dimensional hyperplane, then the free parameters are fully determined.

## Reminder: Lines in a plane¶

A line can either be represented by its slope and y-intercept, as shown above. Alternatively, it can be characterised by a point on the line, $$\vec{x_1}$$ , and the line’s normal vector, $$\vec{n}$$ :

$\vec n \cdot \vec x = \vec n \cdot {\vec x_1}$

or

$\begin{split}ax + by = c,\,with\,\vec n = \left( {\begin{array}{{20}{c}} a\\ b \end{array}} \right)\end{split}$

# Overdetermined linear model¶

Mostly, we have more data points than free parameters. In this case, the model is overdetermined by the data, so that it is not possible to fit the model exactly to all of the data points. Instead, we try to find a best fit model, which is the model with the minimum fitting error. Often, the fitting error is taken to be the least squares estimator.

## Residual¶

The difference between a data value and the corresponding model value is referred to as a residual. If we have a data point where the dependent variable has the value $$y_i$$ and the independent variables have value $$x_{id}$$, then the residual $$e_i$$ for the general linear model is

${e_i} = {y_i} - \left( {c + \sum\limits_{d = 1}^k {{m_d}{x_{id}}} } \right)$

## Least squares estimators¶

If there are $$n$$ data points, the sum of the squared residuals is

$E({x_{id}},{y_i},c,{m_d}) = \sum\limits_{i = 1}^n {e_i^2} = \sum\limits_{i = 1}^n {{{\left( {{y_i} - \left( {c + \sum\limits_{d = 1}^k {{m_d}{x_{id}}} } \right)} \right)}^2}}$

The least squares estimators are the values of $$c$$ and $$m_1, ..., m_k$$ that minimise E. To determine the value of the least squares estimators $$\hat c$$ and $${\hat m_1},...,{\hat m_k}$$ , it is necessary to locate the minimum of E by finding where the following partial derivatives are zero:

$\begin{split}\begin{array}{l} 0 = {\left. {\frac{{\partial E}}{{\partial c}}} \right|_{\hat c,{{\hat m}_d}}} = - 2\sum\limits_{i = 1}^n {\left( {{y_i} - \left( {\hat c + \sum\limits_{d = 1}^k {{{\hat m}_d}{x_{id}}} } \right)} \right)} \\ 0 = {\left. {\frac{{\partial E}}{{\partial {m_p}}}} \right|_{\hat c,{{\hat m}_d}}} = - 2\sum\limits_{i = 1}^n {\left( {{y_i} - \left( {\hat c + \sum\limits_{d = 1}^k {{{\hat m}_d}{x_{id}}} } \right)} \right)} {x_{ip}}\,\,\,for\,all\,p \end{array}\end{split}$

This linear system of equations can then be solved to find the values of the least squares estimators.

## Reminder: Ordinary Least Squares¶

The method of ordinary least squares can be used to find an approximate solution to overdetermined systems. For the system $$\mathbf{A} \cdot \vec{p} = \vec{y}$$, the least squares formula is obtained from the problem

$\min_{p}\Vert \mathbf{A} \cdot \vec{p} - \vec{y} \Vert,$

the solution of which can be written with the normal equations,

$\vec{p} =(\mathbf{A}^{\mathrm{T}}\mathbf{A})^{-1}\mathbf{A}^{\mathrm{T}} \cdot \vec{y},$

where $$\mathrm{T}$$ indicates a matrix transpose, provided $$(\mathbf{A}^{\mathrm{T}}\mathbf{A})^{-1}$$ exists (that is, provided $$A$$ has full column rank). With this formula an approximate solution is found when no exact solution exists, and it gives an exact solution when one does exist.

# Finding least squares estimators with MATLAB¶

## Linear model with no intercept term¶

Let us start with the case where there is no intercept term (i.e., $$c$$ is zero). Our model is

$y \approx \sum\limits_{d = 1}^k {{m_d}{x_d}}$

X is an $$n * k$$ matrix made of the column vectors $$x_1, ... , x_k$$, $$m$$ is a $$k * 1$$ column vector containing the free parameters $$m_1, ... , m_k$$, and $$y$$ is an $$n * 1$$ column vector, then we can rewrite the model as

$X \cdot m \approx y$

The least squares estimator for $$m$$ can be obtained easily in Matlab by reshaping the equation:

m_estimator = X\y;


Note that the “\” is a backslash and not a normal divide. Alternatively, one can use the command regress:

m_estimator = regress(y, X);


## Linear model with intercept¶

Now let us address the general linear model with intercept term:

$y \approx c * \sum\limits_{d = 1}^k {{m_d}{x_d}}$

We rewrite this in matrix form in the following way:

$\label{eq:regressMatrix} \vec y \approx {\bf{M}} \cdot \vec p$

where

$\begin{split}\label{eq:regressionFit} {\bf{M}} = \left[ {\begin{array}{{20}{c}} {{x_{11}}}&{{x_{12}}}& \cdots &{{x_{1k}}}&1\\ {{x_{21}}}&{{x_{22}}}& \cdots &{{x_{2k}}}&1\\ \vdots & \vdots &{}& \vdots & \vdots \\ {{x_{n1}}}&{{x_{n2}}}& \cdots &{{x_{nk}}}&1 \end{array}} \right],\,\,p = \left[ {\begin{array}{{20}{c}} {{m_1}}\\ {{m_2}}\\ \vdots \\ {{m_k}}\\ c \end{array}} \right],\,\,y = \left[ {\begin{array}{{20}{c}} {{y_1}}\\ {{y_2}}\\ \vdots \\ {{y_n}} \end{array}} \right]\end{split}$

In Matlab, we can solve this in the same way as before, namely The matrix M is sometimes called Design Matrix. In Matlab, we can solve this in the same way as before:

M = [X ones(size(y))];
p_estimator = M\y;


Again, it is possible to use the function regress:

p_estimator = regress(y, M);


## Example 1: Fitting a line¶

For a line-fit, the equation for general linear models is reduced to

$\begin{split}\left[ {\begin{array}{{20}{c}} {{y_1}}\\ {{y_2}}\\ {{y_3}}\\ {...} \end{array}} \right] = \left[ {\begin{array}{{20}{c}} {{x_1}}&1\\ {{x_2}}&1\\ {{x_3}}&1\\ {...}&{...} \end{array}} \right] \cdot \left[ {\begin{array}{{20}{c}} k\\ d \end{array}} \right]\end{split}$
% Generate a noisy line, with an slope of 0.5
% and a y-intercept of -30
x = 1:100;
y = -30 + 0.5*(x + 5*randn(size(x)));

% Plot the line
plot(x,y);
line(xlim, [0 0], 'LineStyle', '--');

% Calculate the best-fit line
M = [x' ones(length(x),1)];
p_estimator = M\y';     % or "p_estimator = regress(y', M);"

slope = p_estimator(1)    % "m"
yintercept = p_estimator(2)     % "c"


An alternative way to fit a line would be to see the line as a $$1^{st}$$ order polynomial, and use the Matlab function polyfit:

p = polyfit(x,y,1);
y_fit = polyval(p, x);


## Example 2: Fitting a polynomial¶

We can use the same approach to fit a polynomial curve to the data. For example, for a quadratic relationship between x and y we have

$y = a*x^2 + b*x + c$

Written in matrix form, this gives

$\begin{split}\left[ {\begin{array}{{20}{c}} y_1\\ y_2\\ y_3\\ {...} \end{array}} \right] = \left[ {\begin{array}{{20}{c}} x_1^2 & x_1 & 1\\ x_2^2 & x_2 & 1\\ x_3^2 & x_3 & 1\\ {...} & {...} & {...} \end{array}} \right] \cdot \left[ {\begin{array}{{20}{c}} a\\ b\\ c \end{array}} \right]\end{split}$

With this we have brought the problem into the form $$\vec{y} \approx \bf{M} \cdot \vec{p}$$, and we can solve it in the same way as above.

## Example 3: Fitting a sine-wave¶

If we have a sinusoidal oscillation, where we know the frequency $$\omega$$, and where we want to know the amplitude, the phase delay, and the offset, our function can be written as

$x = offset \cdot 1 + amplitude \cdot \sin (\omega t + \delta )$

Note that here offset and amplitude appear as linear parameters - but the phase, $$\delta$$, does not. But this can be solved by expressing $$sin(\omega*t+\delta)$$ by the sum of a sine- and a cosine wave:

$\label{eq:sinFit} x = offset \cdot 1 + a \cdot \sin (\omega t) + b \cdot \cos (\omega t)$

From these parameters, we can find the amplitude and the phase:

$\begin{split}\begin{array}{l} amplitude = \sqrt {{a^2} + {b^2}} \\ \delta = {\tan ^{ - 1}}\left( {\frac{b}{a}} \right) \end{array}\end{split}$

Now all the parameters that we want to know appear in a linear (!!) relationship, and we can put together our matrix M as

$\begin{split}{\bf{M}} = \left[ {\begin{array}{{20}{c}} 1&{\sin (\omega \cdot {t_1})}&{\cos (\omega \cdot {t_1})}\\ 1&{\sin (\omega \cdot {t_2})}&{\cos (\omega \cdot {t_2})}\\ 1&{\sin (\omega \cdot {t_3})}&{\cos (\omega \cdot {t_3})}\\ {...}&{...}&{...} \end{array}} \right]\end{split}$

and $$\vec{p}$$, which is here given by $$\vec p = \left( {\begin{array}{*{20}{c}} {offset}\\ a\\ b \end{array}} \right)$$ can again be obtained from the solution of the linear model.

% Time
t = 0:0.1:8*pi;

% Set the parameters
freq = 0.5;
offset = 3;
delta = 45*pi/180;
amplitude = 2;

omega = 2*pi*freq;

% Simulate and plot noisy sine-wave:
y = offset + amplitude * sin(omega*t + delta) + randn(size(t));
plot(t,y)

% Fit the data
M = [ones(length(t),1) sin(omega*t') cos(omega*t')];
p = M\y';

% Extract the coefficients from the fit
found.offset = p(1);
found.amp = sqrt(p(2)^2 + p(3)^2);
found.delta = atan2(p(3), p(2))*180/pi;

% Superpose the original data with the fit
hold('on');
found.y = found.offset + found.amp*sin(omega*t + found.delta*pi/180)
plot(t, found.y, 'r')
legend('original', 'fit');
xlabel('Time [s]');
ylabel('Data');
shg


Sine-Fit

## Example 3: Fitting a circle¶

The same concept can surprisingly even be extended to find the best-fit circles to data. To do so, we rearrange the general equation for a circle:

$\begin{split}\begin{array}{c} {(x - {x_0})^2} + {(y - {y_0})^2} = {r^2}\\ {x^2} - 2x{x_0} + x_0^2 + {y^2} - 2y{y_0} + y_0^2 = {r^2}\\ 2x \cdot {x_0} + 2y \cdot {y_0} + 1 \cdot ({r^2} - x_0^2 - y_0^2) = {x^2} + {y^2} \end{array}\end{split}$

This gives us $${x^2} + {y^2} = {\bf{M}} \cdot \vec p$$ , with

$\begin{split}{\bf{M}} = \left[ {\begin{array}{{20}{c}} {2{x_1}}&{2{y_1}}&1\\ {2{x_2}}&{2{y_2}}&1\\ {...}&{...}&{...} \end{array}} \right]\end{split}$

Again, this is just a linear fit! From this we get

$\begin{split}\begin{array}{l} {x_0} = p(1)\\ {y_0} = p(2)\\ r = \sqrt {p(3) + x_0^2 + y_0^2} \end{array}\end{split}$

## Confidence Interval for Linear Regressions¶

Let us now generalize this concept of confidence intervals to our linear regression, and add some more information about the basic assumptions underlying the calculation of confidence intervals. In the previous section, we saw that it is possible to describe a linear trend via the least squares estimators. However, measured data always have errors superimposed on the basic trend in which we are interested. This error leads to uncertainty in the estimators.

To deduce something about the uncertainty in the estimators, we have to make assumptions about the noise in the data. Some common assumptions that are made are:

• we know the independent variables $$x_1, ... , x_k$$ exactly
• the residuals are roughly normally distributed
• the noise is only in the dependent variable y
• the residuals are independent of the values of $$x_1, ... , x_k$$

If any of these assumptions does not hold, then the confidence interval must be calculated differently.

The Matlab function regress can give you both the least squares estimators and confidence intervals for the parameters:

[B, BINT] = regress(y, X);


where B indicates the least squares estimators $$\hat m$$, and BINT the 95% confidence interval for $$\hat m$$. Remember, though, that Matlab calvculates confidence intervals as described above, so check the assumptions first.

To specify a different confidence level, use the form

[B, BINT] = regress(y, X, alpha);


Note that the confidence interval widens as the required confidence level increases.

## Relationship to hypothesis testing¶

A confidence interval for a parameter can be used directly to test the corresponding null hypothesis. With the confidence intervals described above, the null hypothesis is:

The parameter is not significantly different from zero.

If the confidence interval contains zero, then it is not possible to reject the null hypothesis at the used confidence level; in other words, there is then no statistical evidence to suggest that the parameter is significantly different from zero.

## Line of best fit with no intercept term¶

If we use regress to fit a linear model with no intercept term to data in Fig. [fig:lineThroughZero], we obtain $$B = 0.755$$ and $$BINT = [0.748 0.761]$$.

Sample noisy data with a line of best fit passing through the origin.

This means that the slope of the best fit line (i.e., $$k$$) is most likely to be $$0.755$$, and lies with 95% probability between $$0.748$$ and $$0.761$$.

## Line of best fit with intercept term¶

If we use regress to fit a linear model with intercept term to the data in Fig. [fig:lineWithOffset],

B is $$\left[ {\begin{array}{*{20}{c}} {0.751}\\ {20.228} \end{array}} \right]$$ and BINT is $$\left[ {\begin{array}{*{20}{c}} {0.739}&{0.764}\\ {19.483}&{20.974} \end{array}} \right]$$ .

Sample noisy data, where the line of best fit does not pass through the origin.

This means that the slope of the best fit line (corresponding to $$m$$) is most likely $$0.751$$, and lies with 95% probability between $$0.739$$ and $$0.764$$; and that the offset of the line (corresponding to $$c$$) is $$20.228$$, and lies with 95% probability between $$19.483$$ and $$20.974$$.

## Significance¶

If both confidence for the slope are larger than 0, we speak of a significant increase in the data. If also both 99.9% confidence intervals are above 0, we call this a highly significant increase. Similarly, if both confidence intervals are below zero, our data show a significant or a highly significant decrease.

On the other hand, if the confidence intervals overlap zero, we cannot claim that our data are in- or decreasing, even if the slope is positive or negative, respectively.

# Correlation Coefficient and Coefficient of Determination¶

## Correlation Coefficient¶

The correlation coefficient r is a measure of the linear correlation (or dependence) between two variables $$x$$ and $$y$$, giving a value between +1 and -1 inclusive, where 1 is total positive correlation, 0 is no correlation, and -1 is total negative correlation. For sample data $$x_i$$ and $$y_i$$, r indicates the change in $$x$$ multiplied by the change in $$y$$, normalized by the relative spread of $$x$$ and $$y$$:

$\label{eq:corrcoef} r = \frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n _{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2}}$

Note that the correlation coefficient is symmetric in $$x$$ and $$y$$.

For linear regression, the square of the correlation coefficient r equals the coefficient of determination.

## Coefficient of Determination¶

The coefficient of determination tells us how well the fitted data account for the raw data. It makes use of the fact that

$SST = SSR + SSE$

where SST* is the sum of the squares of the total deviation (Figure below, left), SSR the sum of the squares of the regression, and SSE the sum of the squares of the errors (Figure below, right):

$\begin{split}SST &= \sum\limits_i (y_i - \bar{y})^2 \\ SSR &= \sum\limits_i (\hat{y}_i - \bar{y})^2 \\ SSE &= \sum\limits_i (y_i - \hat{y}_i)^2\end{split}$

The better the linear regression (on the right) fits the data in comparison to the simple average (on the left graph), the closer the value of $$r^2$$ is to one. The areas of the blue squares represent the squared residuals with respect to the linear regression. The areas of the red squares represent the squared residuals with respect to the average value (from Wikipedia).

$$\bar{y}$$ is the average value, and $$\hat{y}$$ is the value of the fitted data that corresponds to a given $$x_i$$ value.

The coefficient of determination is defined as the amount of the sum of squares that can be explained by the regression:

$r^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}$

As an example, if $$r^2 = 0.7$$, this would tell us that approximately seventy percent of the variation in the response can be explained by the line fit.

To illustrate a range of different fits, three data sets with different amounts of noise are plotted in the Figure below. All three data sets have roughly the same line of best fit, but the coefficients of determination differ.

Data with different levels of noise.

In Matlab, the correlation coefficient r can be found by using the function corr.

x = 1:100;
y = 10 + 0.5*x + randn(size(x));
plot(x,y,'.')
corr(x', y')


produces a value of approximately r = 0.997.

# The Next Level¶

## Fitting of nonlinear functions¶

So far, we have only considered a linear relationship between the data and the parameters. Note that even tasks such as fitting a sine-function with an offset and a phase can be expressed with linear relationships (see Example 2: Fitting a sine-wave). However, often, the relationship will be visibly nonlinear, and cannot be fit with a linear model. In this case, it is necessary to select a nonlinear model for the data, and then attempt to fit the parameters of the model.

For example, for the data in the figure below, the falling part of the curve might reflect some physical process that is naturally modelled with an exponential decay.

Plot of a variable that decays over time.

Since an exponential decay is fully defined by the starting value and time, the value of the asymptotic minimum, and the half-life, one way of fitting this curve would be to estimate these parameters independently.

Alternatively, one could use more sophisticated methods that attempt to estimate all of the model parameters simultaneously. Such methods can be found in Matlab in the Curve Fitting Toolbox or in the Optimization Toolbox. Often, though, the methods run more efficiently when the user provides a good estimate of the parameters as a starting point.

An important tip is to choose a model/function that is likely to fit the data well, and if you have a choice, to reduce the number of parameters that must be estimated.

### Example:¶

For fitting an exponential decay, you first have to determine where the decay starts. If the data are not too noisy, this can be automated by finding e.g. the first value clearly below the maximum. Clearly can be e.g. 10% below the maximum

tStart = min(find(x<max(x)*threshold));


While the quality of the fit depends on many factors, e.g. the quality of the data, the appropriateness of the model, and the starting values, the Matlab Curve Fitting Toolbox can sometimes make things remarkably easy. For example, to fit an exponential decay to an offset can be found with

% Generate dummy data
fitTime = (0:0.01:20)';
fitVal = 3 + 4*exp(-fitTime/5)+randn(size(fitTime));

% Make the fit
f = fittype('offset + amp*exp(-x/tau)');
options = fitoptions(f);
options.StartPoint = [1,1,1];
fitted = fit(fitTime,fitVal,f, options)

fitted =
General model:
fitted(x) = offset + amp*exp(-x/tau)
Coefficients (with 95% confidence bounds):
amp =       3.964  (3.79, 4.137)
offset =        3.01  (2.899, 3.122)
tau =       5.148  (4.565, 5.732)


# Exercises¶

## Exercise 1: Line Fits¶

• Read in the data from https://github.com/thomas-haslwanter/BioSignalAnalysis/blob/master/Data/co2.txt.
• Fit a line to the last 25 years of carbon emissions from “fossil fuels and cement”, using “polyfit”. Thereby, choose the values for the first datapoint to be (0/0).
• Fit a quadratic curve to the same data, using “polyfit”.
• Plot original data, line, and quadratic curve.

## Exercise 2: Confidence Intervals¶

Use the same data as in Exercise 1, but now also determine the 95% confidence intervals. Answer the following 2 questions:

• When you fit a line, is the slope of the line significantly rising?
• When you fit a quadratic curve, is the quadratic contribution significant?

Note: When the highest order term is determined, then all lower order terms are also included. If for instance we fit a fifth order polynomial, and only the cubic term is significant, then we would omit the higher order nonsignificant terms, but retain those terms of smaller order than the cubic.

## Exercise 3: Linear Circle Fit¶

All the data that you need for this and the next exercises are in https://github.com/thomas-haslwanter/BioSignalAnalysis/blob/master/Data/noisyStuff.mat. “noisyCircle” contains the x/y values of a noisy circle.

Write a function that takes these data and determines the center of the best-fit circle-center (x0/y0) and the radius r.

## Exercise 4: Nonlinear Exponential Decay¶

“noisyExp” contains the t/x values of a noisy function that exponentially decays to a constant offset.

Write a function that takes these data and calculates the best-fit Amplitude, Offset, and Decay-time for this decay.

## Exercise 5: Butterworth Filter¶

“noisySine” contains the t/x values of a noisy sine wave.

Smooth these data with a Savitzky-Golay filter, and superpose the filtered data with the original data. The MATLAB command butter provides the [b,a] coefficients of an IIR filter corresponding to Butterworth lowpass filter. Find the coefficients for a low-pass filter with a corner frequency of 0.5 Hz, and apply this IIR filter to the noisy sine wave. Compare the output with the output of the Savitzky-Golay filter.

## Exercise 6: Confidence Intervals¶

“noisyLine” contains the t/x values of a very noisy line.

A linear regression fit indicates that the data can be fit with a line (y = k*x+d), with k=0.1653.

Are the data significantly increasing?