When we record data, we often obtain data that can be approximated nicely by a straight line. If this is the case, it tremendously simplifies our description of the dataset: we do not need to know every simgle datapoint any more, but only the two coefficients defining the best-fit line to the data points.

*To obtain the best-fit line to noisy data, we want to find the model coefficients “k” and “d” that define our
data-model “y=k*x + d”.*

The following chapter describes how to obtain the model coefficients for simple linear models, how accurately we know these coefficients, and how this concept can be used to describe much more complex patterns, such as polynomials, circles, etc.

A linear model describes a linear relationship between a dependent variable and a number of independent variables. In the simplest case, there is one dependent variable \(y\) and one independent variable \(x\). We can write the linear model as

\[\label{eq:line}
y = mx+c\]

where \(m\) and \(c\) are the two free parameters. This is a
line in a plane, with \(m\) the *slope*, and \(c\) the
*y-intercept*. If we have two different data points relating the
dependent and independent variables, then these two free parameters are
*fully determined*. For example, if \(P_1=(x_1 / y_1)\) and
\(P_2=(x_2 / y_2)\), then we can calculate that

\[m = \frac{{\Delta y}}{{\Delta x}} = \frac{{{y_2} - {y_1}}}{{{x_2} - {x_1}}}\,and\,c = \frac{{{y_1}{x_2} - {x_1}{y_2}}}{{{x_2} - {x_1}}}\]

In the general case, there is one dependent variable \(y\) and \(k\) independent variables \(x_i\). The linear model then becomes

\[y = c + \sum\limits_{d = 1}^k {{m_d}{x_d}}\]

where \(c, m_1, ..., m_k\) are the \(k+1\) free parameters. If we have \(k+1\) different data points that lie exactly on an k-dimensional hyperplane, then the free parameters are fully determined.

A line can either be represented by its slope and y-intercept, as shown above. Alternatively, it can be characterised by a point on the line, \(\vec{x_1}\) , and the line’s normal vector, \(\vec{n}\) :

\[\vec n \cdot \vec x = \vec n \cdot {\vec x_1}\]

or

\[\begin{split}ax + by = c,\,with\,\vec n = \left( {\begin{array}{{20}{c}}
a\\
b
\end{array}} \right)\end{split}\]

Mostly, we have more data points than free parameters. In this case, the
model is *overdetermined* by the data, so that it is not possible to fit
the model exactly to all of the data points. Instead, we try to find a
best fit model, which is the model with the minimum fitting error.
Often, the fitting error is taken to be the *least squares estimator*.

The difference between a data value and the corresponding model value is
referred to as a *residual*. If we have a data point where the dependent
variable has the value \(y_i\) and the independent variables have
value \(x_{id}\), then the residual \(e_i\) for the general
linear model is

\[{e_i} = {y_i} - \left( {c + \sum\limits_{d = 1}^k {{m_d}{x_{id}}} } \right)\]

If there are \(n\) data points, the sum of the squared residuals is

\[E({x_{id}},{y_i},c,{m_d}) = \sum\limits_{i = 1}^n {e_i^2} = \sum\limits_{i = 1}^n {{{\left( {{y_i} - \left( {c + \sum\limits_{d = 1}^k {{m_d}{x_{id}}} } \right)} \right)}^2}}\]

The *least squares estimators* are the values of \(c\) and
\(m_1, ..., m_k\) that minimise E. To determine the value of the
least squares estimators \(\hat c\) and
\({\hat m_1},...,{\hat m_k}\) , it is necessary to locate the
minimum of E by finding where the following partial derivatives are
zero:

\[\begin{split}\begin{array}{l}
0 = {\left. {\frac{{\partial E}}{{\partial c}}} \right|_{\hat c,{{\hat m}_d}}} = - 2\sum\limits_{i = 1}^n {\left( {{y_i} - \left( {\hat c + \sum\limits_{d = 1}^k {{{\hat m}_d}{x_{id}}} } \right)} \right)} \\
0 = {\left. {\frac{{\partial E}}{{\partial {m_p}}}} \right|_{\hat c,{{\hat m}_d}}} = - 2\sum\limits_{i = 1}^n {\left( {{y_i} - \left( {\hat c + \sum\limits_{d = 1}^k {{{\hat m}_d}{x_{id}}} } \right)} \right)} {x_{ip}}\,\,\,for\,all\,p
\end{array}\end{split}\]

This linear system of equations can then be solved to find the values of the least squares estimators.

The method of ordinary least squares can be used to find an approximate solution to overdetermined systems. For the system \(\mathbf{A} \cdot \vec{p} = \vec{y}\), the least squares formula is obtained from the problem

\[\min_{p}\Vert \mathbf{A} \cdot \vec{p} - \vec{y} \Vert,\]

the solution of which can be written with the normal equations,

\[\vec{p} =(\mathbf{A}^{\mathrm{T}}\mathbf{A})^{-1}\mathbf{A}^{\mathrm{T}} \cdot \vec{y},\]

where \(\mathrm{T}\) indicates a matrix transpose, provided \((\mathbf{A}^{\mathrm{T}}\mathbf{A})^{-1}\) exists (that is, provided \(A\) has full column rank). With this formula an approximate solution is found when no exact solution exists, and it gives an exact solution when one does exist.

Let us start with the case where there is no intercept term (i.e., \(c\) is zero). Our model is

\[y \approx \sum\limits_{d = 1}^k {{m_d}{x_d}}\]

X is an \(n * k\) matrix made of the column vectors \(x_1, ... , x_k\), \(m\) is a \(k * 1\) column vector containing the free parameters \(m_1, ... , m_k\), and \(y\) is an \(n * 1\) column vector, then we can rewrite the model as

\[X \cdot m \approx y\]

The least squares estimator for \(m\) can be obtained easily in Matlab by reshaping the equation:

```
m_estimator = X\y;
```

Note that the “\” is a backslash and not a normal divide.
Alternatively, one can use the command *regress*:

```
m_estimator = regress(y, X);
```

Now let us address the general linear model with intercept term:

\[y \approx c * \sum\limits_{d = 1}^k {{m_d}{x_d}}\]

We rewrite this in matrix form in the following way:

\[\label{eq:regressMatrix}
\vec y \approx {\bf{M}} \cdot \vec p\]

where

\[\begin{split}\label{eq:regressionFit}
{\bf{M}} = \left[ {\begin{array}{{20}{c}}
{{x_{11}}}&{{x_{12}}}& \cdots &{{x_{1k}}}&1\\
{{x_{21}}}&{{x_{22}}}& \cdots &{{x_{2k}}}&1\\
\vdots & \vdots &{}& \vdots & \vdots \\
{{x_{n1}}}&{{x_{n2}}}& \cdots &{{x_{nk}}}&1
\end{array}} \right],\,\,p = \left[ {\begin{array}{{20}{c}}
{{m_1}}\\
{{m_2}}\\
\vdots \\
{{m_k}}\\
c
\end{array}} \right],\,\,y = \left[ {\begin{array}{{20}{c}}
{{y_1}}\\
{{y_2}}\\
\vdots \\
{{y_n}}
\end{array}} \right]\end{split}\]

In Matlab, we can solve this in the same way as before, namely
The matrix **M** is sometimes called *Design Matrix*.
In Matlab, we can solve this in the same way as before:

```
M = [X ones(size(y))];
p_estimator = M\y;
```

Again, it is possible to use the function *regress*:

```
p_estimator = regress(y, M);
```

For a line-fit, the equation for general linear models is reduced to

\[\begin{split}\left[ {\begin{array}{{20}{c}}
{{y_1}}\\
{{y_2}}\\
{{y_3}}\\
{...}
\end{array}} \right] = \left[ {\begin{array}{{20}{c}}
{{x_1}}&1\\
{{x_2}}&1\\
{{x_3}}&1\\
{...}&{...}
\end{array}} \right] \cdot \left[ {\begin{array}{{20}{c}}
k\\
d
\end{array}} \right]\end{split}\]

```
% Generate a noisy line, with an slope of 0.5
% and a y-intercept of -30
x = 1:100;
y = -30 + 0.5*(x + 5*randn(size(x)));
% Plot the line
plot(x,y);
line(xlim, [0 0], 'LineStyle', '--');
% Calculate the best-fit line
M = [x' ones(length(x),1)];
p_estimator = M\y'; % or "p_estimator = regress(y', M);"
slope = p_estimator(1) % "m"
yintercept = p_estimator(2) % "c"
```

An alternative way to fit a line would be to see the line as a
\(1^{st}\) order polynomial, and use the Matlab function *polyfit*:

```
p = polyfit(x,y,1);
y_fit = polyval(p, x);
```

We can use the same approach to fit a polynomial curve to the data. For example, for a quadratic relationship between x and y we have

\[y = a*x^2 + b*x + c\]

Written in matrix form, this gives

\[\begin{split}\left[ {\begin{array}{{20}{c}}
y_1\\
y_2\\
y_3\\
{...}
\end{array}} \right] = \left[ {\begin{array}{{20}{c}}
x_1^2 & x_1 & 1\\
x_2^2 & x_2 & 1\\
x_3^2 & x_3 & 1\\
{...} & {...} & {...}
\end{array}} \right] \cdot \left[ {\begin{array}{{20}{c}}
a\\
b\\
c
\end{array}} \right]\end{split}\]

With this we have brought the problem into the form \(\vec{y} \approx \bf{M} \cdot \vec{p}\), and we can solve it in the same way as above.

If we have a sinusoidal oscillation, where we know the frequency \(\omega\), and where we want to know the amplitude, the phase delay, and the offset, our function can be written as

\[x = offset \cdot 1 + amplitude \cdot \sin (\omega t + \delta )\]

Note that here *offset* and *amplitude* appear as linear parameters
- but the phase, \(\delta\), does *not*. But this can be solved by
expressing \(sin(\omega*t+\delta)\) by the sum of a sine- and a
cosine wave:

\[\label{eq:sinFit}
x = offset \cdot 1 + a \cdot \sin (\omega t) + b \cdot \cos (\omega t)\]

From these parameters, we can find the amplitude and the phase:

\[\begin{split}\begin{array}{l}
amplitude = \sqrt {{a^2} + {b^2}} \\
\delta = {\tan ^{ - 1}}\left( {\frac{b}{a}} \right)
\end{array}\end{split}\]

Now all the parameters that we want to know appear in a linear (!!) relationship, and we can put together our matrix M as

\[\begin{split}{\bf{M}} = \left[ {\begin{array}{{20}{c}}
1&{\sin (\omega \cdot {t_1})}&{\cos (\omega \cdot {t_1})}\\
1&{\sin (\omega \cdot {t_2})}&{\cos (\omega \cdot {t_2})}\\
1&{\sin (\omega \cdot {t_3})}&{\cos (\omega \cdot {t_3})}\\
{...}&{...}&{...}
\end{array}} \right]\end{split}\]

and \(\vec{p}\), which is here given by \(\vec p = \left( {\begin{array}{*{20}{c}} {offset}\\ a\\ b \end{array}} \right)\) can again be obtained from the solution of the linear model.

```
% Time
t = 0:0.1:8*pi;
% Set the parameters
freq = 0.5;
offset = 3;
delta = 45*pi/180;
amplitude = 2;
omega = 2*pi*freq;
% Simulate and plot noisy sine-wave:
y = offset + amplitude * sin(omega*t + delta) + randn(size(t));
plot(t,y)
% Fit the data
M = [ones(length(t),1) sin(omega*t') cos(omega*t')];
p = M\y';
% Extract the coefficients from the fit
found.offset = p(1);
found.amp = sqrt(p(2)^2 + p(3)^2);
found.delta = atan2(p(3), p(2))*180/pi;
% Superpose the original data with the fit
hold('on');
found.y = found.offset + found.amp*sin(omega*t + found.delta*pi/180)
plot(t, found.y, 'r')
legend('original', 'fit');
xlabel('Time [s]');
ylabel('Data');
shg
```

*Sine-Fit*

The same concept can surprisingly even be extended to find the best-fit circles to data. To do so, we rearrange the general equation for a circle:

\[\begin{split}\begin{array}{c}
{(x - {x_0})^2} + {(y - {y_0})^2} = {r^2}\\
{x^2} - 2x{x_0} + x_0^2 + {y^2} - 2y{y_0} + y_0^2 = {r^2}\\
2x \cdot {x_0} + 2y \cdot {y_0} + 1 \cdot ({r^2} - x_0^2 - y_0^2) = {x^2} + {y^2}
\end{array}\end{split}\]

This gives us \({x^2} + {y^2} = {\bf{M}} \cdot \vec p\) , with

\[\begin{split}{\bf{M}} = \left[ {\begin{array}{{20}{c}}
{2{x_1}}&{2{y_1}}&1\\
{2{x_2}}&{2{y_2}}&1\\
{...}&{...}&{...}
\end{array}} \right]\end{split}\]

Again, this is just a linear fit! From this we get

\[\begin{split}\begin{array}{l}
{x_0} = p(1)\\
{y_0} = p(2)\\
r = \sqrt {p(3) + x_0^2 + y_0^2}
\end{array}\end{split}\]

Let us now generalize this concept of confidence intervals to our linear regression, and add some more information about the basic assumptions underlying the calculation of confidence intervals. In the previous section, we saw that it is possible to describe a linear trend via the least squares estimators. However, measured data always have errors superimposed on the basic trend in which we are interested. This error leads to uncertainty in the estimators.

To deduce something about the uncertainty in the estimators, we have to make assumptions about the noise in the data. Some common assumptions that are made are:

- we know the independent variables \(x_1, ... , x_k\) exactly
- the residuals are roughly normally distributed
- the noise is only in the dependent variable y
- the residuals are independent of the values of \(x_1, ... , x_k\)

If any of these assumptions does not hold, then the confidence interval must be calculated differently.

The Matlab function *regress* can give you both the least squares estimators
and confidence intervals for the parameters:

```
[B, BINT] = regress(y, X);
```

where *B* indicates the least squares estimators \(\hat m\), and *BINT* the
95% confidence interval for \(\hat m\). Remember, though, that Matlab calvculates
confidence intervals as described above, so check the assumptions first.

To specify a different confidence level, use the form

```
[B, BINT] = regress(y, X, alpha);
```

Note that the confidence interval widens as the required confidence level increases.

A confidence interval for a parameter can be used directly to test the corresponding null hypothesis. With the confidence intervals described above, the null hypothesis is:

*The parameter is not significantly different from zero.*

If the confidence interval contains zero, then it is not possible to reject the null hypothesis at the used confidence level; in other words, there is then no statistical evidence to suggest that the parameter is significantly different from zero.

If we use *regress* to fit a linear model with no intercept term to data
in Fig. [fig:lineThroughZero], we obtain \(B = 0.755\) and
\(BINT = [0.748 0.761]\).

*Sample noisy data with a line of best fit passing through the origin.*

This means that the slope of the best fit line (i.e., \(k\)) is most likely to be \(0.755\), and lies with 95% probability between \(0.748\) and \(0.761\).

If we use *regress* to fit a linear model with intercept term to the
data in Fig. [fig:lineWithOffset],

B is \(\left[ {\begin{array}{*{20}{c}} {0.751}\\ {20.228} \end{array}} \right]\) and BINT is \(\left[ {\begin{array}{*{20}{c}} {0.739}&{0.764}\\ {19.483}&{20.974} \end{array}} \right]\) .

*Sample noisy data, where the line of best fit does not pass through the origin.*

This means that the slope of the best fit line (corresponding to \(m\)) is most likely \(0.751\), and lies with 95% probability between \(0.739\) and \(0.764\); and that the offset of the line (corresponding to \(c\)) is \(20.228\), and lies with 95% probability between \(19.483\) and \(20.974\).

If both confidence for the slope are larger than 0, we speak of a
*significant increase* in the data. If also both 99.9% confidence intervals
are above 0, we call this a *highly significant increase*. Similarly, if
both confidence intervals are below zero, our data show a *significant* or a
*highly significant decrease*.

On the other hand, if the confidence intervals overlap zero, we cannot claim that our data are in- or decreasing, even if the slope is positive or negative, respectively.

The *correlation coefficient r* is a measure of the linear correlation
(or dependence) between two variables \(x\) and \(y\), giving a
value between +1 and -1 inclusive, where 1 is total positive
correlation, 0 is no correlation, and -1 is total negative correlation.
For sample data \(x_i\) and \(y_i\), *r* indicates the change in
\(x\) multiplied by the change in \(y\), normalized by the
relative spread of \(x\) and \(y\):

\[\label{eq:corrcoef}
r = \frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n _{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2}}\]

Note that the correlation coefficient is symmetric in \(x\) and \(y\).

For linear regression, the square of the *correlation coefficient r*
equals the *coefficient of determination*.

The *coefficient of determination* tells us how well the fitted data account for
the raw data. It makes use of the fact that

\[SST = SSR + SSE\]

where SST* is the *sum of the squares of the total deviation* (Figure below,
left), *SSR* the *sum of the squares of the
regression*, and *SSE* the *sum of the squares of the errors* (Figure below,
right):

\[\begin{split}SST &= \sum\limits_i (y_i - \bar{y})^2 \\
SSR &= \sum\limits_i (\hat{y}_i - \bar{y})^2 \\
SSE &= \sum\limits_i (y_i - \hat{y}_i)^2\end{split}\]

*The better the linear regression (on the right) fits the data in comparison to
the simple average (on the left graph), the closer the value of* \(r^2\) *is to one.
The areas of the blue squares represent the squared residuals with respect to the
linear regression. The areas of the red squares represent the squared residuals
with respect to the average value (from Wikipedia)*.

\(\bar{y}\) is the average value, and \(\hat{y}\) is the value of the fitted data that corresponds to a given \(x_i\) value.

The *coefficient of determination* is defined as the amount of the sum of squares
that can be explained by the regression:

\[r^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST}\]

As an example, if \(r^2 = 0.7\), this would tell us that *approximately
seventy percent of the variation in the response can be explained by the line
fit*.

To illustrate a range of different fits, three data sets with different amounts of noise are plotted in the Figure below. All three data sets have roughly the same line of best fit, but the coefficients of determination differ.

*Data with different levels of noise.*

In Matlab, the correlation coefficient r can be found by using the function
*corr*.

```
x = 1:100;
y = 10 + 0.5*x + randn(size(x));
plot(x,y,'.')
corr(x', y')
```

produces a value of approximately *r = 0.997*.

So far, we have only considered a linear relationship between the data and the parameters. Note that even tasks such as fitting a sine-function with an offset and a phase can be expressed with linear relationships (see Example 2: Fitting a sine-wave). However, often, the relationship will be visibly nonlinear, and cannot be fit with a linear model. In this case, it is necessary to select a nonlinear model for the data, and then attempt to fit the parameters of the model.

For example, for the data in the figure below, the falling part of the curve might reflect some physical process that is naturally modelled with an exponential decay.

*Plot of a variable that decays over time.*

Since an exponential decay is fully defined by the starting value and time, the value of the asymptotic minimum, and the half-life, one way of fitting this curve would be to estimate these parameters independently.

Alternatively, one could use more sophisticated methods that attempt to
estimate all of the model parameters simultaneously. Such methods can be
found in Matlab in the *Curve Fitting Toolbox* or in the *Optimization
Toolbox*. Often, though, the methods run more efficiently when the user
provides a good estimate of the parameters as a starting point.

An important tip is to choose a model/function that is likely to fit the data well, and if you have a choice, to reduce the number of parameters that must be estimated.

For fitting an exponential decay, you first have to determine where the
decay starts. If the data are not too noisy, this can be automated by
finding e.g. the first value clearly below the maximum. *Clearly* can be
e.g. 10% below the maximum

```
tStart = min(find(x<max(x)*threshold));
```

While the quality of the fit depends on many factors, e.g. the quality
of the data, the appropriateness of the model, and the starting values,
the *Matlab Curve Fitting Toolbox* can sometimes make things remarkably
easy. For example, to fit an exponential decay to an offset can be found
with

```
% Generate dummy data
fitTime = (0:0.01:20)';
fitVal = 3 + 4*exp(-fitTime/5)+randn(size(fitTime));
% Make the fit
f = fittype('offset + amp*exp(-x/tau)');
options = fitoptions(f);
options.StartPoint = [1,1,1];
fitted = fit(fitTime,fitVal,f, options)
fitted =
General model:
fitted(x) = offset + amp*exp(-x/tau)
Coefficients (with 95% confidence bounds):
amp = 3.964 (3.79, 4.137)
offset = 3.01 (2.899, 3.122)
tau = 5.148 (4.565, 5.732)
```

- Read in the data from
*https://github.com/thomas-haslwanter/BioSignalAnalysis/blob/master/Data/co2.txt*. - Fit a line to the last 25 years of carbon emissions from “fossil fuels and cement”, using “polyfit”. Thereby, choose the values for the first datapoint to be (0/0).
- Fit a quadratic curve to the same data, using “polyfit”.
- Plot original data, line, and quadratic curve.

Use the same data as in Exercise 1, but now also determine the 95% confidence intervals. Answer the following 2 questions:

- When you fit a line, is the slope of the line significantly rising?
- When you fit a quadratic curve, is the quadratic contribution significant?

Note: When the highest order term is determined, then all lower order
terms are also included. If for instance we fit a fifth order
polynomial, and only the *cubic* term is significant, then we would
*omit the higher order nonsignificant terms, but retain those terms of
smaller order than the cubic*.

All the data that you need for this and the next exercises are in
*https://github.com/thomas-haslwanter/BioSignalAnalysis/blob/master/Data/noisyStuff.mat*.
“noisyCircle” contains the x/y values of a noisy circle.

Write a function that takes these data and determines the center of the best-fit circle-center (x0/y0) and the radius r.

“noisyExp” contains the t/x values of a noisy function that exponentially decays to a constant offset.

Write a function that takes these data and calculates the best-fit Amplitude, Offset, and Decay-time for this decay.

“noisySine” contains the t/x values of a noisy sine wave.

Smooth these data with a Savitzky-Golay filter, and superpose the filtered data with the original data. The MATLAB command butter provides the [b,a] coefficients of an IIR filter corresponding to Butterworth lowpass filter. Find the coefficients for a low-pass filter with a corner frequency of 0.5 Hz, and apply this IIR filter to the noisy sine wave. Compare the output with the output of the Savitzky-Golay filter.

“noisyLine” contains the t/x values of a very noisy line.

A linear regression fit indicates that the data can be fit with a line (y = k*x+d), with k=0.1653.

Are the data significantly increasing?