Using Ordinary Least Squares (OLS) to fit a linear line to noisy data

Ordinary Least Squares (OLS) can be used to fit a linear line to noisy data.

If $y$ is a vector of measured data, $\beta_{0}$ and $\beta_{1}$ are the actual linear model parameters (intercept and gradient) and $\epsilon$ represents the error vector, the model can be expressed as:

$$y_{1} = \beta_{0} + \beta_{1}x_{1} + \epsilon_{1}$$$$y_{2} = \beta_{0} + \beta_{1}x_{2} + \epsilon_{2}$$$$\vdots$$$$y_{n} = \beta_{0} + \beta_{1}x_{n} + \epsilon_{n}$$

Which can be expressed as:

$$\begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} = \begin{bmatrix} 1 & x_{1} \\ 1 & x_{2} \\ \vdots & \vdots \\ 1 & x_{n} \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \end{bmatrix} + \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \vdots \\ \epsilon_{n} \end{bmatrix}$$

Or in matrix form:

$$y=X \beta + \epsilon$$

For some estimate $\hat{\beta}$ of the model parameters, the error and error squared are defined by:

$$\epsilon = y - X \hat \beta$$$$\epsilon^T\epsilon = (y - X \hat \beta)^T(y - X \hat \beta)$$

Expanding the above equation gives:

$$\epsilon^T\epsilon = y^Ty - y^T (X \hat \beta) - (X \hat \beta)^Ty + (X \hat \beta)^T(X \hat \beta)$$$$\epsilon^T\epsilon = y^Ty - (X \hat \beta)^T y - (X \hat \beta)^Ty + (X \hat \beta)^T(X \hat \beta)$$$$\epsilon^T\epsilon = y^Ty - 2(X \hat \beta)^T y + (X \hat \beta)^T(X \hat \beta)$$$$\epsilon^T\epsilon = y^Ty - 2\hat \beta ^T X^T y + \hat \beta^T X^T X \hat \beta$$

To find $\hat \beta$ which minimises the square of the errors, the above equation can be differentiated and set equal to 0:

$$\frac{\partial [\epsilon^T\epsilon]}{\partial \beta} = - 2 X^T y + 2 X^T X \hat \beta = 0$$

Therefore,

$$2 X^T X \hat \beta = 2 X^T y$$$$\hat \beta = (X^T X)^{-1} X^T y$$

Python Implementation

Let's create a python function to calculate $\hat \beta$ from $X$ and $y$:

In [3]:
def ols(X, y):
    beta = np.dot(np.linalg.inv(np.dot(X.T,X)),np.dot(X.T,y))
    return beta

This function uses the linear algebra routine from Python Numpy to carry out thematrix algebra.

Test Data Generation

Now let's generate some noisy data to test the function in Python. The model is $y = 5.67 + 2.18x$ with normally distributed noise, mean = 0 and standard deviation = 5.0

In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

N = 51

b0 = 5.67
b1 = 2.18

x = np.arange(0,N,1)
y = b0 + b1*x

# generate noise and add it, y = b0 + b1*x + e
mn = 0
sd = 5
np.random.seed(100)
e = np.random.normal(mn,sd,N)
yn = y + e

# set the default size of the plots
plt.rcParams['figure.figsize'] = (12,7)
# create ax object for plotting
fig, ax = plt.subplots()
ax.plot(x,y,'r-',label='y = '+str(b0)+' + '+str(b1)+'x')
ax.plot(x,yn,'b-',label='yn = y + e, mean = 0, sd = 5.0')

# add axes labels and title
ax.set_xlabel('x'); ax.set_ylabel('y'); ax.set_title('Linear Model with Noise'); ax.legend()
Out[4]:
<matplotlib.legend.Legend at 0x191c8b5e6d8>

Testing the Function

Before we test the data the vector x needs to be formatted correctly into a matrix X:

In [5]:
print(x)
print(x.shape)
intercept = np.ones((N,1))
xr = x.reshape(-1,1) # convert 1 x N into N x 1
X = np.concatenate((intercept, xr), axis=1)
print(X[0:5])
print(X.shape)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50]
(51,)
[[1. 0.]
 [1. 1.]
 [1. 2.]
 [1. 3.]
 [1. 4.]]
(51, 2)

Note that reshape needs to be used since 1D arrays cannot be transposed in Python. reshape(-1,1) reshapes x into a 1 column vector, using as many rows is required.

Now we can test the ols() function:

In [6]:
beta_hat = ols(X, yn)
print(beta_hat)
[6.97293124 2.11522372]

We now have our OLS estimates of intercept $\beta_{0}$ and gradient $\beta_{1}$.

For comparison numpy.polyfit can also be used to fit a linear model:

In [7]:
print(np.polyfit(x, yn, 1))
[2.11522372 6.97293124]

numpy.polyfit() returns polynomial coefficients with highest order first and therefore they are the same as those calculated using our ols() function derived above, giving confidence our method is working as we want it to.

Finally we can add the predicted line on top of the figure shown earlier to compare.

In [14]:
yest = beta_hat[0] + beta_hat[1]*x

# set the default size of the plots
plt.rcParams['figure.figsize'] = (12,7)
# create ax object for plotting
fig, ax = plt.subplots()
ax.plot(x,y,'r-',label='y = '+str(b0)+' + '+str(b1)+'x')
ax.plot(x,yest,'g-',label='yest = '+"%.2f" % beta_hat[0]+' + '+"%.2f" % beta_hat[1]+'x')
ax.plot(x,yn,'b-',label='yn = y + e, mean = 0, sd = 5.0')

# add axes labels and title
ax.set_xlabel('x'); ax.set_ylabel('y'); ax.set_title('Linear Model with Noise'); ax.legend()
Out[14]:
<matplotlib.legend.Legend at 0x191c944d828>

As we can see in the figure above, the linear estimate is a good fit to the noisy data and compares well with the actual model.

A further article discusses how we can apply OLS to determine the coefficients for a non-linear polynomial model.



Comments

comments powered by Disqus