Using the Maximum Likelihood Estimation (MLE) to determine a linear line of best fit to noisy data¶
This post contains a brief simple derivation of the MLE equation and a Python implementation to determine a line of best fit to some noisy data. The final section of the post then shows why using MLE results in the same linear coefficients as OLS when the noise is normally distributed.
As before if we consider $y$ to be our vector of measured data, $\beta_{0}$ and $\beta_{1}$ are the actual linear model parameters (intercept and gradient) and $\epsilon$ represents the error vector, then the model can be expressed as:
$$y_{1} = \beta_{0} + \beta_{1}x_{1} + \epsilon_{1}$$$$y_{2} = \beta_{0} + \beta_{1}x_{2} + \epsilon_{2}$$$$\vdots$$$$y_{n} = \beta_{0} + \beta_{1}x_{n} + \epsilon_{n}$$If the error vector $\epsilon$ is normally distributed $N(0,\sigma^{2})$ (as was considered for the previous post), each measurement can be thought of being sampled from it's own distribution with means $\mu_{i} = \beta_{0}+\beta_{1}x_{i}$ and constant variance $\sigma^{2}$.
If the probability of a single point is $N(y_{i}|\mu_{i},\sigma^{2})$ then the probability of all points occurring from the distribution defined by $\mu_{i} = \beta_{0}+\beta_{1}x_{i}$ and $\sigma^2$ is the product of these probabilities:
$$\prod_{i = 1}^{n} N(y_{i}|\mu_{i},\sigma^{2})$$This is also the likelihood of the normal distributions defined by $\mu_{i}$ and $\sigma^{2}$ being the distributions from which data points $y_{i}$ have been sampled:
$$L(\mu_{i}, \sigma^{2}|y_{i}) = \prod_{i = 1}^{n} N(y_{i}|\mu_{i},\sigma^{2})$$Now since the probability density function (pdf) for a normal distribution is:
$$P(y) = \frac{1}{\sqrt{2\pi \sigma ^{2}}}exp \left( {\frac{-\left ( y-\mu \right )^{2}}{2\sigma ^{2}}} \right)$$The likelihood can be defined as the product of the individual probabilities calculated for each data point:
$$L(\mu_{i}, \sigma^{2}|y_{i}) = \prod_{i = 1}^{n} \frac{1}{\sqrt{2\pi \sigma ^{2}}}exp \left( {\frac{-\left ( y_{i} -\mu_{i} \right )^{2}}{2\sigma ^{2}}} \right)$$$$L(\mu_{i}, \sigma^{2}|y_{i}) = \left( \frac{1}{\sqrt{2\pi \sigma ^{2}}} \right) ^n \prod_{i = 1}^{n} exp \left( {\frac{-\left ( y_{i} -\mu_{i} \right )^{2}}{2\sigma ^{2}}} \right)$$And since $e^\left(a\right)e^\left(b\right) = e^\left(a+b\right)$:
$$L(\mu_{i}, \sigma^{2}|y_{i}) = \left( \frac{1}{\sqrt{2\pi \sigma ^{2}}} \right) ^n exp \left( {\frac{-\sum_{i}^{n}\left ( y_{i} -\mu_{i} \right )^{2}}{2\sigma ^{2}}} \right )$$Now let's take the natural logarithm of each side, mainly to help simpify the equation by separating the products into sums:
$$ln \left( L(\mu_{i}, \sigma^{2}|y_{i}) \right) = ln \left( \left( \frac{1}{\sqrt{2\pi \sigma ^{2}}} \right) ^n exp \left( {\frac{-\sum_{i}^{n}\left ( y_{i} -\mu_{i} \right )^{2}}{2\sigma ^{2}}} \right ) \right) $$And since $ln(ab) = ln(a) + ln(b)$:
$$ln \left( L(\mu_{i}, \sigma^{2}|y_{i}) \right) = ln \left( \frac{1^n}{\left( 2\pi \sigma ^{2} \right)^{n/2}} \right) + ln \left( exp \left( {\frac{-\sum_{i}^{n}\left ( y_{i} -\mu_{i} \right )^{2}}{2\sigma ^{2}}} \right ) \right) $$And since $ln(a/b) = ln(a) - ln(b)$:
$$ln \left( L(\mu_{i}, \sigma^{2}|y_{i}) \right) = ln \left( 1 \right) ^n - ln \left( 2\pi \sigma ^{2} \right)^{n/2} + ln \left( exp \left( {\frac{-\sum_{i}^{n}\left ( y_{i} -\mu_{i} \right )^{2}}{2\sigma ^{2}}} \right ) \right) $$And again since $ln(ab) = ln(a) + ln(b)$, and $ln(1)^n = 0$:
$$ln \left( L(\mu_{i}, \sigma^{2}|y_{i}) \right) = - \frac{n}{2} ln \left( 2\pi \right) - \frac{n}{2} ln \left( \sigma ^{2} \right) - {\frac{\sum_{i}^{n}\left ( y_{i} -\mu_{i} \right )^{2}}{2\sigma ^{2}}} $$Before we go any further, let's see if we can use this log likelihood equation to fit a linear line to noisy data in Python.
Python Implementation¶
Let's generate the same noisy data we generated in the noisy linear OLS example. We can use the same seed (100) used in the previous post to generate the same random error vector.
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import math
N = 51
b0 = 5.67
b1 = 2.18
x = np.arange(0,N,1)
y = b0 + b1*x
# generate noise and add it, y = b0 + b1*x + e
mn = 0
sd = 5
np.random.seed(100)
e = np.random.normal(mn,sd,N)
yn = y + e
# set the default size of the plots
plt.rcParams['figure.figsize'] = (12,7)
# create ax object for plotting
fig, ax = plt.subplots()
ax.plot(x,y,'r-',label='y = '+str(b0)+' + '+str(b1)+'x')
ax.plot(x,yn,'b-',label='yn = y + e, mean = 0, sd = 5.0')
# add axes labels and title
ax.set_xlabel('x'); ax.set_ylabel('y'); ax.set_title('Linear Model with Noise'); ax.legend()
Now let's code a function that returns a log likelihood value, but before we do this, let's go one further step. Since $\mu_{i} = \beta_{0}+\beta_{1}x_{i}$ for our two parameter linear regression we can define the sum as a matrix multiplication $\sum_{i}^{n}\left ( y_{i} -\mu_{i} \right )^{2} = (y-X\beta)^{T}(y-X\beta)$ then:
$$ln \left( L(\mu_{i}, \sigma^{2}|y_{i}) \right) = - \frac{n}{2} ln \left( 2\pi \right) - \frac{n}{2} ln \left( \sigma ^{2} \right) - {\frac{(y-X\beta)^{T}(y-X\beta)}{2\sigma ^{2}}} $$We are trying to find the parameters which define the distribution from which the residuals are sampled from, by maximising the (log) likelihood value. Incidentally, residuals are the differences between the measured data and the estimated values predicted using our model.
OK, let's code a Python function which takes the following as optimisation parameters, these are the values we want the optimisation routine to change:
- An estimate of the mean of the noise distribution (i.e. $\beta_{0}$ and $\beta_{1}$)
- An estimate of the variance of the noise distribution (i.e. $\sigma^{2}$)
Additionally we will pass in the following as fixed arguments, these will not be changed:
- A vector of independent terms ($x$)
- A vector of measured data ($y$)
We will pack these variables and pass them to our function as an array, we'll set the initial values 1 for all three. Since we are going to use the scipy minimise routine to minimise this function we need to return the negative log likelihood.
def nloglikelihood(params, x, y):
beta0, beta1, var = params # give the array elements a name to use later in the function
n = len(y)
ll1 = -(n/2)*np.log(2*math.pi)
ll2 = -(n/2)*np.log(var)
# create matrix X from vector x
xr = x.reshape(-1,1) # convert 1 x N into N x 1
X = np.concatenate((np.ones((n,1)), xr), axis=1)
# create the beta vector
B = np.array([beta0, beta1]).T
ll3 = - ((y - X.dot(B)).T).dot(y - X.dot(B))/(2*var)
return -(ll1 + ll2 + ll3) # factor by -1 so we can minimise this function
Now we can use minimise to minimise the function. We will set initial estimates of $\beta_{0}$, $\beta_{1}$ and $\sigma^2$ to 1.0 and also add some bounds to keep the beta and variance estimates positive. Adding bounds means we need to specify an optimisation method such as 'L-BFGS-B' in scipy.minimize().
# add some bounds to beta0, beta1 and sigma^2
bnds = ((0, None), (0, None), (0, None))
# minimise the function
res = minimize(nloglikelihood, [1,1,1], args=(x,yn), method='L-BFGS-B', bounds=bnds)
# print the estimates of beta0, beta1 and sigma^2
print('beta0 beta1 sigma^2:')
print(res.x)
Why does MLE give the same parameters as OLS?¶
Now we can see that our estimates of $\beta_{0}$ and $\beta_{1}$ are the same as our estimate calculated using the OLS method in a previous post. Is this a coincidence? Let's see what happens if we take the MLE equation described above, partially differentiate it and set it to 0 to try and find it's maximum:
$$ln \left( L(\mu_{i}, \sigma^{2}|y_{i}) \right) = - \frac{n}{2} ln \left( 2\pi \right) - \frac{n}{2} ln \left( \sigma ^{2} \right) - {\frac{(y-X\beta)^{T}(y-X\beta)}{2\sigma ^{2}}} $$$$\frac{\partial \left( ln \left( L(\mu_{i}, \sigma^{2}|y_{i}) \right) \right)}{\partial \beta} = \frac{\partial}{\partial \beta} \left(- \frac{n}{2} ln \left( 2\pi \right) - \frac{n}{2} ln \left( \sigma ^{2} \right) - {\frac{(y-X\beta)^{T}(y-X\beta)}{2\sigma ^{2}}} \right) = 0 $$$$- \frac{1}{2\sigma ^{2}} \frac{\partial}{\partial \beta} \left({(y-X\beta)^{T}(y-X\beta)} \right) = 0 $$$$ \frac{\partial}{\partial \beta} \left( y^{T}y - y^{T}XB - X^{T}B^{T}y + X^{T}B^{T}XB\right) = 0 $$And since $y^{T}XB$ is a scalar and $y^{T}XB = X^{T}B^{T}y$:
$$ \frac{\partial}{\partial \beta} \left( y^{T}y - 2y^{T}XB + X^{T}B^{T}XB\right) = 0 $$$$ - 2y^{T}X + 2X^{T}B^{T}X = 0 $$$$ X^{T}B^{T}X = y^{T}X $$$$ B = \left( X^{T}X \right)^{-1} X^{T}y $$This is the same equation we derived in the OLS example.
Since our errors are normally distributed, it was no coincidence that maximising the likelihood as we did above gave the same estimate for $\beta$ since both methods yield the same equation for beta.
Comments
comments powered by Disqus