Proofs involving ordinary least squares

Lua error in package.lua at line 80: module 'strict' not found. Lua error in package.lua at line 80: module 'strict' not found. The purpose of this page is to provide supplementary materials for the Ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.

Least squares estimator for β

Using matrix notation, the sum of squared residuals is given by

S(b) = (y-Xb)'(y-Xb) \,

Where $'$ denotes the matrix transpose.

Since this is a quadratic expression, the vector which gives the global minimum may be found via matrix calculus by differentiating with respect to the vector b' and setting equal to zero:

0 = \frac{dS}{db'}(\hat\beta) = \frac{d}{db'}\bigg(y'y - b'X'y - y'Xb + b'X'Xb\bigg)\bigg|_{b=\hat\beta} = -2X'y + 2X'X\hat\beta

By assumption matrix X has full column rank, and therefore X'X is invertible and the least squares estimator for β is given by

\hat\beta = (X'X)^{-1}X'y \,

Unbiasedness and Variance of $\hat\beta$

Plug y = Xβ + ε into the formula for $\hat\beta$ and then use the Law of iterated expectation:

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \begin{align}\operatorname{E}[\,\hat\beta] &= \operatorname{E}\Big[(X'X)^{-1}X'(X\beta+\varepsilon)\Big] \\ &= \beta + \operatorname{E}\Big[(X'X)^{-1}X'\varepsilon\Big] \\ &= \beta + \operatorname{E}\Big[\operatorname{E}\Big[(X'X)^{-1}X'\varepsilon|X \Big]\Big] \\ &= \beta + \operatorname{E}\Big[(X'X)^{-1}X'\operatorname{E}[\varepsilon|X]\Big] &= \beta,\\ \end{align}

where E[ε|X] = 0 by assumptions of the model.

For the variance, let $\sigma^2 I$ denote the covariance matrix of $\varepsilon$ . Then,

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \begin{align} \operatorname{E}[\,(\hat\beta - \beta)(\hat\beta - \beta)^T] &= \operatorname{E}\Big[ ((X'X)^{-1}X'\varepsilon)((X'X)^{-1}X'\varepsilon)^T \Big] \\ &= \operatorname{E}\Big[ (X'X)^{-1}X'\varepsilon\varepsilon'X(X'X)^{-1} \Big] \\ &= \operatorname{E}\Big[ (X'X)^{-1}X'\sigma^2X(X'X)^{-1} \Big] \\ &= \operatorname{E}\Big[ \sigma^2(X'X)^{-1}X'X(X'X)^{-1} \Big] \\ &= \sigma^2 (X'X)^{-1}, \\ \end{align}

where we used the fact that $\hat{\beta} - \beta$ is just an affine transformation of $\varepsilon$ by the matrix $(X'X)^{-1}X'$ ( see article on the multivariate normal distribution under the affine transformation section).

For a simple linear regression model, where $\beta = [\beta_0,\beta_1]^T$ ( $\beta_0$ is the y-intercept and $\beta_1$ is the slope), one obtains

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \begin{align} \sigma^2 (X'X)^{-1} &= \sigma^2 \left(\sum x_ix_i'\right)^{-1}\\ &= \sigma^2 \left(\sum (1,x_i)' (1,x_i) \right)^{-1}\\ &= \sigma^2 \left(\sum \begin{pmatrix} 1& x_i\\x_i& x_i^2\end{pmatrix} \right)^{-1}\\ &= \sigma^2 \begin{pmatrix} n& \sum x_i\\\sum x_i& \sum x_i^2\end{pmatrix}^{-1}\\ &= \sigma^2 \cdot \frac{1}{n\sum x_i^2-(\sum x_i)^2}\begin{pmatrix} \sum x_i^2& -\sum x_i\\-\sum x_i& n\end{pmatrix}\\ &= \sigma^2 \cdot \frac{1}{n\sum_{i=1}^n{(x_i - \bar{x})^2}}\begin{pmatrix} \sum x_i^2& -\sum x_i\\-\sum x_i& n\end{pmatrix}\\ \end{align}

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \begin{align} Var(\beta_1) &= \frac{\sigma^2}{\sum_{i=1}^n{(x_i - \bar{x})^2}}. \end{align}

Expected value of $\hat\sigma^2$

First we will plug in the expression for y into the estimator, and use the fact that X'M = MX = 0 (matrix M projects onto the space orthogonal to X):

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \hat\sigma^2 = \tfrac{1}{n}y'My = \tfrac{1}{n} (X\beta+\varepsilon)'M(X\beta+\varepsilon) = \tfrac{1}{n} \varepsilon'M\varepsilon

Now we can recognize ε'Mε as a 1×1 matrix, such matrix is equal to its own trace. This is useful because by properties of trace operator, tr(AB)=tr(BA), and we can use this to separate disturbance ε from matrix M which is a function of regressors X:

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \operatorname{E}\,\hat\sigma^2 = \tfrac{1}{n}\operatorname{E}\big[\operatorname{tr}(\varepsilon'M\varepsilon)\big] = \tfrac{1}{n}\operatorname{tr}\big(\operatorname{E}[M\varepsilon\varepsilon']\big)

Using the Law of iterated expectation this can be written as

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \operatorname{E}\,\hat\sigma^2 = \tfrac{1}{n}\operatorname{tr}\Big(\operatorname{E}\big[M\,\operatorname{E}[\varepsilon\varepsilon'|X]\big]\Big) = \tfrac{1}{n}\operatorname{tr}\big(\operatorname{E}[\sigma^2MI]\big) = \tfrac{1}{n}\sigma^2\operatorname{E}\big[ \operatorname{tr}\,M \big]

Recall that M = I − P where P is the projection onto linear space spanned by columns of matrix X. By properties of a projection matrix, it has p = rank(X) eigenvalues equal to 1, and all other eigenvalues are equal to 0. Trace of a matrix is equal to the sum of its characteristic values, thus tr(P)=p, and tr(M) = n − p. Therefore

\operatorname{E}\,\hat\sigma^2 = \frac{n-p}{n} \sigma^2

Note: in the later section “Maximum likelihood” we show that under the additional assumption that errors are distributed normally, the estimator $\hat\sigma^2$ is proportional to a chi-squared distribution with n – p degrees of freedom, from which the formula for expected value would immediately follow. However the result we have shown in this section is valid regardless of the distribution of the errors, and thus has importance on its own.

Consistency and asymptotic normality of $\hat\beta$

Estimator $\hat\beta$ can be written as

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \hat\beta = \big(\tfrac{1}{n}X'X\big)^{-1}\tfrac{1}{n}X'y = \beta + \big(\tfrac{1}{n}X'X\big)^{-1}\tfrac{1}{n}X'\varepsilon = \beta\; + \;\bigg(\frac{1}{n}\sum_{i=1}^n x_ix'_i\bigg)^{\!\!-1} \bigg(\frac{1}{n}\sum_{i=1}^n x_i\varepsilon_i\bigg)

We can use the law of large numbers to establish that

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \frac{1}{n}\sum_{i=1}^n x_ix'_i\ \xrightarrow{p}\ \operatorname{E}[x_ix_i']=\frac{Q_{xx}}{n}, \qquad \frac{1}{n}\sum_{i=1}^n x_i\varepsilon_i\ \xrightarrow{p}\ \operatorname{E}[x_i\varepsilon_i]=0

By Slutsky's theorem and continuous mapping theorem these results can be combined to establish consistency of estimator $\hat\beta$ :

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \hat\beta\ \xrightarrow{p}\ \beta + Q_{xx}^{-1}\cdot 0 = \beta

The central limit theorem tells us that

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \frac{1}{\sqrt{n}}\sum_{i=1}^n x_i\varepsilon_i\ \xrightarrow{d}\ \mathcal{N}\big(0,\,V\big),

where   $V = \operatorname{Var}[x_i\varepsilon_i] = \operatorname{E}[\,\varepsilon_i^2x_ix'_i\,] = \operatorname{E}\big[\,\operatorname{E}[\varepsilon_i^2|x_i]\;x_ix'_i\,\big] = \sigma^2 \frac{Q_{xx}}{n}$

Applying Slutsky's theorem again we'll have

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \sqrt{n}(\hat\beta-\beta) = \bigg(\frac{1}{n}\sum_{i=1}^n x_ix'_i\bigg)^{\!\!-1} \bigg(\frac{1}{\sqrt{n}}\sum_{i=1}^n x_i\varepsilon_i\bigg)\ \xrightarrow{d}\ Q_{xx}^{-1}n\cdot\mathcal{N}\big(0, \sigma^2\frac{Q_{xx}}{n}\big) = \mathcal{N}\big(0,\sigma^2Q_{xx}^{-1}n\big)

Maximum likelihood approach

Maximum likelihood estimation is a generic technique for estimating the unknown parameters in a statistical model by constructing a log-likelihood function corresponding to the joint distribution of the data, then maximizing this function over all possible parameter values. In order to apply this method, we have to make an assumption about the distribution of y given X so that the log-likelihood function can be constructed. The connection of maximum likelihood estimation to OLS arises when this distribution is modeled as a multivariate normal.

Specifically, assume that the errors ε have multivariate normal distribution with mean 0 and variance matrix σ²I. Then the distribution of y conditionally on X is

y|X\ \sim\ \mathcal{N}(X\beta,\, \sigma^2I)

and the log-likelihood function of the data will be

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \begin{align} \mathcal{L}(\beta,\sigma^2|X) &= \ln\bigg( \frac{1}{(2\pi)^{n/2}(\sigma^2)^{n/2}}e^{ -\frac{1}{2}(y-X\beta)'(\sigma^2I)^{-1}(y-X\beta) } \bigg) \\ &= -\frac{n}{2}\ln 2\pi - \frac{n}{2}\ln\sigma^2 - \frac{1}{2\sigma^2}(y-X\beta)'(y-X\beta) \end{align}

Differentiating this expression with respect to β and σ² we'll find the ML estimates of these parameters:

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \begin{align} & \frac{\partial\mathcal{L}}{\partial\beta'} = -\frac{1}{2\sigma^2}\Big(-2X'y + 2X'X\beta\Big)=0 \quad\Rightarrow\quad \hat\beta = (X'X)^{-1}X'y \\ & \frac{\partial\mathcal{L}}{\partial\sigma^2} = -\frac{n}{2}\frac{1}{\sigma^2} + \frac{1}{2\sigma^4}(y-X\beta)'(y-X\beta)=0 \quad\Rightarrow\quad \hat\sigma^2 = \frac{1}{n}(y-X\hat\beta)'(y-X\hat\beta) = \frac{1}{n}S(\hat\beta) \end{align}

We can check that this is indeed a maximum by looking at the Hessian matrix of the log-likelihood function.

Finite sample distribution

Since we have assumed in this section that the distribution of error terms is known to be normal, it becomes possible to derive the explicit expressions for the distributions of estimators $\hat\beta$ and $\hat\sigma^2$ :

\hat\beta = (X'X)^{-1}X'y = (X'X)^{-1}X'(X\beta+\varepsilon) = \beta + (X'X)^{-1}X'\mathcal{N}(0,\sigma^2I)

so that by the affine transformation properties of multivariate normal distribution

\hat\beta|X\ \sim\ \mathcal{N}(\beta,\, \sigma^2(X'X)^{-1}).

Similarly the distribution of $\hat\sigma^2$ follows from

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \begin{align} \hat\sigma^2 &= \tfrac{1}{n}(y-X(X'X)^{-1}X'y)'(y-X(X'X)^{-1}X'y) \\ &= \tfrac{1}{n}(My)'My \\ &=\tfrac{1}{n}(X\beta+\varepsilon)'M(X\beta+\varepsilon) \\ &= \tfrac{1}{n}\varepsilon'M\varepsilon, \end{align}

where $M=I-X(X'X)^{-1}X'$ is the symmetric projection matrix onto subspace orthogonal to X, and thus MX = X'M = 0. We have argued before that this matrix has rank of n–p, and thus by properties of chi-squared distribution,

Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \tfrac{n}{\sigma^2} \hat\sigma^2|X = (\varepsilon/\sigma)'M(\varepsilon/\sigma)\ \sim\ \chi^2_{n-p}

Moreover, the estimators $\hat\beta$ and $\hat\sigma^2$ turn out to be independent (conditional on X), a fact which is fundamental for construction of the classical t- and F-tests. The independence can be easily seen from following: the estimator $\hat\beta$ represents coefficients of vector decomposition of $\hat{y}=X\hat\beta=Py=X\beta+P\varepsilon$ by the basis of columns of X, as such $\hat\beta$ is a function of Pε. At the same time, the estimator $\hat\sigma^2$ is a norm of vector Mε divided by n, and thus this estimator is a function of Mε. Now, random variables (Pε, Mε) are jointly normal as a linear transformation of ε, and they are also uncorrelated because PM = 0. By properties of multivariate normal distribution, this means that Pε and Mε are independent, and therefore estimators $\hat\beta$ and $\hat\sigma^2$ will be independent as well.

Proofs involving ordinary least squares

Contents

Least squares estimator for β

Unbiasedness and Variance of $\hat\beta$

Expected value of $\hat\sigma^2$

Consistency and asymptotic normality of $\hat\beta$

Maximum likelihood approach

Finite sample distribution

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools