Jordan Bryan

Estimating signal and noise in linear models

Let \(\mathbf{Y} \in \mathbb{R}^n\) and \(\boldsymbol{\beta} \in \mathbb{R}^p\) and consider the model

\[\begin{aligned} \mathbf{Y} | \boldsymbol{\beta} &\sim N(\mathbf{X} \boldsymbol{\beta}, \tau^2 \mathbf{I}) \\ \boldsymbol{\beta} &\sim N(0, \psi^2 \mathbf{I}) \end{aligned}\]

The parameters \(\psi^2\) and \(\tau^2\) represent, respectively, the variation due to “signal” – i.e. that coming from the linear relationship between \(\mathbf{Y}\) and \(\mathbf{X}\) – and the variation due to noise, which cannot be explained by \(\mathbf{X}\).

Under this model, the induced marginal density on \(\mathbf{Y}\) is

\[p(\mathbf{Y}) = \frac{1}{\sqrt{2 \pi |\mathbf{A}|}} \exp \left\{-\frac{\mathbf{Y}^T \mathbf{A}^{-1} \mathbf{Y}}{2}\right\}\]

where the marginal covariance of \(\mathbf{Y}\) is

\[\mathbf{A} = \psi^2 \mathbf{X} \mathbf{X}^T + \tau^2 \mathbf{I}\]

As long as the rows of \(\mathbf{X}\) are not mutually orthogonal – which would make \(\mathbf{X}\mathbf{X}^T\) equal to the identity – variation due to \(\psi^2\) and variation due to \(\tau^2\) can be disambiguated. In fact, under certain conditions, maximum likelihood estimators for \(\psi^2\) and \(\tau^2\) are available, even when \(p > n\).

The log-likelihood in terms of \(\psi^2\) and \(\tau^2\) is

\[\ell(\tau^2, \psi^2) = -\frac{1}{2} \left( \log 2\pi + \log |\mathbf{A}| + \mathbf{Y}^T \mathbf{A}^{-1} \mathbf{Y} \right)\]

Differentiating the log-likelihood with respect to \(\psi^2\) and \(\tau^2\) relies on Jacobi’s Formula for the derivative of a matrix determinant as well as differentiation with respect to the matrix in a quadratic form. After simplifying the resulting expressions, the partial derivatives are

\[\frac{\partial \ell}{\partial \psi^2} = -\frac{1}{2} \left(\text{tr}(\mathbf{X}^T\mathbf{A}^{-1} \mathbf{X}) - \mathbf{Y}^T \mathbf{A}^{-1} \mathbf{X}\mathbf{X}^T \mathbf{A}^{-1} \mathbf{Y} \right)\]


\[\frac{\partial \ell}{\partial \tau^2} = -\frac{1}{2} \left(\text{tr}(\mathbf{A}^{-1}) - \mathbf{Y}^T\mathbf{A}^{-2} \mathbf{Y} \right)\]

Setting both of these equal to zero produces the system

\[\begin{aligned} \text{tr}(\mathbf{X}^T\mathbf{A}^{-1} \mathbf{X}) &= \mathbf{Y}^T \mathbf{A}^{-1} \mathbf{X}\mathbf{X}^T \mathbf{A}^{-1} \mathbf{Y} \\ \text{tr}(\mathbf{A}^{-1}) &= \mathbf{Y}^T \mathbf{A}^{-2} \mathbf{Y} \\ \end{aligned}\]

Letting \(\mathbf{X}\mathbf{X}^T = \mathbf{Q} \mathbf{D} \mathbf{Q}^T\) be the orthogonal eigendecomposition of the symmetric matrix \(\mathbf{X}\mathbf{X}^T\), \(\mathbf{A}\) may be written as

\[\mathbf{A} = \mathbf{Q} (\tau^2 \mathbf{I} + \psi^2 \mathbf{D})\mathbf{Q}^T\]

from which it is clear that the eigenvalues of \(\mathbf{A}\) are

\[\tau^2 + \psi^2 s_i^2, ~~~~~ i = 1, \dots, n\]

where \(s_i\) is the \(i^\text{th}\) singular value of \(\mathbf{X}\). The system of equations giving the maximum likelihood estimators of \(\psi^2\) and \(\tau^2\) then becomes

\[\begin{aligned} \sum_{i=1}^n \frac{1}{\tau^2 + \psi^2 s_i^2} &= \sum_{i=1}^n \frac{||\mathbf{Q}_i^T \mathbf{Y}||_2^2}{(\tau^2 + \psi^2 s_i^2)^2} \\ \sum_{i=1}^n \frac{s_i^2}{\tau^2 + \psi^2 s_i^2} &= \sum_{i=1}^n \frac{s_i^2 ||\mathbf{Q}_i^T \mathbf{Y}||_2^2}{(\tau^2 + \psi^2 s_i^2)^2} \\ \end{aligned}\]

If the columns of \(\mathbf{X}\) are orthogonal, then \(\mathbf{X}\mathbf{X}^T\) is the orthogonal projection matrix onto the column space of \(\mathbf{X}\) (denote by \(\mathbf{P}_{\mathbf{X}}\)) and its eigenvalues are equal to either \(0\) or \(1\).

Assuming \(n > p\), \(\mathbf{X}\) has \(n - p\) singular values equal to \(0\). Hence, \(\mathbf{A}\) has eigenvalue \(\tau^2\) of multiplicity \(n - p\). These correspond to eigenvectors that form an orthonormal eigenbasis for the space \(\mathbb{R}^n / \mathcal{C}(\mathbf{X})\). Thus, subtracting the second equation from the first in the system above yields an explicit solution for \(\hat{\tau}^2\):

\[\hat{\tau}^2 = \frac{\mathbf{Y}^T(\mathbf{I} - \mathbf{P}_{\mathbf{X}})\mathbf{Y}}{n - p}\]

Plugging this value for \(\tau^2\) into the second equation in the system yields

\[\hat{\psi}^2 = \frac{\mathbf{Y}^T \mathbf{P}_{\mathbf{X}} \mathbf{Y}}{p} - \frac{\mathbf{Y}^T(\mathbf{I} - \mathbf{P}_{\mathbf{X}})\mathbf{Y}}{n - p}\]

If \(p = n\) and the columns of \(\mathbf{X}\) are still orthogonal, the system of equations does not yield a unique solution (the first and the second equations become equal when all singular values are equal to \(1\)). If \(p > n\), then \(\mathbf{X}\mathbf{X}^T\) has no eigenvalues equal to \(0\) and the system will have a solution as long as the eigenvalues are not all equal.

This suggests a close connection between the distribution of the singular values of \(\mathbf{X}\) and the feasibility of disambiguating \(\psi^2\) and \(\tau^2\).