Ridge Regression#
Ridge Regression describes an \(L_2\)-penalized regression task.
Task (Ridge Regression)
Given a dataset of \(n\) observations
the design matrix \(X\in\mathbb{R}^{n\times p}\), where \(X_{i\cdot}=\bm\phi(\vvec{x}_i)^\top\) and a regularization weight \(\lambda>0\).
Find the regression vector \(\bm\beta\), solving the following objective
Return the predictor function \(f:\mathbb{R}^d\rightarrow\mathbb{R}\), \(f(\vvec{x})=\bm\phi(\vvec{x})^\top\bm\beta\)
Optimization#
First of all, we observe that the ridge regression objective is convex. The objective function \(RSS_{L_2}\) is convex as it is the nonnegatively weighted sum of convex functions. The feasible set is the set \(\mathbb{R}^p\), which is convex as well. Hence, the whole objective is a convex, unconstrained optimization problem. That means that every stationary point of the objective function is a minimizer. We can compute the stationary points as follows:
How does this change the standard set of regression solutions?
Effect of the Regularization Weight#
The following lemma indicates that the solutions of Ridge Regression computed above are always unique (exactly one regression vector \(\beta\) solves the ridge regression objective).
Lemma 6
For any matrix \(X\in\mathbb{R}^{n\times d}\), the matrix \(X^\top X+\lambda I\) is invertible for all \(\lambda>0\)
Proof. Let \(X=U\Sigma V^\top\) be the singular value decomposition of \(X\), then
The matrix \(\Sigma^\top\Sigma +\lambda I\) is invertible, as it is a diagonal matrix where each value on the diagonal is at least as large as \(\lambda>0\). Hence, the matrix \(X^\top X+\lambda I\) is invertible and the inverse is \(V(\Sigma^\top\Sigma +\lambda I )^{-1}V^\top\).
As a result, we can multiply with \((X^\top X+\lambda I)^{-1}\) from the left in Eq. (22) and obtain the ridge regression solver.
Corollary 2
The solution to the Ridge Regression task, is given by
So, for any regularization weight \(\lambda>0\) we have a unique regression solver, but for \(\lambda=0\) we can get infinitely many solutions. Yet, what kind of solution do we get for really small \(\lambda\)? Is there one solution of the infinitely many that is somewhat better than the others? We can answer this question indeed with “yes”, as the following theorem shows.
Theorem 16
Let \(X=U\Sigma V^\top\in\mathbb{R}^{n\times p}\) be the SVD of the design matrix of the Ridge Regression task. If only \(r<p\) singular values of \(X\) are nonzero (\(X\) has a rank of \(r\)), then the global minimizer \(\bm{\beta}_{L_2}\) converges for decreasing regularization weights \(\lambda\rightarrow 0\) to
\(\Sigma_r\) denotes here the matrix containing only the first \(r\) rows and columns of the singular values matrix \(\Sigma\) and \(U_r\) denotes the matrix containing the first \(r\) left singular vectors (the first \(r\) columns of \(U\)).
Proof. We substitute \(X\) with its SVD in Eq. (22) yielding the ridge regression solutions.
We use the notation of Observation (Singular value matrix of rank r) to compute
Hence, the lower \(p-r\) rows of \((\Sigma^\top\Sigma +\lambda I )^{-1}\Sigma^\top U^\top \vvec{y}\) are equal to zero, returning
Let’s look again at the plot of Example 17. The regression function that we learn has four parameters, but we have only three data points. Hence, we have \(p-r=1\) and our regression solution vectors are computed as \(\beta=V\mathbf{w}\) where \(\vvec{w}=\begin{pmatrix}\Sigma_r^{-1} U_r^\top \vvec{y}\\ w_4\end{pmatrix}\) and \(w_4\in\mathbb{R}\). If \(w_4=0\), then the resulting \(\beta\) is the one that ridge regression converges to when \(\lambda\rightarrow 0\). We plot now the resulting regression functions, depending on the value of \(w_4\) and get the following: