Minimizing the RSS

Minimizing the RSS#

Assuming that we have now selected a function class, and that it can be modelled as a linear function \(f(\vvec{x})=\bm\phi(\vvec{x})^\top\bm\beta\), we need to train the parameter vector \(\bm\beta\) to fit the dataset. For that reason, we need to define an objective function that is small if the model \(f(\vvec{x})\) is suitable. Since our goal is to approximate the target values, we can simly measure the distance of our prediction \(f(\vvec{x})\) and the target value \(y\).

Figure made with TikZ

The plot above shows a set of datapoints that are approximated by an affine model (blue). The distance to the target \(y\), plotted on the vertical axis, is indicated by the red bars. The distance indicated by the red bars reflect the absolute values \(\lvert y_i - f(\vvec{x}_i)\rvert\). However, the absolute value is not so easy to optimize, since it is non-differentiable at value zero. Instead, we can minimize the squared distances, which gives us a smooth objective function.

The squared approximation error of a function \(f\) to the target values \(y\) can be compactly written as follows for a linear model \(f(\vvec{x})=\bm\phi(\vvec{x})^\top\bm\beta\)

\[\begin{align*} RSS(\bm{\beta}) &= \sum_{i=1}^n(y_i-f(\vvec{x}_i))^2\\ &= \sum_{i=1}^n(y_i-\bm{\phi}(\vvec{x}_i)^\top\bm{\beta})^2\\ &= \sum_{i=1}^n(y_i-X_{i\cdot}\bm{\beta})^2\\ &=\lVert \vvec{y}-X\bm{\beta}\rVert^2. \end{align*}\]

The function \(RSS(\bm{\beta})\) is called the Residual Sum of Squares. We have defined above the matrix \(X\), that gathers the transformed feature vectors \(\bm{\phi}(\vvec{x}_i)^\top = X_{i\cdot}\) over its rows. The matrix \(X\) is called the design matrix. Likewise, we can gather the target values in the vector \(\vvec{y}\).

\[\begin{align*} X&= \begin{pmatrix} -- & \bm{\phi}(\vvec{x}_1)^\top &--\\ &\vdots&\\ --& \bm{\phi}(\vvec{x}_n)^\top &-- \end{pmatrix} \in\mathbb{R}^{n\times p},& \vvec{y}&= \begin{pmatrix} y_1\\ \vdots\\ y_n \end{pmatrix} \in\mathbb{R}^n \end{align*}\]

We can now specify the linear regression task, using linear regression models and the squared Euclidean distance to measure the fit of the model.

Task (Linear Regression with Basis Functions)

Given a dataset of \(n\) observations

\[\begin{equation*}\mathcal{D}=\left\{(\vvec{x}_i,y_i)\vert \vvec{x}_i\in\mathbb{R}^{d}, y_i\in\mathbb{R}, 1\leq i \leq n\right\}\end{equation*} \]

Choose a basis function \(\bm\phi:\mathbb{R}^d\rightarrow \mathbb{R}^p\), and create the design matrix \(X\in\mathbb{R}^{n\times p}\), where \(X_{i\cdot}=\bm\phi(\vvec{x}_i)^\top\)
Find the regression vector \(\bm\beta\), solving the following objective

\[\begin{align*} \min_{\bm\beta} \ RSS(\bm\beta) = \lVert \vvec{y}-X\bm\beta\rVert^2 &\ \text{s.t. } \bm\beta\in\mathbb{R}^p. \end{align*}\]

Return the predictor function \(f:\mathbb{R}^d\rightarrow\mathbb{R}\), \(f(\vvec{x})=\bm\phi(\vvec{x})^\top\bm\beta\).

Convexity of the RSS#

The RSS is a convex optimization objective as it is a composition of an affine function and a convex function (the squared \(L_2\)-norm), which is again convex.

Theorem 13

The function \(RSS(\bm\beta)=\lVert \vvec{y}-X\bm{\beta}\rVert^2\) is convex.

Proof. The squared \(L_2\)-norm \(\lVert\cdot\rVert^2\) is a convex function.

The composition of the affine function \(g(\bm{\beta})=\vvec{y}-X\bm{\beta}\) with the convex function \(\Vert\cdot\rVert^2\), given by the \(RSS(\bm{\beta})=\lVert g(\bm{\beta})\rVert^2\) is then also convex.

As a corollary, the linear regression optimization objective

\[\begin{align*} \min_{\bm{\beta}}&\ RSS(\bm{\beta})& \text{s.t. }\bm{\beta}\in\mathbb{R}^p \end{align*}\]

is convex, since the feasible set is the vector space of \(\mathbb{R}^p\), which is convex. So, we have an unconstrained convex optimization problem with a smooth objective function. That means that all stationary points must be minimizers. Let’s try to find all stationary points.

Minimizers of the RSS#

We compute the stationary points by setting the gradient to zero. The gradient of the \(RSS(\bm{\beta}) = \lVert \vvec{y}-X\bm{\beta}\rVert^2=f(\vvec{g}(\bm{\beta}))\) is computed by the chain rule, as discussed in Computing the Gradients.

(8)#\[\begin{align*} \nabla_{\bm{\beta}} RSS(\bm{\beta})= -2X^\top(\vvec{y}-X\bm{\beta}) =0 \quad \Leftrightarrow \quad X^\top X{\bm{\beta}} = X^\top \vvec{y} \end{align*}\]

According to FONC and convexity of the optimization objective, the global minimizers of the regression problem are given by the set of regression parameter vectors satisfying the equation above

\[\{\bm{\beta}\in\mathbb{R}^p\mid X^\top X\bm{\beta} =X^\top\vvec{y} \}.\]

If the matrix \(X^\top X\) is invertible, then there is only one minimizer. In this case we can solve the equation for \(\bm{\beta}\) by multiplying with \((X^\top X)^{-1}\)

\[\bm{\beta}= (X^\top X)^{-1}X^\top\vvec{y}.\]

However, there also might be infinitely many global minimizers of \(RSS(\bm{\beta})\).

Figure made with TikZ