This last section is meant to show you that while the theory of linear models is quite old, its ideas are very much at the forefront of modern Statistics.
We will now discuss fitting more general functions to data. We will allow infinite dimensional models. We will still use least squares, so the problem will become as follows: find a function f that minimizes
\[\Phi(\pmb{y,X},f)=\sum_{i=1}^n (y_i-f(\pmb{x}_i))^2\] Of course in this generality we can always find a function f so that \(\Phi(\pmb{y,X},f)=0\). For example, if \(\pmb{X}\) is \(n\times 1\) one can always find a polynomial of degree n such \(p(x_i)=y_i\).
One way to make this work is to restrict the set of functions over which to minimize:
A more general approach is to use instead a criterion of the form
\[\Phi(\pmb{y,X},f)=\sum_{i=1}^n (y_i-f(\pmb{x}_i))^2+\lambda\Psi(f)\]
where \(\Psi\) is some penalty function and \(\lambda\) a parameter that allows us to determine the amount of penalty. This idea is called regularization. Notice that if \(\lambda=0\) we are back to least squares, so this is a generalization of least squares.
The idea of regularization is even useful in standard linear regression. Sometimes is a good idea to make sure the coefficients do not get to large, and so we might minimize
\[\Phi(\pmb{y,X},f)=\sum_i (y_i-(\pmb{X'\beta})_i))^2+ \lambda\sum_i \vert\beta_i\vert\]
This is known as ridge regression. The term \(\sum_i \vert\beta_i\vert\) is often referred to as the \(l^1\) penalty.
Let’s define the space of functions \(f:[0,1]\rightarrow \mathbb{R}\) which have a continuous second derivative and who have \(\int_0^1 (f''(x))^2 dx<R\) for some R>0. Then we can minimize
\[\Phi(\pmb{y,X},f)=\sum_{i=1}^n (y_i-f(\pmb{x}_i))^2+ \lambda \int_0^1 (f''(x))^2 dx\] There is nothing special about the interval \([0,1]\), one can similarly define the space of functions on any interval, including \((-\infty,\infty)\).
It may seem at first very hard and maybe impossible to actually find such a minimum because we are here optimizing over an infinite dimensional space, but in some cases this can actually be done. For example, in this specific case it can be shown that the solution is always a cubic spline!
In order to be able to find such a minimum it is necessary to use a function space with sufficient “structure”. A good choice are so called Reproducing Kernel Hilbert Spaces (RKHS), which we will now define:
A vector space is a set of elements with operations “addition” and “multiplication” that follow the usual rules.
Let V be some vector space V. A mapping \(\langle .,.\rangle :V^2\rightarrow \mathbb{R}\) is an inner product on V if for \(x,y,z \in V\) and \(a \in \mathbb{R}\)
A vector space with an inner product is called an inner product space.
Often we also write \(\langle x,x\rangle =||x||^2\) and then ||x|| is called the norm.
Such a norm induces a metric on the space via \(d(x,y)=||x-y||\). Any metric has to have the properties that
\(\mathbb{R}^d\) with \(\langle x,y\rangle = \sum_{i=1}^d x_i y_i\)
the space of continuous functions \(C[0,1]\) with
\[\langle f,g\rangle = \int_0^1 f(x)g(x)dx\]
Note that in an inner product space we have a version of the Pythagorean theorem: if x and y are such that
\[\langle x,y\rangle =0\]
they are said to be orthogonal, and then we have
\[ \begin{aligned} &||x+y||^2 = \\ &\langle x+y,x+y\rangle = \\ &\langle x,x\rangle +\langle x,y\rangle +\langle y,x\rangle +\langle y,y\rangle = \\ &||x||^2 +||y||^2 \end{aligned} \]
A sequence is called a Cauchy sequence if for all \(\epsilon>0\) there exists N such that for all n, m > N \(d(x_n,x_m)<\epsilon\).
This definition of convergence has the advantage that one can check it without knowing what the limit of the sequence is.
Say \(x_n=\frac1n\), \(d(x_n,x_m)=|x_n-x_m|\). Then
\[\vert \frac1n-\frac1m \vert = \frac{\vert m-n\vert}{nm}<\frac{\max\{n,m\}}{mn}=\frac1{\min\{n,m\}}<\epsilon\] if \(n,m>N=1/\epsilon\)
A space is called complete if each Cauchy sequence converges.
It may seem at first that this is always true, but it is not:
Consider the space of functions \(L^1[0,1]\) on [0,1] that are continuous and have norm \(||f||=\int_0^1 |f(x)|dx\). Now consider the sequence \(f_n(x)=x^n\), so
\[||f_n||=\int_0^1 x^n dx=\frac1{n+1}\]
and so we see that \(f_n\) is a Cauchy sequence. But
\[\lim_{n\rightarrow \infty}f_n(x)=I_{1}(x)\] which is not continuous and therefore is not in \(L^1[0,1]\). Therefore we have the definition
A complete inner product space is called a Hilbert Space
The main idea here is that Hilbert spaces can be infinite dimensional but retain (almost) all the nice properties of Euclidean space \(\mathbb{R}^d\), such as the convergence of Cauchy sequences.
Notice that d-dimensional Euclidean space is a subspace of \(l^2\).
Let \(f\in L^2[0,1]\), and let \(\left\{\phi_i \right\}\) be an orthonormal basis of \(L^2[0,1]\), which can be shown to always exist. Then we have
\[f(x)=\sum_{i=0}^\infty a_i \phi_i(x)\]
where \(a_i=\langle f,\phi_i\rangle\). By Parseval’s theorem (a generalization of the Pythegorean theorem) we have
\[||\pmb{a}||_{l^2} = ||f||_{L^2[0,1]}\] therefore we have an isomorphism from \(l^2\) to \(L^2[0,1]\), and in a certain sense the two spaces are the same! Actually it can be shown that every RKHS is isomorphic to \(l^2\).
A linear functional L is a mapping from some vector space V into the real numbers such that for any \(a\in R\)
\[L(ax+y)=aL(x)+L(y)\]
A linear functional is called bounded if there exists M>0 such that
\[|L(f)|\le M||f||\] The mathematical theory that studies such objects is called functional analysis.
Let H be a Hilbert space and \(g\in H\) some element of H, then \(L(f) =\langle f,g\rangle\) defines a bounded linear functional.
proof
The linearity follows from the definition of the inner product. The boundedness is a consequence of the Cauchy-Schwartz inequality:
\[|L(f)| =\vert\langle f,g\rangle\vert \le ||f||\times||g|| = M||f||\]
with \(M=||g||<\infty\). Finally we have
(Riesz representation theorem)
Let L be a bounded linear function on a Hilbert space H. Then there exists \(g\in H\) such that \(L(f) =\langle f,g\rangle\).
In other words, on a Hilbert space all bounded linear functionals are generated by elements of the space and the inner product.
A symmetric bivariate function \(K:H\times H\rightarrow \mathbb{R}\) is called positive (semi)-definite if for all \(n>0\) and elements \(x_1, ,,., x_n\in H\) the matrix with elements \(K(x_i, x_j)\) is positive (semi)-definite.
This definition is an obvious way to extend the idea of positive definite to infinite dimensional spaces.
\[\pmb{a}'K(\pmb{x}, \pmb{x})\pmb{a}=\sum_{i,j=1}^n a_ia_jx_iy_j = ||\sum a_ix_i||^2\ge 0\]
This is of course the usual case of a positive semi-definite matrix.
for any \(x,y\in \mathbb{R}^d\) let
\[K(x,y)=\exp\{-(||x-y||^2/2\sigma^2\}\]
Let \(\mathcal{X}\) be some space, and \(\mathcal{H}\) a Hilbert space of functions from the Cartesian product \(\mathcal{X}\times \mathcal{X}\) into \(\mathbb{R}\). Let K be a semi-definite kernel function on \(\mathcal{X}\times \mathcal{X}\). \(\mathcal{H}\) is called a reproducing kernel Hilbert space (RKHS) if for any \(x\in \mathcal{X}\) the function \(K(\cdot, x)\in \mathcal{H}\) and we have the relation
\[\langle f, K(\cdot,x) \rangle = f(x)\] for all \(f\in \mathcal{H}\).
In other words finding inner products on such a space is easy: one only needs to evaluate the functions. So finding an f that maximizes the inner product means maximizing a function on \(\mathbb{R}\). This is often referred to as the kernel trick.
How can we find such RKHS’s? Let’s say we have a space \(\mathcal{X}\) and a semi-definite kernel K on the Cartesian product \(\mathcal{X}\times \mathcal{X}\). Now consider the space \(\mathcal{H}\) of functions on \(\mathcal{X}\) defined by
\[f(\cdot)=\sum_{j=1}^n a_jK(\cdot, x_j)\]
for some integer \(n\ge 1\), points \(x_1,..,x_n \in \mathcal{X}\) and weights \(a_1,.,,a_n\in \mathcal{R}\). It is easy to check that \(\mathcal{H}\) is a vector space under the usual function addition and scalar multiplication.
For any given pair \(f,g\in \mathcal{H}\) we define the inner product
\[\langle f,g\rangle_{\mathcal{H}}=\sum_{j,k=1}^n a_jb_kK(x_j,\bar{x}_k)\]
and with this definition we have
\[\langle f, K(\cdot,x) \rangle = \sum_{j=1}^n a_jK(x_j, x)=f(x)\] and one can verify that this indeed defines a proper inner product. Finally taking limits as \(n\rightarrow \infty\) and using the completeness of Hilbert spaces we have
Given any positive semi-definite kernel function K, there is a unique Hilbert space \(\mathcal{H}\) in which the kernel satisfies the reproducing property. It is known as the RKHS associated with K.
proof omitted
\(L^2[0,1]\) is not an RKHS. If it were there would have to be a function \(R_x\) such that
\[\int_0^1 f(y)R_x(y) dy = f(x)\]
for all \(f\in L^2[0,1]\). It can be shown that the only function to do so is \(R_x(y)=\infty \times I_{x}(y)\), but that function is not in \(L^2[0,1]\).
In a certain sense \(L^2[0,1]\) is to large a space to be a RHKS, so in order for a space of functions to be a RKHS we need to put further restrictions on them. A common one is
For some fixed \(n\ge 1\) define the space \(\mathcal{H}^n[0,1]\) as the space of all real-valued functions that are n-times continuously differentiable with the nth derivative being Lebesgue integrable and
\[f(0)=f'(0)=..f^{(n-1)}(0)=0\]
If we define the inner product
\[\langle f,g\rangle_{\mathcal{H}^n[0,1]} = \int_0^1 f^{(n)}(x)g^{(n)}(x)dx\]
It can be shown that this defines a RKHS. It is called the Sobolev space of order n.
Let \(\mathcal{H}^n[0,1]\) be the Sobolev space of order n, then
\[K(x,y)=\int_0^1 \frac{(x-z)^{n-1}_{+}(y-z)^{n-1}_{+}}{(n-1)!(n-1)!}dz\]
proof (for n=1)
In this case the kernel becomes \(K(x,y)=\int_0^1 I_{[0,x]}(z)I_{[0,y]}(z)dz\). Now say \(x<y\), then \(K(x,y)=\int_0^1 I_{[0,x]}(z)I_{[0,y]}(z)dz=\int_0^x dz = x\), so in general \(K(x,y)=\min\{x,y\}\). With this kernel one can now verify all the properties of a RKHS.
Say we observe n samples of some unknown function \(y_i=f^{*}(x_i)\) with known design matrix \(x_1,..,x_n\). There are two questions of interest:
is there a function f from some specified set that fits the points exactly, that is we have \(y_i=f(x_i)\) for all i.?
if there are many such functions, which is the a best one?
The first question can be answered by writing down such a function, but the second one is a bit vague: what does best mean? If the set is a RKHS we can use the norm to “order” them and we can write the following optimization problem:
choose \(\hat{f} \in \text{argmin} \left\{ ||f||_\mathcal{H}\right\}\) such that \(y_i=f(x_i)\).
An optimization problem of this type is called a convex program.
Clearly in the example above the blue curve is “better”.
So, how can one solve such an optimization problem? Here is one answer:
Let \(\pmb{K}\) be the kernel matrix defined by the design points \(\{x_i\}\). The convex program above is solvable if and only if \(y\in range(\pmb{K})\), in which case the optimal solution can be written as
\[\hat{f}(\cdot)=\frac1{\sqrt n}\sum_{i=1}^n \hat{\alpha}_iK(\cdot, x_i)\] where \(\pmb{K}\hat{\alpha}=y/\sqrt n\)
proof omitted
In statistics we usually have observations that also include some random noise, so our model is
\[y_i=f^{*}(x_i)+\epsilon_i\]
In this case we no longer want a function that connects all the points but a function that smooths out the random noise. This means we now want to minimize some trade-off between the fit to the data and the Hilbert space norm.For example we might want to minimize the mean-squared difference between the observed data and the fitted values, which leads to the optimization problem
\(\min \left\{ ||f||_\mathcal{H}: f\in\mathcal{H}\right\}\) such that \(\frac1{2n}\sum_{i=1}^n (y_i-f(x_i))^2<\delta^2\)
where \(\delta>0\) is some type of tolerance parameter. It is possible to reformulate such a problem to
\[\hat{f}=\text{arg min} \left\{\frac1{2n}\sum_{i=1}^n (y_i-f(x_i))^2+\lambda_n ||f||_\mathcal{H} \right\}\] Notice that we are now back to least squares!
This method is known as kernel ridge regression.
For all \(\lambda_n >0\) the kernel ridge regression estimate can be written as
\[\hat{f}(\cdot) = \frac1{\sqrt n}\sum_{i=1}^n\hat{\alpha}_i K(\cdot, x_i)\] where
\[\pmb{\hat{\alpha}}=(\pmb{K}+\lambda_n\pmb{I})^{-1}\pmb{y}/\sqrt n\] proof omitted
Let’s use the first order Sobolev space and ridge regression to fit to the Lobster data.
K=function(x, y) {
if(length(x)==1 & length(y)==1) return(1+min(x, y))
apply(cbind(x, y), 1, min)
}
x=lobster$Time
y=lobster$Length
K1=outer(x, x, K)/length(x)
fhat=function(t, l) {
n=nrow(K1)
alphahat=c(solve(K1+l*diag(n))%*%cbind(y))/sqrt(n)
s=t
for(i in seq_along(t)) s[i]=sum(alphahat*K(t[i], x))
s/sqrt(n)
}
t=seq(x[1], max(x), length=100)
df=data.frame(x=c(t, t, t),
y=c(fhat(t, 0.1), fhat(t, 1), fhat(t, 2)),
lambda=rep(c("0.1", 1, 10),each=100))
ggplot(data=lobster, aes(Time, Length)) +
geom_point() +
geom_line(aes(x, y,color=lambda), data=df)