Important: since it is quite technical, the text is mainly
copy-pasted from the
thesis (see
References). For more compact information you can look at the
Neural Computation article.
Gaussian Process Representation
Let us first define some notations:
In the following we will use GPs as priors with the prior kernel
K0(
x,
x
):
an arbitrary sample
f from the GP at spatial locations
=
[
x1,...,
xN]
T
has a Gaussian distribution with covariance
K
:
p0(f)
exp
-
fTK
-1
f
.
|
(20) |
Using
f
= {
f (
x1),...,
f(
xN)}
for the random variables at the data
positions, we compute the posterior distribution as
ppost(f) =
, |
(21) |
where
p0(
f,
f)
is the joint Gaussian distribution of the random variables at the
training
and sample locations,
P(
|
f
)
is the average of the likelihood with respect to the prior GP
marginal,
p0(
f). Computing the predictive
distributions is the combination of the posterior with the likelihood
of the data at
x
p(y) =
df
xP(y|fx
,x)
p
post(f
x) =
|
(22)
|
The predictive distribution of eq. (
22), as it is presented,
might require the computation of a new integral each time a prediction
on a novel input is required. The following lemma shows that simple
but important predictive quantities like the posterior mean and the
posterior covariance of the process at arbitrary inputs can be
expressed as a combination of a finite set of parameters which depend
on the process at the training data only. Knowing these parameters
eliminates the high-dimensional integrations when we are doing
predictions.
Parametrisation of the posterior moments
Based on the rules for partial integration we provide a representation
for the moments of the posterior process obtained using GP priors and
a given data set
. The property used in this
chapter is that of Gaussian averages: for a differentiable scalar
function
g(
x) with
x
d,
based on the partial integration
rule [
30],
we have:
dxp0(x)
xg(x) =
dxp0(x)
g(x)
+
dxp0(x)
g(x)
|
(32) |
where the function
g(
x) and its derivatives grow slower than an exponential
to guarantee the existence of the integrals involved. We use the vector
integral to have a more compact and more intuitive formulation and
g(
x) is the vector
of derivatives. The vector
is the mean and matrix
is the covariance of the normal distribution
p0.
The context in which eq. (
32) is useful is when
p0(
x) is a Gaussian and
g(
x) is a likelihood. We want to compute the moments of
the posterior [
57,
17]. For arbitrary
likelihoods we can show that
Lemma 3.1
(Parametrisation[19])
The result of the Bayesian update eq. (
21) using a
GP prior with mean function
fx
and kernel
K0(
x,
x) and data
=
{(
xn,
yn)|
n = 1,...,
N}
is a process with mean and kernel functions given by
|
(33)
|
The parameters
q(
i) and
R(
ij) are given by
|
(34)
|
where
f
=
[
f (
x1),...,
f (
xN)]
T
and
Z =
df
p0(
f)
P(
|
f)
is a normalising constant and the partial differentiations are with
respect to the prior mean at
xi.
It is important that both the posterior mean and the
posterior kernel can be parametrised using the location of the inputs
(higher order moments also, both in present framework it is
irrelevant). Thus, there is a possibility to represent
(approximations) to posterior processes which are obtained
from a certain prior process. This representation allows
probabilistic treatment for different posterior quantities
(averages/expectations).
The parametrisation lemma states -- under general conditions -- the
existence of the coefficients. Although it provides a formula
to actually calculate them, the integrals on which it depends cannot
be computed analytically. The
Online
scheme, presented in the 'Estimation' section, provides means to
obtain approximations to these parameters, thus to the
entire
posterior process.
Parametrisation in the feature space
The parametrisation lemma provides us the first two moments of the
posterior process. Apart from the Gaussian regression, where the
results are exact, we can consider the moments of the posterior
process as approximations. This approximation is written in a
data-dependent coordinate system. We are using the feature space
and the projection
of the input
x into
. With the scalar product from
eq. (19)
replacing the kernel function K0, we have the mean
and covariance functions for the posterior process as
|
(47) |
This shows that the mean function is expressed in the feature space as
a scalar product between two quantities: the feature space image of
x and a ``mean vector''
,
also a feature-space entity. A similar identification for the
posterior covariance leads to a covariance matrix in the feature space
that fully characterises the covariance function of the posterior
process.
The conclusion is the following: there is a correspondence of the
approximating posterior GP with a Gaussian distribution in the feature
space
where the mean and the covariance are expressed as
|
(48) |
where
is the concatenation of the feature vectors for all data.
This result provides an interpretation of the Bayesian GP inference as
a family of Bayesian algorithms performed in a feature space and the
result projected back into the input space by expressing it in terms
of scalar products. Notice two important additions to the kernel
method framework given by the parametrisation lemma:
-
Bayesian learning algorithms for the GP imply the ``estimation''
of a Gaussian distribution in the feature space (we will present one
in the next Section).
-
The parametrisation from eq. (33)
provides a structure for the covariance of the posterior process.
The main difference between the Bayesian GP learning and the
non-Bayesian kernel method framework is that, in contrast to the
approaches based on the representer theorem for SVMs which result in a
single function, the parametrisation lemma gives a full probabilistic
approximation: we are ``projecting'' back the posterior covariance in
the input space.
Also an important observation is that the parametrisation is
data-dependent: both the mean and the covariance are expressed in a
coordinate system where the axes are the input vectors and
q and
R are
coordinates for the mean and covariance respectively. Using once more
the equivalence to the generalised linear models from Section 2.2, the GP approximation
to the posterior GP is a Gaussian approximation to
.
A probabilistic treatment for the regression case has been recently
proposed [88]
where the probabilistic PCA method [89,67] is extended to the
feature space. The PPCA in the kernel space uses a simpler
approximation to the covariance which has the form
where the takes arbitrary values and
W is a diagonal matrix of the size of the data. This
is a special case of the parametrisation lemma of the posterior
GP eq. (48).
This simplification leads to a sparseness. This is the result of an
EM-like algorithm that minimises the KL distance between the empirical
covariance in the feature space
and the parametrised covariance of eq. (49).
In the following the joint normal distribution in the feature space
with the data-dependent parametrisation from eq. (48) will be used to
deduce the sparsity in the GPs.
Questions, comments, suggestions: contact Lehel
Csató.