Important: since it is quite technical, the text is mainly copy-pasted from the thesis (see References). For more compact information you can look at the Neural Computation article.

Gaussian Process Representation

Let us first define some notations:
In the following we will use GPs as priors with the prior kernel K0(x,x$\scriptstyle \prime$ ): an arbitrary sample f from the GP at spatial locations $ \cal {X}$ = [x1,...,xN]T has a Gaussian distribution with covariance K$\scriptstyle \cal{X}$ :
p0(f)$\displaystyle \propto$ exp$\displaystyle \left\{\vphantom{ - \frac{1}{2}{\boldsymbol { f } }^T{\boldsymbol { K } }_{\cal X}{\boldsymbol { f } } }\right.$ - $\displaystyle {\textstyle\frac{1}{2}}$ fTK$\scriptstyle \cal {X}$ -1 f$\displaystyle \left.\vphantom{ - \frac{1}{2}{\boldsymbol { f } }^T{\boldsymbol { K } }_{\cal X}{\boldsymbol { f } } }\right\}$ . (20)

Using f$\scriptstyle \cal {D}$ = {f (x1),..., f(xN)} for the random variables at the data positions, we compute the posterior distribution as
ppost(f) = eq. 21, (21)

where p0(f,f$\scriptstyle \cal {D}$) is the joint Gaussian distribution of the random variables at the training and sample locations, $ \langle$P($ \cal {D}$|f$\scriptstyle\cal {D}$ )$ \rangle_{0}^{}$ is the average of the likelihood with respect to the prior GP marginal, p0(f). Computing the predictive distributions is the combination of the posterior with the likelihood of the data at x

p(y) = $\displaystyle \int$ df xP(y|fx ,x) p post(f x) = $ ... (22)
The predictive distribution of eq. (22), as it is presented, might require the computation of a new integral each time a prediction on a novel input is required. The following lemma shows that simple but important predictive quantities like the posterior mean and the posterior covariance of the process at arbitrary inputs can be expressed as a combination of a finite set of parameters which depend on the process at the training data only. Knowing these parameters eliminates the high-dimensional integrations when we are doing predictions.

Parametrisation of the posterior moments

Based on the rules for partial integration we provide a representation for the moments of the posterior process obtained using GP priors and a given data set $ \cal {D}$. The property used in this chapter is that of Gaussian averages: for a differentiable scalar function g(x) with x $ \in$ $ \mathbb {R}$d, based on the partial integration rule [30], we have:
$\displaystyle \int$dxp0(x)  xg(x) = $\displaystyle \mu$$\displaystyle \int$dxp0(x)  g(x) + $\displaystyle \Sigma$$\displaystyle \int$dxp0(x)  $\displaystyle \nabla$g(x) (32)
where the function g(x) and its derivatives grow slower than an exponential to guarantee the existence of the integrals involved. We use the vector integral to have a more compact and more intuitive formulation and $ \nabla$g(x) is the vector of derivatives. The vector $ \mu$ is the mean and matrix $ \Sigma$ is the covariance of the normal distribution p0.
The context in which eq. (32) is useful is when p0(x) is a Gaussian and g(x) is a likelihood. We want to compute the moments of the posterior [57,17]. For arbitrary likelihoods we can show that

Lemma 3.1 (Parametrisation[19])   The result of the Bayesian update eq. (21) using a GP prior with mean function $
	   \langle$fx$
      \rangle_{0}^{}$ and kernel K0(x,x$\scriptstyle
      \prime$) and data $ \cal {D}$ = {(xn, yn)|  n = 1,..., N} is a process with mean and kernel functions given by

\begin{displaymath}\begin{split}\langle f_{\boldsymbol { x } } \rangle _{post} & K_0({\boldsymbol { x } }_j,{\boldsymbol { x } }'). \end{split}\end{displaymath} (33)
The parameters q(i) and R(ij) are given by
\begin{displaymath}\begin{split}q(i)=\frac{1}{Z}\int\! d{\boldsymbol { f } }\; P({\cal{D}}\vert{\boldsymbol { f } }_{\cal{D}}) \end{split}\end{displaymath} (34)
where f$\scriptstyle \cal {D}$ = [f (x1),..., f (xN)]T and Z = $ \int$df  p0(f)P($ \cal {D}$|f$\scriptstyle \cal {D}$) is a normalising constant and the partial differentiations are with respect to the prior mean at xi.




It is important that both the posterior mean and the posterior kernel can be parametrised using the location of the inputs (higher order moments also, both in present framework it is irrelevant). Thus, there is a possibility to represent (approximations) to posterior processes which are obtained from a certain prior process. This representation allows probabilistic treatment for different posterior quantities (averages/expectations).
The parametrisation lemma states -- under general conditions -- the existence of the coefficients. Although it provides a formula to actually calculate them, the integrals on which it depends cannot be computed analytically. The Online scheme, presented in the 'Estimation' section, provides means to obtain approximations to these parameters, thus to the entire posterior process.

Parametrisation in the feature space

The parametrisation lemma provides us the first two moments of the posterior process. Apart from the Gaussian regression, where the results are exact, we can consider the moments of the posterior process as approximations. This approximation is written in a data-dependent coordinate system. We are using the feature space $ \pmb$$ \cal {F}$ and the projection $ \phi_{\boldsymbol { x } }^{}$ of the input x into $ \pmb$$ \cal {F}$. With the scalar product from eq. (19) replacing the kernel function K0, we have the mean and covariance functions for the posterior process as

EQ. 47 (47)

This shows that the mean function is expressed in the feature space as a scalar product between two quantities: the feature space image of x and a ``mean vector'' ${\boldsymbol { \mu } }_{\ensuremath{\pmb{\cal{F}}}}$, also a feature-space entity. A similar identification for the posterior covariance leads to a covariance matrix in the feature space that fully characterises the covariance function of the posterior process.

The conclusion is the following: there is a correspondence of the approximating posterior GP with a Gaussian distribution in the feature space $ \pmb$$ \cal {F}$ where the mean and the covariance are expressed as

EQ.48 (48)

where `Phi is the concatenation of the feature vectors for all data.

This result provides an interpretation of the Bayesian GP inference as a family of Bayesian algorithms performed in a feature space and the result projected back into the input space by expressing it in terms of scalar products. Notice two important additions to the kernel method framework given by the parametrisation lemma:

The main difference between the Bayesian GP learning and the non-Bayesian kernel method framework is that, in contrast to the approaches based on the representer theorem for SVMs which result in a single function, the parametrisation lemma gives a full probabilistic approximation: we are ``projecting'' back the posterior covariance in the input space.

Also an important observation is that the parametrisation is data-dependent: both the mean and the covariance are expressed in a coordinate system where the axes are the input vectors $ \phi_{i}^{}$ and q and R are coordinates for the mean and covariance respectively. Using once more the equivalence to the generalised linear models from Section 2.2, the GP approximation to the posterior GP is a Gaussian approximation to .

A probabilistic treatment for the regression case has been recently proposed [88] where the probabilistic PCA method [89,67] is extended to the feature space. The PPCA in the kernel space uses a simpler approximation to the covariance which has the form

$\displaystyle \Sigma$ = $\displaystyle \sigma^{2}_{}$I + $\displaystyle \Phi$W$\displaystyle \Phi$T (49)

where the $ \sigma^{2}_{}$ takes arbitrary values and W is a diagonal matrix of the size of the data. This is a special case of the parametrisation lemma of the posterior GP eq. (48). This simplification leads to a sparseness. This is the result of an EM-like algorithm that minimises the KL distance between the empirical covariance in the feature space $ \sum_{i=1}^{N}$$ \phi_{i}^{}$$
      \phi_{i}^{T}$ and the parametrised covariance of eq. (49).

In the following the joint normal distribution in the feature space with the data-dependent parametrisation from eq. (48) will be used to deduce the sparsity in the GPs.

Questions, comments, suggestions: contact Lehel Csató.