Important: since it is quite technical, the text is mainly copy-pasted from the thesis (see References). For more compact information you can look at the Neural Computation article.

Sparsity in Gaussian Processes

Let us recall the parametrisation of the posterior moments (Lemma 3.1 in the GP representation) which states that whatever the likelihood, given finite data sample, the moments of the posterior can be expressed as a linear/bilinear (multilinear if one wants posterior moments of higher order) combination of the prior kernel function:
\begin{displaymath}\begin{split}\langle f_{\boldsymbol { x } } \rangle _{post} & K_0({\boldsymbol { x } }_j,{\boldsymbol { x } }'). \end{split}\end{displaymath} (33)
where K0(xl,xm) is the prior kernel and N is the number of input data.
However, we want to save resources and approximate the large sum with a more compact representation of the approximated posterior GP. A sequential selection procedure to build a set of ``basis vectors'', or $ \cal
      {BV}$ set, is provided.
The tool to obtain the sparse representation is the computation of the KL-distance between two GPs obtained from the same prior GP0  (details can be found in the Thesis at Section 3.2. The KL-distance is computed using the equivalence of the GPs with the normal distribution in the feature space induced by the kernel.

The Kullback-Leibler divergence

Let us denote the feature space induced by the kernel with $ \cal {F}$ and let the mean and covariance of the two GPs (again, Gaussian random variables in the feature space $ \cal {F}$ ) be ($\displaystyle \mu$1, $\displaystyle \Sigma$1 ) and ($\displaystyle \mu$2, $\displaystyle \Sigma$2 ) respectively. The KL-distance can then be written as:
2KL($\displaystyle \cal {GP}$1|$\displaystyle \cal {GP}$2) = ( $\displaystyle \mu$2 - $\displaystyle \mu$1)T$\displaystyle \Sigma$2-1($\displaystyle \mu$2 - $\displaystyle \mu$1) + tr$\displaystyle \left(\vphantom{\ensuremath{{\boldsymbol { \Sigma } }}_1\ensurema...... \Sigma } }}_2^{-1} - {\boldsymbol { I } }_{\ensuremath{\pmb{\cal{F}}}}}\right.$$\displaystyle \Sigma$1$\displaystyle \Sigma$2-1 - I$\scriptstyle \pmb$$\scriptstyle \cal {F}$$\displaystyle \left.\vphantom{\ensuremath{{\boldsymbol { \Sigma } }}_1\ensurema...... \Sigma } }}_2^{-1} - {\boldsymbol { I } }_{\ensuremath{\pmb{\cal{F}}}}}\right)$ - ln$\displaystyle \left\vert\vphantom{\ensuremath{{\boldsymbol { \Sigma } }}_1\ensuremath{{\boldsymbol { \Sigma } }}_2^{-1} }\right.$$\displaystyle \Sigma$1$\displaystyle \Sigma$2-1$\displaystyle \left.\vphantom{\ensuremath{{\boldsymbol { \Sigma } }}_1\ensuremath{{\boldsymbol { \Sigma } }}_2^{-1} }\right\vert$ (66)
where the means and the covariances are expressed in the high-dimensional and unknown feature space $ \cal {F}$ .
Let us assume that two approximated posterior GPs have the same $ \cal {BV}$ set and denote the kernel Gram matrix of the set with KB The two GPs are comppletely specified by their prior kernel and the distinct set of parameters ($ \alpha$1 C1) and ($ \alpha$2 C2) This leads to the equation for the KL-distance between two GPs
ddd. (74)
where all quantities are those of the dimension of the data (no feature space involved).

KL-optimal projection

We introduce sparsity by assuming that the GP is a result of some arbitrary learning algorithm, not necessarily the online learning from the Online estimation. This implies that at an arbitrary time moment t + 1 we have a set of basis vectors for the GP and the parameters $ \alpha$t + 1 and Ct + 1. We are then looking for the "closest" GP ($ \hat{{\boldsymbol { \alpha }
	    }}_{t+1}^{}$,$ \hat{{\boldsymbol { C } }}_{t+1}^{}$) in the KL-sense (which we know is positive) with a reduced $ \cal {BV}$ set where the last element xt + 1 has been removed.
This is a constrained optimisation problem with analytically computable solution.
Rather than (re-) deriving the result of trimming the GP, in the figure below we provide an intuitive explanation of why the sparsity works.
Let us focus on the posterior mean, which is a linear combination of all basis vectors. The $ \cal {BV}$ set has t+1 elements. The single element of the $ \cal {BV}$ set is removed which provides the smallest KL-distance from the original approximation, intuitively associated with the length of the residual vector $ \phi_{res}^{}$.
\includegraphics[]{project.eps}
Figure 3.1: Visualisation of the projection. The last feature vector $ \phi_{t+1}^{}$ is projected to the subspace spanned by {$ \phi_{1}^{}$,...,$
	\phi_{t}^{}$} resulting in the projection $ \hat{\phi}_{t+1}^{}$ and its orthogonal residual $ \phi_{res}^{}$.
Since the real-world data we want to approximate have a strong structure, he approximations are performant for many examples (see the Examples section).

Questions, comments, suggestions: contact Lehel Csató.