Sparse Gaussian Processes

Let us recall the parametrisation of the posterior moments (Lemma 3.1 in the GP representation) which states that whatever the likelihood, given finite data sample, the moments of the posterior can be expressed as a linear/bilinear (multilinear if one wants posterior moments of higher order) combination of the prior kernel function:

However, we want to save resources and approximate the large sum with a more compact representation of the approximated posterior GP. A sequential selection procedure to build a set of ``basis vectors'', or $\cal {BV}$ set, is provided.

The tool to obtain the sparse representation is the computation of the KL-distance between two GPs obtained from the same prior GP₀ (details can be found in the Thesis at Section 3.2. The KL-distance is computed using the equivalence of the GPs with the normal distribution in the feature space induced by the kernel.

The Kullback-Leibler divergence

Let us denote the feature space induced by the kernel with $\cal {F}$ and let the mean and covariance of the two GPs (again, Gaussian random variables in the feature space $\cal {F}$ ) be ( $\displaystyle \mu$ ₁, $\displaystyle \Sigma$ ₁ ) and ( $\displaystyle \mu$ ₂, $\displaystyle \Sigma$ ₂ ) respectively. The KL-distance can then be written as:

2KL( $\displaystyle \cal {GP}$ ₁| $\displaystyle \cal {GP}$ ₂) = ( $\displaystyle \mu$ ₂ - $\displaystyle \mu$ ₁)^T $\displaystyle \Sigma$ ₂^-1( $\displaystyle \mu$ ₂ - $\displaystyle \mu$ ₁) + tr $\displaystyle \left(\vphantom{\ensuremath{{\boldsymbol { \Sigma } }}_1\ensurema...... \Sigma } }}_2^{-1} - {\boldsymbol { I } }_{\ensuremath{\pmb{\cal{F}}}}}\right.$ $\displaystyle \Sigma$ ₁ $\displaystyle \Sigma$ ₂^-1 - I $\displaystyle \left.\vphantom{\ensuremath{{\boldsymbol { \Sigma } }}_1\ensurema...... \Sigma } }}_2^{-1} - {\boldsymbol { I } }_{\ensuremath{\pmb{\cal{F}}}}}\right)$ - ln $\displaystyle \left\vert\vphantom{\ensuremath{{\boldsymbol { \Sigma } }}_1\ensuremath{{\boldsymbol { \Sigma } }}_2^{-1} }\right.$ $\displaystyle \Sigma$ ₁ $\displaystyle \Sigma$ ₂^-1 $\displaystyle \left.\vphantom{\ensuremath{{\boldsymbol { \Sigma } }}_1\ensuremath{{\boldsymbol { \Sigma } }}_2^{-1} }\right\vert$

(66)

Let us assume that two approximated posterior GPs have the same $\cal {BV}$ set and denote the kernel Gram matrix of the set with K_B The two GPs are comppletely specified by their prior kernel and the distinct set of parameters ( $\alpha$ ₁ C₁) and ( $\alpha$ ₂ C₂) This leads to the equation for the KL-distance between two GPs

KL-optimal projection

We introduce sparsity by assuming that the GP is a result of some arbitrary learning algorithm, not necessarily the online learning from the Online estimation. This implies that at an arbitrary time moment t + 1 we have a set of basis vectors for the GP and the parameters $\alpha$ _{t + 1} and C_{t + 1}. We are then looking for the "closest" GP ( $\hat{{\boldsymbol { \alpha } }}_{t+1}^{}$ , $\hat{{\boldsymbol { C } }}_{t+1}^{}$ ) in the KL-sense (which we know is positive) with a reduced $\cal {BV}$ set where the last element x_{t + 1} has been removed.

Let us focus on the posterior mean, which is a linear combination of all basis vectors. The $\cal {BV}$ set has t+1 elements. The single element of the $\cal {BV}$ set is removed which provides the smallest KL-distance from the original approximation, intuitively associated with the length of the residual vector $\phi_{res}^{}$ .

$\includegraphics[]{project.eps}$

Figure 3.1: Visualisation of the projection. The last feature vector $\phi_{t+1}^{}$ is projected to the subspace spanned by { $\phi_{1}^{}$ ,..., $\phi_{t}^{}$ } resulting in the projection $\hat{\phi}_{t+1}^{}$ and its orthogonal residual $\phi_{res}^{}$ .

Sparsity in Gaussian Processes

The Kullback-Leibler divergence

KL-optimal projection