Subsections


Applications


\begin{summary}
The sparse online algorithm for GPs and its extension using the
expectation-propagation are tested for various problems.
\end{summary}

In the previous chapters learning algorithms for Gaussian processes were derived. The presented algorithms had gradually increasing complexity with the more complex algorithms embellished upon the simpler ones. Specifically, if we include all training examples into the $ \cal {BV}$ set within the sparse online algorithm, then we obtain back the online algorithm of Chapter 2. Similarly, using a single sweep through the data for the sparse EP-algorithm of Chapter 4 is equivalent to the sparse online algorithm described in Chapter 3. Furthermore, when the algorithms were defined, the likelihoods were left unspecified, it has only been assumed that it is possible to evaluate - exactly or using approximations - the averaged likelihood with respect to a one-dimensional Gaussian measure which is needed to compute the update coefficients for the online learning from Chapter 2.

For these reasons the experiments are grouped into a single chapter. This way more emphasis is given to the fact that sparse GP learning is applicable to a wide range of problems and at the same time the repetitions caused by presenting the same problem repeatedly for each algorithm are avoided. The aim of the experiments is to show the applicability of the sparse algorithms for various problems and the benefit of the approximation provided by sparsity. Thus, where it is possible, we present the results for the following cases:

Throughout the experiments two kernels are used: the polynomial (eq. 138) and the radial basis function or RBF (eq. 139) kernels:

KPOL(x,x$\scriptstyle \prime$) =   $\displaystyle \left(\vphantom{ 1 + \frac{{\boldsymbol { x } }^T{\boldsymbol { y } }'}{dl_{K}} }\right.$1 + $\displaystyle {\frac{{\boldsymbol { x } }^T{\boldsymbol { y } }'}{dl_{K}}}$$\displaystyle \left.\vphantom{ 1 + \frac{{\boldsymbol { x } }^T{\boldsymbol { y } }'}{dl_{K}} }\right)^{k}_{}$ (137)
KRBF(x,x$\scriptstyle \prime$) = exp $\displaystyle \left(\vphantom{- \frac{\Vert{\boldsymbol { x } }-{\boldsymbol { x } }'\Vert^2}{2d\sigma_K^2} }\right.$ - $\displaystyle {\frac{\Vert{\boldsymbol { x } }-{\boldsymbol { x } }'\Vert^2}{2d\sigma_K^2}}$$\displaystyle \left.\vphantom{- \frac{\Vert{\boldsymbol { x } }-{\boldsymbol { x } }'\Vert^2}{2d\sigma_K^2} }\right)$ (138)

where the d is the dimension of the inputs and we used it just for an approximate normalisation when using datasets of different dimensions. The kernel parameters are the order of the polynomial k and the length-scale lK for the first case and the width of the radial basis functions in the second case. The denominators in both cases include the normalisation constant d that makes the scaling of the kernels less dependent on the input dimensionality - independent if we assume that the data are normalised to unit variance. Both kernels are often used in practise. The polynomial kernels have a finite-dimensional feature space (discussed in Section 2.2) and are used for visualisation. The RBF kernels, having their origin in the area of radial basis function networks [7], have infinite-dimensional feature space; they are favourites in the kernel learning community.

The following sections detail the results of applying the sparse GP learning to different problems. Section 5.1 presents the results for regression, Section 5.2 for the classification. A non-parametric density estimation using the sparse GPs is sketched in Section 5.3. The final application considered is a data assimilation problem. In this case the problem is to infer a global model of wind fields using satellite observations. The GP approach to this problem and the possible benefits of sparsity are detailed in section 5.4.


Regression

Figure: Results of applying GPs to the noisy $\ensuremath{\mathrm{sinc}}$ function. The inputs were 1000 uniformly sampled points and the outputs were corrupted by white noise with variance $ \sigma_{n}^{2}$ = 0.01. The figures show the true function (continuous line), the predictive mean of the GP (dash-dotted line), the variance of the GP at the inputs (dashed lines) for the fifth order polynomial (left) and the RBF kernel (right). The diamonds on both plots show the $ \cal {BV}$ set with the original noisy output.
\includegraphics[]{sinc_POL_5_0.02.eps} \includegraphics[]{sinc_RBF_10_1_0.02.eps}
(a) (b)

Quadratic regression using Gaussian noise is analytically tractable [102] as has been detailed in previous chapters. The likelihood for a given example (x,y) in this case is

P(y|x,fx) $\displaystyle \propto$ exp$\displaystyle \left(\vphantom{-\frac{\Vert{\boldsymbol { y } }- {\boldsymbol { f } }_{\boldsymbol { x } }\Vert}{2\sigma_0^2} }\right.$ - $\displaystyle {\frac{\Vert{\boldsymbol { y } }- {\boldsymbol { f } }_{\boldsymbol { x } }\Vert}{2\sigma_0^2}}$$\displaystyle \left.\vphantom{-\frac{\Vert{\boldsymbol { y } }- {\boldsymbol { f } }_{\boldsymbol { x } }\Vert}{2\sigma_0^2} }\right)$ (139)

with fx the GP marginal at the input location x and $ \sigma_{0}^{2}$ the variance of the Gaussian noise (here supposed to be known). It has been shown in Chapter 2 that online learning applied to the Gaussian likelihood in eq. (140) leads to an exact iterative computation of the posterior process, expressed by its parameters ($ \alpha$,C). Using the kernel matrix KN for all training data, the posterior GP parameters are:

displaymath_equationstar26858    

with $ \underline{{\boldsymbol { y } }}$ the concatenation of all output values. The parameters for the online recursions are

q(t + 1) = $\displaystyle {\frac{{\boldsymbol { y } }_{t+1} - {\boldsymbol { \alpha } }^T{\boldsymbol { k } }_{t+1}}{\sigma_0^2 + \sigma_{t+1}^2}}$     and     r(t + 1) = - $\displaystyle {\frac{1}{\sigma_0^2 + \sigma_{t+1}^2}}$    

where kt + 1 = [K0(x1,xt + 1),..., K0(xt,xt + 1)]T and $ \sigma_{t+1}^{2}$ = k* + kt + 1TCkt + 1 is the variance of the GP marginal at xt + 1.

The online algorithm from Chapter 2 uses ($ \alpha$,C) of the size of the training data. In Chapter 3 it has been shown that this representation is redundant: the dimensions of the GP parameters should not exceed the dimension of the feature space. Introducing sparsity for the case of regression keeps the GP representation non-redundant. To illustrate this, first we consider the learning of the one-dimensional noisy $\ensuremath{\mathrm{sinc}}$ function

y = sinc(x) = $\displaystyle {\frac{\sin(x)}{x}}$ + $\displaystyle \nu$ (140)

using the polynomial and the RBF kernels. In the experiments we considered independent Gaussian noise with variance $ \sigma_{n}^{2}$ = 0.01. First we considered the polynomial kernels from eq. (138). These kernels, if the input space is one-dimensional, are particularly simple: the dimension of the feature space associated with them is k + 1. This means that the number of inputs in the $ \cal {BV}$ set can never be larger than k + 1. The KL-projection method from the sparse online algorithm ensures this. The results from Fig. 5.1 show that there are 6 examples in the $ \cal {BV}$ set, yet the GP contains information about all data. When the RBF kernel is used, we see that it provides a better approximation in this case, but the cost is that there are twice as many elements in the $ \cal {BV}$ set. The size of the $ \cal {BV}$ set has been set to 10 manually. For the $\ensuremath{\mathrm{sinc}}$ function and RBF kernels with $ \sigma_{K}^{2}$ = 1, $ \cal {BV}$ set sizes larger than 12 were not accepted: the score of each new input has been smaller than the threshold for addition, which was set to 10-6 for numerical stability.

Figure 5.2: Results for the Friedman dataset using 250 training and 500 test data. The two lines show the average test errors after a single sweep (continuous line) and after four iterations (dashed line) through the data. The line show average test errors over 50 iterations with error bars displaying the standard deviation of the 50 experiments. The horizontal dotted line shows the performance of the SVM and RVM algorithms, see text for details.
\includegraphics[]{fr_RBF_1_1_0.0001_250.eps}

The next example considered was the Friedman #1 dataset [23]. In this case the inputs are sampled uniformly from the 10-dimensional hypercube. The scalar output is obtained as

y = 10 sin($\displaystyle \pi$x1x2) + 20(x3 - 0.5)2 + 10x4 + 5x5 (141)

and the variables x6,..., x10 do not affect the output, they act as noise. Fig. 5.2 shows the test errors for this dataset. From the clearly visible plateau as the $ \cal {BV}$ set size increases we conclude that the sparse online GP learning procedure from Chapter 3, plotted with continuous lines, gives as good solutions as the online learning that includes all examples to the $ \cal {BV}$ set. This suggests that the effective dimension of the data in the feature space defined by the kernel is approximately 120 - 130 and there is no need to use more $ \cal {BV}$ s than this effective dimensionality.

The results of the GP learning were compared to those obtained by the Support Vector regression (SVM) and the Relevance Vector machine (RVM) as provided by [87]. The test error in the steady region (i.e. $ \cal {BV}$ set size > 120) for our algorithm was $ \approx$ 2.4 which is smaller than those of the SVM (2.92) and RVM (2.80) respectively. The number of parameters for GP algorithm, however, is larger: it is quadratic in the size of the $ \cal {BV}$ set whilst both the SVM and the RVM algorithms have a linear scaling with respect to the size of their ``support vectors'' (116 were used in the experiments) and ``relevance vectors'' (59 used) respectively.

On the other hand, as discussed in Section 3.7, in their standard form both algorithms use the kernel matrix with respect to all data, this is not the case of the sparse EP/GP algorithm.

The sparse EP algorithm is applied in the second sweep through the data. For cases when the size of the $ \cal {BV}$ set is close enough to the effective dimension of the data in the feature space, this second sweep does not give any improvement. This is expected since for this problem the posterior process is also a GP, no approximations are involved in the online learning. As a result, in the region where the $ \cal {BV}$ set sizes exceed this effective dimension, the KL-projection in the sparse online algorithm is just a rewriting of the feature space image of the new input as a combination of inputs from the $ \cal {BV}$ set, and the length of the residual is practically zero. If we are not close to the limit region, then the EP corrections improve on the performance of the algorithm, as it is the case for the small $ \cal {BV}$ set sizes in Fig. 5.2. This improvement is not significant for regression, the improvements are within the empirical error bars, as it is also shown for the next application: the Boston housing dataset.

The Boston housing dataset is a real dataset with 506 data, each having 13 dimensions and a single output. The 13 inputs describe different characteristics of houses and the task is to predict their price.7In the experiments the dataset has been split into training and test sets of sizes 481/25 and we used 100 random splits for all $ \cal {BV}$ set sizes and the results are shown in Fig. 5.3. If different $ \cal {BV}$ set sizes are compared, we see that there is no improvement after a certain $ \cal {BV}$ set size.

Two series of experiments were considered by using the same dataset without pre-processing (sub-figure a), or normalising it to zero mean and unit variance (subfigure b). The results have been compared with the SVM and RVM methods, which had squared errors approximately 8 [87], we conclude that the performance of the sparse GP algorithm is comparable to these other kernel algorithms. We need to mention however that, the setting of the hyper-parameters $ \sigma_{0}^{2}$ and $ \sigma_{k}^{2}$ was made on a trial-and error, we did not use cross-validation or other hyper-parameter selection method, the aim of the thesis is to show the applicability of the sparse GP framework to real-world problems.

Figure 5.3: Results for the Boston dataset with normalised (left) and unnormalised (right) inputs. The results are obtained using RBF kernels with $ \sigma_{K}^{2}$ = 2 and $ \sigma_{K}^{2}$ = 2500 for the normalised and the unnormalised cases.
\includegraphics[]{bo_RBF_2_3_0.01_481_norm.eps} \includegraphics[]{bo_RBF_2500_1_5e-05_481.eps}
(a) (b)

We see that if the size of the $ \cal {BV}$ set is large enough, the expectation-propagation algorithm does not improve on the performance of the sparse online algorithm. Improvements when the data is re-used are visible only if the $ \cal {BV}$ set is small. This improvement is not visible when learning the unnormalised Boston dataset.

The main message from this section is that using sparsity in the GPs does not lead to significant decays in the performance of the algorithm. However, the differences between the sub-plots in Fig. 5.3 emphasises the necessity of developing methods to adjust the hyper-parameters. In the next section we present examples where the EP-algorithm improves on the results of the sparse online learning: classification using GPs.


Classification

A nontrivial application of the online GPs is the classification problem: the posterior process is non-Gaussian and we need to perform approximations to obtain a GP from it. We use the probit model [49] where a binary value y $ \in$ { - 1, 1} is assigned to an input x $ \in$ $ \mathbb {R}$m with the data likelihood

P$\displaystyle \left(\vphantom{y\vert{\boldsymbol { f } }_{\boldsymbol { x } }}\right.$y|fx$\displaystyle \left.\vphantom{y\vert{\boldsymbol { f } }_{\boldsymbol { x } }}\right)$ = Erf$\displaystyle \left(\vphantom{\frac{y\; {\boldsymbol { f } }_{\boldsymbol { x } }}{\sigma_0}}\right.$$\displaystyle {\frac{y\; {\boldsymbol { f } }_{\boldsymbol { x } }}{\sigma_0}}$$\displaystyle \left.\vphantom{\frac{y\; {\boldsymbol { f } }_{\boldsymbol { x } }}{\sigma_0}}\right)$, (142)

where $ \sigma_{0}^{}$ is the noise variance and $\ensuremath{\mathrm{Erf}}(z)$ is the cumulative Gaussian distribution:

Erf(z) = $\displaystyle {\frac{1}{\sqrt{2\pi}}}$$\displaystyle \int^{z}_{-\infty}$dt  exp$\displaystyle \left(\vphantom{-\frac{t^2}{2}}\right.$ - $\displaystyle {\frac{t^2}{2}}$$\displaystyle \left.\vphantom{-\frac{t^2}{2}}\right)$ (143)

The shape of this likelihood resembles a sigmoidal, the main benefit of this choice is that its average with respect to a Gaussian measure is computable. We can compute the predictive distribution at a new example x:

p(y|x,$\displaystyle \alpha$,C) = $\displaystyle \langle$P(y|fx)$\displaystyle \rangle_{t}^{}$ = Erf$\displaystyle \left(\vphantom{\frac{y\; \langle {\boldsymbol { f } }_{\boldsymbol { x } } \rangle _t }{\sigma_x}}\right.$$\displaystyle {\frac{y\; \langle {\boldsymbol { f } }_{\boldsymbol { x } } \rangle _t }{\sigma_x}}$$\displaystyle \left.\vphantom{\frac{y\; \langle {\boldsymbol { f } }_{\boldsymbol { x } } \rangle _t }{\sigma_x}}\right)$ (144)

where $ \langle$fx$ \rangle_{t}^{}$ = kxT$ \alpha$ is the mean of the GP at x and $ \sigma_{\boldsymbol { x } }^{2}$ = $ \sigma_{0}^{2}$ + k*x + kxTCkx. It is the predictive distribution of the new data, that is the Gaussian average of an other Gaussian. The result is obtained by changing the order of integrands in the Bayesian predictive distribution eq. (143) and back-substituting the definition of the error function.

Based on eq. (53), for a given input-output pair (x, y) the update coefficients q(t + 1) and r(t + 1) are computed by differentiating the logarithm of the averaged likelihood from eq. (145) with respect to the mean at xt + 1 [17]:

q(t + 1) = $\displaystyle {\frac{y}{\sigma_{\boldsymbol { x } }}}$$\displaystyle {\frac{\ensuremath{\mathrm{Erf}}'}{\ensuremath{\mathrm{Erf}}}}$                    r(t + 1) = $\displaystyle {\frac{1}{\sigma^2_{\boldsymbol { x } }}}$$\displaystyle \left\{\vphantom{\frac{\ensuremath{\mathrm{Erf}}''}{\ensuremath{\...
...\frac{\ensuremath{\mathrm{Erf}}'}{\ensuremath{\mathrm{Erf}}} \right)^2 }\right.$$\displaystyle {\frac{\ensuremath{\mathrm{Erf}}''}{\ensuremath{\mathrm{Erf}}}}$ - $\displaystyle \left(\vphantom{\frac{\ensuremath{\mathrm{Erf}}'}{\ensuremath{\mathrm{Erf}}} }\right.$$\displaystyle {\frac{\ensuremath{\mathrm{Erf}}'}{\ensuremath{\mathrm{Erf}}}}$$\displaystyle \left.\vphantom{\frac{\ensuremath{\mathrm{Erf}}'}{\ensuremath{\mathrm{Erf}}} }\right)^{2}_{}$$\displaystyle \left.\vphantom{\frac{\ensuremath{\mathrm{Erf}}''}{\ensuremath{\m...
...frac{\ensuremath{\mathrm{Erf}}'}{\ensuremath{\mathrm{Erf}}} \right)^2 }\right\}$ (145)

with $\ensuremath{\mathrm{Erf}}(z)$ evaluated at z = $ {\frac{y\;{\boldsymbol { \alpha } }_t^T{\boldsymbol { k } }_{x}}{\sigma_x}}$ and $\ensuremath{\mathrm{Erf}}'$ and $\ensuremath{\mathrm{Erf}}''$ are the first and second derivatives at z.

Figure 5.4: Results of the sparse algorithm for the Crab dataset: (a) the average test errors vs. different $ \cal {BV}$ set sizes averaged over 50 experiments for each $ \cal {BV}$ set size. The error bars show the standard deviation of the test errors for the 50 experiments. (b) Cross-section of the experiments for $ \cal {BV}$ set size 50: the upper subplot shows the evolution of test and training errors during the EP-iterations and on the lower subplot the average replacements of elements in the $ \cal {BV}$ set are displayed.
\includegraphics[]{crab_RBF_2_4_0.01_80.eps} \includegraphics[]{crab_RBF_2_4_0.01_80_frag.eps}
(a) (b)

The algorithm for classification was first tested with two small datasets. These small-sized data allowed us to benchmark the sparse EP algorithm against the full EP algorithm presented in Chapter 4. The first dataset is the Crab dataset [668. This dataset has 6-dimensional inputs, measuring different characteristics of the Leptograpsus crabs and the task is to predict their gender. There are 200 examples split into 80/120 training/test sets. In the experiments we used the splits given in [99] and we used the RBF kernels to obtain the results shown Fig 5.4.a (the polynomial kernels gave slightly larger test errors). As for the regression case, we see the long plateau for the sparse online algorithm (the thick continuous line labeled ``1 sweep'') when the $ \cal {BV}$ set size increases. We also see that the online and the sparse online learning algorithms have a relatively large variation in their performance, as shown by the error-bars.

The result of applying the full EP algorithm of [46] can be seen when all 80 basis vectors are included into the $ \cal {BV}$ set in Fig. 5.4.a. We see that the average performance of the algorithm improved when a second and a third iteration through the data is made. Apart from the decrease in the test error, by the end of the third iteration the fluctuations caused by the ordering of the data also vanished.

As the dashed line from Fig. 5.4.a shows, if the size of the $ \cal {BV}$ set is larger than 20, the sparse EP algorithm has the same behaviour as the full EP, reaching a test error of 2.5%, meaning 3 misclassified examples from the test set. Interestingly this small test error is reached even if more than 80% of the inputs were removed,in showing the benefit of combining the EP algorithm with the sparse online GP. The performance of the algorithm is in the range of the results of other the state-of-the-art methods: 3 errors with the TAP mean field method, 2 using the naive mean field method [58], 3 errors when applying the projection pursuit regression method [66].

Fig. 5.4.b shows a cross-section of the simulations for $\char93 \ensuremath{{\cal{BV}}}=50$. The top sub-figure plots the average test and training errors with the empirical deviations of the errors (the dashed lines). The additional computational cost of the algorithm, compared with the sparse online method, is the update of the projection matrix P, requiring $ \cal {O}$(Nd ) operations. This update is needed only if we are replacing an element from the $ \cal {BV}$ set. The bottom part of Fig. 5.4.b shows the average number of changes in the $ \cal {BV}$ set. Although there are no changes in the test or the training errors after the second iteration, the costly modification of the $ \cal {BV}$ set is still performed relatively often, suggesting that the changes in the $ \cal {BV}$ set after this stage are not important. This suggests that for larger datasets we could employ different strategies for speeding up the algorithm. One simple suggestion is to fix the elements in the $ \cal {BV}$ set after the second iteration and to perform the sparse EP algorithm by projecting the data to the subspace of the $ \cal {BV}$ elements in the subsequent iterations. The later stage of the algorithm would be as fast as the sparse online algorithm since the storage of the matrix P is not required. A second strategy would be to choose the kernel parameters and the size of the $ \cal {BV}$ set such that the number of replacements is low, the given number of basis vectors representing all training inputs, this is possible using appropriate pre-processing or adapting the model parameters accordingly.

The selection of the parameters for the RBF kernel is not discussed. When running the different experiments we observed that the size of the $ \cal {BV}$ set and the optimal kernel parameters are strongly related, making the model selection issue an important future research topic.

Figure 5.5: The results the sparse algorithm for the Sonar dataset. The different lines in the sub-plots are as of Fig. 5.4.
\includegraphics[]{son_RBF_0.25_1_0.0001_104.eps} \includegraphics[]{son_RBF_0.25_1_0.0001_104_frag.eps}
(a) (b)

A second experiment for classification used the sonar data [29].9. This dataset has originally been used to test classification performance using neural networks. It consists of 208 measurements of sonar signals bounced back from a cylinder made of metal or rock, each having 60 components. In the experiments we used the same split of the data as has been used by [29]. The data was preprocessed to unit variance in all components and we used RBF kernels with variance $ \sigma_{K}^{}$ = 0.25, with noise $ \sigma_{0}^{2}$ = 0.0001. The results for different $ \cal {BV}$ set sizes are presented in Fig. 5.5. We can again see a plateau of almost constant test error performance starting at 70. Contrary to the case of the Crab dataset, this plateau is not as clear as in Fig. 5.5; a drop in the test error is visible at the upper limit of the $ \cal {BV}$ set sizes. The difference might be due to the higher dimensionality of the data: 60 as opposed to 6 whilst the sizes of the training sets are comparable.

For comparison with other algorithms we used the results as reported in [58]. The test errors of the naive mean field algorithm and the TAP mean field algorithm were 7.7, this being the closest to the result of the EP algorithm which was 6.7. It is also important that the EP test error at $\char93 \ensuremath{{\cal{BV}}}=100$, i.e. when using almost all training data, becomes independent of the order of the presentation, the error bars vanish.

However, using only a half the training data in the $ \cal {BV}$ set, the performance of the sparse EP algorithm is 8.9±0.9%, still as good as the SVM or the best performing two-layered neural network both of which are reported to have 9.6% as test errors [66,58].

A third dataset which we have used in tests is the USPS dataset of gray-scale handwritten digit images of size 16×16. The dataset consists of 7291 training and 2007 test patterns.10 In the first experiment we studied the problem of classifying the digit 4 against all other digits. Fig. 5.6.a plots the test errors of the algorithm for different $ \cal {BV}$ set sizes and fixed values of hyper-parameter $ \sigma_{K}^{2}$ = 0.5.

Figure 5.6: Results for the binary (a) and multi-class (b) classification, the lines showing the average test errors of 10 experiments with different order of the data. The multi-class case is a combination of the 10 individual classifiers: the example x is assigned to the class with highest P(Ci|x). The error bars show the deviation of the test errors averaged out on the 10 independent orderings. The smaller number of $ \cal {BV}$ set sizes on the multiclass plot (b) is due to the computational limitations: the coefficients of the 10 individual classifiers are computed simultaneously.
\includegraphics[]{usps4_RBF_0.5_2_10.eps} \includegraphics[]{usps_all_RBF_0.5_2.eps}
(a) (b)

The USPS dataset has frequently been used to test the performance of kernel-based classification algorithms. We mention the kernel PCA method of  [73], the Nyström method of  [101] and the application of the kernel matching pursuit algorithm [94], all discussed in detail in Section 3.7.

The Nyström approach considers a subset of the training data onto which the inputs are projected, thus it is similar to the sparse GP algorithm. This approximation proved to provide good results if the size of the subset was as low as 256. The mean of the test error for this case was $ \approx$ 1.7% [101], thus the results from Fig. 5.6.a are in the same range.

The PCA reduced-set method of [73] and the kernel matching pursuit algorithms give very similar results for the USPS dataset, their result sometimes is as low as 1.4%. By performing additional model selection the results of the sparse GP/EP algorithms could be improved, our aim for these experiments was to prove that the algorithm produces similar results to other sparse algorithms in the family of kernel methods.

Figure 5.7: Combining rejection and multiclass classification: by increasing the rejection threshold, the percentage of the misclassified examples decreases and the total error does not increase significantly. Subplot (a) uses 150 basis vectors and subplot (b) uses 350.
\includegraphics[]{usps_reject_RBF_0.5_2_150.eps} \includegraphics[]{usps_reject_RBF_0.5_2_350.eps}
(a) (b)

The sparse EP algorithm was also tested on the more realistic problem of classifying all ten digits simultaneously. The ability to compute Bayesian predictive probabilities is absolutely essential in this case. We have trained 10 classifiers on the ten binary classification problems of separating a single digit from the rest. An unseen input was assigned to the class with the highest predictive probability given by eq. (145). Fig. 5.6.b summarises the results for the multi-class case for different $ \cal {BV}$ set sizes and RBF kernels (with the external noise variance $ \sigma_{0}^{2}$ = 0).

To reduce the computational cost we used the same set for all individual classifiers (only a single inverse of the Gram matrix was needed and also the storage cost is smaller). This made the implementation of deleting a basis vector for the multi-class case less straightforward: for each input and each basis vector there are 10 individual scores. We implemented a ``minimax'' deletion rule: whenever a deletion was needed, the basis vector having the smallest maximum value among the 10 classifier problems was deleted, i.e. the index l of the deleted input was

l = arg$\displaystyle \min_{i\in\ensuremath{{\cal{BV}}}}^{}$$\displaystyle \max_{c\in\overline{0,9}}^{}$$\displaystyle \varepsilon$ci. (146)

The results of the combined classification for 100 and 350 $ \cal {BV}$ s is shown in Fig. 5.6.b The sparse online algorithm gave us a test error of 5.4% and the sparse EP refinement further improved the results to 5.15%. It is important that in obtaining the test errors we used the same 350 basis vectors for all binary classifiers. This is important, especially if we are to make predictions for an unknown digit. If we use the SVM approach and compute the 10 binary predictions, then we need to evaluate the kernel 2560 times, opposed to the 350 evaluation for the basis vector case. The results from Fig. 5.6.b are also close to the batch performance reported in [73]. The combination of the 10 independent classifiers obtained using kernel PCA attained a test error of 4.4%.

The benefit of the probabilistic treatment of the classification is that we can reject some of the data: the ones for which the probability of belonging to any of the classes is smaller then a predefined threshold. In Fig. 5.7 we plot the test errors and the rejected inputs for different threshold values.

An immediate extension of the sparse classification algorithms would be to extend them for the case of active learning [75,9] described in Section 3.8.1. It is expected that this extension, at the cost of searching among the training data, would improve the convergence of the algorithm.

To sum up, in the classification case, contrary to regression, we had a significant improvement when using the sparse EP algorithm. The improvement was more accentuated in the low-dimensional case. The important benefit of the EP algorithm for classification was to make the results (almost) independent of the order of presentation, as seen in the figures showing the performance of the EP algorithm.


Density Estimation

Density estimation is an unsupervised learning problem: we have a set of p-dimensional data $ \cal {D}$ = {xi}i = 1N and our goal is to approximate the density that generated the data. This is an ill-posed problem, the weighted sum of delta-peaks being an extreme of the possible solutions [93]. In one-dimensional case a common density estimation method is the histogram, and the Parzen-windowing technique can be thought as an extension of the histogram method to the multi-dimensional case. In this case the kernel is K0(x,x) = h(|x - x$\scriptstyle \prime$|2) with the function h chosen such that $ \int$dxK0(x,x$\scriptstyle \prime$) = 1 for all x$\scriptstyle \prime$. The Parzen density estimate,based on K0, is written [21]:

$\displaystyle \hat{p}_{parzen}^{}$(x) = $\displaystyle {\frac{1}{N}}$$\displaystyle \sum_{i=1}^{N}$K0(xi,x) (147)

where the prefactor 1/N to the sum can be thought of as a coefficient $ \eta_{i}^{}$ that weights the contribution of each input to the inferred distribution. The difficulty in using this model, as with other non-parametric techniques, is that it cannot be used for large data sizes. Even if there is no cost to compute $ \alpha_{i}^{}$, the storage of each training point and the summation is infeasible, raising the need for simplifications.

Recent studies addressed this problem using other kernel methods, based on the idea of Support Vector Machines. [97] used SVMs for density estimation. Instead of using the fixed set of coefficients $ \alpha_{i}^{}$ = 1/N, they allowed these coefficients to vary. An additional degree of freedom to the problem was added by allowing kernels with varying shapes to be included in eq. (148). The increase in complexity was compensated with corresponding regularisation penalties. The result was a sparse representation of the density function with only a few coefficients having nonzero parameters. This made the computation of the inferred density function faster, but to obtain the sparse set of solutions, a linear system using all data points had to be solved.

A different method was the use of orthogonal series expansion of kernels used for Parzen density estimation in eq. (148), proposed by [28]. This mapped the problem into a feature space determined by the orthogonal expansion. The sparsity is achieved by using kernel PCA [70] or the Nyström method for kernel machines [101]. The study presented empirical comparisons between the performance of the kernel PCA density estimation and the original Parzen method, showing that a significant reduction of the components from the sum was possible without decay in the performance of the estimation.

In this section we apply the sparse GPs and the EP algorithm to obtain a Bayesian approach to density estimation. We use a random function f to model the distribution and the prior over the function f regularises the density estimation problem.

The prior over the functions is a GP and next we define the likelihood. A basic requirement for a density function is to be non-negative and normalised, thus we consider the likelihood model:

h(x| f )= $\displaystyle {\frac{f_{\boldsymbol { x } }^2}{\int d{\boldsymbol { z } }\; f_{\boldsymbol { z } }^2}}$ (148)

with fx a random function drawn from the prior process and h(x| f ) is the density at the input x. An important difference from the previous cases is that the likelihood for a single output is conditioned on the whole process. We assume a compact domain for the inputs, this makes the normalising integral well defined. Another possibility would be to assume a generic prior density for the inputs to replace the constraint on the input domain, however this is future work.

In contrast to the previous applications of the sequential GP learning presented in this Chapter, this ``likelihood'' is not local: the density at a point is dependent on all other values of f, the realisation of the random process. This density model has been studied in [34] and [69] and other likelihoods for density estimation have also been considered like p(x| f ) $ \propto$ exp(- fx) by [50], but these earlier studies were concerned with finding the maximum a-posteriori value of the process.

We are using eq. (149) to represent the density, making the inference of the unknown probability distribution equivalent to learning the underlying process. We write the posterior for this process as

ppost(f )= $\displaystyle {\frac{1}{Z}}$p0(f )$\displaystyle {\frac{\prod_{i=1}^N f_i^2}{\left(\int d{\boldsymbol { z } }\; f_{\boldsymbol { z } }^2\right)^N}}$     with     Z = $\displaystyle \int$df  p0(f )$\displaystyle {\frac{\prod_{i=1}^N f_i^2}{\left(\int d{\boldsymbol { z } }\; f_{\boldsymbol { z } }^2\right)^N}}$ (149)

where again, due to the denominator, we need to take into account all possible realisations of the function f, i.e. we have a functional integral.

As usual, we are interested in the predictive distribution of the density, which is given by eq. (149) where fx is the marginal of the posterior GP at x. We obtain a random variable for the density at x. We compute the mean of this random value, and have the prediction for the density function:

h(x|$\displaystyle \cal {D}$) = $\displaystyle {\frac{1}{Z}}$$\displaystyle \int$dfp0(f )$\displaystyle {\frac{f_{\boldsymbol { x } }^2 \prod_{i=1}^N f_i^2}{\left(\int d{\boldsymbol { z } }\; f_{\boldsymbol { z } }^2\right)^{N+1}}}$ (150)

with the normalising constant Z defined in eq (150).

The normalisation in the previous equations makes the density model intractable. Essentially, it prevents the solutions from being expressible using the representer theorem of [40] or the representation lemma from Chapter 2. Next we transform the posterior into a form that makes the representation lemma applicable for the density estimation problem. For this we observe that the normalising term is independent of the input set and of the location at which the distribution is examined. To apply the parametrisation lemma, we include the normalising term into the prior to obtain a new prior for f. For this we consider the Gamma-distribution

$\displaystyle \int_{0}^{\infty}$$\displaystyle \lambda^{N}_{}$e- $\scriptstyle {\frac{\lambda}{2}}$$\scriptstyle \int$dz  fz2 = $\displaystyle {\frac{N!2^{N+1}}{\left(\int d{\boldsymbol { z } }\; f_{\boldsymbol { z } }^2\right)^{N+1}}}$ (151)

where the integral with respect to fz is considered fixed. We can rewrite the predicted density by adding $ \lambda$ to the set of model parameters:

p(x|$\displaystyle \cal {D}$) = $\displaystyle {\frac{1}{2^{N+1}N!\;Z}}$$\displaystyle \int_{0}^{\infty}$d$\displaystyle \lambda$  $\displaystyle \lambda^{N}_{}$$\displaystyle \int$dfp0(f )e- $\scriptstyle {\frac{\lambda}{2}}$$\scriptstyle \int$dz  fz2  fx2$\displaystyle \prod_{i=1}^{N}$fi2  
  $\displaystyle \propto$ $\displaystyle \int_{0}^{\infty}$d$\displaystyle \lambda$Z$\scriptstyle \lambda$  $\displaystyle \lambda^{N}_{}$E$\scriptstyle \lambda$$\displaystyle \left[\vphantom{f^2_{\boldsymbol { x } }\prod_{i=1}^N f_i^2}\right.$f2x$\displaystyle \prod_{i=1}^{N}$fi2$\displaystyle \left.\vphantom{f^2_{\boldsymbol { x } }\prod_{i=1}^N f_i^2}\right]$ (152)

where an expectation over a new Gaussian prior has been introduced. The new GP is obtained by multiplying the initial GP with the exponential from eq. (152) and Z$\scriptstyle \lambda$ is the normalisation corresponding to new ``effective GP''.

Next we derive the covariance of this ``effective GP''. For this we use the decomposition of the kernels K0 using an orthonormal set of eigen-functions $ \phi_{i}^{}$(x). For this set of functions we have

$\displaystyle \int$dx$\displaystyle \phi_{i}^{}$(x)$\displaystyle \phi_{j}^{}$(x) = $\displaystyle \delta_{ij}^{}$ (153)

such that K0(x,x$\scriptstyle \prime$) = $ \sum_{i}^{}$$ \sigma^{2}_{i}$$ \phi_{i}^{}$(x)$ \phi_{j}^{}$(x$\scriptstyle \prime$), the details of the different mappings into the feature space were presented in Section 2.1. We used $ \sigma_{i}^{2}$ to denote the eigenvalues in order to avoid possible confusions that would result from double meaning of the parameter $ \lambda$.

The decomposition of the kernel leads to the projection function from the input into the feature space:

$\displaystyle \phi$(x) = $\displaystyle \left[\vphantom{\sigma_1\phi_1({\boldsymbol { x } }),\sigma_2\phi_2({\boldsymbol { x } }),\ldots }\right.$$\displaystyle \sigma_{1}^{}$$\displaystyle \phi_{1}^{}$(x),$\displaystyle \sigma_{2}^{}$$\displaystyle \phi_{2}^{}$(x),...$\displaystyle \left.\vphantom{\sigma_1\phi_1({\boldsymbol { x } }),\sigma_2\phi_2({\boldsymbol { x } }),\ldots }\right]^{T}_{}$ (154)

and we can write the random function fx = f (x) = $ \xi$T$ \phi$(x) where $ \xi$ = [$ \xi_{1}^{}$,$ \xi_{2}^{}$,...]T is the vector of random variables in the feature space and we used the vector notations.

We want to express the normalisation integral using the feature space. The orthonormality of $ \phi_{i}^{}$(x) from eq. (154) implies that

$\displaystyle \int$dz  fz2 = $\displaystyle \xi$TL$\displaystyle \xi$ (155)

where L is a diagonal matrix with $ \sigma_{i}^{2}$ on the diagonal. The modified Gaussian distribution of the feature space variables is be written as:

p$\scriptstyle \lambda$($\displaystyle \xi$) $\displaystyle \propto$ exp$\displaystyle \left\{\vphantom{-\frac{1}{2} \left[ {\boldsymbol { \xi } }^T{\bo...
...ldsymbol { \xi } }^T{\boldsymbol { L } }{\boldsymbol { \xi } } \right] }\right.$ - $\displaystyle {\textstyle\frac{1}{2}}$$\displaystyle \left[\vphantom{ {\boldsymbol { \xi } }^T{\boldsymbol { \xi } }+ ...
...bda {\boldsymbol { \xi } }^T{\boldsymbol { L } }{\boldsymbol { \xi } } }\right.$$\displaystyle \xi$T$\displaystyle \xi$ + $\displaystyle \lambda$$\displaystyle \xi$TL$\displaystyle \xi$$\displaystyle \left.\vphantom{ {\boldsymbol { \xi } }^T{\boldsymbol { \xi } }+ ...
...bda {\boldsymbol { \xi } }^T{\boldsymbol { L } }{\boldsymbol { \xi } } }\right]$$\displaystyle \left.\vphantom{-\frac{1}{2} \left[ {\boldsymbol { \xi } }^T{\bol...
...dsymbol { \xi } }^T{\boldsymbol { L } }{\boldsymbol { \xi } } \right] }\right\}$ $\displaystyle \propto$ exp$\displaystyle \left\{\vphantom{-\frac{1}{2}{\boldsymbol { \xi } }^T\left({\bold...
...{\cal{F}}}}+ \lambda {\boldsymbol { L } }\right){\boldsymbol { \xi } } }\right.$ - $\displaystyle {\textstyle\frac{1}{2}}$$\displaystyle \xi$T$\displaystyle \left(\vphantom{{\boldsymbol { I } }_{\ensuremath{\pmb{\cal{F}}}}+ \lambda {\boldsymbol { L } }}\right.$I$\scriptstyle \pmb$$\scriptstyle \cal {F}$ + $\displaystyle \lambda$L$\displaystyle \left.\vphantom{{\boldsymbol { I } }_{\ensuremath{\pmb{\cal{F}}}}+ \lambda {\boldsymbol { L } }}\right)$$\displaystyle \xi$$\displaystyle \left.\vphantom{-\frac{1}{2}{\boldsymbol { \xi } }^T\left({\bolds...
...\cal{F}}}}+ \lambda {\boldsymbol { L } }\right){\boldsymbol { \xi } } }\right\}$ (156)

and this change implies that the kernel for the ``effective GP'' is

$\displaystyle \langle$f (x)f (x$\scriptstyle \prime$)$\displaystyle \rangle_{\lambda}^{}$ = $\displaystyle \sum_{i}^{}$$\displaystyle \phi$(x)$\displaystyle \phi$(x$\scriptstyle \prime$)$\displaystyle {\frac{\sigma_i^2}{1 + \lambda\sigma_i^2}}$ (157)

where the initial projection into the feature space and the modified distribution of the random variables in that feature space was used. The final form of the GP kernel is found by observing that K$\scriptstyle \lambda$ has the same eigenfunctions but the eigenvalues are ($ \sigma^{-2}_{i}$ + $ \lambda$)-1, thus

$\displaystyle \Sigma$$\scriptstyle \lambda$ = $\displaystyle \left(\vphantom{ \ensuremath{{\boldsymbol { \Sigma } }}^{-1}_0 + \lambda{\boldsymbol { I } }}\right.$$\displaystyle \Sigma$-10 + $\displaystyle \lambda$I$\displaystyle \left.\vphantom{ \ensuremath{{\boldsymbol { \Sigma } }}^{-1}_0 + \lambda{\boldsymbol { I } }}\right)^{-1}_{}$ = $\displaystyle \Sigma$0$\displaystyle \left(\vphantom{ {\boldsymbol { I } }+ \lambda \ensuremath{{\boldsymbol { \Sigma } }}_0}\right.$I + $\displaystyle \lambda$$\displaystyle \Sigma$0$\displaystyle \left.\vphantom{ {\boldsymbol { I } }+ \lambda \ensuremath{{\boldsymbol { \Sigma } }}_0}\right)^{-1}_{}$. (158)

As a result of previous transformations, the normalising term from the likelihood was eliminated by adding $ \lambda$ to the model parameters.11 This simplifies the likelihood, but for each $ \lambda$ we have different GP kernel, and to find the most probable density, we have to integrate over $ \lambda$, this is not feasible in practise.

Our aim is to approximate the posterior p(x|$ \cal {D}$) from eq. (153) with a GP in such a way that we are able to apply the representation lemma from Chapter 2. For this we first eliminate the integral with respect to $ \lambda$ and then approximate the kernel in the effective Gaussian from eq. (159).

The integral over $ \lambda$ is eliminated using a maximum a-posteriori (MAP)approximation. We compute $ \lambda_{m}^{}$ that maximises the log-integrand in eq. (153). Setting its derivative to zero leads to

$\displaystyle {\frac{N}{\lambda_{m}}}$ = $\displaystyle {\frac{\int d{\boldsymbol { x } }\; E_{\lambda_m}\left[ f_{\bolds...
...2 \prod_{i=1}^N f_i^2 \right]}{E_{\lambda_m}\left[\prod_{i=1}^N f_i^2 \right]}}$. (159)

We replace the integral over $ \lambda$ with its value at $ \lambda_{m}^{}$ and to simplify the notations, in the following we ignore the subscript. The predictive density becomes

p(x|$\displaystyle \cal {D}$) $\displaystyle \approx$ $\displaystyle {\frac{\lambda}{N}}$$\displaystyle {\frac{E_\lambda\left[ f_{\boldsymbol { x } }^2 \prod_{i=1}^N f_i^2 \right]}{E_\lambda\left[\prod_{i=1}^N f_i^2 \right]}}$. (160)

Using this approximation to the predictive density, we can employ the parametrisation lemma and the sequential algorithms, as in the previous cases, to infer a posterior process. We observe that in the numerator of eq. (161) we have a product of single data likelihoods fi2, this time without the normalisation as in in eq. (149), and the denominator is the normalisation required by the posterior.

The application of the EP algorithm for this modified GP becomes straightforward. For a fixed $ \lambda$ we use the ``effective GP'' from eq. (159) as the prior process. We iterate the EP learning from Chapter 4 and obtain a posterior process with the data-dependent parameters ($ \alpha$,C). The parameters also depend on $ \lambda$, thus the complete GP will be characterised with the triplet, which we denote ${\ensuremath{{\cal{GP}}}}_\lambda({\boldsymbol { \alpha } },{\boldsymbol { C } })$. Changing the GP parameters changes $ \lambda$, thus at the end of each EP iteration we recompute $ \lambda$, this time fixing ($ \alpha$,C). Using the posterior GP in eq. (160), the re-estimation equation for $ \lambda$ is

$\displaystyle {\frac{N}{\lambda^{new}}}$ = $\displaystyle \int$dx  E$\scriptstyle \cal {GP}$$\scriptscriptstyle \lambda$$\displaystyle \left[\vphantom{f_{\boldsymbol { x } }^2 }\right.$fx2$\displaystyle \left.\vphantom{f_{\boldsymbol { x } }^2 }\right]$ (161)

and we can iterate the EP algorithm using the new GP.

The problem is finding the kernel that corresponds to the ``effective ${\ensuremath{{\cal{GP}}}}_\lambda$''. Finding $\ensuremath{{\boldsymbol { \Sigma } }}_\lambda$ or general kernels requires the eigenvalues of the kernel, or an inverse operation in the feature space. This is seldom tractable analytically, we mention different approaches to address this inversion applied to density estimation. One is to find the inverse operator by numerically solving eq. (159), employed by [50], although this is extremely time-consuming. A different solution is to use the operator product expansion in [69]:

$\displaystyle \Sigma$$\scriptstyle \lambda$ = $\displaystyle \Sigma$0$\displaystyle \left(\vphantom{ 1 - \lambda\ensuremath{{\boldsymbol { \Sigma } }...
... { \Sigma } }}_0\cdot\ensuremath{{\boldsymbol { \Sigma } }}_0 - \ldots }\right.$1 - $\displaystyle \lambda$$\displaystyle \Sigma$0 + $\displaystyle \lambda^{2}_{}$$\displaystyle \Sigma$0 . $\displaystyle \Sigma$0 -...$\displaystyle \left.\vphantom{ 1 - \lambda\ensuremath{{\boldsymbol { \Sigma } }...
... { \Sigma } }}_0\cdot\ensuremath{{\boldsymbol { \Sigma } }}_0 - \ldots }\right)$ (162)

to obtain $\ensuremath{{\boldsymbol { \Sigma } }}_\lambda$. Apart from being, again, computationally expensive, this expansion is unstable for large $ \lambda$ values.

Figure 5.8: The kernel K0(x,x$\scriptstyle \prime$) for x = 0 (continuous line) and x = 0.4 (dashed line).
\includegraphics[]{sinc_kernel.eps}

A simplification of the inversion problem is to choose the prior kernels in a convenient fashion such as to provide tractable models. A choice for the operator for one-dimensional inputs was l2$ \partial_{x}^{2}$, used by [34] to estimate densities. This operator implies the Ornstein-Uhlenbeck kernels for ${\ensuremath{{\cal{GP}}}}_\lambda$:

K$\scriptstyle \lambda$(x, x') = $\displaystyle {\frac{1}{2\sqrt{\lambda}l}}$exp$\displaystyle \left(\vphantom{-\frac{l}{\sqrt{\lambda}}\vert x-x'\vert}\right.$ - $\displaystyle {\frac{l}{\sqrt{\lambda}}}$| x - x'|$\displaystyle \left.\vphantom{-\frac{l}{\sqrt{\lambda}}\vert x-x'\vert}\right)$. (163)

Since $ \lambda$ increases with the size of the data, the resulting density function, being a weighted sum of kernel functions, will be very rough.

A different simplification is obtained if we choose the prior operator $\ensuremath{{\boldsymbol { \Sigma } }}_0$ such that $\ensuremath{{\boldsymbol { \Sigma } }}_0\cdot\ensuremath{{\boldsymbol { \Sigma } }}_0 = \ensuremath{{\boldsymbol { \Sigma } }}_0$. This means that all eigenvalues are equal $ \sigma_{i}^{2}$ = 1 and the inverse from eq. (159) simplifies to

$\displaystyle \Sigma$$\scriptstyle \lambda$ = $\displaystyle {\frac{1}{\lambda+1}}$$\displaystyle \Sigma$0.    

If we use one-dimensional inputs from the interval [0, 1] and consider a finite-dimensional function space with elements

f (x) = c0 + $\displaystyle \sqrt{2}$$\displaystyle \sum_{n=0}^{k_0}$$\displaystyle \left(\vphantom{a_n\sin 2\pi n{\boldsymbol { x } }+ b_n \cos 2\pi n{\boldsymbol { x } }}\right.$ansin 2$\displaystyle \pi$nx + bncos 2$\displaystyle \pi$nx$\displaystyle \left.\vphantom{a_n\sin 2\pi n{\boldsymbol { x } }+ b_n \cos 2\pi n{\boldsymbol { x } }}\right)$ (164)

where an, bn, and c0 are sampled randomly from a Gaussian distribution with zero mean and unit variance. It is easy to check that the 2k0 + 1 functions from eq. (166) are orthonormal. Since we know that the eigenvalues are all equal to 1, this kernel is idempotent $\ensuremath{{\boldsymbol { \Sigma } }}\cdot\ensuremath{{\boldsymbol { \Sigma } }}=\ensuremath{{\boldsymbol { \Sigma } }}$, thus the inversion is reduced to a division with a constant. We compute the kernel that generates these functions:

\begin{displaymath}\begin{split}K_0({\boldsymbol { x } },{\boldsymbol { x } }') ...
...ot(\pi({\boldsymbol { x } }-{\boldsymbol { x } }')) \end{split}\end{displaymath} (165)

where the average is taken over the set of random variables (an, bn) and c0. Fig. 5.8 shows the kernel for k0 = 5 and the experiments were performed using this kernel.

Next we have to solve eq. (162) to find $ \lambda_{m}^{}$. We assume that we are at the end of the first iteration, when $ \lambda$ = 0 on the RHS. In this case eq. (162) simplifies and we obtain

$\displaystyle {\frac{N}{\lambda}}$ = $\displaystyle \int_{0}^{1}$dx$\displaystyle \left(\vphantom{\mu_{\boldsymbol { x } }^2 + \sigma_{\boldsymbol { x } }^2}\right.$$\displaystyle \mu_{\boldsymbol { x } }^{2}$ + $\displaystyle \sigma_{\boldsymbol { x } }^{2}$$\displaystyle \left.\vphantom{\mu_{\boldsymbol { x } }^2 + \sigma_{\boldsymbol { x } }^2}\right)$ = 2k0 + 1 + $\displaystyle \sum_{ij}^{N}$K0(xi,xj)$\displaystyle \left(\vphantom{\alpha_i\alpha_j+C_{ij}}\right.$$\displaystyle \alpha_{i}^{}$$\displaystyle \alpha_{j}^{}$ + Cij$\displaystyle \left.\vphantom{\alpha_i\alpha_j+C_{ij}}\right)$ (166)

where we used the reproducing property of the kernel and expressed the predictive mean and covariance at x using the representation given by the parameters ($ \alpha$,C):

\begin{displaymath}\begin{split}\mu_{\boldsymbol { x } }&= \sum_{i=1}^p \alpha_i...
...ol { C } }{\boldsymbol { k } }_{\boldsymbol { x } } \end{split}\end{displaymath} (167)

with p the size of the $ \cal {BV}$ set.

To apply online learning, we have to compute the average over the likelihood. Assuming that we have an approximation to the ${\ensuremath{{\cal{GP}}}}({\boldsymbol { \alpha } },{\boldsymbol { C } })$, we need to compute the average with respect to ${\ensuremath{{\mathcal{N}}}}({\boldsymbol { k } }_{t+1}^T{\boldsymbol { \alpha ...
...dsymbol { k } }_{t+1}) =
{\ensuremath{{\mathcal{N}}}}(\mu_{t+1},\sigma_{t+1}^2)$, the marginal GP at xt + 1. The averaged likelihood is $ \langle$ft + 12$ \rangle_{t+1}^{}$ = $ \mu_{t+1}^{2}$ + $ \sigma^{2}_{t+1}$, and the update coefficients are the first and the second derivatives of the log-average (from eq. 53)

q(t + 1) = $\displaystyle {\frac{2\mu_{t+1}}{\mu_{t+1}^2 + \sigma_{t+1}^2}}$     and     r(t + 1) = $\displaystyle {\frac{2\left(\sigma^2_{t+1}-\mu^2_{t+1}\right)}{\left(\mu_{t+1}^2 + \sigma_{t+1}^2\right)^2}}$.    

We used an artificial dataset generated from a mixture of two Gaussians, plotted with dashed lines in Fig. 5.9.a. The size of the data set in simulations was 500 and we used k0 = 5 for the kernel parameter. The learning algorithm was the sparse EP learning as presented in Chapter 4 with the $ \cal {BV}$ set size not fixed in advance. We allowed the $ \cal {BV}$ set to be updated each time the score of a new input was above the predefined threshold, since for this toy example, the dimension of the feature space is 2k0 + 1. The KL-criterion prevents a the $ \cal {BV}$ set from becoming larger than the dimension of the feature space, thus there is no need for pruning the GPs.

Figure 5.9: GP approximation to the probability density function. Sub-figure (a) shows the density function (dashed line), its approximation (continuous line), and the elements of the $ \cal {BV}$ set (the + signs on the X-axis). Sub-figure (b) gives the evolution of the L2 difference between the original and the approximating pdf, measured using evenly distributed grid points. The X-axis shows the number of the EP-iterations made.
\includegraphics[]{FSIN_6_500_20sweeps_GOOD.eps} \includegraphics[]{sinc_dens_plateau.eps}
(a) (b)

To summarise, this section outlined a possibility to use GPs for density estimation. The applicability of the model in its present form is restricted, we gave results only for one-dimensional density modelling. The essential step in obtaining this density model was the transformation of the problem such that the iterative GP algorithms were made applicable. This was achieved by choosing special kernels corresponding to operators satisfying $\ensuremath{{\boldsymbol { \Sigma } }}\cdot\ensuremath{{\boldsymbol { \Sigma } }}=\ensuremath{{\boldsymbol { \Sigma } }}$, condition needed to simplify the operator inversion eq. (163). Further simulations could be made using higher-dimensional inputs for which these kernels are also computable, however we believe that the basic properties of this model are highlighted using this toy example.

The GP density estimation has certain benefits: contrary to mixture density modelling, it does not need a predefined number of mixture components. The corresponding entity, the $ \cal {BV}$ set size in this model is determined automatically, based on the data and the hyper-parameters of the model. An other advantage is computational: we do not need to store the full kernel matrix, opposite to the case of SVMs [97] or the application of the kernel PCA [28]. A drawback of the model, apart from the need to find the proper hyper-parameters, is that the convergence time can be long, as it is shown in Fig. 5.9.b. More theoretical study is required for speeding up the convergence and making the model less dependent on the prior assumptions.


Estimating wind-fields from scatterometer data

In this section we consider the problem of data assimilation where we aim to build a global model based on spatially distributed observations [15]. GPs are well suited for this type of application, providing us with a convenient way of relating different observations.

The data was collected from the ERS-2 satellite [51] where the aim is to obtain an estimate of the wind fields which the scatterometer indirectly measures using radar backscatter from the ocean surface. There are significant difficulties in obtaining the wind direction from the scatterometer data, the first being the presence of symmetries in the model: wind fields that have opposite directions result in almost equal measurements [22,48], making the wind field retrieval a complex problem with multiple solutions. Added the inherent noise in the observations increases the difficulty we face in the retrieval process.

Current methods of transforming the observed values (scatterometer data, denoted as a vector si at a given spatial location xi) into wind fields can be split into two phases: local wind vector retrieval and ambiguity removal [84] where one of the local solutions is selected as the true wind vector. Ambiguity removal often uses external information, such as a Numerical Weather Prediction (NWP) forecast of the expected wind field at the time of the scatterometer observations.

[47] have recently proposed a Bayesian framework for wind field retrieval combining a vector Gaussian process prior model with local forward (wind field to scatterometer) or inverse models. The backscatter is measured over 50×50 km cells over the ocean and the number of observations acquired can be several thousand, making GPs hardly applicable to this problem.

In this section we build a sparse GP that scales down the computational needs of the conventional GPs, this application was presented in  [16].

Figure 5.10: The outputs of the mixture density network (MDN) when applied to raw scatterometer observations. Each arrow points into the direction of the respective wind field component from the mixture, longer arrows represent larger wind speeds. The spatial locations were obtained independently, using the outputs of the same MDN for the scatterometer data at each separate location.
\includegraphics[]{MDN.eps}


Processing Scatterometer Data

We use a mixture density network (MDN) [4] to model the conditional dependence of the local wind vector zi = (ui, vi) on the local scatterometer observations si:

pm(zi|si,$\displaystyle \omega$) = $\displaystyle \sum_{j=1}^{4}$$\displaystyle \beta_{ij}^{}$$\displaystyle \phi$(zi|cij,$\displaystyle \sigma_{ij}^{}$) (168)

where $ \omega$ is the union of the MDN parameters: for each observation at location xi we have the set weightings $ \beta_{ij}^{}$ for the local Gaussians $ \phi$(cij,$ \sigma_{ij}^{2}$) where cij is the mean and $ \sigma_{ij}^{2}$ is the variance. The parameters of the MDN are determined using an independent training set [22] and are considered known in this section.

The MDN with four Gaussian component densities captures the ambiguity of the inverse problem. An example of a 10×10 field produced by the MDN is shown in Fig 5.10. We see that for the majority of the spatial locations we have symmetries in the MDN solution: the model prefers a speed, but the orientation of the resulting wind-vector is uncertain.

In order to have a global model from the N wind vectors obtained by local inversion using eq. (169), we combine them with a zero-mean vector GP [13,47]:

q($\displaystyle \underline{{\boldsymbol { z } }}$) $\displaystyle \propto$ $\displaystyle \left(\vphantom{\prod_{i}^N p({\boldsymbol { s } }_i\vert{\boldsymbol { z } }_i)}\right.$$\displaystyle \prod_{i}^{N}$p(si|zi)$\displaystyle \left.\vphantom{\prod_{i}^N p({\boldsymbol { s } }_i\vert{\boldsymbol { z } }_i)}\right)$p0($\displaystyle \underline{{\boldsymbol { z } }}$|W0)    

and instead of using the direct likelihood p(si|zi), we transform it using Bayes theorem again to obtain:

q($\displaystyle \underline{{\boldsymbol { z } }}$) $\displaystyle \propto$ $\displaystyle \left(\vphantom{\prod_{i}^N \frac{p_m({\boldsymbol { z } }_i\vert...
... s } }_i)} {p_0({\boldsymbol { z } }_i\vert{\boldsymbol { W } }_{0i})} }\right.$$\displaystyle \prod_{i}^{N}$$\displaystyle {\frac{p_m({\boldsymbol { z } }_i\vert{\boldsymbol { s } }_i,{\bo...
...dsymbol { s } }_i)}{p_0({\boldsymbol { z } }_i\vert{\boldsymbol { W } }_{0i})}}$$\displaystyle \left.\vphantom{\prod_{i}^N \frac{p_m({\boldsymbol { z } }_i\vert...
... s } }_i)} {p_0({\boldsymbol { z } }_i\vert{\boldsymbol { W } }_{0i})} }\right)$p0($\displaystyle \underline{{\boldsymbol { z } }}$|W0) (169)

where $ \underline{{\boldsymbol { z } }}$ = [z1,...,zN]T is the concatenation of the local wind field components zi. These components are random variables corresponding to a spatial location xi. These locations specify the prior covariance matrix for the vector $ \underline{{\boldsymbol { z } }}$, given by W0 = {W0(xi,xj)}ij = 1N. The two-dimensional Gaussian p0 is the prior GP marginalised at zi, with zero-mean and covariance W0i. The prior GP was tuned carefully to represent features seen in real wind fields.

Since all quantities involved are Gaussians, we could, in principle, compute the resulting probabilities analytically, but this computation is practically intractable: the number of mixture elements from q($ \underline{{\boldsymbol { z } }}$) is 4N which is extremely high even for moderate values of N. Instead, we will apply the online approximation to have a jointly Gaussian approximation to the posterior at all data points.


Learning vector Gaussian processes

Due to the vector GP, the kernel function W0(x,y) is a 2×2 matrix, specifying the pairwise cross-correlation between wind field components at different spatial positions. The representation of the posterior GP thus requires vector quantities: the marginal of the vector GP at a spatial location x has a bivariate Gaussian distribution with mean and covariance function of the vectors zx $ \in$ $ \mathbb {R}$2 represented as

\begin{displaymath}\begin{split}\langle {\boldsymbol { z } }_{\boldsymbol { x } ...
... } }_0({\boldsymbol { x } }_j,{\boldsymbol { y } }) \end{split}\end{displaymath} (170)

where $ \alpha$z(1),$ \alpha$z(2),...,$ \alpha$z(N) and {Cz(ij)}i, j = 1, N of the vector GP. Since this form is not convenient, we represent (for numerical convenience) the vectorial process by a scalar process with twice the number of observations, i.e. we set

$\displaystyle \langle$zx$\displaystyle \rangle$ = $\displaystyle \begin{bmatrix}\langle f_{{\boldsymbol { x } }^u} \rangle \\  \\  \langle f_{{\boldsymbol { x } }^v} \rangle \end{bmatrix}$     and     W0(x,y) = $\displaystyle \begin{bmatrix}K_0({\boldsymbol { x } }^u,{\boldsymbol { y } }^u)...
...l { y } }^u) & K_0({\boldsymbol { x } }^v,{\boldsymbol { y } }^v) \end{bmatrix}$ (171)

where K0(x,y) is a scalar valued function which we are going to use purely for notation. Since we are dealing with vector quantities, we will never have to evaluate the individual matrix elements. K0 serves us just for re-arranging the GP parameters in a more convenient form. By ignoring the superscripts v and u, we can write

\begin{displaymath}\begin{split}\langle f_{\boldsymbol { x } } \rangle &= \sum_{...
...j) K_0({\boldsymbol { x } }_j,{\boldsymbol { y } }) \end{split}\end{displaymath} (172)

where $ \alpha$ = [$ \alpha_{1}^{}$,...,$ \alpha_{2N}^{}$]T and C = {C(ij)}i, j = 1,..., 2N are rearrangements of the parameters from eq. (171).

Figure 5.11: Illustration of the elements used in the update eq. (175).
\includegraphics[]{schematic.eps}

For a new observation st + 1 we have a a local "likelihood" from eq. (170):

$\displaystyle {\frac{p_m({\boldsymbol { z } }_{t+1}\vert{\boldsymbol { s } }_{t...
... } }_{t+1})}{p_G({\boldsymbol { z } }_{t+1}\vert{\boldsymbol { W } }_{0,t+1})}}$    

and the update of the parameters ($ \alpha$,C) is based on the average of this local likelihood with respect to the marginal of the vectorial GP at xt + 1:

g($\displaystyle \langle$zt + 1$\displaystyle \rangle$) = $\displaystyle \bigg\langle$$\displaystyle {\frac{p_m({\boldsymbol { z } }_{t+1}\vert{\boldsymbol { s } }_{t...
... } }_{t+1})}{p_G({\boldsymbol { z } }_{t+1}\vert{\boldsymbol { W } }_{0,t+1})}}$$\displaystyle \bigg\rangle_{{\ensuremath{{\mathcal{N}}}}_{{\boldsymbol { x } }_{t+1}}({\boldsymbol { z } }_{t+1})}^{}$ (173)

where ${\ensuremath{{\mathcal{N}}}}_{{\boldsymbol { x } }_{t+1}}({\boldsymbol { z } }_{t+1})$ is the marginal of the vectorial GP at xt + 1 and $ \langle$zt + 1$ \rangle$ is the value of the mean function at the same position. The function g($ \langle$zt + 1$ \rangle$) is easy to compute, it involves two dimensional Gaussian averages. To keep the flow of the presentation we deferred the calculations of g($ \langle$zt + 1$ \rangle$) and its derivatives to Appendix F. Using the differential of the log-averages with respect to the prior mean vector at time t + 1, we have the updates for the GP parameters $ \alpha$ and C:

\begin{displaymath}\begin{split}{\boldsymbol { \alpha } }_{t+1} &= {\boldsymbol ...
... } }_{t+1} \rangle ^2} {\boldsymbol { v } }^T_{t+1} \end{split}\end{displaymath}with    vt + 1 = CtK0[t + 1] + I[t + 1]2 (174)

and $ \langle$zt + 1$ \rangle$ is 2×1 a vector, implying vector and matrix quantities in (175). The matrices K0[t + 1] and I[t + 1]2 are shown in Fig. 5.11.

Figure 5.12: The NWP wind field estimation (a), the most frequent (b) and the second most frequent (c) online solution together with a bad solution. The assessment of good/bad solution is based on the value of the relative weight from Section 5.4.3. The gray-scale background indicates the model confidence (Bayesian error-bars) in the prediction, darker shade is for smaller variance.
\includegraphics[]{NWP_all.eps} \includegraphics[]{Online_most_frequent.eps}
(a) (b)
\includegraphics[]{symm_sol.eps} \includegraphics[]{bad_sol_49_84.eps}
(c) (d)

Fig. 5.12 shows the results of the online algorithm applied on a sample wind field having 100 nodes on an equally spaced lattice. In subfigure (a) the predictions from the Numerical Weather Prediction (NWP) are shown as a reference to our online GP predictions, shown in sub-figures (b)-(d). The processing order of each node appears to be important for this case: depending on the order we have a solution agreeing with the NWP results as shown in sub-figures (b) and (d), on other occasions the online GP had a clearly suboptimal wind field structure, as shown in subfigure (d).

However, we know that the posterior distribution of the wind field given the scatterometer observations is multi-modal, with in general two dominating and well separated modes.

The variance in the predictive wind fields resulting from different presentation orders is a problem for the online solutions: we clearly do not know in advance the preferred presentation order, and this means that there is a need to empirically assess the quality of each resulting wind field, this will be presented in the next section.


Measuring the Relative Weight of the Approximation

An exact computation of the posterior process, as it has been discussed previously, would lead to a multi-modal distribution of wind fields at each data-point. This would correspond to a mixture of GPs as a posterior rather than to a single GP that is used in our approximation. If the individual components of the mixture are well separated, we may expect that our online algorithm will track modes with significant underlying probability mass to give a relevant prediction. However, the mode that will be tracked depends on the actual sequence of data-points that are visited by the algorithm. To investigate the variation of our wind field prediction with the data sequence, we have generated several random sequences and compared the outcomes based on a simple approximation for the relative mass of the multivariate Gaussian component.

Figure 5.13: The illustration of computing the computation of the relative weights.
\includegraphics[]{rel_weight.eps}

To compare two posterior approximations obtained from different presentations of the same data, we assume that the marginal distribution ($ \hat{\underline{{\boldsymbol { z } }}}$,$ \hat{{\boldsymbol { \Sigma } }}$) can be written as a sum of Gaussians that are well separated. At the online solutions $ \hat{\underline{{\boldsymbol { z } }}}$ we are at a local maximum of the pdf, meaning that the sum from the mixture of Gaussians is reduced to a single term. This means that

q($\displaystyle \hat{\underline{{\boldsymbol { z } }}}$) $\displaystyle \propto$ $\displaystyle \gamma_{l}^{}$$\displaystyle \left(\vphantom{ 2\pi }\right.$2$\displaystyle \pi$$\displaystyle \left.\vphantom{ 2\pi }\right)^{-2N/2}_{}$|$\displaystyle \hat{{\boldsymbol { \Sigma } }}$|-1/2 (175)

with q($ \hat{\underline{{\boldsymbol { z } }}}$) from eq. (170), $ \gamma_{l}^{}$ the weight of the component of the mixture to which our online algorithm had converged, and we assume the local curvature is also well approximated by $ \hat{{\boldsymbol { \Sigma } }}$.

Having two different online solutions ($ \hat{\underline{{\boldsymbol { z } }}}_{1}^{}$,$ \hat{{\boldsymbol { \Sigma } }}_{1}^{}$) and ($ \hat{\underline{{\boldsymbol { z } }}}_{1}^{}$,$ \hat{{\boldsymbol { \Sigma } }}_{1}^{}$), we find from eq (176) that the proportion of the two weights is given by

$\displaystyle {\frac{\gamma_1}{\gamma_2}}$ = $\displaystyle {\frac{q(\hat{\underline{{\boldsymbol { z } }}}_{1}) \vert\hat{{\...
...{\boldsymbol { z } }}}_{2}) \vert\hat{{\boldsymbol { \Sigma } }}_2\vert^{1/2}}}$ (176)

as illustrated in Fig. 5.13. This helps us to estimate the ``relative weight'' of the wind field solutions:

log$\displaystyle \gamma$ = log q($\displaystyle \hat{\underline{{\boldsymbol { z } }}}$) + $\displaystyle {\frac{\log \vert \hat{\ensuremath{{\boldsymbol { \Sigma } }}}\vert}{2}}$ (177)

helping us to assess the quality of the approximation we obtained. Results, using multiple runs on a wind field data confirm this expectation, the correct solution (Fig. 5.14.b) has large value and high frequency if doing multiple runs. For the wind field example shown in Fig. 5.12 100 random permutations of the presentation were made. The resulting good solutions, shown in subfigure (b) and (c) always had the logarithm of the relative weight larger than 90 whilst the bad solutions had the same quantity at $ \approx$ 70.


Sparsity for vectorial GPs

The sparsity for the vectorial GP is also based on the same minimisation of the KL-distance, as for the scalar GPs. The only difference from Chapter 3 is that here we have a vectorial process. Using the transformed notations from eq. (173), the vectorial GP means that the removal and addition operations can only be performed in pairs. The $ \cal {BV}$ set, for the transformed GP has always an even number of basis vectors: for each spatial location xi the $ \cal {BV}$ set includes xiu and xiv. If removing xi, we have to remove both components from the $ \cal {BV}$ set. The pruning is very similar to the one-dimensional case. The difference is that the elements of the update are 2×2 matrices ( c* and q*), p×2 vectors ( C* and Q*), and the 2×1 vector $ \alpha$*, the decomposition is the same as showed in Fig. 3.3. Using this decomposition the KL-optimal updates are

\begin{displaymath}\begin{split}\hat{{\boldsymbol { \alpha } }} &= {\boldsymbol ...
...dsymbol { C } }^* + {\boldsymbol { Q } }^*\right)^T \end{split}\end{displaymath} (178)

where Q-1 is the inverse Gram matrix and ($ \alpha$,C) are the GP parameters.

The quality of the removal of a location xi (or the two virtual basis vectors xiu and xiv) is measured by the approximated KL-distance from Chapter 3, leading to the score

$\displaystyle \varepsilon$i = $\displaystyle \alpha$iT$\displaystyle \left(\vphantom{{\boldsymbol { c } }^* + {\boldsymbol { q } }^* }\right.$c* + q*$\displaystyle \left.\vphantom{{\boldsymbol { c } }^* + {\boldsymbol { q } }^* }\right)^{-1}_{}$$\displaystyle \alpha$i (179)

where the parameters are obtained from the 2N×2N matrix using the same decomposition shown in Fig 3.3.

Figure 5.14: (a) The predicted wind fields when 85% of the nodes has been removed (from Fig. 2.2). The prediction is based only on basis vectors (circles). The model confidence is higher at these regions. (b) The difference between the full solution and the approximations using the squared difference of means (continuous line) and the KL-distance (dashed line) respectively.
\includegraphics[]{85_rem_13.eps} \includegraphics[]{KL_sq_13.eps}
(a) (b)

The results of the pruning are promising: Fig. 5.14 shows the resulting wind field if 85 of the spatial knots are removed from the presentation eq. (173). On the right-hand side the evolution of the KL-divergence and the sum-squared errors in the means between the vector GP and a trimmed GP using eq. (179) are shown as a function of the number of deleted points. Whilst the approximation of the posterior variance decays quickly, the predictive mean is stable when deleting nodes.

The sparse solution from Fig. 5.14 is the result of combining the non-sparse online algorithm with the KL-optimal removal, the two algorithms being performed one after the another. This means that we still have to store the full vector GP and the costly matrix inversion is still needed, the difference from other methods is that the inversion is sequential.

Applying the sparse online algorithm without the EP re-learning steps lead to significant loss in the performance of the online algorithm, this is due to the more complex multimodal ``likelihoods'' provided by the MDNs that give rise to local symmetries in the parameter space. We compensated for this loss by using additional prior knowledge. The prior knowledge was the wind vector at each spatial location from a NWP model and we included this in the prior mean function. This leads to better performance since we are initially closer to the solution than using simply a zero-mean prior process. The performance of the resulting algorithm is comparable to the combination of the full online learning with the removal and the result is shown in Fig. 5.14.b.

For the wind-field example the EP algorithm from Chapter 4 has not been developed yet. Future work will involve testing the sparse EP algorithm applied to the wind-field problem. As seen for the classification case, the second and third iteration through the data improved the performance. By performing the multiple iterations for the spatial locations we expect to have a better approximation for the posterior GP for this case also. Some improvement is also to be expected when estimating the relative weight of a solution from eq. (178), especially since here an accurate estimation of the posterior covariance is essential.

To conclude, we showed that the wind-field approximation using sparse GPs is a promising research direction. We showed that a reduction of 70% of the basis points lead to a minor change in the actual GP, measured using the KL-distance. The implementation and testing of the sparse EP algorithm for this special problem is also an interesting question.

We estimated the wind-fields by first processing the raw scatterometer data using the local inverse models into a mixture of Gaussians as shown in Fig 5.10. The GP learning algorithm used the product of these local mixtures. Further investigations are aimed at implementing the GPs based on likelihoods directly from the scatterometer observations. Since the dependence of the wind-field on the scatterometer observation is given by a complex forward model, we need to perform approximations to compute g($ \langle$z$ \rangle$) and to benchmarks the results with previous ones.

A modification of the model in a different direction is suggested by the fact that the mixtures tend to have two dominant modes, this resulting from the physics of the system. The same bimodality has also been found in empirical studies [48]. It would be interesting to extend the sparse GP approach to mixtures of GPs in order to incorporate this simple multi-modality of the posterior process in a principled way.


Summary

We presented results for various problems solved using the sparse GP framework with the aim of showing the applicability of the algorithm to a large class of likelihoods. Sparseness for the quadratic regression case showed performance comparable with the full GP regression, tested with artificial and real-world datasets. The additional EP iterations were beneficial only if the $ \cal {BV}$ set size was very low, but generally did not improved the test errors. This was not the case for the classification problems, where we showed both the applicability of the sparse GP learning and the improved performance with multiple sweeps through the data. The performance of the algorithm with the classification problem was also tested on the relatively large USPS dataset having 7000 training data. Despite of the fact that only a fraction of the data was retained, we had good performance. An interesting research direction is to extend sparse GP framework to the robust regression, i.e. non-Gaussian noise distributions like Laplace noise. However, we believe that this would be beneficial only if a method of hyper-parameter selection could be proposed.

The application of the GPs to density estimation at the present time is very restrictive: we considered one-dimensional data and special kernels. For this case the EP steps proved to be very important, several iterations were needed to reach a reasonable solution. However the solution is attractive: we showed the possibility to infer multi-modal distributions without specifying the number of peaks in the distribution.

A real-world application is the sparse GP inference of wind-fields from scatterometer observations. Previous studies [13] shown that this problem in not unimodal, our sparse GP tracks a mode of the posterior distribution. This being a work in progress, we expect more positive results from the implementation and testing of the EP algorithms for the wind fields. Discussion about more general future research topics like model selection can be found in the next chapter.