Further Research Directions

Conclusions and Further Research

The application of the family of Gaussian processes to real problems is impeded by the intractability of the the posteriors or the quadratic scaling of the algorithm with the size of the dataset. The contribution of this thesis is to provide a methodology for the application of GPs to large datasets via a representation of the posterior process and a fixed size of the parameter set, eliminating its prohibitive scaling with growing data sizes. To achieve this independence, we needed

a parametrisation lemma for the moments of the posterior process as a linear combination of kernel function at the training set positions, and
a subset of the training data, the Basis Vector set ( $\cal {BV}$ set) that replaces the full training set when representing the posterior process.

The lemma, given in Chapter 2, transformed the problem of functional inference into the estimation of the GP parameters. The parametrisation of the approximating GP resembles the popular representer theorem of [40] that expresses the most probable (MAP) solution as a linear combination of the kernel functions at data locations. In contrast to the representer theorem, in Lemma 2.3.1 we also considered the posterior kernel. The approach to modeling using GPs is thus fully probabilistic. In our opinion the proposed method of inference via estimation of both GP moments is advantageous over the commonly used MAP-based approximations which represent the state-of-the-art in kernel methods. Apart from the possibility of assessing uncertainty in the prediction, the use of the approximated kernel in learning speeds up the online algorithm. This is similar to the natural gradient methods for parametric models, where the covariance matrix is used to rescale the parameter space to provide an optimal search direction.

In Chapter 2 we used Bayesian online learning [54] to find the posterior GP. This is a sequential algorithm that considers a single example from the training set. In this framework the restrictions over the likelihood are very mild: the requirement is that the Gaussian average of it to be differentiable. The online learning is thus applicable for a large class of likelihoods which includes for example the non-differentiable step function for classification, for which other approximation methods usually fail.

The parametrisation lemma with Bayesian online learning provided the resulting GP using all data points with the parameters scaling quadratically with the size of the data. Apart from being inapplicable, this parametrisation was also unrealistic: by learning we implicitly meant a compression of the information contained in the data into the set of parameters. By having the number of parameters scaling quadratically with the size of the data, there is no compression.

We derived a further approximation to the online solution, presented in Chapter 3. By having a subsequent projection of the posterior to a process with less parameters than the original approximation, only a fraction of the training data was used in representing the posterior process, this subset of the data is called the set of Basis Vectors. The important factor in obtaining the sparse approximation was our ability to compute the KL-distance between two GPs, derived in Section 3.2. This distance served both to obtain the sparse projection of the online GP and to estimate the error made by this projection. It must be mentioned that, although only a fraction of the training data is used for GP prediction, in learning we extracted information from the whole dataset.

This is similar to the result of Support Vector Machine learning, where a sparse set of support vectors is used for generating predictions. The learning strategy, however, is different. For the sparse algorithm we were able to eliminate the scaling of the parameter set and to reduce the computing time to linear. In the basic SVM, indifferent of the degree of sparseness in the final result, the required memory to obtain the sparse results is still quadratic.

The essential result of combining the Bayesian online learning from Chapter 2 with sparsity is the sparse online algorithm presented in Section 3.6. The algorithm possesses means to add new inputs to, and remove unwanted inputs from the $\cal {BV}$ set, thus providing us with full flexibility in manipulating the $\cal {BV}$ set. Additionally, we were able to find a score corresponding to each $\cal {BV}$ that measured the error that would result from its removal. It was possible to obtain this score without actually removing the basis vector in question and without any matrix manipulation.

We can thus apply online GPs to arbitrarily large datasets. The online nature of the algorithm with a single processing of each input is prohibitive for data sets with moderate sizes but for which the full GP representation still impractical. Our aim was to have a more accurate result than obtained from a single sweep through the data. A solution to this problem was proposed in Chapter 4. Considering recent improvements on the online learning both from statistical physics [53] and the algorithmic [46] viewpoints, we derived a sparse algorithm that iteratively approximates the batch solution. The algorithm was based on the ``expectation-propagation'' (EP) algorithm presented by [46] and was extended to give sparse solutions.

In Chapter 5 the sparse EP algorithm is applied to various problems. For quadratic regression, which is analytically tractable, the results confirmed that, compared to the result of the full GP algorithm, there is no significant loss in applying the sparse solution.

The EP iterations proved to be beneficial in classification where the test error was consistently lower after subsequent EP-steps. In additionally to the improved performance, the fluctuations caused by the order of the presentation had also diminished. We also showed that the sparse GP algorithm can be applied to various likelihoods like density estimation or the specific data assimilation problem for the wind fields.

We believe that the general parametrisation of the posterior processes and the KL-divergences opens new possibilities in applying kernel methods. It allows, without significant additional complexity, the representation of uncertainty, a feature that at present time is missing from the family of kernel methods.

The cost of estimating uncertainties is the learning and memorisation of matrix C that parametrises the kernel. This cost is greatly compensated by the full flexibility in setting up the size of the parameters, making the sparse GP or EP algorithm highly competitive with respect to several kernel algorithms in various application areas.

Further Research Directions

We envisage further research in several directions:

to improve the computational efficiency of the algorithm;
to apply to other likelihoods; and
to address the problems of estimating hyperparameters and model selection and the use of non-Gaussian process priors.

The issue of computational efficiency is important in practical applications. We saw that the sparse online algorithm, due to its greedy nature, had its running time scaling linearly with the data while at the same time the number of parameters was kept constant.

The sparse EP algorithm, which had multiple iterations for each input, was more demanding, with quadratic and linear scalings of computing time and size of parameters respectively. The bad scaling of the latter algorithm was due to the need to memorise additional information about each processed data point. We can reduce the computational overhead of the algorithm by first selecting the relevant $\cal {BV}$ set from the available data. This can be done either by random selection, or by examining the scores of the training data and including sequentially the ones with the highest score. The random selection of the $\cal {BV}$ set would be feasible for low-dimensional data where we can hope for a relatively even spread of the random selection in the input space. The second strategy would assure an optimal representation of the training data, with the additional cost of evaluating the score after every inclusion step. We foresee that the increase in test error made by fixating the $\cal {BV}$ set would not be significant but the scaling of the computational time would be reduced by an order of magnitude.

The second direction to go would be the application of the sparse GPs to nonstandard likelihoods. A particularly interesting question is whether we could apply GP inference to time-varying systems to include simple dynamics, or to consider GP treatments of ICA models.

However, we believe that the most important issue to address is the hyper-parameter adjustment, or model selection within the sparse GP framework. For this we see that the EP framework provides us an approximation to the data likelihood using the approximated likelihood terms. In addition to the sparse online learning, we have multiple iterative learning of the GP parameters, we could thus adjust some of the hyperparameters of the model by gradient ascent, for example, to increase the data likelihood. This could include the variance of the assumed noise for regression, or the steepness of the probit model in the classification case. However, changing the model parameters should involve corresponding changes in the GP parameters or the retraining of the whole GP with the new model, the exact form requires more study.

An interesting research direction would be to address the selection of the parameters of the kernel itself, like the width parameter in the RBF kernel or the order of the polynomial used in polynomial kernels. Equally, one could consider the assessment of the relevance of the input components, similarly to the case of automatic relevance determination [42] for the case of sparse Gaussian processes where the size of the $\cal {BV}$ set is fixed. Simple improvements on the sparse EP algorithm are also planned in the future.

Lastly, it would be interesting to extend the single GP latent variable model to mixtures of GPs. Using the sparse EP algorithm to estimate the individual GPs is feasible, though we might need an EM-like algorithm.