Gaussian Processes - Iterative Sparse Approximations

Key words and phrases: Gaussian processes, online learning, sparse approximations

Abstract:

In recent years there has been an increased interest in applying non-parametric methods to real-world problems. Significant research has been devoted to Gaussian processes (GPs) due to their increased flexibility when compared with parametric models. These methods use Bayesian learning, which generally leads to analytically intractable posteriors.

This thesis proposes a two-step solution to construct a probabilistic approximation to the posterior. In the first step we adapt the Bayesian online learning to GPs: the final approximation to the posterior is the result of propagating the first and second moments of intermediate posteriors obtained by combining a new example with the previous approximation. The propagation of functional forms is solved by showing the existence of a parametrisation to posterior moments that uses combinations of the kernel function at the training points, transforming the Bayesian online learning of functions into a parametric formulation. The drawback is the prohibitive quadratic scaling of the number of parameters with the size of the data, making the method inapplicable to large datasets.

The second step solves the problem of the exploding parameter size and makes GPs applicable to arbitrarily large datasets. The approximation is based on a measure of distance between two GPs, the KL-divergence between GPs. This second approximation is with a constrained GP in which only a small subset of the whole training dataset is used to represent the GP. This subset is called the Basis Vector, or $\cal {BV}$ set and the resulting GP is a sparse approximation to the true posterior.

As this sparsity is based on the KL-minimisation, it is probabilistic and independent of the way the posterior approximation from the first step is obtained. We combine the sparse approximation with an extension to the Bayesian online algorithm that allows multiple iterations for each input and thus approximating a batch solution.

The resulting sparse learning algorithm is a generic one: for different problems we only change the likelihood. The algorithm is applied to a variety of problems and we examine its performance both on more classical regression and classification tasks and to the data-assimilation and a simple density estimation problems.

Acknowledgements

First and foremost I am grateful to my supervisor Manfred Opper for his unceasing enthusiasm, the sparkling discussions, the quick understanding of sometimes chaotic ideas, and the thorough explanations he gave me whenever I needed it. Being a member of NCRG was very stimulating and I profited a lot from the discussions during the three years spent at Aston. I acknowledge the financial support of NCRG for making possible my study.

My fellow PhD students helped me adjust to life in England, in particular to the, put it mildly, unfavourable local climate. The lively discussions in the PhD lab are not forgotten. I thank my fellow course-mates, Mehdi Azzouzi, David Evans, Randa Herzallah, Lars Hjorth, Ragnar Lesch, Tony Schwaighofer, Renato Vicente, Wei Lee Woon and Sun Yi.

For the remote support I thank to my family and to my friends without whom the completion of this thesis would have been much harder.

In preparing this thesis I have been helped a lot by Dan Cornford and Wei Lee Woon who gave helpful comments on the grammatical aspects of the thesis. And finally, special thanks to Vicky Bond for her prompt replies whenever it was needed.