Sparse OGP classification

Binary classification is not tractable analytically. But if on uses the probit model [49] where a binary value y $\in$ { - 1, 1} is assigned to an input x $\in$ $\mathbb {R}$ ^m with the data likelihood

P $\displaystyle \left(\vphantom{y\vert{\boldsymbol { f } }_{\boldsymbol { x } }}\right.$ y|f_x $\displaystyle \left.\vphantom{y\vert{\boldsymbol { f } }_{\boldsymbol { x } }}\right)$ = Erf $\displaystyle \left(\vphantom{\frac{y\; {\boldsymbol { f } }_{\boldsymbol { x } }}{\sigma_0}}\right.$ $\displaystyle {\frac{y\; {\boldsymbol { f } }_{\boldsymbol { x } }}{\sigma_0}}$ $\displaystyle \left.\vphantom{\frac{y\; {\boldsymbol { f } }_{\boldsymbol { x } }}{\sigma_0}}\right)$ ,

(142)

Erf(z) = $\displaystyle {\frac{1}{\sqrt{2\pi}}}$ $\displaystyle \int^{z}_{-\infty}$ dt exp $\displaystyle \left(\vphantom{-\frac{t^2}{2}}\right.$ - $\displaystyle {\frac{t^2}{2}}$ $\displaystyle \left.\vphantom{-\frac{t^2}{2}}\right)$

(143)

The shape of this likelihood resembles a sigmoidal, the main benefit of this choice is that its average with respect to a Gaussian measure is computable. We can compute the predictive distribution at a new example x:

p(y|x, $\displaystyle \alpha$ ,C) = $\displaystyle \langle$ P(y|f_x) $\displaystyle \rangle_{t}^{}$ = Erf $\displaystyle \left(\vphantom{\frac{y\; \langle {\boldsymbol { f } }_{\boldsymbol { x } } \rangle _t }{\sigma_x}}\right.$ $\displaystyle {\frac{y\; \langle {\boldsymbol { f } }_{\boldsymbol { x } } \rangle _t }{\sigma_x}}$ $\displaystyle \left.\vphantom{\frac{y\; \langle {\boldsymbol { f } }_{\boldsymbol { x } } \rangle _t }{\sigma_x}}\right)$

(144)

where $\langle$ f_x $\rangle_{t}^{}$ = k_x^T $\alpha$ is the mean of the GP at x and $\sigma_{\boldsymbol { x } }^{2}$ = $\sigma_{0}^{2}$ + k^*_x + k_x^TCk_x. It is the predictive distribution of the new data, that is the Gaussian average of an other Gaussian. The result is obtained by changing the order of integrands in the Bayesian predictive distribution eq. (143) and back-substituting the definition of the error function.

Based on eq. (53), for a given input-output pair (x, y) the update coefficients q^{(t +
1)} and r^{(t + 1)} are computed by differentiating the logarithm of the averaged likelihood from eq. (145) with respect to the mean at x_{t + 1} [17]:

q^{(t + 1)} = $\displaystyle {\frac{y}{\sigma_{\boldsymbol { x } }}}$ $\displaystyle {\frac{\ensuremath{\mathrm{Erf}}'}{\ensuremath{\mathrm{Erf}}}}$ r^{(t + 1)} = $\displaystyle {\frac{1}{\sigma^2_{\boldsymbol { x } }}}$ $\displaystyle \left\{\vphantom{\frac{\ensuremath{\mathrm{Erf}}''}{\ensuremath{\......\frac{\ensuremath{\mathrm{Erf}}'}{\ensuremath{\mathrm{Erf}}} \right)^2 }\right.$ $\displaystyle {\frac{\ensuremath{\mathrm{Erf}}''}{\ensuremath{\mathrm{Erf}}}}$ - $\displaystyle \left(\vphantom{\frac{\ensuremath{\mathrm{Erf}}'}{\ensuremath{\mathrm{Erf}}} }\right.$ $\displaystyle {\frac{\ensuremath{\mathrm{Erf}}'}{\ensuremath{\mathrm{Erf}}}}$ $\displaystyle \left.\vphantom{\frac{\ensuremath{\mathrm{Erf}}'}{\ensuremath{\mathrm{Erf}}} }\right)^{2}_{}$ $\displaystyle \left.\vphantom{\frac{\ensuremath{\mathrm{Erf}}''}{\ensuremath{\m......frac{\ensuremath{\mathrm{Erf}}'}{\ensuremath{\mathrm{Erf}}} \right)^2 }\right\}$

(145)

with $\ensuremath{\mathrm{Erf}}(z)$ evaluated at z = ${\frac{y\;{\boldsymbol { \alpha } }_t^T{\boldsymbol { k } }_{x}}{\sigma_x}}$ and $\ensuremath{\mathrm{Erf}}'$ and $\ensuremath{\mathrm{Erf}}''$ are the first and second derivatives at z.

Classification Demo

The program ( demogp_class_gui) illustrates the Sparse OGP inference for binary classification. This demonstration program can fully be controlled using the buttons provided.

After the data addition (using the Add.Pts and the mouse), the GP LEARNS the classification boundary, shown in the figure with green line. Note that since no hyperparameter learning has been performed, the number of Basis Vectors (black circles on the figure) is quite large.

The result of a fair amount of clicking (called data generation) and pressing the Learning -- Hyp.opt button pair is shown below. Observe that as a results of learning the scaling factors changed so that the first coordinate (X-axis) became more important than the second component, and the negative log-evidence is much reduced.

Sparse OGP Classification

Classification Demo