Diagonalisation of matrix C

In this appendix we are deducing the online learning rules for a restricted GP where only the diagonal elements of the matrix C parametrising the posterior kernel are nonzero, similarly to the parametrisation of the covariances in the kernel spaces proposed by [88].

We are doing the simplification by including the constraint in the learning rule: projecting to a subspace of GP's with the kernel specified using only diagonal elements, ie.

Kpost(x,x$\scriptstyle \prime$) = K0(x,x$\scriptstyle \prime$) + $\displaystyle \sum_{i}^{}$K0(x,xi)CiiK0(xi,x$\scriptstyle \prime$) (216)

and if we use matrix notation and the design matrix $ \underline{{\boldsymbol { \Phi } }}$ = [$ \phi_{1}^{}$,...,$ \phi_{N}^{}$] then we can write the posterior covariance matrix in the feature space specified by the design matrix and the diagonal matrix C as

$\displaystyle \Sigma$post = IH + $\displaystyle \underline{{\boldsymbol { \Phi } }}$C$\displaystyle \underline{{\boldsymbol { \Phi } }}^{T}_{}$ (217)

In online learning setup the KL-divergence between the true posterior and the projected one is minimised. Differentiating the KL-divergence from eq. (74) with respect to a diagonal element Cii leads to the expression

0 = $\displaystyle \phi_{i}^{T}$$\displaystyle \Sigma$t + 1$\displaystyle \left[\vphantom{
-{\boldsymbol { \Sigma } }_{t+1} + {\boldsymbol { \Sigma } }_{post}
}\right.$ - $\displaystyle \Sigma$t + 1 + $\displaystyle \Sigma$post$\displaystyle \left.\vphantom{
-{\boldsymbol { \Sigma } }_{t+1} + {\boldsymbol { \Sigma } }_{post}
}\right]$$\displaystyle \Sigma$t + 1$\displaystyle \phi_{i}^{}$  
$\displaystyle \Sigma$post = $\displaystyle \Sigma$t - $\displaystyle \Sigma$t$\displaystyle \phi_{t+1}^{}$r(t + 1)$\displaystyle \phi_{t+1}^{T}$$\displaystyle \Sigma$t + 1  

with r(t + 1) the scalar coefficient obtained using the online learning rule and $ \phi_{t+1}^{}$ the feature vector corresponding to the new datum.

We have t + 1 equations for t + 1 variables, but the system is not linear. Substituting the forms for the covariances $ \Sigma$ and using the matrix inversion lemma leads to the system of equations:


diag$\displaystyle \bigg[$  $\displaystyle \left(\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{t+1} }\right.$KB-1 + Ct + 1$\displaystyle \left.\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{t+1} }\right)^{-1}_{}$$\displaystyle \left(\vphantom{
{\boldsymbol { C } }_{t+1} - {\boldsymbol { C } }_{post}
}\right.$Ct + 1 - Cpost$\displaystyle \left.\vphantom{
{\boldsymbol { C } }_{t+1} - {\boldsymbol { C } }_{post}
}\right)$$\displaystyle \left(\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{t+1} }\right.$KB-1 + Ct + 1$\displaystyle \left.\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{t+1} }\right)^{-1}_{}$  $\displaystyle \bigg]$ $\textstyle =$ $\displaystyle 0$ (218)
       
$\displaystyle {\boldsymbol { C } }_{post} = {\boldsymbol { C } }_t +
\left({\bo...
...mbol { C } }_t{\boldsymbol { k } }_{t+1} + {\boldsymbol { e } }_{t+1} \right)^T$     (219)

We see that we have a projection with respect to the unknown matrix Ct + 1, giving no analytic solution. Using Cpost from eq. (219) the solution is written

diag$\displaystyle \left(\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{t+1} }\right.$KB-1 + Ct + 1$\displaystyle \left.\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{t+1} }\right)^{-1}_{}$ = diag$\displaystyle \left(\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{post} }\right.$KB-1 + Cpost$\displaystyle \left.\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{post} }\right)^{-1}_{}$    

We see that, to obtain the ``simplified'' solution we still need to invert full matrices. The posterior covariance is not diagonal either. As a consequence we will be required to perform iterative approximations, also considered in [88]. From this we conclude that the diagonalisation of parameter matrix C is not feasible as it does not introduce any computational benefit and we believe that by keeping the size of the $ \cal {BV}$ set at a reasonable size is a better alternative then a diagonalisation of a possibly larger $ \cal {BV}$ set.