Diagonalisation of matrix C

In this appendix we are deducing the online learning rules for a restricted GP where only the diagonal elements of the matrix C parametrising the posterior kernel are nonzero, similarly to the parametrisation of the covariances in the kernel spaces proposed by [88].

We are doing the simplification by including the constraint in the learning rule: projecting to a subspace of GP's with the kernel specified using only diagonal elements, ie.

and if we use matrix notation and the design matrix $\underline{{\boldsymbol { \Phi } }}$ = [ $\phi_{1}^{}$ ,..., $\phi_{N}^{}$ ] then we can write the posterior covariance matrix in the feature space specified by the design matrix and the diagonal matrix C as

In online learning setup the KL-divergence between the true posterior and the projected one is minimised. Differentiating the KL-divergence from eq. (74) with respect to a diagonal element C_ii leads to the expression

0	=	$\displaystyle \phi_{i}^{T}$ $\displaystyle \Sigma$ _{t + 1} $\displaystyle \left[\vphantom{ -{\boldsymbol { \Sigma } }_{t+1} + {\boldsymbol { \Sigma } }_{post} }\right.$ - $\displaystyle \Sigma$ _{t + 1} + $\displaystyle \Sigma$ _post $\displaystyle \left.\vphantom{ -{\boldsymbol { \Sigma } }_{t+1} + {\boldsymbol { \Sigma } }_{post} }\right]$ $\displaystyle \Sigma$ _{t + 1} $\displaystyle \phi_{i}^{}$
$\displaystyle \Sigma$ _post	=	$\displaystyle \Sigma$ _t - $\displaystyle \Sigma$ _t $\displaystyle \phi_{t+1}^{}$ r^{(t + 1)} $\displaystyle \phi_{t+1}^{T}$ $\displaystyle \Sigma$ _{t + 1}

with r^{(t + 1)} the scalar coefficient obtained using the online learning rule and $\phi_{t+1}^{}$ the feature vector corresponding to the new datum.

We have t + 1 equations for t + 1 variables, but the system is not linear. Substituting the forms for the covariances $\Sigma$ and using the matrix inversion lemma leads to the system of equations:

diag $\displaystyle \bigg[$ $\displaystyle \left(\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{t+1} }\right.$ K_B^-1 + C_{t + 1} $\displaystyle \left.\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{t+1} }\right)^{-1}_{}$ $\displaystyle \left(\vphantom{ {\boldsymbol { C } }_{t+1} - {\boldsymbol { C } }_{post} }\right.$ C_{t + 1} - C_post $\displaystyle \left.\vphantom{ {\boldsymbol { C } }_{t+1} - {\boldsymbol { C } }_{post} }\right)$ $\displaystyle \left(\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{t+1} }\right.$ K_B^-1 + C_{t + 1} $\displaystyle \left.\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{t+1} }\right)^{-1}_{}$ $\displaystyle \bigg]$	$\textstyle =$	$\displaystyle 0$	(218)

$\displaystyle {\boldsymbol { C } }_{post} = {\boldsymbol { C } }_t + \left({\bo... ...mbol { C } }_t{\boldsymbol { k } }_{t+1} + {\boldsymbol { e } }_{t+1} \right)^T$			(219)

We see that we have a projection with respect to the unknown matrix C_{t + 1}, giving no analytic solution. Using C_post from eq. (219) the solution is written

diag $\displaystyle \left(\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{t+1} }\right.$ K_B^-1 + C_{t + 1} $\displaystyle \left.\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{t+1} }\right)^{-1}_{}$ = diag $\displaystyle \left(\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{post} }\right.$ K_B^-1 + C_post $\displaystyle \left.\vphantom{{\boldsymbol { K } }_B^{-1} + {\boldsymbol { C } }_{post} }\right)^{-1}_{}$

We see that, to obtain the ``simplified'' solution we still need to invert full matrices. The posterior covariance is not diagonal either. As a consequence we will be required to perform iterative approximations, also considered in [88]. From this we conclude that the diagonalisation of parameter matrix C is not feasible as it does not introduce any computational benefit and we believe that by keeping the size of the $\cal {BV}$ set at a reasonable size is a better alternative then a diagonalisation of a possibly larger $\cal {BV}$ set.