User Guide
There are 3 main actions needed to train and use the different models:
Initialization
Possible models
There are currently 8 possible Gaussian Process models:
GP
GP corresponds to the original GP regression model, it is necessarily with a Gaussian likelihood.
GP(X_train, y_train, kernel; kwargs...)
VGP
VGP is a variational GP model: a multivariate Gaussian is approximating the true posterior. There is no inducing points augmentation involved. Therefore it is well suited for small datasets (~10^3 samples).
VGP(X_train, y_train, kernel, likelihood, inference; kwargs...)
SVGP
SVGP is a variational GP model augmented with inducing points. The optimization is done on those points, allowing for stochastic updates and large scalability. The counterpart can be a slightly lower accuracy and the need to select the number and the location of the inducing points (however this is a problem currently worked on).
SVGP(kernel, likelihood, inference, Z; kwargs...)
Where Z
is the position of the inducing points.
MCGP
MCGP is a GP model where the posterior is represented via a collection of samples.
MCGP(X_train, y_train, kernel, likelihood, inference; kwargs...)
OnlineSVGP
OnlineSVGP is an online variational GP model. It is based on the streaming method of Bui 17', it supports all likelihoods, even with multiple latent GPs.
OnlineSVGP(kernel, likelihood, inference, ind_point_algorithm; kwargs...)
MOVGP
MOVGP is a multi output variational GP model based on the principle f_output[i] = sum(A[i, j] * f_latent[j] for j in 1:n_latent)
. The number of latent GP is free.
MOVGP(X_train, ys_train, kernel, likelihood/s, inference, n_latent; kwargs...)
MOSVGP
MOSVGP is the same thing as MOVGP
but with inducing points: a multi output sparse variational GP model, based on Moreno-Muñoz 18'.
MOVGP(kernel, likelihood/s, inference, n_latent, n_inducing_points; kwargs...)
VStP
VStP is a variational Student-T model where the prior is a multivariate Student-T distribution with scale K
, mean μ₀
and degrees of freedom ν
. The inference is done automatically by augmenting the prior as a scale mixture of inverse gamma.
VStP(X_train, y_train, kernel, likelihood, inference, ν; kwargs...)
Likelihood
GP
can only have a Gaussian likelihood, while the other have more choices. Here are the ones currently implemented:
Regression
For regression, four likelihoods are available :
- The classical
GaussianLikelihood
, for Gaussian noise. - The
StudentTLikelihood
, assuming noise from a Student-T distribution (more robust to ouliers). - The
LaplaceLikelihood
, with noise from a Laplace distribution. - The
HeteroscedasticLikelihood
, (in development) where the noise is a function of the input: $Var(X) = λσ^{-1}(g(X))$ whereg(X)
is an additional Gaussian Process andσ
is the logistic function.
Classification
For classification one can select among
- The
LogisticLikelihood
: a Bernoulli likelihood with a logistic link. - The
BayesianSVM
likelihood based on the frequentist SVM, equivalent to use a hinge loss.
Event Likelihoods
For likelihoods such as Poisson or Negative Binomial, we approximate a parameter by σ(f)
. Two Likelihoods are implemented :
- The
PoissonLikelihood
: A discrete Poisson process (one parameter per point) with the scale parameter defined asλσ(f)
. - The
NegBinomialLikelihood
: The Negative Binomial likelihood wherer
is fixed and we define the success probabilityp
asσ(f)
.
Multi-class classification
There is two available likelihoods for multi-class classification:
- The
SoftMaxLikelihood
, the most common approach. However no analytical solving is possible. - The
LogisticSoftMaxLikelihood
, a modified softmax where the exponential function is replaced by the logistic function. It allows to get a fully conjugate model, Corresponding paper.
More options
There is the project to get distributions from Distributions.jl
to work directly as likelihoods.
Inference
Inference can be done in various ways.
AnalyticVI
: Variational Inference with closed-form updates. For non-Gaussian likelihoods, this relies on augmented version of the likelihoods. For using Stochastic Variational Inference, one can useAnalyticSVI
with the size of the mini-batch as an argument.GibbsSampling
: Gibbs Sampling of the true posterior, this also rely on an augmented version of the likelihoods, this is only valid for theVGP
model at the moment.
The two next methods rely on numerical approximation of an integral and I therefore recommend using the classical Descent
approach as it will use anyway the natural gradient updates. ADAM
seem to give random results.
QuadratureVI
: Variational Inference with gradients computed by estimating the expected log-likelihood via quadrature.MCIntegrationVI
: Variational Inference with gradients computed by estimating the expected log-likelihood via Monte Carlo Integration.
[WIP] : AdvancedHMC.jl will be integrated at some point, although generally the Gibbs sampling is preferable when available.
Compatibility table
Not all inference are implemented/valid for all likelihoods, here is the compatibility table between them.
Likelihood/Inference | AnalyticVI | GibbsSampling | QuadratureVI | MCIntegrationVI |
---|---|---|---|---|
GaussianLikelihood | ✔ (Analytic) | ✖ | ✖ | ✖ |
StudentTLikelihood | ✔ | ✔ | ✔ | ✖ |
LaplaceLikelihood | ✔ | ✔ | ✔ | ✖ |
HeteroscedasticLikelihood | ✔ | ✔ | (dev) | ✖ |
LogisticLikelihood | ✔ | ✔ | ✔ | ✖ |
BayesianSVM | ✔ | (dev) | ✖ | ✖ |
LogisticSoftMaxLikelihood | ✔ | ✔ | ✖ | (dev) |
SoftMaxLikelihood | ✖ | ✖ | ✖ | ✔ |
Poisson | ✔ | ✔ | ✖ | ✖ |
NegBinomialLikelihood | ✔ | ✔ | ✖ | ✖ |
(dev) means that the feature is possible and may be developped and tested but is not available yet. All contributions or requests are very welcome!
Model/Inference | AnalyticVI | GibbsSampling | QuadratureVI | MCIntegrationVI |
---|---|---|---|---|
VGP | ✔ | ✖ | ✔ | ✔ |
SVGP | ✔ | ✖ | ✔ | ✔ |
MCGP | ✖ | ✔ | ✖ | ✖ |
OnlineSVGP | ✔ | ✖ | ✖ | ✖ |
MO(S)VGP | ✔ | ✖ | ✔ | ✔ |
VStP | ✔ | ✖ | ✔ | ✔ |
Note that for MO(S)VGP you can use a mix of different likelihoods.
Inducing Points
Both SVGP
and MOSVGP
do not take data directly as inputs but inducing points instead. AGP.jl directly reexports the InducingPoints.jl package for you to use. For example to use a k-means approach to select 100
points on your input data you can use:
Z = inducingpoints(KmeanAlg(100), X)
Z
will always be an AbstractVector
and be directly compatible with SVGP
and MOSVGP
For OnlineSVGP
, since it cannot be assumed that you have data from the start, only an online inducing points selection algorithm can be used. The inducing points locations will be initialized with the first batch of data
Additional Parameters
Hyperparameter optimization
One can optimize the kernel hyperparameters as well as the inducing points location by maximizing the ELBO. All derivations are already hand-coded (no AD needed). One can select the optimization scheme via :
- The
optimiser
keyword, can benothing
orfalse
for no optimization or can be an optimiser from the Flux.jl library, see list here Optimisers. - The
Zoptimiser
keyword, similar tooptimiser
it is used for optimizing the inducing points locations, it is by default set tonothing
(no optimization).
PriorMean
The mean
keyword allows you to add different types of prior means:
ZeroMean
, a constant mean that cannot be optimized.ConstantMean
, a constant mean that can be optimized.EmpiricalMean
, a vector mean with a different value for each point.AffineMean
,μ₀
is given byX*w + b
.
Training
Offline models
Training is straightforward after initializing the model
by running :
model, state = train!(model, X_train, y_train; iterations=100, callback=callbackfunction)
where the callback
option is for running a function at every iteration. callbackfunction
should be defined as
function callbackfunction(model, iter)
# do things here...
end
The returned state
will contain different variables such as some kernel matrices and local variables. You can reuse this state to save some computations when using prediction functions or computing the ELBO
.
Note that passing X_train
and y_train
is optional for GP
, VGP
and MCGP
Online models
We recommend looking at the tutorial on online Gaussian processes. One needs to pass a state around, i.e.
let state=nothing
for (X_batch, y_batch) in eachbatch((X_train, y_train))
model, state = train!(model, X_batch, y_batch, state; iterations=10)
end
end
Prediction
Once the model has been trained it is finally possible to compute predictions. There always three possibilities :
predict_f(model, X_test; covf=true, fullcov=false)
: Compute the parameters (mean and covariance) of the latent normal distributions of each test points. Ifcovf=false
return only the mean, iffullcov=true
return a covariance matrix instead of only the diagonal.predict_y(model, X_test)
: Compute the point estimate of the predictive likelihood for regression or the label of the most likely class for classification.proba_y(model, X_test)
: Return the mean with the variance of each point for regression or the predictive likelihood to obtain the classy=1
for classification.
Miscellaneous
🚧 In construction – Should be developed in the near future 🚧
Saving/Loading models
Once a model has been trained it is possible to save its state in a file by using save_trained_model(filename,model)
, a partial version of the file will be save in filename
.
It is then possible to reload this file by using load_trained_model(filename)
. !!!However note that it will not be possible to train the model further!!! This function is only meant to do further predictions.
🚧 Pre-made callback functions 🚧
There is one (for now) premade function to return a a MVHistory object and callback function for the training of binary classification problems. The callback will store the ELBO and the variational parameters at every iterations included in iterpoints If `Xtestand
y_test` are provided it will also store the test accuracy and the mean and median test loglikelihood