Regression Using Machine Learning Algorithms

Machine Learning has been applied in earnest to multi-dimensional, nonlinear, data regression since the 1980’s. Many of these methods fall into the category of Bayesian Interpolation. Proponents, such as Mackay, Williams, Rasmussen and Tipping, have investigated Neural Network, Gaussian Process, Relevance Vector Machine and other methods. All these methods rely on training and testing using sets of known inputs and outputs (sometimes) to set pertinent algorithm parameters. The parameters characterize the training data and enable the algorithm to perform well on the test data.

The Relevance Vector Machine (RVM) method is attractive because it has a closed form iterative training solution and provides good interpolation using a weighted linear combination over a sparse basis function set. The RVM is a form of Support Vector Machine (SVM) derivable from a Bayesian cost function and is unlimited as to the nature of basis functions that can be employed. SVM requires the use of symmetric, positive definite, basis functions called Mercer kernels. An appropriate choice of kernels can reduce the fit error substantially. Figures 4 and 5 from “Sparse Bayesian Learning and the Relevance Vector Machine”, Michael Tipping, Journal of Machine Learning Research 1 (2001) pp. 211 – 244, dramatically illustrate how the right kernel choice is important. Figure 4 in “Bayesian Inference: An Introduction to Principles and Practice in Machine Learning”, Michael Tipping, in O. Bousquet, U von Luxburg, and g. Ratsch (Eds.), Advanced Lectures on Machine Learning, pp. 41-62, Springer, shows how added training data refines the interpolative estimate and weight distributions.

The paper “Fast Marginal Likelihood Maximization for Sparse Bayesian Models”, Michael Tipping and Anita Faul, In C. M. Bishop and B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, Key West, FL, Jan 3-6, provides a ‘fast’ version of the iterative training algorithm. The derivation is analyzed in “Analysis of Sparse Bayesian Learning”, Anita Faul and Michael Tipping, in T. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pp. 383 – 389, MIT Press, 2002.

The least evident expression used in these papers is:

However, it is not immediately obvious how this formula follows from Bayes Theorem.

Applying Bayes Theorem to the weight vector probability density conditioned by the training set, weight variance and training data noise variance on the left hand side (LHS) yields:

By dividing top and bottom of the RHS by  we get:

where the struck through variables are not applicable to the distributions due to an observation by Tristan Fletcher, another worker in the Machine Learning field.

Removing the superfluous dependencies results in:

which is equation 31 (or 5), the only mysterious expression in Tipping’s and Tipping and Faul’s RVM accounts, respectively.

Rasmussen and Candela in “Healing the Relevance Vector Machine through Augmentation”, Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 2005, take exception to RVM’s sparsity when it does not result in a correct uncertainty estimate. Their corrective approach changes the interpolation little and restores the correct uncertainty estimate at the expense of kernel sparsity. They do not recommend using their approach in practice when the uncertainty estimate is unimportant. Tipping anticipated this result in an appendix of the first paper cited above.

Bayesian methods are usually used to fit noisy data and extract, in some sense, the underlying system model. The noise parameter may be small but the approach does not support a zero value. This implies an irreducible residual for the case of deterministic, zero noise data.

Most Bayesian regression methods have related uses as classifiers.