A Framework for Distributed Inference with Emerging Hardware.
NB: A static version of this interactive article has been published on arXiv and can be cited as:
@article{Ortiz2021visualGBP, title = {A visual introduction to Gaussian Belief Propagation}, author = {Ortiz, Joseph and Evans, Talfan and Davison, Andrew J.}, journal={arXiv preprint arXiv:2107.02308}, year = {2021}, }
See our short video introduction to the article here.
In this article, we present a visual introduction to Gaussian Belief Propagation (GBP), a probabilistic inference algorithm that operates by passing messages between the nodes of arbitrarily structured factor graphs. A special case of loopy belief propagation, GBP updates rely only on local information and will converge independently of the message schedule. Our key argument is that, given recent trends in computing hardware, GBP has the right computational properties to act as a scalable distributed probabilistic inference framework for future machine learning systems.
Bayesian probability theory is the fundamental framework for dealing with uncertain data, and is at the core of practical systems in machine learning and robotics
Z. Ghahramani.
Nature, Vol 521(7553), pp. 452--459. Nature Publishing Group. 2015.
A.J. Davison.
arXiv preprint arXiv:1803.11288. 2018.
There are various possible algorithms for probabilistic inference, many of which take advantage of specific problem structure for fast performance.
Efficient inference on models represented by large, dynamic and highly inter-connected graphs however remains computationally challenging and is already a limiting factor in real embodied systems.
Inference on such graphs will only become more important as general purpose perception systems strive towards more heterogeneous and dynamic representations containing abstractions (such as objects or concepts) with continually changing relationships
N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, J. Dean.
Proceedings of the International Conference on Learning Representations (ICLR). 2017.
W. Fedus, B. Zoph, N. Shazeer.
arXiv preprint arXiv:2101.03961. 2021.
D. Lacey. 2019.
As we search for the best algorithms for probabilistic inference, we should recall the "Bitter Lesson" of machine learning (ML) research: "general methods that leverage computation are ultimately the most effective, and by a large margin"
R. Sutton.
2019.
Looking to the future, we anticipate a long-term trend towards a "hardware jungle"
H. Sutter. 2011.
V. Sze, Y. Chen, T. Yang, J.S. Emer.
Proceedings of the IEEE, Vol 105(12), pp. 2295--2329. Ieee. 2017.
To scale arbitrarily and leverage all available computation in this "hardware jungle", our key hypothesis is that we need inference methods which can operate with distributed local processing / storage and message passing communication, without the need for a global view or coordination of the whole model
Fortunately, a well-known inference algorithm exists which has these properties: Loopy Belief Propagation
J. Pearl. Morgan Kaufmann. 1988.
F.R. Kschischang, B.J. Frey, H. Loeliger.
IEEE Transactions on information theory, Vol 47(2), pp. 498--519. IEEE. 2001.
K.P. Murphy, Y. Weiss, M.I. Jordan.
Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. 1999.
J. Pearl. Morgan Kaufmann. 1988.
F. Dellaert, M. Kaess.
Foundations and Trends in Robotics, Vol 6(1-2), pp. 1--139. Now Publishers, Inc. 2017.
F. Dellaert.
Annual Review of Control, Robotics, and Autonomous Systems, Vol 4. Annual Reviews. 2021.
BP guarantees exact marginal computation with one sequence of forward-backward message passing in tree-structured graphs
C.M. Bishop.
Springer-Verlag New York, Inc. 2006.
K.P. Murphy, Y. Weiss, M.I. Jordan.
Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. 1999.
F. Scarselli, M. Gori, A.C. Tsoi, M. Hagenbuchner, G. Monfardini.
IEEE transactions on neural networks, Vol 20(1), pp. 61--80. IEEE. 2008.
M.M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, P. Vandergheynst.
IEEE Signal Processing Magazine, Vol 34(4), pp. 18--42. IEEE. 2017.
P.W. Battaglia, J.B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, others.
arXiv preprint arXiv:1806.01261. 2018.
R.J. McEliece, D.J.C. MacKay, J. Cheng.
IEEE Journal on selected areas in communications, Vol 16(2), pp. 140--152. IEEE. 1998.
One issue that has precluded the use of general Loopy BP is that it lacks convergence guarantees. However, a second and perhaps more relevant issue given modern hardware trends, is that its computational properties have not fitted the dominant processing paradigms of recent decades, CPUs and GPUs. Consequently, other factor graph inference algorithms have been preferred which take advantage of global problem structure to operate much more rapidly and robustly than loopy BP on a CPU.
We believe that now is the right time to re-evaluate the properties of loopy BP given the rise of graph-like computing hardware and sparse models.
Indeed there has been a recent resurgence in interest in the algorithm
D. George, W. Lehrach, K. Kansky, M. Lazaro-Gredilla, C. Laan, B. Marthi, X. Lou, Z. Meng, Y. Liu, H. Wang, others.
Science, Vol 358(6368). American Association for the Advancement of Science. 2017.
V.G. Satorras, M. Welling.
International Conference on Artificial Intelligence and Statistics, pp. 685--693. 2021.
J. Kuck, S. Chakraborty, H. Tang, R. Luo, J. Song, A. Sabharwal, S. Ermon.
Neural Information Processing Systems (NeurIPS). 2020.
J. Ortiz, M. Pupilli, S. Leutenegger, A. Davison.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020.
A. Opipari, C. Chen, S. Wang, J. Pavlasek, K. Desingh, O.C. Jenkins.
arXiv preprint arXiv:2101.05948. 2021.
J. Ortiz, M. Pupilli, S. Leutenegger, A. Davison.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020.
A.J. Davison, J. Ortiz.
arXiv preprint arXiv:arXiv:1910.14139. 2019.
J. Du, S. Ma, Y. Wu, S. Kar, J.M. Moura.
arXiv preprint arXiv:1801.06430. 2018.
D. Bickson. 2008.
In this article, we give an introduction to Gaussian Belief Propagation as a strong general purpose algorithmic and representational framework for large scale distributed inference. Throughout, we present a range of examples across 1D and 2D geometric estimation and image processing in the form of interactive simulations which allow the reader to experiment with and understand the properties of GBP. We hope that these simulations emphasize the key properties of GBP which we summarize below.
In the remainder of the article, we first introduce the GBP algorithm and show that it is intimately linked to solving a linear system of equations
Inference is the problem of estimating statistical properties of unknown variables
Bayesian inference proceeds by first defining a probabilistic model
Given the posterior, we can compute various properties of
These two calculations are known as maximum a posteriori (MAP) inference and marginal inference respectively. An important difference is that MAP inference produces a point estimate while marginal inference retains information about uncertainty.
The Hammersley-Clifford theorem tells us that any positive joint distribution
Factorized representations can be very convenient as they expose structure in a model.
Factor graphs are the natural visual representation of this factorization and can be very useful for reasoning about problems
F. Dellaert, M. Kaess.
Foundations and Trends in Robotics, Vol 6(1-2), pp. 1--139. Now Publishers, Inc. 2017.
In the factor graphs in this article, circles and squares represent variable and factor nodes respectively, with edges connecting each factor to the variables it depends on.
An example of a simple factor graph is shown in the diagram on the right.
By explicitly representing the factors as nodes in the graph, factor graphs clearly emphasize the conditional independence structure of the problem
- the lack of a factor directly connecting two variables means they are conditionally independent
Factor graphs can also be presented as energy based models
Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, F. Huang.
Predicting structured data, Vol 1(0). 2006.
In energy based models, finding the most likely variable configuration is equivalent to minimizing the negative log probability or the sum of factor energies:
Belief propagation (BP) is an algorithm for marginal inference, i.e. it computes the marginal posterior distribution for each variable from the set of factors that make up the joint posterior. BP is intimately linked to factor graphs by the following property: BP can be implemented as iterative message passing on the posterior factor graph. The algorithm operates by iteratively updating a node's locally stored belief by sending and receiving messages from neighbouring nodes. Each iteration consists of 3 phases:
Belief Update: The variable node beliefs are updated by taking a product of the incoming messages from all adjacent factors, each of which represents that factor's belief on the receiving node's variables.
Factor-to-variable message: To send a message to an adjacent variable node, a factor aggregates messages from all other adjacent variable nodes and marginalizes over all the other nodes' variables to produce a message that expresses the factor's belief over the receiving node's variables.
Variable-to-factor message: A variable-to-factor message tells the factor what the belief of the variable would be if the receiving factor node did not exist. This is computed by taking the product of the messages the variable node has received from all other factor nodes.
These 3 operations fully define the algorithm and their equations are presented below.
Belief propagation was originally developed for graphs that are trees
J. Pearl. Morgan Kaufmann. 1988.
K.P. Murphy, Y. Weiss, M.I. Jordan.
Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. 1999.
As BP was originally developed for trees, its application to loopy graphs was at first empirical
K.P. Murphy, Y. Weiss, M.I. Jordan.
Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. 1999.
J.S. Yedidia, W.T. Freeman, Y. Weiss, others.
NIPS, Vol 13, pp. 689--695. 2000.
Y. Weiss, W.T. Freeman.
Neural Information Processing Systems (NeurIPS). 2000.
M.J. Wainwright, M.I. Jordan.
Now Publishers Inc. 2008.
J.S. Yedidia, W.T. Freeman, Y. Weiss, others.
NIPS, Vol 13, pp. 689--695. 2000.
As the Bethe free energy is non-convex, loopy BP is not guaranteed to converge and even when it does it may converge to the wrong marginals.
Empirically, however BP generally converges to the true marginals although for very loopy graphs it can fail
K.P. Murphy, Y. Weiss, M.I. Jordan.
Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. 1999.
M.J. Wainwright, M.I. Jordan.
Now Publishers Inc. 2008.
Most interesting problems have loopy structures and so for the remainder of the article we will use BP to refer to loopy BP. So far, although we have outlined the BP equations, we have not specified the form of the factors, messages or beliefs. From here, we focus on Gaussian belief propagation which is a special form of continuous BP for Gaussian models.
We are interested in Gaussian models in which all factors and therefore the joint posterior are univariate / multivariate Gaussian distributions. Gaussians are a convenient choice for a number of reasons: (1) they accurately represent the distribution for many real world events
E.T. Jaynes.
Cambridge university press. 2003.
A Gaussian factor or in general any Gaussian distribution can be written in the exponential form
The canonical form is often preferred when performing inference, for two main reasons. Firstly, taking a product is simple in the canonical form, so it is easy to form the posterior from the factors. Secondly, the precision matrix is sparse and relates closely to the structure of the factor graph.
The precision matrix describes direct associations or conditional dependence between variables.
If entry
On the other hand, the covariance matrix describes induced correlations between variables and is dense as long as the graph is one single connected component. Unlike the canonical form, the moments form is unable to represent unconstrained distributions, as can be seen by selecting the unanchored preset graph in which there is only relative positional information. We encourage the reader to check out the other preset graphs and edit the graph to explore Gaussian models and the relationship with the canonical form.
The joint distribution corresponding to a factor graph in which all factors are Gaussian can be represented as a single multivariate Gaussian distribution (since the energy terms are additive) in the canonical form:
Marginal inference computes the per-variable marginal posterior distributions.
In the moments form, the marginal distribution of
We can therefore summarize inference in Gaussian models as solving the linear system of equations
Having introduced Gaussian models, we now discuss Gaussian Belief Propagation (GBP) a form of BP applied to Gaussian models. Due to the closure properties of Gaussians, the beliefs and messages are also Gaussians and GBP operates by storing and passing around information vectors and precision matrices.
Unlike general loopy BP, GBP is guaranteed to compute the exact marginal means on convergence, although the same is unfortunately not true for the variances which often converge to the true marginal variances, but are sometimes overconfident for very loopy graphs
Y. Weiss, W.T. Freeman.
Neural Information Processing Systems (NeurIPS). 2000.
D. Bickson. 2008.
J. Du, S. Ma, Y. Wu, S. Kar, J.M. Moura.
arXiv preprint arXiv:1801.06430. 2018.
Q. Su, Y. Wu.
IEEE Transactions on Signal Processing, Vol 63(5), pp. 1144--1155. IEEE. 2015.
K.P. Murphy, Y. Weiss, M.I. Jordan.
Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. 1999.
Q. Su, Y. Wu.
IEEE Transactions on Signal Processing, Vol 63(5), pp. 1144--1155. IEEE. 2015.
K.P. Murphy. MIT press. 2012.
The interactive figure below aims to build intuition for GBP by exploring the effect of individual messages. For easy visualization and interpretation of the beliefs, we examine 3 spatial estimation problems with increasing "loopiness": a chain, a loop and a grid. Click on a variable node to send messages to its adjacent variables and observe how neighbouring beliefs are updated. You will see that GBP converges to the true marginals regardless of the order in which messages are passed.
We have introduced Gaussian Belief Propagation in its basic form as a probabilistic inference algorithm for Gaussian estimation problems. However, to solve real practical problems with GBP, we often need a number of extra details and tricks which we discuss in this section.
Although requiring all factors to be Gaussian is a convenient assumption, most interesting problems involve non-linear relationships and / or non-Gaussian data distributions, both of which result in non-Gaussian factors. GBP can be extended to handle these problems by linearizing the non-linear relationships and using covariance scaling to handle non-Gaussian data distributions.
A factor is usually created given some observed data
For linear functions
If
J. Czarnowski, T. Laidlow, R. Clark, A.J. Davison.
IEEE Robotics and Automation Letters, Vol 5(2), pp. 721--728. IEEE. 2020.
M. Mukadam, J. Dong, X. Yan, F. Dellaert, B. Boots.
The International Journal of Robotics Research, Vol 37(11), pp. 1319--1340. SAGE Publications Sage UK: London, England. 2018.
A.J. Davison, J. Ortiz.
arXiv preprint arXiv:arXiv:1910.14139. 2019.
After linearization, the posterior is a Gaussian approximation of the true posterior and inference is performed by successively solving linearized versions of the underlying non-linear problem (as in non-linear least squares optimization).
To see how this linearization works, consider a robot moving in a plane that measures the 2D distance and angle to a landmark also in the plane, where the current estimates for the position of the robot and landmark are
The accuracy of the approximate Gaussian factor depends on the linearity of the function
A. Ranganathan, M. Kaess, F. Dellaert.
Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 2007.
A second cause of non-Gaussian factors is non-Gaussian data distributions.
We usually model observed data as coming from a Gaussian distribution:
E.T. Jaynes.
Cambridge university press. 2003.
P. Agarwal, G.D. Tipaldi, L. Spinello, C. Stachniss, W. Burgard.
Proceedings of the {IEEE} International Conference on Robotics and Automation ({ICRA}). 2012.
A.J. Davison, J. Ortiz.
arXiv preprint arXiv:arXiv:1910.14139. 2019.
Robust data distributions (or M-estimators), such as the Huber energy
P.J. Huber.
The Annals of Mathematical Statistics, Vol 35(1), pp. pp. 73--101. 1964.
P.J. Huber.
Wiley-Interscience. 1981.
Robust energy functions can make GBP much more generally useful - for example, they are crucial for bundle adjustment
J. Ortiz, M. Pupilli, S. Leutenegger, A. Davison.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2020.
The interactive figure below gives intuition for the effect of robust factors for the task of 1D line fitting.
The variables we are estimating are the
So far, we have assumed that all variable and factor nodes broadcast messages at each iteration in a synchronous fashion, where all nodes absorb and broadcast messages in parallel. In fact, this is far from a requirement and as GBP is entirely local, messages can be sent arbitrarily and asynchronously.
It turns out that the message schedule can have a dramatic effect on the rate of convergence. For example, swapping synchronous updates for random message passing tends to improve convergence, while a fixed "round-robin" schedule can do even better
D. Koller, N. Friedman.
MIT press. 2009.
T. Evans, N. Burgess.
Neural Information Processing Systems (NeurIPS), Vol 32. 2019.
G. Elidan, I. McGraw, D. Koller.
In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI). 2006.
C. Sutton, A. McCallum.
In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI). 2007.
A. Ranganathan, M. Kaess, F. Dellaert.
Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 2007.
In the figure below, we explore message scheduling using the 1D line fitting task once again. The underlying factor graph is a chain (no loops) and so will converge after one sweep of messages from left to right and back again. You can send messages through the graph using the preset schedules (synchronous, random or sweep) or create your own schedule by clicking on a variable node to send messages outwards.
Playing around with different schedules for surface estimation highlights two important properties of GBP.
First, GBP can converge with an arbitrary message passing schedule.
As a consequence, GBP can readily operate in systems with no global clock and varying local compute budgets such as on neuromorphic hardware or between a group of distributed devices
B. Micusik, G. Evangelidis.
arXiv preprint arXiv:2010.02153. 2020.
The second property is that GBP can achieve approximate local convergence without global convergence.
Due to the factorized structure of GBP
P.U. Diehl, J. Martel, J. Buhmann, M. Cook.
Frontiers in computational neuroscience, Vol 12, pp. 54. Frontiers. 2018.
In the figure below we explore attention-driven message passing for image denoising
Propagating information from one node to another with GBP takes the same number of iterations as the number of hops between the nodes. For nearby nodes in a local region, information can be communicated in a small number of iterations and consensus can be reached quickly, while for distant nodes, a global consensus can take many more iterations to be established. This is an inherent property of local algorithms and can be summarized as low frequency errors decay more slowly than the high frequency errors.
Regular grid structured graphs appear a lot in computer vision (e.g. image segmentation) and in discretized boundary value problems (e.g. solving for the temperature profile along a rod).
Accelerating convergence in such grid graphs has been well-studied in the field of Multigrid methods
W.L. Briggs, V.E. Henson, S.F. McCormick.
SIAM. 2000.
P.F. Felzenszwalb, D.P. Huttenlocher.
International journal of computer vision, Vol 70(1), pp. 41--54. Springer. 2006.
Mulitgrid methods can only be applied to graphs with a grid-like structure where it is possible to build equivalent coarsened representations.
In general, most problems are more unstructured and it is not clear how to build a coarsened or abstracted representation of the original problem.
In the general case, we see two possible ways to build hierarchy into a model. A network could be trained to directly predict specific abstractions that form long range connections when included in the graph. Second, the graph could contain additional constraints that define a generic hierarchical structure (much like a neural network) and then the abstractions themselves are also inferred
D. George, W. Lehrach, K. Kansky, M. Lazaro-Gredilla, C. Laan, B. Marthi, X. Lou, Z. Meng, Y. Liu, H. Wang, others.
Science, Vol 358(6368). American Association for the Advancement of Science. 2017.
Solving real non-linear problems with GBP is done by iteratively solving linearized Gaussian versions of the true non-linear problem.
This general pattern of successively solving linearized problems underpins many different non-linear inference methods.
There are efficient libraries
S. Agarwal, M. K., Others.
undefined.
F. Dellaert. 2012.
R. Kummerle, G. Grisetti, H. Strasdat, K. Konolige, W. Burgard.
2011 IEEE International Conference on Robotics and Automation, pp. 3607--3613. 2011.
GBP is just one of many possible algorithms that can be used to solve the linearized Gaussian model.
To place GBP amongst other methods, we present an overview of a number of related methods for MAP and marginal inference for Gaussian models in the table below.
As a reminder, inference in Gaussian models is equivalent to solving the linear system
With so many different inference methods, choosing which method to use can be a challenge in itself.
Judging the speed of each method is complex and depends on both the sparsity structure of
Other notable candidate methods that are local, probabilistic and iterative are Expectation Propagation (EP)
T.P. Minka.
arXiv preprint arXiv:1301.2294. 2013.
T.D. Barfoot.
arXiv preprint arXiv:2010.08022. 2020.
We envisage that ML systems of the future will be large scale, heterogeneous and distributed and as such will require flexible and scalable probabilistic inference algorithms. In this article, we argued that Gaussian Belief Propagation is a strong candidate algorithm as it is local, probabilistic, iterative and asynchronous. Additionally, we showed 1) how GBP is much more general with a prescription for handling non-linear factors and robust energy functions, 2) how GBP can operate in an attention-driven fashion and 3) how hierarchical structure can help convergence. We hope that this visual introduction will encourage more researchers and practitioners to look into GBP as an alternative to existing inference algorithms.
We see many exciting directions for future research around GBP and provide a GBP LibraryNotebook as a starting point for the interested reader.
Some directions we are most excited about are improving theoretical guarantees, using learned factors
J. Czarnowski, T. Laidlow, R. Clark, A.J. Davison.
IEEE Robotics and Automation Letters, Vol 5(2), pp. 721--728. IEEE. 2020.
M. Mukadam, J. Dong, X. Yan, F. Dellaert, B. Boots.
The International Journal of Robotics Research, Vol 37(11), pp. 1319--1340. SAGE Publications Sage UK: London, England. 2018.
A. Opipari, C. Chen, S. Wang, J. Pavlasek, K. Desingh, O.C. Jenkins.
arXiv preprint arXiv:2101.05948. 2021.
V.G. Satorras, M. Welling.
International Conference on Artificial Intelligence and Statistics, pp. 685--693. 2021.
J. Kuck, S. Chakraborty, H. Tang, R. Luo, J. Song, A. Sabharwal, S. Ermon.
Neural Information Processing Systems (NeurIPS). 2020.
We encourage the interested reader to explore our supplementary GBP playground. The playground consists of 2 interative diagrams based around 2D pose graphs. In the first you can construct your own pose graph, set the initialization and then choose how messages are passed through the graph. The second is a simulation of a robot exploring a 2D environment with landmarks.
We are grateful to many researchers with whom we have discussed some of the ideas in this paper, especially from the Dyson Robotics Lab and Robot Vision Group at Imperial College London. We would particularly like to thank Xiaofan Mu, Raluca Scona, Riku Murai, Edgar Sucar, Seth Nabarro, Tristan Laidlow, Nanfeng Liu, Shuaifeng Zhi, Kentara Wada and Stefan Leutenegger.
This standard derivation of the belief propagation equations follows Yedida et al
J.S. Yedidia, W.T. Freeman, Y. Weiss, others.
NIPS, Vol 13, pp. 689--695. 2000.
We begin by writing the posterior as a product of the factors:
We first consider the form of the free energy for a tree. For tree graphs, the distribution
Belief propagation can be derived via minimization of the Bethe free energy subject to two constraints.
The first is a marginalization constraint:
We now choose the lagrange multiplier to be
Using the marginalization condition , we can derive an equation for the messages in terms of other messages and produce the factor-to-variable message equation:
This result tells us that the fixed-points of loopy belief propagation are local stationary points of the Bethe free energy and because the Bethe energy is bounded from below, BP always has a fixed point.
BP variants have been developed using more accurate or convex approximations of the free energy
J.S. Yedidia, W.T. Freeman, Y. Weiss, others.
NIPS, Vol 13, pp. 689--695. 2000.
M.J. Wainwright, M.I. Jordan.
Now Publishers Inc. 2008.
If you see mistakes or want to suggest changes, please create an issue on GitHub.
Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise.
For attribution in academic contexts, please cite this work as
J. Ortiz, T. Evans, A. Davison. A visual introduction to Gaussian Belief Propagation, 2021.
BibTeX citation
@article{Ortiz2021visualGBP, title = {A visual introduction to Gaussian Belief Propagation}, author = {Ortiz, Joseph and Evans, Talfan and Davison, Andrew J.}, journal={arXiv preprint arXiv:2107.02308}, year = {2021}, }