Archive for neural network

generalizing Hamiltonian Monte Carlo with neural networks

Posted in Statistics with tags , , , on April 25, 2018 by xi'an

Daniel Levy, Matthew Hoffman, and Jascha Sohl-Dickstein pointed out to me a recent paper of theirs submitted to and accepted by ICLR 2018, with the above title. This allowed me to discover the open source handling of paper reviews at ICLR, which I find quite convincing, except for not using MathJax or another medium for LaTeX formulas. And which provides a collection of comments besides mine’s. (Disclaimer: I was not involved in the processing of this paper for ICLR!)

“Ultimately our goal (and that of HMC) is to produce a proposal that mixes efficiently, not to simulate Hamiltonian dynamics accurately.”

The starting concept is the same as GANs (generative adversarial networks) discussed here a few weeks ago. Complemented by a new HMC that also uses deep neural networks to represent the HMC trajectory. (Also seen in earlier papers by e.g. Strathman.) The novelty in the HMC seems to be a binary direction indicator on top of the velocity. The leapfrog integrator is also modified, with a location scale generalisation for the velocity and a half-half location scale move for the original target x. The functions appearing in the location scale aspects are learned by neural nets. Towards minimising lag-one auto-correlation. Plus an extra penalty for not moving enough. Reflecting on the recent MCMC literature and in particular on the presentations at BayesComp last month, judging from comments of participants, this inclusion of neural tools in the tuning of MCMC algorithms sounds like a steady trend in the community. I am slightly at a loss about the adaptive aspects of the trend with regards to the Markovianity of the outcome.

“To compute the Metropolis-Hastings acceptance probability for a deterministic transition, the operator
must be invertible and have a tractable Jacobian.”

A remark (above) that seems to date back at least to Peter Green’s reversible jump. Duly mentioned in the paper. When reading about the performances of this new learning HMC, I could not see where the learning steps for the parameters of the leapfrog operators were accounted for, although the authors mention an identical number of gradient computations (which I take to mean the same thing). One evaluation of this method against earlier ones (Fig.2) checks successive values of the likelihood, which may be intuitive enough but does not necessarily qualify convergence to the right region since the posterior may concentrate away from the maximal likelihood.

ABCDE for approximate Bayesian conditional density estimation

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on February 26, 2018 by xi'an

Another arXived paper I surprisingly (?) missed, by George Papamakarios and Iain Murray, on an ABCDE (my acronym!) substitute to ABC for generative models. The paper was reviewed [with reviews made available!] and accepted by NIPS 2016. (Most obviously, I was not one of the reviewers!)

“Conventional ABC algorithms such as the above suffer from three drawbacks. First, they only represent the parameter posterior as a set of (possibly weighted or correlated) samples [for which] it is not obvious how to perform some other computations using samples, such as combining posteriors from two separate analyses. Second, the parameter samples do not come from the correct Bayesian posterior (…) Third, as the ε-tolerance is reduced, it can become impractical to simulate the model enough times to match the observed data even once [when] simulations are expensive to perform”

The above criticisms are a wee bit overly harsh as, well…, Monte Carlo approximations remain a solution worth considering for all Bayesian purposes!, while the approximation [replacing the data with a ball] in ABC is replaced with an approximation of the true posterior as a mixture. Both requiring repeated [and likely expensive] simulations. The alternative is in iteratively simulating from pseudo-predictives towards learning better pseudo-posteriors, then used as new proposals at the next iteration modulo an importance sampling correction.  The approximation to the posterior chosen therein is a mixture density network, namely a mixture distribution with parameters obtained as neural networks based on the simulated pseudo-observations. Which the authors claim [p.4] requires no tuning. (Still, there are several aspects to tune, from the number of components to the hyper-parameter λ [p.11, eqn (35)], to the structure of the neural network [20 tanh? 50 tanh?], to the number of iterations, to the amount of X checking. As usual in NIPS papers, it is difficult to assess how arbitrary the choices made in the experiments are. Unless one starts experimenting with the codes provided.) All in all, I find the paper nonetheless exciting enough (!) to now start a summer student project on it in Dauphine and hope to check the performances of ABCDE on different models, as well as comparing this ABC implementation with a synthetic likelihood version.

 As an addendum, let me point out the very pertinent analysis of this paper by Dennis Prangle, 18 months ago!

machine learning à l’Académie, au Collège, et dans Le Monde

Posted in Books, Statistics, University life with tags , , , , , , , , on January 5, 2018 by xi'an

A back-cover story in Le Monde “Sciences & Médecine” of Stéphane Mallat, professor at École Normale and recently elected at the (French) Academy of Sciences and at the Collège de France, on a newly created Chair of Data Sciences.  With works on wavelets, image compression, and neural networks, Stéphane Mallat will give his first lesson on Data Sciences at Collège de France, downtown Paris, on January 11. Entrance is free and open to everyone. (Collège de France is a unique institution, created by Guillaume Budé and supported by François Ier in 1530 to teach topics not taught (then) at the Sorbonne, as indicated by its motto Docet Omnia, including mathematics! Professors are nominated by the current faculty and the closest to statistics, prior to Stéphane Mallat, was Edmond Malinvaud.)

Nature snapshots [and snide shots]

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on October 12, 2017 by xi'an

A very rich issue of Nature I received [late] just before leaving for Warwick with a series of reviews on quantum computing, presenting machine learning as the most like immediate application of this new type of computing. Also including irate letters and an embarassed correction of an editorial published the week before reflecting on the need (or lack thereof) to remove or augment statues of scientists whose methods were unethical, even when eventually producing long lasting advances. (Like the 19th Century gynecologist J. Marion Sims experimenting on female slaves.) And a review of a book on the fascinating topic of Chinese typewriters. And this picture above of a flooded playground that looks like a piece of abstract art thanks to the muddy background.

“Quantum mechanics is well known to produce atypical patterns in data. Classical machine learning methods such as deep neural networks frequently have the feature that they can both recognize statistical patterns in data and produce data that possess the same statistical patterns: they recognize the patterns that they produce. This observation suggests the following hope. If small quantum information processors can produce statistical patterns that are computationally difficult for a classical computer to produce, then perhaps they can also recognize patterns that are equally difficult to recognize classically.” Jacob Biamonte et al., Nature, 14 Sept 2017

One of the review papers on quantum computing is about quantum machine learning. Although like Jon Snow I know nothing about this, I find it rather dull as it spends most of its space on explaining existing methods like PCA and support vector machines. Rather than exploring potential paradigm shifts offered by the exotic nature of quantum computing. Like moving to Bayesian logic that mimics a whole posterior rather than produces estimates or model probabilities. And away from linear representations. (The paper mentions a O(√N) speedup for Bayesian inference in a table, but does not tell more, which may thus be only about MAP estimators for all I know.) I also disagree with the brave new World tone of the above quote or misunderstand its meaning. Since atypical and statistical cannot but clash, “universal deep quantum learners may recognize and classify patterns that classical computers cannot” does not have a proper meaning. The paper contains a vignette about quantum Boltzman machines that finds a minimum entropy approximation to a four state distribution, with comments that seem to indicate an ability to simulate from this system.

fast ε-free ABC

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , on June 8, 2017 by xi'an

Last Fall, George Papamakarios and Iain Murray from Edinburgh arXived an ABC paper on fast ε-free inference on simulation models with Bayesian conditional density estimation, paper that I missed. The idea there is to approximate the posterior density by maximising the likelihood associated with a parameterised family of distributions on θ, conditional on the associated x. The data being then the ABC reference table. The family chosen there is a mixture of K Gaussian components, which parameters are then estimated by a (Bayesian) neural network using x as input and θ as output. The parameter values are simulated from an adaptive proposal that aims at approximating the posterior better and better. As in population Monte Carlo, actually. Except for the neural network part, which I fail to understand why it makes a significant improvement when compared with EM solutions. The overall difficulty with this approach is that I do not see a way out of the curse of dimensionality: when the dimension of θ increases, the approximation to the posterior distribution of θ does deteriorate, even in the best of cases, as any other non-parametric resolution. It would have been of (further) interest to see a comparison with a most rudimentary approach, namely the one we proposed based on empirical likelihoods.

machine learning and the future of realism

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , on May 4, 2017 by xi'an

Giles and Cliff Hooker arXived a paper last week with this intriguing title. (Giles Hooker is an associate professor of statistics and biology at Cornell U, with an interesting blog on the notion of models, while Cliff Hooker is a professor of philosophy at Newcastle U, Australia.)

“Our conclusion is that simplicity is too complex”

The debate in this short paper is whether or not machine learning relates to a model. Or is it concerned with sheer (“naked”) prediction? And then does it pertain to science any longer?! While it sounds obvious at first, defining why science is more than prediction of effects given causes is much less obvious, although prediction sounds more pragmatic and engineer-like than scientific. (Furthermore, prediction has a somewhat negative flavour in French, being used as a synonym to divination and opposed to prévision.) In more philosophical terms, prediction offers no ontological feature. As for a machine learning structure like a neural network being scientific or a-scientific, its black box nature makes it much more the later than the former, in that it brings no explanation for the connection between input and output, between regressed and regressors. It further lacks the potential for universality of scientific models. For instance, as mentioned in the paper, Newton’s law of gravitation applies to any pair of weighted bodies, while a neural network built on a series of observations could not be assessed or guaranteed outside the domain where those observations are taken. Plus, would miss the simple square law established by Newton. Most fascinating questions, undoubtedly! Putting the stress on models from a totally different perspective from last week at the RSS.

As for machine learning being a challenge to realism, I am none the wiser after reading the paper. Utilising machine learning tools to produce predictions of causes given effects does not seem to modify the structure of the World and very little our understanding of it, since they do not bring explanation per se. What would lead to anti-realism is the adoption of those tools as substitutes for scientific theories and models.

Bayesian program synthesis

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , on April 7, 2017 by xi'an

Last week, I—along with Jean-Michel Marin—got an email from a journalist working for Science & Vie, a French sciences journal that published a few years ago a special issue on Bayes’ theorem. (With the insane title of “the formula that deciphers the World!”) The reason for this call was the preparation of a paper on Gamalon, a new AI company that relies on (Bayesian) probabilistic programming to devise predictive tools. And spent an hour skyping with him about Bayesian inference, probabilistic programming and machine-learning, at the general level since we had not heard previously of this company or of its central tool.

“the Gamalon BPS system learns from only a few examples, not millions. It can learn using a tablet processor, not hundreds of servers. It learns right away while we play with it, not over weeks or months. And it learns from just one person, not from thousands.”

Gamalon claims to do much better than deep learning at those tasks. Not that I have reasons to doubt that claim, quite the opposite, an obvious reason being that incorporating rules and probabilistic models in the predictor is going to help if these rule and models are even moderately realistic, another major one being that handling uncertainty and learning by Bayesian tools is usually a good idea (!), and yet another significant one being that David Blei is a member of their advisory committee. But it is hard to get a feeling for such claims when the only element in the open is the use of probabilistic programming, which is an advanced and efficient manner of conducting model building and updating and handling (posterior) distributions as objects, but which does not enjoy higher predictives abilities by default. Unless I live with a restricted definition of what probabilistic programming stands for! In any case, the video provided by Gamalon and the presentation given by its CEO do not help in my understanding of the principles behind this massive gain in efficiency. Which makes sense given that the company would not want to give up their edge on the competition.

Incidentally, the video in this presentation comparing the predictive abilities of the four major astronomical explanations of the solar system is great. If not particularly connected with the difference between deep learning and Bayesian probabilistic programming.