likelihood-free approximate Gibbs sampling

Posted in Books, Statistics with tags , , , , , , , , on June 19, 2019 by xi'an

“Low-dimensional regression-based models are constructed for each of these conditional distributions using synthetic (simulated) parameter value and summary statistic pairs, which then permit approximate Gibbs update steps (…) synthetic datasets are not generated during each sampler iteration, thereby providing efficiencies for expensive simulator models, and only require sufficient synthetic datasets to adequately construct the full conditional models (…) Construction of the approximate conditional distributions can exploit known structures of the high-dimensional posterior, where available, to considerably reduce computational overheads”

Guilherme Souza Rodrigues, David Nott, and Scott Sisson have just arXived a paper on approximate Gibbs sampling. Since this comes a few days after we posted our own version, here are some of the differences I could spot in the paper:

1. Further references to earlier occurrences of Gibbs versions of ABC, esp. in cases when the likelihood function factorises into components and allows for summaries with lower dimensions. And even to ESP.
2. More an ABC version of Gibbs sampling that a Gibbs version of ABC in that approximations to the conditionals are first constructed and then used with no further corrections.
3. Inherently related to regression post-processing à la Beaumont et al.  (2002) in that the regression model is the start to designing an approximate full conditional, conditional on the “other” parameters and on the overall summary statistic. The construction of the approximation is far from automated. And may involve neural networks or other machine learning estimates.
4. As a consequence of the above, a preliminary ABC step to design the collection of approximate full conditionals using a single and all-purpose multidimensional summary statistic.
5. Once the approximations constructed, no further pseudo-data is generated.
6. Drawing from the approximate full conditionals is done exactly, possibly via a bootstrapped version.
7. Handling a highly complex g-and-k dynamic model with 13,140 unknown parameters, requiring a ten days simulation.

“In certain circumstances it can be seen that the likelihood-free approximate Gibbs sampler will exactly target the true partial posterior (…) In this case, then Algorithms 2 and 3 will be exact.”

Convergence and coherence are handled in the paper by setting the algorithm(s) as noisy Monte Carlo versions, à la Alquier et al., although the issue of incompatibility between the full conditionals is acknowledged, with the main reference being the finite state space analysis of Chen and Ip (2015). It thus remains unclear whether or not the Gibbs samplers that are implemented there do converge and if they do what is the significance of the stationary distribution.

selecting summary statistics [a tale of two distances]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , on May 23, 2019 by xi'an As Jonathan Harrison came to give a seminar in Warwick [which I could not attend], it made me aware of his paper with Ruth Baker on the selection of summaries in ABC. The setting is an ABC-SMC algorithm and it relates with Fearnhead and Prangle (2012), Barnes et al. (2012), our own random forest approach, the neural network version of Papamakarios and Murray (2016), and others. The notion here is to seek the optimal weights of different summary statistics in the tolerance distance, towards a maximization of a distance (Hellinger) between prior and ABC posterior (Wasserstein also comes to mind!). A sort of dual of the least informative prior. Estimated by a k-nearest neighbour version [based on samples from the prior and from the ABC posterior] I had never seen before. I first did not get how this k-nearest neighbour distance could be optimised in the weights since the posterior sample was already generated and (SMC) weighted, but the ABC sample can be modified by changing the [tolerance] distance weights and the resulting Hellinger distance optimised this way. (There are two distances involved, in case the above description is too murky!)

“We successfully obtain an informative unbiased posterior.”

The paper spends a significant while in demonstrating that the k-nearest neighbour estimator converges and much less on the optimisation procedure itself, which seems like a real challenge to me when facing a large number of particles and a high enough dimension (in the number of statistics). (In the examples, the size of the summary is 1 (where does the weight matter?), 32, 96, 64, with 5 10⁴, 5 10⁴, 5 10³ and…10 particles, respectively.) The authors address the issue, though, albeit briefly, by mentioning that, for the same overall computation time, the adaptive weight ABC is indeed further from the prior than a regular ABC with uniform weights [rather than weighted by the precisions]. They also argue that down-weighting some components is akin to selecting a subset of summaries, but I beg to disagree with this statement as the weights are never exactly zero, as far as I can see, hence failing to fight the curse of dimensionality. Some LASSO version could implement this feature.

adaptive copulas for ABC

Posted in Statistics with tags , , , , , , , , on March 20, 2019 by xi'an A paper on ABC I read on my way back from Cambodia:  Yanzhi Chen and Michael Gutmann arXived an ABC [in Edinburgh] paper on learning the target via Gaussian copulas, to be presented at AISTATS this year (in Okinawa!). Linking post-processing (regression) ABC and sequential ABC. The drawback in the regression approach is that the correction often relies on an homogeneity assumption on the distribution of the noise or residual since this approach only applies a drift to the original simulated sample. Their method is based on two stages, a coarse-grained one where the posterior is approximated by ordinary linear regression ABC. And a fine-grained one, which uses the above coarse Gaussian version as a proposal and returns a Gaussian copula estimate of the posterior. This proposal is somewhat similar to the neural network approach of Papamakarios and Murray (2016). And to the Gaussian copula version of Li et al. (2017). The major difference being the presence of two stages. The new method is compared with other ABC proposals at a fixed simulation cost, which does not account for the construction costs, although they should be relatively negligible. To compare these ABC avatars, the authors use a symmetrised Kullback-Leibler divergence I had not met previously, requiring a massive numerical integration (although this is not an issue for the practical implementation of the method, which only calls for the construction of the neural network(s)). Note also that sequential ABC is only run for two iterations, and also that none of the importance sampling ABC versions of Fearnhead and Prangle (2012) and of Li and Fearnhead (2018) are considered, all versions relying on the same vector of summary statistics with a dimension much larger than the dimension of the parameter. Except in our MA(2) example, where regression does as well. I wonder at the impact of the dimension of the summary statistic on the performances of the neural network, i.e., whether or not it is able to manage the curse of dimensionality by ignoring all but essentially the data  statistics in the optimisation.

Bayesian intelligence in Warwick

Posted in pictures, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , on February 18, 2019 by xi'an This is an announcement for an exciting CRiSM Day in Warwick on 20 March 2019: with speakers

10:00-11:00 Xiao-Li Meng (Harvard): “Artificial Bayesian Monte Carlo Integration: A Practical Resolution to the Bayesian (Normalizing Constant) Paradox”

11:00-12:00 Julien Stoehr (Dauphine): “Gibbs sampling and ABC”

14:00-15:00 Arthur Ulysse Jacot-Guillarmod (École Polytechnique Fedérale de Lausanne): “Neural Tangent Kernel: Convergence and Generalization of Deep Neural Networks”

15:00-16:00 Antonietta Mira (Università della Svizzera italiana e Università degli studi dell’Insubria): “Bayesian identifications of the data intrinsic dimensions”

[whose abstracts are on the workshop webpage] and free attendance. The title for the workshop mentions Bayesian Intelligence: this obviously includes human intelligence and not just AI!

a pen for ABC

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , on February 13, 2019 by xi'an Among the flury of papers arXived around the ICML 2019 deadline, I read on my way back from Oxford a paper by Wiqvist et al. on learning summary statistics for ABC by neural nets. Pointing out at another recent paper by Jiang et al. (2017, Statistica Sinica) which constructed a neural network for predicting each component of the parameter vector based on the input (raw) data, as an automated non-parametric regression of sorts. Creel (2017) does the same but with summary statistics. The current paper builds up from Jiang et al. (2017), by adding the constraint that exchangeability and partial exchangeability features should be reflected by the neural net prediction function. With applications to Markovian models. Due to a factorisation theorem for d-block invariant models, the authors impose partial exchangeability for order d Markov models by combining two neural networks that end up satisfying this factorisation. The concept is exemplified for one-dimension g-and-k distributions, alpha-stable distributions, both of which are made of independent observations, and the AR(2) and MA(2) models, as in our 2012 ABC survey paper. Since the later is not Markovian the authors experiment with different orders and reach the conclusion that an order of 10 is most appropriate, although this may be impacted by being a ble to handle the true likelihood.

the \$1,547.02 book on neural networks

Posted in Statistics with tags , , , , , on February 7, 2019 by xi'an

information maximising neural networks summaries

Posted in pictures, Statistics with tags , , , , , , , , on February 6, 2019 by xi'an After missing the blood moon eclipse last night, I had a meeting today at the Paris observatory (IAP), where we discussed an ABC proposal made by Tom Charnock, Guilhem Lavaux, and Benjamin Wandelt from this institute.

“We introduce a simulation-based machine learning technique that trains artificial neural networks to find non-linear functionals of data that maximise Fisher information : information maximising neural networks.” T. Charnock et al., 2018
The paper is centred on the determination of “optimal” summary statistics. With the goal of finding “transformation which maps the data to compressed summaries whilst conserving Fisher information [of the original data]”. Which sounds like looking for an efficient summary and hence impossible in non-exponential cases. As seen from the description in (2.1), the assumed distribution of the summary is Normal, with mean μ(θ) and covariance matrix C(θ) that are implicit transforms of the parameter θ. In that respect, the approach looks similar to the synthetic likelihood proposal of Wood (2010). From which an unusual form of Fisher information can be derived, as μ(θ)’C(θ)⁻¹μ(θ)… A neural net is trained to optimise this information criterion at a given (so-called fiducial) value of θ, in terms of a set of summaries of the same dimension as the data. Which means the information contained in the whole data (likelihood) is not necessarily recovered, linking with this comment from Edward Ionides (in a set of lectures at Wharton).
“Even summary statistics derived by careful scientific or statistical reasoning have been found surprisingly uninformative compared to the whole data likelihood in both scientific investigations (Shrestha et al., 2011) and simulation experiments (Fasiolo et al., 2016)” E. Ionides, slides, 2017
The maximal Fisher information obtained in this manner is then used in a subsequent ABC step as the natural metric for the distance between the observed and simulated data. (Begging the question as to why being maximal is necessarily optimal.) Another question is about the choice of the fiducial parameter, which choice should be tested by for instance iterating the algorithm a few steps. But having to run simulations for a single value of the parameter is certainly a great selling point!