Archive for mixture of experts

Bayesian composite likelihood

Posted in Books, Statistics, University life with tags , , , , , , on February 11, 2016 by xi'an

“…the pre-determined weights assigned to the different associations between observed and unobserved values represent strong a priori knowledge regarding the informativeness of clues. A poor choice of weights will inevitably result in a poor approximation to the “true” Bayesian posterior…”

Last Xmas, Alexis Roche arXived a paper on Bayesian inference via composite likelihood. I find the paper quite interesting in that [and only in that] it defends the innovative notion of writing a composite likelihood as a pool of opinions about some features of the data. Recall that each term in the composite likelihood is a marginal likelihood for some projection z=f(y) of the data y. As in ABC settings, although it is rare to derive closed-form expressions for those marginals. The composite likelihood is parameterised by powers of those components. Each component is associated with an expert, whose weight reflects the importance. The sum of the powers is constrained to be equal to one, even though I do not understand why the dimensions of the projections play no role in this constraint. Simplicity is advanced as an argument, which sounds rather weak… Even though this may be infeasible in any realistic problem, it would be more coherent to see the weights as producing the best Kullback approximation to the true posterior. Or to use a prior on the weights and estimate them along the parameter θ. The former could be incorporated into the later following the approach of Holmes & Walker (2013). While the ensuing discussion is most interesting, it remains missing in connecting the different components in terms of the (joint) information brought about the parameters. Especially because the weights are assumed to be given rather than inferred. Especially when they depend on θ. I also wonder why the variational Bayes interpretation is not exploited any further. And see no clear way to exploit this perspective in an ABC environment.

ABC via regression density estimation

Posted in Statistics with tags , , , , , on December 19, 2012 by xi'an

Fan, Nott, and Sisson recently posted on arXiv a paper entitled Approximate Bayesian computation via regression density estimation. The theme of the paper is that one could take advantage of the joint simulation of the pair parameter/sample to derive a non-parametric estimate of the conditional distribution of the summary statistic given the parameter, i.e. the sampling distribution. While most or even all regular ABC algorithms do implicitly or explicitly rely on some level of non-parametric estimation, from Beaumont et al.’s (2002) non-parametric regression to Blum and François‘s (2010),  Fearnhead and Prangle (2012), and Biau et al. (2012) direct derivations on non-parametric convergence speeds on the kernel bandwidths, this paper centres on the idea to use those same simulations ABC relies upon to build an estimate of the sampling distribution, to be used afterwards as the likelihood in either Bayesian or frequentist inference. Rather than relying on traditional kernel estimates, the adopted approach merges mixtures of experts, namely normal regression mixtures with logit weights (Jordan and Jacobs, 1994) for the marginals, along with a copula representation of the joint distribution (of the summary statistics).

So this is a new kid on the large block of ABC methods! In terms of computing time, it sounds roughly equivalent to regular ABC algorithms in that it relies on the joint simulation of the pair parameter/sample. Plus a certain number of mixtures/mixtures of experts estimations. I have no intuition on how greedy those estimations are. In their unique illustration, the authors report density estimation in dimension 115, which is clearly impressive. I did not see any indication of respective computing times. In terms of inference and connection with the Bayesian exact posterior, I see a few potential caveats: first, the method provides an approximation of the conditional density of the summary statistics given the parameters, while the Bayesian approach considers the opposite. This could induce inefficiencies when the prior is vague and leads to a very wide spread for the values of the summary statistics. Using a neighbourhood of the observed statistics to restrict the range of the simulated statistics thus seems appropriate. (But boils down to our more standard ABC, isn’t it?!) Second, the use of mixtures of experts assume some linear connection between the parameters and the summary statistics: while this reflects Fearnhead and Prangle’s (2012) strategy, this is not necessarily appropriate in settings where those parameters cannot be expressed directly as expectations of the summary statistics (see, e.g., the case of population genetics). Third, the approximation proposed by the paper is a pluggin estimate, whose variability and imprecision are not accounted for in the inference process. Maybe not a major issue, as other solutions also rely on pluggin estimates. And I note the estimation is done once for all, when compared with, e.g., our empirical likelihood solution that requires a (fast) optimisation for each new value of the parameters. Fourth, once the approximation is constructed, a new MCMC run is required and since the (approximated) target is a black box the calibration of the MCMC sampler may turn to be rather delicate, as in the 115 dimensional example.