Archive for conditional density

finding our way in the dark

Posted in Books, pictures, Statistics with tags , , , , , , , , , on November 18, 2021 by xi'an

The paper Finding our Way in the Dark: Approximate MCMC for Approximate Bayesian Methods by Evgeny Levi and (my friend) Radu Craiu, recently got published in Bayesian Analysis. The central motivation for their work is that both ABC and synthetic likelihood are costly methods when the data is large and does not allow for smaller summaries. That is, when summaries S of smaller dimension cannot be directly simulated. The idea is to try to estimate

h(\theta)=\mathbb{P}_\theta(d(S,S^\text{obs})\le\epsilon)

since this is the substitute for the likelihood used for ABC. (A related idea is to build an approximate and conditional [on θ] distribution on the distance, idea with which Doc. Stoehr and I played a wee bit without getting anything definitely interesting!) This is a one-dimensional object, hence non-parametric estimates could be considered… For instance using k-nearest neighbour methods (which were already linked with ABC by Gérard Biau and co-authors.) A random forest could also be used (?). Or neural nets. The method still requires a full simulation of new datasets, so I wonder at the gain unless the replacement of the naïve indicator with h(θ) brings clear improvement to the approximation. Hence much fewer simulations. The ESS reduction is definitely improved, esp. since the CPU cost is higher. Could this be associated with the recourse to independent proposals?

In a sence, Bayesian synthetic likelihood does not convey the same appeal, since is a bit more of a tough cookie: approximating the mean and variance is multidimensional. (BSL is always more expensive!)

As a side remark, the authors use two chains in parallel to simplify convergence proofs, as we did a while ago with AMIS!

ABC²DE

Posted in Books, Statistics with tags , , , , , , , , , , , , , on June 25, 2018 by xi'an

A recent arXival on a new version of ABC based on kernel estimators (but one could argue that all ABC versions are based on kernel estimators, one way or another.) In this ABC-CDE version, Izbicki,  Lee and Pospisilz [from CMU, hence the picture!] argue that past attempts failed to exploit the full advantages of kernel methods, including the 2016 ABCDE method (from Edinburgh) briefly covered on this blog. (As an aside, CDE stands for conditional density estimation.) They also criticise these attempts at selecting summary statistics and hence failing in sufficiency, which seems a non-issue to me, as already discussed numerous times on the ‘Og. One point of particular interest in the long list of drawbacks found in the paper is the inability to compare several estimates of the posterior density, since this is not directly ingrained in the Bayesian construct. Unless one moves to higher ground by calling for Bayesian non-parametrics within the ABC algorithm, a perspective which I am not aware has been pursued so far…

The selling points of ABC-CDE are that (a) the true focus is on estimating a conditional density at the observable x⁰ rather than everywhere. Hence, rejecting simulations from the reference table if the pseudo-observations are too far from x⁰ (which implies using a relevant distance and/or choosing adequate summary statistics). And then creating a conditional density estimator from this subsample (which makes me wonder at a double use of the data).

The specific density estimation approach adopted for this is called FlexCode and relates to an earlier if recent paper from Izbicki and Lee I did not read. As in many other density estimation approaches, they use an orthonormal basis (including wavelets) in low dimension to estimate the marginal of the posterior for one or a few components of the parameter θ. And noticing that the posterior marginal is a weighted average of the terms in the basis, where the weights are the posterior expectations of the functions themselves. All fine! The next step is to compare [posterior] estimators through an integrated squared error loss that does not integrate the prior or posterior and does not tell much about the quality of the approximation for Bayesian inference in my opinion. It is furthermore approximated by  a doubly integrated [over parameter and pseudo-observation] squared error loss, using the ABC(ε) sample from the prior predictive. And the approximation error only depends on the regularity of the error, that is the difference between posterior and approximated posterior. Which strikes me as odd, since the Monte Carlo error should take over but does not appear at all. I am thus unclear as to whether or not the convergence results are that relevant. (A difficulty with this paper is the strong dependence on the earlier one as it keeps referencing one version or another of FlexCode. Without reading the original one, I spotted a mention made of the use of random forests for selecting summary statistics of interest, without detailing the difference with our own ABC random forest papers (for both model selection and estimation). For instance, the remark that “nuisance statistics do not affect the performance of FlexCode-RF much” reproduces what we observed with ABC-RF.

The long experiment section always relates to the most standard rejection ABC algorithm, without accounting for the many alternatives produced in the literature (like Li and Fearnhead, 2018. that uses Beaumont et al’s 2002 scheme, along with importance sampling improvements, or ours). In the case of real cosmological data, used twice, I am uncertain of the comparison as I presume the truth is unknown. Furthermore, from having worked on similar data a dozen years ago, it is unclear why ABC is necessary in such context (although I remember us running a test about ABC in the Paris astrophysics institute once).

fiducial simulation

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on April 19, 2018 by xi'an

While reading Confidence, Likelihood, Probability), by Tore Schweder and Nils Hjort, in the train from Oxford to Warwick, I came upon this unexpected property shown by Lindqvist and Taraldsen (Biometrika, 2005) that to simulate a sample y conditional on the realisation of a sufficient statistic, T(y)=t⁰, it is sufficient (!!!) to simulate the components of  y as y=G(u,θ), with u a random variable with fixed distribution, e.g., a U(0,1), and to solve in θ the fixed point equation T(y)=t⁰. Assuming there exists a single solution. Brilliant (like an aurora borealis)! To borrow a simple example from the authors, take an exponential sample to be simulated given the sum statistics. As it is well-known, the conditional distribution is then a (rescaled) Beta and the proposed algorithm ends up being a standard Beta generator. For the method to work in general, T(y) must factorise through a function of the u’s, a so-called pivotal condition which brings us back to my post title. If this condition does not hold, the authors once again brilliantly introduce a pseudo-prior distribution on the parameter θ to make it independent from the u’s conditional on T(y)=t⁰. And discuss the choice of the Jeffreys prior as optimal in this setting even when this prior is improper. While the setting is necessarily one of exponential families and of sufficient conditioning statistics, I find it amazing that this property is not more well-known [at least by me!]. And wonder if there is an equivalent outside exponential families, for instance for simulating a t sample conditional on the average of this sample.

probabilities larger than one…

Posted in Statistics with tags , , , , , , on November 9, 2017 by xi'an

recycling Gibbs auxiliaries [a reply]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , on January 3, 2017 by xi'an

[Here is a reply sent to me by Luca Martino, Victor Elvira, and Gustau Camp-Vallis, after my earlier comments on their paper.]

We provide our contribution to the discussion, reporting our experience with the application of Metropolis-within-Gibbs schemes. Since in literature there are miscellaneous opinions, we want to point out the following considerations:

– according to our experience, the use of M>1 steps of the Metropolis-Hastings (MH) method for drawing from each full-conditional (with or without recycling), decreases the MSE of the estimation (see code Ex1-Ex2 and related Figure 7(b) and Figures 8). If the corresponding full conditional is very concentrated, one possible solution is to applied an adaptive or automatic MH for drawing from this full-conditional (it can require the use of M internal steps; see references in Section 3.2).

– Fixing the number of evaluations of the posterior, the comparison between a longer Gibbs chain with a single step of MH and a shorter Gibbs chain with M>1 steps of MH per each full-conditional, is required. Generally, there is no clear winner. The better performance depends on different aspects: the specific scenario, if and adaptive MH is employed or not, if the recycling is applied or not (see Figure 10(a) and the corresponding code Ex2).

The previous considerations are supported/endorsed by several authors (see the references in Section 3.2). In order to highlight the number of controversial opinions about the MH-within-Gibbs implementation, we report a last observation:

– If it is possible to draw directly from the full-conditionals, of course this is the best scenario (this is our belief). Remarkably, as also reported in Chapter 1, page 393 of the book “Monte Carlo Statistical Methods”, C. Robert and Casella, 2004, some authors have found that a “bad” choice of the proposal function in the MH step (i.e., different from the full conditional, or a poor approximation of it) can improve the performance of the MH-within-Gibbs sampler. Namely, they assert that a more “precise” approximation of the full-conditional does not necessarily improve the overall performance. In our opinion, this is possibly due to the fact that the acceptance rate in the MH step (lower than 1) induces an “accidental” random scan of the components of the target pdf in the Gibbs sampler, which can improve the performance in some cases. In our work, for the simplicity, we only focus on the deterministic scan. However, a random scan could be also considered.

%d bloggers like this: