Quasi-Monte Carlo sampling

Posted in Books, Kids, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , on December 10, 2014 by xi'an

“The QMC algorithm forces us to write any simulation as an explicit function of uniform samples.” (p.8)

As posted a few days ago, Mathieu Gerber and Nicolas Chopin will read this afternoon a Paper to the Royal Statistical Society on their sequential quasi-Monte Carlo sampling paper.  Here are some comments on the paper that are preliminaries to my written discussion (to be sent before the slightly awkward deadline of Jan 2, 2015).

Quasi-Monte Carlo methods are definitely not popular within the (mainstream) statistical community, despite regular attempts by respected researchers like Art Owen and Pierre L’Écuyer to induce more use of those methods. It is thus to be hoped that the current attempt will be more successful, it being Read to the Royal Statistical Society being a major step towards a wide diffusion. I am looking forward to the collection of discussions that will result from the incoming afternoon (and bemoan once again having to miss it!).

“It is also the resampling step that makes the introduction of QMC into SMC sampling non-trivial.” (p.3)

At a mathematical level, the fact that randomised low discrepancy sequences produce both unbiased estimators and error rates of order

$\mathfrak{O}(N^{-1}\log(N)^{d-}) \text{ at cost } \mathfrak{O}(N\log(N))$

means that randomised quasi-Monte Carlo methods should always be used, instead of regular Monte Carlo methods! So why is it not always used?! The difficulty stands [I think] in expressing the Monte Carlo estimators in terms of a deterministic function of a fixed number of uniforms (and possibly of past simulated values). At least this is why I never attempted at crossing the Rubicon into the quasi-Monte Carlo realm… And maybe also why the step had to appear in connection with particle filters, which can be seen as dynamic importance sampling methods and hence enjoy a local iid-ness that relates better to quasi-Monte Carlo integrators than single-chain MCMC algorithms.  For instance, each resampling step in a particle filter consists in a repeated multinomial generation, hence should have been turned into quasi-Monte Carlo ages ago. (However, rather than the basic solution drafted in Table 2, lower variance solutions like systematic and residual sampling have been proposed in the particle literature and I wonder if any of these is a special form of quasi-Monte Carlo.) In the present setting, the authors move further and apply quasi-Monte Carlo to the particles themselves. However, they still assume the deterministic transform

$\mathbf{x}_t^n = \Gamma_t(\mathbf{x}_{t-1}^n,\mathbf{u}_{t}^n)$

which the q-block on which I stumbled each time I contemplated quasi-Monte Carlo… So the fundamental difficulty with the whole proposal is that the generation from the Markov proposal

$m_t(\tilde{\mathbf{x}}_{t-1}^n,\cdot)$

has to be of the above form. Is the strength of this assumption discussed anywhere in the paper? All baseline distributions there are normal. And in the case it does not easily apply, what would the gain bw in only using the second step (i.e., quasi-Monte Carlo-ing the multinomial simulation from the empirical cdf)? In a sequential setting with unknown parameters θ, the transform is modified each time θ is modified and I wonder at the impact on computing cost if the inverse cdf is not available analytically. And I presume simulating the θ’s cannot benefit from quasi-Monte Carlo improvements.

The paper obviously cannot get into every detail, obviously, but I would also welcome indications on the cost of deriving the Hilbert curve, in particular in connection with the dimension d as it has to separate all of the N particles, and on the stopping rule on m that means only Hm is used.

Another question stands with the multiplicity of low discrepancy sequences and their impact on the overall convergence. If Art Owen’s (1997) nested scrambling leads to the best rate, as implied by Theorem 7, why should we ever consider another choice?

In connection with Lemma 1 and the sequential quasi-Monte Carlo approximation of the evidence, I wonder at any possible Rao-Blackwellisation using all proposed moves rather than only those accepted. I mean, from a quasi-Monte Carlo viewpoint, is Rao-Blackwellisation easier and is it of any significant interest?

What are the computing costs and gains for forward and backward sampling? They are not discussed there. I also fail to understand the trick at the end of 4.2.1, using SQMC on a single vector instead of (t+1) of them. Again assuming inverse cdfs are available? Any connection with the Polson et al.’s particle learning literature?

Last questions: what is the (learning) effort for lazy me to move to SQMC? Any hope of stepping outside particle filtering?

not converging to London for an [extra]ordinary Read Paper

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , on November 21, 2014 by xi'an

On December 10, I will alas not travel to London to attend the Read Paper on sequential quasi-Monte Carlo presented by Mathieu Gerber and Nicolas Chopin to The Society, as I fly instead to Montréal for the NIPS workshops… I am quite sorry to miss this event, as this is a major paper which brings quasi-Monte Carlo methods into mainstream statistics. I will most certainly write a discussion and remind Og’s readers that contributed (800 words) discussions are welcome from everyone, the deadline for submission being January 02.

Relevant statistics for Bayesian model choice [hot off the press!]

Posted in Books, Statistics, University life with tags , , , , , , on October 30, 2014 by xi'an

Our paper about evaluating statistics used for ABC model choice has just appeared in Series B! It somewhat paradoxical that it comes out just a few days after we submitted our paper on using random forests for Bayesian model choice, thus bypassing the need for selecting those summary statistics by incorporating all statistics available and letting the trees automatically rank those statistics in term of their discriminating power. Nonetheless, this paper remains an exciting piece of work (!) as it addresses the more general and pressing question of the validity of running a Bayesian analysis with only part of the information contained in the data. Quite usefull in my (biased) opinion when considering the emergence of approximate inference already discussed on this ‘Og…

[As a trivial aside, I had first used fresh from the press(es) as the bracketted comment, before I realised the meaning was not necessarily the same in English and in French.]

Posted in Kids, R, Statistics, University life, Wines with tags , , , , on May 2, 2014 by xi'an

Great news! The RSS is setting a data analysis challenge this year, sponsored by the Young Statisticians Section and Research Section of the Royal Statistical Society: Details are available on the wordpress website of the Challenge. Registration is open and the Challenge goes live on Tuesday 6 May 2014 for an exciting 6 weeks competition. (A wee bit of an unfortunate timing for those of us considering submitting a paper to NIPS!) Truly terrific, I have been looking for this kind of event to happen for many years (without finding the momentum to set it rolling…)  and hope it will generate a lot of exciting activity and replicas in other societies.

Series B news

Posted in Books, Statistics, University life with tags , , , , , , , , , , on January 24, 2014 by xi'an

The Journal of the Royal Statistical Society, Series B, has a new cover, a new colour and a new co-editor. As can be seen from the above shots, the colour is now a greenish ochre, with a picture of pedestrians on a brick plaza as a background, not much related to statistical methodology as far as I can tell. More importantly, the new co-editor for the coming four years is Piotr Fryzlewicz, professor at the London School of Economics, who will share the burden with Ingrid van Keilegom professor from UCL (Louvain-la-Neuve) who is now starting her third year… My friend, colleague and successor as Series B editor Gareth Roberts is now retiring after four years of hard work towards making Series B one of the top journals in Statistics. Thanks Gareth and best wishes to Ingrid and Piotr!

changing focus is not an option!

Posted in Statistics, University life with tags , , , , , , on October 11, 2013 by xi'an

Here is a quote from Mervyn Stone’s discussion of the DIC paper in Series B

“The paper is rather economical with the ‘truth’. The truth of pt(Y) corresponds fixedly to the conditions of the experimental or observational set-up that ensures independent future replication Yrep or internal independence of y = (y1,…,yn) (not excluding an implicit concomitant x). For pt(Y) ≈ p(Y|θt), θt must parameterize a scientifically plausible family of alternative distributions of Y under those conditions and is therefore a necessary ‘focus’ if the ‘good [true] model’ idea is to be invoked: think of tossing a bent coin. Changing focus is not an option.”

that I found most amusing (and relevant)! Elías Moreno and I wrote our discussions from Newcastle-upon-Tyne  for Series B (and arXived them as well, with a wee bit of confusion when I listed the affiliations: I am not [yet] associated with la Universidad de Las Palmas de Gran Canaria..!).

a general framework for updating belief functions [reply from the authors]

Posted in Statistics, University life with tags , , , , , , on July 16, 2013 by xi'an

Thanks to Christian for the comments and feedback on our paper “A General Framework for Updating Belief Distributions“. We agree with Christian that starting with a summary statistic, or statistics, is an anchor for inference or learning, providing direction and guidance for models, avoiding the alternative vague notion of attempting to model a complete data set. The latter idea has dominated the Bayesian methodology for decades, but with the advent of large and complex data sets, this is becoming increasingly challenging, if not impossible.

However, in order to do work with statistics of interest, we need to find a framework in which this direct approach can be supported by a learning strategy when the formal use of Bayes theorem is not applicable. We achieve this in the paper for a general class of loss functions, which connect observations with a target of interest. A point raised by Christian is how arbitrary these loss functions are. We do not see this at all; for if a target has been properly identified then the most primitive construct between observations informing about a target and the target would come in the form of a loss function. One should always be able to assess the loss of ascertaining a value of $\theta$ as an action and providing the loss in the presence of observation x. The question to be discussed is whether loss functions are objective, as in the case of the median loss,

$l(\theta,x)=|\theta-x|$

or subjective, in the case of the choice between loss functions for estimating a location of a distribution; mean, median or mode? But our work is situated in the former position.

Previous work on loss functions, mostly in the classical literature, has spent a lot of space working out what are optimal loss functions for targets of interest. We are not really dealing with novel targets and so we can draw on the classic literature here. The work can be thought of as the Bayesian version of the M-estimator and associated ideas. In this respect we are dealing with two loss functions for updating belief distributions, one for the data, which we have just discussed, and one for the prior information, which, due to coherence principles, must be the Kullback-Leibler divergence. This raises the thorny issue of how to calibrate the two loss functions. We discuss this in the paper.

To then deal with the statistic problem, mentioned at the start of this discussion, we have found a nice way to proceed by using the loss function $l(\theta,x)=-\log f(x|\theta)$. How this loss function, combined with the use of the exponential family, can be used to estimate functionals of the type

$I=\int g(x)\,f_0(x)\, dx$

is provided in the Walker talk at Bayes 250 in London, titled “The Misspecified Bayesian”, since the “model” $f(x|\theta)$ is designed to be misspecified, a tool to estimate and learn about I only. The basic idea is to evaluate I by ensuring that we learn about the $\theta_0$ for which

$I=\int g(x)\,f(x|\theta_0)\, dx.$

This is the story of the background, we would now like to pick up in more detail on three important points that you raise in your post:

1. The arbitrariness in selecting the loss function.
2. The relative weighting of loss-to-data vs. loss-to-prior.
3. The selection of the loss in the M-free case.

In the absence of complete knowledge of the data generating mechanism, i.e. outside of M-closed,

1. We believe the statistician should weigh up the relative arbitrariness in selecting a loss function targeting the statistic of interest versus the arbitrariness of selecting a misspecified model, known not to be true, for the complete data generating mechanism. There is a wealth of literature on how to select optimal loss functions that target specific statistics, e.g. Hüber (2009) provides a comprehensive overview of how this should be done. As far as we are aware, we know of no formal procedures (that do not rely on loss functions) to select a false sampling distribution $f(x|\theta)$ for the whole of x; see Key, Pericchi and Smith (1999).
2. The relative weighting of loss-to-data vs. loss-to-prior. This is an interesting open problem. Our framework shows in the absence of M-closed or the use of self-information loss that the analyst must select this weighting. In our paper we suggest some default procedures. We have nowhere claimed these were “correct”. You raise concerns regards parameterisation and we agree with you that care is needed, but many of these issues equally hold for existing “Objective” or “Default” Bayes procedures, such as unit-information priors.
3. The selection of the loss in M-free. You say “….there is no optimal choice for the substitute to the loss function…”. We disagree. Our approach is to select an established loss function that directly targets the statistic of interest, and elicit prior beliefs directly on the unknown value of this statistic. There is no notion here of a a pseudo-likelihood or where this converges to.

Thank you again to Christian for his critical observations!