– You took too much!

– Maybe, but remember your sister is staying for two days.

– My sister…, as usual, she will take a big serving and leave half of it!

– Yes, but she will make sure to finish the bottle of wine!

Filed under: Kids, Travel Tagged: farmers' market, métro static ]]>

“The traditional approach to inferring N also contradicts fundamental ideas in Bayesian computation. Imagine we are trying to compute the posterior distribution for a parameter a in the presence of a nuisance parameter b. This is usually solved by exploring the joint posterior for a and b, and then only looking at the generated values of a. Nobody would suggest the wasteful alternative of using a discrete grid of possible a values and doing an entire Nested Sampling run for each, to get the marginal likelihood as a function of a.”

This criticism is receivable when there is a huge number of possible values of N, even though I see no fundamental contradiction with my ideas about Bayesian computation. However, it is more debatable when there are a few possible values for N, given that the exploration of the augmented space by a RJMCMC algorithm is often very inefficient, in particular when the proposed parameters are generated from the prior. The more when nested sampling is involved and simulations are run under the likelihood constraint! In the astronomy examples given in the paper, N never exceeds 15… Furthermore, by merging all N’s together, it is unclear how the evidences associated with the various values of N can be computed. At least, those are not reported in the paper.

The paper also omits to provide the likelihood function so I do not completely understand where “label switching” occurs therein. My first impression is that this is not a mixture model. However if the observed signal (from an exoplanetary system) is the sum of N signals corresponding to N planets, this makes more sense.

Filed under: Books, Statistics, Travel, University life Tagged: birth-and-death process, Chamonix, exoplanet, label switching, métro, nested sampling, Paris, RER B, reversible jump, Université Paris Dauphine ]]>

Filed under: Mountains, pictures Tagged: British Columbia, Canada, Helmcken Falls, ice climbing, Niagara Falls, Niagara-on-the-Lake, USA ]]>

- I cannot turn Bluetooth on again! My keyboard and mouse are no longer recognised or detected. No Bluetooth adapter is found by the system setting. Similarly,
*sudo modprobe bluetooth*shows nothing. I have installed a new interface called Blueman but to no avail. The fix suggested on forums to run*rfkill unblock bluetooth*does not work either… Actually*rfkill list all*only returns the wireless device. Which is working fine. - My webcam vanished as well. It was working fine before the weekend.
- Accessing some webpages, including all New York Times articles, now takes forever on Firefox! If less on Chrome.

Is this a curse of sorts?!

As an aside, I also found this week that I cannot update Adobe reader from version 9 to version 11, as Adobe does not support Linux versions any more… Another bummer. If one wants to stick to acrobat.

**Update [03/02]**

Thanks to Ingmar and Thomas, I got both my problems solved! The Bluetooth restarted after I shut down my *unplugged* computer, in connection with an USB over-current protection. And Thomas figured out my keyboard had a key to turn the webcam off and on, key that I had pressed when trying to restart the Bluetooth device. Et voilà!

Filed under: Kids, Linux Tagged: Bluetooth, Kubuntu, Linux, Ubuntu 14.04 ]]>

[“We mourn but we are not out”]

Filed under: Uncategorized Tagged: atheism, Bangladesh, blogging, Mukto-Mona ]]>

*[Here is a reply by Heiko Strathmann to my post of yesterday. Along with the slides of a talk in Oxford mentioned in the discussion.]*

Thanks for putting this up, and thanks for the discussion. Christian, as already exchanged via email, here are some answers to the points you make.

First of all, we don’t claim a free lunch — and are honest with the limitations of the method (see negative examples). Rather, we make the point that we *can* achieve computational savings in certain situations — essentially exploiting redundancy (what Michael called “tall” data in his note on subsampling & HMC) leading to fast convergence of posterior statistics.

Dan is of course correct noticing that if the posterior statistic does not converge nicely (i.e. all data counts), then truncation time is “mammoth”. It is also correct that it might be questionable to aim for an unbiased Bayesian method in the presence of such redundancies. However, these are the two extreme perspectives on the topic. The message that we want to get along is that there is a trade-off in between these extremes. In particular the GP examples illustrate this nicely as we are able to reduce MSE in a regime where posterior statistics have *not* yet stabilised, see e.g. figure 6.

“And the following paragraph is further confusing me as it seems to imply that convergence is not that important thanks to the de-biasing equation.”

To clarify, the paragraph refers to the *additional* convergence issues induced by alternative Markov transition kernels of mini-batch-based full posterior sampling methods by Welling, Bardenet, Dougal & co. For example, Firefly MC’s mixing time is increased by a factor of 1/q where q*N is the mini-batch size. Mixing of stochastic gradient Langevin gets worse over time. This is *not* true for our scheme as we can use standard transition kernels. It is still essential for the partial posterior Markov chains to converge (*if* MCMC is used). However, as this is a well studied problem, we omit the topic in our paper and refer to standard tools for diagnosis. All this is independent of the debiasing device.

**About MCMC convergence.**

Yesterday in Oxford, Pierre Jacob pointed out that if MCMC is used for estimating partial posterior statistics, the overall result is *not* unbiased. We had a nice discussion how this bias could be addressed via a two-stage debiasing procedure: debiasing the MC estimates as described in the “Unbiased Monte Carlo” paper by Agapiou et al, and then plugging those into the path estimators — though it is (yet) not so clear how (and whether) this would work in our case.

In the current version of the paper, we do not address the bias present due to MCMC. We have a paragraph on this in section 3.2. Rather, we start from a premise that full posterior MCMC samples are a gold standard. Furthermore, the framework we study is not necessarily linked to MCMC – it could be that the posterior expectation is available in closed form, but simply costly in N. In this case, we can still unbiasedly estimate this posterior expectation – see GP regression.

“The choice of the tail rate is thus quite delicate to validate against the variance constraints (2) and (3).”

It is true that the choice is crucial in order to control the variance. However, provided that partial posterior expectations converge at a rate n^{-β} with n the size of a minibatch, computational complexity can be reduced to N^{1-α} (α<β) without variance exploding. There is a trade-off: the faster the posterior expectations converge, more computation can be saved; β is in general unknown, but can be roughly estimated with the “direct approach” as we describe in appendix.

**About the “direct approach”**

It is true that for certain classes of models and φ functionals, the direct averaging of expectations for increasing data sizes yields good results (see log-normal example), and we state this. However, the GP regression experiments show that the direct averaging gives a larger MSE as with debiasing applied. This is exactly the trade-off mentioned earlier.

I also wonder what people think about the comparison to stochastic variational inference (GP for Big Data), as this hasn’t appeared in discussions yet. It is the comparison to “non-unbiased” schemes that Christian and Dan asked for.

Filed under: Statistics, University life Tagged: arXiv, bias vs. variance, big data, convergence assessment, de-biasing, Firefly MC, MCMC, Monte Carlo Statistical Methods, telescoping estimator, unbiased estimation ]]>

*“Data complexity is sub-linear in N, no bias is introduced, variance is finite.”*

**H**eiko Strathman, Dino Sejdinovic and Mark Girolami have arXived a few weeks ago a paper on the use of a telescoping estimator to achieve an unbiased estimator of a Bayes estimator relying on the entire dataset, while using only a small proportion of the dataset. The idea is that a sequence converging—to an unbiased estimator—of estimators φ_{t} can be turned into an unbiased estimator by a stopping rule T:

is indeed unbiased. In a “Big Data” framework, the components φ_{t} are MCMC versions of posterior expectations based on a proportion α_{t} of the data. And the stopping rule cannot exceed α_{t}=1. The authors further propose to replicate this unbiased estimator R times on R parallel processors. They further claim a reduction in the computing cost of

which means that a sub-linear cost can be achieved. However, the gain in computing time means higher variance than for the full MCMC solution:

“It is clear that running an MCMC chain on the full posterior, for any statistic, produces more accurate estimates than the debiasing approach, which by construction has an additional intrinsic source of variance. This means that if it is possible to produce even only a single MCMC sample (…), the resulting posterior expectation can be estimated with less expected error. It is therefore not instructive to compareapproaches in that region. “

I first got a “free lunch” impression when reading the paper, namely it sounded like using a random stopping rule was enough to overcome unbiasedness and large size jams. This is not the message of the paper, but I remain both intrigued by the possibilities the unbiasedness offers *and* bemused by the claims therein, for several reasons:

- the above estimator requires computing T MCMC (partial) estimators φ
_{t}in parallel. All of those estimators have to be associated with Markov chains in a stationary regime and they all are associated with independent chains. While addressing the convergence of a single chain, the paper does not truly cover the*simultaneous*convergence assessment on a group of T parallel MCMC sequences. And the paragraph below is further confusing me as it seems to imply that convergence is not that important thanks to the de-biasing equation. In fact, further discussion with the authors (!) led me to understand this relates to the existing alternatives for handling large data, like firefly Monte Carlo: Convergence to the stationary remains essential (and somewhat problematic) for all the partial estimators.

“If a Markov chain is, in line with above considerations, used for computing partial posterior expectations 𝔼_{π}_{t}[ϕ(θ)], it need not be induced by any form of approximation, noise injection, or state-space augmentation of the transition kernel. As a result, the notorious difficulties of ensuring acceptable mixing and problems of stickiness are conveniently side-stepped –which is in sharp contrast to all existing approaches.”

- the impact of the distribution of the stopping time T over the performances of the estimator is uncanny! Its tail should simply decreases more slowly than the square difference between the partial estimators. This requirement is both hard to achieve [given that the variances of the—partial—MCMC estimators are hard to assess] and with negative consequences on the overall computing time. The choice of the tail rate is thus quite delicate to validate against the variance constraints (2) and (3).
- the stopping time T must have a positive probability to take the largest possible value, corresponding to using the whole sample, in which case the approach gets worse than using directly the original MCMC algorithm, as noted on the first quote above. This shows (as stated in the first quote above) the approach cannot uniformly improve upon the standard MCMC.
- the comparison in the (log-)normal toy example is difficult to calibrate. (And why differentiating a log normal from a normal sample? and why are the tails extremely wide for 2²⁶ datapoints?) The number of likelihood evaluations is 5e-4 times smaller for the de-biased version, hence means a humongous gain in computing time, but how does this partial exploration of the information contained in the data impact the final variability of the estimate? If I judge from Figure 4 (top) after 300 replications, one still observes a range of 0.1 at the end of the 300 iterations. If I could produce the Bayes estimate for the whole data, the variability would be of order 2e-4… If the goal is to find an estimate of the parameter of interest, with a predetermined error, this is fine. But then the comparison with the genuine Bayesian answer is not very meaningful.
- when averaging the unbiased estimators R times, it is unclear whether or not the
*same*subset of n_{t}datapoints is used to compute the partial estimator φ_{t}. On the one hand, using different subsets should improve the connection with the genuine Bayes estimator. On the other hand, this may induce higher storage and computing costs. - cases when the likelihood does not factorise are not always favourable to the de-biasing approach (Section 5.1) in that the availability of the joint density of a subset of the whole data may prove an issue. Take for instance a time series or a graphical network. Computing the joint density of a subset requires stringent conditions on the way the subset is selected.

Overall, I think I understand the purpose of the paper better now I have read it a few times. The comparison is only relevant against other limited information solutions. Not against the full ~~Monty~~ MCMC. However, in the formal examples processed in the paper, a more direct approach would be to compute (in parallel) MCMC estimates for increasing portions of the data, add a dose of bootstrap to reduce bias, and check for stabilisation of the averages.

Filed under: Statistics, University life Tagged: arXiv, bag of little bootstraps, bias vs. variance, big data, convergence assessment, de-biasing, MCMC, Monte Carlo Statistical Methods, telescoping estimator, unbiased estimation ]]>

“From the Bayesian estimation point of view both the states and the static parameters are unknown (random) parameters of the system.” (p.20)

Bayesian filtering and smoothing is an introduction to the topic that essentially starts from ground zero. Chapter 1 motivates the use of filtering and smoothing through examples and highlights the naturally Bayesian approach to the problem(s). Two graphs illustrate the difference between filtering and smoothing by plotting for the same series of observations the successive confidence bands. The performances are obviously poorer with filtering but the fact that those intervals are point-wise rather than joint, i.e., that the graphs do not provide a confidence band. (The exercise section of that chapter is superfluous in that it suggests re-reading Kalman’s original paper and rephrases the Monty Hall paradox in a story unconnected with filtering!) Chapter 2 gives an introduction to Bayesian statistics in general, with a few pages on Bayesian computational methods. A first remark is that the above quote is both correct and mildly confusing in that the parameters can be consistently estimated, while the latent states cannot. A second remark is that justifying the MAP as associated with the 0-1 loss is incorrect in continuous settings. The third chapter deals with the batch updating of the posterior distribution, i.e., that the posterior at time t is the prior at time t+1. With applications to state-space systems including the Kalman filter. The fourth to sixth chapters concentrate on this Kalman filter and its extension, and I find it somewhat unsatisfactory in that the collection of such filters is overwhelming for a neophyte. And no assessment of the estimation error when the model is misspecified appears at this stage. And, as usual, I find the unscented Kalman filter hard to fathom! The same feeling applies to the smoothing chapters, from Chapter 8 to Chapter 10. Which mimic the earlier ones.

“The degeneracy problem can be solved by a resampling procedure.” (p.123)

By comparison, the seventh chapter on particle filters appears too introductory from my biased perspective. For instance, the above motivation for resampling in sequential importance (re)sampling is not clear enough. As stated it sounds too much like a trick, not mentioning the fast decrease in the number of first generation ancestors as the number of generations grows. And thus the need for either increasing the number of particles fast enough or checking for quick-forgetting. Chapter 11 is the equivalent of the above for particle smoothing. I would have like more details on the full posterior smoothing distribution, instead of the marginal posterior smoothing distribution at a given time t. And more of a discussion on the comparative merits of the different algorithms.

Chapter 12 is much longer than the other chapters as it caters to the much more realistic issue of parameter estimation. The chapter borrows at time from Cappé, Moulines and Rydèn (2007), where I contributed to the Bayesian estimation chapter. This is actually the first time in Bayesian filtering and smoothing when MCMC is mentioned. Including reference to adaptive MCMC and HMC. The chapter also covers some EM versions. And pMCMC à la Andrieu et al. (2010). Although a picture like Fig. 12.2 seems to convey the message that this particle MCMC approach is actually quite inefficient.

“An important question (…) which of the numerous methods should I choose?”

The book ends up with an Epilogue (Chapter 13). Suggesting to use (Monte Carlo) sampling only after all other methods have failed. Which implies assessing that those methods have indeed failed. Maybe the suggestion of running what seems like the most appropriate method first with synthetic data (rather than the real data) could be included. For one thing, it does not add much to the computing cost. All in all, and despite some criticisms voiced above, I find the book quite an handy and compact introduction to the field, albeit slightly terse for an undergraduate audience.

Filed under: Books, Statistics, Travel, University life Tagged: book review, CHANCE, EM algorithm, filtering, IMS Textbooks, Kalman filter, MAP estimators, particle filter, particle MCMC, plagiarism, Simo Särkkä, smoothing, The Monty Hall problem ]]>

Filed under: Books, Kids, pictures, Running Tagged: Charlie Hebdo ]]>

As a last presentation for the entire series, my student picked John Skilling’s Nested Sampling, not that it was in my list of “classics”, but he had worked on the paper in a summer project and was thus reasonably fluent with the topic. As he did a good enough job (!), here are his slides.

Some of the questions that came to me during the talk were on how to run nested sampling sequentially, both in the data and in the number of simulated points, and on incorporating more deterministic moves in order to remove some of the Monte Carlo variability. I was about to ask about (!) the Hamiltonian version of nested sampling but then he mentioned his last summer internship on this very topic! I also realised during that talk that the formula (for positive random variables)

does not require absolute continuity of the distribution F.

Filed under: Books, Kids, Statistics, University life Tagged: advanced Monte Carlo methods, classics, efficient importance sampling, evidence, Hamiltonian Monte Carlo, Monte Carlo Statistical Methods, nested sampling, seminar, slides, Université Paris Dauphine ]]>