**D**avid Frazier sent me a picture of another Xi’an restaurant he found near the campus of Monash University. If this CNN webpage on the ten best dishes in Xi’an is to be believed, this will be a must-go restaurant for my next visit to Melbourne! Especially when reading there that Xi’an claims to have xiaolongbao (soup dumplings) that are superior to those in Shanghai!!! (And when considering that I once went on a xiaolongbao rampage in downtown Melbourne.

## Archive for Monash University

## Xi’an cuisine [Xi’an series]

Posted in Statistics with tags Biangbiang noodles, dumplings, jatp, Melbourne, Melbourne food scene, Monash University, Northern China, Shanghai, Xi'an, Xi'an cuisine, xiaolongbao, 小籠包 on August 26, 2017 by xi'an## model misspecification in ABC

Posted in Statistics with tags ABC, all models are wrong, Australia, likelihood-free methods, Melbourne, Mission Beach, model mispecification, Monash University, statistical modelling on August 21, 2017 by xi'an**W**ith David Frazier and Judith Rousseau, we just arXived a paper studying the impact of a misspecified model on the outcome of an ABC run. This is a question that naturally arises when using ABC, but that has been not directly covered in the literature apart from a recently arXived paper by James Ridgway [that was earlier this month commented on the ‘Og]. On the one hand, ABC can be seen as a robust method in that it focus on the aspects of the assumed model that are translated by the [insufficient] summary statistics and their expectation. And nothing else. It is thus tolerant of departures from the hypothetical model that [almost] preserve those moments. On the other hand, ABC involves a degree of non-parametric estimation of the intractable likelihood, which may sound even more robust, except that the likelihood is estimated from pseudo-data simulated from the “wrong” model in case of misspecification.

In the paper, we examine how the pseudo-true value of the parameter [that is, the value of the parameter of the misspecified model that comes closest to the generating model in terms of Kullback-Leibler divergence] is asymptotically reached by some ABC algorithms like the ABC accept/reject approach and not by others like the popular linear regression [post-simulation] adjustment. Which suprisingly concentrates posterior mass on a completely different pseudo-true value. Exploiting our recent assessment of ABC convergence for well-specified models, we show the above convergence result for a tolerance sequence that decreases to the minimum possible distance [between the true expectation and the misspecified expectation] at a slow enough rate. Or that the sequence of acceptance probabilities goes to zero at the proper speed. In the case of the regression correction, the pseudo-true value is shifted by a quantity that does not converge to zero, because of the misspecification in the expectation of the summary statistics. This is not immensely surprising but we hence get a very different picture when compared with the well-specified case, when regression corrections bring improvement to the asymptotic behaviour of the ABC estimators. This discrepancy between two versions of ABC can be exploited to seek misspecification diagnoses, e.g. through the acceptance rate versus the tolerance level, or via a comparison of the ABC approximations to the posterior expectations of quantities of interest which should diverge at rate Vn. In both cases, ABC reference tables/learning bases can be exploited to draw and calibrate a comparison with the well-specified case.

## two ABC postdocs at Monash

Posted in Statistics with tags ABC, approximate inference, Australia, Melbourne, Monash University, postdoctoral position, Victoria on April 4, 2017 by xi'an**F**or students, postdocs and faculty working on approximate inference, ABC algorithms, and likelihood-free methods, this announcement of two postdoc positions at Monash University, Melbourne, Australia, to work with Gael Martin, David Frazier and Catherine Forbes should be of strong relevance and particular interest:

The Department of Econometrics and Business Statistics at Monash is looking to fill two postdoc positions in – one for 12 months and the other for 2 years. The positions will be funded (respectively) by the following ARC Discovery grants:

1. DP150101728: “Approximate Bayesian Computation in State Space Models”. (Chief Investigators: Professor Gael Martin and Associate Professor Catherine Forbes; International Partner Investigators: Professor Brendan McCabe and Professor Christian Robert).

2. DP170100729: “The Validation of Approximate Bayesian Computation: Theory and Practice“. (Chief Investigators: Professor Gael Martin and Dr David Frazier; International Partner Investigators: Professor Christian Robert and Professor Eric Renault).

The deadline for applications is April 28th, 2017, and the nominal starting date is July, 2017 (although there is some degree of flexibility on that front).

## warp-U bridge sampling

Posted in Books, Statistics, Travel, University life with tags bridge sampling, component of a mixture, EM algorithm, folded Markov chain, MCqMC 2016, Melbourne, Monash University, nested sampling, Stanford University, warped bridge sampling, Xiao-Li Meng on October 12, 2016 by xi'an*[I wrote this set of comments right after MCqMC 2016 on a preliminary version of the paper so mileage may vary in terms of the adequation to the current version!]*

**I**n warp-U bridge sampling, newly arXived and first presented at MCqMC 16, Xiao-Li Meng continues (in collaboration with Lahzi Wang) his exploration of bridge sampling techniques towards improving the estimation of normalising constants and ratios thereof. The bridge sampling estimator of Meng and Wong (1996) is an harmonic mean importance sampler that requires iterations as it depends on the ratio of interest. Given that the normalising constant of a density does not depend on the chosen parameterisation in the sense that the Jacobian transform preserves this constant, a degree of freedom is in the choice of the parameterisation. This is the idea behind warp transformations. The initial version of Meng and Schilling (2002) used location-scale transforms, while the warp-U solution goes for a multiple location-scale transform that can be seen as based on a location-scale mixture representation of the target. With K components. This approach can also be seen as a sort of artificial reversible jump algorithm when one model is fully known. A strategy Nicolas and I also proposed in our nested sampling Biometrika paper.

Once such a mixture approximation is obtained. each and every component of the mixture can be turned into the standard version of the location-scale family by the appropriate location-scale transform. Since the component index k is unknown for a given X, they call this transform a *random* transform, which I find somewhat more confusing that helpful. The conditional distribution of the index given the observable x is well-known for mixtures and it is used here to weight the component-wise location-scale transforms of the original distribution p into something that looks rather similar to the standard version of the location-scale family. If no mode has been forgotten by the mixture. The simulations from the original p are then rescaled by one of those transforms, which index k is picked according to the conditional distribution. As explained later to me by XL, the *random[ness]* in the picture is due to the inclusion of a random ± sign. Still, in the notation introduced in (13), I do not get how the distribution Þ *[sorry for using different symbols, I cannot render a tilde on a p]* is defined since both ψ and W are random. Is it the marginal? In which case it would read as a weighted average of rescaled versions of p. I have the same problem with Theorem 1 in that I do not understand how one equates Þ with the joint distribution.

Equation (21) is much more illuminating (I find) than the previous explanation in that it exposes the fact that the principle is one of aiming at a new distribution for both the target and the importance function, with hopes that the fit will get better. It could have been better to avoid the notion of random transform, then, but this is mostly a matter of conveying the notion.

On more specifics points (or minutiae), the unboundedness of the likelihood is rarely if ever a problem when using EM. An alternative to the multiple start EM proposal would then be to get sequential and estimate the mixture in a sequential manner, only adding a component when it seems worth it. See eg Chopin and Pelgrin (2004) and Chopin (2007). This could also help with the bias mentioned therein since only a (tiny?) fraction of the data would be used. And the number of components K has an impact on the accuracy of the approximation, as in not missing a mode, and on the computing time. However my suggestion would be to avoid estimating K as this must be immensely costly.

Section 6 obviously relates to my folded Markov interests. If I understand correctly, the paper argues that the transformed density Þ does not need to be computed when considering the folding-move-unfolding step as a single step rather than three steps. I fear the description between equations (30) and (31) is missing the move step over the transformed space. Also on a personal basis I still do not see how to add this approach to our folding methodology, even though the different transforms act as as many replicas of the original Markov chain.

## The one-hundred year old man who climbed out of the window and disappeared [book review]

Posted in Books with tags Arto Paasilinna, book review, Himalayas, Jonas Jonasson, Monash University, picaresque novel, the girl who saved the king of Sweden, The Long Walk, The one-hundred year old man who climbed out of the window and disappeared on September 11, 2016 by xi'an**S**candinavian picaresque, in the spirit of the novels of Paasilinna, and following another book by Jonas Jonasson already commented on the ‘Og, The Girl who saved the King of Sweden, but not as funny, because of the heavy recourse to World history, the main (100 year old) character meeting a large collection of major historical figures. And crossing the Himalayas when escaping from a Russian Gulag, which reminded me of this fantastic if possibly apocryphal The Long Walk where a group of Polish prisoners was making it through the Gobi desert to reach India and freedom (or death). The story here is funny but not *that* funny and once it is over, there is not much to say about it, which is why I left it on a bookshare table in Monash. The current events are somewhat dull, in opposition to the 100 year life of Allan, and the police enquiry a tad too predictable. Plus the themes are somewhat comparable to The Girl who …, with atom bombs, cold war, brothers hating one another…

## MDL multiple hypothesis testing

Posted in Books, pictures, Statistics, Travel, University life with tags Australia, Bayesian tests of hypotheses, EM algorithm, minimal description length principle, mixtures of distributions, Monash University, Robert Menzies, seminar, statistical tests, Victoria on September 1, 2016 by xi'an

“This formulation reveals an interesting connection between multiple hypothesis testing and mixture modelling with the class labels corresponding to the accepted hypotheses in each test.”

**A**fter my seminar at Monash University last Friday, David Dowe pointed out to me the recent work by Enes Makalic and Daniel Schmidt on minimum description length (MDL) methods for multiple testing as somewhat related to our testing by mixture paper. Work which appeared in the proceedings of the *4th Workshop on Information Theoretic Methods in Science and Engineering (WITMSE-11)*, that took place in Helsinki, Finland, in 2011. Minimal encoding length approaches lead to choosing the model that enjoys the smallest coding length. Connected with, e.g., Rissannen‘s approach. The extension in this paper consists in considering K hypotheses at once on a collection of m datasets (the *multiple* then bears on the datasets rather than on the hypotheses). And to associate an hypothesis index to each dataset. When the objective function is the sum of (generalised) penalised likelihoods [as in BIC], it leads to selecting the “minimal length” model for each dataset. But the authors introduce weights or probabilities for each of the K hypotheses, which indeed then amounts to a mixture-like representation on the exponentiated codelengths. Which estimation by optimal coding was first proposed by Chris Wallace in his book. This approach eliminates the model parameters at an earlier stage, e.g. by maximum likelihood estimation, to return a quantity that only depends on the model index and the data. *In fine*, the purpose of the method differs from ours in that the former aims at identifying an appropriate hypothesis for each group of observations, rather than ranking those hypotheses for the entire dataset by considering the posterior distribution of the weights in the later. The mixture has somehow more of a substance in the first case, where separating the datasets into groups is part of the inference.