## multilevel linear models, Gibbs samplers, and multigrid decompositions

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on October 22, 2021 by xi'an

A paper by Giacommo Zanella (formerly Warwick) and Gareth Roberts (Warwick) is about to appear in Bayesian Analysis and (still) open for discussion. It examines in great details the convergence properties of several Gibbs versions of the same hierarchical posterior for an ANOVA type linear model. Although this may sound like an old-timer opinion, I find it good to have Gibbs sampling back on track! And to have further attention to diagnose convergence! Also, even after all these years (!), it is always a surprise  for me to (re-)realise that different versions of Gibbs samplings may hugely differ in convergence properties.

At first, intuitively, I thought the options (1,0) (c) and (0,1) (d) should be similarly performing. But one is “more” hierarchical than the other. While the results exhibiting a theoretical ordering of these choices are impressive, I would suggest pursuing an random exploration of the various parameterisations in order to handle cases where an analytical ordering proves impossible. It would most likely produce a superior performance, as hinted at by Figure 4. (This alternative happens to be briefly mentioned in the Conclusion section.) The notion of choosing the optimal parameterisation at each step is indeed somewhat unrealistic in that the optimality zones exhibited in Figure 4 are unknown in a more general model than the Gaussian ANOVA model. Especially with a high number of parameters, parameterisations, and recombinations in the model (Section 7).

An idle question is about the extension to a more general hierarchical model where recentring is not feasible because of the non-linear nature of the parameters. Even though Gaussianity may not be such a restriction in that other exponential (if artificial) families keeping the ANOVA structure should work as well.

Theorem 1 is quite impressive and wide ranging. It also reminded (old) me of the interleaving properties and data augmentation versions of the early-day Gibbs. More to the point and to the current era, it offers more possibilities for coupling, parallelism, and increasing convergence. And for fighting dimension curses.

“in this context, imposing identifiability always improves the convergence properties of the Gibbs Sampler”

Another idle thought of mine is to wonder whether or not there is a limited number of reparameterisations. I think that by creating unidentifiable decompositions of (some) parameters, eg, μ=μ¹+μ²+.., one can unrestrictedly multiply the number of parameterisations. Instead of imposing hard identifiability constraints as in Section 4.2, my intuition was that this de-identification would increase the mixing behaviour but this somewhat clashes with the above (rigorous) statement from the authors. So I am proven wrong there!

Unless I missed something, I also wonder at different possible implementations of HMC depending on different parameterisations and whether or not the impact of parameterisation has been studied for HMC. (Which may be linked with Remark 2?)

## deduplication and population size estimation [discussion]

Posted in Books, Statistics with tags , , , , , , on April 23, 2020 by xi'an

[Here is my discussion on the paper “A Unified Framework for De-Duplication and Population Size Estimation” by [my friends] Andrea Tancredi, Rebecca Steorts, and Brunero Liseo, to appear on the June 2020 issue of Bayesian Analysis. The deadline is 24 April. Discussions are to be submitted to BA as regular submissions.]

Congratulations to the authors, for this paper that expand the modelling of populations investigated by faulty surveys, a poor quality feature that applies to extreme cases like Syria casualties. And possibly COVID-19 victims.

The model considered in this paper, as given by (2.1), is a latent variable model which appears as hyper-parameterised in the sense it involves a large number of parameters and latent variables. First, this means it is essentially intractable outside a Bayesian resolution. Second, within the Bayesian perspective, it calls for identifiability and consistency questions, namely which fraction of the unknown entities is identifiable and which fraction can be consistently estimated, eventually severing the dependence on the prior modelling. Personal experiences with capture-recapture models on social data like drug addict populations showed me that prior choices often significantly drive posterior inference on the population size. Here, it seems that the generative distortion mechanism between registry of individuals and actual records is paramount.

“We now investigate an alternative aspect of the uniform prior distribution of λ given N.”

Since the practical application stressed in the title, namely some of civil casualties in Syria, interrogations take a more topical flavour as one wonders at the connection between the model and the actual data, between the prior modelling and the available prior information. It is however not the strategy adopted in the paper, which instead proposes a generic prior modelling that could be deemed to be non-informative. I find the property that conditioning on the list sizes eliminates the capture probabilities and the duplication rates quite amazing, reminding me indeed of similar properties for conjugate mixtures, although we found the property hard to exploit from a computational viewpoint. And that the hit-miss model provides computationally tractable marginal distributions for the cluster observations.

“Several records of the VDC data set represent unidentified victims and report only the date of death or do not have the first name and report only the relationship with the head of the family.”

This non-informative choice is however quite informative in the misreporting mechanism and does not address the issue that it presumably is misspecified. It indeed makes the assumption that individual label and type of record are jointly enough to explain the probability of misreporting the exact record. In practical cases, it seems more realistic that the probability to appear in a list depends on the characteristics of an individual, hence far from being uniform as well as independent from one list to the next. The same applies to the probability of being misreported. The alternative to the uniform allocation of individuals to lists found in (3.3) remains neutral to the reasons why (some) individuals are missing from (some) lists. No informative input is indeed made here on how duplicates could appear or on how errors are made in registering individuals. Furthermore, given the high variability observed in inferring the number of actual deaths covered by the collection of the two lists, it would have been of interest to include a model comparison assessment, especially when contemplating the clash between the four posteriors in Figure 4.

The implementation of a manageable Gibbs sampler in such a convoluted model is quite impressive and one would welcome further comments from the authors on its convergence properties, since it is facing a large dimensional space. Are there theoretical or numerical irreducibility issues for instance, created by the discrete nature of some latent variables as in mixture models?

## deduplication and population size estimation [discussion opened]

Posted in Books, pictures, Running, Statistics, University life with tags , , , , on March 27, 2020 by xi'an

A call (worth disseminating) for discussions on the paper “A Unified Framework for De-Duplication and Population Size Estimation” by [my friends] Andrea Tancredi, Rebecca Steorts, and Brunero Liseo, to appear on the June 2020 issue of Bayesian Analysis. The deadline is 24 April.

Data de-duplication is the process of detecting records in one or more datasets which refer to the same entity. In this paper we tackle the de-duplication process via a latent entity model, where the observed data are perturbed versions of a set of key variables drawn from a finite population of N different entities. The main novelty of our approach is to consider the population size N as an unknown model parameter. As a result, a salient feature of the proposed method is the capability of the model to account for the de-duplication uncertainty in the population size estimation. As by-products of our approach we illustrate the relationships between de-duplication problems and capture-recapture models and we obtain a more adequate prior distribution on the linkage structure. Moreover we propose a novel simulation algorithm for the posterior distribution of the matching configuration based on the marginalization of the key variables at population level. We apply our method to two synthetic data sets comprising German names. In addition we illustrate a real data application, where we match records from two lists which report information about people killed in the recent Syrian conflict.

## Colin Blyth (1922-2019)

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , on March 19, 2020 by xi'an

## logic (not logistic!) regression

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on February 12, 2020 by xi'an

A Bayesian Analysis paper by Aliaksandr Hubin, Geir Storvik, and Florian Frommlet on Bayesian logic regression was open for discussion. Here are some hasty notes I made during our group discussion in Paris Dauphine (and later turned into a discussion submitted to Bayesian Analysis):

“Originally logic regression was introduced together with likelihood based model selection, where simulated annealing served as a strategy to obtain one “best” model.”

Indeed, logic regression is not to be confused with logistic regression! Rejection of a true model in Bayesian model choice leads to Bayesian model choice and… apparently to Bayesian logic regression. The central object of interest is a generalised linear model based on a vector of binary covariates and using some if not all possible logical combinations (trees) of said covariates (leaves). The GLM is further using rather standard indicators to signify whether or not some trees are included in the regression (and hence the model). The prior modelling on the model indices sounds rather simple (simplistic?!) in that it is only function of the number of active trees, leading to an automated penalisation of larger trees and not accounting for a possible specificity of some covariates. For instance when dealing with imbalanced covariates (much more 1 than 0, say).

A first question is thus how much of a novel model this is when compared with say an analysis of variance since all covariates are dummy variables. Culling the number of trees away from the exponential of exponential number of possible covariates remains obscure but, without it, the model is nothing but variable selection in GLMs, except for “enjoying” a massive number of variables. Note that there could be a connection with variable length Markov chain models but it is not exploited there.

“…using Jeffrey’s prior for model selection has been widely criticized for not being consistent once the true model coincides with the null model.”

A second point that strongly puzzles me in the paper is its loose handling of improper priors. It is well-known that improper priors are at worst fishy in model choice settings and at best avoided altogether, to wit the Lindley-Jeffreys paradox and friends. Not only does the paper adopts the notion of a same, improper, prior on the GLM scale parameter, which is a position adopted in some of the Bayesian literature, but it also seems to be using an improper prior on each set of parameters (further undifferentiated between models). Because the priors operate on different (sub)sets of parameters, I think this jeopardises the later discourse on the posterior probabilities of the different models since they are not meaningful from a probabilistic viewpoint, with no joint distribution as a reference, neither marginal density. In some cases, p(y|M) may become infinite. Referring to a “simple Jeffrey’s” prior in this setting is therefore anything but simple as Jeffreys (1939) himself shied away from using improper priors on the parameter of interest. I find it surprising that this fundamental and well-known difficulty with improper priors in hypothesis testing is not even alluded to in the paper. Its core setting thus seems to be flawed. Now, the numerical comparison between Jeffrey’s [sic] prior and a regular g-prior exhibits close proximity and I thus wonder at the reason. Could it be that the culling and selection processes end up having the same number of variables and thus eliminate the impact of the prior? Or is it due to the recourse to a Laplace approximation of the marginal likelihood that completely escapes the lack of definition of the said marginal? Computing the normalising constant and repeating this computation while the algorithm is running ignores the central issue.

“…hereby, all states, including all possible models of maximum sized, will eventually be visited.”

Further, I found some confusion between principles and numerics. And as usual bemoan the acronym inflation with the appearance of a GMJMCMC! Where G stands for genetic (algorithm), MJ for mode jumping, and MCMC for…, well no surprise there! I was not aware of the mode jumping algorithm of Hubin and Storvik (2018), so cannot comment on the very starting point of the paper. A fundamental issue with Markov chains on discrete spaces is that the notion of neighbourhood becomes quite fishy and is highly dependent on the nature of the covariates. And the Markovian aspects are unclear because of the self-avoiding aspect of the algorithm. The novel algorithm is intricate and as such seems to require a superlative amount of calibration. Are all modes truly visited, really? (What are memetic algorithms?!)