Archive for NP-complete problem

unbiased product of expectations

Posted in Books, Statistics, University life with tags , , , , , , , , on August 5, 2019 by xi'an

m_biomet_106_2coverWhile I was not involved in any way, or even aware of this research, Anthony Lee, Simone Tiberi, and Giacomo Zanella have an incoming paper in Biometrika, and which was partly written while all three authors were at the University of Warwick. The purpose is to design an efficient manner to approximate the product of n unidimensional expectations (or integrals) all computed against the same reference density. Which is not a real constraint. A neat remark that motivates the method in the paper is that an improved estimator can be connected with the permanent of the n x N matrix A made of the values of the n functions computed at N different simulations from the reference density. And involves N!/ (N-n)! terms rather than N to the power n. Since it is NP-hard to compute, a manageable alternative uses random draws from constrained permutations that are reasonably easy to simulate. Especially since, given that the estimator recycles most of the particles, it requires a much smaller version of N. Essentially N=O(n) with this scenario, instead of O(n²) with the basic Monte Carlo solution, towards a similar variance.

This framework offers many applications in latent variable models, including pseudo-marginal MCMC, of course, but also for ABC since the ABC posterior based on getting each simulated observation close enough from the corresponding actual observation fits this pattern (albeit the dependence on the chosen ordering of the data is an issue that can make the example somewhat artificial).

the most probable cluster

Posted in Books, Statistics with tags , , , , , , on July 11, 2019 by xi'an

In the last issue of Bayesian Analysis, Lukasz Rajkowski studies the most likely (MAP) cluster associated with the Dirichlet process mixture model. Reminding me that most Bayesian estimates of the number of clusters are not consistent (when the sample size grows to infinity). I am always puzzled by this problem, as estimating the number of clusters sounds like an ill-posed problem, since it is growing with the number of observations, by definition of the Dirichlet process. For instance, the current paper establishes that the number of clusters intersecting a given compact set remains bounded. (The setup is one of a Normal Dirichlet process mixture with constant and known covariance matrix.)

Since the posterior probability of a given partition of {1,2,…,n} can be (formally) computed, the MAP estimate can be (formally) derived. I inserted formally in the previous sentence as the derivation of the exact MAP is an NP hard problem in the number n of observations. As an aside, I have trouble with the author’s argument that the convex hulls of the clusters should be disjoin: I do not see why they should when the mixture components are overlapping. (More generally, I fail to relate to notions like “bad clusters” or “overestimation of the number of clusters” or a “sensible choice” of the covariance matrix.) More globally, I am somewhat perplexed by the purpose of the paper and the relevance of the MAP estimate, even putting aside my generic criticisms of the MAP approach. No uncertainty is attached to the estimator, which thus appears as a form of penalised likelihood strategy rather than a genuinely Bayesian (Analysis) solution.

The first example in the paper is using data from a Uniform over (-1,1), concluding at a “misleading” partition by the MAP since it produces more than one cluster. I find this statement flabbergasting as the generative model is not the estimated model. To wit, the case of an exponential Exp(1) sample that cannot reach a maximum of the target function with a finite number of sample. Which brings me back full-circle to my general unease about clustering in that much more seems to be assumed about this notion than what the statistical model delivers.

Sudokus with minimum number of clues

Posted in Statistics, University life with tags , , on January 9, 2012 by xi'an

Yesterday, I spotted on a Spanish post on the minimal number of clues to solve a Sudoku in a unique way. The original paper was posted on arXiv on January 1, in the Data structure and algorithms category. The authors, Gary McGuire, Bastian Tugemann, and Gilles Civario from University College Dublin, have shown by an exhaustive search that no Sudoku with 16 clues exist (in the sense of having a unique solution). The actual computation actually tooks 7.1 million core hours on the Stokes machine, which is a cluster with 320 nodes, which represents about a year in real time! The Sudoku solver the authors used was Brian Turner’s open source solver. (Here is a review of solvers.) Very impressive…