Also, would you like to enumerate several frequentist methods that are “significantly” more advanced than the “partition and combine” Bayesian algorithms?

First, frequentists are taking the same “partition and combine” approach for big data, for example, “average” or “median” scheme to combine estimates from subsets.

Second, within-iteration-parallel optimization does not always work well. For example, “ADMM” takes much longer steps for convergence, which is likely to kill the savings from parallelization.

Finally, stochastic gradient descent algorithms…well, people now use them just like “partition and combine” MCMC. You run the algorithm for hours and pick one point in the trajectory that gives the largest/smallest function value as your estimator. The performance heavily relies on your tune of the learning rate, and, in most cases, you don’t know whether your algorithm will eventually converge or not…

Oh, forgot to mention MCMC also gives you uncertainty and confidence.

Larger sample size does not necessarily imply a higher accuracy. In fact, the shrinking of the confidence interval somehow increases the risk of losing robustness, because you know your models are wrong…

]]>Here’s a modern day version of Cencov

http://m.pnas.org/content/108/25/10078.full.pdf

Nick

]]>The uniqueness argument comes from the uniqueness of the Fisher-Rao metric, not the resulting measure itself.

Basically Fisher-Rao is the only metric that is consistent with the usual properties of Frequentist statistics, such as sufficiency and the like. Then, using the fact that a metric induces a unique measure (not guaranteed to be a probability measure, of course), you can argue that the Jeffreys’ prior is unique to a given likelihood function.

If you relax the condition on Fisher-Rao (in particular if you do mathematically suspect things like adding the Hessian of a prior density) then you no longer get a unique metric and hence no more unique Jeffreys’ prior.

]]>“This criticism is receivable when there is a huge number of possible values of N, even though I see no fundamental contradiction with my ideas about Bayesian computation. However, it is more debatable when there are a few possible values for N, given that the exploration of the augmented space by a RJMCMC algorithm is often very inefficient, in particular when the proposed parameters are generated from the prior.”

Yes, I agree. This paragraph was aimed at astronomers, many of whom only know about the ‘different trial values of N’ approach.

“The more when nested sampling is involved and simulations are run under the likelihood constraint!”

I think it’s less. The DNS target distribution is usually easier than the posterior, because the posterior might be dominated by levels 50-70 (say) yet the trans dimensional moves might be accepted a lot in level 30 where the likelihood constraint is lower.

]]>It’s much more common that the phase transition occurs at a higher temperature, and that will only affect marginal likelihoods. I’d bet there are many wrong marginal likelihoods in the literature because of phase transitions, but I doubt there are many incorrect posterior distributions. One example of an incorrect posterior distribution is this strange paper by Carlos Rodriguez, where he thinks we should all use Jeffreys priors: http://arxiv.org/abs/0709.1067 For his non-Jeffreys prior the only thing that failed was his MCMC run, which didn’t mix between the two phases.

I think John Skilling’s obtuse writing style is to blame for people’s lack of understanding of these problems. If you read his 2006 paper in BA it’s mostly about phase transitions, yet many papers since then just use NS because they feel like it / it sounds cool.

]]>