Archive for optimisation

Le Monde puzzle [#1070]

Posted in Books, Kids, R, University life with tags , , , , , , , on October 11, 2018 by xi'an

Rewording Le Monde mathematical puzzle  fifth competition problem

For the 3×3 tables below, what are the minimal number of steps to move from left to rights when the yellow tokens can only be move to an empty location surrounded by two other tokens?

In the 4×4 table below, there are 6 green tokens. How many steps from left to right?

Painful and moderately mathematical, once more… For the first question, a brute force simulation of random valid moves of length less than 100 returns solutions in 4 steps:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
     1    1    1    0    0    1    0    0    1
     1    0    1    0    1    1    0    0    1
     0    0    1    1    1    1    0    0    1
     0    0    1    1    0    1    0    1    1
     0    0    1    0    0    1    1    1    1

But this is not an acceptable move because of the “other” constraint. Imposing this constraint leads to a solution in 9 steps, but is this the lowest bound?! It actually took me most of the weekend (apart from a long drive to and from a short half-marathon!) to figure out a better strategy than brute force random exploration: the trick I eventually figured out is to start from the finishing (rightmost) value F of the grid and look at values with solutions available in 1,2,… steps. This is not exactly dynamic programming, but it keeps the running time under control if there is a solution associated with the starting (leftmost) value S. (Robin proceeded reverse-wise, which in retrospect is presumably equivalent, if faster!) The 3×3 grid has 9 choose 5, ie 126, possible configurations with 5 coins, which means the number of cases remains under control. And even so for the 4×4 grid with 6 coins, with 8008 configurations. This led to a 9 step solution for n=3 and the proposed starting grid in yellow:

[1] 1 1 1 0 0 1 0 0 1
[1] 1 1 0 0 1 1 0 0 1
[1] 1 1 0 1 1 0 0 0 1
[1] 0 1 0 1 1 1 0 0 1
[1] 0 1 1 1 0 1 0 0 1
[1] 1 1 1 1 0 0 0 0 1
[1] 0 1 1 1 1 0 0 0 1
[1] 0 0 1 1 1 0 0 1 1
[1] 0 0 1 1 0 0 1 1 1
[1] 0 0 1 0 0 1 1 1 1

and a 19 step solution for n=4:

[1] 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 1
[1] 1 0 0 0 0 1 0 0 0 0 1 1 0 0 1 1
[1] 1 0 0 0 0 1 1 0 0 0 1 1 0 0 1 0
[1] 1 1 0 0 0 1 1 0 0 0 0 1 0 0 1 0
[1] 1 1 0 0 0 0 1 1 0 0 0 1 0 0 1 0
[1] 1 1 1 0 0 0 1 1 0 0 0 0 0 0 1 0
[1] 1 0 1 1 0 0 1 1 0 0 0 0 0 0 1 0
[1] 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0
[1] 1 1 0 1 0 1 1 0 0 0 0 0 0 0 1 0
[1] 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0
[1] 0 0 0 1 1 1 1 0 0 0 1 0 0 0 1 0
[1] 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 0
[1] 0 0 0 1 1 1 0 0 1 1 0 0 0 0 1 0
[1] 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 0
[1] 0 0 0 1 0 1 0 0 1 0 0 0 1 1 1 0
[1] 0 0 0 1 1 1 0 0 1 0 0 0 1 0 1 0
[1] 0 0 0 1 0 1 0 0 1 1 0 0 1 0 1 0
[1] 0 0 0 1 0 1 0 0 0 1 1 0 1 0 1 0
[1] 0 0 0 1 0 1 1 0 0 1 1 0 1 0 0 0

The first resolution takes less than a minute and the second one just a few minutes (or less than my short morning run!). Surprisingly, using this approach does not require more work, which makes me wonder at the solution Le Monde journalists will propose. Given the (misguided) effort put into my resolution, seeing a larger number of points for this puzzle is no wonder.

Le Monde puzzle [#1068]

Posted in Books, Kids, R with tags , , , , , , , on September 26, 2018 by xi'an

And here is the third Le Monde mathematical puzzle  open for competition:

Consider for this puzzle only integers with no zero digits. Among these an integer x=a¹a²a³… is refined if it is a multiple of its scion, the integer defined as x without the first digit, y=a²a³…. Find the largest refined integer x such the sequence of the successive scions of x with more than one digit is entirely refined. An integer x=a¹a²a… is distilled if it is a multiple of its grand-scion, the integer defined as x without the first two digits, z=a³… Find the largest distilled integer x such the sequence of the successive scions of x with more than two digits is entirely distilled.

Another puzzle amenable to a R resolution by low-tech exploration of possible integers, first by finding refined integers, with  no solution between 10⁶ and 10⁸ [meaning there is no refined integer larger than 10⁶] and then checking which refined integers have refined descendants, e.g.,

raf=NULL
for (x in (1e1+1):(1e6-1)){
  y=x%%(10^trunc(log(x,10)))
  if (y>0){
    if (x%%y==0)
      raf=c(raf,x)}}

that returns 95 refined integers. And then

for (i in length(raf):1){
  gason=raf[i]
  keep=(gason%in%raf)
  while (keep&(gason>100)){
    gason=gason%%(10^trunc(log(gason,10)))
    keep=keep&(gason%in%raf)}
  if (keep) break()}}

that returns 95,625 as the largest refined integer with the right descendance. Rather than finding all refined integers at once, going one digit at a time and checking that some solutions have proper descendants is actually faster.

Similarly, running an exploration code up to 10⁹ produces 1748 distilled integers, with maximum 9,977,34,375, so it is unlikely this is the right upper bound but among these the maximum with the right distilled descendants is 81,421,875. As derived by

rad=(1:99)[(1:99)%%10>0]
for (dig in 2:12){
 for (x in ((10^dig+1):(10^{dig+1}-1))){
  y=x%%(10^{dig-1})
  if (y>0){
   if (x%%y==0){
    if (min(as.integer(substring(x,seq(nchar(x)),seq(nchar(x)))))>0){
     rad=c(rad,x)
     y=x%%(10^dig)
     keep=(y%in%rad)
     while (keep&(y>1e3)){
       y=y%%(10^trunc(log(y,10)))
       keep=keep&(y%in%rad)}
     if (keep) solz=x}}}}
 if (solz<10^dig) break()
 topsol=max(solz)}

Le Monde puzzle [#1066]

Posted in Books, Kids, R with tags , , , , , , , on September 13, 2018 by xi'an

Recalling Le Monde mathematical puzzle  first competition problem

For the X table below, what are the minimal number of lights that are on (green) to reach the minimal and maximal possible numbers of entries (P) with an even (P as pair) number of neighbours with lights on? In the illustration below, there are 16 lights on (green) and 31 entries with an even number of on-neighbours.

As suggested last week, this was amenable to a R resolution by low-tech simulated annealing although the number of configurations was not that large when accounting for symmetries. The R code is a wee bit long for full reproduction here but it works on moving away from a random filling of this cross by 0’s and 1’s, toward minimising or maximising the number of P’s, this simulated annealing loop being inserted in another loop recording the minimal number of 1’s in both cases. A first round produced 1 and 44 for the minimal and maximal numbers of P’s, respectively, associated with at least 16 and 3 1’s, respectively, but a longer run exhibited 45 for 6 1’s crossing one of the diagonals of the X, made of two aligned diagonals of the outer 3×3 tables. (This was confirmed by both Amic and Robin in Dauphine!) The next [trigonometry] puzzle is on!

optimal approximations for importance sampling

Posted in Mountains, pictures, Statistics, Travel with tags , , , , , , , , , , , on August 17, 2018 by xi'an

“…building such a zero variance estimator is most of the times not practical…”

As I was checking [while looking at Tofino inlet from my rental window] on optimal importance functions following a question on X validated, I came across this arXived note by Pantaleoni and Heitz, where they suggest using weighted sums of step functions to reach minimum variance. However, the difficulty with probability densities that are step functions is that they necessarily have a compact support, which thus make them unsuitable for targeted integrands with non-compact support. And making the purpose of the note and the derivation of the optimal weights moot. It points out its connection with the reference paper of Veach and Guibas (1995) as well as He and Owen (2014), a follow-up to the other reference paper by Owen and Zhou (2000).

Bayesian gan [gan style]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , on June 26, 2018 by xi'an

In their paper Bayesian GANS, arXived a year ago, Saatchi and Wilson consider a Bayesian version of generative adversarial networks, putting priors on both the model and the discriminator parameters. While the prospect seems somewhat remote from genuine statistical inference, if the following statement is representative

“GANs transform white noise through a deep neural network to generate candidate samples from a data distribution. A discriminator learns, in a supervised manner, how to tune its parameters so as to correctly classify whether a given sample has come from the generator or the true data distribution. Meanwhile, the generator updates its parameters so as to fool the discriminator. As long as the generator has sufficient capacity, it can approximate the cdf inverse-cdf composition required to sample from a data distribution of interest.”

I figure the concept can also apply to a standard statistical model, where x=G(z,θ) rephrases the distributional assumption x~F(x;θ) via a white noise z. This makes resorting to a prior distribution on θ more relevant in the sense of using potential prior information on θ (although the successes of probabilistic numerics show formal priors can be used on purely numerical ground).

The “posterior distribution” that is central to the notion of Bayesian GANs is however unorthodox in that the distribution is associated with the following conditional posteriors

where D(x,θ) is the “discriminator”, that is, in GAN lingo, the probability to be allocated to the “true” data generating mechanism rather than to the one associated with G(·,θ). The generative conditional posterior (1) then aims at fooling the discriminator, i.e. favours generative parameter values that raise the probability of wrong allocation of the pseudo-data. The discriminative conditional posterior (2) is a standard Bayesian posterior based on the original sample and the generated sample. The authors then iteratively sample from these posteriors, effectively implementing a two-stage Gibbs sampler.

“By iteratively sampling from (1) and (2) at every step of an epoch one can, in the limit, obtain samples from the approximate posteriors over [both sets of parameters].”

What worries me about this approach is that  just cannot work, in the sense that (1) and (2) cannot be compatible conditional (posterior) distributions. There is no joint distribution for which (1) and (2) would be the conditionals, since the pseudo-data appears in D for (1) and (1-D) in (2). This means that the convergence of a Gibbs sampler is at best to a stationary σ-finite measure. And hence that the meaning of the chain is delicate to ascertain… Am I missing any fundamental point?! [I checked the reviews on NIPS webpage and could not spot this issue being raised.]

are profile likelihoods likelihoods?!

Posted in Books, Kids, Statistics, University life with tags , , , , , on March 27, 2018 by xi'an

A recent arXived paper by Oliver J. Maclaren is asking this very question. And argue for a positive answer. One of the invoked sources is Murray Aitkin’s integrated likelihood book, which I criticised here and elsewhere. With the idea of the paper being that

“….there is an appropriate notion of integration over variables that takes likelihood functions to likelihood functions via maximization.”

Hmm…. The switch there is to replace addition with maximisation, probability with possibility, and… profile likelihood as marginal possibility under this new concept. I just do not see how adapting these concepts for the interpretation of the profile likelihood makes the latter more meaningful, since it still overwhelmingly does not result from a distribution density at an observed realisation of a random variable. This reminds me a paper I refereed quite a long while ago where the authors were using Schwarz’ theory of distributions to expand the notion of unbiasedness. With unclear consequences.

1500 nuances of gan [gan gan style]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on February 16, 2018 by xi'an

I recently realised that there is a currently very popular trend in machine learning called GAN [for generative adversarial networks] that strongly connects with ABC, at least in that it relies mostly on the availability of a generative model, i.e., a probability model that can be generated as in x=G(ϵ;θ), to draw inference about θ [or predictions]. For instance, there was a GANs tutorial at NIPS 2016 by Ian Goodfellow and many talks on the topic at recent NIPS, the 1500 in the title referring to the citations of the GAN paper by Goodfellow et al. (2014). (The name adversarial comes from opposing true model to generative model in the inference. )

If you remember Jeffreys‘s famous pique about classical tests as being based on improbable events that did not happen, GAN, like ABC,  is sort of the opposite in that it generates events until the one that was observed happens. More precisely, by generating pseudo-samples and switching parameters θ until these samples get as confused as possible between the data generating (“true”) distribution and the generative one. (In its original incarnation, GAN is indeed an optimisation scheme in θ.) A basic presentation of GAN is that it constructs a function D(x,ϕ) that represents the probability that x came from the true model p versus the generative model, ϕ being the parameter of a neural network trained to this effect, aimed at minimising in ϕ a two-term objective function

E[log D(x,ϕ)]+E[log(1D(G(ϵ;θ),ϕ))]

where the first expectation is taken under the true model and the second one under the generative model.

“The discriminator tries to best distinguish samples away from the generator. The generator tries to produce samples that are indistinguishable by the discriminator.” Edward

One ABC perception of this technique is that the confusion rate

E[log(1D(G(ϵ;θ),ϕ))]

is a form of distance between the data and the generative model. Which expectation can be approximated by repeated simulations from this generative model. Which suggests an extension from the optimisation approach to a ABCyesian version by selecting the smallest distances across a range of θ‘s simulated from the prior.

This notion relates to solution using classification tools as density ratio estimation, connecting for instance to Gutmann and Hyvärinen (2012). And ultimately with Geyer’s 1992 normalising constant estimator.

Another link between ABC and networks also came out during that trip. Proposed by Bishop (1994), mixture density networks (MDN) are mixture representations of the posterior [with component parameters functions of the data] trained on the prior predictive through a neural network. These MDNs can be trained on the ABC learning table [based on a specific if redundant choice of summary statistics] and used as substitutes to the posterior distribution, which brings an interesting alternative to Simon Wood’s synthetic likelihood. In a paper I missed Papamakarios and Murray suggest replacing regular ABC with this version…