# Ergodicity of Combocontinuous Adaptive MCMC algorithms

This preprint by Jeff Rosenthal and Jinyoung Yang (currently available from Jeffs webpage) might also be called “Easily verifiable adaptive MCMC”. Jeff Rosenthal gave a tutorial on adaptive MCMC during MCMSki 2016 mentioning this work.  Adaptive MCMC is based on the idea that one can use the information gathered from sampling a distribution using MCMC to improve the efficiency of the sampling process.

If two conditions, diminishing adaptation and containment are satisfied, an adaptive MCMC algorithm is valid in the sense of asymptotically consistent. Diminishing adaptation means that two consecutive Markov Kernels in the algorithm will be asymptotically equal. In other words, we either stop adaptation at some point or we know that the adaptation algorithm converges.
Containment means the number of repeated applications of all used Markov Kernels to get close to the target measure is bounded. Concretely, let $\gamma$ be a Markov kernel index, $P_\gamma^m(x,\cdot)$ be the distribution resulting from m-fold application of kernel $P_\gamma$ starting from $x$ . In other words start MCMC at point x with kernel $P_\gamma$, let it run for m iterations and consider the induced distribution for the last point. Let $\pi$ be the target distribution. Then containment requires that
$\{M_\epsilon(X_n, \Gamma_n)\}_{n=1}^\infty$  is bounded in probability for all $\epsilon > 0$. Here $M_\epsilon(x, \gamma) = \inf \{ m \geq 1 : \| P_\gamma^m(x,\cdot) - \pi(\cdot) \|_\textrm{TV}\}$ and $\|\cdot\|_\textrm{TV}\}$ is a worst case distance between distributions (total variation distance).

The paper is concerned with trying to find conditions for containment in adaptive MCMC that are more easily verified than those from earlier papers. First however it gives a kind of blueprint for adaptive algorithms that satisfy containment.

### A blueprint for consistent adaptive MCMC

Nameley, let $\mathbb{R}^d$ be the support of the target distribution and $K \subseteq \mathbb{R}^d$ some large bounded region, $D > 0$ some large constant. The blueprint, Bounded Adaptive Metropolis, is the following:

Start the algorithm at some $X_0 \in K$ and fix a $d \times d$ covariance matrix $\Sigma_*$. At iteration n generate a proposal $Y_{n+1}$ by

(1)$Y_{n+1} \sim \mathcal{N}(X_n, \Sigma_*)~\textrm{if}~X_n \notin K$
(2)$Y_{n+1} \sim \mathcal{N}(X_n, \Sigma_{n+1})~\textrm{if}~X_n \in K$

Reject if $|Y_{n+1} - X_{n}| > D$, else accept with the usual Metropolis-Hastings acceptance probability. The $latex \Sigma_{n+1}$ can be chosen almost arbitrarily if the diminishing adaptation condition is met, so either the mechanism of choosing is fixed asymptotically or converges.

It would seem to me that we can actually change the distribution in (2) arbitrarily if we continue to meet diminishing adaptation. So for example we could use an independent metropolis, adaptive Langevin or other sophisticated proposal inside K, so long as condition (e) in the paper is satisfied, i.e. the adaptive proposal distribution used in (2) is continuous in $X_n$. Which leads us to the actual conditions for containment.

### General conditions for containment in adaptive MCMC

Let $\mathcal{X}$ be a general state space. For example in the Bounded Metropolis we had $\mathcal{X}=\mathbb{R}^d$. The conditions the authors give are (even more simplified by me):

(a)  The probability to move more than some finite distance D > 0 is zero: $Pr(|X_{n+1} - X_n| > D) = 0$
(b) Outside of K, the algorithm uses a fixed transition kernel P that never changes (and still respects that we can at most move D far away)
(c) The fixed kernel P is bounded above by $P(x, dy) \leq M \mu_*(dy)$ for finite constant M > 0 and all x that are outside K but no farther from it than D (call that set $K_D$) and all y that are between D and 2D distance from K (call that set $K_{2D} \backslash K_D$). Here $\mu_*$ is any distribution concentrated on $K_{2D} \backslash K_D$.
(d) The fixed kernel P is bounded below by $P^{n_0}(x, A) \geq \epsilon \nu_*(A)$ for some measure $\nu_*$ on $\mathcal{X}$, some $n_0 \in \mathbb{N}$ and some event A.
(e) Let $\gamma$ be the parameter adapted by the algorithm. The overall proposal densities $q_\gamma(x,y)$ (combining the proposal in and outside of K) are continuous in $\gamma$ for fixed (x,y) and combocontinuous in x. Practically, this would be that the fixed proposal when outside  K and the adaptive proposal when inside K are both continuous.

Here, conditions (a) and (b) are very easy to ensure even when not an expert on MCMC. Conditions (c) and (d) sound harder, but as mentioned above it seems to me that they are easy to ensure by just using a (truncated, i.e. respecting (a)) gaussian random walk proposal outside of K. Finally, (e) seems to boil down to making the adaptive proposal continuous in both $\gamma$ and x.

The proofs use a generalization of piecewise continuous functions and a generalized version of Dinis theorem to prove convergence in total variation distance.

This paper seems to me to be a long way from Roberts & Rosenthal (2007, Journal of Applied Probability) which was the first paper I read on ergodicity conditions for adaptive MCMC. It truly makes checking containment much easier. My one concern is that the exposition could be clearer for people that are not MCMC researchers. Then again, this is a contribution paper rather than a tutorial.

# Une lèpre démocratique

This article in Le Monde (in French) covers Emmanuel Macrons speech in Lyon yesterday. I must admit that I know little more about Macrons program than what the article reports. One of the most interesting parts being that Macron is decidedly pro European. What caught my eye though was his metaphor regarding the populist right wing spreading of fear and the method of speaking almost exclusively to the populations anger. Macron called this the leprosy of democracy.

I couldn’t agree more and must say that I marvel at the strength of this image. While he was speaking about the french Front National, the exact same argument applies to German AfD. As well as to the countries where egocentric white men have made voters believe they would act in public interest, like the US, Hungary, and Poland.

# Operator Variational Inference

This NIPS 2016 paper by Ranganath et al. is concerned with Variational Inference using objective functions other than KL-divergence between a target density $\pi$ and a proposal density $q$. It’s called Operator VI as a fancy way to say that one is flexible in constructing how exactly the objective function uses $\pi, q$ and test functions from some family $\mathcal{F}$. I completely agree with the motivation: KL-Divergence in the form $\int q(x) \log \frac{q(x)}{\pi(x)} \mathrm{d}x$ indeed underestimates the variance of $\pi$ and approximates only one mode. Using KL the other way around, $\int \pi(x) \log \frac{pi(x)}{q(x)} \mathrm{d}x$ takes all modes into account, but still tends to underestimate variance.

As a particular case, the authors suggest an objective using what they call the Langevin-Stein Operator which does not make use of the proposal density $q$  at all but uses test functions exclusively. The only requirement is that we be able to draw samples from the proposal. The authors claim that assuming access to $q$ limits applicability of an objective/operator. This claim is not substantiated however. The example they give in equation (10) is that it is not possible to find a Jacobian correction for a certain transformation of a standard normal random variable $\epsilon \sim \mathcal{N}(0,I)$  to a bimodal distribution. However their method is not the only one to get bimodality by transforming a standard normal variable and actually the Jacobian correction can be computed even for their suggested transformation! The problem they encounter really is that they throw away one dimension of $\epsilon$, which makes the tranformation lose injectivity. However by not throwing the variable away, we keep injectivity and it is possible to compute the density of the transformed variables. The reasons for not accessing the density $q$ I thus find rather unconvincing.

To compute expectations with respect to $q$, the authors suggest Monte Carlo sums, where every summand uses an evaluation of $\pi$ or its gradient. As that is the most computationally costly part in MCMC and SMC often times, I am very curious whether the method performs any better computationally than modern adaptive Monte Carlo methods.

# Anschlag am Breitscheidplatz

The attack yesterday afternoon took place at one of my favorite places In Berlin, Kaiser-Wilhelm-Gedächtniskirche located at Breitscheidplatz. The church is a most beautiful symbol of starting from scratch after a devastating war. The historic, destroyed tower still exists and was complemented by a modernist church in the 50s (snapshot above).

Three hours before the attack I bought presents at Breitscheidplatz. Now lets hope the police will find the terrorist.

# Talk on Reproducing Kernel Hilbert Spaces in machine learning

Yesterday I gave a talk on Reproducing Kernel Hilbert Spaces (RKHSs) in machine learning, in the Uncertainty Quantification seminar organized by Tim Sullivan. In earlier meetings, Tim himself an Han Cheng Lie gave talks on Vladimir Bogachevs use of RKHSs in his book on Gaussian Measures, which does not seem to mention where the “Reproducing Kernel” part comes from. Which is why I decided to start out with and concentrate on kernels. I pointed out the equivalence of a very simple classification algorithm using the dot product in an RKHS with the usage of KDEs for classification (at least for a certain class of positive definite kernels that are also densities).

You can take a look at my Jupyter Notebook online or download it from Github.

# A year in Paris

After one year my PostDoc in Paris is now over and tomorrow I’m starting at SFB 1114 at FU Berlin. As I already told one of my new colleagues, it’s been quite a thing for me not being Xians office mate any more. One reason, obviously, is that with respect to work it’s a great luxury to be able to ask a senior researcher questions at almost any time. My second major reason is that the last time an office mate was so pleasant on a personal level was about six years ago when both Felix (then office mate) and I where new fathers and just generally got along very well. Christian told me last year how he was very thankful to Jim Berger for taking him as a PostDoc “and basically for no reason” (his words). Christians publication record wasn’t great but Jim Berger didn’t care. My response was that that was exactly how I felt, very thankful – and for the same reasons.
It was rather easy to feel at home scientifically in Paris in general. Especially in domains with a strong math component it’s hard to find a better place when taking the union of all Paris universities. Dauphine had a nice atmosphere with many young researchers, especially probabilists. This also made me reflect more on Germany, and I can’t help but feel that the situation in the most prosperous country in the EU is much worse than in France. Rather it is comparable, especially with respect to young researchers, to some EU countries that are currently fighting low tax revenue. But there is hope. If the EU dissolves there will be less comparisons…

The hard part of this year was family life and commuting back to Berlin almost every weekend. It gave my wife the life of a single mother during week days. Another reason to be very thankful, being supported like that. I’m happy to have been here when my son entered elementary school and hope that was it in terms of long term family separation.