This paper by Schölkopf et al in Statistics and Computing (and on arXiv, which I will refer to when giving page numbers) was a very rewarding read. Although I think the promise of “Kernel Probabilistic Programming” given in the abstract is not really kept, it served me as a concise review of ideas in modern (i.e. distribution based) methods using positive definite kernels. The obvious connection of the pd kernel mean map with kernel density estimators is just one of the things that I think are important to realize in this domain.
Estimating integrals of functions in a certain RKHS becomes a rather straight forward exercise in the proposed framework. However, it’s a bit unfortunate that they assume independent samples from the distribution, but posing this as an IS estimator one might be able to apply some elementary results leading to a generalization for dependent samples (they talk about application to dependent samples, but I felt that was geared only towards causal inference). What might get in the way here is that their estimators are assumed to be independent of the actual samples used (which is not the case in IS).
These difficulties with their estimator for functions of RVs notwithstanding, what struck me again and again is the close connection with importance sampling estimators (I’m a one trick pony). Specifically, the used empirical kernel mean map of a set of samples X for pd kernel k becomes an IS estimator with the constraint imposed in the paper (that ) and one additional assumption (). Also, when using their method with sparse representations (in the paragraph “More general expansions sets” making the assumption themselves), the way they expand the sparse representation is almost exactly Importance Resampling – funny enough that they are not sure wether the method gives consistent estimates. I’m not sure about the influence that their slight variation of Importance Resampling might have, but just using Importance Resampling straight away results in consistent estimators of course under some conditions on the weights and . Those conditions unfortunately being incompatible with the assumption that weights are independent of samples.
I would be really interested to see wether this IS connection is fruitful.
On a more critical note, the paper states that no parametric assumptions on the embedded distributions are made. While this is true, it masks the fact that different types of pd kernels used for the embedding will result in better or worse approximations of integrals in this framework when using finite numbers of samples. Also, the paper claims their method “does not require explicit density estimation as an intermediate step” while still using the kernel mean map, which includes KDEs as a special case. A minor thing is that Figure 1 needs error bars badly.