The attack yesterday afternoon took place at one of my favorite places In Berlin, Kaiser-Wilhelm-Gedächtniskirche located at Breitscheidplatz. The church is a most beautiful symbol of starting from scratch after a devastating war. The historic, destroyed tower still exists and was complemented by a modernist church in the 50s (snapshot above).
Three hours before the attack I bought presents at Breitscheidplatz. Now lets hope the police will find the terrorist.
Yesterday I gave a talk on Reproducing Kernel Hilbert Spaces (RKHSs) in machine learning, in the Uncertainty Quantification seminar organized by Tim Sullivan. In earlier meetings, Tim himself an Han Cheng Lie gave talks on Vladimir Bogachevs use of RKHSs in his book on Gaussian Measures, which does not seem to mention where the “Reproducing Kernel” part comes from. Which is why I decided to start out with and concentrate on kernels. I pointed out the equivalence of a very simple classification algorithm using the dot product in an RKHS with the usage of KDEs for classification (at least for a certain class of positive definite kernels that are also densities).
You can take a look at my Jupyter Notebook online or download it from Github.
After one year my PostDoc in Paris is now over and tomorrow I’m starting at SFB 1114 at FU Berlin. As I already told one of my new colleagues, it’s been quite a thing for me not being Xians office mate any more. One reason, obviously, is that with respect to work it’s a great luxury to be able to ask a senior researcher questions at almost any time. My second major reason is that the last time an office mate was so pleasant on a personal level was about six years ago when both Felix (then office mate) and I where new fathers and just generally got along very well. Christian told me last year how he was very thankful to Jim Berger for taking him as a PostDoc “and basically for no reason” (his words). Christians publication record wasn’t great but Jim Berger didn’t care. My response was that that was exactly how I felt, very thankful – and for the same reasons.
It was rather easy to feel at home scientifically in Paris in general. Especially in domains with a strong math component it’s hard to find a better place when taking the union of all Paris universities. Dauphine had a nice atmosphere with many young researchers, especially probabilists. This also made me reflect more on Germany, and I can’t help but feel that the situation in the most prosperous country in the EU is much worse than in France. Rather it is comparable, especially with respect to young researchers, to some EU countries that are currently fighting low tax revenue. But there is hope. If the EU dissolves there will be less comparisons…
The hard part of this year was family life and commuting back to Berlin almost every weekend. It gave my wife the life of a single mother during week days. Another reason to be very thankful, being supported like that. I’m happy to have been here when my son entered elementary school and hope that was it in terms of long term family separation.
During the super nice International Conference on Monte Carlo techniques in the beginning of July in Paris at Université Descartes (photo), which featured many outstanding talks, one by Tong Zhang particularly caught my interest. He talked about several variants of Stochastic Gradient Descent (SGD) that basically use variance reduction techniques from Monte Carlo algorithms in order to improve the convergence rate versus vanilla SGD. Even though some of the papers mentioned in the talk do not always point out the connection to Monte Carlo variance reduction techniques.
On of the first works in this line, Accelerating Stochastic Gradient Descent using Predictive Variance Reduction by Johnson and Zhang, suggests using control variates to lower the variance of the loss estimate. Let be the loss for the parameter at and jth data point, then the usual batch gradient descent update is with as step size.
In naive SGD instead one picks a data point index uniformly and uses the update , usually with a decreasing step size to guarantee convergence. The expected update resulting from this Monte Carlo estimate of the batch loss is exactly the batch procedure update. However the variance of the estimate is very high, resulting in slow convergence of SGD after the first steps (even in minibatch variants).
The authors choose a well-known solution to this, namely the introduction of a control variate. Keeping a version of the parameter that is close to the optimum, say , observe that has an expected value of 0 and is thus a possible control variate. With the possible downside that whenever is updated, one has to go over the complete dataset.
The contribution, apart from the novel combination of knowledge, is the proof that this improves convergence. This proof assumes smoothness and strong convexity of the overall loss function and convexity of the for the individual data points and then shows that the proposed procedure (termed stochastic variance reduced gradient or SVRG) enjoys geometric convergence. Even though the proof uses a slightly odd version of the algorithm, namely where . Rather simply setting should intuitively improve convergence, but the authors could not report a result on that. Overall a very nice idea, and one that has been discussed in more papers quite a bit, among others by Simon Lacoste-Julien and Francis Bach.