Machine Learning and Differential Equations?


Once in a while I hear the question: “how machine learning can have any connection with differential equations? fast_with_math

 

Gradient Descent; too big to be visible!

Majority of ML models benefit from gradient based optimization algorithms to adjust tunable parameters. Gradient Descent (GD) and its extensions serve as the most basic algorithm used for training ML models particularily Neural Networks. The connection of training dynamics in GD algorithm and differential equations is very well explained in these two blog posts that are about Neural Tangent Kernel Some Math behind Neural Tangent Kernel, Understanding the Neural Tangent Kernel . However, this post does not focus on gradient descent; rather, it delves into a fascinating application of its counterpart, Gradient Ascent!

 

Gradient ascent, differential rquations and generative modeling

A very interesting application of differential equations in generative modeling is in diffusion models, (or better to be called: Score-Based Generative Modeling with Stochastic Differential Equations). These models were introduced not a very long ago with several prominent research papers titled:

  1. Deep Unsupervised Learning using Nonequilibrium Thermodynamics

  2. Denoising Diffusion Probabilistic Models

  3. Score-Based Generative Modeling through Stochastic Differential Equations

and they have played a prominent role in enabling high quality/resolution generative modeling.

To put in very simple words, generative modeling with diffusion models is like the gradient ascent method; to have an initial guess and to move it towards the maximum of the function (in this case the probability density function).

In this method rather than learning the probability distribution, the gradient of its logarithm is being learned (score-function). The probability density function is a scalar and its gradient is a vector field. For a point in feature space we don’t learn the value of the scalar but we try to learn the components of a vector field that point towards the maximum regions. With this vector field being learned, we can sample any random point in the feature space and move it along this vector field towards the high probable regions.

To train a generative model, we start with samples that already belong to the high probable regions of the probability distribution. Therefore it is not very difficult to learn the vector field in these regions. However such regions cover a very small portion of the feature space and it is rather impossible to learn the score-function (the vector field) in regions with no samples. This was the main pitfall of score-based generative models until this problem was solved with diffusion models. To put it in simple words, the solution is to populate the feature space with noisy training data. This is achieved by adding multiple scales of noise. This is helpful in learning a close approximation of the mentioned vector field.

In sample generation mode, we usually start with pure noise and move it along the learned score-function. This procedure is an iterative numerical approach to solve a stochastic differential equation.

Some useful links:


Posts in this series