JSeam2/Neural-Ordinary-Differential-Equations
Sample implementation of Neural Ordinary Differential Equations
repo name | JSeam2/Neural-Ordinary-Differential-Equations |
repo link | https://github.com/JSeam2/Neural-Ordinary-Differential-Equations |
homepage | |
language | Jupyter Notebook |
size (curr.) | 137 kB |
stars (curr.) | 234 |
created | 2018-06-21 |
license | |
Neural Ordinary Differential Equations
Overview and Summary
I try to implement the findings in the paper in this repo. Here’s a summary of what I think is significant information.
Neural Ordinary Differential Equations introduces an interesting way of specifiying a neural network. Instead of treating the neural network as a sequence of discrete states, the approach parameterizes the derivative of the hidden state using a neural network. This parameterizing of the hidden state provides a continuous depth model provides a bunch of useful properties.
Useful Properties I Understand
-
Memory Efficiency: The specifications give a constant memory costs wrt to depth
-
Adaptive Computation: Adapt level of error for efficiency, useful for real-time or low-power applications
-
Scalable and Invertible Normalizing Flows: Continuous transformation allows for easier computation of change of variables. The paper derives a new fclass of invertible density models that avoids the bottleneck of normalizing flows, allowing the model to be trained directly by max-likelihood.
-
Scalable and invertible normalizing flows: The continuous transformation makes the change of variables formula easier to compute. This allows the constructuion of a new class of invertible density models that avoids previous bottlenecks.
-
Continuous Time-series Models: Able to model time-series data that arrive at arbitrary times unlike RNNs.
What Are Normalizing Flows? I am confused.
I got stumped on this one for a while. My reaction irl the whole time -> ( ・◇・)?
If you’re a noob like me you would probably get stumped too. Here’s what I understand without the math.
Imagine if you were in a conference and someone asked the presenter something really difficult. Making the presenter go ( ・◇・)? The presenter doesn’t want to cop out and look stupid so he/she tries to answer the question. The presenter then simplifies the question and answers that simpified question instead. Not satisfactory, but it should gets the point across. This is essentially the idea behind Variational Inference. The quantity that describes this is termed as the posterior distribution.
What happens if this explanation seemed to hand wavy or oversimplistic? We need to find a slightly more detailed way of explaning the same thing. How should we tune the complexity of the explanation? This method of tuning the complexity is the idea behind Normalizing Flows. By using these normalizing flows, we can apply a sequence of invertible transformation (we can back and forth and not lose information) to transform that simple posterior distribution (ie. the explanation) into something more complex that captures the idea we want to describe.
This paper essentially provides a continuous formulation of the normalizing flow concept. While an elegant concept, it can get hard to compute terms, like the change of variables. Using the continuous formulation, the paper offers a method of using the trace of the mapping function, which is more efficient. Check the maths in the paper for a clearer picture.
Useful links
- Quora explanation by Sam Wang I used his analogy
- Variational Inference by David M. Blei
- Variational Inference with Normalizing Flows Paper
- Normalizing Flows as suggested by zdhNarsil
Limitations
- Unstraightforward Minibatching: Though minibatching can still be achieved by concatenating the states of each batch elements together to form an ODE.
- Uniqueness: A unique solution only exists if the neural netowrk has finate weights and Lipshitz nonlinearities like tanh or relu.
- Reversibility: Forward trajectory of the network is invertible in principle, but numerical errors will emerge in forward ODE solver and reverse ODE solver (though this can be reduce at the cost of more computation). Information is lost due to multiple initial value mapping to the final state, this is expected to be a problem if the system dynamics encoded optimization-like, convergent dynamics.
Link To Arxiv Paper
https://arxiv.org/abs/1806.07366
TODOS
To look into JJFORD link to Arxiv Paper https://arxiv.org/abs/1810.01367