Book
Since Feb 2022 I’ve been writing our textbook on Deep Learning with an Energy perspective. It will come in two versions: an electronic one with a dark background for screens (freely available) and a physical one with white background for printers (for purchase).
I finished writing the first 3 chapters and corresponding Jupyter Notebooks:
- Intro;
- Spiral;
- Ellipse.
Once the 4th chapter and notebook are done (end of Aug?), the draft will be submitted to the reviewers (Mikael Henaff and Yann LeCun). After merging their contributions (end of Sep?), a first draft of the book will be available to the public on this website.
Book format
The book is highly illustrated using $\LaTeX$’s packages TikZ and PGFPlots. The figures are numerically generated with the computations done in Python using the PyTorch library. The output of such computations are stored as ASCII files and then read by $\LaTeX$ that visualises them. Moreover, most figures are also rendered on the Notebook using the Matplotlib library.
Why plotting with $\LaTeX$?
Because I can control every single aspect of what is drawn.
If I define the hidden vector $\green{\vect{h}} \in \green{\mathcal{H}}$ in the book, I can have a pair of axis lebelled $\green{h_1}$ and $\green{h_2}$ and the Cartesian plane labelled $\green{\mathcal{H}}$ without going (too) crazy.
All my maths macros, symbols, font, font size, and colour are just controlled by one single stylesheet called maths-preamble.tex
.
Why colours
Because I think in colours. Hence, I write in colours. And if you’ve been my student, you already know that at the bottom left we’ll have a pink-bold-ex $\pink{\vect{x}}$ from which we may want to predict a blue-bold-why $\blue{\vect{y}}$ and there may be lurking an orange-bold-zed $\orange{\vect{z}}$.
Illustrations sneak peeks
To keep myself motivated and avoid going crazy too much, I post the most painful drawings on Twitter, where my followers keep me sane by sending copious amount of love ❤️. You can find here a few of these tweets.
I think I've just acquired the title of TikZ-ninja. pic.twitter.com/dq43bvjcFG 18 hrs writing the book in a row… Let's go home 😝😝😝 A small update, so I keep motivating myself to push forward 😅😅😅 Last update: a preview of the book's “maximum likelihood” section and generating code. Achievement of the day 🥳🥳🥳 Vectors and functions 💡💡💡 One giant leap for Alf, one small step forward for the book 🥲🥲🥲#TeXLaTeX #EnergyBasedModel #DLbook pic.twitter.com/X3FU8Uijys Just some free energy geometric construction. 🤓🤓🤓 pic.twitter.com/DsIevqzuv2 Negative gradient comparison for Fₒₒ and Fᵦ. «The ellipse toy example» chapter is DONE. 🥳🥳🥳 A small glimpse from the book, achievement of the day 🤓🤓🤓 Another update from the book. 📖 When looking at a classifier, we can consider its energy as being the cross-entropy or its negative linear output (often called logits). The energy of a well-trained model will be low for compatible (x, y) and high for incompatible pairs. 📖📖📖 pic.twitter.com/HlfvXQvGWn Maths operand order is often counterintuitive. We can use SVD to inspect 🔍 what a given linear transformation does. From the diagram below we can see how the lavender oriented circle with axes 𝒗₁ and 𝒗₂ gets morphed into the aqua oriented ellipse with axes 𝜎₁𝒖₁ and 𝜎₂𝒖₂. So, they are ‘stretchy rotations’. pic.twitter.com/0HpOwOPbpf A neural net is a sandwich 🥪 of linear and non-linear layers. Last week we've learnt about the geometric interpretation of linear transformations, and now we're appreciating a few activation functions' morphings. Chapter 1 (2 and 3) completed! 🥳🥳🥳Load tweets (may take a few seconds)
Good night World 😴😴😴 pic.twitter.com/kLtw2yeG92
Suggestions and feedback are welcome! 😊😊😊 pic.twitter.com/d5NeKieE5m
🥳🥳🥳 https://t.co/JZeAHuuTnA pic.twitter.com/dgaUIw5bWN
Plenty of pain! 🥲🥲🥲 pic.twitter.com/5BBS5J59bC
A vector 𝒆 ∈ ℝᴷ can be thought of as a function 𝒆 : {1, …, 𝐾} ⊂ ℕ → ℝ, mapping all 𝐾 elements to a scalar value.
Similarly, a function 𝑒 : ℝᴷ → ℝ can be thought of as an infinite vector 𝑒 ∈ ℝ^ℝᴷ, having ℝᴷ elements. pic.twitter.com/ccZREDAal1
For super-cold 🥶 zero-temperature limit we have a single force pulling on the manifold per training sample.
For warmer temperatures ☀️😎 we pull on regions of the manifold.
For super-hot 🥵 settings we kill ☠️ all the latents 😥. pic.twitter.com/cFsGQ3FJFV
7.5k words, 1.2k likes of TikZ, 0.8k lines of Python.
I think I got this! 🥲🥲🥲 pic.twitter.com/5uwwrLcXPf
The two soft maxima and soft minima are compared to the minimum, average, and maximum of a real vector (of size 5). This is a fun plot because the y-axis does something funky 🤪🤪🤪 pic.twitter.com/tST48uxmL2
A classifier 'moves' points around such that they can be separated by the output linear decision boundaries.
Usually one looks at how the net warps the decision boundaries around the data but I like to look at how the input is unwarped instead. 🤓 pic.twitter.com/M3ZGmUUZI6
For example, 𝒔 = 𝑾 𝒓 = 𝑼𝚺𝑽 ᵀ 𝒓 can be more naturally represented by the following circuit. 🤓🤓🤓 pic.twitter.com/S6rdtBtzuy
Almost done with the intro chapter! 🥳🥳🥳 pic.twitter.com/9SAIfkKUWk
We've seen a linear and a bunch of non-linear transformations. But what can a stack of linear and non-linear layers do? Here we have two fully-connected nets doing their nety stuff on some random points. 😀😀😀 pic.twitter.com/otExi5h7bb
Last update: 26 Jul 2022.
Oct 2022 update
For the entire month of Aug and half of Sep I got stuck on implementing a working sparse coding algo for a low-dimensional toy example. Nothing was working for a long while, although I managed to get the expected result (see tweets below). Then, I spent a couple of weeks on the new semester’s lectures, creating new content (slides below, video available soon) on back-propagation, which I’ve never taught at NYU, topic that will make it to the book. Anyhow, now I’m back to writing! 🤓
Zooming in a little, for some finer details. pic.twitter.com/i57E0rYwzH Backpropagation ⏮ of the gradOutput throughout each network's module allows us to compute the rate of change of the loss 📈 wrt the model's parameter.Load tweets (may take a few seconds)
To inspect 🧐 its value we can simply check the gradBias of any linear layer. pic.twitter.com/buysxDBGD7
Last update: 26 Sep 2022.
May 2023 update
Oh boy, this 4th chapter took me while (mostly because I’ve focussed also on other things, including the Spring 2023 edition of the course)… but it’s done now! In these last few months I’ve wrote about undercomplete autoencoders (AE), denoising AE, variational AE, contractive AE, and generative adversarial nets. Thanks to Gabriel Peyré, I’ve developed a method to separate stationary sinks and sources for a dynamics field (which I may write an article about), and it’s integral part of the book explanations.
Moreover, I’ve been pushing a few videos from the Fall 2022 edition of the course, which give a preview on the chapters I’ve been writing, e.g. neural nets components, backpropagation (first time teaching it), energy-based classification, PyTorch training, K-means, and sparse coding (at least for now). Finally, over the Winter break, I’ve been teaching 12 years-olds about the maths and computer science behind generative AI, and I’m considering using p5.js as a tool to teach programming to beginners.
What’s next? I’m sending this first draft, with its 4 chapters (Intro, Spiral, Ellipse, Generative) and companion Jupyter Notebooks to Yann for a review. Meanwhile, I’ll be writing down the Backprop chapter, possibly and article, and pushing a few more videos on YouTube. Once the review is completed, a first draft will pop to this website for the public.
A 2 → 100 → 100 → 1 → 100 → 100 → 2 hyperbolic tangent undercomplete autoencoder trying to recover a 1d manifold from 50 2d data points. 📖📖📖 pic.twitter.com/ImKbpPTavY Let’s get some sections done! 🤓🤓🤓 pic.twitter.com/13bllkQ3wx A variational autoencoder (VAE) limits the low-energy region by mapping the inputs to fuzzy bubbles. The hidden representation can be made uninformative by increasing the temperature during learning, which induces the bubbles to be all centred at the origin and have unit size. pic.twitter.com/qpa8ptsJDD Done with the VAE chapter! 🥳🥳🥳 We have a caption now! The contractive autoencoder section is completed. Epoch 0 vs. epoch 18k. Let's end this year by starting to upload the first video of NYU Deep Learning Fall 2022 edition! 🥳🥳🥳 Let's start the year by brushing up on the basics of neural nets: linear and non-linear transformations. The first video of the «Classification, an Energy Perspective» saga shows two nets' data space transformation, introduces the data format, illustrates the predictor-decoder architecture, and explains how gradient descent is used for learning. The second video of the «Classification, an Energy Perspective» saga teaches backprop, visualises the energy landscape, and explains how contrastive learning works. 🤓 The third and last video of the «Classification, an Energy Perspective» saga covers neural net 5-step training code in @PyTorch, gradient accumulation justification, reprodution of energy surface for different model, and ensembling uncertainty estimation.https://t.co/oyEGlgyhTE pic.twitter.com/MaZsSSRg8U In this lecture, we start with two examples of decoder-only latent-variable EBM (𝐾-means and sparse coding), move to target-prop via amortised inference, to finally land the autoencoder architecture. 🤓 I taught 4 hours of Deep Learning to a class of 7th graders. I didn’t dumb it down at all. I just used the same analogies and explanations I use with the grown ups. By the end I was in love with their young and fresh minds and total absolute attention. ❤️https://t.co/CFP4Mkarwx pic.twitter.com/Ng0veJLftqLoad tweets (may take a few seconds)
Figures from chapter 4
Two sections to go and the first draft ships! 🥳🥳🥳
Yay! 🥳🥳🥳 pic.twitter.com/Lj30urRpZH
One section to go! 🥳🥳🥳 https://t.co/cpid936wDr pic.twitter.com/mTIDqkYSqm
Losses and generator gradients' norm.
Critic learnt energy. pic.twitter.com/7swifi5qNjVideos from DLFL22
This is an incremental version based on DLSP21. Therefore, only new content will be uploaded.
Enjoy the view.https://t.co/TxaNhQgUbO pic.twitter.com/hVZYWEJMv8
In this episode, we're concerned with inference only. Forward and backwards. We introduce the cost and the energy. 🔋
Website: https://t.co/3yY8CMLiXzhttps://t.co/zrqH4CG0mr pic.twitter.com/MrSeV3u40S
Enjoy 🤓❤️🤗https://t.co/glH2iGydIJ pic.twitter.com/S33JxwdH83
This lecture alone was the reason DLFL22 has been pushed online. I hope you like it. ❤️https://t.co/5vVQRwLzxK pic.twitter.com/x0lQaT9hKz
Back to using @AdobeAE for the animations! 🥳https://t.co/ATbVwuxmcC 🎥 pic.twitter.com/kWEF68cE9QTeaching Italian 7th graders
Last update: 16 May 2023.
Aug 2023 update
Of course, during the Summer it was unrealistic expecting anyone to review anything… Anyhow, I’ve just got back from O‘ahu (ICML23) and Maui (2 days before Lahaina burnt down) and finished the Backprop chapter, therefore the first draft will have 5 chapters in total as of right now. Below you can see a few diagrams I’ve developed over these Summer months.
The new semester starts in two weeks, so I’ll be a bit busy with that. I need to plan a possible chapter on joint embedding methods and start working on PART II of the book: ‘geometric stuff’.
About books, I’ve just received my copy of The Little Book of Deep Learning by François Fleuret. I have to say it is really well made and I really like it. It’s a bit on the terse side, but I haven’t decided if it’s a pro or a con.
Let's go fancy with inline diagrams! Backprop, the key component behind training multi-layered deep nets, can be sometimes challenging to digest. Follows an attempt to illustrate it, starting from the last linear layer's gradWeight and gradBias computation in a regression setup. 🤓🤓🤓 pic.twitter.com/PLeiRjJVYb A neural net is made of simple building blocks. «Weights sharing implies tied gradients accumulation.» Since it's not obvious for half of you and only a small fraction can prove it (link to the poll below), let me share this latest book section with y'all! 😀😀😀 The one-hot row routing and branching matrix G 🐢 is a peculiar object. When it's used in a left-multiplication, it acts as a selector and/or branching operator. When it's used in a right-multiplication, it acts as an accumulator via the paths that have previously branched out. pic.twitter.com/DEH7FOxhqrLoad tweets (may take a few seconds)
LaTeX has no secretes to me, mhuahahaha! 🤪🤪🤪
(Writing the backprop chapter.) pic.twitter.com/bm1knKYCI7
Learning how the output gradient is backpropagated through these basic components helps us understand how each part contributes to the final model performance.
Below we see how the node & sum complimentary modules behave. pic.twitter.com/tv5s5A1TFp
This also justifies the backward behaviour of the node module. pic.twitter.com/fA9dpYZiIp
Last update: 16 Aug 2023.