Book
Since Feb 2022 I’ve been writing our textbook on Deep Learning with an Energy perspective. It will come in two versions: an electronic one with a dark background for screens (freely available) and a physical one with white background for printers (for purchase).
I finished writing the first 3 chapters and corresponding Jupyter Notebooks:
- Intro;
- Spiral;
- Ellipse.
Once the 4th chapter and notebook are done (end of Aug?), the draft will be submitted to the reviewers (Mikael Henaff and Yann LeCun). After merging their contributions (end of Sep?), a first draft of the book will be available to the public on this website.
Book format
The book is highly illustrated using $\LaTeX$’s packages TikZ and PGFPlots. The figures are numerically generated, with the computations done in Python using the PyTorch library. The output of such computations is stored as ASCII files and then read by $\LaTeX$ that visualises them. Moreover, most figures are also rendered on the Notebook using the Matplotlib library.
Why plotting with $\LaTeX$?
Because I can control every single aspect of what is drawn.
If I define the hidden vector $\green{\vect{h}} \in \green{\mathcal{H}}$ in the book, I can have a pair of axes lebelled $\green{h_1}$ and $\green{h_2}$ and the Cartesian plane labelled $\green{\mathcal{H}}$ without going (too) crazy.
All my maths macros, symbols, font, font size, and colour are just controlled by one single stylesheet called maths-preamble.tex
.
Why colours
Because I think in colours. Hence, I write in colours. And if you’ve been my student, you already know that at the bottom left we’ll have a pink-bold-ex $\pink{\vect{x}}$ from which we may want to predict a blue-bold-why $\blue{\vect{y}}$ and there may be lurking an orange-bold-zed $\orange{\vect{z}}$.
Illustrations sneak peeks
To keep myself motivated and avoid going crazy too much, I post the most painful drawings on Twitter, where my followers keep me sane by sending a copious amount of love ❤️. You can find here a few of these tweets.
I think I've just acquired the title of TikZ-ninja. pic.twitter.com/dq43bvjcFG 18 hrs writing the book in a row… Let's go home 😝😝😝 A small update, so I keep motivating myself to push forward 😅😅😅 Last update: a preview of the book's “maximum likelihood” section and generating code. Achievement of the day 🥳🥳🥳 Vectors and functions 💡💡💡 One giant leap for Alf, one small step forward for the book 🥲🥲🥲#TeXLaTeX #EnergyBasedModel #DLbook pic.twitter.com/X3FU8Uijys Just some free energy geometric construction. 🤓🤓🤓 pic.twitter.com/DsIevqzuv2 Negative gradient comparison for Fₒₒ and Fᵦ. «The ellipse toy example» chapter is DONE. 🥳🥳🥳 A small glimpse from the book, achievement of the day 🤓🤓🤓 Another update from the book. 📖 When looking at a classifier, we can consider its energy as being the cross-entropy or its negative linear output (often called logits). The energy of a well-trained model will be low for compatible (x, y) and high for incompatible pairs. 📖📖📖 pic.twitter.com/HlfvXQvGWn Maths operand order is often counterintuitive. We can use SVD to inspect 🔍 what a given linear transformation does. From the diagram below we can see how the lavender oriented circle with axes 𝒗₁ and 𝒗₂ gets morphed into the aqua oriented ellipse with axes 𝜎₁𝒖₁ and 𝜎₂𝒖₂. So, they are ‘stretchy rotations’. pic.twitter.com/0HpOwOPbpf A neural net is a sandwich 🥪 of linear and non-linear layers. Last week we've learnt about the geometric interpretation of linear transformations, and now we're appreciating a few activation functions' morphings. Chapter 1 (2 and 3) completed! 🥳🥳🥳Load tweets (may take a few seconds)
Good night World 😴😴😴 pic.twitter.com/kLtw2yeG92
Suggestions and feedback are welcome! 😊😊😊 pic.twitter.com/d5NeKieE5m
🥳🥳🥳 https://t.co/JZeAHuuTnA pic.twitter.com/dgaUIw5bWN
Plenty of pain! 🥲🥲🥲 pic.twitter.com/5BBS5J59bC
A vector 𝒆 ∈ ℝᴷ can be thought of as a function 𝒆 : {1, …, 𝐾} ⊂ ℕ → ℝ, mapping all 𝐾 elements to a scalar value.
Similarly, a function 𝑒 : ℝᴷ → ℝ can be thought of as an infinite vector 𝑒 ∈ ℝ^ℝᴷ, having ℝᴷ elements. pic.twitter.com/ccZREDAal1
For super-cold 🥶 zero-temperature limit we have a single force pulling on the manifold per training sample.
For warmer temperatures ☀️😎 we pull on regions of the manifold.
For super-hot 🥵 settings we kill ☠️ all the latents 😥. pic.twitter.com/cFsGQ3FJFV
7.5k words, 1.2k likes of TikZ, 0.8k lines of Python.
I think I got this! 🥲🥲🥲 pic.twitter.com/5uwwrLcXPf
The two soft maxima and soft minima are compared to the minimum, average, and maximum of a real vector (of size 5). This is a fun plot because the y-axis does something funky 🤪🤪🤪 pic.twitter.com/tST48uxmL2
A classifier 'moves' points around such that they can be separated by the output linear decision boundaries.
Usually one looks at how the net warps the decision boundaries around the data but I like to look at how the input is unwarped instead. 🤓 pic.twitter.com/M3ZGmUUZI6
For example, 𝒔 = 𝑾 𝒓 = 𝑼𝚺𝑽 ᵀ 𝒓 can be more naturally represented by the following circuit. 🤓🤓🤓 pic.twitter.com/S6rdtBtzuy
Almost done with the intro chapter! 🥳🥳🥳 pic.twitter.com/9SAIfkKUWk
We've seen a linear and a bunch of non-linear transformations. But what can a stack of linear and non-linear layers do? Here we have two fully-connected nets doing their nety stuff on some random points. 😀😀😀 pic.twitter.com/otExi5h7bb
Last update: 26 Jul 2022.
Oct 2022 update
For the entire month of Aug and half of Sep I got stuck on implementing a working sparse coding algo for a low-dimensional toy example. Nothing was working for a long while, although I managed to get the expected result (see tweets below). Then, I spent a couple of weeks on the new semester’s lectures, creating new content (slides below, video available soon) on back-propagation, which I’ve never taught at NYU, a topic that will make it to the book. Anyhow, now I’m back to writing! 🤓
Zooming in a little, for some finer details. pic.twitter.com/i57E0rYwzH Backpropagation ⏮ of the gradOutput throughout each network's module allows us to compute the rate of change of the loss 📈 wrt the model's parameter.Load tweets (may take a few seconds)
To inspect 🧐 its value we can simply check the gradBias of any linear layer. pic.twitter.com/buysxDBGD7
Last update: 26 Sep 2022.
May 2023 update
Oh boy, this 4th chapter took me a while (mostly because I’ve focussed also on other things, including the Spring 2023 edition of the course)… but it’s done now! In these last few months I’ve written about undercomplete autoencoders (AE), denoising AE, variational AE, contractive AE, and generative adversarial nets. Thanks to Gabriel Peyré, I’ve developed a method to separate stationary sinks and sources for a dynamics field (which I may write an article about), and it’s an integral part of the book explanations.
Moreover, I’ve been pushing a few videos from the Fall 2022 edition of the course, which give a preview on the chapters I’ve been writing, e.g. neural nets components, backpropagation (first time teaching it), energy-based classification, PyTorch training, K-means, and sparse coding (at least for now). Finally, over the Winter break, I’ve been teaching 12 years-olds about the maths and computer science behind generative AI, and I’m considering using p5.js as a tool to teach programming to beginners.
What’s next? I’m sending this first draft, with its 4 chapters (Intro, Spiral, Ellipse, Generative) and companion Jupyter Notebooks to Yann for a review. Meanwhile, I’ll be writing down the Backprop chapter, possibly and article, and pushing a few more videos on YouTube. Once the review is completed, a first draft will pop to this website for the public.
A 2 → 100 → 100 → 1 → 100 → 100 → 2 hyperbolic tangent undercomplete autoencoder trying to recover a 1d manifold from 50 2d data points. 📖📖📖 pic.twitter.com/ImKbpPTavY Let’s get some sections done! 🤓🤓🤓 pic.twitter.com/13bllkQ3wx A variational autoencoder (VAE) limits the low-energy region by mapping the inputs to fuzzy bubbles. The hidden representation can be made uninformative by increasing the temperature during learning, which induces the bubbles to be all centred at the origin and have unit size. pic.twitter.com/qpa8ptsJDD Done with the VAE chapter! 🥳🥳🥳 We have a caption now! The contractive autoencoder section is completed. Epoch 0 vs. epoch 18k. Let's end this year by starting to upload the first video of NYU Deep Learning Fall 2022 edition! 🥳🥳🥳 Let's start the year by brushing up on the basics of neural nets: linear and non-linear transformations. The first video of the «Classification, an Energy Perspective» saga shows two nets' data space transformation, introduces the data format, illustrates the predictor-decoder architecture, and explains how gradient descent is used for learning. The second video of the «Classification, an Energy Perspective» saga teaches backprop, visualises the energy landscape, and explains how contrastive learning works. 🤓 The third and last video of the «Classification, an Energy Perspective» saga covers neural net 5-step training code in @PyTorch, gradient accumulation justification, reprodution of energy surface for different model, and ensembling uncertainty estimation.https://t.co/oyEGlgyhTE pic.twitter.com/MaZsSSRg8U In this lecture, we start with two examples of decoder-only latent-variable EBM (𝐾-means and sparse coding), move to target-prop via amortised inference, to finally land the autoencoder architecture. 🤓 I taught 4 hours of Deep Learning to a class of 7th graders. I didn’t dumb it down at all. I just used the same analogies and explanations I use with the grown ups. By the end I was in love with their young and fresh minds and total absolute attention. ❤️https://t.co/CFP4Mkarwx pic.twitter.com/Ng0veJLftqLoad tweets (may take a few seconds)
Figures from chapter 4
Two sections to go and the first draft ships! 🥳🥳🥳
Yay! 🥳🥳🥳 pic.twitter.com/Lj30urRpZH
One section to go! 🥳🥳🥳 https://t.co/cpid936wDr pic.twitter.com/mTIDqkYSqm
Losses and generator gradients' norm.
Critic learnt energy. pic.twitter.com/7swifi5qNjVideos from DLFL22
This is an incremental version based on DLSP21. Therefore, only new content will be uploaded.
Enjoy the view.https://t.co/TxaNhQgUbO pic.twitter.com/hVZYWEJMv8
In this episode, we're concerned with inference only. Forward and backwards. We introduce the cost and the energy. 🔋
Website: https://t.co/3yY8CMLiXzhttps://t.co/zrqH4CG0mr pic.twitter.com/MrSeV3u40S
Enjoy 🤓❤️🤗https://t.co/glH2iGydIJ pic.twitter.com/S33JxwdH83
This lecture alone was the reason DLFL22 has been pushed online. I hope you like it. ❤️https://t.co/5vVQRwLzxK pic.twitter.com/x0lQaT9hKz
Back to using @AdobeAE for the animations! 🥳https://t.co/ATbVwuxmcC 🎥 pic.twitter.com/kWEF68cE9QTeaching Italian 7th graders
Last update: 16 May 2023.
Aug 2023 update
Of course, during the Summer it was unrealistic expecting anyone to review anything… Anyhow, I’ve just got back from O‘ahu (ICML23) and Maui (2 days before Lahaina burnt down) and finished the Backprop chapter, therefore the first draft will have 5 chapters in total as of right now. Below, you can see a few diagrams I’ve developed over these summer months.
The new semester starts in two weeks, so I’ll be a bit busy with that. I need to plan a possible chapter on joint embedding methods and start working on PART II of the book: ‘geometric stuff’.
About books, I’ve just received my copy of The Little Book of Deep Learning by François Fleuret. I have to say it is really well-made, and I really like it. It’s a bit on the terse side, but I haven’t decided if it’s a pro or a con.
Let's go fancy with inline diagrams! Backprop, the key component behind training multi-layered deep nets, can be sometimes challenging to digest. Follows an attempt to illustrate it, starting from the last linear layer's gradWeight and gradBias computation in a regression setup. 🤓🤓🤓 pic.twitter.com/PLeiRjJVYb A neural net is made of simple building blocks. «Weights sharing implies tied gradients accumulation.» Since it's not obvious for half of you and only a small fraction can prove it (link to the poll below), let me share this latest book section with y'all! 😀😀😀 The one-hot row routing and branching matrix G 🐢 is a peculiar object. When it's used in a left-multiplication, it acts as a selector and/or branching operator. When it's used in a right-multiplication, it acts as an accumulator via the paths that have previously branched out. pic.twitter.com/DEH7FOxhqrLoad tweets (may take a few seconds)
LaTeX has no secretes to me, mhuahahaha! 🤪🤪🤪
(Writing the backprop chapter.) pic.twitter.com/bm1knKYCI7
Learning how the output gradient is backpropagated through these basic components helps us understand how each part contributes to the final model performance.
Below we see how the node & sum complimentary modules behave. pic.twitter.com/tv5s5A1TFp
This also justifies the backward behaviour of the node module. pic.twitter.com/fA9dpYZiIp
Jun 2025 update
Oh boy… it’s been two years since the last update… Let met tell you what’s happened since the last time I wrote something here.
Autumn 2023
We left off at draft v0.5.0, with 5 chapters completed (Backprop being the last one). During autumn 2023 I completed the 6th chapter (Signals, draft v0.6.0), and started working on recurrent nets.
«Chapter 6» «Chapter 7» Text & maths vs. diagram & caption. Yesterday I wrote two pages of maths, with upper bounds for the computation of a gradient, using Cauchy-Schwarz inequality and other ‘tricks’. A or B and why? pic.twitter.com/LaZ9ruxZQe A or B and why? pic.twitter.com/faz5nv4OQLLoad tweets (may take a few seconds)
In this chapter, we'll introduce several geometric structures, over which functions are defined, and whose properties can be exploited to reduce computations and ease learning, giving rise to several architecture families we'll cover in this part of the book. pic.twitter.com/sTPeSZnlx5
*Recurrent neural nets* are characterised by the presence of *cyclic connections*. They have a *distributed hidden state* with *non-linear dynamics*. The network uses information from its previous state as part of its computation for the current state. pic.twitter.com/xpecYuHEsw
They convey the same information in a very different form. 🤓🤓🤓 pic.twitter.com/bmdFIEKNFP
Today I drew a picture, which summarises two pages of equations.
Although the maths was necessary, the figure is what I see in my mind. pic.twitter.com/35e9lPudRz
My coworker, Brian McFee publishes Digital Signals Theory, an introductory textbook for non-technical people.
Spring 2024
I’m promoted to full-time teaching faculty, with two courses a semester. More precisely, I’m co-teaching a (classical?) symbolic and statistical AI course (don’t ask) to 130 students with no teaching assistant. (In addition to my 80-student graduate Deep Learning course, for a gran total of 210 students.) Therefore, I start working 7 days a week, 12 hours a day.
We decide to split the duties across the semester: I’m in charge of the second ‘learning’ part.
Now the fun part. Students don’t come to class (it’s not necessary for solving the first part’s homework), my slides do not have text, students complain they cannot ‘read’ the slides by their own, my exam is about the knowledge covered in class, I get a tonne of negative reviews on Rate My Professor. A few months later, I become the target of several angry, hateful students. I almost lost my job.
Book? What book? Who has time to focus on anything else?
Yet, I publish some of my lectures as NYU-AISP24.
Summer 2024
I interview Yann LeCun and Léon Bottou. I put together 3 blog posts on SN, Yann and Léon’s 1988 Simulateur de Neurones learning framework. Furthermore, I write a blog about ‘visual requirements’ for my grad course.
I wrote two blog posts about SN, Léon Bottou and @ylecun's 1988 Simulateur de Neurones. Dropping a new blog on «Visual prerequisites for learning deep learning». Nothing new. Just my recommendations, explicitly listed for former and future students’ benefit.https://t.co/qihKsZ9iNr pic.twitter.com/BICD7KIyha Simulateur de Neurones (SN), one of the earliest deep learning frameworks, already had interactive and graphic capabilities ~30 years ago.Load tweets (may take a few seconds)
One is an English translation of the original paper, for which I've reproduced the figures. The other is a tutorial on how to run their code on Apple silicon.https://t.co/YEARKgePSK pic.twitter.com/7ZTdAZEZBz
In this blog post, you can learn more about a PyTorch ancestor, used to train the first convnet.https://t.co/xEjpMQIypD pic.twitter.com/XqWAearY2X
Autumn 2024
I’m still teaching two courses a semester, but one is an offering for alumni of my graduate Deep Learning course. Therefore, there’s minimal overhead and I can get back to writing.
I complete the History (7th) chapter (draft v0.7.0). I push a little more, and complete the RNN (8th) chapter (draft v0.8.0). Finally, I start the TikZ (9th) chapter, where I explain how I draw all my book’s figures.
I become a hate crime victim, target of a psychopath, who verbally threats me. I fear for my safety and file a police report. In retaliation, the psychopath proceeds with a defamation campaign, trying to destroy my public figure.
It's so funny… 😬 This past Spring semester I found myself forced to teach GOFAI… and now I am actually able to share my understanding and perspective in the historical chapter of my book. 🥲 By encoding memories as attractors in a dynamical system, one can retrieve them when presented with corrupted or partial stimuli. From a high-energy configuration, the system will spontaneously relax to a low-energy state. I'm really having a blast at writing the historical section side notes! 🤩🤩🤩 Putting all together, we have the following result. 🤓🤓🤓 One more chapter completed! 🥳🥳🥳 You asked me to show you my secretes… so here we go! Thursday I tried to teach something I couldn't see clearly… oh man… what a drag… 😭😭😭 Daily TikZ show off. 😁😁😁 pic.twitter.com/E3FCt0HsDF In 1962, Hubel and Wiesel uncovered how neurons in cats' brain respond to specific visual stimuli, such as edges, lines, and movement.Load tweets (may take a few seconds)
I guess knowledge is always a good thing. 😅 pic.twitter.com/P4Wagseuds
Can anyone guess what model we're talking about? 🤓 pic.twitter.com/slJpjd7VJN
After the 1969 Minsky & Papert book, we went through the first AI winter. Widrow, kept working on neural nets but rebranded them as adaptive filters, which are now ubiquitous. pic.twitter.com/MD7eVuZZOq
𝚙𝚕𝚘𝚝_𝚠𝚎𝚒𝚐𝚑𝚝𝚜 allows us to inspect the dependencies of the hidden state wrt the input and the previous hidden representation. It also allows us to view the output linear combination of hidden units. https://t.co/tlADa7jd7D pic.twitter.com/xUEVqnlBhh
This one actually ended is a funny way 😅😅😅
Anyhow, posting this to share the gates' histogram overlay with the activation function to show the operation mode (biasing) of the soft switches. pic.twitter.com/DinSLTcAsb
Taking a small break from DL for writing an appendix on procedural graphics. I hope you'll find it useful! 😊😊😊 pic.twitter.com/UjQioTWbKB
I had to relearn how to see what I was talking about. 😩😩😩
And now that I can see, let me draw it, so I won't unsee it again! 🤓🤓🤓 pic.twitter.com/79x1mnWCux
The visual cortex processes info hierarchically, simple cells respond to basic features and complex cells build location invariance. pic.twitter.com/KyKQekOCcz
Spring 2025
I’m given the opportunity to decide what my second course is. Therefore, I put together an undergraduate ‘Introduction to Deep Learning’ blackboard course. Mum gifts me a chalk holder. I have zero registered students one week prior to the beginning of the semester. The admin tells me they will likely have to cancel my course, and I’ll be assigned some other random stuff to teach.
We’re having fun in class. Students are easily amused by this silly prof.
For the second lecture, I spend roughly 4 hours tweaking my slides. I go to class and decided to give an introduction before turning on my laptop and the projector. One hour and a half later… the blackboard is a copy of the slides I planned to use 😅
I think it's going well. At least we're having fun! 😁😁😁 pic.twitter.com/fgg8NR1HSQ Tue morning: *prepares slides*Load tweets (may take a few seconds)
Tue class: *improv blackboard lecture*
Outcome: unexpectedly great lecture.
Thu morning: *prep handwritten notes*
Thu class: *executes blackboard lecture*
Students: 🤩🤩🤩🤩🤩🤩🤩🤩🤩 pic.twitter.com/pXgPVz8ajB
This is fun! 🤩 I get the hang of it and start crafting live coloured blackboard. Students are enthusiastic and hyped.
In today's episode, we review the concepts of loss ℒ(𝘄, 𝒟), per-sample loss L(𝘄, x, y), binary cross-entropy cost ℍ(y, ỹ) = y softplus(−s) + (1−y) softplus(s), ỹ = σ(𝘄ᵀ𝗳(x)).Load tweet
Then, we minimised the loss by choosing convenient values for our weight vector 𝘄. pic.twitter.com/axI0Jje8JC
I go teach in Santiago of Chile for Khipu 2025, and I get Yann to cover for me.
While I was away, teaching for @Khipu_AI, I got ‘someone’ to teach my blackboard undergrad course.Load tweet
It turns out teaching (undergrad) is like riding a bike. Even though you're out of practice, you still know how to do it! 😀😀😀 pic.twitter.com/Uz8PzAJC4H
I get back to FPGA Verilog programming, Spice CMOS simulation, and digital electronics.
Getting my toes wet with FPGA prototyping. 🤓 Digital, by Helmut Neemann, allows you to design and simulate digital logic, and it's designed for educational purposes. It has a Verilog export feature that helps you to understand how hardware description languages work. 🤓🤓🤓https://t.co/kPOx30GvFw pic.twitter.com/d1WxA5VcMj Alright, getting the hang of it! 🥲 Today we're playing with diode logic.Load tweets (may take a few seconds)
There are two always blocks:
• the first counts up to 13.5M, which takes 0.5 seconds with a clock of 27MHz;
• the second reset the LED configuration to 6'b111110 and every 0.5s moves the 0 on step to the left. pic.twitter.com/jLrEy62tiU
I haven't seen a less intuitive GUI in a while… yet, it *is* functional. I guess the author really wants you to switch to the keyboard shortcuts rather than right-clicking your way through! 🥹
BTW, LTspice is free of charge! pic.twitter.com/CvwgDP6xQ5
This component allows us only to perform a logic AND and OR. There is no NOT unless active components are used.https://t.co/lAmeqgqGAM pic.twitter.com/b30dIN2WH8
In class, I experiment a lot with the guided discovery pedagogical technique and having the students being the major actors, to a point that lecture 20 got completely derailed by a student, who kept steering the thread, prompted by his own curiosity. I was so ecstatic about the outcome (it was pure jazz), that I decided to publish the lecture to advertise the course to other students.
In this lecture from my new undergrad course, we review linear multiclass classification, leverage backprop and gradient descent to learn a linearly separable feature vector for the input, and observe the training dynamics in a 2D embedding space. 🤓https://t.co/k4p0JwPtB7 pic.twitter.com/sCgnkiPenALoad tweet
Finally, I create a new animation about training a neural network for classification, reviving code written 5 years ago.
Training of a 2 → 100 → 2 → 5 fully connected ReLU neural net via cross-entropy minimisation.Load tweet
• it starts outputting small embeddings
• around epoch 300 learns an identity function
• takes 1700 epochs more to unwind the data manifold pic.twitter.com/gzCMnA5rb0
For this course, I had the pleasure to have an unofficial assistant, Gabriele Pintus, who has been writing his Master’s thesis on JEPA models with me, here at NYU. Thanks to him, the homework were spectacularly well-made, and students extremely happy.
Book? No time.
Summer 2025
Yann agrees to review the book in July, finally allowing me to release the first book’s draft. I complete and release the 9th chapter, TikZ, bumping the draft to v0.9.0. The next update should happen around the end of July, where I should be able to share with you the first draft of the book. Now, I’m getting started with the 10th chapter, Control.
Last update: 9 Jun 2025.