# Book

Since Feb 2022 I’ve been writing our textbook on Deep Learning with an Energy perspective. It will come in two versions: an electronic one with a dark background for screens (freely available) and a physical one with white background for printers (for purchase).

I finished writing the first 3 chapters and corresponding Jupyter Notebooks:

- Intro;
- Spiral;
- Ellipse.

Once the 4^{th} chapter and notebook are done (end of Aug?), the draft will be submitted to the reviewers (Mikael Henaff and Yann LeCun).
After merging their contributions (end of Sep?), a first draft of the book will be available to the public on this website.

## Book format

The book is **highly** illustrated using $\LaTeX$’s packages Ti*k*Z and PGFPlots.
The figures are numerically generated with the computations done in Python using the PyTorch library.
The output of such computations are stored as ASCII files and then read by $\LaTeX$ that visualises them.
Moreover, most figures are *also* rendered on the Notebook using the Matplotlib library.

### Why plotting with $\LaTeX$?

Because I can control **every single aspect** of what is drawn.
If I define the *hidden vector* $\green{\vect{h}} \in \green{\mathcal{H}}$ in the book, I can have a pair of axis lebelled $\green{h_1}$ and $\green{h_2}$ and the Cartesian plane labelled $\green{\mathcal{H}}$ without going (too) crazy.
All my maths macros, symbols, font, font size, and colour are just controlled by **one single stylesheet** called `maths-preamble.tex`

.

### Why colours

Because I think in colours.
Hence, I write in colours.
And if you’ve been my student, you already know that at the bottom left we’ll have a *pink-bold-ex* $\pink{\vect{x}}$ from which we may want to predict a *blue-bold-why* $\blue{\vect{y}}$ and there may be lurking an *orange-bold-zed* $\orange{\vect{z}}$.

## Illustrations sneak peeks

To keep myself motivated and avoid going crazy too much, I post the most painful drawings on Twitter, where my followers keep me sane by sending copious amount of love ❤️. You can find here a few of these tweets.

## Load tweets (may take a few seconds)

I think I've just acquired the title of TikZ-ninja. pic.twitter.com/dq43bvjcFG

— Alfredo Canziani (@alfcnz) February 9, 2022

18 hrs writing the book in a row… Let's go home 😝😝😝

— Alfredo Canziani (@alfcnz) February 12, 2022

Good night World 😴😴😴 pic.twitter.com/kLtw2yeG92

A small update, so I keep motivating myself to push forward 😅😅😅

— Alfredo Canziani (@alfcnz) February 15, 2022

Suggestions and feedback are welcome! 😊😊😊 pic.twitter.com/d5NeKieE5m

Last update: a preview of the book's “maximum likelihood” section and generating code.

— Alfredo Canziani (@alfcnz) February 18, 2022

🥳🥳🥳 https://t.co/JZeAHuuTnA pic.twitter.com/dgaUIw5bWN

Achievement of the day 🥳🥳🥳

— Alfredo Canziani (@alfcnz) March 16, 2022

Plenty of pain! 🥲🥲🥲 pic.twitter.com/5BBS5J59bC

Vectors and functions 💡💡💡

— Alfredo Canziani (@alfcnz) March 18, 2022

A vector 𝒆 ∈ ℝᴷ can be thought of as a function 𝒆 : {1, …, 𝐾} ⊂ ℕ → ℝ, mapping all 𝐾 elements to a scalar value.

Similarly, a function 𝑒 : ℝᴷ → ℝ can be thought of as an infinite vector 𝑒 ∈ ℝ^ℝᴷ, having ℝᴷ elements. pic.twitter.com/ccZREDAal1

One giant leap for Alf, one small step forward for the book 🥲🥲🥲#TeXLaTeX #EnergyBasedModel #DLbook pic.twitter.com/X3FU8Uijys

— Alfredo Canziani (@alfcnz) March 22, 2022

Just some free energy geometric construction. 🤓🤓🤓 pic.twitter.com/DsIevqzuv2

— Alfredo Canziani (@alfcnz) April 4, 2022

Negative gradient comparison for Fₒₒ and Fᵦ.

— Alfredo Canziani (@alfcnz) May 3, 2022

For super-cold 🥶 zero-temperature limit we have a single force pulling on the manifold per training sample.

For warmer temperatures ☀️😎 we pull on regions of the manifold.

For super-hot 🥵 settings we kill ☠️ all the latents 😥. pic.twitter.com/cFsGQ3FJFV

«The ellipse toy example» chapter is DONE. 🥳🥳🥳

— Alfredo Canziani (@alfcnz) May 17, 2022

7.5k words, 1.2k likes of TikZ, 0.8k lines of Python.

I think I got this! 🥲🥲🥲 pic.twitter.com/5uwwrLcXPf

A small glimpse from the book, achievement of the day 🤓🤓🤓

— Alfredo Canziani (@alfcnz) June 1, 2022

The two soft maxima and soft minima are compared to the minimum, average, and maximum of a real vector (of size 5). This is a fun plot because the y-axis does something funky 🤪🤪🤪 pic.twitter.com/tST48uxmL2

Another update from the book. 📖

— Alfredo Canziani (@alfcnz) June 8, 2022

A classifier 'moves' points around such that they can be separated by the output linear decision boundaries.

Usually one looks at how the net warps the decision boundaries around the data but I like to look at how the input is unwarped instead. 🤓 pic.twitter.com/M3ZGmUUZI6

When looking at a classifier, we can consider its energy as being the cross-entropy or its negative linear output (often called logits). The energy of a well-trained model will be low for compatible (x, y) and high for incompatible pairs. 📖📖📖 pic.twitter.com/HlfvXQvGWn

— Alfredo Canziani (@alfcnz) June 10, 2022

Maths operand order is often counterintuitive.

— Alfredo Canziani (@alfcnz) July 7, 2022

For example, 𝒔 = 𝑾 𝒓 = 𝑼𝚺𝑽 ᵀ 𝒓 can be more naturally represented by the following circuit. 🤓🤓🤓 pic.twitter.com/S6rdtBtzuy

We can use SVD to inspect 🔍 what a given linear transformation does. From the diagram below we can see how the lavender oriented circle with axes 𝒗₁ and 𝒗₂ gets morphed into the aqua oriented ellipse with axes 𝜎₁𝒖₁ and 𝜎₂𝒖₂. So, they are ‘stretchy rotations’. pic.twitter.com/0HpOwOPbpf

— Alfredo Canziani (@alfcnz) July 8, 2022

A neural net is a sandwich 🥪 of linear and non-linear layers. Last week we've learnt about the geometric interpretation of linear transformations, and now we're appreciating a few activation functions' morphings.

— Alfredo Canziani (@alfcnz) July 19, 2022

Almost done with the intro chapter! 🥳🥳🥳 pic.twitter.com/9SAIfkKUWk

Chapter 1 (2 and 3) completed! 🥳🥳🥳

— Alfredo Canziani (@alfcnz) July 22, 2022

We've seen a linear and a bunch of non-linear transformations. But what can a stack of linear and non-linear layers do? Here we have two fully-connected nets doing their nety stuff on some random points. 😀😀😀 pic.twitter.com/otExi5h7bb

Last update: 26 Jul 2022.

## Oct 2022 update

For the entire month of Aug and half of Sep I got stuck on implementing a working sparse coding algo for a low-dimensional toy example.
**Nothing** was working for a long while, although I managed to get the expected result (see tweets below).
Then, I spent a couple of weeks on the new semester’s lectures, creating new content (slides below, video available soon) on back-propagation, which I’ve never taught at NYU, topic that will make it to the book.
Anyhow, now I’m back to writing! 🤓

## Load tweets (may take a few seconds)

Zooming in a little, for some finer details. pic.twitter.com/i57E0rYwzH

— Alfredo Canziani (@alfcnz) September 9, 2022

Backpropagation ⏮ of the gradOutput throughout each network's module allows us to compute the rate of change of the loss 📈 wrt the model's parameter.

— Alfredo Canziani (@alfcnz) September 26, 2022

To inspect 🧐 its value we can simply check the gradBias of any linear layer. pic.twitter.com/buysxDBGD7

Last update: 26 Sep 2022.

## May 2023 update

Oh boy, this 4^{th} chapter took me while (mostly because I’ve focussed also on other things, including the Spring 2023 edition of the course)… but it’s done now!
In these last few months I’ve wrote about *undercomplete autoencoders* (AE), *denoising AE*, *variational AE*, *contractive AE*, and *generative adversarial nets*.
Thanks to Gabriel Peyré, I’ve developed a method to separate stationary sinks and sources for a dynamics field (which I may write an article about), and it’s integral part of the book explanations.

Moreover, I’ve been pushing a few videos from the Fall 2022 edition of the course, which give a preview on the chapters I’ve been writing, *e.g.* neural nets components, backpropagation (first time teaching it), energy-based classification, PyTorch training, K-means, and sparse coding (at least for now).
Finally, over the Winter break, I’ve been teaching 12 years-olds about the maths and computer science behind generative AI, and I’m considering using p5.js as a tool to teach programming to beginners.

What’s next? I’m sending this first draft, with its 4 chapters (Intro, Spiral, Ellipse, Generative) and companion Jupyter Notebooks to Yann for a review. Meanwhile, I’ll be writing down the Backprop chapter, possibly and article, and pushing a few more videos on YouTube. Once the review is completed, a first draft will pop to this website for the public.

## Load tweets (may take a few seconds)

### Figures from chapter 4

A 2 → 100 → 100 → 1 → 100 → 100 → 2 hyperbolic tangent undercomplete autoencoder trying to recover a 1d manifold from 50 2d data points. 📖📖📖 pic.twitter.com/ImKbpPTavY

— Alfredo Canziani (@alfcnz) November 12, 2022

Let’s get some sections done! 🤓🤓🤓 pic.twitter.com/13bllkQ3wx

— Alfredo Canziani (@alfcnz) December 13, 2022

A variational autoencoder (VAE) limits the low-energy region by mapping the inputs to fuzzy bubbles. The hidden representation can be made uninformative by increasing the temperature during learning, which induces the bubbles to be all centred at the origin and have unit size. pic.twitter.com/qpa8ptsJDD

— Alfredo Canziani (@alfcnz) March 16, 2023

Done with the VAE chapter! 🥳🥳🥳

— Alfredo Canziani (@alfcnz) March 24, 2023

Two sections to go and the first draft ships! 🥳🥳🥳

Yay! 🥳🥳🥳 pic.twitter.com/Lj30urRpZH

We have a caption now! The contractive autoencoder section is completed.

— Alfredo Canziani (@alfcnz) April 18, 2023

One section to go! 🥳🥳🥳 https://t.co/cpid936wDr pic.twitter.com/mTIDqkYSqm

Epoch 0 vs. epoch 18k.

— Alfredo Canziani (@alfcnz) May 11, 2023

Losses and generator gradients' norm.

Critic learnt energy. pic.twitter.com/7swifi5qNj

### Videos from DLFL22

Let's end this year by starting to upload the first video of NYU Deep Learning Fall 2022 edition! 🥳🥳🥳

— Alfredo Canziani (@alfcnz) December 30, 2022

This is an incremental version based on DLSP21. Therefore, only new content will be uploaded.

Enjoy the view.https://t.co/TxaNhQgUbO pic.twitter.com/hVZYWEJMv8

Let's start the year by brushing up on the basics of neural nets: linear and non-linear transformations.

— Alfredo Canziani (@alfcnz) January 1, 2023

In this episode, we're concerned with inference only. Forward and backwards. We introduce the cost and the energy. 🔋

Website: https://t.co/3yY8CMLiXzhttps://t.co/zrqH4CG0mr pic.twitter.com/MrSeV3u40S

The first video of the «Classification, an Energy Perspective» saga shows two nets' data space transformation, introduces the data format, illustrates the predictor-decoder architecture, and explains how gradient descent is used for learning.

— Alfredo Canziani (@alfcnz) January 4, 2023

Enjoy 🤓❤️🤗https://t.co/glH2iGydIJ pic.twitter.com/S33JxwdH83

The second video of the «Classification, an Energy Perspective» saga teaches backprop, visualises the energy landscape, and explains how contrastive learning works. 🤓

— Alfredo Canziani (@alfcnz) January 9, 2023

This lecture alone was the reason DLFL22 has been pushed online. I hope you like it. ❤️https://t.co/5vVQRwLzxK pic.twitter.com/x0lQaT9hKz

The third and last video of the «Classification, an Energy Perspective» saga covers neural net 5-step training code in @PyTorch, gradient accumulation justification, reprodution of energy surface for different model, and ensembling uncertainty estimation.https://t.co/oyEGlgyhTE pic.twitter.com/MaZsSSRg8U

— Alfredo Canziani (@alfcnz) February 21, 2023

In this lecture, we start with two examples of decoder-only latent-variable EBM (𝐾-means and sparse coding), move to target-prop via amortised inference, to finally land the autoencoder architecture. 🤓

— Alfredo Canziani (@alfcnz) February 28, 2023

Back to using @AdobeAE for the animations! 🥳https://t.co/ATbVwuxmcC 🎥 pic.twitter.com/kWEF68cE9Q

### Teaching Italian 7th graders

I taught 4 hours of Deep Learning to a class of 7th graders. I didn’t dumb it down at all. I just used the same analogies and explanations I use with the grown ups. By the end I was in love with their young and fresh minds and total absolute attention. ❤️https://t.co/CFP4Mkarwx pic.twitter.com/Ng0veJLftq

— Alfredo Canziani (@alfcnz) January 18, 2023

Last update: 16 May 2023.

## Aug 2023 update

Of course, during the Summer it was unrealistic expecting anyone to review anything… Anyhow, I’ve just got back from O‘ahu (ICML23) and Maui (2 days before Lahaina burnt down) and finished the Backprop chapter, therefore the first draft will have 5 chapters in total as of right now. Below you can see a few diagrams I’ve developed over these Summer months.

The new semester starts in two weeks, so I’ll be a bit busy with that. I need to plan a possible chapter on joint embedding methods and start working on PART II of the book: ‘geometric stuff’.

About books, I’ve just received my copy of *The Little Book of Deep Learning* by François Fleuret.
I have to say it is *really* well made and I *really* like it.
It’s a bit on the terse side, but I haven’t decided if it’s a pro or a con.

## Load tweets (may take a few seconds)

Let's go fancy with inline diagrams!

— Alfredo Canziani (@alfcnz) May 22, 2023

LaTeX has no secretes to me, mhuahahaha! 🤪🤪🤪

(Writing the backprop chapter.) pic.twitter.com/bm1knKYCI7

Backprop, the key component behind training multi-layered deep nets, can be sometimes challenging to digest. Follows an attempt to illustrate it, starting from the last linear layer's gradWeight and gradBias computation in a regression setup. 🤓🤓🤓 pic.twitter.com/PLeiRjJVYb

— Alfredo Canziani (@alfcnz) May 26, 2023

A neural net is made of simple building blocks.

— Alfredo Canziani (@alfcnz) June 9, 2023

Learning how the output gradient is backpropagated through these basic components helps us understand how each part contributes to the final model performance.

Below we see how the node & sum complimentary modules behave. pic.twitter.com/tv5s5A1TFp

«Weights sharing implies tied gradients accumulation.» Since it's not obvious for half of you and only a small fraction can prove it (link to the poll below), let me share this latest book section with y'all! 😀😀😀

— Alfredo Canziani (@alfcnz) June 14, 2023

This also justifies the backward behaviour of the node module. pic.twitter.com/fA9dpYZiIp

The one-hot row routing and branching matrix G 🐢 is a peculiar object. When it's used in a left-multiplication, it acts as a selector and/or branching operator. When it's used in a right-multiplication, it acts as an accumulator via the paths that have previously branched out. pic.twitter.com/DEH7FOxhqr

— Alfredo Canziani (@alfcnz) June 21, 2023

Last update: 16 Aug 2023.