Ἀγεωμέτρητος μηδεὶς εἰσίτω.
οὐκ ἔστιν βασιλικὴ ἀτραπὸς εἰς τὴν γεωμετρίαν.

Structure and Interpretation

of Tensor Programs

The Wizard Book 2.0: Deep Learning and Deep Learning Systems
Trainnanochatby buildingteenygradfrom scratch: the bridge from microgradto tinygrad

Made with 🖤 by Jeffrey David Zhang, Waterloo Mathematics
Made possible by Lambda Labs Research Grant
First Edition (Draft).

You are viewing this on a mobile device, whereas SITP is best viewed on a desktop — the book includes various multimedia lecture videos, visualizers, any tufte-style sidenotes with many external hyperlinks to other resources.

Dedication

In loving memory of my father, my teacher, and my best friend Dr. Thomas Zhang ND., R.TCMP, R.Ac.
May you rest in pure land. We'll meet again dad.

We’ll Meet Again by Vera Lynn 1939. Cover by Johnny Cash 2002.

Myself, presenting an early outline of SITP at Toronto School of Foundation Modeling Season 1

Preface

This book is aspirationally titled The Structure and Interpretation of Tensor Programs, (abbreviated as SITP) as it's goal is to serve the same role for software 2.0 as The Structure and Interpretation of Computer Programs (abbreviated as SICP) did for software 1.0. Written by Harold Abelson and Gerald Sussman with Julie Sussman, SICP has reached consensus amongst many to be integral to the programmer canon, providing an introductory whirlwind tour on the essence of computation through a logically unbroken yet informal sequence from programming, to programming languagesSICP influenced other texts such as it's dual HtDP (introduced at Waterloo by Prabhakar Ragde) it's typed counterpart OCEB, and the recent addition of DCIC spawning from it's phylogenetic cousin PAPL.. The primary reason why this format is loved by many (and to be fair, equally disliked by a few), is because of the somewhat challenging learning curve for beginners, as it's not a book which simply teaches syntax such as the many "Learn X in Y Minute" booksWhich at this point LLM's adequately replace., but rather the semantics of a programming with programming languagesThe difficulty of teaching has always been semantics: Matthias Felleisen's TeachScheme!, Shriram Krishnamurthi's Standard Model of Programming Languages, Will Crichton's Profiling Programming Language Learning (something on SITP's roadmap)..

Before the success of large language modelsnotably the supervised finetuning and reinforcement learning from human feedback on top of a pretrainted transformer, the pedagogical return on investment in an introductory book on artificial intelligence following the same form as SICP was low, as readers would build their own pytorch from scratch "just" to classify MNIST or ImageNet. However now that deep learning systems are becoming as important if not more than the models themselves especially in the period of research in artificial intelligence dubbed the era of scaling, characterized by the heavy engineering of pouring internet-scale data into the weights of transformer neural networks with massively parallel and distributed compute., that return on investment is higher, as the frontier of deep learning systems increasingly becomes ever more out of reach from the grasp of the beginnerthe massively parallel processors now have dedicated hardware units evaluating matrix instructions called tensor cores, which in turn have precipitated the need for fusion compilers. Even language-runtime codesign/cooptimization like profile-guided optimizations are repeating themselves with languages such as torch.compile() and runtimes like vllm/sglang.. At least this is how I personally felt as a professional engineer transitioning to the world of domain specific tensor compilers, coming from domain specific cloud compilers and distributed infrastructure provisioners.

While I enjoyed reading existing deep learning canonsuch as [DFO20] for mathematics, [JWHTT23] for machine learning, [GBC16] for deep learning, [HKH22] for parallel programming, I couldn't help but imagine how delightful a SICP-style top-down-just-in-time reading experience would be. If product management and engineering transitions into vibecoding and training models respectively, then why aren't we teaching the parallel programming of deep nets to highschoolers and first year college students from the get-go? Curiosity got the best of me, and what resulted therein is the book you hold on your screens.

So in part one of the book, you will train your generalized linear models with numpy and then start developing teenygrad by implementing your own multidimensional array abstraction to train those models once again. Then, in part two of the book, you will train deep neural networks with pytorch following the age of research and then update the implementation of teenygrad to support gpu-accelerated "eager mode" evaluation for neural network primitives for both forward and backward passes. Finally, in part three which if you are more experienced, you may benefit in jumping straight to, to better understand what something like torch.compile() is doing for you of the book, you will update teenygrad for the last time for the age of scaling by developing a "graph mode" compilation and inference engine with tinygrad's RISCy IR, borrowing ideas from ThunderKitten's tile registers, MegaKernels, and Halide/TVM schedules. To continue deeping your knowledge, more resources are provided in the afterword.

The book provides a single resource with code, math, and expositioninspired by pedagogy such as Dive into Deep Learning (Zhang, Lipton, Li and Smola) and Distill (Carter and Olah) for deep learning systems such as pytorch and jax, while also embedding visualizers, explainers, and lectures from other open source educatorsfrom Andrej Karpathy, Grant Sanderson, Stephen Welch, Artem Kirsanov, and so on. to provide a rich multimodal dynamic documentas explored by Crichton with research in technical foundations of technical communication experience for the reader. While the SITP book and the teenygrad codebase is licensed under the MIT License, such embedded content remains the property of its original creators and licensors and is not claimed as original work of this project nor released under the MIT License.

If you empathize with some of my frustrations, you may benefit from the book too.
If you are looking for reading groups checkout the #teenygrad channel in
Good luck on your journey.
Are you ready to begin?

Tower of Babel, Genesis 11:1–9

Prologue

In some sense, the 21st century truly began only after the first 20 years past the second millenium, for it was not until the creation of ChatGPT where humanity traded in their so-called bicycles of mind for motorcycle upgrades. From 2020 to 2025, programmers (just like you) discovered The Scaling Laws, where pouring internet-scale data into the weights of transformer neural networks with massively parallel and distributed compute produces large language models which in turn enable communication between natural organisms and artificial machines through the means of natural language. This has always been a long standing dreamargubally started with Descartes, extended by La Mettrie's Man A Machine, initiated by Leibniz's Universal Calculus, and applied computationally with Wittgenstein's Tractatus Logico-Philosophicus and Philosophical Investigations. in the science of the mind we call artificial intelligence.

The story of artificial intelligence is tightly interconnected with computation, given that the field as we know it today started in earnest during the 20th century at the 1956 Dartmouth Summer Research Project on Artificial Intelligence. There, a group of resesearchers interested in the science of the mind went on to establish the field of "artificial intelligence" a rebranding of the science of mindtainted by the hermaneuticism of psychoanalysis with computational methodseffectively operationalizing the notion of computationalism rather than the correlative methods of neurosciencesee Could a Neuroscientist Understand a Microprocessor? (Jonas, Kording 2017) and the observational methods of psychology's behaviorism. Their reasoning, (roughly and reductively) consists of the following:

Since Gödel's Incompleteness Theorems state that there exist propositions unprovable, and the Church-Turing Thesiswhich SITP's spiritual predecessor SICP smuggles in through a footnote in chapter 4.1 The Metacircular Evaluator states that all representable languages implementable with computation have the same expressivity, if we ever wanted to physically realize non-biological artificial intelligence, the constructive stateful mathematics on Turing Machines implemented with von Neumann Architectures via electricity and semiconductors are the correct substrate to conduct and construct the science of the mind, as opposed to classical stateless mathematics.

Although united by the idea of using computation to mechanize the mind and thus serve as the basis of artificial intelligence, they were divided between which exact computational techniques and to employ. The two prominent onesothers include embodimentalists, dynamicalists, and self-organizers being the symbolists and the connectionists which in some sense are the primordial "software 1.0" and "software 2.0" we now know today, and in fact date back to Aristotle and Laplace, in which Baye's Theorem generalizes Aristotelian Logic as a corner case with the probability of some belief given evidence is 1.

With the way in which the field played out, the logical approach with symbolic techniques to artificial intelligence started out as the favorite school of thoughtthus known as "classical" AI or "good-old-fashioned" AI as opposed to the probabilistic approach largely due to the 1969 book Perceptrons by Marvin Minsky and Seymour Papert. The logical approach used logical tools from logicianslike Frege, Tarski, Brouwer, Gentzen, Curry-Howard, Martin Löf, Girard, and so on to create expert systems such as ELIZA. Overtime however, and largely due to the continually increasing capability of hardware, the probabilistic approach with machine learning techniques started to see some more success.

Ironically enough, although these people were united in the idea of using probabilistic learning methodsfollowing the same principle of modeling phenomena with mathematical models where the system minimizes free energy which experienced large success in the 19th century when physicists such as Helmholtz, Gibbs, and Boltzmann modeled energy, enthalpy, and entropy, they themselves were divided between which exact models $y = f_{θ} (x)$ to employ. The three prominent ones being gaussian processes, kernel machines, and neural networks. Theoretically, (kernel machines todo), but practically neural networks when made deep are able to learn representations...the true watershed moment was during 2012 when a neural network named AlexNet trained on a parallel graphics processor was released, scoring a loss on the ImageNet dataset 10.8% better than the next runner-up.

The 2012-2019 period in deep learning is now becoming known as the era of research, as diverse and various inductive biases were explored through the means of network architectures, resulting in different neural networks such as feedforward neural networks, convolutional neural networks, recurrent neural networks, long-short-term-memory neural networks, and so on, up until the scaling the attention mechanism and feedforward nets inside the transformer archicture started gaining dominance in the 2020-2025 period, now known as the the era of scaling, which brings us back to the present day.

Euclid Instructing His Pupils in The School of Athens, Raffaello Sanzio da Urbino 1509-1511.

You are viewing this on a mobile device, whereas SITP is best viewed on a desktop — the book includes various multimedia lecture videos, visualizers, any tufte-style sidenotes with many external hyperlinks to other resources.

I. Elements of Networks

Although separated by over 2000 years, the discipline of programming originally practiced in Silicon Valley shares the same dream of mathematics in Ancient Greece: using discrete and deterministic descriptions to model reality. Although mathematicians prove properties about mathematical objects with theorems whereas programmers evaluate the values of said objects with programs, what the two traditions share in common is the activity of modeling discrete objects with deterministic descriptions which in turn, are created for them by toolsmiths referred to as logicians and language implementors — the mathematician describes the world with a logic, and the programmer does so with a programming language.

However, for the programmer interested in learning the skills necessary for the statistical learning approach to artificial intelligence, they must make a transition to programming from discrete data usinand deterministic functions to that of continuous data $x$ and stochastic functions $f (x)$ . Although you will encounter many new concepts in your journey up ahead, there are also many similarities with the art of programming that you already know.

So in 1 Continuously Stochastic Data $x$ with numpy, we will first learn about programming data $x$ of software 2.0 by representing types continuously with coordinates, distributing truth stochastically with probability, and then recovering said distributions with maximum likelihood estimation. Then, in 1.2 Statistical Learning of Continuously Stochastic Functions with numpy, we will learn how to recover stochastic functions $y = f (x)$ by automatically evaluating their derivatives $f^{'} (x)$ with HIPS/autograd HIPS/autograd along with torch-autograd were the first numerical libraries to implement automatic differentiation in the ecosystems of productive programming languages like Python's numpy and Lua's torch, which ultimately inspired PyTorch. HIPS/autograd is used in part one of the book to emphasize the automatic differentiation abstraction was implemented before the heavy GPU acceleration PyTorch provided. For more information Soumith's PyTorch's design origins is an excellent read. and using it's direction of steepest descent to iteratively optimize the loss with stochastic gradient descent. From there, readers will unravel the mystery in how the internals of numpy and torch are implemented by building teenygrad from scratch. So in 1.3 Programming CPU Accelerated Basic Linear Algebra Subroutines with teenygrad.Tensor, teenygrad.Tensor is implemented in Python with strides that virtualize arbitrary n-dimensinal shapes to a 1-dimensional physical array. Then, the basic linear algebra operations on the Tensor are accelerated on the CPU by changing the host language from Python to Rust in order to translate programs into native machine code called assembly with the rustc compiler which enables us to analyze and tune the performance for the underlying processor's pipelines and memory hierarchiesBerkeley's opensource educational RISC-V cores sodor and BOOM cores along with Tenstorrent's RVV extension fork ocelot will be used to introduce instruction sets and microarchitecture, but x86 machines like Intel's Xeon and AMD's EPYC will be used in order to achieve server-class performance.

From there, part two of the SITP book takes you on the next step in the journey by increasing the expressivity of the models that are trained from generalized linear models to deep neural networks with torchIn part three, teenygrad will be updated once again for SITP's ultimate capstone project which modifies the eager mode interpreter implementation to a graph mode compiler in order to support the nanogpt and nanochat models. This final version of teenygrad will share 80% of the abstractions with the pure fusion compiler called tinygrad, providing a bridge to the frontier of deep learning systems.. and then evolving teenygrad's implementation to support the same neural network primitives, with GPU acceleration.

I. Table of Contents

WeThe following is a software2.0-adapted excerpt from The Structure and Interpretation of Computer Programs Chapter 1: Building Abstractions with Procedures: are about to study the idea of a computational process. Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of ~~rules called a program~~ probabilities called a model. People ~~create programs to direct processes~~ recover models to direct processes. In effect, we conjure the spirits of the computer with our spells.

A computational process is indeed much like a sorcerer's idea of a spirit. It cannot be seen or touched. It is not composed of matter at all. However, it is very real. It can perform intellectual work. It can answer questions. It can affect the world by disbursing money at a bank or by controlling a robot arm in a factory. The programs we use to conjure processes are like a sorcerer's spells. They are carefully composed from ~~symbolic~~ numerical expressions in arcane and esoteric programming languages that prescribe the tasks we want our processes to perform. A computational process, in a correctly working computer, ~~executes programs precisely and accurately~~ recovers models stochastically and noisily. Thus, like the sorcerer's apprentice, novice programmers must learn to understand and to anticipate the consequences of their conjuring. Even small errors (usually called bugs or glitches) in programs can have complex and unanticipated consequences.

1. Representing Data with Highly Dimensional Stochastic Distributions in `torch`

$↩$ Table of Contents

1.1 From Deductive Software 1.0 to Inductive Software 2.0

$↩$ Table of Contents

The first logician was a man born over 2000 years ago named Aristotle, as he discovered and created the formal structures and principles embedded within natural language that govern reasoning — referred to as a propositional logicAlthough there are many limitations with Aristotelian Logic serving as a sufficient basis for reasoning, proof assistants, and programming languages as we understand them today, many scholars trained in formal techniques have come to appreciate Aristotle with new respect given how far he went with so little to build from — when peak mathematics moved from Greece to the Middle East, Aristotle was known amongst the medeival Hindu-Arabic scholars as "The First Teacher"

Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child's? If this were then subjected to an appropriate course of education one would obtain the adult brain.

A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$ , if its performance at tasks in $T$ , as measured by $P$ , improves with experience $E$ .

As programmers however, although we prefer to evaluate the values of objects rather than prove their properties, there turns out to be a lot in common between the two (in some sense, Aristotle was not only the first logician but also language implementor).

In this chapter 1 Statistical Inference of Continuously Stochastic Data with numpy we will cover continuously stochastic data $x$ and in the next chapter 1.2 Statistical Learning of Continuously Stochastic Functions with numpy, we will cover continuously stochastic functions $f (x)$ .. The discipline and profession of mathematics and programming are closely interconnected, and the transition from discretely deterministic descriptions of reality to continuously stochastic ones have already been made by our mathematical cousins a few hundred years ago during the Scientific Revolution.

In the next three chapters we will place our focus on (todo)

and in the next two, (todo)

in which you will be prepared to understand the final chapter:

1.7 Probability + Linear Algebra = High Dimensional Stochastic Distributions

1.2 Distributing Truth Stochastically with Probabilities

$↩$ Table of Contents

The first shift in mindset when transitioning from representing data in software 1.0 to that of software 2.0 is the stochastic nature of it.

That is, phenomena that is randomly determined. It wasn't until 1933 where Kolmogorov formalized the notion of probabilities with the following set-theoretic construction:

the set of all possible states (outcomes) of the experiment is the sample space $Ω$
the set of all possible subsets of the sample space is the event space $F \subseteq Ω$
a measure $P : F \to [0, 1]$ which maps all events to some real between $0$ and $1$

Together this triplet of two sets and one function $(Ω, F, P)$ defines a probability space. Focusing on the map $P$ , although it formalizes the notions of "chance", "likeliness", and "probability" as a numerical assignment between $0$ and $1$ on every event in the event space, there are two possible semantic interpretations of the value returned by this function when modeling real-world stochastic phenomena — namely that of the so-called frequentist and bayesian interpretation of probability.

The former interprets probabilities (the values returned by the probability measure) as an "objective" belief where the value returned is the count of the event in which an infinite sequence of repeated trials converges to. That is, $P (E) = n \to \infty lim \frac{count ( E )}{n}$

so for instance, the most trivial example is an experiment of rolling a single fair coin with a probability assignment of $P ({head}) = 0.5$ . The experiment can be repeated many times, and the frequency between the count of heads and the number of trials will converge to $0.5$ . As programmers, running a simulation of such experiment is quite simple:

import random

sample_space = {"🧑", "🤖"} # human head, robot tail
def experiment():
  outcome = random.choice(list(sample_space))
  return outcome

if __name__ == "__main__":
  n = 10_000_000
  target_e, target_e_count = {"🧑"}, 0
  for i in range(n):
    outcome = experiment()
    if outcome in target_e: target_e_count += 1

  prob_e = len(target_e) / len(sample_space)
  freq_e = target_e_count / n
  print(f"the expected probability of {target_e} is {prob_e}")
  print(f"the actual frequency of {target_e} in {n} trials is {freq_e}")

Although running the simulation results in slightly different numbers, they are all close to $0.5$ within a few precision points. The probability assignment of $0.5$ is something that can be "objectively" verified. To contrast, bayesian interpretation of probabilities are "subjective" beliefs where the value returned represents the internal uncertainty of the observer. For instance, consider an unfair die (todo..)

It's important to remember that neither interpretation is "the" correct one, and often times the availability of data implies which one to use. todo..returning to functional axiomaticization...probability is the weight of a set.

1.8 Recovering Probabilities from Data with Parameter Estimation via Maximum Likelihood

$↩$ Table of Contents

2. Learning Functions from Data with Optimization in `torch`

$↩$ Table of Contents

In chapter 1, you've initiated your transition from deductive software 1.0 to inductive software 2.0 by:

using random variables instead of deterministic variables by distributing the truth of types onto a weighted set of events and representing such types in high dimensional coordinate systems
recovering programs from data rather than specifying programs with instructions by estimating the parameters of distributions with maximum likelihood and maximum a posteriori estimation

And so now, the formulation of a computer program learning from experience $E$ with respect to some class of tasks $T$ and performance measure $P$ starts to make more sense with respect to learning distributions from data. In chapter 2, we extend the notion of experience $E$ and tasks $T$ to not just learn distributions $X$ from data, but that of learning functions $f : X \to Y$ . So instead of recover a single random variable $X$ from data like in chapter 1, we will recover two random variables $X$ and $Y$ to construct linear functions $f : X \to Y$ (todo, maybe rephrase. depending on how the link between ERM of squared error and MLE is presented).

More formally, given observations of input-outputWe stick to the domain neutral terminology of inputs and outputs. However the input-output pairs $(x^{(i)}, y^{(i)})$ are also referred to as predictors/response and features/labels pairs comprising some dataset $D = {(x^{(i)}, y^{(i)})}_{i = 1}^{n}$ , with function approximation, learn some mapping from input space to output space $f : X \to Y$ , and evaluate such learned mapping $\overset{y}{^} = f (x^{n + 1})$ in order to predict new outputs given unseen input well, referred to as generalization. When historical data of the expected output $y^{(i)}$ given some input $x^{(i)}$ is available, we refer to this regime as supervised learning, for the learning algorithm is being "supervised" with the correct answerAs opposed to unsupervised learning which we'll cover in part II.

However, roughly speakingOrdinal data can also exist in addition to numerical and categorical data(todo), there are two settings of supervised learning depending on the type of the output space $Y$ that the learned function $f$ must map to (with the input space $X$ usually being the vector space of $R^{d}$ ). If the phenomena being modeled admits quantitative outputs, then the function that needs to be learned takes the form $f : R^{d} \to R$ and is referred to as regressionThe origin of the name is instructive (Stigler, 1986). It comes from 19th century investigations into the relationship between the attributes of parents and their children. People who are taller (heavier, faster,...) than average tend to have children who are also taller than average, but not quite as tall., whereas if it admits qualitative outputs, then the function that needs to be learned takes the form $f : R^{d} \to {1, \dots, k}$ and is referred to as classification.

In the next two chapters of Chapter 2. Learning Functions From Data with pytorch will implement linear models for both cases of regression and classification with torch, and Chapter 3. Accelerating Basic Linear Algebra Subroutines with teenygrad.Tensor will re-implement them with our own deep learning framework teenygrad.

2.1 Recovering Probabilities from Data with Parameter Estimation via Maximum Likelihood

$↩$ Table of Contents

2.2 Predicting Scalars by Linearly Modeling Regression

$↩$ Table of Contents

Similarly to learning distributions $X$ from data, learning functions $f : X \to Y$ from data roughly follows the three step process of selecting a model for the task $T$ , a performance measure $P$ , and optimizing such measure on experience $E$ . Let's take a look at the linearly modeling regression with squared error loss, a problem which straddles tools from all three languages of probability theory, linear algebra, and calculus.

For the task of house price prediction, the input space $X = R^{d}$ is the size in square feet, and the output space $Y = R$ is the price, meaning that the function which needs to be recovered has the type of $f : R^{d} \to R$ . An assumption about the structure of the data needs to be made, referred to as the inductive bias. The simplest assumption to make is to assume that there exists some linear relationship between the data $x$ and parameters $w$ so that $f$ ends up being modeled as an inner product plus bias, parameterized as $f (x; w, b)$ : $f f (x; w, b) : R^{d} \to R : = (w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n}) + b = (i = 1 \sum d w_{i} x_{i}) + b = w^{⊤} x + b$

One small adjustment we can make to $f$ is to fold the final bias term $b$ into the vector $w$ as $w_{1}$ , increasing the total dimensionality so that $w \in R^{d + 1}$ , and adjusting $x$ accordingly so that $x_{1} := 1$ and $x \in R^{d + 1}$ . Then, the equation of our line with bias is simply parameterized as $f (x; w)$ $f f (x; w) : R^{d + 1} \to R : = w^{⊤} x$

Semantically speaking, each property of the input house $x_{i} \in R^{d}$ is weighted by some weight $w_{i}$ , adjusted accordingly during the learning process so that it accurately reflects the property's influence on the output price $y \in R$ when evaluating a prediction $\overset{y}{^}$ . We can evaluate $f$ on all $x^{(i)} \in D$ with a randomly intialized $w$ to ensure we've wired everything up correctly.

However, while specifying the equation of the linear regression model on ~~paper~~ screen and subsequently evaluating the computation manually might have been sufficient for pre-computational mathematicians such as Gauss and Legendre, the discipline of statistical learning is entirely predicated on computational methods. Let's continue to use PyTorch's core n-dimensional array datastructure with torch.Tensor, this time modeling a function $f : X \to Y$ rather than some distribution $X$ :

(todo: maybe change example to closure following torch.nn)

import torch
def f(x: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
  return torch.dot(x, w)

in which we can update the names of our torch.Tensor variables with the ranks of their vector spacesReferred to as Shape Suffixes, described by Noam Shazeer, an engineer from Google to clarify what dimensions these operations are evaluated on:

import torch
def f_shapesuffixed(x_D: torch.Tensor, w_D: torch.Tensor) -> torch.Tensor:
  return torch.dot(x_D, w_D)

and finally, actually enforce them via runtime typechecking with the jaxtyping package (try swapping the return statements below to trigger a type error with the return type):

import torch
from jaxtyping import Float, jaxtyped
from beartype import beartype

@jaxtyped(typechecker=beartype)
def f_typechecked(
  x_D: Float[torch.Tensor, "D"],
  w_D: Float[torch.Tensor, "D"]
) -> Float[torch.Tensor, ""]:
  # return X_ND
  return torch.dot(x_D, w_D)

Let's now sanity-check that f_typechecked is wired up correctly by evaluating it on all $x^{(i)} \in D$ with random parameter w_D. (todo: use actual housing data)

import torch
if __name__ == "__main__":
  X_ND = torch.tensor([
    [1500, 10, 3, 0.8],  # house 1
    [2100, 2,  4, 0.9],  # house 2
    [800,  50, 2, 0.3]   # house 3
  ], dtype=torch.float32)
  Y_N = torch.tensor([500000, 800000, 250000], dtype=torch.float32)

  w_D = torch.randn(4); print(f"random weight vector: {w_D}")
  for (xi_D,yi_1) in zip(X_ND,Y_N):
    yihat_1 = f_typechecked(xi_D,w_D)
    print(f"expected: ${yi_1}, actual: ${yihat_1:.2f}")

random weight vector: tensor([-0.3343,  0.0768,  2.7828, -0.1331])
expected: $500000.0, actual: $-492.46
expected: $800000.0, actual: $-690.89
expected: $250000.0, actual: $-258.08

Clearly, these outputs are unintelligible, and will only become intelligible with a "good" choice of parameter $w$ . Before we find such a parameter, let's make one more modification to our function $f$ to increase it's performance. Rather then evaluating f_typechecked on every input xi_D in the dataset zip(X_ND,Y_N) sequentially with a loop, we can evaluate all outputs with a single matrix-vector multiplication by updating $f$ 's function definition to the following

$f_{batched} f_{batched} (X; w) : R^{n \times d} \to R^{n} : = Xw = x_{1}^{⊤} w x_{2}^{⊤} w ⋮ x_{n}^{⊤} w$

and the corresponding PyTorch updated to

import torch
@jaxtyped(typechecker=beartype)
def f_batched(
  X_ND: Float[torch.Tensor, "N D"],
  w_D: Float[torch.Tensor, "D"]
) -> Float[torch.Tensor, ""]:
  return X_ND@w_D

Now that we've selected our model for the task $T$ , we can proceed with selecting our performance measure $P$ which the learner will improve on with experience $E$ .

2.3 Fitting Linear Regression by Directly Optimizing Squared Error Loss

$↩$ Table of Contents

After the inductive bias on the family of functions has been made, the learning algorithm must find the function $\hat{f}$ with a good fit. Since artificial learning algorithms don't have visual cortex like biological humans[], the notion of "good fit" needs to defined in a systematic fashion. This is done by selecting the parameter $θ \in R^{m}$ which maximizes the likelihood of the data $p (d; θ)$ . Returning to the linear regression inductive bias we've selected to model the house price data, we assume there exists noise $ϵ^{(i)}$ in both our model (epistemic uncertainty) and data (aleatoric uncertainty), so that $y^{(i)} = θ^{⊤} x^{(i)} + ϵ^{i}$ where $ϵ^{(i)} \sim N (μ, σ^{2})$

prices $y^{i}$ are normally distributed conditioned on seeing the features $x^{i}$ with the mean being the equation of the line $θ^{⊤} x^{(i)}$ where $y^{(i)} ∣ x^{(i)} \sim N (μ = θ^{⊤} x^{(i)}, σ^{2})$ , then we have that

$p (y^{(i)} ∣ x^{(i)}; θ) = N (θ^{⊤} x^{(i)}, σ^{2}) = \frac{1}{2 π σ ^{2}} exp (- \frac{( y ^{(i)} - θ ^{⊤} x ^{(i)} ) ^{2}}{2 σ ^{2}})$ TODO: generate prose/exposition by repaging lecture/text in ram

Returning to the linear regression model, we can solve this optimization with a direct method using normal equations. QR factorization, or SVD.

def fhatbatched(X_n: np.ndarray, m: float, b: float) -> np.ndarray: return X_n*m+b

if __name__ == "__main__":
  X, Y = np.array([1500, 2100, 800]), np.array([500000, 800000, 250000]) #  data

  X_b = np.column_stack((np.ones_like(X), X))              # [1, x]
  bhat, mhat = np.linalg.solve(X_b.T @ X_b, X_b.T @ Y)   # w = [b, m]

  yhats = fhatbatched(X, mhat, bhat) # yhat
  for y, yhat in zip(Y, yhats):
    print(f"expected: ${y:.0f}, actual: ${yhat:.2f}")

To summarize, we have selected and computed

an inductive bias with the family of linear functions $f (X; θ) : = X @ θ$
an inductive principle with the least squared loss $L (θ) : = \sum_{i = 0}^{n} (y^{(i)} - X @ θ)^{2}$
the parameters which minimze the empirical risk, denoted as $(\hat{θ}) = arg min L (θ)$

Together, the inductive bias describes the relationship between the input and output spaces, the inductive principle is the loss function that measures prediction accuracy, and the minimization of the empirical risk finds the parameters for the best predictor.

2.4 Predicting Categories by Linearly Modeling Classification

logistic regression, cross entropy loss, gradient descent

2.5 Fitting Logistic Regression by Iteratively Optimizing Cross Entropy Loss

2.6

linearity depends on that the scalars are in a field (addition is associative) "scalars" in a field --> we are dealing with floating point representations "vectors" we saw coordinate systems in 1.4 "matrices" and we saw linear transforms in 2.1 mathematically, a vector is a coordinate in some basis of a vector space a matrix is a linear function, the linear, in numerical linear algebra, is aspirational.

3. Accelerating Functions and Data with Basic Linear Algebra Subroutines in `teenygrad`

$↩$ Table of Contents

3.1 The Simplest `Tensor` Implementation with Nested `array`s

Now that we have an operational understanding of how multidimensional arrays are used in practice, let's now implement our very own np.array, which will serve as the foundation to the book's capstone project — the deep learning framework teenygrad.

Let's start with the simplest implementation idea for teenygrad.Tensor, which is a nested array implementation with Python's built-in list data structure.

For some $x \in R$ we will use a float, for $x \in R^{n}$ , we will use a list[float], and for arbitary $x \in R^{d_{0} \times \dots \times d_{n}}$ we will use a list[list[float]]. In fact, this is simply what we started doing in the previous chapter when implementing our linear regression model before switching to numpy's accelerated nd.array.

3.2 Virtualizing ND Logical Shapes to 1D Physical Storages with Strides

from typing import Self
import array, math
import teenygrad

class InterpretedTensor:
  @classmethod
  def arange(cls, end: int, requires_grad: bool=False) -> Self: return InterpretedTensor((end,), list(range(end)), requires_grad=requires_grad)
  @classmethod
  def zeros(cls, shape: tuple[int, ...]) -> Self:
    numel = math.prod(shape)
    tensor = InterpretedTensor((numel,), [0.0]*numel).reshape(shape)
    return tensor
  @classmethod
  def ones(cls, shape: tuple[int, ...]) -> Self:
    numel = math.prod(shape)
    tensor = InterpretedTensor((numel,), [1.0]*numel).reshape(shape)
    return tensor
  
  def __init__(self, shape: tuple[int, ...], storage: list[float], inputs: tuple[Self, ...]=(), requires_grad: bool=False) -> None:
    self.shape: tuple[int, ...] = shape
    self.stride: tuple[int, ...] = [math.prod(shape[i+1:]) for i in range(len(shape))] # row major, and math.prod([]) produces 1
    self.storage: list[float] = storage

    self.inputs: tuple[Self, ...] = inputs
    self._backward = lambda: None # callers override after init with self captured in closure
    self.grad: InterpretedTensor = InterpretedTensor.zeros(shape) if requires_grad else None # python can recursively type (no need for Box<_>) bc everything is a heap-allocated reference
  @property
  def numel(self): return math.prod(self.shape) # np (and thus jax) call this .size
  @property
  def ndim(self): return len(self.shape)
  @property
  def T(self) -> Self:
    assert self.ndim == 2
    m, n = self.shape
    t = InterpretedTensor((n, m), self.storage)
    t.stride = [self.stride[1], self.stride[0]]
    return t

  def reshape(self, shape: tuple[int, ...]) -> Self:
    self.shape = shape
    self.stride = [math.prod(shape[i+1:]) for i in range(len(shape))] # math.prod([]) produces 1
    return self
  
  def __repr__(self) -> str:
    return f"InterpretedTensor({self.chunk(self.storage, self.shape)})"
  @staticmethod
  def chunk(flat, shape):
    if len(shape) == 1: return flat[:shape[0]]
    size = len(flat) // shape[0]
    return [InterpretedTensor.chunk(flat[i*size:(i+1)*size], shape[1:]) for i in range(shape[0])]
  
  # backward f'(x)
  @staticmethod
  def topo(node: InterpretedTensor, seen: set[InterpretedTensor], output: list[InterpretedTensor]) -> None:
    if node in seen: return
    seen.add(node)
    for input in node.inputs: InterpretedTensor.topo(input, seen, output)
    output.append(node)

  def backward(self) -> None:
    seen, topologically_sorted_expression_graph = set(), []
    InterpretedTensor.topo(self, seen, topologically_sorted_expression_graph)

    self.grad = InterpretedTensor.ones(self.shape) # base case
    for tensor in reversed(topologically_sorted_expression_graph): tensor._backward()
  
  # forwards f(x)
  def __radd__(self, other: Self) -> Self: return self.__add__(other)
  def __add__(self, other: Self) -> Self:
    n, alpha = self.numel, 1
    x, y, z = array.array('f', self.storage), array.array('f', other.storage), array.array('f', [0.0]*(n))
    teenygrad.rs.cpu.saxpy(n, alpha, x, y) # y=axpy
    requires_grad = self.grad is not None or other.grad is not None
    output_tensor = InterpretedTensor(self.shape, list(y), (self, other), requires_grad=requires_grad)
    def _backward():
      self.grad += output_tensor.grad
      other.grad += output_tensor.grad
    output_tensor._backward = _backward
    return output_tensor

  def __rmul__(self, other: Self) -> Self: return  self.__mul__(other)
  def __mul__(self, other: Self) -> Self:
    n = self.numel
    x, y, z = array.array('f', self.storage), array.array('f', other.storage), array.array('f', [0.0]*n)
    teenygrad.rs.cpu.smul(n, x, y, z)
    requires_grad = self.grad is not None or other.grad is not None
    output_tensor = InterpretedTensor(self.shape, list(z), (self, other), requires_grad=requires_grad)
    def _backward():
      self.grad += output_tensor.grad * other
      other.grad += output_tensor.grad * self
    output_tensor._backward = _backward
    return output_tensor

  def __neg__(self) -> Self:
    n = self.numel
    x, y = array.array('f', self.storage), array.array('f', [0.0]*n)
    teenygrad.rs.cpu.saxpy(n, -1, x, y)
    requires_grad = self.grad is not None
    output_tensor = InterpretedTensor(self.shape, list(y), (self,), requires_grad=requires_grad)
    def _backward():
      self.grad += -output_tensor.grad
    output_tensor._backward = _backward
    return output_tensor

  def __sub__(self, other: Self) -> Self:
    n = self.numel
    x, y = array.array('f', other.storage), array.array('f', self.storage)
    teenygrad.rs.cpu.saxpy(n, -1, x, y)
    requires_grad = self.grad is not None or other.grad is not None
    output_tensor = InterpretedTensor(self.shape, list(y), (self, other), requires_grad=requires_grad)
    def _backward():
      self.grad += output_tensor.grad
      other.grad += -output_tensor.grad
    output_tensor._backward = _backward
    return output_tensor

  def tanh(self) -> Self:
    n = self.numel
    x, y = array.array('f', self.storage), array.array('f', [0.0]*n)
    teenygrad.rs.cpu.stanh(n, x, y)
    requires_grad = self.grad is not None
    output_tensor = InterpretedTensor(self.shape, list(y), (self,), requires_grad=requires_grad)
    def _backward():
      self.grad += output_tensor.grad * (InterpretedTensor.ones(self.shape) - output_tensor * output_tensor) # f(x) = tanh(x) ==> f'(x) = 1 - tanh(x)^2
    output_tensor._backward = _backward
    return output_tensor

  def __rmatmul__(self, other: Self) -> Self: return other.__matmul__(self) # GEMM does not commute: AB != BA
  def __matmul__(self, other: Self) -> Self:

3.3 Implementing Basic Linear Algebra Subroutines in Python: `AXPY`, `GEMV` and `GEMM`

3.4 Accelerating the `BLAS` with Processor Pipelines and Memory Hierarchies

/// SGEMM with the classic BLAS signature (row-major, no transposes):
/// C = alpha * A * B + beta * C
fn sgemm(
  m: usize, n: usize, k: usize,
  alpha: f32, a: &[f32], lda: usize,
  b: &[f32], ldb: usize,
  beta: f32, c: &mut [f32], ldc: usize) {
  assert!(m > 0 && n > 0 && k > 0, "mat dims must be non-zero");
  assert!(lda >= k && a.len() >= m * lda);
  assert!(ldb >= n && b.len() >= k * ldb);
  assert!(ldc >= n && c.len() >= m * ldc);

  for i in 0..m {
    for j in 0..n {
      let mut acc = 0.0f32;
      for p in 0..k { acc += a[i * lda + p] * b[p * ldb + j]; }
      let idx = i * ldc + j;
      c[idx] = alpha * acc + beta * c[idx];
    }
  }
}

fn main() {
  use std::time::Instant;

  for &n in &[16usize, 32, 64, 128, 256] {
    let (m, k) = (n, n);
    let (a, b, mut c) = (vec![1.0f32; m * k], vec![1.0f32; k * n], vec![0.0f32; m * n]);

    let t0 = Instant::now();
    sgemm(m, n, k, 1.0, &a, k, &b, n, 0.0, &mut c, n);
    let secs = t0.elapsed().as_secs_f64().max(std::f64::MIN_POSITIVE);
    let gflop = 2.0 * (m as f64) * (n as f64) * (k as f64) / 1e9;
    let gflops = gflop / secs;

    println!("m=n=k={n:4} | {:7.3} ms | {:6.2} GFLOP/s", secs * 1e3, gflops);
  }
}

*AMD Zen5 Core Die Shot by Fritzchens Fritz (Hover for annotation by Nemez)*

Thus far in our journey we've successfully made the transition from programming software 1.0 which involves specifying of discrete data structures — like sets, associations, lists, trees, graphs, — with the determinism of types to programming software 2.0 which involves reovering continous data structures — like vectors, matrices, and tensors — with the stochasticity of probabilities. If you showed the mathematical equations for your models, loss functions, and optimizers to our mathematical cousinsIn particular, the physicists whom realized describing reality as minimizing free-energy was fruitful, such as Helmholtz with energy, Gibbs with enthalpy, and Boltzmann with entropy. who also made the same transition, they would understand the mathematical equations and even algorithmsas until recently, algorithms were understood to be a sequence of instructions to be carried out by a human. i.e Egyptian Multiplication, Euclid's Greatest Common Divisor. But they wouldn't understand how the code we've programmed thus far with Python and Numpy is being automatically calculated by the mechanical machine we call computers. For that we need to transition from users of the numpy framework to implementors our own framework, which we'll call teenygrad. This requires us moving from the beautiful abstraction of mathematics to the assembly heart and soul of our machines. This book uses RustTo backtrack to familiarize yourself with Rust, take a look at the experimental version of The Rust Programming Language, by from Will Crichton and Shriram Krishnamurthi, originally written by Steve Klabnik, Carol Nichols, and Chris Krycho., but feel free to follow along with C/C++In that case, take a look at "The C Programming Language by Brian Kernighan and Dennis Ritchie. — what's fundamental to the purpose of accelerating said basic linear algebra subroutines on the CPU is is choosing a language that 1. forgoes the dynamic memory management overhead of garbage collection and 2. has a compiled implementation which allows us to analyze and tune the actual instructions which an underlying processor executes. Using Rust will allow us to start uncovering what is really happening under the hood of accelerated Tensors To sidetrack to familiarize yourself with systems programming in the context of software 1.0 , read the classic of Computer Systems A Programmer's Perspectivewith instruction set architectures, microarchitecture, memory hierarchies

instruction set architecture
computation: control path, data path
communication: memory, input/output

In the last chapter we developed teenygrad.Tensor by virtualizing physical 1D storage buffers into arbitrarily-shaped logical ND arrays by specifying an iteration space with strides, and started our journey in accelerating the basic linear algebra subroutines defined on the Tensor with Rust, which speeds up the throughput performance of SGEMM from 10MFLOP/S to 1GFLOP/S. Simply executing a processor's native code — referred to as assembly code — rather than Python's bytecode results in an increase of two orders of magnitude.

Now that the code that is being executed is native assembly, we can dive deeper into the architecture of the machine in order to reason and improve the performance of teenygrad's BLAS.

3.5 Accelerating Computation with Instruction Level Parallelism via Loop Unrolling

3.6 Accelerating Computation with Data Level Parallelism via Vector and Matrix Extensions

3.7 Accelerating Communication with Memory Hierarchies via Register and Cache Blocking

3.8 Accelerating Computation with Thread Level Parallelism

The Creation of Adam, Michelangelo 1508-1512.

You are viewing this on a mobile device, whereas SITP is best viewed on a desktop — the book includes various multimedia lecture videos, visualizers, any tufte-style sidenotes with many external hyperlinks to other resources.

II. Neural Networks

In part one of the book you have developed a solid foundation to the statistical learning approach towards artificial intelligence — it's amazing how close you are to that of deep learning. By now, you understand that fuzzy phenomena such as house prices or language sentiment are better described with continuous stochastic descriptions like linear regression and linear classification.

The functions of software 2.0 are continuous in nature because data is represented with higher dimensional coordinates rather than some discrete type so functions for the case of regression and classification have the form $f : R^{d} \to R$ and $f : R^{d} \to [0, 1]$ , respectively.
They are also stochastic in nature because the function bodies are assigned conditional distributions $f (x) := p_{θ} (Y = y ∣ X = x)$ . That is, rather than declaring a deterministic function that is computable by a Turing Machine with control flow and assignment, the functions are recovered via parameter estimation of $θ$ through iterative optimization.

In part two of the book, you will train a variety of deep neural networks with pytorch, which includes the same core multidimensional array data structure as numpy with torch.Tensor, but in addition provides three additional abstractions specifically for training deep nets, including 1. neural network primitives, 2. automatic differentiation, and 3. gpu acceleration. Because you have the necessary foundation in machine learning, in 2.1 Sequence Learning with Deep Neural Networks in pytorch we will train various sequence-based neural networksfollowing the 2012-2020 time period of what is colloquially known as the age of research,. and simply present the definition of the model's forward pass $\overset{y}{^} = f (x; θ)$ , the loss $L (y, \overset{y}{^}; θ)$ , and train the model using pytorch's support for optimization and automatic differentiation with optim.sgd and Tensor.backward() until the loss curves converge. After training a variety of deep neural networks with various inductive biases through pytorch, in 2.2 Programming GPU Accelerated Forward and Backward Passes with teenygrad.Tensor.backward() you will update the implementation of teenygrad in order to train those very same nets.

After part two, you will be ready for part three of the book, which follows the 2020-2025 time period of what is colloquially known amongst researchers as the age of scaling. This time period focused on scaling up the generality of the attention mechanism in the transformers network architecture with more data and compute following scaling laws, in which the software and hardware like pytorch and GPUs evolved to support. In part three, you will add midtraining and posttraining phases to your pretrained transformer to add assistant and reasoning behavior, and evolve teenygrad's implementation for the final time to support a graph mode fusion compiler, and inference engine.

II. Table of Contents

2.1 Learning Sequences with Deep Neural Networks in `pytorch`

$↩$ Table of Contents

(todo, some explanation) on learning sequences how part 2 will involve implementing the concepts presented in the 3b1b video

2.1.1 Learning Non-Linear Representations with Feedforward Neural Networks

Recall that the task of house price prediction and sentiment classification which can be modelled by functions of the form $f : R^{d} \to R$ and $f : R^{d} \to [0, 1]$ respectively. The simplest inductive bias was made, in a which a linear relationship was assumed to hold between the input and output spaces, and where the output was subsequently modeled as an inner product $<, >: R^{d} \times R^{d} \to R$ between an input vector and a weight vector. For the case of regression, we have $y = f (x; θ) := θ^{⊤} x$ , and for the case of classification, we have $y = f (x; θ) := σ (θ^{⊤} x)$ , todo:glm,exp. The key entry point into the function class of deep neural networks is that of logistic regression, because the log odds produced by the inner product (which are indeed affine) required a mapping into a valid probability via sigmoid function $σ : R \to [0, 1]$ , where $σ (x) := \frac{1}{1 + e ( - x )}$ which is in fact not linear nor affine.

The next natural question to ask then is whether the logistic regression model is considered a deep neural network? The answer is that technically yes, it can be considered a degenerative deep neural network with $0$ hidden layersIn the same way that a list can be considered a degenerative binary tree or graph. These so-called hidden layers automate the construction of the representation through learning, so that the model not only discovers the mapping from representation to output, but also the representation itselfIn the same way that for certain computations the positional representation of arabic numerals are more suitable compared to that of roman numerals, and polar coordinates over cartesian coordinates. The functions of deep neural networks will take the form of $f (x) := g \circ ϕ (x)$ with $g$ being a linear classifier on feature extractor $ϕ : R^{d_{1}} \to R^{d_{n}}$ , $ϕ (x) := h^{[l]} \circ h^{[l - 1]} \circ \dots h^{[1]} \circ h^{[0]} (x)$ where $l$ is the number of compositional layers, and each intermediate function $h^{[i]}$ has the form of $h^{[i]} : R^{d_{i}} \to R^{d_{i + 1}}$ . Each hidden layer successively and graduallyIn the same way of Grothendieck's preferred style of mathematics described in Récoltes et Semailles: I can illustrate the second approach with the same image of a nut to be opened. The first analogy that came to my mind is of immersing the nut in some softening liquid, and why not simply water? From time to time you rub so the liquid penetrates better, and otherwise you let time pass. The shell becomes more flexible through weeks and months—when the time is ripe, a touch of the hand is enough, and the shell opens like a perfectly ripened avocado! A different image came to me a few weeks ago. The unknown thing to be known appeared to me as some stretch of earth or hard marl, resisting penetration. One can go at it with pickaxes or crowbars or even jackhammers: this is the first approach, that of the “chisel” (with or without a hammer). The other is the sea. The sea advances insensibly and in silence, nothing seems to happen, nothing moves, the water is so far off you hardly hear it… yet it finally surrounds the resistant substance. lifts the complexity and abstraction of the data's representationChris Olah, cofounder of Anthropic and the lead of it's interpretability research wrote an excellent article on how software 2.0's representation learning loosely correspond to software 1.0's types in Neural Networks, Types, and Functional Programming.

Together, these two aspects of learning non-linear, representations form the essence of deep learning.

Let's now turn out attention to the function bodies of these $h^{[i]}$ 's with a deep neural netork of $1$ hidden layer, carrying out the task of price regression and sentiment classification so that $f$ has the form $f : R^{d} \to R$ . With the statistical learning foundation from part one, we will simply present the forward pass $f (x)$ , the loss function $L (d, θ)$ , and the backward pass $ℓ^{'} (y, \overset{y}{^}, θ)$ .

Forward Pass

The functions of deep neural networks will take the form of $f (x) := g \circ ϕ (x)$ with $g$ being a linear classifier on feature extractor

$ϕ ϕ (x) : R^{d_{1}} \to R^{d_{n}} := h^{[l]} \circ h^{[l - 1]} \circ \dots h^{[1]} \circ h^{[0]} (x)$ where $l$ is the number of compositional layers, and each intermediate function $h^{[i]}$ has the form of $h^{[i]} : R^{d_{i}} \to R^{d_{i + 1}}$ .

Loss Function

Backward Pass

TODO

figure/diagram/lecun circuits ->algebraic/symbolic equations->torch code
- first train net for price regression and classification
  - intuition of automating feature engineering
- XOR: playground.tensorflow
- change code below to XOR
- mention that 2.1 treats backward pass as a black box, which is the magic of the abstraction
- mention to readers that if they want to, they can read 2.1.3, 2.2, and then back to 2.1.4

TODO: generate prose/exposition by repaging lecture/text in ram

class MLP(nn.Module):
  """
  takes the previous block_size tokens, encodes them with a lookup table,
  concatenates the vectors and predicts the next token with an MLP.

  Reference:
  Bengio et al. 2003 https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
  """

  def __init__(self, config):
    super().__init__()
    self.block_size = config.block_size
    self.vocab_size = config.vocab_size
    self.wte = nn.Embedding(config.vocab_size + 1, config.n_embd) # token embeddings table
    # +1 in the line above for a special <BLANK> token that gets inserted if encoding a token
    # before the beginning of the input sequence
    self.mlp = nn.Sequential(
      nn.Linear(self.block_size * config.n_embd, config.n_embd2),
      nn.Tanh(),
      nn.Linear(config.n_embd2, self.vocab_size)
    )

  def get_block_size(self):
    return self.block_size

  def forward(self, idx, targets=None):
    # gather the word embeddings of the previous 3 words
    embs = []
    for k in range(self.block_size):
      tok_emb = self.wte(idx) # token embeddings of shape (b, t, n_embd)
      idx = torch.roll(idx, 1, 1)
      idx[:, 0] = self.vocab_size # special <BLANK> token
      embs.append(tok_emb)

    # concat all of the embeddings together and pass through an MLP
    x = torch.cat(embs, -1) # (b, t, n_embd * block_size)
    logits = self.mlp(x)

    # if we are given some desired targets also calculate the loss
    loss = None
    if targets is not None:
      loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

    return logits, loss

2.1.2 Learning Embeddings with Feedforward Neural Networks

2.1.3 Learning Sequences with Feedforward Neural Networks

class MLP(nn.Module):
  """
  takes the previous block_size tokens, encodes them with a lookup table,
  concatenates the vectors and predicts the next token with an MLP.

  Reference:
  Bengio et al. 2003 https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
  """

  def __init__(self, config):
    super().__init__()
    self.block_size = config.block_size
    self.vocab_size = config.vocab_size
    self.wte = nn.Embedding(config.vocab_size + 1, config.n_embd) # token embeddings table
    # +1 in the line above for a special <BLANK> token that gets inserted if encoding a token
    # before the beginning of the input sequence
    self.mlp = nn.Sequential(
      nn.Linear(self.block_size * config.n_embd, config.n_embd2),
      nn.Tanh(),
      nn.Linear(config.n_embd2, self.vocab_size)
    )

  def get_block_size(self):
    return self.block_size

  def forward(self, idx, targets=None):
    # gather the word embeddings of the previous 3 words
    embs = []
    for k in range(self.block_size):
      tok_emb = self.wte(idx) # token embeddings of shape (b, t, n_embd)
      idx = torch.roll(idx, 1, 1)
      idx[:, 0] = self.vocab_size # special <BLANK> token
      embs.append(tok_emb)

    # concat all of the embeddings together and pass through an MLP
    x = torch.cat(embs, -1) # (b, t, n_embd * block_size)
    logits = self.mlp(x)

    # if we are given some desired targets also calculate the loss
    loss = None
    if targets is not None:
      loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

    return logits, loss

2.1.4 Learning Sequences with Convolutional Neural Networks

# Near copy paste of the layers we have developed in Part 3

# -----------------------------------------------------------------------------------------------
class Linear:
  
  def __init__(self, fan_in, fan_out, bias=True):
    self.weight = torch.randn((fan_in, fan_out)) / fan_in**0.5 # note: kaiming init
    self.bias = torch.zeros(fan_out) if bias else None
  
  def __call__(self, x):
    self.out = x @ self.weight
    if self.bias is not None:
      self.out += self.bias
    return self.out
  
  def parameters(self):
    return [self.weight] + ([] if self.bias is None else [self.bias])

# -----------------------------------------------------------------------------------------------
class BatchNorm1d:
  
  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.momentum = momentum
    self.training = True
    # parameters (trained with backprop)
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)
    # buffers (trained with a running 'momentum update')
    self.running_mean = torch.zeros(dim)
    self.running_var = torch.ones(dim)
  
  def __call__(self, x):
    # calculate the forward pass
    if self.training:
      if x.ndim == 2:
        dim = 0
      elif x.ndim == 3:
        dim = (0,1)
      xmean = x.mean(dim, keepdim=True) # batch mean
      xvar = x.var(dim, keepdim=True) # batch variance
    else:
      xmean = self.running_mean
      xvar = self.running_var
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    # update the buffers
    if self.training:
      with torch.no_grad():
        self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * xmean
        self.running_var = (1 - self.momentum) * self.running_var + self.momentum * xvar
    return self.out
  
  def parameters(self):
    return [self.gamma, self.beta]

# -----------------------------------------------------------------------------------------------
class Tanh:
  def __call__(self, x):
    self.out = torch.tanh(x)
    return self.out
  def parameters(self):
    return []

# -----------------------------------------------------------------------------------------------
class Embedding:
  
  def __init__(self, num_embeddings, embedding_dim):
    self.weight = torch.randn((num_embeddings, embedding_dim))
    
  def __call__(self, IX):
    self.out = self.weight[IX]
    return self.out
  
  def parameters(self):
    return [self.weight]

# -----------------------------------------------------------------------------------------------
class FlattenConsecutive:
  
  def __init__(self, n):
    self.n = n
    
  def __call__(self, x):
    B, T, C = x.shape
    x = x.view(B, T//self.n, C*self.n)
    if x.shape[1] == 1:
      x = x.squeeze(1)
    self.out = x
    return self.out
  
  def parameters(self):
    return []

# -----------------------------------------------------------------------------------------------
class Sequential:
  
  def __init__(self, layers):
    self.layers = layers
  
  def __call__(self, x):
    for layer in self.layers:
      x = layer(x)
    self.out = x
    return self.out
  
  def parameters(self):
    # get parameters of all layers and stretch them out into one list
    return [p for layer in self.layers for p in layer.parameters()]

# original network
# n_embd = 10 # the dimensionality of the character embedding vectors
# n_hidden = 300 # the number of neurons in the hidden layer of the MLP
# model = Sequential([
#   Embedding(vocab_size, n_embd),
#   FlattenConsecutive(8), Linear(n_embd * 8, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
#   Linear(n_hidden, vocab_size),
# ])

# hierarchical network
n_embd = 24 # the dimensionality of the character embedding vectors
n_hidden = 128 # the number of neurons in the hidden layer of the MLP
model = Sequential([
  Embedding(vocab_size, n_embd),
  FlattenConsecutive(2), Linear(n_embd * 2, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  FlattenConsecutive(2), Linear(n_hidden*2, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  FlattenConsecutive(2), Linear(n_hidden*2, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(n_hidden, vocab_size),
])

# parameter init
with torch.no_grad():
  model.layers[-1].weight *= 0.1 # last layer make less confident

parameters = model.parameters()
print(sum(p.nelement() for p in parameters)) # number of parameters in total
for p in parameters:
  p.requires_grad = True

2.1.5 Learning Sequences with Recurrent Neural Networks

class RNNCell(nn.Module):
  """
  the job of a 'Cell' is to:
  take input at current time step x_{t} and the hidden state at the
  previous time step h_{t-1} and return the resulting hidden state
  h_{t} at the current timestep
  """
  def __init__(self, config):
    super().__init__()
    self.xh_to_h = nn.Linear(config.n_embd + config.n_embd2, config.n_embd2)

  def forward(self, xt, hprev):
    xh = torch.cat([xt, hprev], dim=1)
    ht = F.tanh(self.xh_to_h(xh))
    return ht

class RNN(nn.Module):
  def __init__(self, config, cell_type):
    super().__init__()
    self.block_size = config.block_size
    self.vocab_size = config.vocab_size
    self.start = nn.Parameter(torch.zeros(1, config.n_embd2)) # the starting hidden state
    self.wte = nn.Embedding(config.vocab_size, config.n_embd) # token embeddings table
    if cell_type == 'rnn':
        self.cell = RNNCell(config)
    elif cell_type == 'gru':
        self.cell = GRUCell(config)
    self.lm_head = nn.Linear(config.n_embd2, self.vocab_size)

  def get_block_size(self):
    return self.block_size

  def forward(self, idx, targets=None):
    device = idx.device
    b, t = idx.size()

    # embed all the integers up front and all at once for efficiency
    emb = self.wte(idx) # (b, t, n_embd)

    # sequentially iterate over the inputs and update the RNN state each tick
    hprev = self.start.expand((b, -1)) # expand out the batch dimension
    hiddens = []
    for i in range(t):
      xt = emb[:, i, :] # (b, n_embd)
      ht = self.cell(xt, hprev) # (b, n_embd2)
      hprev = ht
      hiddens.append(ht)

    # decode the outputs
    hidden = torch.stack(hiddens, 1) # (b, t, n_embd2)
    logits = self.lm_head(hidden)

    # if we are given some desired targets also calculate the loss
    loss = None
    if targets is not None:
      loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

    return logits, loss

2.1.6 Learning Sequences with Generative Pretrained Transformers

class NewGELU(nn.Module):
  """
  Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT).
  Reference: Gaussian Error Linear Units (GELU) paper: https://arxiv.org/abs/1606.08415
  """
  def forward(self, x):
    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))

class CausalSelfAttention(nn.Module):
  """
  A vanilla multi-head masked self-attention layer with a projection at the end.
  It is possible to use torch.nn.MultiheadAttention here but I am including an
  explicit implementation here to show that there is nothing too scary here.
  """

  def __init__(self, config):
    super().__init__()
    assert config.n_embd % config.n_head == 0
    # key, query, value projections for all heads, but in a batch
    self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
    # output projection
    self.c_proj = nn.Linear(config.n_embd, config.n_embd)
    # causal mask to ensure that attention is only applied to the left in the input sequence
    self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                  .view(1, 1, config.block_size, config.block_size))
    self.n_head = config.n_head
    self.n_embd = config.n_embd

  def forward(self, x):
    B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

    # calculate query, key, values for all heads in batch and move head forward to be the batch dim
    q, k ,v  = self.c_attn(x).split(self.n_embd, dim=2)
    k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
    q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
    v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

    # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
    att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
    att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
    att = F.softmax(att, dim=-1)
    y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
    y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

    # output projection
    y = self.c_proj(y)
    return y

class Block(nn.Module):
  """ an unassuming Transformer block """

  def __init__(self, config):
    super().__init__()
    self.ln_1 = nn.LayerNorm(config.n_embd)
    self.attn = CausalSelfAttention(config)
    self.ln_2 = nn.LayerNorm(config.n_embd)
    self.mlp = nn.ModuleDict(dict(
        c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd),
        c_proj  = nn.Linear(4 * config.n_embd, config.n_embd),
        act     = NewGELU(),
    ))
    m = self.mlp
    self.mlpf = lambda x: m.c_proj(m.act(m.c_fc(x))) # MLP forward

  def forward(self, x):
    x = x + self.attn(self.ln_1(x))
    x = x + self.mlpf(self.ln_2(x))
    return x

class Transformer(nn.Module):
  """ Transformer Language Model, exactly as seen in GPT-2 """

  def __init__(self, config):
    super().__init__()
    self.block_size = config.block_size

    self.transformer = nn.ModuleDict(dict(
        wte = nn.Embedding(config.vocab_size, config.n_embd),
        wpe = nn.Embedding(config.block_size, config.n_embd),
        h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
        ln_f = nn.LayerNorm(config.n_embd),
    ))
    self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

    # report number of parameters (note we don't count the decoder parameters in lm_head)
    n_params = sum(p.numel() for p in self.transformer.parameters())
    print("number of parameters: %.2fM" % (n_params/1e6,))

  def get_block_size(self):
    return self.block_size

  def forward(self, idx, targets=None):
    device = idx.device
    b, t = idx.size()
    assert t <= self.block_size, f"Cannot forward sequence of length {t}, block size is only {self.block_size}"
    pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape (1, t)

    # forward the GPT model itself
    tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
    pos_emb = self.transformer.wpe(pos) # position embeddings of shape (1, t, n_embd)
    x = tok_emb + pos_emb
    for block in self.transformer.h:
        x = block(x)
    x = self.transformer.ln_f(x)
    logits = self.lm_head(x)

    # if we are given some desired targets also calculate the loss
    loss = None
    if targets is not None:
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

    return logits, loss

2.2 Programming GPU Accelerated Automatic Differentation with `teenygrad.Tensor.backward()`

$↩$ Table of Contents

2.2.1 Iterative Optimization with Stochastic Gradient Descent `optim.sgd`

2.2.2 Symbolic, Numerical, and Automatic Differentiation

2.2.3 Forward Mode Automatic Differentiation with `Tensor.forward()`

2.2.4 Reverse Mode Automatic Differentiation with `Tensor.backward()`

Consider the function $f : R, R \to R$ where $f : = exp (x_{1}) * sin^{2} x_{2}$ , and translate it to it's computational counterpart in python with one-dimensional Tensors:

import picograd as pg

def f(x1: pg.Tensor, x2: pg.Tensor) -> pg.Tensor:
  a = pg.exp(x1)
  b = pg.sin(x2)
  c = b**2
  d = a*c
  return d

Figure 1. Python source for the function $f : R, R \to R$ where $f : = exp (x_{1}) * sin^{2} x_{2}$

Here we've broken up the function to render the subexpressions more clearly. But this isn't necessary — automatic differentiation will work if the function was expressed in one line. In part one, the development of picograd followed that of numpy — an array programming language similar to Matlab but embedded in the host language of Python, that could evaluate functions of the form $f (x)$ where Tensor objects stored their values $f (x)$ with the value: field and the function types that produced their values with Op. For instance, evaluating the specified function f from above with 9 and 10

if __name__ == "__main__":
  print(f(9, 10))

populates the Tensor.value fields. In part one of the book we verified this with a REPL-interface, but we can also represent the entire expression being evaluated with a graph of vertices and edges where the vertices are Tensors (along with their Ops and values) and the edges are their data dependencies:

Here you can see that even if the function was specified in one line, the graph of the expression always parses into Tensor vertices, and data dependency edges. You may have noticed the Tensor.grad fields, which supposedly store the values of derivatives $f^{'} (x)$ . The question now remains in how to populate these fields.

Taking a step back to differential calculus, deriving the derivative of $f : = exp (x_{1}) * sin^{2} x_{2}$ involves the application of the chain rule where $f (x) := g (h (x)) ⟹ f^{'} (x) = g^{'} (h (x)) h^{'} (x)$ . Evaluating the derivative of the function with respect to its inputs $f^{'} (x_{1})$ and $f^{'} (x_{2})$ results in

$f^{'} (x_{1}) = \frac{\partial}{\partial x _{1}} [exp (x_{1}) * sin^{2} x_{2}] = \frac{\partial}{\partial x _{1}} [exp (x_{1})] * sin^{2} x_{2} = exp (x_{1}) (1) * (sin^{2} x_{2})$ $f^{'} (x_{2}) = \frac{\partial}{\partial x _{2}} [exp (x_{1}) * sin^{2} x_{2}] = exp (x_{1}) * \frac{\partial}{\partial x _{2}} [sin^{2} x_{2}] = exp (x_{1}) * 2 sin x_{2} cos x_{2}$

symbolic and numeric differentiattion symbolic differentiation has performance issues since a large unrolled expression must be constructed in order to differentiate[^0], whereas numerical differentiation has correctness issues since evaluating finite differences requires evaluating functions to a precision point resulting in numerical instability. (trace through EXAMPLE for both. talking nets widrow)

To populate the Tensor.grad fields, the simplest idea would be to literally translate the manual derivation of the derivative into code. The translation from math to code involves a design decision: should we evaluate from outputs to inputs (symbolically outside-in, graphically right-to-left) or from inputs to outputs (symbolically inside-out, graphically left-to-right)? Although the former order seems more natural with symbolic expressions, there's nothing illegal about the latter.

import picograd as pg

def f(x1: pg.Tensor, x2: pg.Tensor) -> pg.Tensor:
  a = pg.exp(x1)
  b = pg.sin(x2)
  c = b**2
  d = a*c
  return d

# dict[f(x), f'(x)] of local derivatives (adjoints)
dd_da, dd_dc = [c, a] # d(a,c):=a*c ==> d'(a)=c, d'(c)=a
da_dx1 = pg.exp(x1) # a(x1):=exp(x1) ==> a'(x1)=exp(x1)
dc_db = 2*b # c(b):=b^2 ==> c'(b)=2b
db_dx2 = pg.cos(x2) # b(x2):=sin(x2) ==> b'(x2)=cos(x2)

# outputs to inputs: outside-in symbolically, right-to-left graphically
dd_dd = pg.Tensor(1) # base case
dd_da, dd_dc = [dd_dd*dd_da, dd_dd*dd_dc]
dd_dx1 = dd_da*da_dx1 # DONE for the x1->d path

dd_db = dd_dc*dc_db
dd_dx1 = dd_db*db_dx2 # DONE for x2->path

# inputs to outputs: inside-out symbolically, left-to-right graphically
dx1_dx1, dx2_dx2 = [pg.Tensor(1), pg.Tensor(1)] # base case
da_dx1 = da_dx1*dx1_dx1
dd_dx1 = dd_da*da_dx1 # DONE for the x1->d path

db_dx2 = db_dx2*dx2_dx2
dc_dx2 = dc_dc*db_dx2
dd_dx2 = dd_dc*dc_dx_2 # DONE for the x2->d path

Do you notice any difference in the number of evaluations between the two orders?

The outputs-to-input ordering takes 6 arithmetic operations (including the destructuring), whereas the input-to-output ordering take 7 arithmetic operations. This is because the former can reuse dd_dd as a dynamic programming solution to a subproblem for the two inputs, whereas the latter cannot. And taking a step back, we only want to reuse the output because the shape of the function is of $f : R^{2} \to R$ . Alternatively, if $f$ had type $f : R \to R^{2}$ , then the input-to-output ordering would be able to reuse results. This distinction is referred to as "forward-mode" vs "reverse-mode", and reflects the fact that for some function $f : R^{n} \to R^{m}$ the time complexity of forward-mode differentiation is proportional to $n$ , whereas that of forward-mode differentiation is proportional to $m$ . If the expression graph fans-in so that $n > m$ , reverse-mode is preferred. If the expression graph fans-out so that $m > n$ , forward-mode is preferred. However, if we take a step with a graph-theory lens, we can see that the derivative is the sum of paths, where each path is a product of local derivatives from the input source to the output sink. From a combinatorics perspective, we are calculating all the possible (ors) ways (ands) on how the inputs perturb the output. That is:

$\frac{\partial f}{\partial x _{i}} = \sum\prod \frac{\partial v _{j}}{\partial v _{k}}$

and as long as the operations along this path are associative — $(A B) C = A (BC)$ — then we can choose the order in how we perform these path products to minimize the number of operations. Finding the optimal ordering is an NP-hard problem because ____. For instance, if the expression graph is diamond-shaped, evaluating the derivative with forward-mode for the left-half and reverse-mode for the right-half would be more performant. In practice, we use reverse-mode as a heuristic, since most of the functions that are differentiated (so they can be optimized) in the field of machine learning are neural networks of the form $f : R^{n} \to R$

How can we generalize this into an algorithm?
All we need are 1. mappings from $f (x) \to f^{'} (x)$ and 2. a topological sort

For the derivative rules, the same way that optimizing compilers implement an optimization "manually" once which then gets reused many times, the authors of deep learning frameworks also implement derivatives manually which then become reused many times through automatic differentiation. In theory, we can differentiate any expression with f'(x) with only a few derivative rules for addition and multiplication, but in practice most frameworks provide sugar for complex derivatives.

For topological sort, we can simply reversed the ordering produced by a depth-first-search:

def toposort(self):
  order: list[Op] = []
  visited: set[Op] = set()

  def dfs(node: Op) -> None:
    if node in visited: return
    visited.add(node)
    for src in node.src: dfs(src)
    order.append(node)

  dfs(self)
  return order

class Tensor():
  def backward():
    for t in reversed(topo):
      t.backward()

We will now use this idea to modify the interpretation of our deep learning framework to not only evaluate $f (x)$ , but $f^{'} (x)$ as well. This is done by dynamically overloading the operators at runtime[^0] to trace the expression graph

chain_rules = PatternMatcher([
  (Pattern(OpCode.MATMUL, name="input"), lambda output_grad, input: (_____,)),
  (Pattern(OpCode.MATVEC, name="input"), lambda output_grad, input: (_____,)),
  (Pattern(OpCode.RECIPROCAL, name="input"), lambda output_grad, input: (-output_grad * input * input,)),
  (Pattern(OpCode.SIN, name="input"), lambda output_grad, input: ((math.pi/2 - input.src[0]).sin() * output_grad,)),
  (Pattern(OpCode.LOG2, name="input"), lambda output_grad, input: (output_grad / (input.src[0] * math.log(2)),)),
  (Pattern(OpCode.EXP2, name="input"), lambda output_grad, input: (input * output_grad * math.log(2),)),
  (Pattern(OpCode.SQRT, name="input"), lambda output_grad, input: (output_grad / (input*2),)),
  (Pattern(OpCode.ADD), lambda output_grad: (1.0*output_grad, 1.0*output_grad)),
  (Pattern(OpCode.MUL, name="input"), lambda output_grad, input: (input.src[1]*output_grad, input.src[0]*output_grad)),
])

class Tensor:
  def _forward(self, f:Callable, *other:Tensor) -> Tensor: #extra_args=(), **kwargs)
    out_tensor = evaluator.eval_uop([self, other], out_uop)

  def backward(self, grad:Tensor|None=None) -> Tensor:
    """
    backward performs by collecting tensors, computing gradients with automatic differentiation, and updating said tensors.
    """
    # 1. collect all tensors that requires grad by topologically sorting the graph of uops and filter
    all_uops = self.uop.toposort()
    tensors_require_grad: list[Tensor] = [t for tref in all_tensors if (t:=tref()) is not None and t.uop in all_uops and t.requires_grad]
    uops_require_grad = [t.uop for t in tensors_require_grad]
    assert grad is not None or self.shape == tuple(), "when no gradient is provided, backward must be called on a scalar tensor"
    if not (self.is_floating_point() and all(t.is_floating_point() for t in tensors_require_grad)): raise RuntimeError("only float Tensors have gradient")
    
    # 2. compute the gradient with a map of tensors to partials
    if grad is None: grad = Tensor(1.0, dtype=self.dtype, device=self.device, requires_grad=False) # base case is 1.0
    tens2grads = Tensor._automatically_differentiate(self.uop, grad.uop, set(uops_require_grad)) # skipping materializing zerod grads for now
    grads = [Tensor(g, device=t.device) for t,g in zip(tens2grads.keys, tens2grads.values)] # initialize tensor grads on device
    
    # 3. update the tensors that require grad with the gradient's partials
    for t,g in zip(tensors_require_grad, grads):
      assert g.shape == t.shape, f"grad shape must match tensor shape, {g.shape!r} != {t.shape!r}"
      t.grad = g if t.grad is None else (t.grad + g) # accumulate if t.grad exists
    return self

  @staticmethod
  def _automatically_differentiate(root:Op, root_grad:Op, targets:set[Op]) -> dict[Op, Op]:
    """
    _differentiate backpropagates partials on a topologically sorted expression graph with the chain rule
    and produces the gradient in the form of a map of ops to their partials (which, in turn, are ops)
    """
    tens2grads = {root: root_grad}

    # 1. topological sort
    in_target_path: dict[Op, bool] = {}
    for u in root.toposort(): in_target_path[u] = any(x in targets or in_target_path[x] for x in u.src)
    dfs = list(root.toposort()) # lambda node: node.op not in {OpCode.DETACH, OpCode.ASSIGN} and in_target_path[node])) # don't flow through DETACH/ASSIGN or anything not in target path

    # 2. backpropagation with the chain rule
    for tensor in reversed(dfs):
      if tensor not in tens2grads: continue

      local_grads: tuple[Op|None, ...]|None = cast(tuple[Op, ...]|None, chain_rules.rewrite(tensor, ctx=tens2grads[tensor]))
      if local_grads is None: raise RuntimeError(f"failed to compute gradient for {tensor.op}\n\nin {str(tensor)[0:1000]}...")
      assert len(local_grads) == len(tensor.src), f"got {len(local_grads)} gradient, expected {len(tensor.src)}"

      for tensor,local_grad in zip(tensor.src, local_grads): # <--------------------- MOOOSE: why are we accumulating inside ad()? don't we do it in backward()??
        if local_grad is None: continue
        if tensor in tens2grads: tens2grads[tensor] = tens2grads[tensor] + local_grad # accumulate if tensor exists
        else: tens2grads[tensor] = local_grad # o/w initialize

To implement automatic differentiation with Tensor.backward(), there is a design decision to be made — the choice of implementing it dynamically or just-in-time[^3], similar to the decision of how to implement types for general programming languages[^4]. This stands in contrast to the alternative of performing a just-in-time, source-to-source transformation.

Let's now move onto automatically differentiating the functions of neural networks, specifically the FFN language model from earlier. (johnson/ryan adams ordering) n^2 vs n^3

2.2.5 Programming Massively Parallel Processors

https://arxiv.org/pdf/1410.0759

https://arxiv.org/pdf/1804.06826 https://arxiv.org/pdf/2512.02189v1 https://girl.surgery/bad_paper https://www.arxiv.org/pdf/2512.07004

2.2.6 Accelerating `GEMM` from `CUDA Rust` to `PTX`

2.2.7 Accelerating `GEMM` on GPU with Data Reuse

2.2.8 Accelerating `GEMM` on GPU with Scheduling

2.2.9 Accelerating `GEMM` on GPU with Tensor Cores

Saint George and the Dragon, Raffaello Sanzio da Urbino 1505.

You are viewing this on a mobile device, whereas SITP is best viewed on a desktop — the book includes various multimedia lecture videos, visualizers, any tufte-style sidenotes with many external hyperlinks to other resources.

III. Scaling Networks

Part 3 covers the age of scaling by building a distributed tensor compiler using tinygrad IR

Epilogue

To continue deepening your knowledge on deep learning and it's systems, the following resources are a good next step. You might find this book complementary to your reading, since the three streams outlined below were woven into a single narrative for the book. Once you feel comfortable, you should graduate towards contributing to larger deep learning systems. The SITP book obviously sets you up with tinygrad, which itself is a good bridge to pytorch and jax, but at the end of the day the world is your oyster.

Good luck on your journey.
I'll see you at work.

Appendix

Contents

Acknowledgements

Errata

Jashanpreet Singh

References

Amanzhol Salykov. (2025, January 12). Advanced Matrix Multiplication Optimization on NVIDIA GPUs. Salykova. https://salykova.github.io/sgemm-gpu

Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Evgeni Burovski, Chauhan, G., Anjali Chourdia, Constable, W., Alban Desmaison, DeVito, Z., Ellison, E., Feng, W., Gong, J., Gschwind, M., & Hirsh, B. (2024). PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. ACM, ASPLOS’2024. https://doi.org/10.1145/3620665.3640366

Bach, F. (2024). Learning Theory from First Principles. MIT Press. https://www.di.ens.fr/~fbach/ltfp_book.pdf

Boehm, S. (2022, December 31). How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog. Siboehm.com. https://siboehm.com/articles/22/CUDA-MMM

Bright, P., Edelman, A., & Johnson, S. G. (2025). Matrix Calculus (for Machine Learning and Beyond). ArXiv.org. https://arxiv.org/abs/2501.14787

Chan, S. H. (2021). Introduction to Probability for Data Science. Michigan Publishing. https://probability4datascience.com/

Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Cowan, M., Shen, H., Wang, L., Hu, Y., Ceze, L., Guestrin, C., & Krishnamurthy, A. (2018). TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. https://doi.org/10.48550/arxiv.1802.04799

Dao, T. (2023, July 17). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. ArXiv.org. https://arxiv.org/abs/2307.08691

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022, June 23). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. ArXiv.org. https://doi.org/10.48550/arXiv.2205.14135

Darve, E., & Wootters, M. (2021). Numerical Linear Algebra with Julia. SIAM. https://ericdarve.github.io/NLA

Dongarra, J., J. Du Croz, Sven Hammarling, & Duff, I. S. (1990). A Set of Level 3 Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 16(1), 1–17. https://doi.org/10.1145/77626.79170

Dongarra, J., J. Du Croz, Sven Hammarling, & Hanson, R. J. (1988a). Algorithm 656: An Extended Set of Basic Linear Algebra Subprograms: Model Implementation and Test Programs. ACM Transactions on Mathematical Software, 14(1), 18–32. https://doi.org/10.1145/42288.42292

Dongarra, J., J. Du Croz, Sven Hammarling, & Hanson, R. W. (1988b). An Extended Set of FORTRAN Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software, 14(1), 1–17. https://doi.org/10.1145/42288.42291

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. The MIT Press. https://www.deeplearningbook.org/

Gordić, A. (2025, October 29). Inside NVIDIA GPUs: Anatomy of high performance matmul kernels - Aleksa Gordić. Aleksagordic.com. https://www.aleksagordic.com/blog/matmul

Goto, K., & Geijn, R. A. van de. (2008). Anatomy of High-Performance Matrix Multiplication. ACM Transactions on Mathematical Software, 34(3), 1–25. https://doi.org/10.1145/1356052.1356053

Goto, K., & Van De Geijn, R. (2008). High-Performance Implementation of the Level-3 BLAS. ACM Transactions on Mathematical Software, 35(1), 1–14. https://doi.org/10.1145/1377603.1377607

Güneş Baydin, A., Pearlmutter, B., Siskind, J., Baydin, G., Radul, A., & Mark, J. (2018). Automatic Differentiation in Machine Learning: a Survey. Journal of Machine Learning Research, 18, 1–43. https://www.jmlr.org/papers/volume18/17-468/17-468.pdf

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical learning, Second Edition: Data mining, inference, and Prediction (2nd ed.). Springer. https://hastie.su.domains/ElemStatLearn/

Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., & Gérard-Marchant, P. (2020). Array Programming with numpy. Nature, 585(7825), 357–362. https://doi.org/10.1038/s41586-020-2649-2

Hennessy, J. L., Patterson, D. A., & Christos Kozyrakis. (2025). Computer Architecture. Morgan Kaufmann.

Hwu, W.-M. W., Kirk, D. B., & Hajj, I. E. (2022). Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann.

James, G., Witten, D., Hastie, T., Tibshirani, R., & Taylor, J. (2023). An Introduction to Statistical Learning. Springer. https://www.statlearning.com/

Jurafsky, D., & H. Martin, J. (2026). Speech and Language Processing. Stanford.edu. https://web.stanford.edu/~jurafsky/slp3/

Klein, P. N. (2013). Coding the Matrix : Linear Algebra Through Applications to Computer Science. Newtonian Press. https://codingthematrix.com/

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Cody Hao Yu, Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. https://doi.org/10.1145/3600006.3613165

Lambert, N. (2026). RLHF Book. Rlhfbook.com. https://rlhfbook.com/

Lawson, C. L., Hanson, R. J., Kincaid, D. R., & Krogh, F. T. (1979). Basic Linear Algebra Subprograms for Fortran Usage. ACM Transactions on Mathematical Software, 5(3), 308–323. https://doi.org/10.1145/355841.355847

Minsky, M., Papert, S., & Léon Bottou. (2017). Perceptrons : An Introduction to Computational Geometry. The MIT Press.

Mitchell, T. M. (1997). Machine learning. Mcgraw-Hill. https://www.cs.cmu.edu/~tom/files/MachineLearningTomMitchell.pdf

Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.

Nakatsukasa, Y. (n.d.). Numerical Linear Algebra. Retrieved March 17, 2026, from https://courses.maths.ox.ac.uk/pluginfile.php/105965/mod_resource/content/35/NLA_lecture_notes.pdf

Ng, A., & Ma, T. (2023). CS229 Lecture Notes. https://cs229.stanford.edu/main_notes.pdf

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., & Bai, J. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. Neural Information Processing Systems. https://papers.nips.cc/paper_files/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html

Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., & Amarasinghe, S. (2013). Halide. Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation. https://doi.org/10.1145/2491956.2462176

Raschka, S. (2024). Build a Large Language Model (From Scratch). Manning.

Raschka, S. (2026). Build a Reasoning Model (From Scratch). Simon and Manning.

Roberts, D. A., Yaida, S., & Hanin, B. (2022). The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks. Cambridge University Press. https://deeplearningtheory.com/

seb-v. (2025, January 20). Optimizing Matrix Multiplication on RDNA3: 50 TFlops and 60% Faster Than rocBLAS. Seb-v. https://seb-v.github.io/optimization/update/2025/01/20/Fast-GPU-Matrix-multiplication.html

Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., & Dao, T. (2024, July 24). FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. ArXiv.org. https://arxiv.org/abs/2407.08608

Shalizi, C. R. (n.d.). Advanced Data Analysis from an Elementary Point of View. https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf

Shalizi, C. R. (2015). Modern Regression Lecture Notes. Cmu.edu. https://www.stat.cmu.edu/~cshalizi/mreg/15/

Shankhdhar, P. (2024, November 29). Outperforming cuBLAS on H100: a Worklog. Substack.com. https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog

Smith, T. M., Geijn, R. van de, Smelyanskiy, M., Hammond, J. R., & Zee, F. G. V. (2014, May 1). Anatomy of High-Performance Many-Threaded Matrix Multiplication. IEEE Xplore. https://doi.org/10.1109/IPDPS.2014.110

Spector, B. F., Arora, S., Singhal, A., Fu, D. Y., & Ré, C. (2024, October 27). ThunderKittens: Simple, Fast, and Adorable AI Kernels. ArXiv.org. https://arxiv.org/abs/2410.20399

Spector, B., Singhal, A., Arora, S., & Re, C. (2024, May 12). GPUs Go Brrr. Stanford.edu. https://hazyresearch.stanford.edu/blog/2024-05-12-tk

Strang, G. (2023). Introduction to Linear Algebra. Wellesley-Cambridge Press. https://math.mit.edu/~gs/linearalgebra/

Sutton, R. S., & Barto, A. (2018). Reinforcement learning: An introduction (2nd ed.). The MIT Press. http://incompleteideas.net/book/the-book-2nd.html

Tillet, P., Kung, H.-T., & Cox, D. G. (2019). Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. https://doi.org/10.1145/3315508.3329973

Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. Society For Industrial And Applied Mathematics.

Valiant, L. (2014). Probably Approximately Correct: Nature’s Algorithms for Learning and Prospering in a Complex World. Basic Books, A Member Of The Perseus Books Group.

Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., & Sheng, Y. (2023). SGLang: Efficient Execution of Structured Language Model Programs. ArXiv.org. https://arxiv.org/abs/2312.07104

Zadouri, T., Hoehnerbach, M., Shah, J., Liu, T., Thakkar, V., & Dao, T. (2026). FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling. ArXiv.org. https://arxiv.org/abs/2603.05451

Keyboard shortcuts

SITP