The Creation of Adam, Michelangelo 1508-1512.

You are viewing this on a mobile device, whereas SITP is best viewed on a desktop — the book includes various multimedia lecture videos, visualizers, any tufte-style sidenotes with many external hyperlinks to other resources.

II. Neural Networks

In part one of the book you have developed a solid foundation to the statistical learning approach towards artificial intelligence — it's amazing how close you are to that of deep learning. By now, you understand that fuzzy phenomena such as house prices or language sentiment are better described with continuous stochastic descriptions like linear regression and linear classification.

The functions of software 2.0 are continuous in nature because data is represented with higher dimensional coordinates rather than some discrete type so functions for the case of regression and classification have the form $f : R^{d} \to R$ and $f : R^{d} \to [0, 1]$ , respectively.
They are also stochastic in nature because the function bodies are assigned conditional distributions $f (x) := p_{θ} (Y = y ∣ X = x)$ . That is, rather than declaring a deterministic function that is computable by a Turing Machine with control flow and assignment, the functions are recovered via parameter estimation of $θ$ through iterative optimization.

In part two of the book, you will train a variety of deep neural networks with pytorch, which includes the same core multidimensional array data structure as numpy with torch.Tensor, but in addition provides three additional abstractions specifically for training deep nets, including 1. neural network primitives, 2. automatic differentiation, and 3. gpu acceleration. Because you have the necessary foundation in machine learning, in 2.1 Sequence Learning with Deep Neural Networks in pytorch we will train various sequence-based neural networksfollowing the 2012-2020 time period of what is colloquially known as the age of research,. and simply present the definition of the model's forward pass $\overset{y}{^} = f (x; θ)$ , the loss $L (y, \overset{y}{^}; θ)$ , and train the model using pytorch's support for optimization and automatic differentiation with optim.sgd and Tensor.backward() until the loss curves converge. After training a variety of deep neural networks with various inductive biases through pytorch, in 2.2 Programming GPU Accelerated Forward and Backward Passes with teenygrad.Tensor.backward() you will update the implementation of teenygrad in order to train those very same nets.

After part two, you will be ready for part three of the book, which follows the 2020-2025 time period of what is colloquially known amongst researchers as the age of scaling. This time period focused on scaling up the generality of the attention mechanism in the transformers network architecture with more data and compute following scaling laws, in which the software and hardware like pytorch and GPUs evolved to support. In part three, you will add midtraining and posttraining phases to your pretrained transformer to add assistant and reasoning behavior, and evolve teenygrad's implementation for the final time to support a graph mode fusion compiler, and inference engine.

II. Table of Contents

2.1 Learning Sequences with Deep Neural Networks in `pytorch`

$↩$ Table of Contents

(todo, some explanation) on learning sequences how part 2 will involve implementing the concepts presented in the 3b1b video

2.1.1 Learning Non-Linear Representations with Feedforward Neural Networks

Recall that the task of house price prediction and sentiment classification which can be modelled by functions of the form $f : R^{d} \to R$ and $f : R^{d} \to [0, 1]$ respectively. The simplest inductive bias was made, in a which a linear relationship was assumed to hold between the input and output spaces, and where the output was subsequently modeled as an inner product $<, >: R^{d} \times R^{d} \to R$ between an input vector and a weight vector. For the case of regression, we have $y = f (x; θ) := θ^{⊤} x$ , and for the case of classification, we have $y = f (x; θ) := σ (θ^{⊤} x)$ , todo:glm,exp. The key entry point into the function class of deep neural networks is that of logistic regression, because the log odds produced by the inner product (which are indeed affine) required a mapping into a valid probability via sigmoid function $σ : R \to [0, 1]$ , where $σ (x) := \frac{1}{1 + e ( - x )}$ which is in fact not linear nor affine.

The next natural question to ask then is whether the logistic regression model is considered a deep neural network? The answer is that technically yes, it can be considered a degenerative deep neural network with $0$ hidden layersIn the same way that a list can be considered a degenerative binary tree or graph. These so-called hidden layers automate the construction of the representation through learning, so that the model not only discovers the mapping from representation to output, but also the representation itselfIn the same way that for certain computations the positional representation of arabic numerals are more suitable compared to that of roman numerals, and polar coordinates over cartesian coordinates. The functions of deep neural networks will take the form of $f (x) := g \circ ϕ (x)$ with $g$ being a linear classifier on feature extractor $ϕ : R^{d_{1}} \to R^{d_{n}}$ , $ϕ (x) := h^{[l]} \circ h^{[l - 1]} \circ \dots h^{[1]} \circ h^{[0]} (x)$ where $l$ is the number of compositional layers, and each intermediate function $h^{[i]}$ has the form of $h^{[i]} : R^{d_{i}} \to R^{d_{i + 1}}$ . Each hidden layer successively and graduallyIn the same way of Grothendieck's preferred style of mathematics described in Récoltes et Semailles: I can illustrate the second approach with the same image of a nut to be opened. The first analogy that came to my mind is of immersing the nut in some softening liquid, and why not simply water? From time to time you rub so the liquid penetrates better, and otherwise you let time pass. The shell becomes more flexible through weeks and months—when the time is ripe, a touch of the hand is enough, and the shell opens like a perfectly ripened avocado! A different image came to me a few weeks ago. The unknown thing to be known appeared to me as some stretch of earth or hard marl, resisting penetration. One can go at it with pickaxes or crowbars or even jackhammers: this is the first approach, that of the “chisel” (with or without a hammer). The other is the sea. The sea advances insensibly and in silence, nothing seems to happen, nothing moves, the water is so far off you hardly hear it… yet it finally surrounds the resistant substance. lifts the complexity and abstraction of the data's representationChris Olah, cofounder of Anthropic and the lead of it's interpretability research wrote an excellent article on how software 2.0's representation learning loosely correspond to software 1.0's types in Neural Networks, Types, and Functional Programming.

Together, these two aspects of learning non-linear, representations form the essence of deep learning.

Let's now turn out attention to the function bodies of these $h^{[i]}$ 's with a deep neural netork of $1$ hidden layer, carrying out the task of price regression and sentiment classification so that $f$ has the form $f : R^{d} \to R$ . With the statistical learning foundation from part one, we will simply present the forward pass $f (x)$ , the loss function $L (d, θ)$ , and the backward pass $ℓ^{'} (y, \overset{y}{^}, θ)$ .

Forward Pass

The functions of deep neural networks will take the form of $f (x) := g \circ ϕ (x)$ with $g$ being a linear classifier on feature extractor

$ϕ ϕ (x) : R^{d_{1}} \to R^{d_{n}} := h^{[l]} \circ h^{[l - 1]} \circ \dots h^{[1]} \circ h^{[0]} (x)$ where $l$ is the number of compositional layers, and each intermediate function $h^{[i]}$ has the form of $h^{[i]} : R^{d_{i}} \to R^{d_{i + 1}}$ .

Loss Function

Backward Pass

TODO

figure/diagram/lecun circuits ->algebraic/symbolic equations->torch code
- first train net for price regression and classification
  - intuition of automating feature engineering
- XOR: playground.tensorflow
- change code below to XOR
- mention that 2.1 treats backward pass as a black box, which is the magic of the abstraction
- mention to readers that if they want to, they can read 2.1.3, 2.2, and then back to 2.1.4

TODO: generate prose/exposition by repaging lecture/text in ram

class MLP(nn.Module):
  """
  takes the previous block_size tokens, encodes them with a lookup table,
  concatenates the vectors and predicts the next token with an MLP.

  Reference:
  Bengio et al. 2003 https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
  """

  def __init__(self, config):
    super().__init__()
    self.block_size = config.block_size
    self.vocab_size = config.vocab_size
    self.wte = nn.Embedding(config.vocab_size + 1, config.n_embd) # token embeddings table
    # +1 in the line above for a special <BLANK> token that gets inserted if encoding a token
    # before the beginning of the input sequence
    self.mlp = nn.Sequential(
      nn.Linear(self.block_size * config.n_embd, config.n_embd2),
      nn.Tanh(),
      nn.Linear(config.n_embd2, self.vocab_size)
    )

  def get_block_size(self):
    return self.block_size

  def forward(self, idx, targets=None):
    # gather the word embeddings of the previous 3 words
    embs = []
    for k in range(self.block_size):
      tok_emb = self.wte(idx) # token embeddings of shape (b, t, n_embd)
      idx = torch.roll(idx, 1, 1)
      idx[:, 0] = self.vocab_size # special <BLANK> token
      embs.append(tok_emb)

    # concat all of the embeddings together and pass through an MLP
    x = torch.cat(embs, -1) # (b, t, n_embd * block_size)
    logits = self.mlp(x)

    # if we are given some desired targets also calculate the loss
    loss = None
    if targets is not None:
      loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

    return logits, loss

2.1.2 Learning Embeddings with Feedforward Neural Networks

2.1.3 Learning Sequences with Feedforward Neural Networks

class MLP(nn.Module):
  """
  takes the previous block_size tokens, encodes them with a lookup table,
  concatenates the vectors and predicts the next token with an MLP.

  Reference:
  Bengio et al. 2003 https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
  """

  def __init__(self, config):
    super().__init__()
    self.block_size = config.block_size
    self.vocab_size = config.vocab_size
    self.wte = nn.Embedding(config.vocab_size + 1, config.n_embd) # token embeddings table
    # +1 in the line above for a special <BLANK> token that gets inserted if encoding a token
    # before the beginning of the input sequence
    self.mlp = nn.Sequential(
      nn.Linear(self.block_size * config.n_embd, config.n_embd2),
      nn.Tanh(),
      nn.Linear(config.n_embd2, self.vocab_size)
    )

  def get_block_size(self):
    return self.block_size

  def forward(self, idx, targets=None):
    # gather the word embeddings of the previous 3 words
    embs = []
    for k in range(self.block_size):
      tok_emb = self.wte(idx) # token embeddings of shape (b, t, n_embd)
      idx = torch.roll(idx, 1, 1)
      idx[:, 0] = self.vocab_size # special <BLANK> token
      embs.append(tok_emb)

    # concat all of the embeddings together and pass through an MLP
    x = torch.cat(embs, -1) # (b, t, n_embd * block_size)
    logits = self.mlp(x)

    # if we are given some desired targets also calculate the loss
    loss = None
    if targets is not None:
      loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

    return logits, loss

2.1.4 Learning Sequences with Convolutional Neural Networks

# Near copy paste of the layers we have developed in Part 3

# -----------------------------------------------------------------------------------------------
class Linear:
  
  def __init__(self, fan_in, fan_out, bias=True):
    self.weight = torch.randn((fan_in, fan_out)) / fan_in**0.5 # note: kaiming init
    self.bias = torch.zeros(fan_out) if bias else None
  
  def __call__(self, x):
    self.out = x @ self.weight
    if self.bias is not None:
      self.out += self.bias
    return self.out
  
  def parameters(self):
    return [self.weight] + ([] if self.bias is None else [self.bias])

# -----------------------------------------------------------------------------------------------
class BatchNorm1d:
  
  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.momentum = momentum
    self.training = True
    # parameters (trained with backprop)
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)
    # buffers (trained with a running 'momentum update')
    self.running_mean = torch.zeros(dim)
    self.running_var = torch.ones(dim)
  
  def __call__(self, x):
    # calculate the forward pass
    if self.training:
      if x.ndim == 2:
        dim = 0
      elif x.ndim == 3:
        dim = (0,1)
      xmean = x.mean(dim, keepdim=True) # batch mean
      xvar = x.var(dim, keepdim=True) # batch variance
    else:
      xmean = self.running_mean
      xvar = self.running_var
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    # update the buffers
    if self.training:
      with torch.no_grad():
        self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * xmean
        self.running_var = (1 - self.momentum) * self.running_var + self.momentum * xvar
    return self.out
  
  def parameters(self):
    return [self.gamma, self.beta]

# -----------------------------------------------------------------------------------------------
class Tanh:
  def __call__(self, x):
    self.out = torch.tanh(x)
    return self.out
  def parameters(self):
    return []

# -----------------------------------------------------------------------------------------------
class Embedding:
  
  def __init__(self, num_embeddings, embedding_dim):
    self.weight = torch.randn((num_embeddings, embedding_dim))
    
  def __call__(self, IX):
    self.out = self.weight[IX]
    return self.out
  
  def parameters(self):
    return [self.weight]

# -----------------------------------------------------------------------------------------------
class FlattenConsecutive:
  
  def __init__(self, n):
    self.n = n
    
  def __call__(self, x):
    B, T, C = x.shape
    x = x.view(B, T//self.n, C*self.n)
    if x.shape[1] == 1:
      x = x.squeeze(1)
    self.out = x
    return self.out
  
  def parameters(self):
    return []

# -----------------------------------------------------------------------------------------------
class Sequential:
  
  def __init__(self, layers):
    self.layers = layers
  
  def __call__(self, x):
    for layer in self.layers:
      x = layer(x)
    self.out = x
    return self.out
  
  def parameters(self):
    # get parameters of all layers and stretch them out into one list
    return [p for layer in self.layers for p in layer.parameters()]

# original network
# n_embd = 10 # the dimensionality of the character embedding vectors
# n_hidden = 300 # the number of neurons in the hidden layer of the MLP
# model = Sequential([
#   Embedding(vocab_size, n_embd),
#   FlattenConsecutive(8), Linear(n_embd * 8, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
#   Linear(n_hidden, vocab_size),
# ])

# hierarchical network
n_embd = 24 # the dimensionality of the character embedding vectors
n_hidden = 128 # the number of neurons in the hidden layer of the MLP
model = Sequential([
  Embedding(vocab_size, n_embd),
  FlattenConsecutive(2), Linear(n_embd * 2, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  FlattenConsecutive(2), Linear(n_hidden*2, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  FlattenConsecutive(2), Linear(n_hidden*2, n_hidden, bias=False), BatchNorm1d(n_hidden), Tanh(),
  Linear(n_hidden, vocab_size),
])

# parameter init
with torch.no_grad():
  model.layers[-1].weight *= 0.1 # last layer make less confident

parameters = model.parameters()
print(sum(p.nelement() for p in parameters)) # number of parameters in total
for p in parameters:
  p.requires_grad = True

2.1.5 Learning Sequences with Recurrent Neural Networks

class RNNCell(nn.Module):
  """
  the job of a 'Cell' is to:
  take input at current time step x_{t} and the hidden state at the
  previous time step h_{t-1} and return the resulting hidden state
  h_{t} at the current timestep
  """
  def __init__(self, config):
    super().__init__()
    self.xh_to_h = nn.Linear(config.n_embd + config.n_embd2, config.n_embd2)

  def forward(self, xt, hprev):
    xh = torch.cat([xt, hprev], dim=1)
    ht = F.tanh(self.xh_to_h(xh))
    return ht

class RNN(nn.Module):
  def __init__(self, config, cell_type):
    super().__init__()
    self.block_size = config.block_size
    self.vocab_size = config.vocab_size
    self.start = nn.Parameter(torch.zeros(1, config.n_embd2)) # the starting hidden state
    self.wte = nn.Embedding(config.vocab_size, config.n_embd) # token embeddings table
    if cell_type == 'rnn':
        self.cell = RNNCell(config)
    elif cell_type == 'gru':
        self.cell = GRUCell(config)
    self.lm_head = nn.Linear(config.n_embd2, self.vocab_size)

  def get_block_size(self):
    return self.block_size

  def forward(self, idx, targets=None):
    device = idx.device
    b, t = idx.size()

    # embed all the integers up front and all at once for efficiency
    emb = self.wte(idx) # (b, t, n_embd)

    # sequentially iterate over the inputs and update the RNN state each tick
    hprev = self.start.expand((b, -1)) # expand out the batch dimension
    hiddens = []
    for i in range(t):
      xt = emb[:, i, :] # (b, n_embd)
      ht = self.cell(xt, hprev) # (b, n_embd2)
      hprev = ht
      hiddens.append(ht)

    # decode the outputs
    hidden = torch.stack(hiddens, 1) # (b, t, n_embd2)
    logits = self.lm_head(hidden)

    # if we are given some desired targets also calculate the loss
    loss = None
    if targets is not None:
      loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

    return logits, loss

2.1.6 Learning Sequences with Generative Pretrained Transformers

class NewGELU(nn.Module):
  """
  Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT).
  Reference: Gaussian Error Linear Units (GELU) paper: https://arxiv.org/abs/1606.08415
  """
  def forward(self, x):
    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))

class CausalSelfAttention(nn.Module):
  """
  A vanilla multi-head masked self-attention layer with a projection at the end.
  It is possible to use torch.nn.MultiheadAttention here but I am including an
  explicit implementation here to show that there is nothing too scary here.
  """

  def __init__(self, config):
    super().__init__()
    assert config.n_embd % config.n_head == 0
    # key, query, value projections for all heads, but in a batch
    self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
    # output projection
    self.c_proj = nn.Linear(config.n_embd, config.n_embd)
    # causal mask to ensure that attention is only applied to the left in the input sequence
    self.register_buffer("bias", torch.tril(torch.ones(config.block_size, config.block_size))
                                  .view(1, 1, config.block_size, config.block_size))
    self.n_head = config.n_head
    self.n_embd = config.n_embd

  def forward(self, x):
    B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

    # calculate query, key, values for all heads in batch and move head forward to be the batch dim
    q, k ,v  = self.c_attn(x).split(self.n_embd, dim=2)
    k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
    q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
    v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

    # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
    att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
    att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
    att = F.softmax(att, dim=-1)
    y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
    y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

    # output projection
    y = self.c_proj(y)
    return y

class Block(nn.Module):
  """ an unassuming Transformer block """

  def __init__(self, config):
    super().__init__()
    self.ln_1 = nn.LayerNorm(config.n_embd)
    self.attn = CausalSelfAttention(config)
    self.ln_2 = nn.LayerNorm(config.n_embd)
    self.mlp = nn.ModuleDict(dict(
        c_fc    = nn.Linear(config.n_embd, 4 * config.n_embd),
        c_proj  = nn.Linear(4 * config.n_embd, config.n_embd),
        act     = NewGELU(),
    ))
    m = self.mlp
    self.mlpf = lambda x: m.c_proj(m.act(m.c_fc(x))) # MLP forward

  def forward(self, x):
    x = x + self.attn(self.ln_1(x))
    x = x + self.mlpf(self.ln_2(x))
    return x

class Transformer(nn.Module):
  """ Transformer Language Model, exactly as seen in GPT-2 """

  def __init__(self, config):
    super().__init__()
    self.block_size = config.block_size

    self.transformer = nn.ModuleDict(dict(
        wte = nn.Embedding(config.vocab_size, config.n_embd),
        wpe = nn.Embedding(config.block_size, config.n_embd),
        h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
        ln_f = nn.LayerNorm(config.n_embd),
    ))
    self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

    # report number of parameters (note we don't count the decoder parameters in lm_head)
    n_params = sum(p.numel() for p in self.transformer.parameters())
    print("number of parameters: %.2fM" % (n_params/1e6,))

  def get_block_size(self):
    return self.block_size

  def forward(self, idx, targets=None):
    device = idx.device
    b, t = idx.size()
    assert t <= self.block_size, f"Cannot forward sequence of length {t}, block size is only {self.block_size}"
    pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape (1, t)

    # forward the GPT model itself
    tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embd)
    pos_emb = self.transformer.wpe(pos) # position embeddings of shape (1, t, n_embd)
    x = tok_emb + pos_emb
    for block in self.transformer.h:
        x = block(x)
    x = self.transformer.ln_f(x)
    logits = self.lm_head(x)

    # if we are given some desired targets also calculate the loss
    loss = None
    if targets is not None:
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

    return logits, loss

2.2 Programming GPU Accelerated Automatic Differentation with `teenygrad.Tensor.backward()`

$↩$ Table of Contents

2.2.1 Iterative Optimization with Stochastic Gradient Descent `optim.sgd`

2.2.2 Symbolic, Numerical, and Automatic Differentiation

2.2.3 Forward Mode Automatic Differentiation with `Tensor.forward()`

2.2.4 Reverse Mode Automatic Differentiation with `Tensor.backward()`

Consider the function $f : R, R \to R$ where $f : = exp (x_{1}) * sin^{2} x_{2}$ , and translate it to it's computational counterpart in python with one-dimensional Tensors:

import picograd as pg

def f(x1: pg.Tensor, x2: pg.Tensor) -> pg.Tensor:
  a = pg.exp(x1)
  b = pg.sin(x2)
  c = b**2
  d = a*c
  return d

Figure 1. Python source for the function $f : R, R \to R$ where $f : = exp (x_{1}) * sin^{2} x_{2}$

Here we've broken up the function to render the subexpressions more clearly. But this isn't necessary — automatic differentiation will work if the function was expressed in one line. In part one, the development of picograd followed that of numpy — an array programming language similar to Matlab but embedded in the host language of Python, that could evaluate functions of the form $f (x)$ where Tensor objects stored their values $f (x)$ with the value: field and the function types that produced their values with Op. For instance, evaluating the specified function f from above with 9 and 10

if __name__ == "__main__":
  print(f(9, 10))

populates the Tensor.value fields. In part one of the book we verified this with a REPL-interface, but we can also represent the entire expression being evaluated with a graph of vertices and edges where the vertices are Tensors (along with their Ops and values) and the edges are their data dependencies:

Here you can see that even if the function was specified in one line, the graph of the expression always parses into Tensor vertices, and data dependency edges. You may have noticed the Tensor.grad fields, which supposedly store the values of derivatives $f^{'} (x)$ . The question now remains in how to populate these fields.

Taking a step back to differential calculus, deriving the derivative of $f : = exp (x_{1}) * sin^{2} x_{2}$ involves the application of the chain rule where $f (x) := g (h (x)) ⟹ f^{'} (x) = g^{'} (h (x)) h^{'} (x)$ . Evaluating the derivative of the function with respect to its inputs $f^{'} (x_{1})$ and $f^{'} (x_{2})$ results in

$f^{'} (x_{1}) = \frac{\partial}{\partial x _{1}} [exp (x_{1}) * sin^{2} x_{2}] = \frac{\partial}{\partial x _{1}} [exp (x_{1})] * sin^{2} x_{2} = exp (x_{1}) (1) * (sin^{2} x_{2})$ $f^{'} (x_{2}) = \frac{\partial}{\partial x _{2}} [exp (x_{1}) * sin^{2} x_{2}] = exp (x_{1}) * \frac{\partial}{\partial x _{2}} [sin^{2} x_{2}] = exp (x_{1}) * 2 sin x_{2} cos x_{2}$

symbolic and numeric differentiattion symbolic differentiation has performance issues since a large unrolled expression must be constructed in order to differentiate[^0], whereas numerical differentiation has correctness issues since evaluating finite differences requires evaluating functions to a precision point resulting in numerical instability. (trace through EXAMPLE for both. talking nets widrow)

To populate the Tensor.grad fields, the simplest idea would be to literally translate the manual derivation of the derivative into code. The translation from math to code involves a design decision: should we evaluate from outputs to inputs (symbolically outside-in, graphically right-to-left) or from inputs to outputs (symbolically inside-out, graphically left-to-right)? Although the former order seems more natural with symbolic expressions, there's nothing illegal about the latter.

import picograd as pg

def f(x1: pg.Tensor, x2: pg.Tensor) -> pg.Tensor:
  a = pg.exp(x1)
  b = pg.sin(x2)
  c = b**2
  d = a*c
  return d

# dict[f(x), f'(x)] of local derivatives (adjoints)
dd_da, dd_dc = [c, a] # d(a,c):=a*c ==> d'(a)=c, d'(c)=a
da_dx1 = pg.exp(x1) # a(x1):=exp(x1) ==> a'(x1)=exp(x1)
dc_db = 2*b # c(b):=b^2 ==> c'(b)=2b
db_dx2 = pg.cos(x2) # b(x2):=sin(x2) ==> b'(x2)=cos(x2)

# outputs to inputs: outside-in symbolically, right-to-left graphically
dd_dd = pg.Tensor(1) # base case
dd_da, dd_dc = [dd_dd*dd_da, dd_dd*dd_dc]
dd_dx1 = dd_da*da_dx1 # DONE for the x1->d path

dd_db = dd_dc*dc_db
dd_dx1 = dd_db*db_dx2 # DONE for x2->path

# inputs to outputs: inside-out symbolically, left-to-right graphically
dx1_dx1, dx2_dx2 = [pg.Tensor(1), pg.Tensor(1)] # base case
da_dx1 = da_dx1*dx1_dx1
dd_dx1 = dd_da*da_dx1 # DONE for the x1->d path

db_dx2 = db_dx2*dx2_dx2
dc_dx2 = dc_dc*db_dx2
dd_dx2 = dd_dc*dc_dx_2 # DONE for the x2->d path

Do you notice any difference in the number of evaluations between the two orders?

The outputs-to-input ordering takes 6 arithmetic operations (including the destructuring), whereas the input-to-output ordering take 7 arithmetic operations. This is because the former can reuse dd_dd as a dynamic programming solution to a subproblem for the two inputs, whereas the latter cannot. And taking a step back, we only want to reuse the output because the shape of the function is of $f : R^{2} \to R$ . Alternatively, if $f$ had type $f : R \to R^{2}$ , then the input-to-output ordering would be able to reuse results. This distinction is referred to as "forward-mode" vs "reverse-mode", and reflects the fact that for some function $f : R^{n} \to R^{m}$ the time complexity of forward-mode differentiation is proportional to $n$ , whereas that of forward-mode differentiation is proportional to $m$ . If the expression graph fans-in so that $n > m$ , reverse-mode is preferred. If the expression graph fans-out so that $m > n$ , forward-mode is preferred. However, if we take a step with a graph-theory lens, we can see that the derivative is the sum of paths, where each path is a product of local derivatives from the input source to the output sink. From a combinatorics perspective, we are calculating all the possible (ors) ways (ands) on how the inputs perturb the output. That is:

$\frac{\partial f}{\partial x _{i}} = \sum\prod \frac{\partial v _{j}}{\partial v _{k}}$

and as long as the operations along this path are associative — $(A B) C = A (BC)$ — then we can choose the order in how we perform these path products to minimize the number of operations. Finding the optimal ordering is an NP-hard problem because ____. For instance, if the expression graph is diamond-shaped, evaluating the derivative with forward-mode for the left-half and reverse-mode for the right-half would be more performant. In practice, we use reverse-mode as a heuristic, since most of the functions that are differentiated (so they can be optimized) in the field of machine learning are neural networks of the form $f : R^{n} \to R$

How can we generalize this into an algorithm?
All we need are 1. mappings from $f (x) \to f^{'} (x)$ and 2. a topological sort

For the derivative rules, the same way that optimizing compilers implement an optimization "manually" once which then gets reused many times, the authors of deep learning frameworks also implement derivatives manually which then become reused many times through automatic differentiation. In theory, we can differentiate any expression with f'(x) with only a few derivative rules for addition and multiplication, but in practice most frameworks provide sugar for complex derivatives.

For topological sort, we can simply reversed the ordering produced by a depth-first-search:

def toposort(self):
  order: list[Op] = []
  visited: set[Op] = set()

  def dfs(node: Op) -> None:
    if node in visited: return
    visited.add(node)
    for src in node.src: dfs(src)
    order.append(node)

  dfs(self)
  return order

class Tensor():
  def backward():
    for t in reversed(topo):
      t.backward()

We will now use this idea to modify the interpretation of our deep learning framework to not only evaluate $f (x)$ , but $f^{'} (x)$ as well. This is done by dynamically overloading the operators at runtime[^0] to trace the expression graph

chain_rules = PatternMatcher([
  (Pattern(OpCode.MATMUL, name="input"), lambda output_grad, input: (_____,)),
  (Pattern(OpCode.MATVEC, name="input"), lambda output_grad, input: (_____,)),
  (Pattern(OpCode.RECIPROCAL, name="input"), lambda output_grad, input: (-output_grad * input * input,)),
  (Pattern(OpCode.SIN, name="input"), lambda output_grad, input: ((math.pi/2 - input.src[0]).sin() * output_grad,)),
  (Pattern(OpCode.LOG2, name="input"), lambda output_grad, input: (output_grad / (input.src[0] * math.log(2)),)),
  (Pattern(OpCode.EXP2, name="input"), lambda output_grad, input: (input * output_grad * math.log(2),)),
  (Pattern(OpCode.SQRT, name="input"), lambda output_grad, input: (output_grad / (input*2),)),
  (Pattern(OpCode.ADD), lambda output_grad: (1.0*output_grad, 1.0*output_grad)),
  (Pattern(OpCode.MUL, name="input"), lambda output_grad, input: (input.src[1]*output_grad, input.src[0]*output_grad)),
])

class Tensor:
  def _forward(self, f:Callable, *other:Tensor) -> Tensor: #extra_args=(), **kwargs)
    out_tensor = evaluator.eval_uop([self, other], out_uop)

  def backward(self, grad:Tensor|None=None) -> Tensor:
    """
    backward performs by collecting tensors, computing gradients with automatic differentiation, and updating said tensors.
    """
    # 1. collect all tensors that requires grad by topologically sorting the graph of uops and filter
    all_uops = self.uop.toposort()
    tensors_require_grad: list[Tensor] = [t for tref in all_tensors if (t:=tref()) is not None and t.uop in all_uops and t.requires_grad]
    uops_require_grad = [t.uop for t in tensors_require_grad]
    assert grad is not None or self.shape == tuple(), "when no gradient is provided, backward must be called on a scalar tensor"
    if not (self.is_floating_point() and all(t.is_floating_point() for t in tensors_require_grad)): raise RuntimeError("only float Tensors have gradient")
    
    # 2. compute the gradient with a map of tensors to partials
    if grad is None: grad = Tensor(1.0, dtype=self.dtype, device=self.device, requires_grad=False) # base case is 1.0
    tens2grads = Tensor._automatically_differentiate(self.uop, grad.uop, set(uops_require_grad)) # skipping materializing zerod grads for now
    grads = [Tensor(g, device=t.device) for t,g in zip(tens2grads.keys, tens2grads.values)] # initialize tensor grads on device
    
    # 3. update the tensors that require grad with the gradient's partials
    for t,g in zip(tensors_require_grad, grads):
      assert g.shape == t.shape, f"grad shape must match tensor shape, {g.shape!r} != {t.shape!r}"
      t.grad = g if t.grad is None else (t.grad + g) # accumulate if t.grad exists
    return self

  @staticmethod
  def _automatically_differentiate(root:Op, root_grad:Op, targets:set[Op]) -> dict[Op, Op]:
    """
    _differentiate backpropagates partials on a topologically sorted expression graph with the chain rule
    and produces the gradient in the form of a map of ops to their partials (which, in turn, are ops)
    """
    tens2grads = {root: root_grad}

    # 1. topological sort
    in_target_path: dict[Op, bool] = {}
    for u in root.toposort(): in_target_path[u] = any(x in targets or in_target_path[x] for x in u.src)
    dfs = list(root.toposort()) # lambda node: node.op not in {OpCode.DETACH, OpCode.ASSIGN} and in_target_path[node])) # don't flow through DETACH/ASSIGN or anything not in target path

    # 2. backpropagation with the chain rule
    for tensor in reversed(dfs):
      if tensor not in tens2grads: continue

      local_grads: tuple[Op|None, ...]|None = cast(tuple[Op, ...]|None, chain_rules.rewrite(tensor, ctx=tens2grads[tensor]))
      if local_grads is None: raise RuntimeError(f"failed to compute gradient for {tensor.op}\n\nin {str(tensor)[0:1000]}...")
      assert len(local_grads) == len(tensor.src), f"got {len(local_grads)} gradient, expected {len(tensor.src)}"

      for tensor,local_grad in zip(tensor.src, local_grads): # <--------------------- MOOOSE: why are we accumulating inside ad()? don't we do it in backward()??
        if local_grad is None: continue
        if tensor in tens2grads: tens2grads[tensor] = tens2grads[tensor] + local_grad # accumulate if tensor exists
        else: tens2grads[tensor] = local_grad # o/w initialize

To implement automatic differentiation with Tensor.backward(), there is a design decision to be made — the choice of implementing it dynamically or just-in-time[^3], similar to the decision of how to implement types for general programming languages[^4]. This stands in contrast to the alternative of performing a just-in-time, source-to-source transformation.

Let's now move onto automatically differentiating the functions of neural networks, specifically the FFN language model from earlier. (johnson/ryan adams ordering) n^2 vs n^3

SITP