The Creation of Adam by Michelangelo. 1508-1512.
2. Neural Networks
In part two of the book we change both 1. the goal from learning price/sentiment to learning sequences
and 2. the function class from generalized linear models to deep neural networks
with torch to study the foundations of machine learning's "age of research" from 2012 to 2020.
We build off of teenygrad's numpy-like capability developed in part one
and abstract two more tasks for the research scientist
optim.sgdandTensor.backward()providing iterative optimization via differentiationcuBLAS-like kernels providing acceleration on manycore processors like GPUs which we will then use to train other language models with different inductive biases (invariances) such asRNNs,LSTMs,BERTs, andGPTs culminating innanogpt.
This will prepare us for part three of the book where we modify the language implementation
of the deep learning framework to support distributed compilation in order to run both
the training and inference of nanochat
Contents
- 2.1 Learning Sequences in
torch - 2.2 Accelerating
cuBLASKernels withteenygrad - 2.3 Assistants with Sequence Learning with Different Inductive Biases in
teenygrad - 2.4 Reasoners
1.5 Learning Sentiment Iteratively with Classification
We'll now tackle our first learning problem in the domain of language with sentiment analysis — classifying whether a given piece of text has a positive or negative connotation — by modifying
- our model to support discrete categorical targets
- our optimization method to an iterative algorithm, namely gradient descent
Before diving into the model to support discrete categorical targets in , we will take a closer look into the optimization method of gradient descent because there are two reasons for why we want to move from direct to iterative methods, the first being memory bottlenecks, and the second being non-linear function classes.
1.5.1 Linear Regression, and the Limitations of Direct Methods
The first reason why we want to switch from direct to iterative optimization methods is because direct methods are memory-bound on materializing the matrix .
1.5.2 Inductive Bias and Principle with Logistic Regression and Cross Entropy
The second reason why we want to switch from direct to iterative optimization methods is because even if the number of dimensions is small enough so that we are not memory-bound, direct methods simply will not work for other function classes besides linear regression. Consider ___.
Inductive Bias
In order to train a model which learns the sentiment of text, we will collect data where the input space are feature vectors and the output space are and which encode the negative and positive cases of binary classification1 (we will expand the number of categorical responses in the next section with multi-class classification). Then the function of interest we'd like to recover from the data is of form . Recall that machine learning consists of selecting a family of functions as the inductive bias, a loss function as the inductive principle, and estimating the parameters by empirically minimizing the risk.
For the inputs,
For the model class , we will continue to use a a weighted sum of the input vector with where so that , but add the function where referred to as the logistic or sigmoid[^7] function so that the output is a valid probability. We'll interpret as the log odds and as and the complement as . However, the current type of is , whereas the task is to assign a negative () or positive () sentiment to the provided input feature vector . To ammeliorate this we will add a final decision boundary where
(FIGURE.MODEL ARCHITECTURE)
We can vectorize the evaluation of on each with a data matrix so that
import torch
def f(xi: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
"""Compute sigmoid(w^T x) for a single example xi."""
logits = torch.matmul(w, xi)
return torch.sigmoid(logits)
if __name__ == "__main__":
w = torch.randn(3)
X = torch.randn(5, 3)
for xi in X: # does this work?
yi_hat = f(xi, w)
print(yi_hat.item())
For example, let's trace through the evaluation of our model with the following input example :
Inductive Principle
For the loss function
1.5.3 Iterative Optimization via Gradient Descent
For estimating the parameters that minimize the loss , we will optimize iteratively using gradient descent, for the two reasons already mentioned of being memory-bottlenecked on materializing the with the linear regression model class, and, with other classes of functions (which is currently the case), we don't have ___.
Gradient descent, simply put, is an optimization method that uses differential calculus, namely, the fact that the gradient provides hints — the direction of steepest descent — on how to iteratively modify the parameters closer to an optimum. The gradient descent algorithm can be expressed in a one line update rule so that the goal of is implemented by:
where is referred to as the learning rate or step size. We now dive deeper into differential calculus and generalize the derivative to higher dimensional spaces[^8] to better understand what's happening under the hood.
1.5.4 Multinomial Logistic Regression
1.5.5 Generalized Linear Models
2.1 Learning Representations, Learning Sequences
2.1.1 XOR Learning with Feedforward Neural Network
2.1.2 Sentiment Learning with FNNs
In part 1 of the book we trained generalized linear models of the form ___. In part 2 we modify and increase the expressivity of the function class by including non-linearities . The feedforward neural network simply put is a series of linear and non-linear layers of the form so
where , is an elementwise nonlinearity, and . Conceptually, the linear layers are performing linear transformations that rotate, reflect, shear, and scale space, whereas the nonlinear transformations perform transformations that squash and twist space.
We will now use the same model of the feedforward neural network to accomplish two other goals. Namely, representation learning, and sequence learning.
2.1.3 Representation Learning with FNNs
2.1.4 Language Modeling with FNNs
A sequence model, simply put, is the conditional probability distribution of an output token given an input token . A sequence of tokens can be a sentence of words in the domain of language, a series of pixels in the domain of vision, or a stream of waves in the domain of audio.
Since we are modeling language as stochastic phenomena, we use the formal language of probability theory, where a probability space is a measurable space with a measure . In the domain of language, the measurable space consists of a sample space which is the set of all tokens modelling a vocabulary, and the event space is the set of all token combinations which model a language. The measure is the measure of the weight of a particular token combination (sentence, really) as an event with respect to the set of all possible token combinations (sentences) as the entire event space. Once we use a random variable to map events to , we can forget about the probability space and focus our attention on language models which are joint probability distribution over all sequences of tokens.
Language modeling with ngrams.
(1. EXPLAIN MODEL).
# FFN MODEL f: R^n -> R
import torch
class MLP():
"""
model: Neural Language Models (Bengio et al. 2003)
key:
b: batch size, t: sequence length
v: vocabulary size, e: dimension of embedding, d: dimension of model
"""
def __init__(self, cfg):
super().__init__()
b, t, v, e, d = cfg.b, cfg.t, cfg.v, cfg.e, cfg.d
self.wte = layers.Embedding(v+1, e) # token embeddings table (+1 for <BLANK>)
l1 = layers.Linear(t*e, d, b=False)
l2 = layers.Linear(d, d, b=False)
l3 = layers.Linear(d, v, b=False)
def forward(self, i, targets=None):
embs = [] # gather the word embeddings of the previous 3 words
for k in range(self.b):
tok_emb = self.wte(i) # token embeddings of shape (b, t, e)
i = torch.roll(i, 1, 1)
i[:, 0] = self.v # special <BLANK> token
embs.append(tok_emb)
# concat all of the embeddings together and pass through an MLP
x = torch.cat(embs, -1) # (b, t, e * block_size)
x = self.l1(x).tanh()
x = self.l2(x).tanh()
x = self.l3(x)
yhat = x
# if we are given some desired targets also calculate the loss
loss = None
if targets is not None: loss = F.cross_entropy(yhat.view(-1, yhat.size(-1)), targets.view(-1), ignore_index=-1)
return yhat, loss
(2. EXPLAIN DATASET).
# FFN DATA d={(x^i,y^i)}
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
def build_dataset(t):
import random
words = open('./data/names.txt', 'r').read().splitlines()
v = sorted(list(set(''.join(words))))
encode = { c:i+1 for i,c in enumerate(v) }
encode['.'] = 0
decode = { i:c for c,i in encode.items() }
def gen_dataset(words, t):
X, Y = [], []
for w in words:
context = [0] * t
for c in w + '.':
X.append(context)
Y.append(encode[c])
# print(''.join(decode[i] for i in context), '-->', decode[encode[c]])
context = context[1:] + [encode[c]]
X, Y = torch.tensor(X), torch.tensor(Y) # X:(N,C) Y:(N)
return X, Y
random.seed(42)
random.shuffle(words)
n1, n2 = int(0.8*len(words)), int(0.9*len(words))
Xtraining, Ytraining = gen_dataset(words[:n1], t)
Xdev, Ydev = gen_dataset(words[n1:n2], t)
Xte, Yte = gen_dataset(words[n2:], t)
return Xtraining, Ytraining
(3. EXPLAIN TRAINING LOOP).
# FFN TRAINING LOOP: theta^(t+1) := theta^t - alpha*grad(L)
if __name__ == "__main__":
b, t, v, e, d = 32, 3, 27, 10, 200 # init hyperparameters
X, Y = build_dataset(t) # init data
C = torch.randn((v,e), generator=g) # init embedding
model = MLP() # init model
params = [C] + [p for l in model for p in l.parameters()]
for p in params: p.requires_grad = True
N, losses, steps = X.shape[0], [], [] # train
for step in range(200000):
i_b = torch.randint(0, N, (b,))
X_b, Y_b = X[i_b], Y[i_b]
X_bd = C[X_b].view(-1, t * e) # 0. embed
for layer in model: X_bd = layer(X_bd) # 1. forward
loss = X_bd.cross_entropy(Y_b)
for layer in model: layer.out.retain_grad()
for p in params: p.grad = None
loss.backward() # 2. backward
for p in params: p.data += -0.01 * p.grad # 3. update
# optimizer.step()?
steps.append(step)
losses.append(loss.log10().item())
if step % 10000 == 0: print(f"step: {step}/{200000}, loss {loss.item()}")
plt.plot(steps, losses)
In the next two chapters of 2.2 and 2.3, we will implement two features
on picograd which are the two primary tasks which pytorch abstracts away from research scientists:
the backward pass with automatic differentiation, and the device acceleration of the specified forward pass.
2.3 Accelerating cuBLAS Kernels
https://arxiv.org/pdf/1410.0759
https://arxiv.org/pdf/1804.06826 https://arxiv.org/pdf/2512.02189v1 https://girl.surgery/bad_paper https://www.arxiv.org/pdf/2512.07004
2.3.1 PDP11 Problem: Throughput-Oriented Many Core Processors
#[allow(improper_ctypes_definitions)] #[kernel] pub unsafe fn main_gpu() { println!("of Tensor Programs!"); } use cust::prelude::*; use std::error::Error; fn main() -> Result<(), Box<dyn Error>> { let _ctx = cust::quick_init()?; // Initialize the CUDA Driver API. `_ctx` must be kept alive until the end. let module = Module::from_ptx(PTX, &[])?; // Create a module from the PTX code compiled by `cuda_builder`. let stream = Stream::new(StreamFlags::NON_BLOCKING, None)?; // Create a stream, which is like a thread for dispatching GPU calls. let add_kernel = module.get_function("add")?; unsafe { launch!(add_kernel<<<stream>>>())?; } stream.synchronize()?; Ok(()) }
2.3.2 Accelerating GEMM with CUDA(RS)
2.3.3 Accelerating GEMM with Data Reuse
#[allow(improper_ctypes_definitions)] #[kernel] pub unsafe fn main_gpu() { println!("of Tensor Programs!"); } use cust::prelude::*; use std::error::Error; fn main() -> Result<(), Box<dyn Error>> { let _ctx = cust::quick_init()?; // Initialize the CUDA Driver API. `_ctx` must be kept alive until the end. let module = Module::from_ptx(PTX, &[])?; // Create a module from the PTX code compiled by `cuda_builder`. let stream = Stream::new(StreamFlags::NON_BLOCKING, None)?; // Create a stream, which is like a thread for dispatching GPU calls. let add_kernel = module.get_function("add")?; unsafe { launch!(add_kernel<<<stream>>>())?; } stream.synchronize()?; Ok(()) }
2.3.4 Accelerating GEMM with Scheduling:
#[allow(improper_ctypes_definitions)] #[kernel] pub unsafe fn main_gpu() { println!("of Tensor Programs!"); } use cust::prelude::*; use std::error::Error; fn main() -> Result<(), Box<dyn Error>> { let _ctx = cust::quick_init()?; // Initialize the CUDA Driver API. `_ctx` must be kept alive until the end. let module = Module::from_ptx(PTX, &[])?; // Create a module from the PTX code compiled by `cuda_builder`. let stream = Stream::new(StreamFlags::NON_BLOCKING, None)?; // Create a stream, which is like a thread for dispatching GPU calls. let add_kernel = module.get_function("add")?; unsafe { launch!(add_kernel<<<stream>>>())?; } stream.synchronize()?; Ok(()) }
2.3.5 Accelerating GEMM with Tensor Cores:
2.4 Learning Sequences with Different Inductive Biases
Sequence learning...
2.4.1 CNN: Convolutional Neural Networks
2.4.2 RNN: Recurrent Neural Networks
2.4.3 BERT: Bidirectional Encoder Representations from Transformers
2.4.4 GPT, Generative Pretrained Transformers
#!/usr/bin/env python3
import os, argparse, contextlib
from typing import Optional, Union
with contextlib.suppress(ImportError): import tiktoken
from tinygrad import Tensor, TinyJit, Device, GlobalCounters, Variable, dtypes
from tinygrad.uop.ops import UOp
from tinygrad.helpers import Timing, DEBUG, JIT, getenv, fetch, colored, trange
from tinygrad.nn import Embedding, Linear, LayerNorm
from tinygrad.nn.state import gguf_load, torch_load, load_state_dict, get_state_dict
from extra.bench_log import BenchEvent, WallTimeEvent
MAX_CONTEXT = getenv("MAX_CONTEXT", 128)
HALF = getenv("HALF")
class Attention:
def __init__(self, dim, n_heads):
self.c_attn = Linear(dim, 3*dim, bias=True)
self.c_proj = Linear(dim, dim, bias=True)
self.n_heads = n_heads
self.dim = dim
self.head_dim = dim // n_heads
def __call__(self, x:Tensor, start_pos:Variable, mask:Optional[Tensor]) -> Tensor:
if mask is not None or start_pos.val == 0:
# no symbolic shape qkv when consuming prompts
start_pos = start_pos.val
if HALF: x = x.half()
xqkv = self.c_attn(x).reshape(None, None, 3, self.n_heads, self.head_dim)
xq, xk, xv = [xqkv[:, :, i, :, :] for i in range(3)]
bsz, seqlen, _, _ = xq.shape
# create kv cache
if not hasattr(self, "cache_kv"):
self.cache_kv = Tensor.zeros(2, bsz, MAX_CONTEXT, self.n_heads, self.head_dim, dtype=x.dtype).contiguous().realize()
# update the cache
self.cache_kv[:, :, start_pos:start_pos+seqlen, :, :].assign(Tensor.stack(xk, xv)).realize()
if start_pos > 0:
keys = self.cache_kv[0][:, :start_pos+seqlen, :, :]
values = self.cache_kv[1][:, :start_pos+seqlen, :, :]
else:
keys = xk
values = xv
xq, keys, values = xq.transpose(1, 2), keys.transpose(1, 2), values.transpose(1, 2)
return self.c_proj(xq.scaled_dot_product_attention(keys, values, mask).transpose(1, 2).reshape(bsz, seqlen, self.dim))
class FeedForward:
def __init__(self, dim, hidden_dim):
self.c_fc = Linear(dim, hidden_dim, bias=True)
self.c_proj = Linear(hidden_dim, dim, bias=True)
def __call__(self, x:Tensor) -> Tensor:
return self.c_proj(self.c_fc(x).gelu())
class TransformerBlock:
def __init__(self, dim, n_heads, norm_eps):
self.attn = Attention(dim, n_heads)
self.mlp = FeedForward(dim, 4*dim)
self.ln_1 = LayerNorm(dim, norm_eps)
self.ln_2 = LayerNorm(dim, norm_eps)
def __call__(self, x:Tensor, start_pos:Variable, mask:Optional[Tensor]):
h = x + self.attn(self.ln_1(x), start_pos, mask).float()
return (h + self.mlp(self.ln_2(h))).contiguous()
class Transformer:
def __init__(self, dim, n_heads, n_layers, norm_eps, vocab_size, max_seq_len=1024):
self.vocab_size = vocab_size
self.wte = Embedding(vocab_size, dim)
self.wpe = Embedding(max_seq_len, dim)
self.h = [TransformerBlock(dim, n_heads, norm_eps) for _ in range(n_layers)]
self.ln_f = LayerNorm(dim, norm_eps)
self.lm_head = Linear(dim, vocab_size, bias=False)
self.forward_jit = TinyJit(self.forward)
def forward(self, tokens:Union[Tensor,UOp], start_pos:Variable, temperature:float=0.0):
if not hasattr(self, 'allpos'): self.allpos = Tensor.arange(0, MAX_CONTEXT).reshape(1, -1).realize()
if isinstance(tokens, UOp):
seqlen = 1
tok_emb = self.wte.weight.shrink(((tokens, tokens+1), None))
else:
seqlen = tokens.shape[1]
tok_emb = self.wte(tokens)
# not symbolic when consuming the prompt
selected_pos = (0, seqlen) if start_pos.val == 0 else (start_pos, start_pos+1)
pos_emb = self.wpe(self.allpos.shrink((None, selected_pos)))
h = tok_emb + pos_emb
if HALF: h = h.half()
mask = Tensor.full((1, 1, seqlen, start_pos.val+seqlen), float("-inf"), dtype=h.dtype).triu(start_pos.val+1) if seqlen > 1 else None
for hi in self.h: h = hi(h, start_pos, mask)
logits = self.lm_head(self.ln_f(h))
if logits.shape[1] == 0:
# special case for empty prompt
logits = Tensor.ones((logits.shape[0], self.vocab_size), dtype=logits.dtype, device=logits.device)
else:
logits = logits[:, -1, :]
if temperature < 1e-6:
ret = logits.argmax(-1)
else:
ret = (logits / temperature).softmax().multinomial()
return ret.flatten().realize()
def __call__(self, tokens:Union[Tensor,UOp], start_pos:Variable, temperature:float=0.0) -> Tensor:
forward = (self.forward_jit if JIT and (isinstance(tokens, UOp) or tokens.shape[1] == 1) else self.forward)
return forward(tokens, start_pos, temperature)
---
-
If you are familiar with functional programming, this is the mathematical equivalent of currying via closure. ↩