Euclid Instructing His Pupils in The School of Athens, Raffaello Sanzio da Urbino 1509-1511.

You are viewing this on a mobile device, whereas SITP is best viewed on a desktop — the book includes various multimedia lecture videos, visualizers, any tufte-style sidenotes with many external hyperlinks to other resources.

I. Elements of Networks

Although separated by over 2000 years, the discipline of programming originally practiced in Silicon Valley shares the same dream of mathematics in Ancient Greece: using discrete and deterministic descriptions to model reality. Although mathematicians prove properties about mathematical objects with theorems whereas programmers evaluate the values of said objects with programs, what the two traditions share in common is the activity of modeling discrete objects with deterministic descriptions which in turn, are created for them by toolsmiths referred to as logicians and language implementors — the mathematician describes the world with a logic, and the programmer does so with a programming language.

However, for the programmer interested in learning the skills necessary for the statistical learning approach to artificial intelligence, they must make a transition to programming from discrete data usinand deterministic functions to that of continuous data $x$ and stochastic functions $f (x)$ . Although you will encounter many new concepts in your journey up ahead, there are also many similarities with the art of programming that you already know.

So in 1 Continuously Stochastic Data $x$ with numpy, we will first learn about programming data $x$ of software 2.0 by representing types continuously with coordinates, distributing truth stochastically with probability, and then recovering said distributions with maximum likelihood estimation. Then, in 1.2 Statistical Learning of Continuously Stochastic Functions with numpy, we will learn how to recover stochastic functions $y = f (x)$ by automatically evaluating their derivatives $f^{'} (x)$ with HIPS/autograd HIPS/autograd along with torch-autograd were the first numerical libraries to implement automatic differentiation in the ecosystems of productive programming languages like Python's numpy and Lua's torch, which ultimately inspired PyTorch. HIPS/autograd is used in part one of the book to emphasize the automatic differentiation abstraction was implemented before the heavy GPU acceleration PyTorch provided. For more information Soumith's PyTorch's design origins is an excellent read. and using it's direction of steepest descent to iteratively optimize the loss with stochastic gradient descent. From there, readers will unravel the mystery in how the internals of numpy and torch are implemented by building teenygrad from scratch. So in 1.3 Programming CPU Accelerated Basic Linear Algebra Subroutines with teenygrad.Tensor, teenygrad.Tensor is implemented in Python with strides that virtualize arbitrary n-dimensinal shapes to a 1-dimensional physical array. Then, the basic linear algebra operations on the Tensor are accelerated on the CPU by changing the host language from Python to Rust in order to translate programs into native machine code called assembly with the rustc compiler which enables us to analyze and tune the performance for the underlying processor's pipelines and memory hierarchiesBerkeley's opensource educational RISC-V cores sodor and BOOM cores along with Tenstorrent's RVV extension fork ocelot will be used to introduce instruction sets and microarchitecture, but x86 machines like Intel's Xeon and AMD's EPYC will be used in order to achieve server-class performance.

From there, part two of the SITP book takes you on the next step in the journey by increasing the expressivity of the models that are trained from generalized linear models to deep neural networks with torchIn part three, teenygrad will be updated once again for SITP's ultimate capstone project which modifies the eager mode interpreter implementation to a graph mode compiler in order to support the nanogpt and nanochat models. This final version of teenygrad will share 80% of the abstractions with the pure fusion compiler called tinygrad, providing a bridge to the frontier of deep learning systems.. and then evolving teenygrad's implementation to support the same neural network primitives, with GPU acceleration.

I. Table of Contents

WeThe following is a software2.0-adapted excerpt from The Structure and Interpretation of Computer Programs Chapter 1: Building Abstractions with Procedures: are about to study the idea of a computational process. Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of ~~rules called a program~~ probabilities called a model. People ~~create programs to direct processes~~ recover models to direct processes. In effect, we conjure the spirits of the computer with our spells.

A computational process is indeed much like a sorcerer's idea of a spirit. It cannot be seen or touched. It is not composed of matter at all. However, it is very real. It can perform intellectual work. It can answer questions. It can affect the world by disbursing money at a bank or by controlling a robot arm in a factory. The programs we use to conjure processes are like a sorcerer's spells. They are carefully composed from ~~symbolic~~ numerical expressions in arcane and esoteric programming languages that prescribe the tasks we want our processes to perform. A computational process, in a correctly working computer, ~~executes programs precisely and accurately~~ recovers models stochastically and noisily. Thus, like the sorcerer's apprentice, novice programmers must learn to understand and to anticipate the consequences of their conjuring. Even small errors (usually called bugs or glitches) in programs can have complex and unanticipated consequences.

1. Representing Data with Highly Dimensional Stochastic Distributions in `torch`

$↩$ Table of Contents

1.1 From Deductive Software 1.0 to Inductive Software 2.0

$↩$ Table of Contents

The first logician was a man born over 2000 years ago named Aristotle, as he discovered and created the formal structures and principles embedded within natural language that govern reasoning — referred to as a propositional logicAlthough there are many limitations with Aristotelian Logic serving as a sufficient basis for reasoning, proof assistants, and programming languages as we understand them today, many scholars trained in formal techniques have come to appreciate Aristotle with new respect given how far he went with so little to build from — when peak mathematics moved from Greece to the Middle East, Aristotle was known amongst the medeival Hindu-Arabic scholars as "The First Teacher"

Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child's? If this were then subjected to an appropriate course of education one would obtain the adult brain.

A computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$ , if its performance at tasks in $T$ , as measured by $P$ , improves with experience $E$ .

As programmers however, although we prefer to evaluate the values of objects rather than prove their properties, there turns out to be a lot in common between the two (in some sense, Aristotle was not only the first logician but also language implementor).

In this chapter 1 Statistical Inference of Continuously Stochastic Data with numpy we will cover continuously stochastic data $x$ and in the next chapter 1.2 Statistical Learning of Continuously Stochastic Functions with numpy, we will cover continuously stochastic functions $f (x)$ .. The discipline and profession of mathematics and programming are closely interconnected, and the transition from discretely deterministic descriptions of reality to continuously stochastic ones have already been made by our mathematical cousins a few hundred years ago during the Scientific Revolution.

In the next three chapters we will place our focus on (todo)

and in the next two, (todo)

in which you will be prepared to understand the final chapter:

1.7 Probability + Linear Algebra = High Dimensional Stochastic Distributions

1.2 Distributing Truth Stochastically with Probabilities

$↩$ Table of Contents

The first shift in mindset when transitioning from representing data in software 1.0 to that of software 2.0 is the stochastic nature of it.

That is, phenomena that is randomly determined. It wasn't until 1933 where Kolmogorov formalized the notion of probabilities with the following set-theoretic construction:

the set of all possible states (outcomes) of the experiment is the sample space $Ω$
the set of all possible subsets of the sample space is the event space $F \subseteq Ω$
a measure $P : F \to [0, 1]$ which maps all events to some real between $0$ and $1$

Together this triplet of two sets and one function $(Ω, F, P)$ defines a probability space. Focusing on the map $P$ , although it formalizes the notions of "chance", "likeliness", and "probability" as a numerical assignment between $0$ and $1$ on every event in the event space, there are two possible semantic interpretations of the value returned by this function when modeling real-world stochastic phenomena — namely that of the so-called frequentist and bayesian interpretation of probability.

The former interprets probabilities (the values returned by the probability measure) as an "objective" belief where the value returned is the count of the event in which an infinite sequence of repeated trials converges to. That is, $P (E) = n \to \infty lim \frac{count ( E )}{n}$

so for instance, the most trivial example is an experiment of rolling a single fair coin with a probability assignment of $P ({head}) = 0.5$ . The experiment can be repeated many times, and the frequency between the count of heads and the number of trials will converge to $0.5$ . As programmers, running a simulation of such experiment is quite simple:

import random

sample_space = {"🧑", "🤖"} # human head, robot tail
def experiment():
  outcome = random.choice(list(sample_space))
  return outcome

if __name__ == "__main__":
  n = 10_000_000
  target_e, target_e_count = {"🧑"}, 0
  for i in range(n):
    outcome = experiment()
    if outcome in target_e: target_e_count += 1

  prob_e = len(target_e) / len(sample_space)
  freq_e = target_e_count / n
  print(f"the expected probability of {target_e} is {prob_e}")
  print(f"the actual frequency of {target_e} in {n} trials is {freq_e}")

Although running the simulation results in slightly different numbers, they are all close to $0.5$ within a few precision points. The probability assignment of $0.5$ is something that can be "objectively" verified. To contrast, bayesian interpretation of probabilities are "subjective" beliefs where the value returned represents the internal uncertainty of the observer. For instance, consider an unfair die (todo..)

It's important to remember that neither interpretation is "the" correct one, and often times the availability of data implies which one to use. todo..returning to functional axiomaticization...probability is the weight of a set.

1.8 Recovering Probabilities from Data with Parameter Estimation via Maximum Likelihood

$↩$ Table of Contents

2. Learning Functions from Data with Optimization in `torch`

$↩$ Table of Contents

In chapter 1, you've initiated your transition from deductive software 1.0 to inductive software 2.0 by:

using random variables instead of deterministic variables by distributing the truth of types onto a weighted set of events and representing such types in high dimensional coordinate systems
recovering programs from data rather than specifying programs with instructions by estimating the parameters of distributions with maximum likelihood and maximum a posteriori estimation

And so now, the formulation of a computer program learning from experience $E$ with respect to some class of tasks $T$ and performance measure $P$ starts to make more sense with respect to learning distributions from data. In chapter 2, we extend the notion of experience $E$ and tasks $T$ to not just learn distributions $X$ from data, but that of learning functions $f : X \to Y$ . So instead of recover a single random variable $X$ from data like in chapter 1, we will recover two random variables $X$ and $Y$ to construct linear functions $f : X \to Y$ (todo, maybe rephrase. depending on how the link between ERM of squared error and MLE is presented).

More formally, given observations of input-outputWe stick to the domain neutral terminology of inputs and outputs. However the input-output pairs $(x^{(i)}, y^{(i)})$ are also referred to as predictors/response and features/labels pairs comprising some dataset $D = {(x^{(i)}, y^{(i)})}_{i = 1}^{n}$ , with function approximation, learn some mapping from input space to output space $f : X \to Y$ , and evaluate such learned mapping $\overset{y}{^} = f (x^{n + 1})$ in order to predict new outputs given unseen input well, referred to as generalization. When historical data of the expected output $y^{(i)}$ given some input $x^{(i)}$ is available, we refer to this regime as supervised learning, for the learning algorithm is being "supervised" with the correct answerAs opposed to unsupervised learning which we'll cover in part II.

However, roughly speakingOrdinal data can also exist in addition to numerical and categorical data(todo), there are two settings of supervised learning depending on the type of the output space $Y$ that the learned function $f$ must map to (with the input space $X$ usually being the vector space of $R^{d}$ ). If the phenomena being modeled admits quantitative outputs, then the function that needs to be learned takes the form $f : R^{d} \to R$ and is referred to as regressionThe origin of the name is instructive (Stigler, 1986). It comes from 19th century investigations into the relationship between the attributes of parents and their children. People who are taller (heavier, faster,...) than average tend to have children who are also taller than average, but not quite as tall., whereas if it admits qualitative outputs, then the function that needs to be learned takes the form $f : R^{d} \to {1, \dots, k}$ and is referred to as classification.

In the next two chapters of Chapter 2. Learning Functions From Data with pytorch will implement linear models for both cases of regression and classification with torch, and Chapter 3. Accelerating Basic Linear Algebra Subroutines with teenygrad.Tensor will re-implement them with our own deep learning framework teenygrad.

2.1 Recovering Probabilities from Data with Parameter Estimation via Maximum Likelihood

$↩$ Table of Contents

2.2 Predicting Scalars by Linearly Modeling Regression

$↩$ Table of Contents

Similarly to learning distributions $X$ from data, learning functions $f : X \to Y$ from data roughly follows the three step process of selecting a model for the task $T$ , a performance measure $P$ , and optimizing such measure on experience $E$ . Let's take a look at the linearly modeling regression with squared error loss, a problem which straddles tools from all three languages of probability theory, linear algebra, and calculus.

For the task of house price prediction, the input space $X = R^{d}$ is the size in square feet, and the output space $Y = R$ is the price, meaning that the function which needs to be recovered has the type of $f : R^{d} \to R$ . An assumption about the structure of the data needs to be made, referred to as the inductive bias. The simplest assumption to make is to assume that there exists some linear relationship between the data $x$ and parameters $w$ so that $f$ ends up being modeled as an inner product plus bias, parameterized as $f (x; w, b)$ : $f f (x; w, b) : R^{d} \to R : = (w_{1} x_{1} + w_{2} x_{2} + \dots + w_{n} x_{n}) + b = (i = 1 \sum d w_{i} x_{i}) + b = w^{⊤} x + b$

One small adjustment we can make to $f$ is to fold the final bias term $b$ into the vector $w$ as $w_{1}$ , increasing the total dimensionality so that $w \in R^{d + 1}$ , and adjusting $x$ accordingly so that $x_{1} := 1$ and $x \in R^{d + 1}$ . Then, the equation of our line with bias is simply parameterized as $f (x; w)$ $f f (x; w) : R^{d + 1} \to R : = w^{⊤} x$

Semantically speaking, each property of the input house $x_{i} \in R^{d}$ is weighted by some weight $w_{i}$ , adjusted accordingly during the learning process so that it accurately reflects the property's influence on the output price $y \in R$ when evaluating a prediction $\overset{y}{^}$ . We can evaluate $f$ on all $x^{(i)} \in D$ with a randomly intialized $w$ to ensure we've wired everything up correctly.

However, while specifying the equation of the linear regression model on ~~paper~~ screen and subsequently evaluating the computation manually might have been sufficient for pre-computational mathematicians such as Gauss and Legendre, the discipline of statistical learning is entirely predicated on computational methods. Let's continue to use PyTorch's core n-dimensional array datastructure with torch.Tensor, this time modeling a function $f : X \to Y$ rather than some distribution $X$ :

(todo: maybe change example to closure following torch.nn)

import torch
def f(x: torch.Tensor, w: torch.Tensor) -> torch.Tensor:
  return torch.dot(x, w)

in which we can update the names of our torch.Tensor variables with the ranks of their vector spacesReferred to as Shape Suffixes, described by Noam Shazeer, an engineer from Google to clarify what dimensions these operations are evaluated on:

import torch
def f_shapesuffixed(x_D: torch.Tensor, w_D: torch.Tensor) -> torch.Tensor:
  return torch.dot(x_D, w_D)

and finally, actually enforce them via runtime typechecking with the jaxtyping package (try swapping the return statements below to trigger a type error with the return type):

import torch
from jaxtyping import Float, jaxtyped
from beartype import beartype

@jaxtyped(typechecker=beartype)
def f_typechecked(
  x_D: Float[torch.Tensor, "D"],
  w_D: Float[torch.Tensor, "D"]
) -> Float[torch.Tensor, ""]:
  # return X_ND
  return torch.dot(x_D, w_D)

Let's now sanity-check that f_typechecked is wired up correctly by evaluating it on all $x^{(i)} \in D$ with random parameter w_D. (todo: use actual housing data)

import torch
if __name__ == "__main__":
  X_ND = torch.tensor([
    [1500, 10, 3, 0.8],  # house 1
    [2100, 2,  4, 0.9],  # house 2
    [800,  50, 2, 0.3]   # house 3
  ], dtype=torch.float32)
  Y_N = torch.tensor([500000, 800000, 250000], dtype=torch.float32)

  w_D = torch.randn(4); print(f"random weight vector: {w_D}")
  for (xi_D,yi_1) in zip(X_ND,Y_N):
    yihat_1 = f_typechecked(xi_D,w_D)
    print(f"expected: ${yi_1}, actual: ${yihat_1:.2f}")

random weight vector: tensor([-0.3343,  0.0768,  2.7828, -0.1331])
expected: $500000.0, actual: $-492.46
expected: $800000.0, actual: $-690.89
expected: $250000.0, actual: $-258.08

Clearly, these outputs are unintelligible, and will only become intelligible with a "good" choice of parameter $w$ . Before we find such a parameter, let's make one more modification to our function $f$ to increase it's performance. Rather then evaluating f_typechecked on every input xi_D in the dataset zip(X_ND,Y_N) sequentially with a loop, we can evaluate all outputs with a single matrix-vector multiplication by updating $f$ 's function definition to the following

$f_{batched} f_{batched} (X; w) : R^{n \times d} \to R^{n} : = Xw = x_{1}^{⊤} w x_{2}^{⊤} w ⋮ x_{n}^{⊤} w$

and the corresponding PyTorch updated to

import torch
@jaxtyped(typechecker=beartype)
def f_batched(
  X_ND: Float[torch.Tensor, "N D"],
  w_D: Float[torch.Tensor, "D"]
) -> Float[torch.Tensor, ""]:
  return X_ND@w_D

Now that we've selected our model for the task $T$ , we can proceed with selecting our performance measure $P$ which the learner will improve on with experience $E$ .

2.3 Fitting Linear Regression by Directly Optimizing Squared Error Loss

$↩$ Table of Contents

After the inductive bias on the family of functions has been made, the learning algorithm must find the function $\hat{f}$ with a good fit. Since artificial learning algorithms don't have visual cortex like biological humans[], the notion of "good fit" needs to defined in a systematic fashion. This is done by selecting the parameter $θ \in R^{m}$ which maximizes the likelihood of the data $p (d; θ)$ . Returning to the linear regression inductive bias we've selected to model the house price data, we assume there exists noise $ϵ^{(i)}$ in both our model (epistemic uncertainty) and data (aleatoric uncertainty), so that $y^{(i)} = θ^{⊤} x^{(i)} + ϵ^{i}$ where $ϵ^{(i)} \sim N (μ, σ^{2})$

prices $y^{i}$ are normally distributed conditioned on seeing the features $x^{i}$ with the mean being the equation of the line $θ^{⊤} x^{(i)}$ where $y^{(i)} ∣ x^{(i)} \sim N (μ = θ^{⊤} x^{(i)}, σ^{2})$ , then we have that

$p (y^{(i)} ∣ x^{(i)}; θ) = N (θ^{⊤} x^{(i)}, σ^{2}) = \frac{1}{2 π σ ^{2}} exp (- \frac{( y ^{(i)} - θ ^{⊤} x ^{(i)} ) ^{2}}{2 σ ^{2}})$ TODO: generate prose/exposition by repaging lecture/text in ram

Returning to the linear regression model, we can solve this optimization with a direct method using normal equations. QR factorization, or SVD.

def fhatbatched(X_n: np.ndarray, m: float, b: float) -> np.ndarray: return X_n*m+b

if __name__ == "__main__":
  X, Y = np.array([1500, 2100, 800]), np.array([500000, 800000, 250000]) #  data

  X_b = np.column_stack((np.ones_like(X), X))              # [1, x]
  bhat, mhat = np.linalg.solve(X_b.T @ X_b, X_b.T @ Y)   # w = [b, m]

  yhats = fhatbatched(X, mhat, bhat) # yhat
  for y, yhat in zip(Y, yhats):
    print(f"expected: ${y:.0f}, actual: ${yhat:.2f}")

To summarize, we have selected and computed

an inductive bias with the family of linear functions $f (X; θ) : = X @ θ$
an inductive principle with the least squared loss $L (θ) : = \sum_{i = 0}^{n} (y^{(i)} - X @ θ)^{2}$
the parameters which minimze the empirical risk, denoted as $(\hat{θ}) = arg min L (θ)$

Together, the inductive bias describes the relationship between the input and output spaces, the inductive principle is the loss function that measures prediction accuracy, and the minimization of the empirical risk finds the parameters for the best predictor.

2.4 Predicting Categories by Linearly Modeling Classification

logistic regression, cross entropy loss, gradient descent

2.5 Fitting Logistic Regression by Iteratively Optimizing Cross Entropy Loss

2.6

linearity depends on that the scalars are in a field (addition is associative) "scalars" in a field --> we are dealing with floating point representations "vectors" we saw coordinate systems in 1.4 "matrices" and we saw linear transforms in 2.1 mathematically, a vector is a coordinate in some basis of a vector space a matrix is a linear function, the linear, in numerical linear algebra, is aspirational.

3. Accelerating Functions and Data with Basic Linear Algebra Subroutines in `teenygrad`

$↩$ Table of Contents

3.1 The Simplest `Tensor` Implementation with Nested `array`s

Now that we have an operational understanding of how multidimensional arrays are used in practice, let's now implement our very own np.array, which will serve as the foundation to the book's capstone project — the deep learning framework teenygrad.

Let's start with the simplest implementation idea for teenygrad.Tensor, which is a nested array implementation with Python's built-in list data structure.

For some $x \in R$ we will use a float, for $x \in R^{n}$ , we will use a list[float], and for arbitary $x \in R^{d_{0} \times \dots \times d_{n}}$ we will use a list[list[float]]. In fact, this is simply what we started doing in the previous chapter when implementing our linear regression model before switching to numpy's accelerated nd.array.

3.2 Virtualizing ND Logical Shapes to 1D Physical Storages with Strides

from typing import Self
import array, math
import teenygrad

class InterpretedTensor:
  @classmethod
  def arange(cls, end: int, requires_grad: bool=False) -> Self: return InterpretedTensor((end,), list(range(end)), requires_grad=requires_grad)
  @classmethod
  def zeros(cls, shape: tuple[int, ...]) -> Self:
    numel = math.prod(shape)
    tensor = InterpretedTensor((numel,), [0.0]*numel).reshape(shape)
    return tensor
  @classmethod
  def ones(cls, shape: tuple[int, ...]) -> Self:
    numel = math.prod(shape)
    tensor = InterpretedTensor((numel,), [1.0]*numel).reshape(shape)
    return tensor
  
  def __init__(self, shape: tuple[int, ...], storage: list[float], inputs: tuple[Self, ...]=(), requires_grad: bool=False) -> None:
    self.shape: tuple[int, ...] = shape
    self.stride: tuple[int, ...] = [math.prod(shape[i+1:]) for i in range(len(shape))] # row major, and math.prod([]) produces 1
    self.storage: list[float] = storage

    self.inputs: tuple[Self, ...] = inputs
    self._backward = lambda: None # callers override after init with self captured in closure
    self.grad: InterpretedTensor = InterpretedTensor.zeros(shape) if requires_grad else None # python can recursively type (no need for Box<_>) bc everything is a heap-allocated reference
  @property
  def numel(self): return math.prod(self.shape) # np (and thus jax) call this .size
  @property
  def ndim(self): return len(self.shape)
  @property
  def T(self) -> Self:
    assert self.ndim == 2
    m, n = self.shape
    t = InterpretedTensor((n, m), self.storage)
    t.stride = [self.stride[1], self.stride[0]]
    return t

  def reshape(self, shape: tuple[int, ...]) -> Self:
    self.shape = shape
    self.stride = [math.prod(shape[i+1:]) for i in range(len(shape))] # math.prod([]) produces 1
    return self
  
  def __repr__(self) -> str:
    return f"InterpretedTensor({self.chunk(self.storage, self.shape)})"
  @staticmethod
  def chunk(flat, shape):
    if len(shape) == 1: return flat[:shape[0]]
    size = len(flat) // shape[0]
    return [InterpretedTensor.chunk(flat[i*size:(i+1)*size], shape[1:]) for i in range(shape[0])]
  
  # backward f'(x)
  @staticmethod
  def topo(node: InterpretedTensor, seen: set[InterpretedTensor], output: list[InterpretedTensor]) -> None:
    if node in seen: return
    seen.add(node)
    for input in node.inputs: InterpretedTensor.topo(input, seen, output)
    output.append(node)

  def backward(self) -> None:
    seen, topologically_sorted_expression_graph = set(), []
    InterpretedTensor.topo(self, seen, topologically_sorted_expression_graph)

    self.grad = InterpretedTensor.ones(self.shape) # base case
    for tensor in reversed(topologically_sorted_expression_graph): tensor._backward()
  
  # forwards f(x)
  def __radd__(self, other: Self) -> Self: return self.__add__(other)
  def __add__(self, other: Self) -> Self:
    n, alpha = self.numel, 1
    x, y, z = array.array('f', self.storage), array.array('f', other.storage), array.array('f', [0.0]*(n))
    teenygrad.rs.cpu.saxpy(n, alpha, x, y) # y=axpy
    requires_grad = self.grad is not None or other.grad is not None
    output_tensor = InterpretedTensor(self.shape, list(y), (self, other), requires_grad=requires_grad)
    def _backward():
      self.grad += output_tensor.grad
      other.grad += output_tensor.grad
    output_tensor._backward = _backward
    return output_tensor

  def __rmul__(self, other: Self) -> Self: return  self.__mul__(other)
  def __mul__(self, other: Self) -> Self:
    n = self.numel
    x, y, z = array.array('f', self.storage), array.array('f', other.storage), array.array('f', [0.0]*n)
    teenygrad.rs.cpu.smul(n, x, y, z)
    requires_grad = self.grad is not None or other.grad is not None
    output_tensor = InterpretedTensor(self.shape, list(z), (self, other), requires_grad=requires_grad)
    def _backward():
      self.grad += output_tensor.grad * other
      other.grad += output_tensor.grad * self
    output_tensor._backward = _backward
    return output_tensor

  def __neg__(self) -> Self:
    n = self.numel
    x, y = array.array('f', self.storage), array.array('f', [0.0]*n)
    teenygrad.rs.cpu.saxpy(n, -1, x, y)
    requires_grad = self.grad is not None
    output_tensor = InterpretedTensor(self.shape, list(y), (self,), requires_grad=requires_grad)
    def _backward():
      self.grad += -output_tensor.grad
    output_tensor._backward = _backward
    return output_tensor

  def __sub__(self, other: Self) -> Self:
    n = self.numel
    x, y = array.array('f', other.storage), array.array('f', self.storage)
    teenygrad.rs.cpu.saxpy(n, -1, x, y)
    requires_grad = self.grad is not None or other.grad is not None
    output_tensor = InterpretedTensor(self.shape, list(y), (self, other), requires_grad=requires_grad)
    def _backward():
      self.grad += output_tensor.grad
      other.grad += -output_tensor.grad
    output_tensor._backward = _backward
    return output_tensor

  def tanh(self) -> Self:
    n = self.numel
    x, y = array.array('f', self.storage), array.array('f', [0.0]*n)
    teenygrad.rs.cpu.stanh(n, x, y)
    requires_grad = self.grad is not None
    output_tensor = InterpretedTensor(self.shape, list(y), (self,), requires_grad=requires_grad)
    def _backward():
      self.grad += output_tensor.grad * (InterpretedTensor.ones(self.shape) - output_tensor * output_tensor) # f(x) = tanh(x) ==> f'(x) = 1 - tanh(x)^2
    output_tensor._backward = _backward
    return output_tensor

  def __rmatmul__(self, other: Self) -> Self: return other.__matmul__(self) # GEMM does not commute: AB != BA
  def __matmul__(self, other: Self) -> Self:

3.3 Implementing Basic Linear Algebra Subroutines in Python: `AXPY`, `GEMV` and `GEMM`

3.4 Accelerating the `BLAS` with Processor Pipelines and Memory Hierarchies

/// SGEMM with the classic BLAS signature (row-major, no transposes):
/// C = alpha * A * B + beta * C
fn sgemm(
  m: usize, n: usize, k: usize,
  alpha: f32, a: &[f32], lda: usize,
  b: &[f32], ldb: usize,
  beta: f32, c: &mut [f32], ldc: usize) {
  assert!(m > 0 && n > 0 && k > 0, "mat dims must be non-zero");
  assert!(lda >= k && a.len() >= m * lda);
  assert!(ldb >= n && b.len() >= k * ldb);
  assert!(ldc >= n && c.len() >= m * ldc);

  for i in 0..m {
    for j in 0..n {
      let mut acc = 0.0f32;
      for p in 0..k { acc += a[i * lda + p] * b[p * ldb + j]; }
      let idx = i * ldc + j;
      c[idx] = alpha * acc + beta * c[idx];
    }
  }
}

fn main() {
  use std::time::Instant;

  for &n in &[16usize, 32, 64, 128, 256] {
    let (m, k) = (n, n);
    let (a, b, mut c) = (vec![1.0f32; m * k], vec![1.0f32; k * n], vec![0.0f32; m * n]);

    let t0 = Instant::now();
    sgemm(m, n, k, 1.0, &a, k, &b, n, 0.0, &mut c, n);
    let secs = t0.elapsed().as_secs_f64().max(std::f64::MIN_POSITIVE);
    let gflop = 2.0 * (m as f64) * (n as f64) * (k as f64) / 1e9;
    let gflops = gflop / secs;

    println!("m=n=k={n:4} | {:7.3} ms | {:6.2} GFLOP/s", secs * 1e3, gflops);
  }
}

*AMD Zen5 Core Die Shot by Fritzchens Fritz (Hover for annotation by Nemez)*

Thus far in our journey we've successfully made the transition from programming software 1.0 which involves specifying of discrete data structures — like sets, associations, lists, trees, graphs, — with the determinism of types to programming software 2.0 which involves reovering continous data structures — like vectors, matrices, and tensors — with the stochasticity of probabilities. If you showed the mathematical equations for your models, loss functions, and optimizers to our mathematical cousinsIn particular, the physicists whom realized describing reality as minimizing free-energy was fruitful, such as Helmholtz with energy, Gibbs with enthalpy, and Boltzmann with entropy. who also made the same transition, they would understand the mathematical equations and even algorithmsas until recently, algorithms were understood to be a sequence of instructions to be carried out by a human. i.e Egyptian Multiplication, Euclid's Greatest Common Divisor. But they wouldn't understand how the code we've programmed thus far with Python and Numpy is being automatically calculated by the mechanical machine we call computers. For that we need to transition from users of the numpy framework to implementors our own framework, which we'll call teenygrad. This requires us moving from the beautiful abstraction of mathematics to the assembly heart and soul of our machines. This book uses RustTo backtrack to familiarize yourself with Rust, take a look at the experimental version of The Rust Programming Language, by from Will Crichton and Shriram Krishnamurthi, originally written by Steve Klabnik, Carol Nichols, and Chris Krycho., but feel free to follow along with C/C++In that case, take a look at "The C Programming Language by Brian Kernighan and Dennis Ritchie. — what's fundamental to the purpose of accelerating said basic linear algebra subroutines on the CPU is is choosing a language that 1. forgoes the dynamic memory management overhead of garbage collection and 2. has a compiled implementation which allows us to analyze and tune the actual instructions which an underlying processor executes. Using Rust will allow us to start uncovering what is really happening under the hood of accelerated Tensors To sidetrack to familiarize yourself with systems programming in the context of software 1.0 , read the classic of Computer Systems A Programmer's Perspectivewith instruction set architectures, microarchitecture, memory hierarchies

instruction set architecture
computation: control path, data path
communication: memory, input/output

In the last chapter we developed teenygrad.Tensor by virtualizing physical 1D storage buffers into arbitrarily-shaped logical ND arrays by specifying an iteration space with strides, and started our journey in accelerating the basic linear algebra subroutines defined on the Tensor with Rust, which speeds up the throughput performance of SGEMM from 10MFLOP/S to 1GFLOP/S. Simply executing a processor's native code — referred to as assembly code — rather than Python's bytecode results in an increase of two orders of magnitude.

Now that the code that is being executed is native assembly, we can dive deeper into the architecture of the machine in order to reason and improve the performance of teenygrad's BLAS.

SITP

I. Elements of Networks

I. Table of Contents

1. Representing Data with Highly Dimensional Stochastic Distributions in `torch`

1.1 From Deductive Software 1.0 to Inductive Software 2.0

1.2 Distributing Truth Stochastically with Probabilities

1.3 The Algebra of Probabilities: Sum Rule, Product Rule, and Bayes Rule

1.4 Random Variables and their Distributions, Expectations, and Variance

1.5 Representing Types with Coordinates of Higher Dimensionality

1.6 The Linear Algebra of Coordinates: Matrices

1.7 Probability + Linear Algebra = High Dimensional Stochastic Distributions

1.8 Recovering Probabilities from Data with Parameter Estimation via Maximum Likelihood

2. Learning Functions from Data with Optimization in `torch`

2.1 Recovering Probabilities from Data with Parameter Estimation via Maximum Likelihood

2.2 Predicting Scalars by Linearly Modeling Regression

2.3 Fitting Linear Regression by Directly Optimizing Squared Error Loss

2.4 Predicting Categories by Linearly Modeling Classification

2.5 Fitting Logistic Regression by Iteratively Optimizing Cross Entropy Loss

2.6

3. Accelerating Functions and Data with Basic Linear Algebra Subroutines in `teenygrad`

3.1 The Simplest `Tensor` Implementation with Nested `array`s

3.2 Virtualizing ND Logical Shapes to 1D Physical Storages with Strides

3.3 Implementing Basic Linear Algebra Subroutines in Python: `AXPY`, `GEMV` and `GEMM`

3.4 Accelerating the `BLAS` with Processor Pipelines and Memory Hierarchies

3.5 Accelerating Computation with Instruction Level Parallelism via Loop Unrolling

3.6 Accelerating Computation with Data Level Parallelism via Vector and Matrix Extensions

3.7 Accelerating Communication with Memory Hierarchies via Register and Cache Blocking

3.8 Accelerating Computation with Thread Level Parallelism

Keyboard shortcuts

SITP