Stanford CS336: Language Modeling from Scratch

Author

Yuyang Zhang

Lecture Website: Link

Lecture Video: YouTube

This is a collection of notes and code for the Stanford CS336: Language Modeling from Scratch course. This is course is designed to teach students how to build a “large” language model (LLM) from scratch, covering from bottom to top, including the:

data collection and preprocessing,
tokenization,
model architecture and variations,
pre-training,
post-training,
inference
model evaluation.

The course is designed to provide a comprehensive understanding of how LLMs work and how to build them from scratch.

There are total 17 Lectures and 5 Assignments in this course.

Warning

It might takes 200+ hours to finish the course, including the lectures and assignments. But hang in there, it will be worth it!!!

Lectures

Lecture 01: Overview & Tokenization

Summary of Lecture 01

Byte Pair Encoding (BPE) is a simple yet effective tokenization algorithm that adaptively merges frequent character pairs to create a compact vocabulary, balancing compression efficiency and reversibility for language modeling.

Tokenization

Tokenization is the process of converting raw text into a sequence of tokens(usually is represented by integer indices), which are the basic units of meaning in a language model.

There are different level of tokenization, such as:

Character-level: Each character is treated as a separate token.
Word-level: Each word is treated as a separate token.
Subword-level: Words are split into smaller units, such as prefixes, suffixes, or common subwords, to handle rare or out-of-vocabulary words more effectively.
Byte-level: Each byte of the text is treated as a separate token, which can handle any character in the Unicode standard.

Figure 1: An example of tokenization, the interactive demo of the tokenization process can be found here

Before diving into the details of each tokenization algorithms , let’s first define a framework for tokenization in Python code: A Tokenizer is a class that implements the encode and decode methods for tokenization:

encode(string) method takes a string of text and converts it into a sequence of integer indices, which are the tokens.
decode(indices) method takes a sequence of integer indices and converts it back into a string of text.

class Tokenizer:
    def __init__(self, config):

    def encode(self, string: str) -> list[int]:
        pass

    def decode(self, indices: list[int]) -> str:
        pass

The Tokenizer will generate a list of tokens from the input text, the number of the tokens is called the vocabulary size.

One of the good quantity to measure the tokenization algorithm is the compression ratio, which is defined as the ratio of the number of bytes of the input text to the number of tokens. A higher compression ratio indicates that the tokenization algorithm is more efficient in representing the input text. The compression ratio can be calculated as:

def get_compression_ratio(string: str, indices: list[int]) -> float:
    # Calculate the compression ratio of the tokenization algorithm
    return len(bytes(string, 'utf-8')) / len(indices)

Now, let’s discuss those four tokenization algorithms in detail:

Character-Based Tokenization

A Unicode string is a sequence of Unicode code points, which can be represented as a sequence of integers. Each character in the string is treated as a separate token, for example:

ord('a')   # 97 
chr(97)   # 'a'

ord('🌍') # 127757
chr(127757) # '🌍'

We can implement a simple character-based tokenizer as follows:

class CharacterTokenizer(Tokenizer):
    def encode(self, string: str) -> list[int]:
        return [ord(c) for c in string]
    def decode(self, indices: list[int]) -> str:
        return ''.join(chr(i) for i in indices)

One of the main drawbacks of character-based tokenization is that it can lead to a large vocabulary size (there are approximately 150K unicode characters), which can make the model more difficult to train and less efficient in terms of memory usage. And some character are not commonly used in the text(e.g. ‘🌍’), which is inefficient use of the vocabulary.

Byte-Based Tokenization

Unicode strings can be encoded as a sequence of bytes using UTF-8 encoding. Each byte can be represented by integers in the range of 0 to 255. Some Unicode characters are represented by one byte, others take multiple bytes:

bytes('a', 'utf-8')  # b'a'
bytes('🌍', 'utf-8')  # b'\xf0\x9f\x8c\x8d'

We can implement a simple byte-based tokenizer as follows:

class ByteTokenizer(Tokenizer):
    def encode(self, string: str) -> list[int]:
        return list(bytes(string, 'utf-8'))
    def decode(self, indices: list[int]) -> str:
        return bytes(indices).decode('utf-8')

As we can expected, the compression ratio is 1, which is terrible.

The vocabulary size is 256, which is a nice property because it is fixed and covers all possible bytes, which means we will not get out-of-vocabulary (OOV) tokens. However, the main drawback is that it can lead to long sequences, since each character (even multi-byte Unicode characters) is split into multiple tokens. This can make training less efficient and increase the computational cost for language models, given that the context length of a Transformer model is limited.

Word-Based Tokenization

Another common approach is to tokenize the text into words, where each word is treated as a separate token. We can use regular expression to split the text into words.

for example the GPT-2 regular expression is :

GPT2_TOKENIZER_REGEX = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
segments = re.findall(GPT2_TOKENIZER_REGEX, string)

After we get the segments, we can assign each segment to a unique integer index, which is the token. However there are several problem with this word-level tokenization:

The number of words is huge.
Many words are rare and the model won’t learn much about them.
The vocabulary size is not fixed, which can lead to out-of-vocabulary (OOV) tokens. New words we haven’t seen during training will be treated as UNK tokens, which can lead to a loss of information and context.

Byte Pair Encoding (BPE)

The basic idea of BPE is to iteratively merge the most frequent pairs of characters or subwords in the text until a desired vocabulary size is reached. This allows for a more efficient representation of the text, as it can capture common subwords and reduce the overall vocabulary size. The intuition behind BPE is that common sequences of characters are represented by a single token, rare sequences are represented by many tokens.

The GPT-2 paper used word-based tokenization to break up the text into initial segments and run the original BPE algorithm on each segment.

Here are the steps to implement BPE:

Initialize: Start with a vocabulary that contains all unique characters in the text, usually start with 0-255
Count Pairs: Count all adjacent character pairs in the text.
Merge: Find the most frequent pair and merge it into a new token.
Update: Replace all occurrences of the merged pair in the text with the new token.
Repeat: Repeat steps 2-4 until the desired vocabulary size is reached.

@dataclass(frozen=True)
class BPETokenizerParams:
    vocab: dict[int, bytes]   # Mapping of token indices to byte sequences
    merges: dict[tuple[int, int], int]  # Mapping of pairs to their frequency

def merge(indices: list[int], pair: tuple[int, int], new_index:int)  -> list[int]:
    new_indices = []
    i = 0 
    while i< len(indices):
        if i + 1 < len(indices) and indices[i] == pair[0] and indices[i + 1] == pair[1]:
            new_indices.append(new_index)  # Replace the pair with the new index
            i += 2  # Skip the next index since it's part of the pair
        else:
            new_indices.append(indices[i])  # Keep the current index
            i += 1  # Move to the next index
    return new_indices

class BPETokenizer(Tokenizer):
    def __init__(self, params: BPETokenizerParams):
        self.params = params 
    
    def encode(self, string: str) -> list[int]:
        indices = list(map(int, string.encode('utf-8'))) # Convert string to byte indices
        for pair, new_index in self.params.merges.items():
            indices = merge(indices, pair, new_index)  
        return indices
    
    def decode(self, indices: list[int]) -> str: 
        byte_list = list(map(self.params.vocab.get, indices))  # Convert indices to bytes
        string = b''.join(byte_list)  # Join bytes into a single byte string
        return string.decode('utf-8')  # Decode byte string to UTF-8

To train an BPE tokenizer, we need to count the frequency of pairs in the text and merge them iteratively. The BPETokenizerParams class holds the vocabulary and merge rules, which can be generated from a training corpus.

def train_bpe_tokenizer(string: str, num_merges: int) -> BPETokenizerParams:
    # Initialize vocabulary with single characters
    vocab: dict[int, bytes] = {x: bytes([x]) for x in range(256)}  # index -> bytes, 256 unique bytes for UTF-8
    merges: dict[tuple[int, int], int] = {}  # index1, index2 => merged index
    
    # Convert string to byte indices
    indices = list(map(int, string.encode('utf-8')))
    
    for i in range(num_merges):
        # Count pairs
        counts = defaultdict(int)
        for index1, index2 in zip(indices[:-1], indices[1:]):
            counts[(index1, index2)] += 1

        # Find the most frequent pair

        most_frequent_pair = max(counts, key=counts.get)
        
        # Create a new token for the merged pair
        new_index = 256 + i 
        merges[most_frequent_pair] = new_index
        vocab[new_index] = vocab[index1] + vocab[index2]  # Merge the byte sequences
        
        # Merge the pair in the indices
        indices = merge(indices, most_frequent_pair, new_index)
    
    return BPETokenizerParams(vocab=vocab, merges=merges)

We can save the BPETokenizerParams to a file and load it later to use the tokenizer. The BPE algorithm is efficient in terms of compression ratio, as it can adaptively merge frequent subwords to form a compact vocabulary, while still being reversible.

Note

This implementation of BPE is a simplified version, and there are many optimizations and variations that can be applied to improve the efficiency and performance of the algorithm. For example, we can use a priority queue to efficiently find the most frequent pair. And we can also use more advanced techniques to speed up the merging process, such as caching and batching.

Summary of Lecture 01

In this lecture, we learned about the concept of the tokenization, which is the process of converting raw text into a sequence of tokens. We discussed four different tokenization algorithms: character-based, byte-based, word-based, and byte pair encoding (BPE). Each algorithm has its own advantages and disadvantages. We also learned how to measure the efficiency of a tokenization algorithm using the compression ratio. We will implemented more fancy version of the BPE algorithm in the assignment 01, which will be used to build a transformer model from scratch.

Below are the summary of four different tokenization algorithms:

Tokenization Type	Description	Compression Ratio	Vocabulary Size	Pros	Cons
Character-Based	Each character is a token (Unicode code point).	Low	Large (all characters)	Simple; handles any language	Inefficient; large vocab; rare chars waste tokens
Byte-Based	Each byte from UTF-8 encoding is a token (0–255).	~1	Fixed (256)	Fixed vocab; language-agnostic	Long sequences for Unicode; inefficient for modeling semantics
Word-Based	Words (or regex-matched units) are tokens.	Medium	Large and dynamic	Intuitive; better compression than char/byte	Poor generalization; large vocab; OOV issues with `UNK`
Byte Pair Encoding (BPE)	Iteratively merges frequent subword pairs to form subword tokens.	High	Medium (tunable)	Efficient; balances granularity and vocab size	Merge rules are corpus-dependent; needs initial segmentation

Table 1: Summary of 4 different tokenization algorithms

Lecture 02: Pytorch, Resource Accounting

h100_bytes = 80e9

Bytes_per_parameter = 4 + 4 + ( 4+ 4) => Parameters, gradient, optimizer state

Num Paramter = (h100_byes * 8) / Bytes_per_parameter 4e10

x = torch.tensor([
    [1, 2, 3],
])

Memory Accounting

Let’s introduce three most common data types used in deep learning: float32, float16, and bfloat16.

float32 is the most common(default) data type used in deep learning, which uses 4 bytes per element.

Figure 2: The representation of float32 in memory

Float Representation

The exponent used to represent the dynamic range of the number, for float32, it uses 8 bits for the exponent, which allows for a dynamic range of approximately 1.18e-38 to 3.4e+38. The fraction is used to represent the precision of the number, which is also known as resolution. For float32, it uses 23 bits for the fraction, which allows for a precision of approximately 7 decimal digits. To represent a number in float32, it uses the following formula: \[ \begin{aligned} \text{value} &= (-1)^{b_{31}} \times 2^{(b_{30}b_{29} \dots b_{23})_2 - 127} \times (1.b_{22}b_{21} \dots b_{0})_2 \\ &= (-1)^{\text{sign}} \times 2^{(E - 127)} \times \left( 1 + \sum_{i=1}^{23} b_{23 - i} \cdot 2^{-i} \right) \end{aligned} \] where sign is the sign bit, exponent is the exponent value, and fraction is the fraction value. The exponent is biased by 127, which means that the exponent value is stored as the actual exponent plus 127. This allows for both positive and negative exponents to be represented.

Memory is determined by the number of values in the tensor and the data type of the tensor. The memory usage can be calculated as: \[ \text{Memory} = \text{Number of Values} \times \text{Bytes per Value} \tag{1}\] which can be calculated as:

def get_memory_usage(x: torch.Tensor):
    return x.numel() * x.element_size()

The float32 takes 4 bytes per element, for the LLM, this is a lot of memory. We are use float16 to reduce the memory usage by half, which is 2 bytes per element. This is a common practice in deep learning to save memory and speed up training.

Figure 3: The representation of float16 in memory

The float16 cuts down the memory usage by half, but it has a smaller dynamic range and precision compared to float32. This can lead to numerical instability and loss of precision in some cases, especially when dealing with very large or very small numbers.

x = torch.tensor([1e-8], dtype=torch.float16)  
assert x == 0

when the underflow or overflow happens, the value will be rounded to 0 or infinity, which might cause the instability in the training process.

Another data type is bfloat16, which is a 16-bit floating point format that has the same dynamic range as float32, but with less precision. It uses 8 bits for the exponent and 7 bits for the fraction, which allows for a dynamic range of approximately 3.4e-38 to 3.4e+38, but with a precision of only 2 decimal digits.

Figure 4: The representation of bfloat16 in memory

In conclusion, the choice of data type can have a significant impact on the memory usage and performance of the model. The float32 is the most common data type used in deep learning, but it can be memory-intensive. The float16 and bfloat16 are commonly used to reduce the memory usage, but they can lead to numerical instability and loss of precision in some cases. One of the common practice to combine float16 and float32 is to use float16 for the model parameters and gradients, and float32 for the optimizer state. This is known as mixed precision training(Micikevicius et al. 2018), which can save memory and speed up training without sacrificing too much performance.

Compute Account

By defaults, all the tensors are stored in the CPU memory, which is not efficient for training deep learning models. We can move the tensors to the GPU memory by calling the to method on the tensor, which will move the tensor to the specified device.

x = torch.tensor([1, 2, 3])
x = x.to('cuda')  # Move tensor to GPU

num_gpus = torch.cuda.device_count() # Check the number of GPUs available
for i in range(num_gpus):
    properties = torch.cuda.get_device_properties(i)  # @inspect properties

y = x.to('cuda:0')  # Move tensor to GPU 0

One way to check the memory usage of the GPU in PyTorch is to use the torch.cuda.memory_allocated() and torch.cuda.memory_reserved() functions. The memory_allocated() function returns the total memory allocated by the tensors, while the memory_reserved() function returns the total memory reserved by the CUDA allocator.

The PyTorch tensor are pointers to the memory allocated on the GPU, with metadata such as the shape, data type, and device. There are several commom operations that can be performed on the tensors:

x = torch.tensor([[1., 2, 3], [4, 5, 6]]) 
y = x[0] # Get the first row of the tensor
z = x[:, 1] # Get the second column of the tensor
w = x[0, 1] # Get the element at the first row and second

y = x.view(3, 2) # Reshape the tensor to 3 rows and 2 columns
z = x.reshape(3, 2) # Reshape the tensor to 3 rows
w = x.transpose(0, 1) # Transpose the tensor

One of the most important operations in deep learning is the matrix multiplication, which can be performed using the torch.matmul() function. Also, besides that, the einops library provides a powerful way to manipulate tensors, allowing for more complex operations

Now, let’s talk about the compute account, which is the number of floating point operations (FLOPs) required to perform a certain operation. The FLOPs can be calculated as the number of floating point operations divided by the time taken to perform the operation.

Note

The FLOPs is a measure of the computational complexity of an operation, and it is often used to compare the performance of different algorithms. The FLOP/s (FLOPs per second) is a measure of the performance of a hardware, which is the number of floating point operations that can be performed in one second. It is often used to compare the performance of different hardware, such as CPUs and GPUs.

Support we have a Linear layer with B batch size, D input dimension, and K output dimension. The number of floating point operations required to perform the forward pass of the linear layer can be calculated as:

FLOPs = 2 * B * D * K

where 2 is the number of floating point operations required to perform the matrix multiplication and addition (1 for multiplication and 1 for addition) of (i, j, k) triple.

Beside the matrix multiplication, the element-wise operations such as activation functions (e.g., ReLU, Sigmoid) also contribute to the FLOPs. For example, if we apply a ReLU activation function after the linear layer, it will add B * K additional floating point operations, since it applies the function to each element in the output tensor.

Addition of two \(m \times n\) matrices requires \(mn\) FLOPs, where \(m\) is the number of rows and \(n\) is the number of columns.

FLOPs is measure in terms of the number of the floating point operations required to perform a certain operation, which is a common way to measure the computational complexity of an algorithm. In theory, how to map those FLOPs to the wall clock time is not straightforward, since it depends on the hardware and the implementation of the algorithm. However, we can use the following formula to estimate the wall clock time:

Wall Clock Time = FLOPs / FLOP/s

One of the criterion is the MFU(Model FLOPs Utilization).Usually, MFU of >= 0.5 is quite good (and will be higher if matmuls dominate), the MFU is defined as the ratio of the number of floating point operations to the number of floating point operations that can be performed in one second, which is the FLOP/s. The MFU can be calculated as:

MFU = actual_flop_per_sec / promised_flop_per_sec

Eniops

It also introduce several einops operations:

einsum is used to perform Einstein summation notation, which is a powerful way to express complex tensor operations in a concise way.
rearrange is used to change the shape of the tensor without changing the data, it is similar to view or reshape in PyTorch.
reduce is used to reduce the tensor along a certain dimension, such as summing or averaging the values.
repeat is used to repeat the tensor along a certain dimension, which is useful for broadcasting operations.

Count Flops for operations

A floating-point operation (FLOP) is a basic operation like addition (x + y) or multiplication (x y).

FLOPs: floating-point operations (measure of computation done)
FLOP/s: floating-point operations per second (also written as FLOPS), which is used to measure the speed of hardware.

For example, the Flops for the matrix multiplication of two matrices with dimensions m x n and n x p is 2 * m * n * p, where the 2 accounts for the multiplication and addition operations.

In the Linear module, the number of FLOPs can be calculated as follows:

Number of data points(B) x Number of parameters (D K)

One of the matrix is Model FLOPs utilization (MFU), which is the ratio of the number of floating point operations to the number of floating point operations that can be performed in one second, which is the FLOP/s. The MFU can be calculated as:

MFU = actual_flop_per_sec / promised_flop_per_sec

Models

Model Parameters Initialization

The model parameters are usually initialized using a random distribution, such as the normal distribution or the uniform distribution. Such as:

w = nn.Parameter(torch.randn(input_dim, out_dim))

However, each element of output scales as sqrt(input_dim), so we need to scale the initialization by 1/sqrt(input_dim) to ensure that the variance of the output is not too large or too small. This is known as the Xavier initialization or Glorot initialization. The Xavier initialization can be implemented as follows:

w = nn.Parameter(torch.randn(input_dim, out_dim) * (1 / math.sqrt(input_dim)))

To be more safe, we can truncate the normal distribution to avoid the outliers, which can be done by using the torch.nn.init.trunc_normal_ function. The truncation can be done by specifying the mean, std, and a and b parameters, which are the lower and upper bounds of the truncation.

w = nn.Parameter(torch.nn.init.trunc_normal_(torch.empty(input_dim, out_dim), mean=0, std=1, a=-2, b=2))

Random Seed

To ensure the reproducibility of the results, we need to set the random seed for the random number generator. In PyTorch, we can set the random seed using the torch.manual_seed() function, which will set the seed for all the random number generators in PyTorch.

torch.manual_seed(42)  # Set the random seed to 42

Data Loading

In language modeling, data is a sequence of integer (tokens) indices, It is convenient to serialize them as numpy arrays, and load it back.

orig_data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=np.int32)
orig_data.tofile("data.npy")

If we don’t want to load the entire data into memory, we can use the np.memmap function to create a memory-mapped array, which allows us to access the data on disk as if it were in memory. This is useful for large datasets that cannot fit into memory.

data = np.memmap("data.npy", dtype=np.int32, mode="r")
print(data[0:10])  # Access the first 10 elements of the data

We can load data using

def get_batch(
    data: np.array, 
    batch_size: int, 
    sequence_length: int,
    device: str = "cpu"
) -> torch.Tensor:
    start_indices = torch.randint(len(data) - sequence_length, (batch_size, )) # Randomly select start indices for each batch
    batch = torch.stack([torch.tensor(data[i:i + sequence_length]) for i in start_indices])  # Create a batch of sequences

By default, CPU tensors are in paged memory, we can explicitly pin the memory to avoid the page faults, which can be done by using the pin_memory() method on the tensor. This will ensure that the tensor is allocated in a contiguous block of memory, which can improve the performance of the data loading process.

if torch.cuda.is_available():
batch = batch.pin_memory()  # Pin the memory to avoid page faults

Figure 5: Illustrate the difference between pinned memory and paged memory. The GPU cannot access data directly from pageable host memory, so when a data transfer from pageable host memory to device memory is invoked, the CUDA driver must first allocate a temporary page-locked, or “pinned”, host array, copy the host data to the pinned array, and then transfer the data from the pinned array to device memory (Image Source: NVIDA)

Pinned Memory vs. Paged Memory

Paged Memory: Default system memory that can be swapped out to disk by the OS. Slower transfer to GPU, and does not support asynchronous transfer.
Pinned Memory: Also called page-locked or fixed memory. It is locked in RAM and cannot be swapped out. Enables faster transfer to GPU and supports asynchronous transfer.

This allows use to copy x from CPU to GPU asynchronously, which can improve the performance of the data loading process. We can use the to() method to move the tensor to the GPU, which will copy the tensor to the GPU memory.

if torch.cuda.is_available():
    batch = batch.to("cuda", non_blocking=True)  # Move the batch to GPU

This allows us to do two things in parallel (not done here):

Fetch the next batch of data into CPU
Process x on the GPU.

Note

This is why when we load the data using the DataLoader in PyTorch, we can set the pin_memory=True to enable the pinned memory, which will automatically pin the memory for the tensors in the data loader. This can improve the performance of the data loading process, especially when using GPUs.

Optimizer

Introduce several common optimizers in PyTorch:

SGD: Stochastic Gradient Descent, the most basic optimizer, which updates the parameters using the gradient of the loss function with a learning rate.
Momentum: SGD + exponential averaging of gradients.
Adaptive optimizers:
- Adam: Adaptive Moment Estimation, which uses the first and second moments of the gradients to adaptively adjust the learning rate for each parameter.
- Adagrad: Adaptive Gradient, which adapts the learning rate based on the historical gradients.
- RMSprop: Root Mean Square Propagation, which adapts the learning rate based on the historical squared gradients.

Implement of SGD optimizer in PyTorch is as follows:

class SGD(torch.optim.Optimizer):
    def __init__(self, params: Iterable[nn.Parameter], lr: float = 0.01):
        super(SGD, self).__init__(params, dict(lr=lr))
    def step(self):
        for group in self.param_groups:
            lr = group["lr"]
            for p in group["params"]:
                grad = p.grad.data
                p.data -= lr * grad

Also the AdaGrad optimizer can be implemented as follows:

class AdaGrad(torch.optim.Optimizer):
    def __init__(self, params: Iterable[nn.Parameter], lr: float = 0.01):
        super(AdaGrad, self).__init__(params, dict(lr=lr))
    def step(self):
        for group in self.param_groups:
            lr = group["lr"]
            for p in group["params"]:
                # Optimizer state
                state = self.state[p]
                grad = p.grad.data
                # Get squared gradients g2 = sum_{i<t} g_i^2
                g2 = state.get("g2", torch.zeros_like(grad))
                # Update optimizer state
                g2 += torch.square(grad)
                state["g2"] = g2
                # Update parameters
                p.data -= lr * grad / torch.sqrt(g2 + 1e-5)

The usage of the optimizer is as follows:

model = MyModel()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

x_in = torch.randn(32, 100)  # Input tensor
y_out = model(x_in)  # Forward pass
target = torch.randn(32, 10)  # Target tensor
loss = loss_fn(y_out, target)  # Compute loss

loss.backward()  # Backward pass
optimizer.step()  # Update parameters
optimizer.zero_grad(set_to_none=True) # Zero gradients for the next iteration # Set to None is more memory efficient

Training Loop

Checkpointing

Mixed Precision Training

Choice of data type (float32, bfloat16, fp8) have tradeoffs.

Higher precision: more accurate/stable, more memory, more compute

Lower precision: less accurate/stable, less memory, less compute

We can use mixed precision training to balance the tradeoffs, which is a common practice in deep learning to save memory and speed up training without sacrificing too much performance. The idea is to use float16 bfloat16 for the model parameters and gradients, and float32 for the optimizer state. This can be done using the torch.cuda.amp module in PyTorch, which provides automatic mixed precision training.

scaler = torch.cuda.amp.GradScaler()  # Create a GradScaler for mixed precision training
for data, target in dataloader:
    optimizer.zero_grad(set_to_none=True)  # Zero gradients for the next iteration
    with torch.cuda.amp.autocast():  # Enable autocasting for mixed precision training
        output = model(data)  # Forward pass
        loss = loss_fn(output, target)  # Compute loss
    scaler.scale(loss).backward()  # Backward pass with scaled loss
    scaler.step(optimizer)  # Update parameters with scaled gradients
    scaler.update()  # Update the scaler for the next iteration

Summary of Lecture 02

Lecture 03: Architectures & Hyperparameters

Lecture 04: Mixture of Experts

Lecture 05 & 06: GPU, Kernels, Triton

Lecture 07 & 08: Parallelism

Lecture 09 & 11: Scaling Laws

Lecture 10: Inference

Lecture 12: Evaluation

Lecture 13 & 14: Data

Lecture 15, 16 & 17: Alignment: SFT/ RLHF

Assignments

Assignment 01: Basics

Preparation

Clone Assignment Repository

First, we need clone the assignment repository from GitHub:

git clone https://github.com/stanford-cs336/assignment1-basics/tree/main
cd assignment1-basics

Download Dataset

Then, we need download the dataset, the dataset used are TinyStories and OpenWebText

mkdir -p data
cd data

wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt
wget https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt

wget https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_train.txt.gz
gunzip owt_train.txt.gz
wget https://huggingface.co/datasets/stanford-cs336/owt-sample/resolve/main/owt_valid.txt.gz
gunzip owt_valid.txt.gz

cd ..

After downloading the dataset, we can check the size of the dataset:

du -sh ./data             
# 13G    ./data

There are 13GB of data in total, which is a small dataset for training a language model.

Install Dependencies

To run code, and install the dependencies, we can just run the following command:

uv run python

It will automatically create a virtual environment and install the dependencies.

Byte-Pair Encoding (BPE) Tokenizer

The Unicode Standard

ord() # convert a single Unicode character into its integer representation
chr() # convert a single integer representation into its Unicode character

The Unicode character chr(0) return '\x00', which is the null character. While use repr(chr(0)) will return "'\\x00'", which is the string representation of the null character. When use print function, it will not print anything, since the null character is not printable. So, when add the chr(0) to the string, it will not change the string, but it will add a null character to string.

"this is a test" + chr(0) + "string" # 'this is a test\x00string'
print("this is a test" + chr(0) + "string") # # this is a teststring

Unicode Encoding

Unicode Encoding is the process of converting a Unicode character into a sequence of bytes. The most common encoding is UTF-8, which uses 1 to 4 bytes to represent a Unicode character.

Unicode Standard VS. Unicode Encoding

Unicode Standard is a character encoding standard that defines a unique number for every character, no matter the platform, program, or language. Unicode Encoding is the actual implementation of the Unicode Standard in a specific encoding format, such as UTF-8 or UTF-16. It will convert the Unicode character into a sequence of bytes.

To encode the Unicode string into UTF-8, we can use the encode() method in Python to convert the string into a byte sequence, and then use the decode() method to convert the byte sequence back into a string. The encode() method takes an optional argument that specifies the encoding format, which is UTF-8 by default.

original_string = "Hello World! 🌍 你好, 世界!"
utf8_encode = original_string.encode("utf-8")
utf8_bytes = list(utf8_encode) # Get the byte values of the encoded string
original_string = utf8_encode.decode("utf-8")
len(original_string) # 22
len(utf8_bytes) # 33

UTF-8 is variable-length and uses 1 byte for ASCII (common in English), making it more compact for most real-world text. UTF-16 uses 2 or 4 bytes per character, and UTF-32 always uses 4 bytes, even for ASCII.

s = "hello"
print(len(s.encode("utf-8")))   # 5 bytes
print(len(s.encode("utf-16")))  # 12 bytes (includes BOM)
print(len(s.encode("utf-32")))  # 20 bytes (includes BOM)

UTF-8 produces a byte range of 0–255, making the vocabulary size small and fixed (ideal for BPE training).
•   UTF-16/32 would increase the vocabulary size dramatically (≥65,536 for UTF-16 and ~4 billion for UTF-32), requiring larger models and memory.

bytes() creates an immutable sequence of bytes — that is, a sequence of integers in the range 0–255.

s = "hello 🌍"
b = bytes(s, "utf-8")
print(b)  # b'hello \xf0\x9f\x8c\x8d'
b = s.encode("utf-8")

b = bytes([104, 101, 108, 108, 111])  # equivalent to 'hello'
print(b)  # b'hello'

b = bytes(5)
print(b)  # b'\x00\x00\x00\x00\x00'

decode_utf8_bytes_to_str_wrong("🌍".encode("utf-8"))
# Raises: UnicodeDecodeError

The function is incorrect because it decodes each byte individually, but multi-byte UTF-8 characters (like ‘🌍’) must be decoded as a whole sequence, not byte-by-byte.

bytes([0xc0, 0x20])

BPE Tokenizer Training

Three steps:

Vocabulary Initialization
Pre-Tokenization
Compute BPE merges
Handle Special Tokens

Assignment 02: System

In this assignment, we will profile and benchmark the model we built in the assignment 01, and optimize the attention mechanism using Triton by implementing the FlashAttention2(Dao 2023).

References

Dao, Tri. 2023. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” July 17, 2023. https://doi.org/10.48550/arXiv.2307.08691.

Micikevicius, Paulius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, et al. 2018. “Mixed Precision Training.” February 15, 2018. https://doi.org/10.48550/arXiv.1710.03740.