# Transformer-based language models

<a target="_blank" href="https://colab.research.google.com/github/jaspock/me/blob/main/docs/materials/transformers/assets/notebooks/lmgpt.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
<a href="http://dlsi.ua.es/~japerez/"><img src="https://img.shields.io/badge/Universitat-d'Alacant-5b7c99" style="margin-left:10px"></a>

Notebook and code written by Juan Antonio PÃ©rez in 2023â€“2024.

This notebook uses the decoder-like transformer of our previous notebook to train and test a ridiculously simple language model. The size of the dataset will prevent the model from learning anything useful, but it will be enough to illustrate the basic principles of sequence generation with transformers. 

It is assumed that you are already familiar with the basics of PyTorch. This notebook complements a [learning guide](https://dlsi.ua.es/~japerez/materials/transformers/intro/) based on studying the math behind the models by reading the book "[Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/)" (3rd edition) by Jurafsky and Martin. It is part of a series of notebooks which are supposed to be incrementally studied, so make sure you follow the right order. If your learning is being supervised by a teacher, follow the additional instructions that you may have received. Although you may use a GPU environment to execute the code, the computational requirements for the default settings are so low that you can probably run it on CPU.

In [None]:
%%capture
%pip install torch

## Mini-batch preparation

The `make_batch` is a generator that yields a batch of input and output sequences. The input sequences are just a slice of the corpus, and the output sequences are the same as the input sequences, but shifted one token to the right. The generator is infinite, so it can be used in a `for` loop to iterate over the batches. The `yield` keyword is different from `return` in that it does not terminate the function, but it returns a value and pauses the execution of the function until the next time the function is called. This allows us to keep a state between calls to the function as well as to call the function unlimited times. Note that the instruction after the `yield` keyword (a `pass` instruction in this case) will be the first to be executed when the function is called again. 

The second parameter in `word_index.get` is the default value to return if the key is not found in the dictionary. In this case, we return the index of the special token `[UNK]`, which is the index of the unknown token in the vocabulary. This is a simple way to handle out-of-vocabulary words. Ideally, the training corpus should contain also out-of-vocabulary words so that a useful representation can be learned for the token `[UNK]`. In this notebook, however, the vocabulary includes all the words in the corpus.

The inner loop of the generator creates a batch of `batch_size` sequences. The start position in the corpus is chosen randomly, and the end index is `max_len` tokens after the start index. If the end index is greater than the length of the corpus, the remaining tokens are taken from the beginning of the corpus. 

To avoid having to pad the sequences, we assume that the training corpus is long enough to completely fill each *line* of the batch. If the corpus is shorter, an exception is raised.

In [None]:
import torch
import random

def make_batch(tokenized_corpus, word_index, max_len, batch_size, device):

    token_indices = [word_index.get(token, word_index['[UNK]']) for token in tokenized_corpus]
    n_tokens = len(token_indices)  # number of tokens in the corpus
    assert n_tokens >= max_len, f'Short corpus ({n_tokens} tokens), must be at least {max_len} tokens long'

    while True:
        input_batch, output_batch = [], []
        
        for _ in range(batch_size):
            start_index = random.randint(0, n_tokens - 1)  # random start
            end_index = start_index + max_len
            input_seq = token_indices[start_index:end_index]
            if end_index > n_tokens:
                input_seq += token_indices[:end_index - n_tokens]
            
            # output is input shifted one token to the right:
            output_seq = input_seq[1:] + [token_indices[end_index % n_tokens]]

            input_batch.append(input_seq)
            output_batch.append(output_seq)

        yield torch.LongTensor(input_batch).to(device), torch.LongTensor(output_batch).to(device)
        pass  # this line will be executed next time the function is called

## Import our transformer code

We load the `DecoderTransformer` class implemented in the previous notebook. If we are running this on the cloud, we download the parent notebook file from GitHub. If we are running it locally, we assume that the file is in the same directory as this notebook. The seed is also set to a fixed value to ensure reproducibility.

In [None]:
%%capture
import os
colab = bool(os.getenv("COLAB_RELEASE_TAG"))  # running in Google Colab?
if not os.path.isfile('transformer.ipynb') and colab:
    %pip install wget
    !wget https://raw.githubusercontent.com/jaspock/me/main/docs/materials/transformers/assets/notebooks/transformer.ipynb

%pip install nbformat
%run './transformer.ipynb'

set_seed(42)

## Corpus preprocessing

Our model will be trained with a corpus contained in a single file. In our case, we will download the Tiny Shakespeare dataset made of works by William Shakespeare. 

The preprocessing of the corpus follows the same steps as in the previous notebook. The only difference is the addition of a few special tokens to the vocabulary. The special tokens are `[PAD]` for padding, and `[UNK]` for unknown words. `PAD` is used to fill the sequences shorter than `max_len`, but it is not used here. `[UNK]` is used to represent out-of-vocabulary words. In this notebook, however, the vocabulary includes all the words in the corpus; therefore, good representations for `[UNK]` will not be learned. Anyway, it is used at inference time to handle out-of-vocabulary words. Note also that when subword tokenization is used, unknown tokens are usually not so frequent.

Different tasks may require different special tokens. For example, a multilingual model may need a special token to indicate the language of the input sequence.

In [None]:
# download Tiny Shakespeare dataset:
import urllib.request
url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
chars = 10000  # number of characters to keep
corpus = urllib.request.urlopen(url).read().decode("utf-8")[:chars]
print(corpus[:100])

word_list = list(set(corpus.split()))
word_index = {'[PAD]': 0, '[UNK]': 1}
special_tokens = len(word_index) 
for i, w in enumerate(word_list):
    word_index[w] = i + special_tokens
index_word = {i: w for i, w in enumerate(word_index)}
vocab_size = len(word_index)
print(f"vocab_size = {vocab_size}")

## Model training

Hopefully, having studied the other notebooks, once you reach this point, you will realize that everything sounds familiar and understandable.

In [None]:
n_layer = 2
n_head = 2
n_embd =  64
embd_pdrop = 0.1
resid_pdrop = 0.1
attn_pdrop = 0.1
batch_size = 4
max_len = 32
training_steps = 1000
eval_steps = 100
lr = 0.001

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import math

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = DecoderTransformer(n_embd=n_embd, n_head=n_head, n_layer=n_layer, vocab_size=vocab_size,  
                max_len=max_len, embd_pdrop=embd_pdrop, attn_pdrop=attn_pdrop, resid_pdrop=resid_pdrop)
model.to(device)

criterion = nn.CrossEntropyLoss(ignore_index=0)  # not needed here since we are not padding inputs
optimizer = optim.Adam(model.parameters(), lr=lr)
scheduler = lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.5, total_iters=training_steps)

model.train()
tokenized_corpus = corpus.split()
step = 0

for inputs, outputs in make_batch(tokenized_corpus, word_index, max_len, batch_size, device):
    optimizer.zero_grad()
    logits = model(inputs)
    loss = criterion(logits.view(-1,logits.size(-1)), outputs.view(-1)) 
    if step % eval_steps == 0:
        print(f'Step [{step}/{training_steps}], loss: {loss.item():.4f}, perplexity: {math.exp(loss.item()):.2f}')
    loss.backward()
    optimizer.step()
    scheduler.step()
    step = step + 1
    if (step==training_steps):
        break

print(f'Step [{step}/{training_steps}], loss: {loss.item():.4f}, perplexity: {math.exp(loss.item()):.2f}')

## Model evaluation

The `generate` function is used to auto-regressively continue a given prompt up until `max_len` tokens. It starts off by tokenizing the prompt and converting it to a one-sample mini-batch. Then, it iteratively predicts the next token by selecting the index with the highest probability in the output vector corresponding to the last token in the sequence. The resulting index is appended to the input sequence, and the process is repeated until the desired length is reached. Finally, the predicted tokens are converted back to words and returned as a single string.

Due to the intentionally small size of the training corpus, the model will probably verbatim copy excerpts from the training corpus. 

ðŸ“˜ *Documentation:* [`torch.Tensor.cat`](https://pytorch.org/docs/stable/generated/torch.cat.html), [`torch.Tensor.item`](https://pytorch.org/docs/stable/generated/torch.Tensor.item.html), [`torch.argmax`](https://pytorch.org/docs/stable/generated/torch.argmax.html), [`torch.Tensor.view`](https://pytorch.org/docs/stable/generated/torch.Tensor.view.html)

In [None]:

def generate_text(model, prompt, word_index, index_word, max_len, device):
    words = prompt.split()
    input_ids = [word_index.get(word, word_index['[UNK]']) for word in words]
    input = torch.LongTensor(input_ids).view(1, -1).to(device)  # add batch dimension

    with torch.no_grad():
        for _ in range(max_len - len(input_ids)):
            output = model(input)
            last_token_logits = output[0, -1, :]
            predicted_id = torch.argmax(last_token_logits, dim=-1).item()
            input = torch.cat([input, torch.LongTensor([predicted_id]).view(1,-1).to(device)], dim=1)
            predicted_word = index_word[predicted_id]
            words.append(predicted_word)

    return ' '.join(words)

model.eval()
prompt = "O God, that robot is out of control! I tell you, friends, "
generated_text = generate_text(model, prompt, word_index, index_word, max_len, device)
print(generated_text)


## Exercises

If your learning path is supervised by a teacher, they may have provided you with additional instructions on how to proceed with the exercises.

âœŽ Use SentencePiece to tokenize the data.

âœŽ Use the [`torch.topk`](https://pytorch.org/docs/stable/generated/torch.topk.html) function to implement sampling instead of greedy decoding.

âœŽ Implement your own versions of top-$k$ and top-$p$ (nucleus) sampling.

âœŽ Use a mini-batch of prompts at inference time to generate multiple texts in parallel.

âœŽ Compare the original pre-norm implementation of the transformer with the post-norm implementation under this task.
