Overview
This post contains my notes on the book *Hands-On Large Language Models: Language Understanding and Generation.
You can find the book on Amazon
I’ll be adding my notes to this post as I read through the book. The notes will be organized by chapter and will include key concepts, code examples, and any additional insights I find useful.
Chapter 1: An Introduction to Large Language Models
Chapter 1 introduces the reader to the recent history of Large Language Models (LLMs). The diagrams in this chapter
are particularly useful for understanding the evolution of LLMs and how they relate to other AI technologies. It’s a
great segway from the machine learning and neural networks covered in the previous book I’m reading in parallel.
Chapter 2: Tokens and Embeddings
Chapter 2 introduces the concept of tokens and embeddings.
Tokens are the basic units of text that LLMs use to process and generate language. The chapter covers a number of LLM
Tokenizers including BERT (cased and uncased), GPD-2, FLAN-T5, StarCoder2, and a few others. It provides details on
how each tokenizer works and how they differ from one another.
Token embeddings are numerical representations of tokens that capture their semantic meaning. Embeddings can be used to
represent sentences, paragraphs, or even entire documents. Further, embeddings can be used in Recommendation Systems. The
chapter covers a song recommendation system that uses embeddings to recommend songs based on a song input by the user.
Chapter 3: Looking Inside Large Language Models
Note: This chapter contains a number of useful diagrams that I’ve described in my own representation. However,
the diagrams are not reproduced in their entirety. Please refer to the book for the complete diagrams and explanations.
Chapter 3 takes a deeper dive into the architecture of LLMs. We start out with a view into the Inputs and Outputs of
Trained Transformer LLMs. This might be an overly simplified view, but it helps to understand the basic flow of data
through an LLM.

The transformer generates a single output token at a time, using the previous tokens as context. This is known as an
autoregressive model.
Diving a little deeper, we learn about the Transformer architecture. It’s composed of a Tokenizer, a stack of Transformer
blocks, and an LM Head.

Going further, the tokenizer breaks down the input text into tokens and becomes a token vocabulary. The set of transformer
blocks have token embeddings based on the token vocabulary. The LM head is a neural network layer that contains token
probabilities for each token in the vocabulary.

Greedy decoding is when the model selects the token with the highest probability at each step.
It is possible to process in parallel multiple input tokens and the amount of tokens that can be processed at once is
referred to as the context size. Keep in mind that embeddings are not the same as tokens, but rather a numerical
representation of tokens that captures their semantic meaning.

Keep in mind that only the last token in the sequence is used to generate the next token. Even though this is the case,
the processing stream results are parallelized and can be cashed to improve efficiency.
Digging even deeper, we learn about the Transformer blocks. Each block consists of a self-attention mechanism and a feed-forward
neural network.

The feed-forward neural network is the source of learned information that enables the model to generate coherent text.
Attention is a key mechanism in LLMs that allows the model to focus on specific parts of the input sequence when
generating text.

Taking a look into attention more closely, we see that we are getting to the core of how LLMs work.
Note: The book mentions projection matrices which are shown in the diagram below. However, it doesn’t
explain them in detail. If you’re interested in understanding projection matrices, a good resource that I found helpful
is this article.
The content appears to be a summarization of the original paper on self-attention, Attention Is All You Need.
Another good resource is The Illustrated Transformer

Using the queries and keys, the model calculates relevance scores for each token in the input sequence. The scores are then
multiplied by the values to produce the output vectors.
Attention is a powerful mechanism that allows the model to weigh the importance of different tokens in the input sequence
when generating text. Newer LLMs have available a more efficient attention mechanism called sparse attention, which
can be strided or fixed. These attention mechanisms use fewer input tokens as context for self-attention.
Additionally, there are other attention mechanisms such as:
- Grouped Query Attention (GQA)
- Multi-Head Attention
- Flash Attention
Chapter 4: Text Classification
The goal of text classification is to assign a label to a piece of text based on its content. Classification can be
used for a variety of tasks, such as:
- sentiment analysis
- topic classification
- spam detection
- intent detection
- detecting language
Techniques:
- Text Classification with Representation Models
- Text Classification with Generative Models
Text Classification with Representation Models
How it works:
- Based models are fine-tuned for specific tasks, like classification or embeddings.

- The models are fed inputs and outputs specific to the task are generated.

There are some suggestions for models that are good for text classification:
When looking to generate embeddings the MTEB Leaderboard is a good
resource to find models.
To evaluate the performance of classification models that have labeled data, we can use metrics such as accuracy,
precision, recall, and F1 score.
Zero-shot classification is a technique that allows us to classify text without any labeled data.
Using the cosine similarity function, we can compare the embeddings of the input text to the embeddings of the labels.
Text Classification with Generative Models
Prompt engineering is the process of designing prompts that can effectively elicit the desired response from a generative
model.
The T5 model is similar to the original Transformer architecture, using an encoder-decoder structure.
OpenAI’s GPT model training process is published here: https://openai.com/index/chatgpt/
Chapter 5: Text Clustering and Topic Modeling
Text clustering is the process of grouping similar pieces of text together based on their content, yeilding clusters of
symantically similar text.
Text clustering can be used for topic modeling, which is the process of identifying the main topics in a collection of text.
The book example uses ArXiv papers as the text corpus.
Common Pipeline for Text Clustering:
- convert input documents -> embeddings w/ embedding model
- Reduce dimensionality w/ dimensionality reduction model
- find groups of documents w/ cluster model
Dimensionality Reduction
There are well known method for dimensionality reduction including:
- Principal Component Analysis (PCA)
- Uniform Manifold Approximation and Projection (UMAP)
Clustering Algorithms
An example of clustering algorithms include:
- Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)
Visualization of clusters can be done using tools like Matplotlib.
BERTopic: Modular Topic Modeling Framework
- Follow the same procedure in text clustering to generate clusters
- Model distribution over words, bag of words - use frequency of words in each cluster to identify topics
- Use class-based term frequency inverse document frequency (c-TF-IDF) to identify words that are unique to each cluster
A full pipeline for topic modeling using BERTopic:
clustering | topic representation | reranking |
---|
sbert -> umap -> hdbscan | count vectorizaton -> c-TF-IDF | reprentation model |
embed docs -> reduce dim -> cluster docs | tokenize words -> weight words | fine tune representation |
BERTopics can be used like Legos to build custom pipelines.
Chapter 6: Prompt Engineering
Basics of using text generation model
- Select a model considering:
- opensource vs proprietary
- output control
- Choose opensource or proprietary model
suggestion: start with a small foundational model
Load the model
Control the output
- set do_Sample=True to use temperature and top_p
- Tune Temperature and top_p for the use case
Intro to prompt engineering
Ingredients of a good prompt:
- When no instructions are given, the model will try to predict the next word based on the input text.
- Two components of basic instructions:
- Task description
- Input text (data)
- Extending the prompt with output indicator allows for specific output
Use cases for instruction based prompts:
- Supervised classification
- Search
- Summarization
- Code generation
- Named entity recognition
Techniques for improving prompts:
- Specificity
- Hallucination mitigation
- Order
Complex prompt components:
- Persona
- Instruction
- Context
- Format
- Audience
- Tone
- Data
In-context learning:
- Zero-shot learning
- One-shot learning
- Few-shot learning
Chain prompting:
- Break the task into smaller sub-tasks and use the output of one prompt as the input to the next prompt.
- Useful for:
- Response validation
- Parallel prompts
- Writing stories
Reasoning with Generative Models
Chain of thought:
- Prompt the model to think step-by-step
Self-consistency:
- using the same prompt multiple times to generate multiple responses
- works best with temperature and top_p sampling
Tree of thought:
- useful when needing to explore multiple paths to a solution
- ask the model to mimic multiple agents working together to solve a problem
- question each other until they reach a consensus
Output Verification
- Useful for:
- Structured output
- Valid output
- ethics
- accuracy
Techniques:
- Provide examples of valid output
Grammar: constrained sampling
Taxonomy
- Accuracy: A metric used to evaluate the performance of classification models, measuring the proportion of correct predictions.
- Attention: A mechanism that allows models to focus on specific parts of the input sequence, improving context understanding.
- Autoregressive Models: Models that generate text by predicting the next token in a sequence based on the previous tokens.
- Bag of Words (BoW): A simple representation of text that ignores grammar and word order but keeps track of word frequency.
- BERT (Bidirectional Encoder Representations from Transformers): A pre-trained model that uses a Transformer architecture to understand the context of words in a sentence.
- Byte Tokens: A tokenization scheme that represents text as a sequence of bytes, allowing for a more compact representation.
- Character Tokens: A tokenization scheme where each token represents a single character, useful for languages with complex morphology.
- Context Size: The number of tokens the model can consider at once when generating text, affecting its ability to maintain coherence.
- Embeddings: Numerical representations of words or tokens that capture their semantic meaning and relationships.
- F1 Score: A metric that combines precision and recall to evaluate the performance of classification models.
- Feed-Forward Neural Network: A type of neural network where connections between nodes do not form cycles, used in Transformer blocks.
- Flash Attention: An efficient attention mechanism that reduces memory usage and speeds up computation in LLMs.
- GPT (Generative Pre-trained Transformer): A type of LLM that is pre-trained on a large corpus of text and can generate coherent text based on a given prompt.
- Greedy Decoding: A text generation strategy where the model selects the token with the highest probability at each step.
- Grouped Query Attention (GQA): An attention mechanism that uses a single set of keys and values for multiple queries, improving efficiency.
- Inverse Document Frequency (IDF): is a measure of how important a word is to a document in a collection of documents.
- Large Language Models (LLMs): A type of AI model that is trained on large datasets to understand and generate human language.
- LM Head: The final layer of an LLM that generates the output tokens based on the processed input.
- Multi-Head Attention: An attention mechanism that allows the model to focus on different parts of the input sequence simultaneously.
- Output vectors: The numerical representations of the output tokens generated by the LLM.
- Parallel Processing: The ability to process multiple input tokens simultaneously, improving efficiency.
- Precision: The numerical accuracy of the computations performed by the model, affecting its performance and resource usage.
- Recall: A metric used to evaluate the performance of classification models, measuring the ability to identify all relevant instances.
- Self-Attention: A mechanism that allows the model to weigh the importance of different tokens in the input sequence when generating text.
- Sparse Attention: An efficient attention mechanism that uses fewer input tokens as context for self-attention, reducing computational complexity.
- Subword Tokens: A tokenization scheme where tokens can represent parts of words, allowing for better handling of rare or unknown words.
- T5 model: Text-To-Text Transfer Transformer, a model that converts all NLP tasks into a text-to-text format.
- Temperature: A parameter that controls the randomness of the model’s output, with higher values leading to more diverse text.
- Tokenization: The process of breaking down text into smaller units (tokens) for processing by LLMs.
- Token Embedding: The process of converting tokens into numerical vectors that capture their semantic meaning.
- Token Probabilities: The likelihood of each token in the vocabulary being the next token in a sequence, used for text generation.
- Top-p Sampling (Nucleus Sampling): A text generation strategy that selects tokens from the smallest set whose cumulative probability exceeds a threshold p, allowing for more diverse outputs.
- Transformer: A neural network architecture that uses self-attention mechanisms to process sequences of data, widely used in LLMs.
- Transformer Blocks: The building blocks of the Transformer architecture, consisting of layers of attention and feed-forward neural networks.
- Trained Transformer LLMs: LLMs that have been trained on large datasets using the Transformer architecture, enabling them to understand and generate human language effectively.
- Word Tokens: A tokenization scheme where each token represents a whole word.
- word2vec: A technique that uses neural networks to learn word embeddings, capturing semantic relationships between words.
References: