by John M Costa, III

Book Notes: Hands On Large Language Models: Language Understanding and Generation

Overview

This post contains my notes on the book *Hands-On Large Language Models: Language Understanding and Generation.

You can find the book on Amazon

I’ll be adding my notes to this post as I read through the book. The notes will be organized by chapter and will include key concepts, code examples, and any additional insights I find useful.

Chapter 1: An Introduction to Large Language Models

Chapter 1 introduces the reader to the recent history of Large Language Models (LLMs). The diagrams in this chapter are particularly useful for understanding the evolution of LLMs and how they relate to other AI technologies. It’s a great segway from the machine learning and neural networks covered in the previous book I’m reading in parallel.

Chapter 2: Tokens and Embeddings

Chapter 2 introduces the concept of tokens and embeddings.

Tokens are the basic units of text that LLMs use to process and generate language. The chapter covers a number of LLM Tokenizers including BERT (cased and uncased), GPD-2, FLAN-T5, StarCoder2, and a few others. It provides details on how each tokenizer works and how they differ from one another.

Token embeddings are numerical representations of tokens that capture their semantic meaning. Embeddings can be used to represent sentences, paragraphs, or even entire documents. Further, embeddings can be used in Recommendation Systems. The chapter covers a song recommendation system that uses embeddings to recommend songs based on a song input by the user.

Chapter 3: Looking Inside Large Language Models

Note: This chapter contains a number of useful diagrams that I’ve described in my own representation. However, the diagrams are not reproduced in their entirety. Please refer to the book for the complete diagrams and explanations.

Chapter 3 takes a deeper dive into the architecture of LLMs. We start out with a view into the Inputs and Outputs of Trained Transformer LLMs. This might be an overly simplified view, but it helps to understand the basic flow of data through an LLM.

transformer.highlevel.drawio.png

The transformer generates a single output token at a time, using the previous tokens as context. This is known as an autoregressive model.

Diving a little deeper, we learn about the Transformer architecture. It’s composed of a Tokenizer, a stack of Transformer blocks, and an LM Head.

transformer.components.drawio.png

Going further, the tokenizer breaks down the input text into tokens and becomes a token vocabulary. The set of transformer blocks have token embeddings based on the token vocabulary. The LM head is a neural network layer that contains token probabilities for each token in the vocabulary.

transformer.forwardpass.drawio.png

Greedy decoding is when the model selects the token with the highest probability at each step.

It is possible to process in parallel multiple input tokens and the amount of tokens that can be processed at once is referred to as the context size. Keep in mind that embeddings are not the same as tokens, but rather a numerical representation of tokens that captures their semantic meaning.

transformer-Page-2.processing-stream.drawio.png

Keep in mind that only the last token in the sequence is used to generate the next token. Even though this is the case, the processing stream results are parallelized and can be cashed to improve efficiency.

Digging even deeper, we learn about the Transformer blocks. Each block consists of a self-attention mechanism and a feed-forward neural network.

transformer-Page-3.drawio.png

The feed-forward neural network is the source of learned information that enables the model to generate coherent text.

Attention is a key mechanism in LLMs that allows the model to focus on specific parts of the input sequence when generating text.

transformer-Page-3.simple-self-attention.drawio.png

Taking a look into attention more closely, we see that we are getting to the core of how LLMs work.

Note: The book mentions projection matrices which are shown in the diagram below. However, it doesn’t explain them in detail. If you’re interested in understanding projection matrices, a good resource that I found helpful is this article. The content appears to be a summarization of the original paper on self-attention, Attention Is All You Need. Another good resource is The Illustrated Transformer

transformer-Page-3.relevance-scoring.drawio.png

Using the queries and keys, the model calculates relevance scores for each token in the input sequence. The scores are then multiplied by the values to produce the output vectors.

Attention is a powerful mechanism that allows the model to weigh the importance of different tokens in the input sequence when generating text. Newer LLMs have available a more efficient attention mechanism called sparse attention, which can be strided or fixed. These attention mechanisms use fewer input tokens as context for self-attention.

Additionally, there are other attention mechanisms such as:

  • Grouped Query Attention (GQA)
  • Multi-Head Attention
  • Flash Attention

Chapter 4: Text Classification

The goal of text classification is to assign a label to a piece of text based on its content. Classification can be used for a variety of tasks, such as:

  • sentiment analysis
  • topic classification
  • spam detection
  • intent detection
  • detecting language

Techniques:

  • Text Classification with Representation Models
  • Text Classification with Generative Models

Text Classification with Representation Models

How it works:

  1. Based models are fine-tuned for specific tasks, like classification or embeddings.

transformer-Page-4.fine-tuning.drawio.png

  1. The models are fed inputs and outputs specific to the task are generated.

transformer-Page-4.fine-tuning.drawio.png

There are some suggestions for models that are good for text classification:

When looking to generate embeddings the MTEB Leaderboard is a good resource to find models.

To evaluate the performance of classification models that have labeled data, we can use metrics such as accuracy, precision, recall, and F1 score.

Zero-shot classification is a technique that allows us to classify text without any labeled data.

Using the cosine similarity function, we can compare the embeddings of the input text to the embeddings of the labels.

Text Classification with Generative Models

Prompt engineering is the process of designing prompts that can effectively elicit the desired response from a generative model.

The T5 model is similar to the original Transformer architecture, using an encoder-decoder structure.

OpenAI’s GPT model training process is published here: https://openai.com/index/chatgpt/

Chapter 5: Text Clustering and Topic Modeling

Text clustering is the process of grouping similar pieces of text together based on their content, yeilding clusters of symantically similar text.

Text clustering can be used for topic modeling, which is the process of identifying the main topics in a collection of text.

The book example uses ArXiv papers as the text corpus.

Common Pipeline for Text Clustering:

  1. convert input documents -> embeddings w/ embedding model
  2. Reduce dimensionality w/ dimensionality reduction model
  3. find groups of documents w/ cluster model

Dimensionality Reduction

There are well known method for dimensionality reduction including:

  • Principal Component Analysis (PCA)
  • Uniform Manifold Approximation and Projection (UMAP)

Clustering Algorithms

An example of clustering algorithms include:

  • Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)

Visualization of clusters can be done using tools like Matplotlib.

BERTopic: Modular Topic Modeling Framework

  1. Follow the same procedure in text clustering to generate clusters
  2. Model distribution over words, bag of words - use frequency of words in each cluster to identify topics
  3. Use class-based term frequency inverse document frequency (c-TF-IDF) to identify words that are unique to each cluster

A full pipeline for topic modeling using BERTopic:

clusteringtopic representationreranking
sbert -> umap -> hdbscancount vectorizaton -> c-TF-IDFreprentation model
embed docs -> reduce dim -> cluster docstokenize words -> weight wordsfine tune representation

BERTopics can be used like Legos to build custom pipelines.

Chapter 6: Prompt Engineering

Basics of using text generation model

  • Select a model considering:
    • opensource vs proprietary
    • output control
  1. Choose opensource or proprietary model

suggestion: start with a small foundational model

  1. Load the model

  2. Control the output

  • set do_Sample=True to use temperature and top_p
  1. Tune Temperature and top_p for the use case

Intro to prompt engineering

Ingredients of a good prompt:

  • When no instructions are given, the model will try to predict the next word based on the input text.
  • Two components of basic instructions:
    1. Task description
    2. Input text (data)
  • Extending the prompt with output indicator allows for specific output

Use cases for instruction based prompts:

  • Supervised classification
  • Search
  • Summarization
  • Code generation
  • Named entity recognition

Techniques for improving prompts:

  • Specificity
  • Hallucination mitigation
  • Order

Complex prompt components:

  • Persona
  • Instruction
  • Context
  • Format
  • Audience
  • Tone
  • Data

In-context learning:

  • Zero-shot learning
  • One-shot learning
  • Few-shot learning

Chain prompting:

  • Break the task into smaller sub-tasks and use the output of one prompt as the input to the next prompt.
  • Useful for:
    • Response validation
    • Parallel prompts
    • Writing stories

Reasoning with Generative Models

Chain of thought:

  • Prompt the model to think step-by-step

Self-consistency:

  • using the same prompt multiple times to generate multiple responses
  • works best with temperature and top_p sampling

Tree of thought:

  • useful when needing to explore multiple paths to a solution
  • ask the model to mimic multiple agents working together to solve a problem
  • question each other until they reach a consensus

Output Verification

  • Useful for:
    • Structured output
    • Valid output
    • ethics
    • accuracy

Techniques:

  • Provide examples of valid output

Grammar: constrained sampling

  • use packages for:
    • Guidance
    • Guardrails
    • LMQL

Taxonomy

  • Accuracy: A metric used to evaluate the performance of classification models, measuring the proportion of correct predictions.
  • Attention: A mechanism that allows models to focus on specific parts of the input sequence, improving context understanding.
  • Autoregressive Models: Models that generate text by predicting the next token in a sequence based on the previous tokens.
  • Bag of Words (BoW): A simple representation of text that ignores grammar and word order but keeps track of word frequency.
  • BERT (Bidirectional Encoder Representations from Transformers): A pre-trained model that uses a Transformer architecture to understand the context of words in a sentence.
  • Byte Tokens: A tokenization scheme that represents text as a sequence of bytes, allowing for a more compact representation.
  • Character Tokens: A tokenization scheme where each token represents a single character, useful for languages with complex morphology.
  • Context Size: The number of tokens the model can consider at once when generating text, affecting its ability to maintain coherence.
  • Embeddings: Numerical representations of words or tokens that capture their semantic meaning and relationships.
  • F1 Score: A metric that combines precision and recall to evaluate the performance of classification models.
  • Feed-Forward Neural Network: A type of neural network where connections between nodes do not form cycles, used in Transformer blocks.
  • Flash Attention: An efficient attention mechanism that reduces memory usage and speeds up computation in LLMs.
  • GPT (Generative Pre-trained Transformer): A type of LLM that is pre-trained on a large corpus of text and can generate coherent text based on a given prompt.
  • Greedy Decoding: A text generation strategy where the model selects the token with the highest probability at each step.
  • Grouped Query Attention (GQA): An attention mechanism that uses a single set of keys and values for multiple queries, improving efficiency.
  • Inverse Document Frequency (IDF): is a measure of how important a word is to a document in a collection of documents.
  • Large Language Models (LLMs): A type of AI model that is trained on large datasets to understand and generate human language.
  • LM Head: The final layer of an LLM that generates the output tokens based on the processed input.
  • Multi-Head Attention: An attention mechanism that allows the model to focus on different parts of the input sequence simultaneously.
  • Output vectors: The numerical representations of the output tokens generated by the LLM.
  • Parallel Processing: The ability to process multiple input tokens simultaneously, improving efficiency.
  • Precision: The numerical accuracy of the computations performed by the model, affecting its performance and resource usage.
  • Recall: A metric used to evaluate the performance of classification models, measuring the ability to identify all relevant instances.
  • Self-Attention: A mechanism that allows the model to weigh the importance of different tokens in the input sequence when generating text.
  • Sparse Attention: An efficient attention mechanism that uses fewer input tokens as context for self-attention, reducing computational complexity.
  • Subword Tokens: A tokenization scheme where tokens can represent parts of words, allowing for better handling of rare or unknown words.
  • T5 model: Text-To-Text Transfer Transformer, a model that converts all NLP tasks into a text-to-text format.
  • Temperature: A parameter that controls the randomness of the model’s output, with higher values leading to more diverse text.
  • Tokenization: The process of breaking down text into smaller units (tokens) for processing by LLMs.
  • Token Embedding: The process of converting tokens into numerical vectors that capture their semantic meaning.
  • Token Probabilities: The likelihood of each token in the vocabulary being the next token in a sequence, used for text generation.
  • Top-p Sampling (Nucleus Sampling): A text generation strategy that selects tokens from the smallest set whose cumulative probability exceeds a threshold p, allowing for more diverse outputs.
  • Transformer: A neural network architecture that uses self-attention mechanisms to process sequences of data, widely used in LLMs.
  • Transformer Blocks: The building blocks of the Transformer architecture, consisting of layers of attention and feed-forward neural networks.
  • Trained Transformer LLMs: LLMs that have been trained on large datasets using the Transformer architecture, enabling them to understand and generate human language effectively.
  • Word Tokens: A tokenization scheme where each token represents a whole word.
  • word2vec: A technique that uses neural networks to learn word embeddings, capturing semantic relationships between words.

References:

comments powered by Disqus