Large Language Models And Generative Ai Hand Book

A large language model (LLM) is a type of machine learning model that can do a lot of cool things with language. It can write text that sounds like it was written by a human, answer questions like a human would, and even translate text from one language to another. LLMs are trained using a lot of data and deep learning algorithms to understand language better. ๐Ÿ˜Ž

Transformer Architecture

  • What is the transformer architecture? - The transformer architecture is a deep neural network architecture that was introduced in the paper โ€œAttention Is All You Needโ€ by Vaswani et al..
  • How does the transformer architecture work? - The transformer architecture uses self-attention to compute representations of input sequences.
  • What are some applications of the transformer architecture? - The transformer architecture has been used for a wide range of applications, including machine translation, text classification, and text generation.

Transformer and NN Basics

  • GPT 2 implementation from scratch by Andrej Karpathy. Video

  • GPT (Generative pretrained-transformer) created by Andrej Karpathy for education purpose. Video

  • llama3 from scratch

Stages of building LLM

gpt_stage Image courtesy

LLM starts with the pre-training stage where the model learns to predict next token given it is an auto-regressive language model. Pre-training is done an internet-scale dataset.

Scaling law

Scaling law helps to identify relationship among model parameters, dataset size and compute budget. There is a powerlaw relationship, meaning two quantities are proportional to each other based on power raised to a constant:

  • Testloss vs compute budget
  • Testloss vs dataset size (number of tokens)
  • Testloss vs number of model parameters

Refer to Chinchilla scaling law: Training Compute-Optimal Large Language Models

Following scaling law was followed during pretraining BloombertGPT. bgpt_scaling_polity

Prompt optimization for LLM application layer

TextGrad

textgrad

TextGrad offers a PyTorch-style API designed for automatic text based optimization. It allows automatically transform a simple prompt to classify data in a more sophisticated prompt that is suitable for application. TextGrad does this by backpropagating text feedback from the output of a language models to all the possible earlier components in the pipeline for optimization.TextGrad uses language models to 1) evaluate outputs, 2) criticize them and 3) update the inputs.

Recursively criticizes and improves

Metaprompt

DSPy

Anthropicโ€™s Prompt Engineering Interactive Tutorial

The Google Sheets-based interactive exercises to experiment with different prompts

AI agents

LLM Finetuning

OpenAIโ€™s best practices for finetuning

This guide focuses on GPT-3, but are applicable to full finetuning in general. It includes how finetuning works, how to prepare training data, etc.

llamafactory

LlamaFactory provides a unified and efficient fine-tuning framework for a wide range of large language models (LLMs). By integrating various efficient training methods and supporting over 100 LLMs, LlamaFactory allows users to easily adapt these models to different downstream tasks.

Supervised Fine-Tuning

  • What is fine-tuning? - Fine-tuning is the process of adapting a pre-trained language model to a specific task or domain.
  • How does fine-tuning work? - Fine-tuning involves training the pre-trained language model on a small amount of task-specific data.
  • What are some applications of fine-tuning? - Fine-tuning can be used for a wide range of applications, including sentiment analysis, named entity recognition, and text classification.

Fine-tuning examples:

RLHF: Reinforcement Learning from Human Feedback

Reward model (RM): RM is used to model human feedback. Reward model is also a language model except for the last layear is the linear layer that outputs a reward value. Given two completion \(y_{i}\) and \(y_{j}\), objective to model the probability \(p_{ij}\) which denotes the confidence for \(y_{i}\) is better than \(y_{j}\):

reward

Proximal policy optimization (PPO): Next, we apply reinforcement learning that trains a policy aka language model parameters to generate text with higher reward based on the reward model. It samples many prompts and uses the language model to generate sequence for these prompts. Objective is to maximize the expected reward i.e., weighted summation of completion rewards with weights as the probability of the completion.

Proximal policy optimization (PPO) is used to compute the gradient. Iterative algorithm like gradient ascend is used to solve the optimization objective.

Steps of policy gradient training steps:

  • Initialize parameters of the language model from Supervised finetuned mode
  • Samples prompt from dataset, generates sequence for the prompts and current policy
  • Calculate reward for the prompt and completion
  • Calculate gradients and update the parameters of the language model

Regularization Vanilla policy gradient can over optimize to the Reward model. A per-token KL divergence from SFT distribution penalty added as a regularization. This is to keep some of the variation from SFT model.

reward

Mixture of expert

Langchain handbooks

Retrieval augmented generation (RAG)

rag-survey

Online course

Evaluation of LLM

Knowledge representation and Hallucination

  • Hallucination refers to mistakes in the generated text that are statistically plausible but are in fact incorrect or nonsensical
  • Stanford lecture

Additional case study

Interview resource

โ€ข ๐——๐—ผ ๐—ป๐—ผ๐˜ ๐—ท๐˜‚๐˜€๐˜ ๐—ณ๐—ผ๐—ฐ๐˜‚๐˜€ ๐—ผ๐—ป ๐˜„๐—ต๐—ฎ๐˜ ๐—ฏ๐˜‚๐˜ ๐—ฎ๐—น๐˜€๐—ผ ๐˜„๐—ต๐˜† โ€“ I have seen many people focus on what they have done in a project but miss out on why. Gen AI is very similar to Machine Learning; itโ€™s process-driven & not outcome-driven. If someone says that linear regression gave 90% accuracy in their project, you canโ€™t challenge that as every data set is different. Similarly, in Gen AI, itโ€™s about the process.

ย - Why did you choose a vector DB? ย - Why did you do fine-tuning vs. RAG? ย - Why did you do a hybrid search & not just a keyword-based search? ย - Why did you choose an ANN algorithm?

โ€ข ๐—จ๐—ป๐—ฑ๐—ฒ๐—ฟ๐˜€๐˜๐—ฎ๐—ป๐—ฑ๐—ถ๐—ป๐—ด ๐—ถ๐—ป๐˜๐—ฒ๐—ฟ๐—ป๐—ฎ๐—น ๐˜„๐—ผ๐—ฟ๐—ธ๐—ถ๐—ป๐—ด๐˜€ ๐—ผ๐—ณ ๐—Ÿ๐—Ÿ๐— ๐˜€ โ€“ It is important to understand how LLMs work, including the attention mechanism, and the latest research around internal components like tokenizers, position encoding, etc.

โ€ข ๐—˜๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ถ๐˜€ ๐—ฎ๐—น๐—น ๐˜†๐—ผ๐˜‚ ๐—ป๐—ฒ๐—ฒ๐—ฑ โ€“ I have seen a lot of cases where people have production LLM use cases without evaluation; they donโ€™t know the evaluation numbers. Evaluation is not just about RAGAS numbers but also about latency, throughput, time to first token, and cost per query.

โ€ข ๐——๐—ผ๐—ฐ๐˜‚๐—บ๐—ฒ๐—ป๐˜ ๐—ฑ๐—ถ๐—ด๐—ถ๐˜๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป & ๐—ฝ๐—ฟ๐—ฒ-๐—ฝ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€๐—ถ๐—ป๐—ด โ€“ A lot of people miss out on this, but it is the most important step in any Gen AI project. Many donโ€™t know how to handle large tables, charts, and graphs.

โ€ข ๐—™๐—ถ๐—ป๐—ฒ-๐˜๐˜‚๐—ป๐—ถ๐—ป๐—ด ๐—Ÿ๐—Ÿ๐— ๐˜€ ๐—ถ๐˜€ ๐—ฎ๐—ป ๐—ถ๐—บ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฎ๐—ป๐˜ ๐˜€๐—ธ๐—ถ๐—น๐—น โ€“ While talking with a few senior folks from big tech giants, the market trend is moving towards fine-tuning language models. Itโ€™s not just about LoRA and QLORA, which are important to learn theoretically, but also:

ย - How many data points did you use to tune the LLM? ย - How did you create data? ย - GPU size estimation? ย - Parallelism techniques.

โ€ข ๐—Ÿ๐—Ÿ๐—  ๐˜€๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—ฑ๐—ฒ๐˜€๐—ถ๐—ด๐—ป โ€“ This is supercritical. LLM engineering is just part of the entire system. Focus not only on LLM models/agents but also on observability and monitoring, scaling, cost, and latency optimization.

Byte Pair Encoding (BPE) Byte Pair Encoding (BPE) and SentencePiece are both subword tokenization methods that are used in NLP. BPE is a data-driven method that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. Subword tokenization with BPE helps in effectively tackling the concerns of out-of-vocabulary words. SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training

Positional embedding Positional embeddings are added to the word embeddings once before the first layer. Each position t within the sequence gets different embedding when t is even then use sine function if t is odd then consine function.

Dataset generation

Written on May 6, 2023