Large Language Models And Generative Ai Hand Book

A large language model (LLM) is a type of machine learning model that can do a lot of cool things with language. It can write text that sounds like it was written by a human, answer questions like a human would, and even translate text from one language to another. LLMs are trained using a lot of data and deep learning algorithms to understand language better. 😎

Transformer Architecture

  • What is the transformer architecture? - The transformer architecture is a deep neural network architecture that was introduced in the paper “Attention Is All You Need” by Vaswani et al..
  • How does the transformer architecture work? - The transformer architecture uses self-attention to compute representations of input sequences.
  • What are some applications of the transformer architecture? - The transformer architecture has been used for a wide range of applications, including machine translation, text classification, and text generation.

Transformer and NN Basics

  • GPT 2 implementation from scratch by Andrej Karpathy. Video

  • GPT (Generative pretrained-transformer) created by Andrej Karpathy for education purpose. Video

  • llama3 from scratch

Stages of building LLM

gpt_stage Image courtesy

LLM starts with the pre-training stage where the model learns to predict next token given it is an auto-regressive language model. Pre-training is done an internet-scale dataset.

Scaling law

Scaling law helps to identify relationship among model parameters, dataset size and compute budget. There is a powerlaw relationship, meaning two quantities are proportional to each other based on power raised to a constant:

  • Testloss vs compute budget
  • Testloss vs dataset size (number of tokens)
  • Testloss vs number of model parameters

Refer to Chinchilla scaling law: Training Compute-Optimal Large Language Models

Following scaling law was followed during pretraining BloombertGPT. bgpt_scaling_polity

Finetuning Framework for LLM

llamafactory

LlamaFactory provides a unified and efficient fine-tuning framework for a wide range of large language models (LLMs). By integrating various efficient training methods and supporting over 100 LLMs, LlamaFactory allows users to easily adapt these models to different downstream tasks.

Supervised Fine-Tuning

  • What is fine-tuning? - Fine-tuning is the process of adapting a pre-trained language model to a specific task or domain.
  • How does fine-tuning work? - Fine-tuning involves training the pre-trained language model on a small amount of task-specific data.
  • What are some applications of fine-tuning? - Fine-tuning can be used for a wide range of applications, including sentiment analysis, named entity recognition, and text classification.

Fine-tuning examples:

RLHF: Reinforcement Learning from Human Feedback

Reward model (RM): RM is used to model human feedback. Reward model is also a language model except for the last layear is the linear layer that outputs a reward value. Given two completion \(y_{i}\) and \(y_{j}\), objective to model the probability \(p_{ij}\) which denotes the confidence for \(y_{i}\) is better than \(y_{j}\):

reward

Proximal policy optimization (PPO): Next, we apply reinforcement learning that trains a policy aka language model parameters to generate text with higher reward based on the reward model. It samples many prompts and uses the language model to generate sequence for these prompts. Objective is to maximize the expected reward i.e., weighted summation of completion rewards with weights as the probability of the completion.

Proximal policy optimization (PPO) is used to compute the gradient. Iterative algorithm like gradient ascend is used to solve the optimization objective.

Steps of policy gradient training steps:

  • Initialize parameters of the language model from Supervised finetuned mode
  • Samples prompt from dataset, generates sequence for the prompts and current policy
  • Calculate reward for the prompt and completion
  • Calculate gradients and update the parameters of the language model

Regularization Vanilla policy gradient can over optimize to the Reward model. A per-token KL divergence from SFT distribution penalty added as a regularization. This is to keep some of the variation from SFT model.

reward

Mixture of expert

Langchain handbooks

Retrieval augmented generation (RAG)

rag-survey

Online course

Evaluation of LLM

Knowledge representation and Hallucination

  • Hallucination refers to mistakes in the generated text that are statistically plausible but are in fact incorrect or nonsensical
  • Stanford lecture

Application

Byte Pair Encoding (BPE) Byte Pair Encoding (BPE) and SentencePiece are both subword tokenization methods that are used in NLP. BPE is a data-driven method that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. Subword tokenization with BPE helps in effectively tackling the concerns of out-of-vocabulary words. SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training

Positional embedding Positional embeddings are added to the word embeddings once before the first layer. Each position t within the sequence gets different embedding when t is even then use sine function if t is odd then consine function.

Dataset generation

Written on May 6, 2023