This is the second article of my 75 Days Of Generative AI series where I try to build and learn to unravel mysteries of Generative AI and LLMs for myself and the world! For basics around LLM architecture check out my previous article.
Anyone who has used ChatGPT or a similar AI-based tool will vouch that it just seems magical how it can answer questions/instructions thrown at it. It involves multiple steps to become useful. I will try to cover my understanding of these parts of the journey in today’s and tomorrow’s articles.
These steps include
Pre-training models
Supervised Fine-Tuning
Preference Alignment
Evaluation of model - Next Article
Quantization - Next Article
Pre-training LLM Models
Pre-training is a very long and costly process. It’s good to have some understanding, but this is not mandatory for someone trying to build something using existing models. Reason? The computational power and dataset required for this step are impossible for an individual to achieve. For example, TinyLlama, a model with 1.1B parameters, requires 16 A100-40G GPUs to run for 90 days!
Supervised Fine-Tuning
Pre-trained language models, while powerful, are initially limited to next-token prediction tasks, which doesn't directly translate to being helpful assistants. Supervised Fine-Tuning (SFT) addresses this limitation by adapting these models to respond to specific instructions and perform various tasks.
While you could do full fine-tuning by training all the parameters in the model. It is not an efficient technique, but it produces slightly better results. The 2 most commonly used ones are
LoRA (Low-Rank Adaptation):
Instead of changing all the model's parameters, LoRA only updates a small number of them.
This makes fine-tuning faster and uses less memory.
It's like teaching a smart student a new skill without having to reteach everything they already know.
QLoRA (Quantized LoRA):
QLoRA is a version of LoRA that uses even less memory.
It does this by using simpler numbers (quantization - more about it soon) to represent the model's knowledge.
This allows fine-tuning of very large models on regular computers, not just big servers.

Preference Alignment
Reinforcement Learning from Human Feedback (RLHF) is an advanced step in the development of Language Models (LLMs) that follows Supervised Fine-Tuning (SFT). Its primary purpose is to align the LLM's outputs more closely with human expectations and values.
For example, In their groundbreaking paper on GPT-3, the authors not only demonstrated the model's impressive capabilities but also took a responsible approach by addressing its broader societal impacts. They included a crucial section on fairness, bias, and representation, highlighting several concerning issues.
The following techniques are used for RLHF
Preference Datasets:
These are special collections of model answers.
Each answer is given a ranking, like "good," "better," or "best."
They're harder to make than regular instruction datasets because you need to compare and rank multiple answers.
Proximal Policy Optimization (PPO):
This is a method to teach the model which answers humans prefer.
It uses a "reward model" to guess how much humans would like an answer.
The main model is then updated to give better answers, but not too differently from how it answered before.
Direct Preference Optimization (DPO):
This is a simpler way to do what PPO does.
Instead of guessing rewards, it directly learns which answers are preferred.
It's easier to set up and more stable than PPO.
DPO treats the problem like sorting answers into "preferred" or "not preferred" categories.

In summary, this article has covered three crucial steps in the development of effective Language Models (LLMs):
Pre-training: The initial, resource-intensive phase that builds the foundation of an LLM's knowledge.
Supervised Fine-Tuning (SFT): The process of adapting pre-trained models to follow instructions and perform specific tasks, including efficient techniques like LoRA and QLoRA.
Preference Alignment: The use of Reinforcement Learning from Human Feedback (RLHF) to align model outputs with human expectations, utilizing methods such as Preference Datasets, Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO).
These steps transform raw language models into more useful and aligned AI assistants. We've explored how each stage contributes to the overall capability and reliability of LLMs, from building foundational knowledge to refining responses based on human preferences.
In the next article, we'll delve into two more critical aspects of LLM development:
Evaluation of models: We'll discuss various methods and metrics used to assess the performance and capabilities of fine-tuned models, helping to ensure they meet desired standards.
Quantization: We'll explore this technique for reducing model size and increasing inference speed, making LLMs more accessible and efficient for deployment.
Stay Tuned!