Welcome to day 21 of our #75DaysofGenerativeAI series. We have been developing models via notebooks and using the same to do the prediction. However, production-grade applications require the “ops” side to ensure we have a dedicated and repeatable pipeline to build and manage our LLM models.
What is LLMOps?
We've all been there - excitedly experimenting with LLMs like ChatGPT in a Jupyter Notebook, only to hit a wall when trying to scale to production. But what if you could streamline the process? Enter LLMOps, a subset of MLOps dedicated to operationalizing LLMs from training to maintenance. By leveraging innovative tools and methodologies, LLMOps makes it easier to adopt Generative AI at scale. In this series, we'll dive deeper into the world of LLMOps and explore its potential to transform the AI landscape.
The LLM Lifecycle: A Focus on Fine-Tuning
Let's dive into the LLM lifecycle, with a focus on fine-tuning. Why fine-tuning? Well, it's uncommon for organizations to train LLMs entirely from scratch. Instead, we start with an already trained foundation model and train it on a more specific, smaller dataset to create a custom model. Imagine taking a pre-trained LLM and fine-tuning it on your company's dataset.
You deploy this custom model, and prompts start flowing in. The corresponding completions are returned, and it's crucial to monitor and retrain the model to ensure its performance remains consistent. This is especially important for AI systems driven by LLMs.
A Glimpse into various aspects of the lifecycle
LLMOps facilitate the practical application of LLMs by incorporating prompt management, LLM chaining, monitoring, and observability techniques not typically found in conventional MLOps. Let's break down these concepts:
Prompt Management
Prompts are the primary means for people to interact with LLMs. Anyone who has crafted a prompt understands that refining it is a repetitive task that requires several attempts to attain a satisfactory outcome. Tools within LLMOps typically offer features to track version prompts and their outputs over time. This makes it easier to gauge the model's overall efficacy. Certain platforms and tools also facilitate prompt evaluations across multiple LLMs, so you can quickly find the best-performing LLM for your prompt.
LLM Chaining
LLM chaining links multiple LLM calls in sequence to deliver a distinct application feature. In this workflow, the output from one LLM call serves as the input for the subsequent LLM call, culminating in the final result.
This design approach introduces an innovative approach to AI application design and breaks down complex tasks into smaller steps. For instance, rather than employing a single extensive prompt to write a short story, you can break the prompt into shorter prompts for specific topics and receive more accurate results.
Monitoring and Observability
Imagine deploying an LLM model, only to find out weeks later that its performance has degraded significantly. This can happen due to various reasons, such as changes in user behavior, data drift, or even issues with the model itself. To avoid such scenarios, it's vital to have a monitoring system in place that can detect potential issues in real time.
So, what data points should we capture to ensure our LLM model is performing optimally? Here are some key metrics to track:
Prompts: What are the inputs to our model?
Prompt tokens/length: How long are the prompts, and what's the average token length?
Completions: What are the outputs of our model?
Completion tokens/length: How long are the completions, and what's the average token length?
Unique identifier for the conversation: How can we track conversations and identify potential issues?
Latency: How long does it take for our model to respond?
Custom metadata: What additional information can we capture to provide context?
Difference between MLOps and LLMOps
If you've made it this far, it's evident that LLMOps is the MLOps equivalent for LLMs. By now, you understand that LLMOps is critical to managing LLMs, especially fine-tuned LLMs you've trained. Below are basic difference between the two
LLMOps platforms
There are platforms out there that enable you to follow the well-known steps of LLMOps as a convention.
Paid
Google Cloud Vertex AI: Based on usage of Google Cloud services
Databricks: Supported as part of the Machine learning module of Databricks
Valohai: Automates everything from data extraction to model deployment; supports Kubernetes clusters and version management
Free/OSS
Ollama: One of the most widely known platforms that simplifies running LLMs locally, supports Windows, macOS, and Linux
OpenLLM (BentoML): Optimizes model serving with low latency and high throughput, supports Kubernetes for cloud deployment, integrates with open-source models
Kubeflow: Although not specialized for LLM flows, it enables End-to-end ML workflows on Kubernetes with tools for data preprocessing, training, serving, and monitoring
Conclusion
This article just covers the basics of LLMOps and introduces you to what exactly it is. In the real world, building and managing the LLMOps pipeline are much more complex and require a lot of practice to be good at it. In upcoming articles, we will leverage one of the LLMOps platforms to develop a production-grade LLM-based application. So stay tuned!