Developing and designing a scalable, affordable, efficient, and well-managed infrastructure to handle the Machine Learning lifecycle is essential for Gen AI-based solution’s success. This blog is to suggest expanding MLOps to include LLMOps on the AWS platform. The focus will be on the adjustments and enhancements needed to make our MLOps framework suitable for LLM projects.
Sample Solution Architecture
To learn how to generate the next token, LLMs are usually trained on massive amounts of text data. To get better models first prepare the pertinent text and any labels or annotations to modify it for a downstream operation like text categorization. ML services like Amazon Textract and Amazon Comprehend can be used to extract entities from photos or documents or classify documents, among other choices, to prepare data for training or fine-tuning LLMs. Moreover, you can use the glue job if you need to construct your logic. Finally, you may use human annotation or active learning workflows to classify or annotate text data using Amazon SageMaker Ground Truth.
A few hundred GB or hundreds of millions of text tokens make up an average LLM dataset. Many alternatives for storing and loading datasets are provided by Sagemaker-Managed clusters of ml.p4d.24xlarge instances:
On-node NVMe SSD
ml.P4d.24xlarge instances have 8TB of NVMe storage that can be found at /tmp and, if one utilizes SageMaker File mode, at /opt/ml/input/data/. The data can be copied to the NVMe SSD if you want the ease of use and speed of a local read. SageMaker File mode or own code, such as multi-processed Boto3 or S3 CLI can be used to perform the copy.
FSx for Lustre
At each task or warm cluster construction, on-node NVMe SSDs must be ingested from S3 due to their size limitation.FSx for Lustre can be used to grow to larger datasets while preserving low-latency random access. HPC uses the open-source parallel file system Amazon FSx extensively to accomplish high IOPS.
SageMaker FastFile Mode
SageMaker-only FastFile Mode(FFM) presents remote S3 objects in compute instances managed by SageMaker under a POSIX-compliant interface and streams them only when they are read via FUSE. S3 calls that stream remote files block by block are the outcome of FFM readings.
Self-managed data loading
One can choose to use proprietary or open-source code to implement on own, unique data loading logic. Self-managed data loading can be used to implement custom error-handling logic, facilitate migrations by reusing previously developed code, or gain more control over sharding and underlying performance. Libraries that can be for self-managed data loading are Webdataset and torchdata. datapipes). Custom data loading code can also be created by combining the AWS Python SDK Boto3 with classes from the Torch Dataset. SageMaker Training heterogeneous clusters can also be creatively used with custom data loading classes, allowing the CPU and GPU balance to be precisely adjusted to a particular workload.
Data loading practice from s3
- Attempt to read and write from several S3 prefixes and buckets. For instance, divide checkpoints and training data among several prefixes.
- To monitor request rates, look at S3 metrics in Amazon CloudWatch.
- Limit the quantity of PUT/GET operations occurring at the same time.
- Reduce the number of processes that use S3 concurrently. Checkpointing hierarchically—first within the node, then from the node to S3 can reduce PUT traffic by a factor of eight. For example, if each node needs a checkpoint to S3.
- Rather than utilizing an S3 GET for each training record, read several training records from a single file or S3 GET.
- When SageMaker FFM is used in conjunction with Amazon S3, SageMaker FFM calls S3 to retrieve files one chunk at a time. The recommendation is to read files sequentially and to restrict the number of files opened in parallel to reduce the amount of S3 traffic that FFM creates.
LLMs are too large to fit on a single GPU because they have dozens to hundreds of billions of parameters. FSDP, DeepSpeed, and Megatron are just a few of the open-source libraries that LLM practitioners have created to help with the distributed computation of LLM training. The AWS Cloud-optimized SageMaker distributed training libraries offer a more straightforward developer experience. Distributed libraries or self-managed are the two options available for distributed training of their LLM on SageMaker.
SageMaker distributed libraries
SageMaker Training proposes several proprietary extensions to scale TensorFlow and PyTorch training code. LLM training is often conducted in a 3D-parallelism fashion:
- Data parallelism splits and feeds the training mini-batches to multiple identical replicas of the model to increase processing speed.
- Pipeline parallelism attributes various layers of the model to different GPUs or even instances, to scale model size beyond a single GPU and a single server.
- Tensor parallelism splits a single layer into multiple GPUs usually within the same server, to scale individual layers to sizes exceeding a single GPU.
To manage distributed training yourself, One has two options to write custom code:
- AWS Deep Learning Container (DLC) – AWS develops and maintains DLCs, providing AWS-optimized Docker-based environments for open-source ML frameworks. SageMaker Training has a unique integration allowing to pull and run AWS DLCs with external, user-defined entry points. For LLM training in particular, AWS DLCs for TensorFlow, PyTorch, Hugging Face, and MXNet are particularly relevant. Using a framework DLC allows you to use framework-native parallelism, such as PyTorch Distributed, without having to develop and manage your own Docker images. Additionally, our DLCs feature an MPI integration, which allows you to launch parallel code easily.
- Write a custom SageMaker-compatible Docker image – Bring your image either starting from scratch or extending an existing DLC image. When using a custom image for LLM training on SageMaker, it’s particularly important to verify the following: 1. Your image contains EFA with appropriate settings 2. Your image contains an NVIDIA NCCL communication library, enabled with GPUDirectRDMA
Prompt engineering is comparatively simpler to use but as a limitation, it can only give the model basic instructions. Furthermore, the TCO of a paid service can be increased by using a lengthy prompt, as all LLMs are limited by the number of tokens that can be passed to them at the time of inference. Making an LLM more domain-specific by fine-tuning it on a carefully selected dataset is the next logical step.
There are two options:
- Parameter-efficient fine-tuning
- Full fine-tuning
Depending on the availability of labeled data and CPU. To fully fine-tune all of the weights and biases of the chosen model, thousands of examples requiring a high processing power are needed. However, PEFT is a more economical choice because it can still be carried out with tens of examples and much less processing power. Several popular PEFT methods consist of:
- Low-Rank Adaptation (LoRA), which freezes the pre-trained model weights to minimize the number of trainable parameters for downstream tasks. Using this method, adapters are added after every transformer sub-layer.
- QLoRA To reduce memory usage, QLoRA extends LoRA by quantizing original weight values from high-resolution data types like Float32 to lower-resolution data types like int4.
- Prompt Tuning By adding a soft prompt to the top embedding layer at the start of the transformer layers and training only the additional prompt tokens while maintaining the trained LLM frozen, this technique is known as prompt-tuning or prefix-tuning.
- LLaMa-Adapter To prevent "corruption" of LLaMa's original knowledge, LLaMa-Adapter is a modified version of prefix tuning in which soft prompts are added at the N top-most transformer layers in addition to initializing the parameters close to the attention mechanism to zero instead of at random.
RAG is the solution if LLM is required to produce a response that can make use of extra context found in the proprietary data. Using semantic search this method can quickly and relevantly obtain context to enhance the LLM's responses and produce more accurate outcomes. Usually, it entails setting up vector stores and embeddings.
There are multiple vector store options available in AWS, such as Amazon OpenSearch Serverless Vector Store, Amazon Kendra, and pgvector in Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL. However, one advantage of using Amazon Kendra is that it can be used for both semantic search and embedding creation, so you don't need to rely on another FM model for embedding. This greatly simplifies and modularizes the design of your Gen AI application. It's also important to remember that RAG is more affordable and has greater flexibility than fine-tuning. RAG and fine-tuning will often be used.
One can still use traditional ML performance metrics like accuracy, precision, and recall if they are using LLMs for traditional ML problems like sentiment analysis or general classification. However, we may employ various metrics to assess the LLM's performance based on the task it is used for.
- BLEU is a machine translation algorithm that evaluates the difference in n-gram precision between the source and produced output.
- METEOR, a machine translation algorithm that prioritizes recall over precision, is derived from the harmonic mean of unigram precision and recall.
- ROUGE for summarization that takes into account F1-Score, recall, and precision throughout a sequence.
- Text generation perplexity serves as a gauge for how well the trained model has picked up on text distribution.
- BERTScore for embedding similarity in text generation
- CIDEr calculates how close an image's generated caption is to the reference captions.
- SPICE prioritizes capturing information about objects, attributes, and relationships (semantic propositions) when evaluating a caption generated for an image
- Assessing the output produced by smaller language models using larger models with more parameters
- Common benchmarks for a variety of applications include question answering, natural language generation, summarization, machine translation, and more. Several well-known benchmarks include Mostly Basic Python Programming (MBPP), HumanEval, GLUE, SuperGLUE, MMLU, LAMBADA, and Big Bench Collaboration Benchmarks. Additionally, there is Human in the Loop (HITL) where human evaluators provide input on the calibre of the text produced by the model.
Traditional hosting solutions used for smaller models lack the necessary optimisation functionality to host LLM models with the best possible throughput and inference latency. LLM model inference container images with Deep Java Library (DJL) Serving are supported by SageMaker. Model parallelism and inference optimisation libraries supported by SageMaker include:
- An open-source inference optimisation library called DeepSpeed
- Hugging Face: A library for model parallel inference
- An Nvidia open-source library called FasterTransformer makes transformer-based neural network inference operate more smoothly.
Moreover, Hugging Face LLM Deep Learning Container (DLC) hosting for LLMs on SageMaker allows high-performance text generation via tensor parallelism, dynamic batching and model quantization.
In production: prompts, generated response, RAG performance, data quality, model quality, infrastructure utilization, endpoint latency, and throughput all need to be closely watched.
To detect any drift or eventual deterioration of quality in the generated response. It is also crucial to compare their capabilities over time to a baseline.
Context, relevance to prompt, repetitiveness, repeatability, readability, token size, injection, refusal, sentiment, toxicity, and response hallucinations are a few of the suggested metrics to track prompts and responses. Monitoring chunk size, generated embeddings, embedding speed, and semantic search results is recommended for RAG. To enable prompt remediation through retraining, each of these factors needs to be watched carefully over time.
A Shadow Testing or AB Testing pipeline can also be used to compare various prompts, embeddings, and/or LLMs over time and select the most appropriate one for various use cases. These features are included in SageMaker by default, and to maximize their potential can also be coupled with other open-source tools like MLflow.
SageMaker's human-in-the-loop capabilities through SageMaker Augmented AI (A2I) to implement human review of LLM predictions in production in many cases where automated monitoring is not possible or relevant benchmark is not available for your specific domain.