The Future of AI Infrastructure - Part 1

Talking about training and inference and where is the bottleneck

Jun 20, 2024

Nvidia recently overtook Microsoft to become the world's most valuable company. A closer look reveals that Nvidia’s data center revenue reached a record high of $22.6 billion in Q1 2024, representing a 23% increase from the previous quarter and an astonishing 427% growth compared to the same period last year. If this growth trajectory continues, Nvidia’s data center revenue could exceed $100 billion by the end of this year.

This tremendous spend on compute and the rapidly evolving infrastructure landscape in the new era of AI has been fascinating. In light of this, I have taken some time to reflect on the current market and technology trends, seeking to identify potential opportunities for startups to capitalize on these developments.

For today’s part 1, I will mainly discuss:

Training and Inference in the Era of Large-Scale Models
- Surging Demand in Training Compute
- Heterogeneous Characteristics in AI inference (two phases in decoder-LLM inference, AlphaFold performance on CPU vs. GPU)
- The Memory Wall
AI Demands Transformation on Data Center Network and Edge Computing

Enjoy, and ping me if you want to grab a coffee to chat further!

Training and Inference in the Era of Large-Scale Models

Surging Demand in Training Compute

We witnessed the introduction of generative AI applications starting in 2021. These applications represent a revolutionary aspect of AI, as compared to the traditional AI systems which focus on pattern recognition and predictions. This marks a new beginning in the age of AI, as Nvidia’s CEO Jensen Huang compared this new era to the “iPhone moment” of AI [ref].

What makes these emerging generative AI applications unique is the number of parameters in these models. The relationship between model scale and performance has revealed remarkable scaling laws. As the number of parameters grows, model capabilities increase predictably and substantially. Since 2010, the amount of training compute for machine learning models has grown by a factor of 10 billion, significantly exceeding a naive extrapolation of Moore’s Law. [ref]

Therefore, a growing disparity exists between the requirements of generative AI applications in terms of compute power, and the available infrastructure to enable it. As shown in Fig. 1, the compute used to train notable AI systems has exponentially increased in the past decade, reaching over 10 billion PetaFLOPs (PFLOPs) such as GPT-4 [ref]. The growth in training compute is at the rate of 750x every 2 years in the transformer model era [ref]. In contrast, for top Graphics Processing Units (GPUs) used in ML research at any point in time, it is observed a much slower rate of improvement, at 2x FLOP performance every ~2 years [ref]. This foresees a significant surge in the demand of GPU quantity and performance for training even larger models in the near future.

Fig. 1 The amount of compute measured in PetaFLOPs (PFLOPs) needed to train State of the Art (SOTA) models, for different AI/ML models, along with the different scaling of Transformer models (750x/2yrs) [ref]

AI Inference Presents Heterogeneous Computational Characteristics

In general, GPU has been positioned as the platform of choice for AI inference since massive parallelism is often indispensable. However, GPU is not always being fully utilized in the pipeline. In fact, inference of the common decoder-only LLMs (such as the GPT series) currently limits efficient use of GPU resources because of its two distinct phases: a prefill phase (i.e. prompt computation) followed by a decode phase (i.e. token generation).

In the prefill phase, the LLM processes the input tokens to compute the intermediate states, which are used to generate the “first” new token. Each new token depends on all the previous tokens, but because the full extent of the input is known, at a high level this is a matrix-matrix operation that is highly parallelized. It effectively saturates GPU utilization. In the decode phase, LLMs generate each output token sequentially and are limited by GPU memory bandwidth. This is a matrix-vector operation that underutilizes the GPU compute ability compared to the prefill phase. [ref] Therefore, the discrepancy between these two phases results in low overall GPU utilization, leading to much higher costs when offering LLMs to users.

Fig. 2 An example of LLM inference process and the two phases associated with it. The prompt phase is computationally intensive, while the token phase is memory intensive. [ref]

Additionally, when it comes to AI inference in end-to-end pipelines, there are certain use cases that Central Processing Units (CPUs) could potentially perform well on a much wider range of compute characteristics, resulting in better overall performance than GPUs. These include recommendation systems for training and inference that require greater memory for embedding layers, classical machine learning algorithms that are difficult to parallelize for GPUs, and so on [ref].

One interesting example worth noting is in the computational biology field - DeepMind’s AlphaFold. The protein folding problem is considered a holy grail problem in biology, a task that entails predicting the 3D structure of a protein from its amino acid sequence. Protein folding pipeline using AlphaFold consists of two parts:

Preprocessing that includes database search using file IO and alignment of multiple protein sequences and
Model inference that performs inference using a transformer-based deep learning model.

The deep learning component of AlphaFold is essential for its prediction accuracy, but it takes up only a little fraction of the total execution time of the end-to-end pipeline for protein folding. According to a throughput experiment, for a set of proteins of lengths less than a thousand, AlphaFold2 on a single Intel CPU achieves 9.5x faster performance compared to GPU-based FastFold on an A100 GCP instance. This is mainly because preprocessing consumes majority of the time, and it is done using only 6 cores on the GCP instance compared to 56 cores on the CPU socket. [ref] As such, AI inference in real world applications requires taking a holistic end-to-end view of the overall application, as these pipelines comprise of deep learning based compute and non deep learning compute with varied computational characteristics.

Fig. 3 The performance of the four platforms for set of C. elegans proteins of length less than 1000. All the experiments use bfloat16 for model inference [ref]

The Memory Wall

The surging need in compute power has been the main driver to improve the performance of GPUs, however, people often miss the critical challenges of memory and communication. These challenges are commonly referred to as the memory wall problem, a term originally coined by William Wulf and Sally Mckee in 1995. The memory wall problem involves both the limited capacity and the bandwidth of memory transfer.

This entails different levels of memory data transfer. For example, data transfer between compute logic and on-chip memory, or between compute logic and DRAM memory, or across different processors on different sockets. For all these cases, the capacity and the speed of data transfer has been significantly lagging behind hardware compute capabilities [ref]. This means that we are overspending on under-memoried compute engines for a lot of AI workloads [ref].

Fig. 4 The scaling of the bandwidth of different generations of interconnections & memory, as well as the peak FLOPs [ref]

Particularly, our ability to use maximum available compute FLOPs in inference has been worse than in training [ref], due to the memory bandwidth limitation in inference. As discussed above, the decode step in LLM inference is a memory-bound operation because of the autoregressive nature. The speed at which the data (weights, keys, values, activations) is transferred to the GPU from memory dominates the latency, not how fast the computation actually happens [ref].

To continue the innovations and break the memory wall, there are many research and startup efforts rethinking the design of algorithms and even hardware accelerators for training and deploying AI more efficiently. More discussion on AI hardware accelerators will be detailed in part 2. On the algorithm front, one popular technique is Flash Attention, introduced by Dao et al. in 2022 [ref], which reduces the memory footprint and computational cost of the attention mechanism in Transformers. Another promising open-source project is vLLM [ref] for fast LLM inference and serving, which utilizes PagedAttention that effectively manages attention keys and values. vLLM was known to redefine the new state of the art in LLM serving as it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes.

AI Demands Transformation on Data Center Network and Edge Computing

The requirements and characteristics of AI workloads vary significantly depending on the specific task or application. At the 2023 OCP Global Summit, Dan Rabinovitsj, VP of Infrastructure at Meta presented the diverse requirements of some of its AI applications, including training and inferencing workloads of Ranking & Recommendations (R&R) and Large Language Models (LLMs). Designing a single system architecture that can efficiently serve all types of AI workloads remains a significant challenge, particularly as novel models and parallelism techniques continue to emerge, placing unanticipated demands on AI systems. For example, as shown in Fig. 5, LLM training has a significant requirement in network bandwidth, compute, and memory capacity (total amount of memory, typically RAM). In contrast, LLM inferencing at the prefill step demands high compute and memory capacity, while network latency and memory bandwidth (the rate at which data can be read from or written to memory) become more of a bottleneck during the decoding stage.

Fig. 5 Diverse AI workloads presented by Dan Rabinovitsj, VP of Infrastructure at Meta, during 2023 OCP Global Summit [ref] (R&R: Ranking & Recommendation, LLM: Large Language Model)

The complexity and size of an AI application in terms of the number of parameters dictates the number of accelerators needed to run the AI applications, which in turn dictates the size and type of interconnect and network design options required to connect the different accelerators in a cluster. Since the AI workloads are very heterogenous and require orders of magnitude more GPUs to move data between them that CPUs have in the past, the configuration to connect these accelerated nodes in large clusters can be significantly different from the traditional frontend network in a data center used mostly to connect general-purpose servers. As a result, a new network called backend network has evolved, which serves as the backbone of AI applications and handles data movement between GPUs [ref]. As it typically takes 2 years to build one data center, designing the AI network with future workloads characteristics in mind is critical, necessitating an approach to adapt to the heterogeneity, network efficiency and energy consumption requirements more than ever before.

The rapid growth of those requirements is driving a significant paradigm shift, along with exciting opportunities. For the AI training workloads, the backend network of data centers presents growth potential for both scale-up (up to a few hundreds of GPUs connected in one cluster) and scale-out (up to tens or hundreds of thousands of GPUs connected) networking that could achieve low loss, high throughput, high scalability and low network latency.

For AI inferencing workloads, depending on the latency requirement, a lightweight version of the model is sometimes deployed to reduce inference time and cost with an acceptable loss of accuracy. Many inferencing applications are also expected to occur at the edge or on the edge devices. With last decade’s proliferation of IoT devices and mobile chip technology, edge computing solutions that can efficiently run AI models locally have many opportunities. There are particular opportunities for startups to develop lightweight AI models and specialized hardware accelerations that enhance the performance and energy efficiency of edge devices when deploying AI, facilitating smarter, autonomous operations in diverse environments.

looking glass

Discussion about this post