Amazon Web Services (AWS) recently unveiled Project Rainer, a massive compute cluster powered by hundreds of thousands of custom AWS Trainium2 chips.
AWS leverages this system to support the initiatives of AI development firm Anthropic PBC. Since September last year, Amazon, the parent company of AWS, has invested $8 billion in this OpenAI competitor. A few weeks ago, Anthropic announced its collaboration with AWS to enhance the Trainium chip series.
The Trainium2 chip features eight so-called NeuronCores, each comprising four computing modules. One of these modules is the GPSIMD engine, optimized specifically for executing custom AI operations. These operations are highly specialized, low-level code snippets that machine learning teams utilize to enhance the performance of neural networks.
The eight NeuronCores are supported by 96GB of high-bandwidth memory (HBM), which operates significantly faster than other types of RAM. The Trainium2 chip can transfer data between the HBM pool and NeuronCores at speeds of up to 2.8 terabits per second. The faster data reaches the chip's processing units, the sooner computations can begin.
Hundreds of thousands of Trainium2 chips in Project Rainer are assembled into what are called Trn2 UltraServers. These AWS-developed servers were announced alongside the compute cluster today. Each machine houses 64 Trainium2 chips and delivers an aggregate performance of 332 teraflops when running sparse FP8 operations, a type of computation used by AI models to process data.
Unlike the typical approach of deploying Project Rainer's servers in a single data center, AWS has opted to distribute them across multiple locations. This strategy simplifies logistical tasks, such as ensuring the necessary power supply for the cluster.
Historically, distributing hardware across multiple facilities offers clear benefits but also incurs costs, notably increased latency. The greater the distance between servers in the cluster, the longer it takes for data to transmit between them. Since AI clusters frequently exchange information between servers, this added latency can significantly slow down processing speeds.
AWS has addressed this limitation with an internally developed technology called Elastic Fabric Adapter. This networking device accelerates data flow between the company's AI chips.
Transferring information between two different servers involves numerous computational operations, some of which are performed by the servers' operating systems. AWS's Elastic Fabric Adapter bypasses the operating system, allowing network traffic to reach its destination more quickly.
The device manages traffic with the help of the open-source networking framework libfabric. This software is not only suitable for powering AI models but also for other demanding applications, such as scientific simulations.
AWS expects to complete Project Rainer's construction by next year. Once operational, the system will become one of the world's largest compute clusters for training AI models. AWS claims it will deliver more than five times the performance of the systems Anthropic has used so far to develop its language models.
AWS announced Project Rainer approximately a year after revealing another large-scale AI cluster initiative.
The system, named Project Ceiba, utilizes Nvidia chips instead of Trainium2 processors. Initially planned to equip the supercomputer with 16,384 Nvidia GH200 GPUs, AWS switched to configuring 20,736 Blackwell B20 chips in March last year, anticipating a sixfold performance increase.
Project Ceiba will support Nvidia's internal engineering projects. The chip manufacturer plans to use the system for projects in areas such as language model research, biology, and autonomous driving.