Artificial intelligence (AI) bellwether NVIDIA recently announced that Colossus, the world’s largest supercomputer cluster being used to train xAI’s Grok family of large language models (LLMs), relies on NVIDIA’s 800-Gbit/sec Spectrum SN5600 Ethernet switch as well as other products in the company’s Spectrum-X Ethernet networking platform, for Colossus’s Remote Direct Memory Access (RDMA) network. NVIDIA says the platform “is designed to deliver superior performance to multi-tenant, hyperscale AI factories using standards-based Ethernet.”
Colossus currently comprises 100,000 NVIDIA Hopper GPUs (graphics processing units) and is in the process of doubling to include 200,000 Hoppers. Colossus is located in Memphis, TN.
“The supporting facility and state-of-the-art supercomputer was built by xAI and NVIDIA in just 122 days, instead of the typical timeframe for systems of this size that can take many months to years,” NVIDIA said when announcing the Ethernet platform’s role in the cluster.
“Colossus is the most powerful training system in the world,” said Elon Musk on X. “Nice work by xAI team, NVIDIA and our many partners/suppliers.”
NVIDIA further reported that across all three tiers of the network fabric, the system has experience zero application latency degradation or packet loss due to flow collisions. “It has maintained 95% data throughput enabled by Spectrum-X congestion control,” NVIDIA stated. “This level of performance cannot be achieved at scale with standard Ethernet, which creates thousands of flow collisions while delivering only 60% data throughput.”
“AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency,” said Gilad Shainer, senior vice president of networking at NVIDIA. “The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment, and time-to-market of AI solutions.”
A spokesperson for xAI said it “has built the world’s largest, most-powerful supercomputer. NVIDIA’s Hopper GPUs and Spectrum-X allow us to push the boundaries of training AI models at a massive scale, creating a super-accelerated and optimized AI factor based on the Ethernet standard.”
The Spectrum SN5600 supports speeds of up to 800 Gbits/sec and is based on the Spectrum-4 switch ASIC. xAI is pairing the Spectrum-X SN5600 switch with NVIDIA BlueField-3 SuperNICs.
NVIDIA concluded its announcement by saying that Spectrum-X Ethernet networking for AI “brings advanced features that deliver highly effective and scalable bandwidth with low latency and short tail latency, previously exclusive to InfiniBand. These features include adaptive routing with NVIDIA Direct Path Placement technology, congestion control, as well as enhanced AI fabric visibility and performance isolation—all key requirements for multi-tenant generative AI clouds and large enterprise environments.”