The Importance of Retimers in Large Scale Generative AI Systems

Marketing Artimar
Apr 25, 2024
3 min read

Microchip META-DX2L PHY retimer enables scale-out for OCP-OAI 2.0.

AI Revolution: Transforming Innovation with Advanced Capabilities

Artificial Intelligence (AI) has become a hot topic during the last year, moving beyond just industry buzz to a critical component of innovation today. It is a major point of discussion in the tech world, with AI making headlines everywhere. Whether it's through news articles or product launches, AI's presence is hard to miss. AI's capabilities, from understanding natural language to recognizing complex patterns in data are expanding at an unprecedented rate. This surge in AI development and deployment is largely due to advances in AI processing hardware.

In this blog post, we are looking into how standardization efforts of the Open Accelerator Infrastructure (OAI) 2.0, directed by the Open Compute Project (OCP), are making it more efficient and cost-effective to build AI servers. You'll learn about the importance of hardware standardization, as well as the importance of retimers in enabling large scale systems. We will explore the key role that Microchip's META-DX2L retimer plays in this evolution, and how these technologies come together to streamline AI hardware design.

The OAI initiative plays a pivotal role in streamlining the process of AI hardware development, with an emphasis on making it more accessible, modular and cost-effective. By standardizing critical components like servers, motherboards, expansion modules and tray chassis, OAI is paving the way for a more interoperable ecosystem where components from various vendors can be mixed and matched. This approach not only accelerates the pace of innovation but also broadens the access to cutting-edge AI technology. Figure 1 shows an example of OAI system’s building blocks.

Figure 1: OAI System showing the building blocks (Source: OCP, OAI Spec)

In the field of AI, connecting multiple Graphics Processing Units (GPUs) and AI servers together into a unified computing network is essential for carrying out intricate and intensive AI workloads. Connecting GPUs within the same server enables direct and rapid communication between the GPUs, leveraging high-bandwidth, low-latency connections. Extending GPU connections across servers through expansion modules enables the creation of expansive GPU clusters. This method is critical for managing AI workloads that surpass the capabilities of a single server, allowing for computational resource scaling across multiple locations. Retimers are essential components in the expansion modules for maintaining signal integrity over longer distances of server connections. They ensure the signal integrity is maintained and data transfer remains reliable with high accuracy which is vital for the performance of AI applications. OAI has specified expansion modules having scale-out interface with QSFP-DD or OSFP connectors and retimers for supporting the training of large-scale machine learning models and processing large datasets in real-time using multiple AI servers.

The increasing demands of AI applications for higher data throughput and processing capabilities has sparked the progression of SerDes data rates from 56Gbps to 112Gbps. At 112Gbps, the signal integrity challenges are exacerbated, requiring a PHY retimer for ensuring that AI servers can communicate effectively, maintaining signal integrity.

At Microchip, we're working closely with OCP, providing key components for AI server design. Our diverse portfolio of devices includes PCIe switches, TPM security modules, timing products and importantly, our 112G META-DX2L retimer. The META-DX2L retimer is a fundamental component for the scale-out interface. It is part of the expansion module to support maintaining signal integrity as data moves between one server's GPUs, that sits on the Universal Base Board (UBB), across the expansion module and cable to a switch on another server. Figure 2 shows the block diagram of the OAI UBB 2.0 expansion module.

Figure 2: UBB 2.0 expansion (EXP) module block diagram

Our retimer features 32 × 112Gbps Long Reach PAM4 SerDes, supporting proprietary rates of up to 112G. META-DX2L retimer not only meets the OAI2.0 channel model requirements of 30dB, but exceeds that, offering unmatched performance and reliability. The exceptional performance, small footprint and low latency retiming capabilities of META-DX2L enables high performance interconnect over cables, across boards and when scaling up to larger AI systems.

In developing the OAI 2.0 specifications, the OCP-OAI group has recognized the need to add SPI protocol in addition to I2C and MDIO to the host interface choices of PHY retimers. SPI interface provides quicker image downloads and faster bootup. SPI bus signal integrity was validated before it was included into UBB 2.0 and EXP 2.0 specifications. Microchip was one of the contributors in validating the improved performance of the SPI bus, which gave the OAI group confidence in incorporating the SPI bus into their specifications.

The journey of AI hardware standardization is ongoing, and we remain a dedicated participant in this evolution. Our work with the META-DX2L retimer and contributions to the OAI 2.0 standard are just the beginning.

For more information about the META-DX2L, and other META-DX family of devices, visit the META-DX family page.

Azadeh Farzin, Apr 16, 2024

Tags/Keywords: AI-ML

Reposted from: https://www.microchip.com/en-us/about/media-center/blog/2024/importance-of-retimers-in-large-scale-generative-ai-systems?utm_campaign=metadx2l-retimer&utm_source=instagram.com&utm_medium=Post&utm_bu=CBU