Securing Programmable Networks

Shawn Singh | September 15, 2025

Teams that manage network infrastructure seek to stay ahead of the fast pace with which their organization deploys new applications and services that rely on the network for secure, fast, reliable, and scalable access to the internal, Internet, third party, and cloud hosted resources they need. To accomplish this, advanced network programmability methods are adopted to efficiently and dynamically deploy network devices and services. Amid this constant drive to avoid becoming a bottleneck, network teams also need to ensure that the additional programming interfaces, protocols, and tools used are enabled securely. As advanced network programmability is often adopted after an initial build of network infrastructure, organizations should avoid assuming that existing security controls will be sufficient to prevent exposure to new vulnerabilities.

As advanced network programmability is employed, it is important to review and address the additional security risks that result from the increased exposure that these features introduce. While major network device vendors provide security guides to harden devices against exposure, these guides may not sufficiently focus on the additional considerations and protections required when enabling and using programmability features. To address this gap, outlined below is a reference guide for organizations to use to further secure their networks.

Programmability in Networks

Modern network devices offer different levels of programmability. Advanced network programmability is the ability to interact with a network device using machine to machine level communication such as Application Programming Interfaces (APIs), custom applications developed using major languages like Python, scripting, and other programmatic methods to enable automated device configuration, monitoring, and dynamic network services. While not a new method of building and managing networks, use of advanced network programmability remains a major shift in the way that organizations provision and manage the many routers, switches, firewalls, and other devices that comprise the different networks they operate. This shift in network operations is due to the industry’s continued movement from producing closed box network devices that require use of proprietary command line interfaces (CLIs) and legacy standard management protocols like Simple Network Management (SNMP), to more open programmable devices that allow shell access to the network operating system (NOS), interactions using modern open standard APIs, the ability to host and run custom developed applications, as well as use of multi-vendor and agentless automation tools like Ansible. Self-hosted and Cloud Service Provider (CSP) based networks of different trust and risk levels where these devices play a critical role include Internet exposed perimeter networks, third party exposed Business to Business (B2B) networks, and massively scalable Datacenter networks.

Network Programmability Security Standard

Many years of experience, as well as recent vulnerability research and testing as it relates to the additional exposure of management interfaces used by machine level network programmability features is the core input that informs this standard. This Network Programmability Security Standard (NPSS) is vendor agnostic and may be used as a supplement to existing standards, or as the base to developing a new standard for any network infrastructure environment. To maintain future relevance and applicability that coincides with evolving vulnerabilities and attack vectors, the NPSS should be considered a living list that can be adapted, customized, and expanded by organizations as their requirements dictate.

NPSS 1 – Disable Unused APIs

A vendor may support multiple APIs for programmability, but not all APIs are created equal. There is some overlap in the network configuration and operational state resources that can be read and modified with stateless RESTCONF requests and stateful NETCONF requests, but vendors may choose to make more resources available using one API over the other for various technical reasons (such as the data models used to represent network data and security features supported by each). Compared to gNMI, there is much less overlap with RESTCONF and NETCONF as gNMI was designed for the primary purpose of streaming network device telemetry data that provides real time network state and performance statistics through client subscriptions instead of just being a configuration management interface. Vendor created REST APIs may additionally offer access to resources not supported by the standards organizations behind RESTCONF (IETF), NETCONF (IETF), and gNMI (OpenConfig). As each of these protocols increases the attack surface of the network device by using additional TCP ports and may improve the efficiency and speed of attacks if compromised, organizations should carefully evaluate their programmability needs and goals against the features available with each of these protocols and only enable the protocols required.

NPSS 2 – Change Default API Ports

Ports used by popular programmability methods include:

REST API: TCP 80, 443
RESTCONF: TCP 443 (HTTPS)
NETCONF: TCP 830 (XML over SSH)
gNMI: TCP 6030
Ansible: TCP 22 (SSH)

Network device platforms typically provide the ability to change these ports to another value. These ports should be changed to reduce the likelihood that an attacker will successfully establish a connection to the device using their knowledge of the default ports.

NPSS 3 – Restrict Shell Access

Access to the Linux shell that underpins many network devices is one of the most vulnerable attack vectors available. However, not every software or human administrator requires shell access. In fact, most daily operations for managing a network do not require shell access. If an organization finds themselves requiring shell access to implement a function or retrieve data not currently accessible through other programmability interfaces, this should trigger an action to either develop software that can instead retrieve the data via an existing API, a custom application, or make a feature request with the device vendor. Shell access can and should be disabled as a default standard and only enabled for the limited time necessary to make it unnecessary to have again (such as deploying an application hosted by the network device).

NPSS 4 – Use Encrypted Transport

There is no need to support unencrypted transport for modern network management and programmability traffic. If organizations still maintain legacy network devices and software that does not support encryption this is a risk to that organization that needs to trigger remediation actions that may include disabling services, replacing devices, and re-designing the network. Investing in and continuing to operate vulnerable network hardware and software is a risk prolonged by decisions, not a necessity.

NPSS 5 – Implement Daily NOS Vulnerability Check, Test, Patch Cycle

The traditional approach of weekly, quarterly, and annual efforts to review vendor security notices and schedule software patching is insufficient at protecting an organization from a security breach. Today and for the foreseeable future, organizations must always be on guard and be actively checking for vulnerabilities and solutions to minimize exploitation in their environment. Unique to teams managing network infrastructure that consists of devices manufactured by vendors is the lack of access in most cases to the NOS software codebase. There are several interfaces that vendors provide to interact with the NOS, but a device administrator cannot review the NOS codebase to independently find and patch vulnerabilities. Many network device vulnerabilities are reported by device vendors and only they can develop and provide an officially supported fix for these vulnerabilities.

NPSS 6 – Install Trusted Certificate on Network Devices

Network devices and administrators are not immune from the risk of bad actors inside or outside of their organization that seek to gain access to steal data, control, or destroy resources. Compromising the network is a very powerful way to accomplish these goals. Despite all the efforts made to secure network infrastructure, security is never guaranteed. Even if not publicly accessible, device web management interfaces that do not use a certificate issued by an organization’s trusted certificate authority (CA) are at risk of being exploited by a man-in-the-middle (MITM) attack where device credentials are captured and then used for malicious activity. At minimum, organizations should build their own internal Publick Key Infrastructure (PKI) or use a trusted external service to securely generate and issue certificates that can be deployed to devices to maintain secure communications with management interfaces.

NPSS 7 – Restrict Device Access to Secured Management Networks Only

A secure management network is more than just another multilayered wall of defense. By confining human and application programmatic access to network devices to be allowed from this network only, the attack vector to a successful exploit significantly shrinks. In practice, some organizations instead choose to promote mobility and flexibility for their administrators by allowing access from any internal network host and rely on Authentication, Authorization, and Accounting (AAA) services to protect against malicious access. However, while the AAA component of this strategy may achieve some level of conformity to the traditional and effective rule of least privilege, it is insufficient and a security risk due to its lack of regard for the security value of limiting points of entry. In other words, it is not secure or necessary to allow access from anywhere, even if this is restricted to the internal network. Mobility for network administrators can be achieved in a secure fashion by enabling access to jump boxes within the secured management network that can be used as a proxy for accessing the network infrastructure instead of doing so directly. This strategy has the security advantage of operating a central control, visibility, and policy enforcement point instead of the risk introduced by points of entry created by many administrator laptops, machines located at home, or internal hosts scattered throughout the organization. When access is required from a source outside of the internal network, a secure IPSec VPN infrastructure should be used to terminate these connections into a secure DMZ network from which authorized administrators can access the jump boxes used to manage the network infrastructure.

NPSS 8 – Use Secure API Key / Token Options

APIs that can be configured to enforce the use of keys or tokens for subsequent authentication and authorization of request transactions after the initial first level / user authentication and authorization process should be used. API keys and tokens are more than just another form of authentication and authorization; they are time limited, can be manually revoked as part of an API call, and negate the need to expose administrator device credentials to the applications that execute multiple API calls.

NPSS 9 – Replace Administrator Passwords with SSH Keys for Authentication

SSH keys are more secure than passwords for administrator and application authentication for access to network devices. While SSH keys and passwords may be captured and eventually decoded using the right techniques, tools, and computing power, the separation of public and private keys and the larger number of bits used to generate these keys are two major factors as to their greatly enhanced security compared to passwords. When supported by an API and vendor’s platform SSH keys should be used instead of passwords and can be combined with remote AAA services for greater security.

NPSS 10 – Use Dedicated Development and Test Environment

Although organizations will remain dependent on network device vendors to identify many vulnerabilities and provide patches, they should dedicate the time and other resources required to build and maintain a test and development environment for the purpose of testing and evaluating NOS software and planned changes made via network programmability methods. With Artificial Intelligence (AI) technologies, this environment could be in the form of a digital twin for network infrastructure.

Supplementing vendor published security advisories, security vulnerabilities may be identifiable by using the vulnerability scanning and penetration testing tools. While there are many free and paid options, having a set of tools and standard tests that can be run will go a long way to ensure vulnerabilities are identified and addressed prior to promoting any changes to production environments.

AI INFRA Summit Wrap-Up from Santa Clara

Shawn Singh | September 11, 2025

This week’s AI INFRA Summit in Santa Clara was quite immersive. I specifically focused on sessions curated for hardware, systems, and Data Center topics, but walked away with more knowledge than I could process in three days. Suffice it to say, I’ll be occupied with this for a bit longer. Here’s a short list of themes that continue to fire neurons in my brain. Let’s talk if you have a similar reaction.

Security above all else

No exceptions.

“The Network Is the Computer”

If this doesn’t resonate yet, keep reading.

Interconnection

No one winner (InfiniBand, Ethernet, NVLink, Ultra Accelerator Link, Ultra Ethernet) to cover all cases (scale up, scale out, scale in, scale across).
Read more: https://ualinkconsortium.org/

Transport

TCP, RDMA / RoCE are great, so is SRD.
Read more: https://ieeexplore.ieee.org/document/9167399

10p10u for Massive AI Clusters in AWS

10p (tens of petabits per second of throughput) in less than 10u (<10 microseconds).
Read more: https://www.aboutamazon.com/news/aws/aws-infrastructure-generative-ai

Software Defined Memory

Scalability requires deconstructed compute.
Read more: https://kove.com/

Great job funding your new Data Center build.

Good luck getting power.
Need to plan 3+ years out for power.

Co-Packaged Optics (CPO) vs. Linear Pluggable Optics (LPO)

Exponential power efficiency. Don’t scale without it.
Read more: https://blog.apnic.net/2025/05/07/co-packaged-optics-a-deep-dive/

Copper isn’t dead (yet)

Consider Co-Packaged Copper (CPC).
Read more: https://blog.samtec.com/post/co-packaged-interconnects/

The Neocloud

GPUaaS.
Read more: https://drivenets.com/resources/education-center/what-are-neocloud-providers/

Liquid Cooling Isn’t Coming

It’s been here. Use it.
Read more: https://datacenters.lbl.gov/liquid-cooling

Photonics

Possibly the biggest enabler to groundbreaking computing and interconnection scale.
Read more: https://lightmatter.co/resource/hot-interconnects-2025-nick-harris-ceo-founder/

Reducing Waste with AI Observability

Eliminating idle time is a must.
Read more: https://www.virtana.com/platform/aifo/

Be Open

Read more: https://www.opencompute.org/w/index.php?title=Main_Page

The Quantum Networking Mechanics Driving Extreme Computing

Shawn Singh | September 7, 2025

The technologies deployed for HPC and AI enable an immense level of highly scalable computing power that is accelerating progress in solving the world’s greatest challenges. However, as powerful as these widely deployed technologies are, they are not the peak of our capabilities to building the computing power required to overcome challenges in critical areas like information security, medical research, scientific discovery, natural disaster forecasting, safe autonomous transportation, sustainable energy production, and self-healing materials manufacturing. Rather, these classical computing components will be utilized in conjunction with quantum computing to enable exponentially greater performance that on-going hardware and software innovations alone will never be able to provide. By using quantum bits (qubits) that can simultaneously take on multiple states to encode more data for more powerful parallel computations, improving the performance, scale, accuracy, and applications of computing with quantum computers enables a fundamental shift in computing that is unattainable by the binary based bit scheme used in classical computing.

A current challenge to harnessing quantum compute power is how to efficiently enable and scale distributed quantum computing. Per researchers at Cisco Quantum Labs, a quantum processor is currently limited to processing tens or hundreds of qubits which is far below the millions of qubits required to be truly useful [1]. Furthermore, interconnecting quantum computers to route qubits using existing technologies like Ethernet and InfiniBand is not physically feasible. Instead, another fundamental shift is occurring to develop the components needed for a quantum network. As we approach a point in time when quantum networks are available outside of research labs, building knowledge now about the mechanics of transporting qubits will enable organizations to be strategically ready to leverage the combined strengths of AI, HPC, and quantum computing for solving the most complex problems.

Quantum Cargo

At the physical layer, data can be encoded in different forms, the selection of which is impacted by properties such as computational speed, transmission distance, stability, and ease of manipulation. Whereas classical computing uses alternating voltage levels to represent bit values of 0 and 1, quantum computing uses the unique properties of subatomic particles to encode data. According to quantum physics, these properties include the smallest units of energy exhibited by subatomic particles (their quanta), a single packet of which is known as a quantum. Quantum bits (qubits) may be in the form of superconducting materials, trapped ion particles, small single electron semiconductors, and photons [2]. As particles of light, photons are well suited for transmission using existing fiber infrastructure and are being used in various quantum networking development projects.

The power of qubits resides in the quantum property of superposition. Superposition means that a qubit can simultaneously straddle multiple states (values) such as 0 and 1 as opposed to 0 or 1 in classical computing [3]. Decoding a qubit requires it to be measured (read) and this action is said to collapse the qubit’s superposition into a resulting value. Superposition allows qubits to store exponentially more data than classical bits and is core to the power of quantum computing. To demonstrate this, Figure 1 [4] shows a comparison of the quantity of classical bits and quantum bits required to process the same amount of data.

Fig .1. Comparison of qubits and bits from [4]

Transporting Qubits

While qubits comprised of photons are transmittable via existing fiber, different computing and transmission media that can work with the quantum properties of these subatomic particles are required. One of these key properties is quantum entanglement, and the role of a quantum network is the sharing of entangled qubits between quantum computers that seek to communicate. When two qubits are entangled, they are inextricably linked such that the state of one qubit impacts the state of the other regardless of the distance between them [5]. Quantum entanglement of these particles can be created with lasers or a chip and induced by splitting a photon, bringing particles together to interact and yield an entangled pair, or causing an atom to shed excess energy by emitting two photons at once [5, 6, 7].

Once entangled qubits are distributed amongst machines, communication occurs by measuring one qubit in the pair with a qubit containing data from the sender’s memory that it wants to transmit to the receiver. This measurement changes the qubit’s state and results in an instantaneous state change in the corresponding remote qubit in the entangled pair at the receiver. Exchanging information in this way is called quantum teleportation given that the sender communicates with the receiver without having to transmit anything to trigger the state change of the remote qubit [3]. Since this state change collapses the entanglement, the sender needs to separately communicate the result of its measurement using classical binary data and networking to the receiver so that it can determine the quantum data that resulted in this state change. Figure 2 [8] from the MIT Technology Review article “Explainer: What is quantum communication” provides a visual description of how this process works:

Fig .2. Quantum Teleportation from MIT Technology Review [8]

This also demonstrates the ease of which eavesdropping is detected as any attempt to copy the state of an entangled qubit would collapse the entanglement and alert both sides of the communication. Furthermore, this also creates a challenge for enabling distributed quantum computing over a network. Classical routers and switches that copy and propagate data between ingress and egress buffers would break the quantum entanglement. Since entanglement is required to exchange quantum data between quantum computers, quantum repeaters are used to perform entanglement swapping along the path to reliably extend the distance between which qubits can be entangled [9]. These repeaters use specialized quantum memories to store and transfer qubits [10].

Networking Innovations

A quantum network of interconnected repeaters has limited scale. To enable distributed quantum computing over networks like what exists today for HPC and AI, new protocols and quantum capable chips, switches, and other hardware needs to be developed. To that end, Cisco Quantum Labs is developing a quantum switch (the Cisco Quantum Enabled Switch) that integrates quantum repeaters, quantum memory, and can provide non-blocking switching of entangled photons as well as the capability to employ routing and switching over quantum and classical networks [11]. In collaboration with the University of California Santa Barbara, Cisco Quantum Labs has also developed a prototype quantum network entanglement chip that generates over 200 million entangled photon pairs per second, uses only 1mW of power, does not require extreme cooling, and uses the same 1550 nm wavelength as existing fiber infrastructure [12]. Other private, academic, and government organizations contributing to the development of quantum networks include:

National Institute of Standards and Technology (NIST): researching use of non-linear optics for quantum network interconnections, operates the NIST Gaithersburg Quantum Network (NG-QNet) testbed for research and development.
AWS Center for Quantum Computing (AWS CQN) and Harvard University: deployment of a multi-node 35 km long quantum network in Boston for research and development.
The University of Chicago and Argonne National Laboratory: deployment of 124 miles of fiber that demonstrates the improving distance of quantum transmissions.

Large scale quantum Data Center networks and a quantum Internet may be available in the future. Watching the development of quantum networking as it enables the next evolution of distributed computing will position organizations to benefit from its capabilities as they become commercially available.

References

[1] H. Shapourian, E. Kaur, T. Sewell, J. Zhao, M. Kilzer, R. Kompella, R. Nejabati. “Quantum Data Center Infrastructures: A Scalable Architectural Design Perspective.” Arxiv. Accessed: Sep. 6, 2025. [Online]. Available: https://arxiv.org/abs/2501.05598

[2] J. Schneider, I. Smalley, “What is quantum computing.” IBM. Accessed: Sep. 6, 2025. [Online]. Available: https://www.ibm.com/think/topics/quantum-computing

[3] K. S. Vasantrao and A. Saxena, "Bits to Qubits: An Overview of Quantum Computing," 2025 International Conference on Intelligent Control, Computing and Communications (IC3), Mathura, India, 2025, pp. 120-124, doi: 10.1109/IC363308.2025.10956309.

[4] A. Yadav and R. Gangarde, "Quantum Computing and Cryptography: Addressing Emerging Threats," 2024 International Conference on Intelligent Systems and Advanced Applications (ICISAA), Pune, India, 2024, pp. 1-5, doi: 10.1109/ICISAA62385.2024.10828839.

[5] J. Schneider, I. Smalley, “What is a qubit?” IBM. Accessed: Sep. 6, 2025. [Online]. Available: https://www.ibm.com/think/topics/qubit

[6] C. Orzel, “How Do You Create Quantum Entanglement?” Forbes. Accessed: Sep. 6, 2025. [Online]. Available: https://www.forbes.com/sites/chadorzel/2017/02/28/how-do-you-create-quantum-entanglement/

[7] “Quantum Computing Explained.” NIST. Accessed: Sep. 6, 2025. [Online]. Available: https://www.nist.gov/quantum-information-science/quantum-computing-explained

[8] M. Giles, “Explainer: What is quantum communication?” MIT Technology Review. Sep. 6, 2025. [Online]. Available: https://www.technologyreview.com/2019/02/14/103409/what-is-quantum-communications/

[9] M. Johnson-Groh, “What is a Quantum Network?” Symmetry. Accessed: Sep. 6, 2025. [Online]. Available: https://www.symmetrymagazine.org/article/what-is-a-quantum-network?language_content_entity=und

[10] D. Levonian, “An Illustrated Introduction to Quantum Networks and Quantum Repeaters.” AWS. Accessed: Sep. 6, 2025. [Online]. Available: https://aws.amazon.com/blogs/quantum-computing/an-illustrated-introduction-to-quantum-networks-and-quantum-repeaters/

[11] R. Nejabati, H. Shapourian, P. Zhao, E. Kaur, M. Kilzer, L. Dela Chiesa, R. Kompella, “Quantum Networks of the Future, A Cisco Vision.” Cisco. Accessed: Sep. 6, 2025. [Online]. Available: https://research-strapi-s3.s3.us-east-2.amazonaws.com/QN_of_Future_2024_Release1_4_4a00a63028.pdf

[12] R. Kompella, R. Nejabati, “Building the Fabric of Quantum Networking: Inside Cisco’s Quantum Data Center vision.” Cisco. Accessed: Sep. 6, 2025. [Online]. Available: https://outshift.cisco.com/blog/cisco-quantum-data-center-vision

Speeding+ with InfiniBand

Shawn Singh | August 28, 2025

While 400 Gb/s and 800 Gb/s link speeds are the current interconnection standard for large scale AI/ML, HPC, and storage systems, the industry is expected to debut 1600 Gb/s in 2026 [1]. Yet, as the ceiling for bandwidth demand will likely remain fluid, the longevity of 1600 Gb/s as a future standard may not be long at all. To keep pace with the ever-increasing capacity of large-scale computing systems, the InfiniBand Trade Association (IBTA), maintainers of the InfiniBand interconnection standards, incorporates the development of 3200 Gbps and 9600 Gbps link speeds by 2030 in their current roadmap [2]. Ethernet, the other widely deployed interconnection technology for a much broader set of use cases and with a larger user base, will either outpace, keep pace, or trail InfiniBand. In years past, InfiniBand pioneer Mellanox (acquired by NVIDIA) claimed the title of being first to 40 Gb/s, 56 Gb/s, 100 Gb/s, and 200 Gb/s [3]. Regardless of which technology enables the next level of speed first, it is incumbent upon infrastructure teams to be versed in the architecture of both Ethernet and InfiniBand as they will coexist in large scale network infrastructures that support frontend networks, and the higher speed and lower latency backend networks required for AI (see Exploring Networks for AI). With the advent of Remote Direct Memory Access over Ethernet Version 2 (RoCEv2), Ethernet is now closer to providing InfiniBand-like low latency performance, but InfiniBand’s purpose, architecture, and other characteristics are unique enough to warrant its continued consideration for deployment alongside Ethernet, but dedicated for the specific AI/ML and HPC workloads that can leverage its capabilities.

To provide a bridge to network design with InfiniBand, outlined below is a summary of essential concepts and resources to evolve design playbooks as new InfiniBand (and Ethernet) features are released.

Layered Architecture Differences

InfiniBand is not a replacement for Ethernet. While Ethernet is designed to work within the TCP/IP layered architecture where independently developed protocols at each layer (application, transport, network, data link, physical) implement different functions to facilitate communication, InfiniBand is a distinct layered communications architecture with protocols specifically designed for InfiniBand that span each layer. Using the TCP/IP stack, the delivery of data between applications residing on different hosts is dependent on multiple levels of encapsulation implemented by different processes that add headers to the payload (the data being delivered) containing socket details (TCP or UDP ports to deliver data to right application), IP address details (to deliver data to the right host where the receiving application resides), and MAC address details (to deliver data to the next hop in the network along the path to the delivery destination) before transmission onto the physical network as a series of bits. The function of each of these processes are implemented by the sending application’s host TCP/IP software that depend on the host’s CPU for allocating resources (processing time, memory, buffers, data copies) to execute these functions and deliver the packaged (encapsulated) data to the host’s network interface card (NIC) for transmission onto the network. Similarly, the InfiniBand architecture consists of five layers (upper, transport, network, link, physical) but the components involved at each layer are different. At the upper layer, instead of common TCP/IP based applications like HTTP, FTP, and SSH, examples of applications that use InfiniBand include Message Passing Interface (MPI, used by HPC systems), storage systems like the Hadoop Distributed File System (HDFS), and real time data applications such as those used by electronic trading firms [4].

Direct Message Based Memory Transfers

Developers of InfiniBand based applications do not use socket level programming to establish connection oriented or connectionless data transfer sessions through TCP or UDP ports, the host’s CPU, or Ethernet NIC. Instead, the InfiniBand specification defines a set of verbs that is used to construct an API between applications and the InfiniBand transport layer, which then uses RDMA to deliver messages directly between the memory space of two remote hosts. The InfiniBand transport layer encompasses TCP like functions such as establishing connections, congestion control, error detection, error recovery, and in order packet delivery. Using verbs, data from applications in user space bypasses kernel space (and the host’s CPU) and is sent directly to a Host Channel Adapter (HCA). An HCA is a physical NIC installed in the host but used for communication on the InfiniBand network and is directly connected to an InfiniBand switch. With RDMA and HCAs, the InfiniBand transport layer is implemented completely in hardware and results in significantly lower message latency compared to TCP/IP based transmissions (1 microsecond or less is achievable). This is one of the most distinguishing features of InfiniBand. While kernel bypass and TCP offload NICs exist for TCP/IP networks, with pre-arranged and secure access between their memory address spaces, being able to optionally transfer data directly between the memory spaces of remote systems without the active involvement of one side (either sender or receiver) during transmission is another high-performance feature that distinguishes InfiniBand [4]. Using this “memory semantics” method of data simulates active local access to a remote system’s memory space.

Software Defined Networking

Another distinguishing feature of the InfiniBand architecture is its employment of Software Defined Networking (SDN). InfiniBand uses a Subnet Manager (SM) to manage control plane functions such as allocating link layer addresses, building Linear Forwarding Tables (LFT) to move and load balance data across all available paths, defining Quality of Service (QoS) policies, and monitoring the network fabric. The SM is software that may be deployed on a fabric connected server, switch, or dedicated appliance. To provision and monitor the InfiniBand network, the SM communicates to Subnet Manager Agents (SMAs) that are running on each member of the fabric during its initial discovery process. Compared to a TCP/IP based network, the use of the SM makes InfiniBand a plug and play network.

Link Layer Lossless Networking

At the link layer, InfiniBand uses an addressing scheme that consists of a Globally Unique Identifier (GUID) that is assigned by the manufacturer of the HCA (like an Ethernet MAC address), a logical address called the Local Identifier (LID) that is assigned to all endpoints (such as an HCA) located on the same subnet, and a Global Identifier (GID) that identifies the subnet address. Packet forwarding at the link layer is based on destination LIDs. While InfiniBand is designed to support large flat networks with up to 48,000 nodes per subnet, routing between different subnets is facilitated by network layer InfiniBand routers. Flow control, QoS, and data integrity mechanisms using two Cyclical Redundancy Checks (CRCs) are also employed at the link layer. Key to InfiniBand’s lossless networking is its use of credit-based flow control. With credit-based flow control, a receiver grants the ability to send data by issuing credits to the sender. Data is only transmitted if there are sufficient credits to ensure the receiver does not drop data.

Low Latency and High Speed

Physically, InfiniBand links are defined by their widths and incur a latency of approximately 5 nanoseconds per meter over copper or fiber media [5]. Link widths (denoted as 1X, 4X, or 12X) indicate the quantity of parallel transmit and receive wires (for copper) or fibers within a cable and the signaling rate (speed) of each pair of wires or fibers [6]. Each pair of transmit and receive wires or fibers is referred to as a lane. For example, 1X cables of the single data rate (SDR) generation are comprised of 4 wires or fibers (2 lanes) and operates at a speed of 2.5 Gb/s per lane and result in 5 Gb/s of full duplex transmission but is reduced to 4 Gb/s of raw data bandwidth after accounting for transmission overhead. On the other end of the speed scale, a 4X link of the extreme data rate (XDR) generation uses a 250 Gb/s signaling rate and effective bandwidth of 800 Gb/s.

Additional Resources

The InfiniBand architecture standard is currently over 2000 pages in length and is accessible to members of the IBTA. Although a review of the standard will provide a wealth of details not covered in this article, listed below are online resources from the industry that provide an in depth look at each of the InfiniBand layers. Use these resources to enhance your network design playbooks for deciding if the distinct features of InfiniBand should be included in your next project.

References

[1] “IEEE P802.3dj 200 Gb/s, 400 Gb/s, 800 Gb/s, and 1.6 Tb/s Ethernet Task Force.” IEEE802. Accessed: Aug. 28, 2025. [Online]. Available: https://www.ieee802.org/3/dj/index.html

[2] “InfiniBand Roadmap.” InfiniBand Trade Association. Accessed: August 28, 2025. [Online]. Available: https://www.infinibandta.org/infiniband-roadmap/

[3] “Introducing 200G HDR InfiniBand Solutions.” Nvidia. Accessed: Aug. 28, 2025. [Online]. Available: https://network.nvidia.com/pdf/whitepapers/WP_Introducing_200G_HDR_InfiniBand_Solutions.pdf

[4] P. MacArthur, Q. Liu, R. D. Russell, F. Mizero, M. Veeraraghavan and J. M. Dennis, "An Integrated Tutorial on InfiniBand, Verbs, and MPI," in IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2894-2926, Fourthquarter 2017, doi: 10.1109/COMST.2017.2746083.

[5] “Cable Latency.” NVIDIA DGX SuperPOD: Cabling Data Centers Design Guide. Accessed: Aug. 28, 2025. [Online]. Available: https://docs.nvidia.com/dgx-superpod/design-guide-cabling-data-centers/latest/cable-latency.html

[6] “InfiniBand Cables Primer Overview” NVIDIA DGX SuperPOD: Cabling Data Centers Design Guide. Accessed: Aug. 28, 2025. [Online]. Available: https://docs.nvidia.com/dgx-superpod/design-guide-cabling-data-centers/latest/infiniband-overview.html

Exploring Network Design for AI

Shawn Singh | August 20, 2025

Your next project might require building infrastructure for Artificial Intelligence (AI) workloads. To get started, I’ve surveyed various resources that highlight the network features that large scale AI deployments require and briefly summarized them below. While this is a large topic that is only partially covered by this article, if you may be embarking on a network design project for AI or would like to explore how a network design for AI differs from typical Data Center networks, this article is intended to be a starting point for your journey. Links to key references for detailed explanations are included below.

AI Workloads

These are the users of the network and an understanding of how they function is an essential input to the network design. AI workloads incorporate all the processes involved in completing tasks to train AI models, and are distributed across multiple systems, consume data from many sources (user inputs, web content, databases, output from AI cluster members), process different data types (text, video, images, audio), and provide output in near real time. Common AI workload types include machine learning to make predictions from existing data, deep learning for highly complex tasks like speech and image recognition, natural language processing for understanding and generating human language, and generative AI for creating new content [1]. There are other types of workloads, but their common requirement is the ability to receive and send a large quantity of data at speeds exceeding 100 – 400 Gb/s across distributed systems using a low latency network to facilitate real-time processing and output to the applications using AI.

AI Compute

The heavy computing power required by AI systems requires the extremely high performance, parallel, and accelerated processing that Graphical Processing Units (GPUs) and Tensor Processing Units (TPUs)provide. While not originally created for AI workloads, GPUs have proven to be well suited to and continue to be developed for the parallel processing requirements of machine learning and other AI workloads that cannot be handled by CPUs. In contrast, Google developed the TPU as an Application Specific Integrated Circuit (ASIC) in 2015 as a purpose-built accelerator for AI workloads [2], [3]. A tensor is a multidimensional data structure used in ML frameworks [4] like TensorFlow and PyTorch. Both GPUs and TPUs require high speed networking to send and receive data and use speeds exceeding 100 – 400 Gb/s.

Traffic Flows

Reference architectures from networking manufacturers place special emphasis on managing elephant flows, mice flows, and minimizing tail latencies for packets traversing the network that carry AI workflow messages [5] – [7]. Elephant flows are high bandwidth flows. For AI training workloads, elephant flows are few in quantity but are long in duration and represent the bulk of traffic that consumes network bandwidth and switching queues. In contrast, mice flows are low bandwidth flows. In typical enterprise and Data Center networks, mechanisms to balance the traffic load across all available links between sources and destinations are employed and may be based on flows or packets. In this context, flows are IP packets, and each flow is uniquely identified by source and destination IP address and transport protocol port number pairs.

When flow-based load balancing is employed, load balancing across all available links is more effective as the number of unique flows increases. In a network that is comprised of a few elephant flows, there is a greater chance that these high bandwidth flows will utilize only a small number of links which will eventually lead to congestion, delay, and packet loss. Flow based load balancing will not work for AI training workloads given their use of elephant flows. In contrast, per packet or packet spraying load balancing would improve link utilization but result in out of order packet delivery. For TCP flows, while TCP will re-order data to ensure it is delivered in proper sequence this adds latency and is counter to the low latency communication AI workloads require. For UDP flows, re-ordering data will need to be completed by the receiving application, and this adds latency and complexity to the overall process (hardware solutions may be available to mitigate this latency impact). In either scenario, the negative impact of tail latency is consistently highlighted as a risk that the network needs to prevent. Tail latency occurs when some AI workflow messages are significantly delayed well beyond the average latency experienced by other messages in the workload [8]. The messages that experience tail latency slow down the entire workload process for all GPUs or TPUs involved. A network infrastructure dedicated for AI workloads needs to implement mechanisms to eliminate the congestion, packet loss, and tail latency for these traffic flows (see the lossless networking section below).

Data Rates

The data rates required are well beyond the 10 Gb/s server Network Interface Card (NIC) connections, and multi-40 and 100 Gb/s switch interconnects that are predominant in other Data Center networks. In a network design for AI workloads, be ready to deploy 100 or 400 Gb/s server NIC connections, and multi-400 Gb/s, or multi-800 Gb/s switch interconnections. Furthermore, a standard for 1.6 Terabit Ethernet is in the works by the IEEE [10].

Interconnection Architectures

While Ethernet remains the dominant interconnection architecture for facilitating data transmission across networked components for the widest variety of use cases, the InfiniBand architecture is widely deployed for high performance computing (HPC) and AI environments due to its specific focus on direct, high bandwidth, and low latency communications between processors distributed across networked servers. One of InfiniBand’s most powerful features is its support for Remote Direct Memory Access (RMDA). RDMA allows processes running on different servers to directly access the memory running other servers without involving the server’s CPU or kernel [9]. As a result, data transfer between systems using RDMA occur at significantly lower latency. Originally developed for use over InfiniBand networks, there is now a version of RDMA that operates on Ethernet networks. Known as RoCEv2 (Remote Direct Memory Access over Converged Ethernet Version 2), Ethernet is now an attractive alternative to InfiniBand for interconnecting HPC and AI nodes. Given its wider support by equipment manufacturers and network teams, Ethernet will likely become the interconnection architecture of choice for HPC and AI workloads. To continue to improve Ethernet data rates, latency, traffic load balancing, congestion management, and other features needed to optimize AI and HPC workflows, the industry has formed the Ultra Ethernet Consortium (UEC) to develop new Ethernet enhancing protocols and released it first specification document on June 11, 2025 [11]. Tracking the progress of the UEC and incorporating the protocols developed will be key to designing future proof networks for AI and HPC workloads.

Low Latency Messaging

For AI training clusters, the current UEC specification outlines a one-way latency target of 2 – 10 microseconds for an unloaded network transporting RMDA messages [11]. RDMA and RoCEv2 remain the standard low latency messaging protocols employed for AI workloads, and both require lossless networks to enable high bandwidth, low latency, and congestion free data delivery between nodes distributed across the network fabric. While RoCEv2 and the wider availability of Ethernet hardware from multiple manufacturers makes them more attractive than InfiniBand for networks that support AI workloads, designing, deploying, and managing a lossless network using Ethernet requires the use of additional mechanisms that need to be configured, fine-tuned, and work in coordination on applications, server NICs, and switches using the network. The UEC has recognized that a simpler protocol that does not require a lossless network is required for the future and are working towards developing the Ultra Ethernet Transport (UET) protocol to replace RoCEv2. Alternative protocols like the Tesla Transport Protocol over Ethernet (TTPoE) that aim to address the drawbacks of TCP and RoCE by implementing all transport functions in hardware have been made public and may be suitable for some deployments [12]. Note that Telsa announced it is joining the UEC to share TTPoE and contribute to the development of a publicly available standard.

Network Blocks

Data Center networks are comprised of different functional blocks to provide shared services for many applications, dedicated services for specific applications, connectivity to systems within the same Data Center, and connectivity to systems outside of the Data Center. When building a network for AI, the following network blocks are deployed and are represented in the UE specification:

Frontend Network

The frontend network is not designed for carrying the bulk of AI workload traffic. This network is the familiar Data Center network designed for providing access to shared services, interconnecting Data Centers, the Internet, and other outside entities. Servers hosting AI systems typically have dedicated NICs to the frontend network. A spine-leaf-border switching topology that carries north-south traffic (for external connectivity) and east-west (for intra-Data Center connectivity) traffic flows is typically used here. This network is built to tolerate packet loss and millisecond level latency.

Backend Scale-Out Network

A separate set of server NICs dedicated for GPU and TPU network connectivity comprise the backend scale-out network. This network is optimized for high throughput (100 Gb/s, 400 Gb/s, 800 Gb/s, 1.6 Tb/s), low latency (10 microseconds or less), lossless connectivity. A spine-leaf switching topology is also used here but traffic flows are mostly east-west for communication across thousands of GPU and TPU nodes. Interconnecting GPUs and TPUs across a switched network fabric is the primary purpose of the backend scale-out network.

Scale-Up Network

Several GPUs and TPUs within the same server may need to directly communicate at the lowest possible latencies depending on the AI workload involved. The scale-up network facilitates this communication using a single switch or no switch at all. Using a single switch or short-range high-speed interconnects like AMD XGMI, NVIDIA NVLINK, and Intel Xe link provides the lowest latency (less than 1 microsecond).

Rail Optimized Topology

While not referenced in the UE specification, a rail optimized topology may be deployed within the spine-leaf topology and is outlined in reference architectures from the industry. In this topology, communication between accelerators (GPUs, TPUs) never crosses more than a single switching hop so long as each accelerator is connected to the same “rail” [13]. In this context, a rail is a leaf switch that a set of accelerators physically connect to. For example, if there is a pool of 100 accelerators distributed across Data Center racks you could create 10 rails (identified as rails 0 – 9) that consist of 10 accelerators each. Each rail could be hosted by their own leaf switch or multiple rails could be attached to a single switch. Without a rail optimized topology traffic between accelerators connected to different leaf switches would need to traverse two switching hops (spine, destination leaf). Since the quantity of ports per leaf switch will be less than the total quantity of accelerators deployed in a Data Center, the spine-leaf path is necessary to scale out connectivity, but latency for a subset of accelerator traffic can be further reduced by also deploying the rail optimized topology.

Lossless Ethernet Networking

As part of a modular communications architecture, Ethernet does not inherently include mechanisms to prevent or recover from packet loss and is instead designed to work in conjunction with higher level protocols like TCP and features employed by some applications and hardware that implement these functions when there are outages, overflowing buffers, or other issues impacting packet delivery along the network path. For AI workloads that may require the simultaneous execution of processes across thousands of distributed accelerators, dropped packets result in extraordinary delays and complete failures that may require all processes to restart. When building an Ethernet network to support these AI workloads, a lossless environment is achievable by using RoCEv2 and reviving some techniques that may have been disbanded in non-AI/HPC Data Center networks.

Non-Blocking Topology

A must have for lossless networking, the standard spine-leaf topology and its variants, along with selecting switches with appropriately sized resources (ports, ASIC performance, buffer architecture, memory, etc.), and the bandwidth of interconnecting switch links to achieve a 1:1 oversubscription ratio (ensuring the bandwidth of all host links on each leaf switch is equal to the bandwidth of all uplinks to the spine switch) are all required to build a network without bottlenecks. Given the use of elephant flows by AI workloads it is critical to ensure the switching fabric does not create an environment of contention where flows are competing for network resources.

DCQCN

Data Center Quantized Congestion Notification (DCQCN) is a combination of Priority-Based Flow Control (PFC) and Explicit Congestion Notification (ECN) [14] that may not be deployed in other networks due to the administrative configuration overhead on the network devices as well as server NICs, performance, and the comparatively inexpensive but sometimes short-term value of simply adding more bandwidth to congested network paths. PFC and ECN are deployed together as they each serve distinct purposes (preventing packet loss vs. preventing congestion).

PFC is a hop-by-hop mechanism used to slow down traffic from a sender when a receiver is unable to accept incoming traffic at the rate it is being delivered [14]. When PFC is enabled, a receiver (which may be an end host or switch along the path) can use a Pause Frame to request that a sender (which is the directly connected sender of data and may not be the originating source of data) stop sending traffic for a period of time. PFC operation needs to be enabled on all switches and NICs connected to the network to function. This helps avoid packet drops at the receiver end but adds latency for the duration of the pause interval and may eventually lead to packet drops if switching buffers overflow while data transmission is paused. Contrary to its name, PFC does not consider individual flows when pausing transmission and instead impacts all data (all flows) traversing an interface.

Unlike PFC, ECN works on an end-to-end basis to allow the originating data source to be informed by the end receiver that the rate of transmission needs to be slowed to prevent congestion [14]. Like PFC, all devices in the path of senders and receivers (including all intermediary switches) must be ECN enabled. To prevent congestion, the receiver will mark packets back to the sender as ECN to signal to the receiver to reduce its sending rate until congestion is cleared.

In a lossless Ethernet network using DCQCN to prevent packet loss, the NICs of all connecting hosts and all switches should be configured to use ECN to first prevent congestion by reducing data transmission rates without impacting the flows for other traffic, then activate PFC as a last resort to prevent packet drops [14].

Power Efficiency

Thousands of distributed accelerators will certainly require a substantial amount of power. With NICs providing connectivity for these accelerators through multiple interfaces per card, the power required by pluggable optics is tremendous. To address this power surge, Linear Pluggable Optics (LPO) should be considered when designing the cabling infrastructure for a large-scale AI Data Center network. In 2024, the Linear Pluggable Optics Multi-Source Agreement (LPO MSA) was founded to define a standard to use in the manufacturing of LPOs for compatibility across network devices [15]. The key differentiator with LPO vs. a regular pluggable optic is the omission of the Digital Signal Processor (DSP) in the optic. Instead, the signal modulation, digital compensation, forward error correction, distortion corrections (linearization), timing signal recovery, and diagnostics functions of the DSP are done by the DSP already embedded within the switches at each end of the link [16]. LPO reduces power consumption (by almost 50% for an 400G optic), latency, and costs (by about 20-40% for a 400G optic) for large scale distributed AI connectivity [17].

Network Platforms for AI Workloads

Outside of designing and manufacturing your own switches, there are a plethora of platform options available from industry manufacturers. Aside from the usual selection criteria related to hardware capacity, performance, and cost, keeping abreast of the ongoing developments of the UEC will be key in future proofing your network as network hardware platforms and operating systems evolve to optimize AI workloads. To help sort through the large pool of enterprise and Data Center grade platform options, listed below is a snapshot of high-performance switching and routing platforms aimed at supporting the AI workloads within the backend networks discussed in this article. Each platform is comprised of a group of switches and / or routers in the series and the maximum port speeds of each device in the series are indicated in parentheses. Note that ports may be broken out into different speeds (see the specification pages for details).

Nvidia

- Quantum-X800 InfiniBand switches (800 Gb/s)

- Quantum-2 InfiniBand switches (400 Gb/s)

- Quantum QM8700 InfiniBand switches (200 Gb/s)

- Specturm-4 Ethernet switches (200 Gb/s, 800 Gb/s)

Cisco

- Silicon One Ethernet switches (800 Gb/s) and routers (1.6 Tb/s)

- Nexus 9000 Ethernet switches (800 Gb/s)

Arista

- 7060X6 Ethernet switches (800 Gb/s)

- 7060X Ethernet switches (400 Gb/s, 800 Gb/s)

- 7388X5 Ethernet switches (200 Gb/s, 400 Gb/s)

- 7800R4 Ethernet switches (400 Gb/s, 800 Gb/s)

- 7700R4 Ethernet switches (800 Gb/s)

Juniper

- QFX Ethernet switches (100 Gb/s, 400 Gb/s, 800 Gb/s)

- PTX 10002-36QDD Ethernet routers (800 Gb/s)

- PTX modular Ethernet routers (800 Gb/s)

References

[1] “6 Types of AI Workloads, Challenges and Critical Best Practices.” Cloudian. Accessed: Aug. 20, 2025. [Online.] Available: https://cloudian.com/guides/ai-infrastructure/6-types-of-ai-workloads-challenges-and-critical-best-practices/

[2] M. McHugh-Johnson. “Ask a Techspert: What’s the difference between a CPU, GPU, and TPU?” Google - The Keyword. Accessed: Aug. 20, 2025. [Online.] Available: https://blog.google/technology/ai/difference-cpu-gpu-tpu-trillium/

[3] C. Gartenberg. “TPU transformation: A look back at 10 years of our AI-specialized chips.” Google Cloud Blog. Accessed: Aug. 20, 2025. [Online.] Available: https://cloud.google.com/transform/ai-specialized-chips-tpu-history-gen-ai?e=13802955

[4] “Introduction to Tensors.” TensorFlow. Accessed: Aug. 20, 2025. [Online.] Available: https://www.tensorflow.org/guide/tensor

[5] “Networking the AI Data Center.” Juniper. Accessed: Aug. 20, 2025. [Online.] Available: https://www.juniper.net/content/dam/www/assets/white-papers/us/en/networking-the-ai-data-center.pdf

[6] “Networking for the Era of AI: The Network Defines the Data Center.” Nvidia. Accessed: Aug. 20, 2025. [Online.] Available: https://nvdam.widen.net/s/bvpmlkbgzt/networking-overall-whitepaper-networking-for-ai-2911204

[7] “Cisco Data Center Networking Blueprint for AI/ML Applications.” Cisco. Accessed: Aug. 20, 2025. [Online.] Available: https://www.cisco.com/c/en/us/td/docs/dcn/whitepapers/cisco-data-center-networking-blueprint-for-ai-ml-applications.html

[8] S. Verma. “Network Impairments in AI Data Centers.” Calnex. Accessed: Aug. 20, 2025. [Online.] Available: https://calnexsol.com/understanding-tail-latency-network-impairments-in-ai-data-centers/

[9] “Everything You Wanted to Know About RDMA But Were Too Proud to Ask.” SNIA. Accessed: Aug. 20, 2025. [Online.] Available: https://www.snia.org/educational-library/everything-you-wanted-know-about-rdma-were-too-proud-ask-2025

[10] “IEEE P802.3dj 200 Gb/s, 400 Gb/s, 800 Gb/s, and 1.6 Tb/s Ethernet Task Force.” IEEE802. Accessed: Aug. 20, 2025. [Online.] Available: https://www.ieee802.org/3/dj/index.html

[11] “Ultra Ethernet Specification v1.0.” UltraEthernet. Accessed: Aug. 20, 2025. [Online.] Available: https://ultraethernet.org/wp-content/uploads/sites/20/2025/06/UE-Specification-6.11.25.pdf

[12] E. Quinnell. “Tesla Transport Protocol over Ethernet (TTPoE).” Hot Chips. Accessed: Aug. 20, 2025. [Online.] Available: https://hc2024.hotchips.org/assets/program/conference/day2/17_HC2024_Tesla_TTPoE_v5.pdf

[13] “Ethernet in the Age of AI: Adapting to New Networking Challenges.” SNIA. Accessed: Aug. 20, 2025. [Online.] Available: https://www.snia.org/educational-library/ethernet-age-ai-adapting-new-networking-challenges-2024

[14] “Data Center Quantized Congestion Notification (DCQCN).” Juniper. Accessed: Aug. 20, 2025. [Online.] Available: https://www.juniper.net/documentation/us/en/software/junos/traffic-mgmt-qfx/topics/topic-map/cos-qfx-series-DCQCN.html

[15] “Twelve Industry Leaders Collaborate To Define Specifications For Linear Pluggable Optics.” LPO MSA. Accessed: Aug. 20, 2025. [Online.] Available: https://www.lpo-msa.org/news/twelve-industry-leaders-collaborate-to-define-specifications-for-linea

[16] “AI Networking”. Arista. Accessed: Aug. 20, 2025. [Online.] Available: https://www.arista.com/assets/data/pdf/Whitepapers/AI-Network-WP.pdf

[17] “What is LPO Optical Transceiver Module?” FiberMall. Accessed: Aug. 20, 2025. [Online.] Available: https://www.fibermall.com/blog/what-is-lpo-optical-module.htm?srsltid=AfmBOoo0pvNL-d5f3trPpz2qJ5FmZAIwSDiniY6G-kuDjcUfWkv3cuRY#The_Advantages_of_LPO

Page updated

Report abuse