More than 1.35 billion people using Facebook on an ongoing basis rely on a seamless, “always on” site performance. In the backend there are many advanced sub-systems and infrastructures in place that make such a real-time experience possible.
Facebook’s production network by itself is a large distributed system with specialized tiers and technologies for different tasks: edge, backbone, and data centres. It has to constantly scale and evolve, rapidly adapting to the application needs.
The amount of traffic from Facebook to Internet is called “machine to user” traffic which is large and ever increasing, as more people connect and as and when new products and services are created. However, this type of traffic is only the tip of the iceberg. What happens inside the Facebook data centres – “machine to machine” traffic – is several orders of magnitude larger than what goes out to the Internet.
The back-end service tiers and applications are distributed and logically interconnected. They rely on extensive real-time “cooperation” with each other to deliver a fast and seamless experience on the front end, customized for each person using the apps and site. Irrespective of optimizing internal application efficiency the rate of machine-to-machine traffic growth remains exponential, and the volume has been doubling at an interval of less than a year.
The ability to move fast and support rapid growth is at the core of infrastructure design philosophy. At the same time, Facebook is always striving to keep the networking infrastructure simple enough that small, highly efficient teams of engineers can manage it. The goal is to make deploying and operating Facebook networks easier and faster over time, despite the scale and exponential growth.
Traditional Method Followed
The previous data centre networks were built using clusters. A cluster is a large unit of deployment, involving hundreds of server cabinets with top of rack (TOR) switches aggregated on a set of large, high-radix cluster switches. More than three years ago, Facebook developed a reliable layer3 “four-post” architecture, offering 3+1 cluster switch redundancy and 10x the capacity of the previous cluster designs. But as effective as it was in early data centre builds, the cluster-focused architecture has its limitations.
First, the size of a cluster is limited by the port density of the cluster switch. To build the biggest clusters Facebook needed the biggest networking devices which were available only from a limited set of vendors. Additionally, the need for so many ports in a box is orthogonal to the desire to provide the highest bandwidth infrastructure possible. Evolutionary transitions to the next interface speed do not come at the same XXL densities quickly. Operationally, the bigger bleeding-edge boxes are not better either. They have proprietary internal architectures that require extensive platform-specific hardware and software knowledge to operate and troubleshoot. With large areas of the datacentre depending on just a few boxes, the impact of hardware and software failures can also be significant.
Even more difficult is maintaining an optimal long-term balance between cluster size, rack bandwidth, and bandwidth out of the cluster. The whole concept of a “cluster” was born from a networking limitation – it was dictated by a need to position a large amount of compute resources (server racks) within an area of high network performance supported by the internal capacity of the large cluster switches. Traditionally, inter-cluster connectivity is oversubscribed, with much less bandwidth available between the clusters than within them. This assumes and then dictates that most intra-application communications occur inside the cluster. There are many clusters in Facebook typical data centre, and machine-to-machine traffic grows between them and not just within them. Allocating more ports to accommodate inter-cluster traffic takes away from the cluster sizes. With rapid and dynamic growth, this balancing act never ends.
Facebook Fabric Network
The next-generation data centre network design challenged Facebook to make the entire data centre building one high-performance network, instead of a hierarchically oversubscribed system of clusters. Facebook also wanted a clear and easy path for rapid network deployment and performance scalability without ripping out or customizing massive previous infrastructures every time we need to build more capacity.
To achieve this, Facebook took a disaggregated approach: Instead of the large devices and clusters, they broke the network up into small identical units – server pods – and created uniform high-performance connectivity between all pods in the data centre.
There is nothing particularly special about a pod – it’s just like a layer3 micro-cluster. The pod is not defined by any hard-physical properties; it is simply a standard “unit of network” on Facebook new fabric. Each pod is served by a set of four devices that we call fabric switches, maintaining the advantages of our current 3+1 four-post architecture for server rack TOR uplinks, and scalable beyond that if needed. Each TOR currently has 4 x 40G uplinks, providing 160G total bandwidth capacity for a rack of 10G-connected servers.
Each pod has only 48 server racks, and this form factor is always the same for all pods. It’s an efficient building block that fits nicely into various data centre floor plans, and it requires only basic mid-size switches to aggregate the TORs. The smaller port density of the fabric switches makes their internal architecture very simple, modular, and robust, and there are several easy-to-find options available from multiple sources.
Another notable difference is how the pods are connected together to form a data centre network. For each downlink port to a TOR, we are reserving an equal amount of uplink capacity on the pod’s fabric switches, which allows us to scale the network performance up to statistically non-blocking.
To implement building-wide connectivity, we created four independent “planes” of spine switches, each scalable up to 48 independent devices within a plane. Each fabric switch of each pod connects to each spine switch within its local plane. Together, pods and planes form a modular network topology capable of accommodating hundreds of thousands of 10G-connected servers, scaling to multi-petabit bisection bandwidth, and covering our data centre buildings with non-oversubscribed rack-to-rack performance.
For external connectivity, Facebook equipped Facebook fabric with a flexible number of edge pods, each capable of providing up to 7.68Tbps to the backbone and to back-end inter-building fabrics on the data centre sites, and scalable to 100G and higher port speeds within the same device form factors.
This highly modular design allows to quickly scale capacity in any dimension, within a simple and uniform framework. When Facebook need more compute capacity, they add server pods. When they need more intra-fabric network capacity, they add spine switches on all planes. When they need more extra-fabric connectivity, they add edge pods or scale uplinks on the existing edge switches.
Facebook took a “top down” approach – thinking in terms of the overall network first, and then translating the necessary actions to individual topology elements and devices.
Facebook were able to build the fabric using standard BGP4 as the only routing protocol. To keep things simple, Facebook used only the minimum necessary protocol features. This enabled to leverage the performance and scalability of a distributed control plane for convergence, while offering tight and granular routing propagation management and ensuring compatibility with a broad range of existing systems and software. At the same time, Facebook developed a centralized BGP controller that is able to override any routing paths on the fabric by pure software decisions. Facebook call this flexible hybrid approach “distributed control, centralized override.”
The network is all layer3 – from TOR uplinks to the edge. And like all our networks, it’s dual stack, natively supporting both IPv4 and IPv6. Facebook’ve designed the routing in a way that minimizes the use of RIB and FIB resources, allowing them to leverage merchant silicon and keep the requirements to switches as basic as possible.
For most traffic, Facebook fabric makes heavy use of equal-cost multi-path (ECMP) routing, with flow-based hashing. There are a very large number of diverse concurrent flows in a Facebook data centre, and statistically they are seeing almost ideal load distribution across all fabric links. To prevent occasional “elephant flows” from taking over and degrading an end-to-end path, they’ve made the network multi-speed – with 40G links between all switches, while connecting the servers on 10G ports on the TORs. Facebook also have server-side means to “hash away” and route around trouble spots, if they occur.
Despite the large scale of hundreds of thousands of fibres strands, fabric’s physical and cabling infrastructure is far less complex than it may appear from the logical network topology drawings. Teams have worked towards third-generation data centre building designs for fabric networks, shorten cabling lengths, and enable rapid deployment
From a server rack or data hall MDF standpoint, there is almost no change – the TORs just connect to their four independent aggregation points, same as with clusters before. For spine and edge devices, Facebook’ve designed special independent locations at the centre of the building, which they call BDF rooms. The BDFs are being constructed and pre-equipped with fabric early in the building turn-up process. The data halls are then identically attached to the BDFs as they get built, which drastically reduces network deployment time.
The massive fibre runs from the fabric switches in the data hall MDFs to spine switches in the BDFs are actually simple and identical “straight line” trunks. All fabric complexity is localized within BDFs, where it is very manageable. Facebook consider each spine plane, with its corresponding trunks and pathways, one failure domain that we can safely take out of service at any time without production impact. To further optimize fibre lengths, backbone devices are positioned in the MPOE rooms directly above the fabric BDFs. This allowed us to use short vertical trunks in a simple and physically redundant topology.
Furthermore, all fabric spine planes in the BDFs are identical clones by design, and cabling is localized within each independent spine plane. Port layouts are visual and repetitive, and all port maps are automatically generated and validated by our software.
All this makes fabric deployment and cabling a smooth, efficient, and virtually error-free job and is a great example of how networking requirements can positively influence building design.
In dealing with some of the largest-scale networks in the world, the Facebook network engineering team has learned to embrace the “keep it simple, stupid” principle.
Facebook new fabric was not an exception to this approach. Despite the large scale and complex-looking topology, it is a very modular system, with lots of repetitive elements. It’s easy to automate and deploy, and it’s simpler to operate than a smaller collection of customized clusters.
Fabric offers a multitude of equal paths between any points on the network, making individual circuits and devices unimportant – such a network is able to survive multiple simultaneous component failures with no impact. Smaller and simpler devices mean easier troubleshooting. The automation that fabric required to create and improve made it faster to deploy than our previous data centre networks, despite the increased number of boxes and links.
Facebook modular design and component sizing allow us to use the same mid-size switch hardware platforms for all roles in the network – fabric switches, spine switches, and edges – making them simple “Lego-style” building blocks that they can procure from multiple sources.
With smaller device port densities and minimized FIB and control plane needs, what’s started as our first all-around 40G network today will be quickly upgradable to 100G and beyond in the not so distant future, while leveraging the same infrastructure and fibre plant. With the first iteration of the fabric in our Altoona data centre, it is already achieved a 10x increase in the intra-building network capacity, compared with the equivalent cluster design, and can easily grow to over 50x within the same port speeds.