The famous story of the Tacoma Narrows Bridge is a vivid illustration of what happens when engineers fail to adequately plan for how their work will perform under real-world conditions. The construction of the bridge in 1940 in the state of Washington brought together some of the country’s best bridge architects, designers, and builders. But just four months after completion, it dramatically collapsed into Puget Sound due to unanticipated aeroelastic flutter caused by strong winds that engineers had failed to account for.
Civil engineers have largely learned the lessons of the Tacoma Narrows Bridge. But we still see modern software engineering teams separate feature development from operations and stability. Despite movements like DevOps, it’s still very common today to see software engineers writing and developing applications; then handing off their code to a QA team for testing; then handing over again to an operations team for deploying and debugging. This works well when QA and operations have the product knowledge required to fix problems, but if they don’t, fixing and debugging needs to work its way back to the software engineers. This workflow can significantly delay fixes. The challenge software teams face today is how to build in a mindset where operations and stability are also part of the software development lifecycle.
The Centralized Reporting, Embedded Locality Model
At Facebook, we have chosen a different organizational approach. By embedding production engineering teams within software engineering teams, we are able to work more effectively, more creatively, and with greater organizational cohesion. This “Centralized Reporting, Embedded Locality” model ensures that Facebook is organizationally set up to provide the best user experience and access to the platform at any hour.
With more than two billion users, and an average of 1.4 billion of whom use Facebook products every day, production engineers support an intricate system that operates at a large scale, delivering photos, videos, messages, and other types of content to the Facebook community. The Centralized Reporting, Embedded Locality structure creates a collaborative culture between the software engineering and operations teams to ensure Facebook’s infrastructure is healthy.
This model relies on four principles to be effective:
- Co-location. Every production engineering team is co-located with the software engineering team that they’re working with. Both departments physically sit together, socialize, and attend each department’s all-hands so that everyone has a clear understanding of the product/infrastructure roadmap. When an issue arises, they can quickly gather in a room to help examine and solve for it. This empowers the team to keep abreast of development, making each stakeholder more knowledgeable about the individual piece of software being built and how that software might interact with the servers and services. The shared knowledge of product vision and product group incentives is crucial for this type of structure to thrive. The team accomplishes this through automation, new tools, performance analysis, and hardcore systems debugging.
- Unified org structure. We have one single engineering business unit, encompassing both development and production. From an org chart perspective, production engineering is not a business unit separate from engineering; it’s a part of the engineering discipline. This makes the engineering leader of the company accountable for operations. We believe the org chart should not have engineering and the operations departments going very far up the organizational ladder before they meet. Ultimately, operations is the responsibility of that engineering lead and their assessment and ability to move up and take on more responsibility is directly tied to whether operations is successful. At the same time, operations’ success is now fundamentally intertwined with the engineering leader and their team. To implement this structure successfully, incentives need to be re-aligned as well. Operations needs to be incentivized to enable faster development and strike a balance between building feature-rich products and services, and stability. No one facet should win over another. To ensure team members understand and adjust to their responsibilities, it is important to create a centralized reporting structure for production engineering. This creates a mechanism to respond to conflicts, such as when engineering leads look at a project through the “features only” lens, and assume operations is solely responsible for the work they don’t want to do. When these situations arise, both software engineering and production engineering leadership need to come together to solve the problem collectively, using their centralized reporting structure.
- Team parity. Software engineers and production engineers are on a level playing field. As Facebook has built the production engineering model, one major learning is the importance of putting software engineers and production engineers on a level playing field, because perception is the reality. Everyone is an equal partner and has the same clout when they are building a system. Production engineers build software to solve operational problems; software engineers also take responsibility for running their services. Parity means that software engineers are on call for the services they build as opposed to depending on a traditional operations team to respond to alerts. In most cases, production engineers and software engineers share on-call rotations.
- A holistic mindset. One unified team understands how the code is built and how it responds to the environment. Production engineers’ work ranges from hardware design to UI tools. When software is optimizing for user traffic or failing, production engineering is the team that best understands how the code is built and how it responds to the environment. From backend services like Facebook’s data warehouses to front-end services like News Feed, to infrastructure components like caching infrastructure, load balancing, and deployment systems, the team keeps Facebook running. Production engineers build a variety of systems that solve unique problems that others may not be focused on solving. For example, production engineers designed and built out a Faraday cage for wireless mobile phone testing and Augmented Traffic Control (ATC) that allows mobile developers to simulate real-world network environments. Engineers can use ATC to test their application across varying conditions like severely impaired networks or emulating mobile carrier speeds and technologies like 3G and LTE. Another example of technology built by production engineers is Facebook Auto Remediation (FBAR), which is an automated system that handles repetitive issues so that engineers can focus on solving and preventing the larger, more complex site disruptions. In these examples, the team is on-call 100 percent of the time because the production engineers who built and deploy it also own it.
Benefits: More Diverse Thinking, Collaborative Culture, Holistic Response
With both software and operations engineers on equal footing, collaborating as part of a single team with one unified vision, and sharing the same physical workspace, the Centralized Reporting, Embedded Locality model pays off in several important ways.
Perhaps most important, we see a higher diversity of thought across the stack, because engineers share knowledge of both product vision and operational issues. Since they are working side by side as part of the same team, software engineers benefit from the constant vigilance of the production engineers and also get a window into operational issues right as they are being identified and addressed.
By embedding production engineering teams within software engineering teams, we are able to work more effectively, more creatively, and with greater organizational cohesion.
Production engineers also can understand the prioritization constraints of the software teams. When problems surface in production, both software engineers and production engineers huddle together, shoulder to shoulder, to fix them. When the software engineer is on call, the production engineer is sitting there, and he or she or they can just lean over and ask for help on a particular problem or get guidance on ways to be more effective in the future.
In this way, the “Centralized Reporting, Embedded Locality” model strengthens internal communications and removes old preconceptions. The result is a powerful, collaborative and unified engineering culture, which ultimately benefits Facebook’s user experience.
Applying the Centralized Reporting, Embedded Locality Model
Companies need to build engineering teams that will be sought out for their knowledge and problem-solving abilities. Organizations cannot just rebrand their backend departments. They need to change the culture of their software team and their underlying approach to the work done by the development teams. Organizational support is crucial.
By putting operations on equal footing with software engineers, this model helps the production engineering team build a strong reputation for effectiveness and creative solutions, and avoid hierarchical, siloed thinking. As other organizations look to implement this model, one of the biggest lessons for building a cohesive team is to hold open, honest conversations regarding career development, the nature of future work, and how change affects what people do. When applicable, make it clear to teams when they should automate themselves out of current roles to take on bigger, more difficult challenges. All of these elements help to build a powerful engineering culture, which will ultimately benefit your company’s user experience.
Feature image via Pixabay.