Wobbly Bridges and Sturdy Software
Next time you visit London, make sure you walk across the Millennium Bridge. A landmark in its own right, the bridge connects St. Paul’s Cathedral on the northern end with the Tate Modern and Shakespeare’s Globe Theatre at its southern end. It’s a suspension bridge, engineered for pedestrians to enjoy an unobstructed view of some of London’s most celebrated sights.
Yet, as an engineer, I enjoy it for a completely different reason: the story behind it. It goes like this: In the late 1990s, London was looking for a way to celebrate the new millennium. Some of the world’s best architects put forward their ideas, and the vision of a lateral suspension bridge ultimately won the design competition. Construction broke ground in late 1998, and a much-anticipated opening date was set for spring 2000.
Two months beyond its original deadline, and £2 million over budget (roughly US$2.65 million), for a total cost of £18 million (roughly US$24 million), media around the world covered the bridge’s opening on June 10, 2000. It was closed two days later, on June 12, for health and safety reasons.
The Runtime Issues
On that opening day, an estimated 100,000 people walked across the bridge, with about 2,000 people present at any given time. Within minutes, it became clear that something was not quite right. The bridge was gently swinging left and right, exhibiting greater-than-expected lateral movement. The locals quickly nicknamed it the “Wobbly Bridge,” as it swayed up to 7 cm (2.75 inches) in each direction. In their characteristic manner, many Londoners enjoyed the ride, treating it more like a funfair than a bridge. Public health officials disagreed and temporarily shut the bridge down.
For the next two days, attempts were made to limit the number of people on the bridge, but it became clear that even a comparatively small number of people caused the bridge to sway dangerously, despite its being originally designed to withstand the weight of around 5,000 people. The bridge would not reopen to the public until Feb. 22, 2002.
It was a runtime problem. What some of the leading designers didn’t predict was the interaction between the structure and the people using it. The natural swaying motion of peoples’ gaits had caused small sideways oscillations in the bridge. In turn, this caused people to try to synchronize their steps with the oscillations — have you ever tried walking on a train that is swinging left and right? — further increasing the amplitude of the oscillations and further reinforcing the effect. This is a textbook example of the “positive feedback” phenomenon. The solution involved retrofitting 37 dampers to dissipate that energy, at an extra cost of £5 million (US$6.63 million) and nearly two years of additional work.
Every failure is a learning opportunity. A wise person learns from other people’s mistakes. From the perspective of a software engineer, I’ve got three points to make.
Lesson #1: The Importance of Testing
To ensure the fixes they put in place worked as expected, they arranged for 2,000 — they really seem to have been into the millennium symbolism! — volunteers to walk across the bridge while the sway was carefully measured. It worked like a charm.
Alas, that tells us they opened a bridge to the public without having first tested it in real life. With hindsight, we know that even a relatively small number of test subjects strolling up and down the bridge would have detected a problem.
I don’t know about you, but the next time I build a bridge, I would definitely arrange for a more thorough testing before I invite the media!
Lesson #2: Memories Are Long
Twenty years later, the locals still affectionately refer to the structure as the “wobbly bridge” (sometimes “wibbly-wobbly”). The internet is ripe with videos of that wobbling. All of that from an era a full 7 years prior to the launch of the first iPhone. Imagine how many YouTube videos, TikTok challenges and social media posts there would be had this happened today.
The bottom line is this: People will remember your failures long after they’ve been fixed.
Lesson #3: It’s Hard to Predict Emergent Properties of a System
While I’m not sure what the best practices for structural engineering are nowadays, I do know that we tend to have it easier in the field of software engineering. Most of the time, testing is easy and/or cheap enough that there is no excuse for releasing any untested code. This is what we teach students before they get their computer science degrees.
Although, if this story tells us anything, it’s that even when all the tests pass at build time (bridge design), there is still a chance of catastrophic failure at runtime. It’s hard to predict the interaction between components. This often goes under the umbrella of the emergent properties of a system, and it’s one area where chaos engineering excels.
Chaos engineering is the practice of experimenting on a system to gain confidence that it will withstand turbulent conditions. It’s about taking a real system and seeing how it behaves when bad things happen.
It’s an extra layer of testing that can help detect behavior that would go unnoticed through the unit, integration and even end-to-end tests. It boils down to conducting experiments to confirm or disprove your assumptions about the system, especially when things you know might go wrong, actually do go wrong.
Just like bridges, software systems have built-in redundancy to accommodate for component failure. The difference is that, while it’s risky and expensive to cut off a 20-ton cable to confirm that the remaining cables will bear the load, it’s something software engineers do on a daily basis. We have the equivalent of spare bridges, staged bridges, mini bridges and everything else to experiment within the virtual world.
If you’re not doing chaos engineering in 2022, there’s never been a better time to start. If you’d like to learn more about this, tune in for my keynote presentation on wobbly bridges and the need to build sturdy software at Chaos Carnival 2022 in January 2022.
If you’re ready to dive deeper into chaos engineering straight away, check out my book “Chaos Engineering: Site reliability through controlled disruption” (Manning).