Avoiding ‘Success Disasters’: Ticketmaster vs. Taylor Swift
When tickets for Taylor Swift’s Eras Tour first went on sale in November 2022, the red-hot demand from fans and attacks from bots saw Ticketmaster melt under pressure. It couldn’t handle the load. The resulting stress on the live entertainment ticketing giant’s systems saw 15% of interactions across the site experiencing issues.
Some Swifties spent the entire day in the presale queue, only to be unexpectedly removed. Others eventually reached the front, placed tickets in their cart, and were then ejected and forced to restart the process.
Swift herself said it was “excruciating” to watch fans struggling to buy tickets, and that she had been repeatedly assured that Ticketmaster could handle the demand. Alongside angry fans, an irate entertainer and a rash of negative news coverage, Ticketmaster’s parent company, Live Nation Entertainment, has also faced a U.S. Senate antitrust hearing.
This past July, Ticketmaster experienced another “success disaster,” when its site crashed from overwhelming demand during presale for Swift’s 2024 shows in France.
Ticketmaster hasn’t publicly offered a technical postmortem, though some information can be gleaned from public statements. The story offers some clues about what created the company’s “success disaster,” and some lessons for making organizations’ systems resilient enough to handle huge spikes in customer demand.
The Bots Arms Race
Part of what makes this so interesting is that cloud-based elastic compute was expected to get rid of these capacity-based “success disasters.” However, while a technology like serverless allows you to deal with a 10x or even 100x increase in load, doing so has to be paid for. So if it isn’t legitimate traffic, you don’t necessarily want to scale to support it.
What makes this even more challenging is that the nature of legitimate traffic has changed. Tech-savvy users may choose to run “intelligent” agents (i.e., bots) to help them shop, win eBay auctions and so on — including buy tickets the users can later resell at a high markup. But there are growing calls to make all this type of activity illegal, not least from Joe Berchtold, president and CFO of Live Nation Entertainment.
“You can imagine that if you are doing a Black Friday sale, you are not just waiting on individual users with mobile phones or laptops accessing your site,” said Spencer Kimball, CEO at Cockroach Labs. “Now it’s about any agents that may have been launched on behalf of users, so you are not tied to the same scale anymore. It could easily be 10x as much, and you really need to plan for this.”
Ideally, you’d filter out the bot traffic, but this is technically challenging. “Dealing with bots has always been an arms race,” Laura Nolan, principal software engineer at Stanza Systems, told The New Stack. “You can limit by IP address but the way the modern internet works with network address translation, you can run into a lot of problems doing that.
“Fingerprinting TLS connections with JA3 and JA3S hashes worked for a while, but now bot software is varying the TLS parameters so that doesn’t work well anymore. CAPCHAs aren’t as effective as they used to be because we have AI systems that can solve them, so it becomes very difficult.”
Ticketmaster’s solution was their Verified Fan system, which provides codes to confirmed users as part of the checkout process. So why didn’t it prevent the crashes?
According to the company’s statement, over 3.5 million people pre-registered for the Taylor Swift Verified Fan presale in 2022, with around 1.5 million of them sent codes to join the on-sale. According to Ticketmaster, around 40% of invited fans typically show up and buy tickets, with most purchasing an average of three. In this case, however, the impressively high demand resulted in 14 million fans hitting the site.
In written testimony ahead of a grilling by U.S. senators, Berchtold stated that as well as fans, the firm was also “hit with three times the amount of bot traffic than we had ever experienced, and for the first time in 400 Verified Fan on sales they came after our Verified Fan access code servers. While the bots failed to penetrate our systems or acquire any tickets, the attack required us to slow down and even pause our sales. This is what led to a terrible consumer experience that we deeply regret.”
Reading between the lines, it sounds as if they had too many requests hitting the verification system simultaneously and it therefore stopped functioning.
With the benefit of hindsight, a proportion of Ticketmaster’s issues seem fairly obvious. Speaking on the Decoder podcast, Dean Budnick, co-author of the book ”Ticket Masters: The Rise of the Concert Industry and How the Public Got Scalped,” pointed out that rather than selling all tickets at once, “it would’ve been different if they had spread those out over the course of a week.”
Berchtold has also acknowledged this, but as any experienced software professional will know, sometimes business wishes override sound technical decisions.
How Do You Load Test Legacy Systems?
Was the architecture a factor? Ticketmaster’s architecture, as of 2019, was based on a monolith running on an emulated Vax. However, in his statement, Berchtold said that his company has “created an entirely new, modern computing architecture,” at least parts of which are running in AWS.
However, building a system to avoid a “success disaster” demands certain criteria be met.
For a system with an older core, “you have to build your system in layers to protect and prioritize what is going into it since horizontal scaling probably isn’t going to work,” Nolan said. “So it’s the infrastructure around it, like the trusted fan system that prioritizes and gatekeepers it, where they probably need to be looking.”
Kimball suggested that if you are not already running load tests to see how your system behaves for your likely high watermark, doing so would be advisable. “If you assume a number such as a 100x scale-up for a Black Friday or Boxing Day event, a load test provides a means for you to verify whether your system can actually support it,” he said.
He also recommended pushing the system to the point where it can no longer handle the load, so you can see what happens and then look at possible mitigations.
“You might not be able to scale up your database if it is monolithic past a certain level,” Kimball said. “But what happens when you push the thing too far? Does the whole system fall over? Or can you slow down the request/response speed by putting people in a queue that sits in front of your service?”
Once you’ve gathered data for this, a good approach for a distributed system is to identify your biggest bottleneck and address it, then repeat the test until you’re able to meet the scale you need. The reason for working this way — as opposed to fixing multiple bottlenecks between each test run — is that the next main bottleneck is often in an entirely different component of the system.
For monolithic applications, however, it can pay dividends to choose the quickest-to-fix issue from a list of top five or so bottlenecks, rather than focusing on the top bottleneck to get your application tuned the quickest. Tuning is almost never done, so it also pays to agree to any performance targets upfront before you start.
Techniques such as graceful degradation are also effective. Netflix, for example, is designed with both fallbacks, such as showing unpersonalized or stale cached lists of movies if the personalized list can’t be shown, as well as the removal of non-critical features.
A step beyond this might be to go entirely serverless, or perhaps use serverless as a way to augment capacity.
In our shopping example, you can imagine using the old legacy system until its capacity is exhausted, then sending new traffic to the serverless implementation, which gives you elastic scaling but at an additional cost.
“You need a very scalable database to do this, but there are serverless databases now, such as the one Cockroach Labs provides, which give you extraordinary real-time scaling of your database up and down, without needing to add nodes and rebalance,” Kimball said.
In Ticketmaster’s case, it sounds like they may have accepted unbounded numbers of incoming requests. Load shedding at the load balancer can be a good way to guard against this. It is also worth setting limits for individual services to give you defense in depth. Netflix’s concurrency-limits tool shows one way to do this for Java, and there are good commercial tools, including Stanza’s, which allow quite sophisticated approaches.
If possible, it’s a good idea to prioritize what traffic you serve. So, for example, Ticketmaster could perhaps have prioritized fans in the checkout process over those searching for tickets. If you can’t do this — in Ticketmaster’s case it may have been that the traffic was too heterogeneous to allow it — it is better to put a limit in place.
“If you have 1 million queries per second of capacity available and 1.2 million QPS coming in, you are better off serving 1 million QPS and dropping the other 200,000, rather than allowing all that traffic through and having your system grind to a halt,” Nolan told us.
It is also preferable to fail fast. As Nolan wrote in an InfoQ article on cascading failures, “It’s better to get a fast failure and retry to a different instance of the service, or serve an error or a degraded experience than wait until the request deadline is up (or indefinitely, if there’s no request deadline set).
“Allowing this to happen can lead to slowness that spreads through an entire microservice architecture, and it can be tricky to find the underlying cause when every service has ground to a halt.”
Finally, in the case of Ticketmaster and Taylor Swift, running a game day and doing a red team/blue team exercise might have been a good idea.
“Having someone thinking about how a system could break goes a lot further than someone thinking about how a system is properly bulletproof,” Kimball said. “You’d think these two things would yield the same outcomes, but they don’t.”
How Do You Learn Incident Response?
Another aspect of all of this is knowing how to respond when the worst does happen. We are strangely poor at dealing with significant and novel distributed systems failures as an industry. “We don’t have a good way to train people on how to respond to new kinds of complex system failures,” Nolan said, “because we don’t know that much about it.”
The current state of the art is the Incident Command System, which comes out of a school of thought called High-Reliability Organization theory. A characteristic of these organizations is that they have ways of flattening the organization and pushing decisions down to practitioners in a crisis.
This approach is something that the U.S. wildfire and other response units use and it was brought into the technology industry relatively recently — around the 2010s.
The general idea is that you declare an incident, choose an incident commander, then have one or more subject matter experts working on the issue. The incident commander isn’t hands-on; they are in a coordination role and are key to making the approach work.
“They are supposed to make sure that the right people get called into the incident, so they need to know the organization structure, how to page people, how to escalate to executives if they need to, and so on,” Nolan said. “They should also make sure that the technical people don’t lose sight of the big picture, that people aren’t working at cross purposes and that the communication is good.”
This still leaves the problem of when you are called into an incident as a subject matter expert. While some incidents look familiar — a denial-of-service (DoS) attack, say — some are novel and require a lot of improvisation and quick thinking. It may be hard to figure out what is going on and finding a resolution can be complex.
“You are typically scrambling together BASH scripts and whatever else you can put together to fix it,” Nolan said.
Adding to this, with a major incident like that at Ticketmaster, you know that while you’re trying to solve the problem, The New York Times is writing a front-page story on why your website is down. How we handle resolving the incident with the additional stress this entails is something that we all just have to learn on the job.
Your best defense is to design and test your systems for resiliency, so your organization can avoid having what should be its best days turn into success disasters.