The first message on Microsoft’s LinkedIn was sent the day the service launched, on May 5, 2003. The LinkedIn messaging platform now stores 17 years’ worth of messages (created with 17 years of different product features), and the number of messages sent keeps going up. It’s quadrupled over the last four years, but in the first week of April, messages were up 14% compared to the previous year; the end of March saw weekly jumps in messages sent between co-workers (up 21%) and users sending each other content from their LinkedIn newsfeeds in messages (up 14%).
Originally those messages looked very like email; now they look more like chat, with threading, group conversations, emoji and no subject lines. The code powering that messaging system has been updated, getting ever more complicated, but the common infrastructure that powers all the different LinkedIn messaging experiences hadn’t changed as much over the years.
“While very little code, still survives from 2003, a lot of the infrastructure was still the same,” Manny Lavery, the engineering manager for the messaging platform infrastructure team told the New Stack.
The original infrastructure was a single Oracle database, running in a single data center, with two tables powering two services: one for storing the messages, which included all the business logic for messages for multiple LinkedIn products, and one responsible for the different ways the messages could be delivered — push notifications, different email formats, and tracking they were received.
Over time, LinkedIn added new data centers and switched to distributed data storage with sharded architecture using its own NoSQL JSON database, Espresso. Scaling out brought its own complexities though; different mailboxes were divided between shards with a Personal Data Routing service keeping track of whose inbox was on which shard and bi-directional replication putting a copy of each shard into a different data center in case of availability issues.
There was a major product redesign in 2016 and LinkedIn kept adding more features, eventually changing the name to Messaging. But it was still built on a data architecture designed for email-style conversations between two people, using a codebase that was over ten years old, making made it hard to shard the data correctly for scaling, with critical business logic still embedded in code in the message store.
Untangling the Mess
The infrastructure team is involved when any of the LinkedIn product teams want to add features to messaging — sometimes writing the code for them or working with them on development — and was responsible for maintaining all that new code in production. But while they owned the code, different product teams owned the business logic, the reasoning behind it and the decisions about making any changes to it.
The messaging platform supports five different LinkedIn lines of business: the consumer web site, the “talent solutions” for HR and hiring teams which offer one kind of email messaging, the sales solutions which offer a different kind of email, LinkedIn’s marketing solutions and the premium service which has premium messaging options (plus a few other internal clients). All those products have their own business rules for how to handle their specific features in messaging.
“All of those business use cases existed in a single codebase that the messenger team was responsible for the long term maintenance of,” Lavery told us. “As the company expanded, as we added more lines of business over 17 years, the use cases got more complex, and they got more numerous, which was really difficult for our engineers. No one could fully understand every business use case that happened in the platform.” Before the redesign, there were around sixty different pieces of custom business logic.
The infrastructure design made writing new code harder and slower than it needed to be, with developers getting overly cautious about the simplest of changes because more and more of the team’s time was being spent just keeping things running. “The maintenance costs consumed most of the engineering time.”
Concerned that a new architecture could become just as much of a maintenance burden over time, the new platform didn’t just change the data architecture to focus on conversations and content rather than individual messages. It’s also designed as a plug-in infrastructure, with service ownership distributed between the infrastructure and product teams.
“We started to break up the content so that what could be shared was shared and what didn’t have to be shared was then isolated to a single service, and we designed our system around ownership of responsibility, and then backed up those services of responsibility with databases that provided that data.”
“We made sure that we could operate as a platform, agnostic to the business logic that is being executed upon us. From our perspective, the messages are just content and once you make that decision it’s easy to say ‘I’m just going to get the content and store it for you, and I’m going to retrieve it to you in the exact same condition you gave it’, and then they can render it as they want.”
The fundamental unit of messaging is no longer the whole inbox but conversations, which makes it easier to deliver fast search and retrieval times because conversations and messages can be broken up and spread across the database. Conversations are stored separately from the messages in them, with references back to the messages, to avoid “hot key” problems with very active database records that could slow the system down.
“Now, when members are fetching their inbox they aren’t fixing a list of messages; they’re fetching a list of conversations and if they access one of those conversations we can then access the messages within it,” Lavery explained. Initially, only the most recent conversations are retrieved, and only the first few messages in those conversations, because that’s what users are most likely to be looking for. And those recent conversations are all stored in the same part of the database, for speed.
“Now you’ve reduced the search problem from ‘how do I find data within billions of conversations’ to ‘I only need to find the top ten conversations with one person and for each of those conversations I only need to find the first couple of messages.’ Now your data retrieval problem is very small, and it’s the exact same data size retrieval problem, no matter how big their conversation list gets.”
That list can be large: a new LinkedIn member might only have a few hundred messages. Others have 750,000 or a million messages. “If you’re a recruiter who’s been working on the platform for over a decade you’re sending probably several hundred emails a day, five or six days a week. No matter how big your inbox is, your retrieval time should be the same.”
As users read back through a conversation, more messages get retrieved. “Espresso was built to do that pagination style retrieval and because it’s NoSQL, because there are only the references we have to store into and there are no joins, that database retrieval becomes very quick.”
The new design does take up substantially more storage space because it has multiple tables and multiple indexes to make the retrieval faster. But the improvements in performance and reliability more than make up for that, Lavery maintains.
“Engineering time is expensive. Even though the new system does have some additional costs in terms of storage, that’s relatively low when compared to the amount of maintenance and engineering time you spend coming up with technical solutions to very hard problems.”
Owning the Problem
The same kind of decisions determined the platform services, keeping the numbers deliberately low rather than opting for a full microservices architecture as a way of getting an effective team structure for the longer term.
There are less than a dozen services for the messaging platform, some small but others fairly large, sized so each can be owned by a senior technical leader. “We disseminated leadership across a number of engineers, and that really allowed us during development to accelerate very quickly, because we could bring in engineers who weren’t very familiar with the platform as a whole and organize them around these technical leaders, and execute very quickly.” Lavery can give developers more responsibility without one of two experts on the team becoming bottlenecks.
That helped when the initial deployment plan for the platform didn’t work out. The team had planned to have old messages stay in the existing system, and new messages get created in the new system as they moved members to it. Then they realized they’d accidentally created a distributed system with eventual consistency, because messages had to be replicated to the old system for members who hadn’t been migrated to the new service.
“We started seeing really poor member experiences, waiting for those messages to show up and making sure that everything was synchronized.” That meant rolling back the 1% of members who had been migrated on to the new system and bringing in extra engineers from across the company to rework the deployment plan.
Usually adding more staff to a project slows it down even more while they get up to speed, but because each service was owned by a technical lead, they were able to mentor the extra engineers, get back on track and ramp up the rollout to every two weeks.
“How many other systems have you seen that took three-plus years to build ramp without issue, every couple of weeks? Within two months, that now serves hundreds of millions of messages created a week, serves hundreds of millions of members and has billions of messages stored in its database.”
The service ownership approach is working for building out new features too. Product teams still work with the infrastructure team, but all the custom business logic was migrated out of the message store and encapsulated into the plugins for their features, and the different service owners can work with them when they need help. “I think if we had gone with say, two dozen services, it would have been much more difficult today to make sure we have those technical leaders available to help,” Lavery suggested.
The new architecture means product teams can experiment and test out new ideas without committing to months of development. “If your use case works within the plugin-style architecture that we’ve already built and you don’t need an additional change, you should be able to build and get your plugin production-ready within a month,” he said. “If it doesn’t work out, we could easily pull it out and if it does, we can double down.”
This is also speeding up work on some long-requested features like being able to edit and delete messages, something that was, though not technically impossible, prohibitively hard to build on the old system. Sending a message to two people used to create three records, one for the sender and one for each recipient, which would have meant consistency problems.
“If I wanted to edit that message, I would have to edit it three times, and make sure that all three of those edits were successful.” Now the message is a record that all three people get a reference to.
“That reference tells us where that message is, it also includes things like have you read it, is it set for delete. That makes edit and delete substantially easier, because I’m just editing one record. I just have to make sure that that edit is successful, and we have systems to make sure that that edit is carried around the world as quickly as possible, so that all our data centers are updated.”
Another in-demand feature is already starting to roll out, letting people in the same group send a message request to start a conversation with someone who isn’t already a LinkedIn connection. “That was one of our biggest complaints we got for the longest period of time: why do I have to be connected in order to have a conversation with another member?”
The design principles the messaging team came up with help them deliver features progressively. LinkedIn messages are encrypted and they’re reviewed to see if they’re spam before they’re ever created in the message database. But rather than blocking spam, LinkedIn shows users a notification that a message might be spam, letting them ignore or read it. Now if one member confirms that a message is spam later on, it can be asynchronously removed across the service.
Because the system is extensible, the LinkedIn trust team was able to add anti-sexual harassment checking in the same way as spam checking and that’s now being extended for other forms of harassment, as well as notifying users about issues like account takeovers and password resets — something the old system wasn’t flexible enough to discover.
Taking this more structured approach to designing the platforms means users will get new functionality sooner, and it’s better for the career development of the engineers working on the project.
“Fires do suck a lot of the oxygen out of the room so if you have a system that’s constantly having those types of problems, people tend to fall back on a couple of engineers who are known to be able to solve those types of problems, but it’s not a good career option for those engineers to be constantly solving problems, and it’s not healthy for a team to be constantly solving those problems.”
“All the engineers I’ve ever met are problem solvers, but they also have something they want to build or something that they want to achieve. if you’re in a state where you can actually focus on those types of problems, the overall morale of the team is just increased.”
At this time, The New Stack does not allow comments directly on this website. We invite all readers who wish to discuss a story to visit us on Twitter or Facebook. We also welcome your news tips and feedback via email: [email protected].