Like a number of Internet-scale companies, LinkedIn has had to develop its own combination of tools and technologies to meet specific requirements. We see this continually with companies such as Airbnb, Facebook, Google and any number of companies that now operate at a scale far greater than most in this world. Starting off with 2,500 users in its first week, LinkedIn grew to over 13 million members by 2007, reaching a point that necessitated a change in how it stored, accessed and organized data.
Today, the company has more than 300 million users on a network that, to meet its potential, will need to interconnect through deeper search capabilities. Without advancements, the LinkedIn network would lack the capability to connect people, help them find jobs and participate — be it through groups, blogging or otherwise.
The growth has predicated the need for LinkedIn to transform its data infrastructure. Through that process, LinkedIn has come to develop Galene, a search architecture that immediately indexes with every new update to a user profile. Along the way to Galene, LinkedIn has built a cadre of open source technologies that have impacted, if not resulted in its search architecture.
LinkedIn’s story is a familiar one to SaaS companies that got their start before the advent of such innovations as RESTful APIs, NoSQL databases and modern, flexible programming languages. Today, we face a transition to an infrastructure that must adapt to the demands of the application, heralding the arrival of containers, microservices and a new set of complexities that come with distributed architectures.
With these new complexities comes a drive to open source projects that attract developers to build out the technologies that are needed to scale. LinkedIn is one of the earlier companies to take this approach, after realizing the limitations of technologies that were built more for corporate enterprise than web-scale operation. It has given LinkedIn the capability to build out a search product that LinkedIn Principal Staff Engineer Sriram Sankar wrote in a blog post is designed to “find people, jobs, companies, groups and other professional content.”
“Our goal is to provide deeply personalized search results based on each member’s identity and relationships.”
In an interview earlier this year Igor Perisic, vice president of engineering at LinkedIn, walked us through an explanation about how the company transformed its data architecture and subsequently developed a package of open source technologies that have helped the company transform its search capabilities with Galene.
The time also saw the sudden availability of massive compute capabilities and storage, which in turn had its own influence on developer tools, platforms and databases. The maturity of these technologies is reflected in any number of service providers that have used Amazon Web Services to build out in a manner that an earlier generation of companies just did not have the luxury of doing.
Hadoop, Cassandra, and technologies such as MongoDB have made it possible for new entrepreneurs to make a fraction of the investment that earlier ones had to make, said Lew Tucker, vice president and CTO of Cloud Computing at Cisco and vice-chairman of the OpenStack Foundation. With that has come a surge in open source contributions from web-scale companies and a new group of commercial software providers that have built their businesses off of open source projects. Docker is a good example of a company that fits that bill.
LinkedIn’s examples show how some of the largest tech companies are open sourcing technologies to build developer communities, which in turn leads to platforms that are based on contributions from the community. Through that process, a thread emerges: better efficiencies, more automated technologies and a sharper group of developers working for the company.
“Web companies benefit in that they attract developers when contributing open source projects,” Tucker said in a phone interview with The New Stack. “That makes it easier to hire people.”
The Open Road
LinkedIn’s advancements in search can be traced back to 2007 when engineers struggled with a slow and unwieldy search algorithm. The company wanted to offer members a search capability that would immediately make available new updates to a user’s profile. In response, LinkedIn developed Zoie, an open source, real-time indexing and search system built on Apache Lucene. LinkedIn open-sourced Zoie in the summer of 2008.
LinkedIn also found that memory and CPU costs increased with the new capabilities the search advancements brought. Reliability, resulting in occasional downtime for members with large networks, had also surfaced as an issue. There were also questions about what an optimal graph partition would look like. Would it be sharded randomly, or would people who communicate with each other be put on the same shard? On top of all this, its engineers dealt with an extremely dynamic — rather than static — environment, with many time-consuming updates. In turn, they lacked the ability to recreate partitions quickly, as Hadoop had yet to come into the picture. Eventually, LinkedIn’s teams from search, network and analytics settled on random sharding, hashing it on Riak, which worked sufficiently.
LinkedIn’s foray into open source development with Zoie in 2008 continued with a gradual migration toward other open source tools such as Hadoop. The company needed Hadoop to replace Oracle, which it relied on for data stores. Though robust, the cost that came with Oracle proved considerable enough to explore more open source initiatives.
In 2009, LinkedIn open sourced Voldemort, a distributed key-value storage system. In 2011 the company open-sourced Kafka, a high-volume, low-latency messaging system for managing real-time data feeds. In 2012 it open sourced Sensei, a distributed, elastic, real-time, semi-structured database.
In 2013 Samza appeared, a distributed stream processing framework which was initially developed to help standardize varied data streams in real-time, instead of offline where it requires more cycles and doesn’t scale as well. All these projects came about as ways to adapt to the growing scale of data, and to revamp an architecture that was no longer sufficiently agile. LinkedIn also developed Espresso, a horizontally-scalable, indexed, timeline-consistent NoSQL data store that will eventually replace the company’s legacy Oracle databases.
The pre-Galene architecture also included Bobo, Cleo, Krati and Norbert, said Sankar in the blog post he write about Galene. These components have also been open sourced.
The architecture still lacked depth in a number of ways. Indexes were difficult to rebuild. And if the indexes did need getting rebuilt, say after a data corruption, it often meant putting several people on the project to get it done. Updates were cumbersome, requiring the whole entity to get rebuilt when only a portion of it needed revamping. The scoring was inflexible, making inserting machine learning scores a cumbersome task. Lucene had limits on what it could do. Offline relevance, query rewriting, reranking, blending, and experimentation were all not possible with Lucene.
There were also the issues about the open source components. As more was open sourced, the more fractured and complex it became to pull things together across multiple organizations. That experience taught them not to break up Galene. Instead, Galene, according to Sankar, became a single unified framework with a single identity.
Last year, the company launched Galene as its new search architecture to replace Lucene. With Lucene, there “originally were memory leaks. You had to shut down the machine in order to persist index and to do the Lucene optimization,”said Perisic. “[Galene] revamps the entire backend stack of search.”
Galene is representative of LinkedIn’s shift from reactionary, ad-hoc approaches to a more holistic methodology as a way to keep pace with scale. In the post he wrote about Galene, Sankar explained why LinkedIn expanded beyond Lucene, and the technical details behind the shift:
As we grew, we evolved the search stack by adding layers on top of Lucene. Our approach to scaling the system was reactive, often narrowly focused, and led to stacking new components to our architecture, each to solve a particular problem without thinking holistically about the overall system needs. This incremental evolution eventually hit a wall requiring us to spend a lot of time keeping systems running, and performing scalability hacks to stretch the limits of the system.
The new search architecture behind Galene allows members and companies to quickly retrieve results for complex search queries like “Engineers with Hadoop experience in Brazil” or “Data science jobs in New York in companies where my connections have worked.”
So what does the upcoming year hold for LinkedIn? According to Perisic, the overarching data science and development themes the company is focusing on include data standardization, how to optimize diversity, adding more languages, streamlining the rankers, and focusing on the ability to express models in similar fashion across platforms.
Feature image via Flickr Creative Commons.