What’s What with Open Source Twitter
It took him a while, but Elon Musk, owner and CEO of Twitter, finally kept his promise to open source some of Twitter’s code. This came even though he’d fired Twitter’s open source staff. Color me surprised. Without his developers, I didn’t think Musk could do it. But, on March 31, Twitter released its recommendation code algorithm, determining which tweets appear on your timeline.
With the bombastic title The Algorithm, the recommendation code is on GitHub Twitter’s also open sourced some of the Machine Learning (ML) modules behind the recommendation code. Together these describe the services and jobs used for constructing your home timeline.
The recommendation code is written primarily in Scala. This is a high-level language that uses both object-oriented and functional programming techniques. Other languages represented in the code are Java and Python. The overall recommendation code is licensed under the GNU Affero General Public License (AGPL3).
The ML modules are written in Python. They’ve only been run on Linux and have been optimized for NVIDIA GPUs. It’s licensed under the AGPL3 and a BSD 3-Clause License variant, TorchRec, which is used by the PyTorch domain library of the same name.
The recommendation pipeline is made up of three main stages:
- Fetch the best Tweets from different recommendation sources in a process called candidate sourcing.
- Rank each Tweet using an ML model.
- Apply heuristics and filters, such as filtering out Tweets from users you’ve blocked, NSFW content, and Tweets you’ve already seen.
The service that is responsible for constructing and serving your timeline is called Home Mixer. Home Mixer is built on Product Mixer, Twitter’s custom Scala framework for building content feeds.
Candidate sources are made up of 50% from people you follow (In-Network) and 50% from people you don’t follow (Out-of-Network). For In-Network, Twitter uses Real Graph. This is a model which predicts the likelihood of engagement between two users. The higher the Real Graph score between you and the author of the Tweet, the more of their tweets you’ll see. For out-of-network tweets, the algorithm uses a logistic regression model with graph traversals via GraphJet, a real-time graph processing engine.
This sounds impressive. It’s not really. As a former Twitter executive told Fortune, “In order to open source the algorithm, you need to open source the training set, which is impossible for Twitter to do. Every effort in open-sourcing the algorithm without the data is completely dishonest.”
This isn’t just a matter of sensational bits and pieces in the code, such as identifying Twitter users as Republicans or Democrats. Even Musk admitted, “Our initial release of the so-called algorithm is going to be quite embarrassing, and people are going to find a lot of mistakes.”
The trouble isn’t the mistakes. It’s that there’s not enough there, there, to make any real sense of the programs. As Jonathan Mayer, an assistant CS professor at Princeton, observed on Ycombinator, while “the documentation gives a decent high-level overview of how Tweet recommendation works … the underlying policies and models are almost entirely missing … Without those, we can’t evaluate the behavior and possible effects of ‘the algorithm.'”
Some people argue that that was to prevent the code from being used by spammers who could work on how to game the recommendation algorithm. I would argue that spam should be filtered out before it reaches a recommendation algorithm. Yes, some would make it through, but hey, that’s how the recommendation game works. Just ask anyone who works in the eternal duel between Search Engine Optimization (SEO) and Google PageRank.
And, besides, open source means “open.” The advantages of open source development outweigh its dangers, or we’ve all been coding things wrong for the last 20 years or so.
While Twitter claims, “The goal of our open source endeavor is to provide full transparency to you, our users, about how our systems work.” To me, this looks more like an open source washing than a real look at Twitter’s code.