Earlier this week, an international business holiday was almost declared as Slack unexpectedly went offline for a few hours. Organizations have become so dependent on the team chat messaging service that many wondered if they could get any work at all done during this downtime. Only four years old, the company serves eight million users from 500,000 different organizations. So, even with an outage here and there, Slack knows a thing or two about rapid scalability and long-term reliability.
Earlier this year we caught up with Julia Grace, who is Slack’s senior director of infrastructure engineering. Signing on in 2015, she was quickly promoted to the senior role, building her engineering team to 75 people in the two years since signing on. Prior to Slack, she co-founded Tindie, a marketplace for maker goods, and spent time at IBM Research. We spoke with her about rapid scalability, the cloud versus in-house computing, diversity and other topics.
What was your general area of specialization when you studied computer science in college?
I’ve always been really fascinated by the intersection of humans and computers. And so I always find it really interesting: how do we build systems. I went to graduate school during the rise of Facebook, and so there was this really interesting challenge of building a huge, huge scale application, but then looking at usage and understanding patterns across those applications and how they can facilitate communities and conversation and communication.
When I was an undergraduate, even though I knew math and computer science was my calling, the other two areas that I spent a lot of time in was studying social psychology, so I took a lot of sociology and social psychology classes, and that was a subject that came very naturally. How do people communicate? How to understand power dynamics and groups and all these very fascinating attributes of how people communicate.
I’m on the East Coast. When you do computer science as a graduate student, many roads lead to California. So I get on I-40, I head west, I show up at IBM Research. We did so much really fascinating work. For example, we did an early paper and system for crawling and mining the data in MySpace, which was the social network at the time. We were able, in collecting and understanding sentiment, basically doing natural language processing. Building the systems to mine MySpace was incredibly difficult because the activity levels in MySpace at the time were just huge.
To support these companies with these tens of thousands of users, we’ve made very big changes behind the scenes, especially in relation to how much data we sent and receive.
One one of the early projects that I worked on was mining the data with Top Ten music lists that were more representative of the music that was being listened to. And this was really important, because around this time, like 2007, all of those Billboard top ten lists were populated by CD sales. That was a declining market because people were buying instant streams. And so there wasn’t any way to figure it out. So I got really interested in starting building some of these systems, working with some of the most brilliant minds in machine learning, in natural learning and processing. So I had always thought you can have a huge impact on society if you have a knowledge of computing, of technology, but also if you have a knowledge of how society communicates and collaborates and works.
And so, I love people, I love working with people, I love growing people, hence being in a manager role now. And so combining that with my knowledge of systems was where I found I could have the most impact.
Later, you had co-founded Tindie, and set up the engineering department. How did you attack that problem?
I co-founded Tindie with Emile Petrone. He then had the CEO role and I had the CTO role. I not only had to build the engineering team from scratch, but I also had to, alongside my co-founder, build a business. And so, now only were there interesting hiring challenges, but I like to say that I got my MBA on Sandhill Road, because having to pitch and understand the nuance of the business was so incredibly important, understanding customer acquisition, understanding marketing.
One of the wonderful things that happened was, as I learned how to talk about the business and the impact, I could then use that same pitch, for lack of a better word, when I talked with engineers. Because, ideally — and this is very true at Slack as well — we want people who are excited about the mission in the company. We want people who deeply believe in the product and how the product can really change the way people work. We’re at work more than we’re at home, that’s a lot of time.
And so when I was at Tindie, it was all about enabling the next generation of hardware entrepreneurs to buy and sell what they made in the marketplace. We build the marketplace. And so I learned at Tindie, and I’ve carried it with me to Slack, is that I am very much a mission-driven person. We used Slack at Tindie, and it completely changed the way that we worked, and this is long before I thought of working at Slack. I was like “This product is amazing!” I built on the platform.
When we sold Tindie, I was deciding what to do next. I looked at the home screen on my phone, looking at the apps that I love, and Slack is right there in the middle.
So you just introduced yourself there?
You got it. I had known a few folks here through some of my mentors, and I first went to my mentors and said “What do you think about Slack? I only know it from a product, what do you think about it from a cultural perspective, from a company perspective?” And I heard nothing but glowing reviews. So I read every single article that I could about Slack and I listened to every single podcast that Stewart Butterfield [Slack CEO] because I wanted to understand the business, I wanted to hear what Stewart and our other executives said about the business, where it was going. I did a lot of research and everything that I found only re-affirmed that this would be a great place for me to be, so I got an introduction.
There was a job opening, I interviewed, and they liked me as much as I liked them. They said to me, “We don’t have anyone for you to manage but we’re gonna hire some people, and so by the time you start, there’ll be people to manage.” And I said “oh, that sounds great, sign me up!” And then a few months later I joined the company, and it’s been an incredible ride ever since.
Now, how would you explain what your role is at the company in relation to the actual infrastructure?
My organization, we build and seal all of the back-end systems that make Slack work. So if we do our jobs correctly, Slack is seamless. Its performance, you don’t notice. I often compare it to infrastructure in the real world. If you’re driving down a road, that doesn’t have potholes, that’s free of debris, you don’t think “This is a great road to drive on!” You’re free to think about other things. And so the challenge of infrastructure is, just like in the real world, if the road has potholes, if there’s debris, if it’s not marked well, you become frustrated. You notice, why don’t they fix these potholes? If we do our job right, you never think about us, the infrastructure just works. If we do our job poorly, we’re on the front page of every news publication.
You’re the one who gets the pressure when there’s an issue or a slowdown of the site…
Exactly! We’re under the gun, but what we love about our job is that we’re able to build those systems that then the mobile engineers, the designers, the product managers, the engineers working on the product can then build at a faster cadence, they can build more reliably, they can build the features that you know and love, while we’re behind the scenes making sure the foundation of the house is solid and can scale with the user group.
Slack is an amazing story in terms of scaling up to meet global demand. Do you have any rules of thumb for scalability?
A lot of the things that we’ve done now is ensuring that Slack is just as fast and just as seamless in the United States as it is in, for example, Hong Kong. Or in Japan, or anywhere in the Asia Pacific. A big part of then is then caching data around the world, ensuring start-up times, channel-switching times, the time it takes to find someone where you open the quick switcher, or search for them. We ensure all of that is really fast, by storing data closer and closer to the users.
In the early days, we didn’t do that because most Slack users were small teams in the United States. So the infrastructure for small chains in the United States was rock solid. I mean, our founders came out of Flickr. They scaled high-performing websites. But the user base began to diversify. It wasn’t only international, it was large enterprises. For example, Oracle right now has 38,000 people on Slack. That’s very different than when I was running Tindie with my small team in the U.S. and Canada.
To support these companies with these tens of thousands of users, we’ve made very big changes behind the scenes, especially in relation to how much data we sent and receive. People keep Slack open for something like ten hours a day. They’re not actively using the product for ten hours a day, but we’re sending, you know messages like, Joab’s online, he’s offline, and new channels have been created, someone has an emoji-reaction to your message.
So that’s a really large volume of data, and we used to send a huge firehose of data to all of the clients, to your iPhone, to your desktop device. But as we’ve grown and scaled, we’ve had to transition to a publish-subscribe model because the firehose of data was so much it would overwhelm the client. So that’s been a big change over the past few months is going to publish-subscribe.
Is there kind of a cultural challenge to, like you said, move from a firehose based architecture to a pub-sub based architecture? Is there a different sort of mindset that comes with being highly scalable?
In the early days of Slack, we were the heaviest users of the product. And so the way that we did a lot of the testing was that we released these features to ourselves. Now as the business grows — new people are joining every few weeks — we’re not yet the size of the 38,000 people that are at Oracle. And so we’ve had to build a lot of internal tooling, in terms of load testing, testing features at really, really large scales, because, and this is a good problem to have, no longer being the largest user.
Building engineering teams and being an engineer is much more than just writing code — it’s also fostering and creating inclusive, empowering cultures that get the best out of people with a wide range of diverse backgrounds.
And so we’ve had to then shift the mentality from “well we can release it to ourselves first,” to understanding how we can model and better understand the extremely high usage patterns of some of these huge corporations, to ensure that the infrastructure we build will work seamlessly, will work quickly, will deliver the results that we want at those huge scales.
Do you have any rules of thumb when it comes to buy vs. build? I know there’s especially with scalable architectures, there are always questions of whether to use a cloud service or do it in-house. Is there any sort of advice that you offer along those lines?
The cloud landscape is rapidly changing. I’ve been incredibly impressed with the diversification and the scale. The implication of that is that all of my rules of thumb are constantly changing because we are in a world with rapid, rapid change and rapid growth. The companies that built out their own data centers, 10-15 years ago, used a calculus that would be fundamentally different today. You see a lot of companies migrating from their own data centers to cloud, and then you see some companies migrating, cloud-first companies, migrating to their own data centers.
And so it really depends on the type of business. In the data center case, it depends upon the volume where the data is going, and then there’s a huge cost factor there. Something it might make sense from a cost perspective, sometimes it might not. And so one of the things that we really pride ourselves on at Slack is the ability to iterate quickly and adjust to that rapidly changing environment. The decisions that we made four years ago, sometimes those are the correct decisions, and sometimes we’ve had to revise and do things differently, again because so many interesting things are happening in the cloud now.
Are you looking at microservices or DevOps? I would imagine you have a pretty agile shop to begin with.
Yeah, so Slack, we just turned four years old. On one hand that’s a really young company, but on the other hand, things like Kubernetes and containerization weren’t mature enough at that time. I’m seeing a trend across the industry of operators becoming more like engineers. So operators are moving up the stacks, and developers are moving down the stacks. They also need to know how to deploy, monitor, alert on the services that they build.
And so we see that a lot at Slack, where teams have operators and engineers, because the skill sets are really coming together nicely. While we’re not an embedded model now we’re definitely thinking about that.
So again, I think part of what’s so fun is this is a really diverse, changing landscape. So we have to adjust frequently, do that with diligence and care, so that we’re providing a service that is highly available, where companies can run their whole organization on Slack.
What is the importance of diversity in infrastructure?
Building engineering teams and being an engineer is much more than just writing code — it’s also fostering and creating inclusive, empowering cultures that get the best out of people with a wide range of diverse backgrounds. No one demographic has a monopoly on good engineering talent and ideas. So to get great infrastructure, you need people with differing backgrounds, approaches and thought processes. Only by bringing all these different experience sets and perspectives together can you be sure you’re building something truly excellent.