When it comes to the practice of DevOps, few companies have been on the cutting edge as long as Etsy, which once famously asserted that its new engineer hires would commit code to production after the first day of work. Don’t be fooled by the crafty look-and-feel of Etsy. Last year, the e-commerce giant brought in $2.39 billion from over 24 million buyers around the world. The company even hired Rasmus Lerdorf, the creator of PHP, which is the company’s language-of-choice.
We recently interviewed the company’s new chief technology officer John Allspaw, in the company’s Brooklyn headquarters, about the company’s development pipeline, the use of machine learning and what the company looks for in potential new hires.
TNS: Could you describe Etsy’s current architecture?
JA: The architecture is relatively straightforward. At a high level, there’s not too much that’s overly complicated, and I think we prefer it that way: PHP, Linux, and Apache, and MySQL. We use Elasticsearch and other sort of Lucene-based stuff for search. And we’ve modified Solr and modified Elasticsearch in various ways. Our “big data” stack is Hadoop and Vertica.
Our jobs have been written in Scala. For a long time, they were written with JRuby Cascading, and I believe now they’re entirely Scala. We’re a CentOS platform and are a Chef shop for configuration management, and we’ve grouped together configuration management and some provisioning stuff.
There some more, rather interesting, smaller parts of infrastructure for what you would call “data science” and “machine learning” that use C and have used Scala as well. It’s possible that we have Fortran in the stack because it’s actually, as it turns out — and I wasn’t aware of this — but a lot of mission-learning approaches are very much about matrix math.
We want to exploit all of the advantages that come from having a small number of well-known tools. When you have a small number of well-known tools, you can then focus on the product. We don’t want to be doing engineering for engineering’s sake, and much in the same way that you wouldn’t want a carpenter to be overly concerned about what kind of hammer he has — at some point, it doesn’t matter.
Halting the Proliferation of Tools
TNS: But how do you stop the proliferation of tools?
JA: One of the things that we’ve done for a number of years is to recognize that if you’re going to solve a problem, and you think you need to solve it in a new and novel way that can’t be solved with the technology stack or the patterns that we already have — then that’s fine, it’s just that we need to be explicit about the operational costs of doing so. Introducing something new and different can bear a huge long-term cost to the organization.
TNS: So it’s not just the case of it being free, it’s also the case of you’ve got to support it, and then other people have to learn bits of it…
JA: So we do something called “architecture reviews.” I’m a developer; I’m working on this thing; I’ve run into a problem; and I’ve tried the various ways that we typically have tried to solve similar problems; and if those don’t work sufficiently, what I want to do is I want to ask the greater population of engineers at the company, “Hey, I’m working on this problem. I think the only way to solve it is by using this brand-new thing that nobody knows about. Please talk me out of this.”
I want somebody to raise their hand and say “After you’ve described this problem, I know exactly what you’re doing because check this out — six months ago, I had a very similar thing and I finally figured it out. You could just use the thing that I’ve got!”
There’s no reason to re-invent the wheel.
Some people could think of that as an example of top-down limitations on autonomy or creativity. I would say that it’s easy, and rather cynical, to say that. Think about all of the freedom in the product that you will have if you don’t need to consider this new stuff. We want to hire engineers who are psyched about the house that they’re going to build, not psyched that they’re going to use, you know, the new “John Deere Nail Gun 3000 XT”.
TNS: Could you tell us about the deployment process for new features?
JA: We want to optimize to pulling code to production [and make it] simple enough for you to have a huge amount of confidence in the deployment. So when things don’t go the way that you expect them to, you know that you’re talking actually about a production issue and not a deployment issue.
So if [a new engineer] can’t do it on the first day or, at the very least, the first week that they start, then that says our deployment process is too difficult. Or our onboarding process is too difficult.
You Build It, You Own It
TNS: How do you minimize the differences between the development and production?
JA: I think that’s always going to be an unsolved problem. Actually, the only place currently where we use containers [is in] the development environment. All of the infrastructural components are represented for the Etsy stack, and for developing on the API or the Web.
We stopped trying to replicate production data in development a very long time ago. Because we’ve got over 36 million unique listings and a lot of languages, a lot of currencies, and so there’s no scenario where we are going to deterministically generate that diversity in data. So instead, what we do is — I think other large companies do this — is provide a sort of read-only proxy.
TNS: The engineers are responsible for their own bits when they are in production?
JA: Oh, yeah! You build it, you own it. If you write the code, you have been the one who has been thinking about, trying to anticipate, all of the ways that it should work, and all of the ways that it might not work. Just in the same way that Etsy sellers are very happy to sign their name to the things that they sell on the site, engineers need to have the same [pride of ownership], to say “This is my code; I take responsibility for it, whether it’s successful or not successful.”
If you’re responsible for pressing that big button that says “Deploy to Production,” there’s a palpable sense of responsibility. The next thing that will occur to you is “How will I know if it’s broken?” Is it possible for it to be broken and I not know?” All of a sudden you’re much more attuned to what monitoring looks like, what alerting looks like, what metrics collection looks like.
TNS: At the same time, I imagine that you’ve abstracted a lot of the supporting infrastructure away from the engineer. They don’t have to worry about the particular configuration of the supporting stack?
JA: Yes and no. And I think it really is a common expectation — that abstracting away. The difference is, are you abstracting away so that you truly can say “I don’t have to worry about this”? Or are you abstracting away because you’re aware of those guts, but want to focus your attention right now in this area. That is what we’re looking for.
Post-mortem debriefings every day are littered with the artifacts of people insisting, the second before an outage, that “I don’t have to care about that.”
If “abstracting away” is nothing for you but a euphemism for “Not my job,” “I don’t care about that,” or “I’m not interested in that,” I think Etsy might not be the place for you. Because when things break, when things don’t behave the way they’re expected to, you can’t hold up your arms and say “Not my problem.” That’s what I could call “covering your ass” engineering, and it may work at other companies, but it doesn’t work here.
And the ironic part is that we find, in reality, engineers are more than willing to want to know. I’ve never heard an engineer not wanting to know more about networking. I’ve never heard an engineer wanting to say “You know what, I don’t want to care about database schema design.” And so if the reality is that people do care, then it’s kind of a waste of time to pretend that we’re abstracting away. Because you’re going to not care up until the absolute second you do, and when you do, that’s all you want to care about.
Engineering vs. Development
That’s the difference between being a programmer or software developer and an engineer. We want to hire engineers; we don’t want to hire software developers.
Engineering, as a discipline and as an activity, is multi-disciplinary. It’s just messy. And that’s actually the best part of engineering. It’s not about everyone knowing everything. It’s about paying attention to the shared, mutual understanding. As an engineer who starts day one, I am [not] the best one to know how network protocols at Etsy work, and I’m going to be encouraged to seek out the experts in those domains until I do. And maybe something will break, and then I’m going to learn something new.
And likewise, network engineers at Etsy are the experts at knowing and figuring out whether they have a sufficient understanding of the application mechanics, and it’s on me as CTO to create the conditions where people can blur those lines and say “I want to learn a lot about databases!” or “I want to learn a lot about networking.” I don’t have to be an expert in all of the esoteric of the OSI networking model. But I need to continually expect that part of my job is exploring the boundary — what’s the precipice of my knowledge — and keep going.
TNS: What do you guys do with machine learning? Is it still in the dev phase…?
JA: Oh, no, no! My most favorite example of how we’ve been using machine learning [is for search].
So again, at Etsy, we’ve got over 36 million things! To some extent, all of them have a pretty significant amount of uniqueness to them. There’re no SKUs. We hear from buyers and sellers [that] one of the things that makes them love Etsy is they say they can find things there that they can’t find anywhere else.
So that’s fine, that’s great. But think about it for a second. If the thing that you found doesn’t exist anywhere else, how did you know it existed [when] you went to Etsy to find it? And where did that come from? How do I characterize what you’re looking for? I have to do some inferred guessing, and I have to make inferences and use heuristics about what your taste is.
So we have to pop up a level and look at things like, the population of people who wear wool pea coats [also] really like natural shaving products. I’m not smart enough to tell you about all the math. But the data science team has published a couple of pretty well-known papers on this particular problem of how to provide recommendations for such a long tail.
TNS: So is there like a talent pool that you drew from to kind of build up this expertise? Is it coming from the statistics side, or where?
JA: I think, in the end, we pull from the same talent pool that our peer companies do. Again, we don’t want to hire a statistician who’s only interested in statistics. Given the choice between a software engineer who has demonstrated incredible abilities in algorithms, or an engineer or data scientist who has demonstrated both some fluency and, more importantly, curiosity on the financial and legal implications of the work that they’re doing, we will always choose the latter. Always. Because, again, that’s the difference between hiring programmers and hiring engineers.
You can’t look at the news that relates to software development and not see the word “automation.” We’re in a world where very important dialogue is happening around how much judgment is imbued in software. You could say this about self-driving cars; you could say this about drones. Remember, we’re writing code — whether it’s in e-commerce or a car or payments processing system — that has opinions.
As we’re writing the code, we’re going to try to anticipate scenarios that the code will find itself in when we’re not around. And so therefore, we have to encode decision-making in software. And so where do those decisions come from? Well, the thing that we don’t generally pay attention to is that the way that we code those decisions — and there are decisions that the code is going to run into that we hadn’t even planned for…
For example, I think there are scenarios where the safer thing to do is to go through a yellow light. I think there are situations where the safest thing to do is to break the speed limit. So I’m going to have to manifest those opinions into code.
The promise of things like machine learning is that it paints a picture of the future where software will be able to make all of these decisions and that humans’ judgment will always be faulty.
I think that there’s a huge miss there in that perspective. This is the reason we’ve said that we want to take a human-centered approach to engineering at Etsy. We can pretend as much as we want that the algorithms will make everything great and perfect, but that’s only when we view humans as frail, faulty, error-prone, cause of accidents, that sort of thing, and it’s simply false. The fact of the matter is, accidents and failures happen way less often than success. And so to pull back the high-level view, at Etsy, this is not just sort of an empty thing. We want to view engineering as engineering, and not be clouded by rose-colored algorithmic glasses.
So I’m saying that the more evolved, perspective — the one that we’re going to take here — is to, instead of asking questions about “why did something fail,” we want to ask why something succeeded, which is really easy to skip over.
Transcription services from Mara Kruk.
Images from Etsy’s Brooklyn headquarters.