Walmart Labs CTO: How OpenStack Can Prevent Cloud Lock-in
When it comes to deployments of OpenStack, few, if any, are larger than Walmart’s. Next week, at the Open Stack Summit 2016 in Austin Texas, retail giant Walmart will detail how the company scales a single OpenStack region to 400 compute nodes that can host over 30,000 virtual machines. Since last year’s summit, the company has released its own OpenStack-based OneOps application lifecycle platform to seamlessly move workloads between private clouds and public ones. We caught up with Jeremy King, chief technology officer of Walmart Labs, to find out more about the DevOps processes inside of Walmart as well as how the company is using open source and cloud-native technology to keep nimble in the ever-evolving e-commerce market.
TNS: How did you come to Walmart?
King: Coming up this summer will be my five years. Walmart Labs was created as part of the  acquisition of Kosmix. I’ve joined very shortly after that, and for me, it was the signal that Walmart actually was going to get into the e-commerce game. Previously to that, Walmart hadn’t really invested very heavily in the e-commerce side.
It’s really been a transformation. Previous to Walmart Labs, Walmart would largely outsource to third parties, and we used proprietary software vendors. But we decided that we’re going to [offer] millions of items to hundreds of millions of customers and we were going to integrate it with our stores, and in our five thousand pick-up locations and all of our distribution centers. There is no software in the world that you can buy to implement that.
After five years, we’ve acquired fifteen of these companies and hired up 2,000 plus folks in the organization. It really transformed into a Silicon Valley startup. We used a ton of open source. We’re big OpenStack users. We want to contribute back to open source as much as possible and as a result, we’re getting lots of great folks who have bounced all over the valley and who love working on these big problems that we have here.
TNS: That’s quite a daunting task to set up a huge e-commerce operation. How did you manage it?
King: I spent some years previously at another big e-commerce company and did somewhat of the same reboot — A new stack, starting from scratch. Its code name [for the Walmart project] was Pangea, and effectively what we wanted to do is modernize the stack. The engineers literally hadn’t touched the stack in the thirteen years and so it was really old technology.
Of course, the engineers were great but obviously were frustrated with how fast they could move. So, we decided to reboot it completely. In doing so at that time, we said, “Hey, it’s going to be completely open source. It’s going to be cloud-native from the beginning.” As a result, we were able to track some very talented people.
When you come up with folks and say, “Hey, we’re going to be re-platform the world’s largest retailer.” People are just, “Yeah. That’s something interesting you got.” So, I was able to attract some really great talent to do that.
It took about eighteen months just to build the not only the base core platform and four or five big parts of the infrastructure of Pangea: the search engine, the whole platform server, the cloud-native infrastructure and all that.
TNS: From an earlier interview with Walmart, we learned about the concept of making the developer responsible for the operation of the app. “You deploy it, you own it.” Could you talk a bit more about the DevOps process at Walmart?
King: Let me tell you a little bit about my philosophy. When I worked with other e-commerce company, it was way before the open source revolution has come on. So, they have built a platform that was a very scalable and very innovative but to some extent; it’s constrictive if you know what I mean. Every architectural decision was put through a gauntlet, and we were completely paranoid about scalability and back end. As a result, we had a great website, and we had a nice scalable infrastructure. But, when we hired people, it took them months to learn the system, and it would be really hard to add new capabilities.
From that company, I went to a little start-up, about a hundred engineers and it was the exact opposite. It was absolute chaos, where engineers could work on any platform they wanted. With a hundred engineers, we used about 30 different platforms of different systems. But the engineers were pretty happy because they could work on the latest, greatest stuff. But, I was trying to build a “four nines” capability and in doing so, I was replicating high availability in five platforms and I get pretty tired of that.
With Walmart Labs, I wanted both of those worlds. That’s what really OneOps trying to get to. I absolutely hate this word governance, but to some extent, there is an inventory of the systems that we have.
OneOps is not just cloud management tool. It’s an application life-cycle management tool and what that means is that not only do [Walmart developers] have control of the ability to do deployments, but it also gives me stats. Who would be using it? What version do they own? When was the last time it was deployed? How many transactions? How many machines this is running on? Which cloud this running on?
As a result, I have an inventory of every piece of the system I’m using, whether it’s a database, NoSQL database, a caching service, Ruby or Java or Node.js systems. I know what version they are running on. I know who’s using it, and that allowed me to run a template.
It’s DevOps culture with some control if you know what I mean. It’s a beautiful thing.
TNS: Walmart released the OneOps platform a few months back. How has the reception been?
King: It’s been great. It seems like we are doing demos for folks every week, mostly with friends from other companies and then also for vendors that are interested in building their plugins into that. We are very much committed to keeping it current. We continue to invest in the different connectors that we’re building, whether it would be a database server or to a new cloud. We’ve made connectors to pretty much any OpenStack cloud provider. We are working on the Azure connect now.
We have a lot of users internally. We did thirty thousand deployments in November last year and typically November is the freeze period for retailers because they don’t want to get involved in anything crazy before Black Friday. So, this is the kind of control we’ve been able to manage as a result of having these capabilities to have so much visibility that we were safe enough to deploy even the week before Black Friday.
TNS: What role does OpenStack play in Walmart operations?
King: It’s absolutely a critical piece of it. As I mentioned, we’ve been on OpenStack now for a long time. We cut our teeth on it three plus years ago. You probably know, it wasn’t very stable in the very beginning. We had a lot of problems initially, but we never turned back, and we’re extremely happy with it even in the last two years. We haven’t had any site outages as a result of OpenStack.
So, we’re very committed in all of our infrastructure running on OpenStack now. We were at one point the biggest deployment that in the world, but I know a number of companies have started doing it quite a bit, so we I don’t know if we still hold that title, but we’re pretty big.
TNS: What are the advantages does OpenStack offer?
King: Flexibility was a big piece of it. So many of the cloud providers were moving to OpenStack as well. So, if I want to make a deployment internally, it is just the same processes that I want to deploy to another cloud provider that has an OpenStack end point. For example, we do an OpenStack end point with Rackspace, and we’ve fully integrated with a couple of other companies that have OpenStack end points.
TNS: What are dangers with lock-in with cloud providers. Is this an issue that Walmart has grappled with?
King: We are definitely cloud users, and we definitely were worried about lock-in from the very beginning. When we build OneOps initially, we weren’t really thinking about being cloud agnostic, but we what were thinking about was moving from a cloud environment in development, to a cloud environment in staging, and to a cloud environment in production. But what we’ve found it is definitely easy to move from those internal cloud capabilities to the external cloud and as a result.
I think there’s a lot of concern about it is this, the “Hotel California” problem: Once you move all your data and your apps in into the cloud, you absolutely cannot get out. It’s a real concern. So, as we built OneOps, we’ve been able to stay very cloud agnostic and move between not only internal cloud but external cloud for any computer.
The other thing that OneOps got, which is part of the reason we love it so much, is that engineers don’t actually know it is based on OpenStack. They don’t know that it’s Rackspace on one side and Azure on the other side or an internal cloud because it’s just the configuration management and effectively we can control the load centrally.