Artifactory at Scale at IBM Cloud
JFrog sponsored this post, which was written independently for The New Stack.
“Two years ago, every team was doing their own thing. We’ve gotten a heck of a lot more consistency out of the environment, and that’s enabled us to go faster,” Jason McGee, vice president and chief technology officer of the IBM Cloud Platform, told those attending the swampUP 2019 user conference last week in San Francisco.
IBM Cloud is a full-stack platform designed for the enterprise. The teams are building more than 100 services, including Kubernetes, serverless, networking, databases, analytics and artificial intelligence. They employ more than 60 data centers around the globe.
To implement the repository, the company reorganized its teams. There are tribes that manage a collection of related services, such as containers, infrastructure compute, databases, and within tribes, there are squads of developers that own some subset of that service. The squads work independently, yet own the entire lifecycle of their projects end to end.
“We can’t manage this at scale unless we have some standard architectural principles,” McGee said, and for the company that means running everything on Kubernetes.
“Kubernetes is still a pretty young thing. There’s still a lot of work ahead of us around the adoption of Kubernetes as a platform. But there are a lot of people being really successful running applications at scale on Kubernetes,” he said. The technology has “helped us put some commonality around how we run DevOps, how we do security and compliance, how we do deployment to use Kubernetes as an abstraction layer across that diverse environment.”
Individual services have “pretty amazing” velocity, he said. There are between 200 and 500 updates a week into that production environment, so it has to be up and stable and scalable at all times.
It chose Artifactory server as a service as its central repository that it runs internally. It had to be highly available, highly performant, up 24/7 all over the world. It also needed to be self-service and automated.
It runs the Artifactory cluster on large-memory, multi-core bare-metal servers. It’s backed by a MySQL database as an object storage service, and it’s all hosted on its own cloud, which poses some interesting challenges, he said, when something breaks. A squad of 12 manages it.
The repository is part of a broader set of DevOps capabilities that include Jenkins and Travis as a service and some of its own tools.
It runs four instances — two staging environments and two production environments. Teams pick a primary location — Dallas or London — that’s the read/write location for their artifacts, then a read-only replica to the other location. In case of outage, they can switch to the other location and artifacts are still available.
Among the lessons learned are that self-service is paramount. Automation allows teams to create their own repositories. Within five minutes, the new one can be available worldwide.
“It’s faster than if you had to call somebody and get them to do it for you. This self-service onboarding is the thing that allows us to have that ramp of consumption,” he said.
Predictable and non-disruptive upgrades of the Artifactory environment itself also are vital, he said. Again, automation helps implement upgrades in a way that’s largely invisible to development teams.
And there’s the people issue, such as people sharing their entire API feeds on Slack when they have a problem, which poses compliance issues. A bot written for Slack helps thwart that behavior.
“When you scale any technology, you have to think about how people are going to consume it and the behavior they use to consume it and how you can help them be successful,” McGee said.
He lists the single-instance MySQL database among its challenges. It really needs a horizontally scalable HA database so that element of the architecture can handle changes in demand in a way that’s consistent with other layers of the architecture, he said.
And the London and Dallas environments are not interchangeable. It’s hard to switch back and forth.
“If Dallas goes down and that my single rewrite location, I can still pull artifacts from London, but I can’t update them because my primary location is down,” he said. “We want to use the metric system to be able to read and write to either location to keep everything in sync. We want users to not have to even think about these things. And if we want to expand to a third location, [we want this] to be transparent to the users.”
Scale will continue to be a focus, implementing AI and machine learning to keep the technology ahead of the growth curve.
And while it uses JFrog Xray for automatic vulnerability scanning of container images before they go into the registry, it would like to do vulnerability analysis earlier in the development process.
“For those of you doing continuous compliance, you know there are some pretty stringent rules about how long [it takes] to apply a fix to your environment. So the earlier we know about a problem, the earlier we can start the fix. So adding things like Xray into the architecture giving us that visibility is really important.”