A New Relic Tale about Migrating to Amazon Web Services

New Relic sponsored this podcast.
A legacy data center migration to a cloud native platform is never easy, but besides the enormous scaling opportunities and other benefits associated with the shift, DevOps teams will almost certainly learn a lot along the way.
In this The New Stack Makers podcast, Wendy Shepperd, general vice president of engineering, New Relic, describes the challenges of migrating New Relic’s telemetry platform to a cloud native environment on Amazon Web Services (AWS). Shepperd discusses key lessons learned about New Relic’s shift to AWS, as well as implications for observability following the move, in this episode hosted by TNS founder and publisher Alex Williams.
“Here at New Relic, what I found is different: one is just the massive scale of the telemetry data platform and the migration that we’re doing there. And also, things have evolved quite a bit — so Kubernetes isn’t new anymore, microservices aren’t new anymore,” said Shepperd. “So, it’s less about migrating to those things and more about shifting the load out of our on-prem data centers into the public cloud.”
A New Relic Tale About Migrating to AWS w/ Wendy Shepperd
In some ways, platform migration challenges remain unchanged. Shepperd said one of the key “key learnings” she has retained dating back to her mainframe days is that any platform migration is “always going to have unexpected challenges.”
“It typically takes a significant amount of planning, investment, alignment across the business, and a bit of certainly pioneering and going into the unknown. And so, from a leadership perspective, you really need to upfront get down your mission and clarify how you’re going to get there, and have some type of framework and great program approach to have checkpoints and to iterate along the way,” said Shepperd. “Like any major complex software development, project or migration, you really need to iterate and learn as you go and just expect there to be surprises — don’t plan for the happy path, plan for there to be surprises.”
More specific to New Relic’s needs, Shepperd described how when New Relic began planning its migration over a year ago. It was apparent that the company had to more than double its capacity when making the shift. New Relic then had only a single Kafka cluster, which was “already pushing the limits of running as a large cluster,” said Shepperd. One of the key changes that New Relic then made was to implement a cellular architecture for its cloud deployment, instead of relying on “one giant Kafka cluster,” she said.
Shepperd described a cellular architecture as a type of computer architecture that’s prominent in parallel computing, “based on the idea that massive scale requires power parallelization, and requires components to be isolated from each other to enable that parallelization,” she said.
“It’s almost like the ‘Matrix,’ right? Because in our architecture, we deploy Kubernetes to manage all the components within that architecture,” said Shepperd. “So, you’ve got containers within cells.”
One of the main lessons learned was how to “observe what’s happening across multiple cells,” said Shepperd. New Relic originally had a single cell deployed, which has since increased to 10 and should eventually surpass 100 cells this year. “How do you observe cross-platform health? How do you identify the health of the cluster as a whole,” said Shepperd. “So those are some of the challenges in observing and understanding what’s going on in each cell and understanding what that means in total.”