The Role of Embedded Tester on DevOps Development Teams
As DevOps strives for continuous delivery and integration, continuous testing becomes a bigger part of the whole transformation toward rapid stability. Before the agile-driven practices of DevOps and extreme programming, testing teams used to receive the code at the end of the development process, tossed over the proverbial wall. Now we’re not only seeing developers performing their own tests, but testers are included on development and infrastructure teams.
The end goal is no longer for testers to check code after it’s written, but rather to play a more strategic role in an organization looking for speed through automation and responsiveness to production data.
Today we talk with two of these embedded testers to learn how it works at two very different sized digitally driven companies — Facebook and Moo — as they look to strengthen software in a way which they look at as distinct from the popular site reliability engineering.
The Option of an Embedded Test Engineer at Moo
Abby Bangser joined online print and design company Moo and its 100-person tech department about two months ago as a senior test engineer on the platform team, an unusual role unto itself. Moo embraces the DevOps ethos of “product not project,” with a test engineer, a product owner, and an agile coach active on each tech team.
Bangser told The New Stack that test engineers are embedded within tech teams in order to break down silos. She was quick to say that her job isn’t about testing for the developers, but rather to help the team identify what quality means for each service, by identifying suitable requirements and leveraging testing tools to support them.
“Our engineers are in charge of leveraging tools [for] their own monitoring, writing their own service-level alerts, running their own pipelines including through to production deployment and monitoring,” she said.
Bangser looks at her role on the platform team — which she says combines platform engineering, operations, and testing — as a good alternative to a site reliability engineer on smaller development teams. She says her role focuses on infrastructure and provisioning, shared resources, and a focus on the observability of these services.
She said that “Coming from a software development testing background, I’m a bit of a different profile than most platform team members. It’s a profile that’s going to benefit from being exposed more to deployment and infrastructure work. I believe that software testers can [bring the] ability to identify the use case and the value of the feature.”
“Software testers identify the use case and value of a feature.”
— Abby Bangser, MOO
Bangser continued that her role involves a lot of requirement and user story analysis. She says this also helps support developer experience.
“You focus on user experience. I hear devs talk about all the time that ‘they don’t need UX because it’s an API.’ But if you are building software, you have users, and as a platform team we absolutely are aware of our users — and they [the users] are the dev teams,” she said.
She says the test engineer role helps the whole team focus on the impact of any change to the end users, which means making sure documentation, training, feature prioritization, and update communication all stays at the front of mind.
What if Not Everyone’s On-Call?
The New Stack has talked a lot about on-call rotations as an important part of a zero-downtime DevOps culture. But what happens if not everybody is on call? At Moo, software engineers aren’t — yet. Right now, it’s only the folks who work in platform and operations outside of events like Black Friday and Cyber Monday sales. However, everyone on the engineering team needs to understand the thought process behind escalation on call because they are often the raisers of alerts.
“Since they aren’t the ones that will be woken up, they don’t necessarily have experience in setting the threshold,” Bangser explained.
MOO “soak tests” incident escalation so everyone understands what should go to on-call.
She says, at Moo, they run their own sort of soak test — running a test for a longer probationary period to really validate stability — on any alert that the software teams wants to put into PagerDuty incident management tool.
“We leave that alert running in the Slack channel of the team who created the alert for a certain period of time, which allows them to tune it to the right levels, which, in the case of an event, we should take action on.”
Bangser continued that they do this fine-tuning and coaching in collaboration with the on-call support.
The DevOps Infrastructure for Facebook’s Heavy Loads
Evan Snyder has been a production engineer at Facebook for four years now, the last two of which have been spent in the testing infrastructure group. She says a production engineer is sort of a DevOps role, like a site reliability engineer, but, like test engineers at Moo, is more integrated with the dev team, even writing roadmaps together.
While the developers are responsible for writing their own tests, Snyder told The New Stack that “Teams don’t have to reinvent the wheel to run and track their tests at the right times. The Testing Infrastructure team builds shared systems for test selection, execution, triage, and reporting to support the full lifecycle of tests once they are written.”
Evan said, “A DevOps approach allows us to ship high-quality code, so running the right tests at the right time is essential as it helps us ensure things are safe before we deploy. We have a large suite of tests that make realistic requests our applications, verifying that functionalities behave as expected.”
Facebook uses a mix of continuous unit, integration and end-to-end testing.
At Facebook, each product team has three types of tests that are running continuously on any change that will be committed to Facebook’s master repository:
- Unit tests — very small and targeted tests, which they write a lot of these cheaper tests early on
- Integration tests — larger chunks of dependent code
- End-to-end tests — more expensive and longer to run
Snyder said this is all done with the goal of signaling any anomalies as early as possible to developers:
“Either your test is broken by this change or something is broken in the master [repo] so please come and see what’s going on.”
Unsurprisingly, the Facebook suite of apps, including Messenger, WhatsApp, and Instagram, has a heavier load to bear than almost any service out there, which means any sort of break can impact a large number of people, and therefore necessitates test coverage at scale.
Automated Testing for the Plethora of Devices
Again, as widespread as Facebook and its other apps are, it has a greater need than many to test across environments, like the different versions of the app for specific operating systems: iOS, Android, and KaiOS. This is why Facebook created a resource pool for tests.
One World is an internal tool at Facebook providing libraries and infrastructure to expose runtimes to services and engineers, through a common API. “Productionized” examples of runtimes include emulators, simulators, web browsers, or devices. To illustrate the scale One World operates at, Facebook runs millions of tests daily through these existing deployments.
One World is used when a product developer would like to test the build of an Android app but doesn’t want to set up a complicated Android emulator, or it can be used when someone wants to try something on Windows but doesn’t have a Windows machine, so they can use a virtual machine to test. Facebook even has a physical lab for testing devices like mobile phones.
Everything they are working with needs to focus on answering:
- How can I set it up to be ready to run a test?
How can that test communicate with the client?
How can I clean it all up afterwards to be sure the health of the resource is good?
One World includes hardware management, a mobile device lab, performance testing, functionality testing, and emulated correctness testing — does clicking this do that? Snyder says that performance testing has to happen on the actual hardware, which makes it more challenging.
“The goal of the production engineering team at Facebook is resource efficiency and rational testing strategy.” — Evan Snyder, Facebook
Developers learn from production engineers how to evaluate the trade-offs and the cost — time and capacity resources — of executing and running tests.
Snyder said that “A goal of all of our teams is to use resources as efficiently as possible — hardware and computing. If you are executing every single test on a bunch of servers, if there are a lot of pull requests coming at once, can we batch them up together?”
Facebook is working to “do more with less at every layer of testing. Infinitely running tests won’t scale forever.”
PagerDuty is a sponsor of The New Stack.
Feature image via Pixabay.