Engineering Management During Wartime
“The war came to our house very unexpectedly in the morning when we were going to work. In a few weeks, the war became the new normal for us.”
This is the heartbreaking way Oleg Mykolaichenko kicked off his memorable DevOpsDays Ukraine talk. The life-long Kyiv resident has grown from help desk support then junior DevOps engineer all the way through to senior DevOps roles. Now, he manages a distributed team as the head of infrastructure at large-scale machine learning security company SQUAD, “so I deeply know how incidents look like from different perspectives, companies and businesses,” he said.
Although an invasion isn’t something most engineering contingency plans consider, Mykolaichenko reckons SQUAD was about 80% prepared for this worst incident ever, with careful business continuity planning and incident management. There’s plenty to be learned no matter what your crisis. Resilience is possible with proper planning.
When the Worst Incident Occurs
“There is no time anymore. You can’t change anything. You can’t add more documentation. You can’t build new processes. You can’t add more data into runbooks. You can’t create meetings for knowledge sharing. You can’t test any back-ups. No way, you don’t have time. Because the biggest and the worst incident already happened,” Mykolaichenko said of waking to the sound of Russia bombing his country on February 24, 2022.
Quickly about a thousand employees worldwide pivoted and rallied around a single instruction from SQUAD’s director of engineering: “Until you reach a place where you feel safe, don’t worry about work.”
Anticipating the possibility of attack, SQUAD had already provided staff with a list of probable danger zones and company shelters ahead. Soon after, they opened offices in Romania and Poland and gave details of how to pay taxes and follow local laws. And for those that remained in Ukraine, SQUAD more formally created a huddle of volunteers and worked to fulfill the needs of those teammates who’d enlisted in the Ukrainian military.
The SQUAD team wrote an open source bot to ping employees daily to:
- Check if they are safe.
- Check availability to work and commitment level.
- Check internet connection.
This way, they can quickly know, at the individual level, if each employee is OK, and, at the company level, if there is any ability to make engineering progress or not. Even during times of crisis, it’s essential to keep stakeholders and customers as up-to-date as possible.
“Trust is key here. You should never push your engineers because your engineers are under stress and they will do their best when they can keep themselves safe and [help] their families keep to their safe places,” Mykolaichenko said. If something does go wrong during wartime, it’s going to be so much worse than anything that would be in your incident management system.
The role of team lead quickly included taking care of team mental health. This started out by nagging some engineers to focus on personal safety above much simpler routine tasks. “Between two tasks, the human will choose the clearest task.” When it’s unclear, Mykolaichenko says, the engineer will choose work. SQUAD leadership had to better communicate when to go, where to go, and what to bring.
He also felt his role pivoted to helping his team discover ways of “contributing into the big victory, so you must share your vision how every team member can contribute into the big Ukrainian victory.”
This includes contributing a meaningful part of their salaries and volunteering. As SQUAD is a machine learning company, several engineers are working with a gaming company that has a huge amount of 3D models of enemy military tanks and equipment to more quickly differentiate enemy infiltration.
How to Create a Business Continuity Plan
While priorities may change, stakeholders need to agree on the most important priorities right when the incident occurs. For SQUAD, they had to maintain more than 50 critical applications in their data processing pipeline, with petabytes of data.
“Define the most critical parts, so you will know you are always fighting in the right place,” Mykolaichenko said. “You don’t need to waste time on non-critical alerts and noisy warnings.”
Once you know who is dealing with what incidents, continue progress on existing projects, measure available resources, and see if there is a chance to have at least some progress, he recommended.
A team lead is also in charge of ensuring business continuity, which Mykolaichenko says includes:
- Continuously sync with internal and external stakeholders.
- Keep track of metrics, including time to acknowledge and time to resolve.
- Be ready to step in and work hands-on with the most important jobs.
- Create a new formula for estimating and measuring the project through the lens of the new reality.
Document and then clarify and share responsibility for all systems, services, processes and responsibilities across at least two team members, Mykolaichenko recommends. This is even more important for team leads to be the perfect hit-by-a-bus example, training your replacement in your ways and processes.
Once settling into the new work-war balance, SQUAD had to doublecheck timezones and available working time slots, building a new rotation schedule with Infrastructure as Code with Terraform, which was integrated into the aforementioned status bot.
“Involving developers into a production environment is always a positive change,” he said, encouraging that not just ops but devs be on-call to spread out responsibility and increase team resiliency.
How to Set up a Disaster Recovery Plan
“You need to know what to do if the worst-case scenario happens,” Mykolaichenko said. A disaster recovery plan shouldn’t be overly detailed but includes all the mission-critical what-ifs — database goes down or was deleted, the data has been corrupted, or a Kubernetes cluster, load balancer, DNS provider or the whole internet disappears — with the required procedure for every production-ready system.
A disaster recovery plan should simply clarify potential issues and the vectors for solving them. For SQUAD this took a couple of meetings to really review and dig into architectural dependencies to identify:
- Identify critical operations. What interruptions would impact your ability to operate?
- Evaluate disaster scenarios. Determine your priorities, recovery objectives and timelines.
- Create a communication plan. Assign roles to specific people and departments.
- Develop a data backup and recovery plan. How to fix, how to contain, and how to monitor for further intrusion.
- Test your plan. For gaps and weaknesses, add any steps needed to increase efficiency.
Runbooks for Alerting
Runbooks are another prerequisite for proper contingency planning, Mykolaichenko said, paired with detailed explanations, including:
- Detailed steps and commands of how to deal with alerts.
- Where to find more details, documentation, data, architecture schemas, and Grafana dashboards.
- How to connect.
- How to debug or fix.
- How to confirm and report something is fixed.
- All runbooks are stored as code.
This enables anyone, anywhere to step up and help deal with any alerts. And then if any incident is at risk of repeating, a ticket, he said, must be created for automating or permanently fixing the issue.
Templatize Your Own Postmortem
In case of failures, postmortems have to happen, but, especially in crisis mode, Mykolaichenko recommends, keep them as simple as possible. You should create your own postmortem template but he specifically recommended against using the ones from “giants” like Google, nor open source ones — “because usually they are too big and you will waste time pasting all of the required roles and lines to follow the approaches that you probably don’t need. Even in crisis, try to focus on things that matter.”
Focus on issues that directly affect business, he said, looking to answer measurable metrics like:
- How many users were affected?
- Can you quantify the monetary loss?
- Which servers were affected?
Then, Mykolaichenko says to have alert levels to decide how urgent a fix it is. Once you adopt this series of postmortem practices, he says your team will be able to extend, automate and go through postmortems to find more root causes and dig deeper. Reverting to the original safety first directive, he says his team will examine postmortems more thoroughly when the war is over.
How to Build a ‘Self-Driving’ Team
Especially in times of uncertainty, the role of a team lead is to make sure they’re building a self-sufficient team. Your team should know what to do when you’re offline — because it’s more important than ever to disconnect — or if you’re on medical leave or another crisis, which includes how to:
- Perform processes without you. (sprint kickoff, daily standup, closing implementation pass, reporting)
- Take tickets from the backlog.
- Change priorities when necessary.
- Report to management.
Once you’ve facilitated a self-driving team, he contends, you will be needed for more complex and interesting high-level issues, like helping to grow the next team leaders.
“Building a self-driving team is the most challenging part of a team lead’s responsibilities because there’s always an antipattern that appears in real life,” Mykolaichenko warns. He calls this antipattern: Treating your team like a child.
It’s also important, he said, to design your applications with self-healing approaches to architect for failure in order to minimize issues in user-facing environments.
“Fortunately, the DevOps world created a huge amount of approaches on how to build stable and reliable software,” Mykolaichenko said, including circuit breakers, automatic retries, and Advanced Message Queuing Protocol (AMQP). Every failed request should automatically perform a retry, he said. Then, if the retry request fails, you can exponentially multiply the retry or send to the delayed queue. If this remote queue is unavailable, he said, you can always save to shared storage to try again in a few days.
“Fallbacks can prevent downtime, can prevent losing of data,” he also recommended. Kubernetes’ ephemeral nature and scheduler is particularly good at self-killing and bringing up a new pod.
He also recommends implementing rate limits for all customer-facing endpoints, and do anything you can to remove single points of failure. He continued with further recommendations:
Plan for failure:
- Identify service limits
- Use self-throttling
- Consider alternative resource types
- Introduce universal instrumentation
- Collect event-centric diagnostics
- Give everyone visibility
- Reroute and unblock
- Automate known solutions
- Notify a human
“Nothing can be so hard as managing incidents, if you are inside another bigger incident,” Mykolaichenko said, wrapping up his talk with a dose of perspective.
“You always have a chance to make your infrastructure, applications, pipelines, workflow, workloads and all processes as strong as Ukraine now or as Ukrainian organizers who are setting up this conference during the war.”
Directly Support a Ukraine under Siege
In case the recent increase of cross-country bombardments on Ukrainian civilians like in Vinnytsia isn’t a reminder enough, the Russian invasion of Ukraine has entered its fifth month. Help is still desperately needed in all forms. DevOpsDays Ukraine was a virtual event held this past May that raised more than $100,000 for the following charities:
- Voices of Children — to provide psychological support for child victims of war.
- Monstrov Corporation Foundation — to provide humanitarian and medical aid to those in crisis.
- Insight — to support LGBTQI communities in Ukraine.
- Kolo Foundation — to organize humanitarian aid and coordinate civilian evacuations.
- Razom — to support short- and long-term projects that foster democracy and prosperity in Ukraine.
- Happy Paw — to provide assistance for homeless cats and dogs.
You can support these essential causes directly or by purchasing a ticket to DevOpsDays Ukraine, which will give you access to the 17 sessions from global DevOps leadership who volunteered their time to raise funds for the above charities on the ground in this war zone.
Kyiv-based SQUAD is hiring.