#JWST: Day 2 Operations of the Most Expensive SRE Project
We are more than two months in, and the results from the James Webb Space Telescope (JWST) are priceless. OK, $10 billion is a hefty price tag, but the site reliability engineering (SRE) lessons — often of what not to do — continue to float in from a million miles away.
After the response to The New Stack’s article on overcoming (or at least planning for) the 344 single points of failure NASA faced with JWST, we caught up with Robert Barron, SRE architect at IBM, to learn how the telescope project is going now that it’s onto Day 2 operations.
JWST’s Bumps Along the Way
So, how is the most advanced, most expensive telescope that has zero redundancy or repairability doing? “Spectacularly well,” says Barron. “All of the results that are coming in have been up to spec or beyond specification,” by both scientific and engineering measurements.
One of the key telemetries for any telescope is onboard fuel — because when fuel runs out, it just becomes space junk. For the JWST, the design perimeter for fuel success was five years, with a hope to last for 10.
“What happened was, the launch was precise enough that they saved a lot of fuel en route,” Barron said, meaning NASA didn’t have to make any corrections. “Now they are thinking 10 years is going to be the baseline, and then maybe 15 to 20 years. Nearly doubled its life, just because the launch was very, very precise, more precise than they had planned for.”
Barron likened this to Star Trek’s Scotty always saying the Starship Enterprise couldn’t take any more wear and tear, but Captain Kirk demanding a miracle anyway.
“When engineers do this, they always keep a secret buffer … so if something unexpected happens, you’ve got cover,” he said. “If your manager says five years, you add a buffer for at least 12 years. To be precise within 100 meters, design for within 50 meters.
“So, even if you make a mistake, you’ll still be with your perimeters.”
This isn’t unheard of at NASA. The Voyager I space probe launched in September 1977 with enough fuel for five to seven years — enough to reach Jupiter and Saturn. It’s still going 45 years later, having gone on to Uranus, Neptune and out into interstellar space.
Since the JWST gets its power from solar panels, it has an edge, but there are other obstacles in the way of the universe’s largest telescope (that we know of) — like meteors. NASA knew the telescope’s solar shields would be hit eventually.
For thousands of years, Barron said, scientists have been studying the size, shape and regularity of meteor showers to establish patterns. This allows NASA to predict the frequency and magnitude of hits, which were factored into the size of JWST’s 69-feet-by-46-feet sunshield.
Scientists can turn off cameras and change the direction the telescope faces to mitigate risk when they know it’s going to be orbiting in an area with more showers from one direction.
Back in May, when it was still being set up, the JWST got hit by a meteor double the size that NASA engineers had accounted for. It knocked out one of the 18 mirrors that make up the telescope’s iconic honeycomb shape. Again, this hit was larger than the NASA team had predicted, but still within the perimeters of their emergency buffer.
This collision seems to have reduced the accuracy of some data collected, but, even with this small setback, the overall results from the telescope so far have gone beyond any expectations.
“From what I can understand, the preparations have been paid off. Discovery after discovery. Spectacular picture after spectacular picture. And very few issues,” Barron observed.
All kinds of space probes also have a “safe mode,” which basically stops operations, followed by a lot of testing. The Hubble Telescope famously spent a lot of its Day One/Day Two in safe mode, Barron said, as the constant change of temperatures, as it moved from sun to shade, saw metal flexing and other errors.
“As far as I know Webb has not gone into safe mode except to test recovery,” Barron said. “They spent so much time testing, so much time designing reliability into the system, and it’s actually working.”
Graceful Degradation of Service
The JWST’s latest issue, which was detected in late August and reported in late September, is that one of the adjustment wheels in one camera mode is a bit “sticky” and reports more friction than expected during operations.
How much of a problem is this? In a follow-up interview, Barron reminded The New Stack that “SREs differentiate between different kinds of monitoring or observability signals.”
The four golden signals of latency, traffic, error and saturation are typically deemed the most important SRE considerations. These metrics represent current problems — something working slowly (latency) or not working at all (error), and potential problems, such as how business-critical something is (throughput) and how at-capacity something is (saturation).
“Webb is looking at things much further away but we are using the accumulated knowledge of the last centuries to know when and where.”
— Robert Barron, SRE architect, IBM
JWST’s headline-grabbing sticky wheel, Barron said, is a “saturation kind of signal, which means that Webb is working harder than expected to adjust the camera. Once the camera is adjusted the images are fine.”
While there is no problem right now, he continued, “NASA’s SREs are cautious and have stopped using this specific configuration — and are spending time investigating why more friction than expected is needed to adjust the camera.” The telescope is operating as usual, with its 16 other cameras and scientific instruments.
This follows the earthbound SRE concept of graceful degradation of service where, instead of risking an uncontrolled failure, you purposefully deactivate a component and offer a more limited replacement, so you can perform deeper investigations without risking any unexpected, client-facing problems.
By contrast, when the Hubble Space Telescope launched in 1990, its main mirror was mounted incorrectly, compromising the resulting images from Day One. NASA’s team had to do a full degradation of service until astronauts could fly up to do on-site repairs.
By contrast, no human will be able to fly out to fix the one-shot JWST. Therefore, Barron said, “Webb’s repairs will have to be remote, which is why its graceful degradation of service is so vital for continued successful operations.”
Why the JWST Is Bad Business
Because the JWST is so unique and cutting-edge, not all of its SRE lessons will apply to earthbound engineers tasked with keeping an enterprise running smoothly.
“It is a technological marvel,” Barron acknowledged. “But, from a business perspective, it’s absolutely the stuff you shouldn’t do — all your eggs in one basket.”
JWST wouldn’t work as a business project: It was over budget and years late. It would be much better to do an agile, fail-fast solution, Barron said.
“If the SRE team wants to get to perfection, to the reliability of Webb, the application will never go into production because the expectations of the system change is so fast in the real world, ” he said. “If you aim for perfection, you’ll just be too slow.”
Even this type of long-term commitment is only allowed because space telescopes aren’t part of any space race. But there is a striking comparison between the still-not-launched, Moon-orbiting Artemis I Space Launch System (SLS) by NASA to its commercial competitors from SpaceX, Blue Origin and Virgin Galactic.
Indeed, the race to the Moon of the ‘60s and ‘70s was a competition of steps, “The first to do this, first to do that. And a lot of these ‘I’m the first to do’ can be quite meaningless in the long run,” Barron said. Similarly, the current commercial space race is “absolutely going the agile way.”
He pointed to Falcon 9, SpaceX’s attempt at a reusable rocket — which, at the time of publication, had gone through 176 launches.
“Each rocket must have a number of versions of the Falcon Starship being developed,” Barron said. Elon Musk “is not saying ‘We’re going to launch Starship.’ ‘We will do many along the way.” Thus far, SpaceX has seen 42 Falcon launches.
Musk, Barron said, is communicating that there will be 20 different milestones to meet before SpaceX can get to the Moon, and another 40 before it can take humans there: “Musk has limitless money, but also it’s low-risk because he’s not beholden to anyone.”
Barron also observed that Musk is managing expectations and pitching failure as a positive learning opportunity.
In contrast, NASA’s Artemis, as a purely scientific endeavor, will likely only have a single launch, which has been postponed several times at this writing. Artemis, Barron said, is very similar to JWST in that it’s a monolithic solution produced in very Waterfall project management — one that’s very late and very over budget.
“NASA, the way they’re doing it is, we’re going to launch SLS and it’s going to succeed, and then the next time we’re going to the Moon,” he said. This monolith cannot fail because “a failure on NASA’s part is you wasted billions of dollars.”
The cost of a monolithic scientific advancement is simply greater than commercial success, because, as Barron put it, “building half a Webb would’ve given us a 10th of the value.”
“An SRE does have a responsibility for the business that runs on that platform, a sense of pride and ownership. Not just for the devs that are writing the code but the people who are running it. And reliability.”
— Robert Barron, SRE architect, IBM
But JWST is not the SRE way. Musk’s way is. “To fail fast. To do experiments. Not to be afraid if the experiment fails because it won’t be as expensive,” Barron said. Even if some SpaceX launches fail, the successes far outweigh the failures.
This has NASA being significantly more cautious than SpaceX. Back in December, the launch of the JWST was delayed because one of three redundancies in the ground system had failed.
But just a few weeks before, NASA launched a comparatively more affordable communication satellite with the same rate of failure. Said Barron: “Some manager and SRE signed off: ‘I’m aware of the risk. It’s a calculated risk. I’m willing to launch. It’s not worth the effort to halt.’”
Both the JWST and Artemis are not taking any chances. This past August, the latter’s first launch was postponed because it also had a problem with redundancy, this time with sensors. The Artemis team could’ve taken the risk and it probably would’ve been fine, but that’s not the NASA way.
On the other hand, Barron said, “SREs never want to be in that position where you can’t take the risk. You always want to manage the risk. And if this doesn’t happen, what is my fallback? And it should never be that everything fails and the rocket explodes,”
Chaos Engineering Grounded in Modularity
Both architecture and experiments must be modular enough to allow for individual pieces to fail, without taking down the whole system. “I can’t inject something that I know will cause so much damage,” Barron said. “I need to be as small as possible so I can manage it all the time.”
If you had two of these telescopes, you could test in production, but all 344 single points of failure had to be tested well before launch. Typically, SREs at least have virtual environments and duplicate systems to continue their pseudo-chaos engineering.
The JWST’s on-the-ground, pre-launch chaos engineering relied on the infrastructure being truly componentized, stretching and testing each piece to its limits. Barron told The New Stack about how NASA faced failure early on when trying to open the foil sunshields: “It’s not good that they tore, but on the other hand it’s fantastic that they tore on the ground so they could re-engineer them.”
“You need to be able to fail. The system as a whole should succeed, but every component should be able to fail. Part of that is that you should be able to manage the failure if something bigger happens. That’s chaos engineering.”
— Robert Barron, SRE architect, IBM
In order to provide redundancy across the red planet, there are two NASA Mars rovers. When something fails in one, NASA has the option to replicate the failure on the other in a more controlled environment to try and solve it.
“Ten years without a tune-up — pretty good for any vehicle!” Barron said, reflecting on the rovers. “And that’s another big difference between NASA engineers and earthbound SREs — getting to that level of reliability is almost always a waste of time.”
Indeed, in pursuit of five nines of availability, that last 0.0001% will always be the most expensive. “Better to accept that there are a handful of manual tasks that you’ll never automate completely than to break your budget and miss your target dates,” he said.
SREs Think Business First
Barron applies his lessons from space to his role as an SRE architect for IBM, working on the hybrid cloud platform in the 1,200-person office of the company’s chief information officer.
The office works to optimize IBM’s internal cloud and internal development systems and processes for 250,000 IBM users in over 170 countries. Hybrid, in this case, involves systems written in the 1970s on old mainframes, all the way up to modernized and new platforms, powered by Kubernetes, Red Hat OpenShift, and artificial intelligence.
“In the last two years, we’ve reduced our legacy data center footprint by nearly half and modernized more than 1,900 applications,” Barron said. “
He added, “We’re in the process of migrating our internal IT workloads from 74 different legacy data centers around the world into a set of public and private cloud environments with high power usage effectiveness.”
Barron’s goal is to help people become better SREs and to develop in an SRE manner — failing fast, with collaboration, reliability and redundancy, through coaching, design work and code reviews, as well as developing architecture for new solutions.
That includes leading the transition to ChatOps, which aims that all technical work be done from Slack in a centralized environment, instead of at 10 different terminals. This, Barron said, allows transparency between the technical work and the business people — with over 300,000 Slack users. “If you can see them in the same Slack channel, you can work together.”
As we head to the universe and beyond, centralized communication and transparency are key.
Robert Barron will be giving a talk, “Over $9 Billion of SRE Lessons,” at SREcon22 EMEA, October 25-27 in Amsterdam or online.