Where are you using WebAssembly?
Wasm promises to let developers build once and run anywhere. Are you using it yet?
At work, for production apps
At work, but not for production apps
I don’t use WebAssembly but expect to when the technology matures
I have no plans to use WebAssembly
No plans and I get mad whenever I see the buzzword
Software Testing

Beyond ChatGPT: Tailored AI in Test Automation

While solutions like ChatGPT can do cool things in the realm of software testing, they’re not the right tools for serious test authoring and maintenance.
Dec 7th, 2023 10:00am by
Featued image for: Beyond ChatGPT: Tailored AI in Test Automation
Image by Esther Merbt from Pixabay.

It’s safe to say that, by this point, virtually every developer and Quality Assurance (QA) engineer who has an internet connection has experimented with generative AI technology like ChatGPT.

It’s probably also safe to say that most of these folks have been impressed by what generative AI software can do in the realm of code generation and software testing.

ChatGPT and similar services are impressively adept at producing test cases, for example, using virtually any language or automation framework that you ask them to work with.

But does that mean it’s time to surrender to the robots by handing responsibility for software testing to generative AI tools?

I’m here to tell you it’s not.

While solutions like ChatGPT can do cool things in the realm of software testing, they’re not the right tools for serious test authoring and maintenance.

Allow me to explain by discussing where generative AI excels in the context of software testing, then walking through reasons why you typically shouldn’t use it to write tests.

The Appeal of GenAI for Software Testing

The main capability generative AI tools bring to software testing is that they can automatically produce scripts to execute automated tests.

This is a big deal because the ability to automate tests — as opposed to running them manually, an approach that takes much more time and yields less consistent testing results — is critical for testing at scale, especially for businesses that want to be able to build, test and deliver software updates on a frequent basis.

Yet, traditionally, actually writing the tests that power automated testing was a lot of work.

In fact, test authoring was often the biggest pain point in QA. A recent survey conducted by my company, Kobiton, shows that nearly half of QA teams spend at least nine hours writing a single test case.

Because ChatGPT has no contextual knowledge about what the app does or which features are most important to users, it has no ability to determine what is most critical to test.

Eight percent of organizations spend 40 or more hours on that task. Given that a single application might require dozens or even hundreds of tests, generating the tests to power automated testing can be a monumental task.

Recent advancements like DALL-E 3’s integration with GPT pave the way for generating test scripts directly from app screenshots — a task that might soon take seconds, not days. This cutting-edge capability could revolutionize how QA engineers approach test automation using tools like Selenium and TestNG.

While GPT impressively automated basic test cases for Kobiton, its aptitude for handling more intricate, domain-specific user interactions was lacking.

This highlights the crucial role of domain expertise in ensuring comprehensive coverage and underscores the limitations of current AI in grasping the nuances of complex testing scenarios. As much as some folks would like to say that GenAI doesn’t reliably produce good code, that isn’t usually the case when it comes to generating automated software tests.

Why ChatGPT Might Not Be the Best Tool for Software Test Generation

But just because ChatGPT and similar tools can save so much time by automatically generating tests doesn’t mean they’re the right solution for every test automation need. On the contrary, if you rely on public generative AI services to produce tests, you face two major risks.

Lack of Domain Expertise

One risk is the fact that, although there’s no denying that the test scripts produced by genAI typically execute well, there is no way of guaranteeing that they’ll test the right things.

ChatGPT’s limitation is clear: it lacks the domain expertise to discern which specific app features to prioritize for testing, potentially overlooking critical test cases.

The fact that they can do things like automatically look at screenshots and identify visual elements is very cool. But because they have no contextual knowledge about what the app does or which features are most important to users, they cannot determine what is most critical to test.

As a result, without human oversight to correct for potential AI bias, you might end up with tests that run well and take you seconds to generate, but that offer little value because they don’t test the right things. In turn, you have to run some tests manually because your automated tests don’t offer adequate coverage.

Lack of Maintainability

It’s easy to ask ChatGPT to generate tests. It’s much harder — and, in many cases, impossible — to ask it to update an existing test due to changes in the app you need to test. A ChatGPT prompt such as the following isn’t likely to get you very far: “Here’s a test you wrote eight months ago. I added a new UI feature to my app and now would like you to update the test.”

You can, of course, simply generate new tests from scratch every time your app changes and you need to update your tests. But the problem there is that you lose test consistency, as well as visibility into the historic state of your tests.

ChatGPT tends to style tests differently each time it produces one; indeed, being able to generate original content in response to similar requests is part of what makes generative AI so powerful in general.

But getting different results for similar queries is a bad thing in the context of software testing, where it’s better to have a standing set of tests that evolve over time, rather than tests that you regenerate from scratch repeatedly.

A Better Approach to Automated Test Generation

The limitations of public generative AI tools for test generation don’t mean that QA teams need to settle for producing tests manually. Instead, they should take advantage of AI tools that were designed specifically for generating tests — as opposed to generic genAI services like ChatGPT.

Tools created specifically for the test automation domain can generate consistent tests. They can also update tests over time, rather than regenerating them for each new application release. In this way, these solutions provide the benefits of fast, low-effort test generation, without the drawbacks of a generic solution like generative AI.

Conclusion: A Healthy Approach to GenAI for Testing

ChatGPT and similar tools can write tests very quickly, and the quality of the tests is usually surprisingly good. But if you think beyond the challenge of generating tests themselves, you realize that public genAI services fall short. They lack the domain expertise to know what you actually need to test, and they have little ability to update or maintain tests over time in a consistent way.

While today’s generative AI tools like ChatGPT might not dominate software testing, they are stepping stones toward more sophisticated AI applications. I expect that most QA teams will be turning to domain-specific tools that leverage AI to generate and maintain tests in ways that a general-purpose tool like ChatGPT will just never excel at.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.