Elevating the Conversation with LLM Assistants

Here is some CSV data that created a problem for Steampipe’s CSV plugin. Can you see what’s wrong?
Last week several of us failed to see, at first glance, what was wrong. Excel was happy with the data, and so was a CSV validator, but when Steampipe loaded it by way of the CSV plugin the headers got mangled.
1 2 3 4 |
date,notes,category,amt,account,sub_cat,notes 20-Dec-2021,b/f,transfer,"-₹350,012.00",8970,, 20-Dec-2021,b/f,transfer,"-₹213,189.40",2681,, 20-Dec-2021,b/f,transfer,"-₹200,000.00",5595,, |
Recognizing Low-Level Patterns
When I showed the data to SourceGraph Cody and GitHub Copilot Chat, they raised various questions that had also occurred to me, including: Inconsistent quoting? Unicode issue? All were red herrings.
ChatGPT-4 had a better idea.
“The CSV file you provided seems to be structured correctly at a first glance, with values properly separated by commas and text values enclosed in quotation marks as needed.
However, the CSV file has an extra column header “notes” which does not have any corresponding values in the rows. The header row should have a 1:1 correspondence with the data in the rows, so that each piece of data falls under a specific column.”
That still wasn’t quite right. Adding values to the second notes column won’t help in this case. Steampipe has to create a Postgres schema from the header row, and a schema can’t have two columns with the same name. The solution was to remove the duplicate notes column.
When I revised the prompt to include “Why can’t I import it into Postgres?” Cody figured it out.
“The header row should define the schema for the table, with each header representing a column name. Having two columns named “notes” will cause issues when Postgres tries to create the table schema.”
Copilot still didn’t get there. I’m sure there’s a way to write a prompt that would elicit the right answer, and iterating on that prompt is a good way to use an LLM as a rubber duck. We should always try to surface and reflect on the assumptions that blind us to error, and these tools for thought can help us do that.
But as Chris Wiggins reminds us in How Data and AI Happened, what we now call AI is rooted — by way of machine learning — in an older discipline we don’t hear much about nowadays: pattern recognition. That’s something our brains do really well; in Fluid Concepts and Creative Analogies, Douglas Hofstadter places it at the core of human intelligence. Although he’s now terrified by the prospect that humans will soon be eclipsed, I think we can draw a useful distinction between low-level and high-level pattern recognition. We want machines to do more of the former, so we can do more of the latter. The pattern two-columns-with-the-same-name is an easy thing to miss when looking at a sample of problematic data that also includes confounding variables like Unicode characters and possible quoting issues. Once we notice that pattern, though, and knowing we’re in a Postgres context, we can easily spot the problem.
High-Level Patterns
Here’s how Christopher Alexander’s A Pattern Language summarizes a pattern called FOUR-STORY LIMIT: “Therefore, in any urban area, no matter how dense, keep the majority of buildings four stories high or less.”
And here’s how Ward Cunningham’s Portland Pattern Repository summarizes the Singleton Pattern: “Therefore, let the class create and manage the single instance of itself, the Singleton. Wherever in the system you need access to this single instance, query the class.”
In a similar vein, here’s a pattern I’ve found useful in the realm of SQL: “Decompose a complex query into a pipeline of Common Table Expressions (CTEs), each naming a step that’s easy to read, test, and modify.”
I arrived at this pattern when I returned to SQL after a long absence, dug into Postgres, and realized that CTEs — which hadn’t been part of my prior experience — could help me reason about queries that would otherwise seem like opaque black boxes. As I interact with other SQL returnees I find that this effect is often a revelation to them too, and often cite the pattern when we’re trying to debug queries.
It’s been a challenge for me, though, to apply the pattern to a given example. Here’s one that came up in the Steampipe community last week.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
-- Find dangling NS records select r.name as hostname, type, record as resource_record, z.title as zone, a.title as account from aws_route53_zone as z, aws_route53_record as r, aws_account as a -- Split multi-record entries left join jsonb_array_elements_text(records) as record on true where r.zone_id = z.id and z.account_id = a.account_id and (type = 'NS') and z.private_zone = false -- Convert records to lowercase and add trailing . if missing and regexp_replace((LOWER(record)), '([^\.])$', '\1.') like any ( -- Limit search to records pointing at zones we can check select concat('%', name) as search from aws_route53_zone ) -- convert records to lowercase and add trailing . if missing and regexp_replace((LOWER(record)), '([^\.])$', '\1.') not in ( select distinct(r.name) from aws_route53_zone as z, aws_route53_record as r where r.zone_id = z.id and r.name is not null ) order by r.name; |
The author of the query, Chris Cooper, asked for feedback. Typically in this situation, I’ll suggest the pattern as a way to make the query easier to read, understand, test, and modify. Also typically, though, it won’t be easy for me to replicate the author’s AWS setup in order to propose a rewrite that I’m confident will work. Often I’ll wind up suggesting a stylistic improvement that I can’t illustrate concretely. But this time, I enlisted the help of my three assistants.
Copilot: Proposed minor changes, but left the structure intact.
Cody: Proposed a CTE pipeline.
ChatGPT-4: Proposed a CTE pipeline.
Here’s the original alongside two CTE-oriented rewrites.
Although Cody’s approach is viable and more concise, I liked how the ChatGPT version breaks out the phases and puts the regex heavy lifting at the beginning of the pipeline. But would it work? I asked Chris Cooper who confirmed that, though he preferred to name things a bit differently, the strategy was sound and “significantly easier to follow.”
That would already have been an impressive result if I had named the pattern and asked the assistants to apply it. But in this case, I didn’t lead the witnesses! They arrived at CTE-based solutions without a CTE-flavored prompt. That feels all the more remarkable because, in my experience, the pattern isn’t the most common or widely acknowledged and so may not be the likeliest prediction. Of course, I could be wrong about that. Maybe it is the likeliest given the corpus of queries that the assistants were trained on. Maybe the refiners of the model influenced the outcome. And maybe in the case of Cody, which sees many of my queries, I was implicitly leading the witness. It’s hard, maybe impossible, to know for sure.
The need for this pattern arises often in the Steampipe community. If I can consistently give query authors working code that illustrates the pattern, will they adopt it with the help of their own assistants? We’ll see how this plays out going forward.
The How and Why of a Lateral Cross Join
A few days later, Chris Cooper came back with a question about another query. This one was already written as a CTE pipeline — his question was about how to feed a parameter into it. But it presented an opportunity to dig into a SQL construct that I’d never used and only vaguely understood: the lateral cross join. At the core of each of his CTE phases was a query like this.
1 2 3 4 5 6 7 8 9 10 11 12 |
SELECT v.ip as ip, 'ec2' as source, title as title, instance_id as id, region as region, account_id FROM aws_ec2_instance CROSS JOIN LATERAL ( VALUES (public_ip_address), (private_ip_address) ) as v(ip) right join target on target.ip = v.ip |
I asked my assistants to explain it, and all three did so very nicely.
I’m an example-driven learner, though, and explanations like these don’t really come alive for me until I can see them in action. So I asked ChatGPT for sample data, which it provided in a very realistic way. That enabled me to write gist that helps me (and perhaps others) visualize how the lateral cross join synthesizes a single ip column from a variety of different flavors of IP address.
Again I asked Chris for feedback, and here was his reply: “It successfully describes the what but not necessarily the why. There are several approaches that all solve the problem, but the reason the cross-join lateral is used here is to perform the expansion while only making a single query to the AWS table. Other methods would require iterating through the table once for each different column that you wanted to expose as IP.”
For good measure, I asked my assistants to show alternative approaches that do require multiple scans of the table. Now I’ve got a good handle on both the how and why of Chris’ query strategy. More importantly, I should be able to use this approach to convert other SQL explanations into examples that I can run and interact with. I’ll codify these as patterns with names like use-lateral-cross-join-to-form-a-column-from-multiple-rows-with-one-table-scan. And then I’ll look for ways to incorporate this kind of pattern language into my dialogues with the LLM assistants.
In The Chess Master and the Computer, Garry Kasparov famously wrote:
“The winner was revealed to be not a grandmaster with a state-of-the-art PC but a pair of amateur American chess players using three computers at the same time. Their skill at manipulating and “coaching” their computers to look very deeply into positions effectively counteracted the superior chess understanding of their grandmaster opponents and the greater computational power of other participants. Weak human + machine + better process was superior to a strong computer alone and, more remarkably, superior to a strong human + machine + inferior process.”
I’m just an average developer, a “weak human” in Kasparov’s formulation. My LLM assistants can handle many simple coding tasks for me, but the most powerful results will come from a “better process” for collaborating with them. They are already better than me at recognizing low-level patterns. I’m hoping we can elevate the conversation to include high-level patterns that we collaboratively propose and discuss.