Beware ChatGPT: a Language Model in the Shape of Shakespeare

“ChatGPT is a large language model.” However often those words are repeated, they are mainly ignored, as we would rather like true artificial intelligence to exist. It would be exciting to believe we are on the forefront of friendly droids and safe self-driving cars.
But just because a 6-year-old human child can repeat things, we are not normally persuaded they deserve to be listened to. A large language model masters the context, but not the meaning of the text it represents.
It’s a stochastic parrot. But the performance of these caged birds is astonishing. We can see just how good ChatGPT is at writing ‘x’ in the style of ‘y,’ and how many teachers are complaining about the number of synthetic essays their pupils are handing in. On examination, there is very little accuracy in the text, but the shape is good.
You probably should have heard about transformers by now, and that they have played a lead role in deep learning models. Understanding language models is still an academic pursuit, even though many of the tools are available.
I’m just going to look at a very lo-fi way of simulating the underlying basics of exploring text structure — without getting anywhere near the fidelity of current models, but without using any special knowledge either. So let us start with some Shakespeare.
‘Beware the Ides of March’
This short line has a distinct feel due to its familiarity. The structure isn’t particularly Shakespearean perhaps, and the words are very specific to the play it comes from (Julius Caesar).
Our aim with the short code project below will be to produce some nonsense text that is nevertheless in the shape of Shakespeare, and good enough to show that the trick of extruding structure is not really intelligence.
The word order in the soothsayer’s warning line to Caesar might not be important, but we can fashion certain weighting simply from noting word order statistics. First, we will look at preceding words so that we can record the words that follow on. We can then use this to play forward and produce our fabricated Shakespeare.
We will look at two words in order and try to “calculate” the third. This means we want to collect enough examples of which words most often follow on from the previous word. So if we consider “Beware the Ides,” we note that “Ides” follows one word after “the,” and two words after “Beware.” I am arbitrarily using a depth of two words, but could do the same with three words or more. The reason we look at not only the next word, but the word after that, is to make sure we capture a little of the sentence structure.
This means that as we move forward creating our nonsense text, we can ask “what should be the next word based on previous examples?”
Beware | +1 ‘the’, +2 ‘Ides’, |
the | +1 ‘Ides’, +2 ‘of’ |
Ides | +1 ‘of’, +2 ‘March’., |
of | +1 ‘March.’ |
March. |
Where I write “+1” and “+2” I mean one in front, or two in front, and I refer to this as “depth” here.
First, Prepare a Corpus
The sonnets are easier to manipulate form, as they are just a succession of sentences. So I chose to use these to learn from. I found the first 100 sonnets on the internet and used simple regex expressions to strip out the newline characters and spaces between numbered sonnets, in order to make a text file. Shakespeare is not famous for repeating himself — he uses a very wide spectrum of language in even this small corpus. Punctuation is also vital to the meter, and we can keep it as part of the structure. So we hope enough plagiarism will give us original results.
Now, C# isn’t necessarily the right language to use for text processing, but it is easy to manage once written in Visual Studio. Here is the project. You will see the corpus file that the project consumes.
What is a Token
While our food and drink here are words, we deal with them mechanically and with minimal reference to grammar. This is because we want the structure to be moulded by Will, but not so much by what we know about English. So we treat “The” to be a different entity from “the” because of where it appears in a sentence. Similarly we treat “love,” as a token, separate from “love” (without the comma), in spite of our sense that the comma is separate to the word. This is why I use the term “token” when I’m just talking about a word.
Forward Reference
This is entirely in my very simplistic model, where we look at the next token as depth 1, and the second token as depth 2.
Remember, we are not particularly interested in what Shakespeare actually did write, as opposed to what best matches his structural bias.
When running the code, you can see one example of all the forward references for ‘The”. Looking at some of them below
1 |
'The' - {bounteous One X 1} {largess Two X 1} {lovely One X 1} {gaze Two X 1} {eyes One X 1} {(fore Two X 1} {world One X 2} {will Two X 2} {age One X 1} {to Two X 1} {perfect One X 1} {ceremony Two X 1} {painful One X 1} {warrior Two X 1} {dear One X 1} {respose Two X 1} {one One X 2} {by Two X 1} {sad One X 1} {account Two X 1} {region One X 1} {of Two X 7} .. |
You can read this as
- there is one example of the token “bounteous” following on directly from “The”; i.e. depth 1. This will only be considered when “The” is the second word when we start.
- there are two examples of the word “will” following on two words after “The”. This will only be considered when “The” is the first word.
They are initially recorded in pairs, but this breaks down when a token repeats. The most common is the depth 2 “of”. You can use a little regex to check all these within the corpus.
Here is the code from the project that reads the corpus text, then adds references for the first, then the second of the words in the corpus, before shifting to the next word and repeating the process, until the text is consumed.
1 2 3 4 5 6 7 8 9 10 |
string[]? tokens = BackEnd.FileServices.ReadCorpus(); .. for (int i = 2; i < tokens?.Length; i++) { string currentToken = tokens[i]; ForwardReference.Add(tokens[i - 2], ForwardReference.Depth.Two, currentToken); ForwardReference.Add(tokens[i - 1], ForwardReference.Depth.One, currentToken); } |
Making a Sonnet
We start with any two words that are part of the corpus and let the system select the next word. We then use the new word as the second word, the old second as the first word, and continue in that fashion for as long as we feel able.
In my first run, I added these constraints:
- No repeated words unless they are two letters or less
- A bias towards the depth 1 words as opposed to depth 2 words by 5:2
- Stopping after 25 words, or less if we run out of rope
The above are controlled from the constants or const
commands in the code.
Without good feedback mechanisms, we will not get anywhere close to the ChatGPT texts, but what will we see? Here is the code with the main loop, ready with the first two words:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
.. MakeSonnetWord ms = new MakeSonnetWord("A", "rose"); newwords.Add(ms.firstword); newwords.Add(ms.secondword); for (short i = 0; i < MAXWORDS; i++) { ForwardReference fr = ms.EvaluteBestReference(newwords); newwords.Add(fr.referencetoken); ms = new MakeSonnetWord(ms.secondword, fr.referencetoken); } Console.Write($"nNew sonnet! nn {string.Join(" ", newwords)}"); |
Starting with “A rose”:
1 2 3 4 |
A rose might never die, But that I am not to the world of thy sweet self dost thou art so fair as a look, of my |
This is quite pleasing, although we clearly fall into a whirlpool of nonsense. Let’s now add some noise in the form of random numbers so we can get different results after repeating:
1 2 3 |
A rose in the world of thy this self to my love to his And you in their of my But I am thee for |
This is also quite nice but clearly needs more adjustments, otherwise it will unerringly collapse. Adjustments can be made anywhere a const
is defined. Actual AI uses many micro adjustments.
One last example with the same start words:
1 2 3 |
A rose in the world of my love to my self and all in my love, to my But that I in my verse And for |
And this is quite enough tortured doggerel. It feels like a monkey really is typing Shakespeare, but it does show how little is needed to capture the structure — without in any way approaching something useful.
I leave you with an effort appropriate for the new Bard, using the two start words “The intelligence,”:
1 2 3 4 |
The intelligence, of thy love And for my self and all my love, But thou art the world of my love's breast, where-through plead that |
OK, stop now.