TNS
VOXPOP
Will JavaScript type annotations kill TypeScript?
The creators of Svelte and Turbo 8 both dropped TS recently saying that "it's not worth it".
Yes: If JavaScript gets type annotations then there's no reason for TypeScript to exist.
0%
No: TypeScript remains the best language for structuring large enterprise applications.
0%
TBD: The existing user base and its corpensource owner means that TypeScript isn’t likely to reach EOL without a putting up a fight.
0%
I hope they both die. I mean, if you really need strong types in the browser then you could leverage WASM and use a real programming language.
0%
I don’t know and I don’t care.
0%
Frontend Development

Beware ChatGPT: a Language Model in the Shape of Shakespeare

David Eastman doth attempt to re-create ChatGPT using a few programming techniques and finds he can approximate the shape of Shakespeare.
Feb 15th, 2023 6:48am by
Featued image for: Beware ChatGPT: a Language Model in the Shape of Shakespeare
Image via Shutterstock 

“ChatGPT is a large language model.” However often those words are repeated, they are mainly ignored, as we would rather like true artificial intelligence to exist. It would be exciting to believe we are on the forefront of friendly droids and safe self-driving cars.

But just because a 6-year-old human child can repeat things, we are not normally persuaded they deserve to be listened to. A large language model masters the context, but not the meaning of the text it represents.

It’s a stochastic parrot. But the performance of these caged birds is astonishing. We can see just how good ChatGPT is at writing ‘x’ in the style of ‘y,’ and how many teachers are complaining about the number of synthetic essays their pupils are handing in. On examination, there is very little accuracy in the text, but the shape is good.

You probably should have heard about transformers by now, and that they have played a lead role in deep learning models. Understanding language models is still an academic pursuit, even though many of the tools are available.

I’m just going to look at a very lo-fi way of simulating the underlying basics of exploring text structure — without getting anywhere near the fidelity of current models, but without using any special knowledge either. So let us start with some Shakespeare.

Beware the Ides of March’

This short line has a distinct feel due to its familiarity. The structure isn’t particularly Shakespearean perhaps, and the words are very specific to the play it comes from (Julius Caesar).

Our aim with the short code project below will be to produce some nonsense text that is nevertheless in the shape of Shakespeare, and good enough to show that the trick of extruding structure is not really intelligence.

The word order in the soothsayer’s warning line to Caesar might not be important, but we can fashion certain weighting simply from noting word order statistics. First, we will look at preceding words so that we can record the words that follow on. We can then use this to play forward and produce our fabricated Shakespeare.

We will look at two words in order and try to “calculate” the third. This means we want to collect enough examples of which words most often follow on from the previous word. So if we consider “Beware the Ides,” we note that “Ides” follows one word after “the,” and two words after “Beware.” I am arbitrarily using a depth of two words, but could do the same with three words or more. The reason we look at not only the next word, but the word after that, is to make sure we capture a little of the sentence structure.

This means that as we move forward creating our nonsense text, we can ask “what should be the next word based on previous examples?”

Beware +1 ‘the’, +2 ‘Ides’,
the +1 ‘Ides’, +2 ‘of’
Ides +1 ‘of’, +2 ‘March’.,
of +1 ‘March.’
March.

Where I write “+1” and “+2” I mean one in front, or two in front, and I refer to this as “depth” here.

First, Prepare a Corpus

The sonnets are easier to manipulate form, as they are just a succession of sentences. So I chose to use these to learn from. I found the first 100 sonnets on the internet and used simple regex expressions to strip out the newline characters and spaces between numbered sonnets, in order to make a text file. Shakespeare is not famous for repeating himself — he uses a very wide spectrum of language in even this small corpus. Punctuation is also vital to the meter, and we can keep it as part of the structure. So we hope enough plagiarism will give us original results.

Now, C# isn’t necessarily the right language to use for text processing, but it is easy to manage once written in Visual Studio. Here is the project. You will see the corpus file that the project consumes.

What is a Token

While our food and drink here are words, we deal with them mechanically and with minimal reference to grammar. This is because we want the structure to be moulded by Will, but not so much by what we know about English. So we treat “The” to be a different entity from “the” because of where it appears in a sentence. Similarly we treat love,as a token, separate from love(without the comma), in spite of our sense that the comma is separate to the word. This is why I use the term “token” when I’m just talking about a word.

Forward Reference

This is entirely in my very simplistic model, where we look at the next token as depth 1, and the second token as depth 2.

Remember, we are not particularly interested in what Shakespeare actually did write, as opposed to what best matches his structural bias.

When running the code, you can see one example of all the forward references for ‘The”. Looking at some of them below


You can read this as

  • there is one example of the token “bounteous” following on directly from “The”; i.e. depth 1. This will only be considered when “The” is the second word when we start.
  • there are two examples of the word “will” following on two words after “The”. This will only be considered when “The” is the first word.

They are initially recorded in pairs, but this breaks down when a token repeats. The most common is the depth 2 “of”. You can use a little regex to check all these within the corpus.

Here is the code from the project that reads the corpus text, then adds references for the first, then the second of the words in the corpus, before shifting to the next word and repeating the process, until the text is consumed.

Making a Sonnet

We start with any two words that are part of the corpus and let the system select the next word. We then use the new word as the second word, the old second as the first word, and continue in that fashion for as long as we feel able.

In my first run, I added these constraints:

  • No repeated words unless they are two letters or less
  • A bias towards the depth 1 words as opposed to depth 2 words by 5:2
  • Stopping after 25 words, or less if we run out of rope

The above are controlled from the constants or const commands in the code.

Without good feedback mechanisms, we will not get anywhere close to the ChatGPT texts, but what will we see? Here is the code with the main loop, ready with the first two words:


Starting with “A rose”:


This is quite pleasing, although we clearly fall into a whirlpool of nonsense. Let’s now add some noise in the form of random numbers so we can get different results after repeating:


This is also quite nice but clearly needs more adjustments, otherwise it will unerringly collapse. Adjustments can be made anywhere a const is defined. Actual AI uses many micro adjustments.

One last example with the same start words:


And this is quite enough tortured doggerel. It feels like a monkey really is typing Shakespeare, but it does show how little is needed to capture the structure — without in any way approaching something useful.

I leave you with an effort appropriate for the new Bard, using the two start words “The intelligence,”:


OK, stop now.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.