Why did GPT-4 Get a 2 on AP English?
A technical understanding of GPT-4’s AP performance
GPT-4, arguably the most powerful AI language model, got a 2 on both AP English Language and Composition and AP English Literature and Composition. English, being GPT’s most comfortable language to write in, is somehow also GPT’s worst AP exam (from the scores released). Why is this?
The answer lies in what Generative Pre-trained Transformer 4, nicknamed GPT-4, actually is: a multimodal language model.
A language model is the backbone of Natural Language Processing (NLP), the branch of AI that focuses on the computer understanding of human language. We use the term “understanding” here, although we don’t exactly mean the human definition of understanding something.
NLP understanding can be broken down into 5 categories:
- Lexical or Morphological Analysis (the structure and vocabulary used in the input text)
- Syntax Analysis or Parsing (the grammar of the input text)
- Semantic Analysis (figuring out the intended meaning of the input text, factoring in every word)
- Discourse Integration (the context of the input text, meaning, is it related to previous inputs?)
- Pragmatic Analysis (knowledge from other sources, not from the input text)
In these five principles, you do not see anything about rhetorical statements and figurative language. This is where GPT-4 falls short: it cannot “understand” symbols and draw conclusions from those symbols to connect to larger themes in writing.
When humans analyze text, we look for specific word usage and what that implies in the context of the story. GPT cannot create that connection between the word used and the meanings of the phrase simply because it takes every word as its literal meaning.
Furthermore, with exams such as the English APs, there is not logical method of progression for answering questions. In math-focused exams, the question is straightforward and has no hidden meanings, so the progression is logical and steps are predefined.
For most lower level writing classes (perhaps high school freshman/sophomore and lower), GPT’s analytical capabilities will suffice, because classes like those typically do not focus more on specific details that help you to understand the text as a whole. They require more shallow and literal analyses, perfect for GPT.
Another thing to note is that when you input some text into GPT, the underlying NLP algorithm does not actually “understand” what you are saying, but rather connects your input to the most probable output based on the five categories listed above.
This means that when GPT reads an essay or a text and is asked for a specific response, it searches through its vast database of information and merges the most probable relevant data to create a relevant response, which is typically just the combination of that data that has the highest probability to be the response.
So even though GPT may technically understand the gist of the intended response, it ultimately has to reach into a database of factual information and create a relevant response. This causes problems because with only factual information, we cannot really connect the dots between a symbolic word and what it means.
Also, when GPT outputs a response, it feels “AI-generated” and many online sources can detect when something has been (re)written by GPT, Quillbot or others. This has to do with GPT’s system for structuring sentences with outputs, which comes back to the NLP categories.
GPT tries to structure a response that perfectly fits these categories, which results in writing having inhumane (judged by several factors such as perplexity and burstiness of the writing) features.
By adhering to a set of guidelines that themselves lack room for figurative language and reasoning, GPT loses voice in writing. It cannot exhibit a unique writing style and has very pattern-like and logically-progressing writing. The writing appears bland and monotone, simply answering a question with shallow analysis.
Trying to pass the English APs without the ability to articulate and show a deeper understanding via seeing the connections of specific details to overall themes in the text is extremely difficult, hence why GPT-4 and its competitors have gotten low scores.