Your Large Language Model – it’s as Dumb as a Rock

© Jason Flaks -Initially generated by DALL-E and edited by Jason Flaks

Unless you’ve been living under a rock lately you likely think we’re entering some sort of AI-pocalypse. The sky is falling and the bots have come calling. There are endless reports of ChatGPT acing college-level exams, becoming self-aware, and even trying to break up people’s marriages! The way  OpenAI and their ChatGPT product have been depicted, it’s a miracle we haven’t all unplugged our devices and shattered our screens. It seems like a sensible way to stop the AI overlords from taking control of our lives.

But never fear! I am here to tell you that large language models (LLMs) and their various compatriots are as dumb as the rocks we all might be tempted to smash them with. Well, ok, they are smart in some ways. But don’t fret—these models are not conscious, sentient, or intelligent at all. Here’s why.

Some Like it Bot: What’s an LLM?

Large Language Models (LLMs) actually do something quite simple. They take a given sequence of words and predict the next likely word to follow. Do that recursively, and add in a little extra noise each time you make a prediction to ensure your results are non-deterministic, and voila! You have yourself a “generative AI” product like ChatGPT.

But what if we take the description of LLMs above and restate it a little more succinctly:

LLMs estimate an unknown word based on extending a known sequence of words.

It may sound fancy—revolutionary, even—but the truth is it’s actually old school. Like, really, really old school—it’s almost the exact definition of extrapolation, a common mathematical technique that’s existed since the time of Archimedes! If you take a step back, Large Language Models are nothing more than a fancy extrapolation algorithm.  Last I checked nobody thinks their standard polynomial extrapolation algorithm is conscious or intelligent. So why exactly do so many believe LLMs are?

Hear Ye, Hear Ye: What’s in an Audio Sample

Sometimes it’s easier to explain a complex topic by comparison. Let’s take a look at one of the most common human languages in existence—music.  Below are a few hundred samples from Bob Dylan’s “Like a Rolling Stone.” 


If I were to take those samples and feed them into an algorithm and then recursively extrapolate out for a few thousand samples, I’d have generated some additional audio content. But there is a lot more information encoded in that generated audio content than just the few thousand samples used to create it.

At the lowest level:

  • Pitch
  • Intensity
  • Timbre

At a higher level:

  • Melody
  • Harmony
  • Rhythm

And at an even higher level:

  • Genre
  • Tempo

So by simply extrapolating samples of audio, we generated all sorts of complex higher-level features of auditory or musical information. But pump the brakes! Did I just create AI Mozart? I don’t think so. It’s more like AI Muzak.

An AI of Many Words: What’s Next? 

It turns out that predicting the next word in a sequence of words will also generate more than just a few lines of text. There’s a lot of information encoded in those lines,  including the structure of how humans speak and write, as well as general information and knowledge we’ve previously logged. Here’s just a small sample of things encoded in a sequence of words:

  • Vocabulary
  • Grammar/Part of Speech (PoS) tagging
  • Coreference resolution (pronoun dereferencing)
  • Named entity detection
  • Text categorization
  • Question and answering
  • Abstract summarization
  • Knowledge base

All of the information above can, in theory, be extracted by simply predicting the next word, much in the same way predicting the next musical sample gives us melody, harmony, rhythm, and more.   And just like our music extrapolation algorithm didn’t produce the next Mozart, ChatGPT isn’t going to create the next Shakespeare (or the next horror movie villain, for that matter).

LLMs: Lacking Little Minds? 

Large Language Models aren’t the harbinger of digital doom, but that doesn’t mean they don’t have some inherent value. As an early adopter of this technology, I know it has a place in this time. It’s integral to the work we do at Xembly, where I’m the co-founder and CTO. However, once you understand that LLMs are just glorified extrapolation algorithms, you gain a better understanding of the limitations of the technology and how best to use it. 

Five Alive: How to Use LLMs So They Don’t Take Over the World


LLMs have huge potential. Just like any other tool, though, in order to extrapolate the most value, you have to use them properly. Here are five areas to consider as you incorporate LLMs into your life and work. 

  • Information must be encoded in text
  • Extrapolation error with distance
  • Must be prompted
  • Limited short-term memory
  • Fixed in time with no long-term memory

Let’s dig a little deeper.

Information Must Be Encoded in Text

Yan LeCun probably said it best:

Humans are multi-modal input devices and many of the things we observe and are aware of that drive our behavior aren’t verbal  (and hence not encoded in text). An example we contend with at Xembly is the prediction of action items from a meeting. It turns out that the statement “I’ll update the row in the spreadsheet” may or may not be a future commitment to do work.  Language is nuanced, influenced by other real-time inputs like body language and hundreds of other human expressions. It’s entirely possible in this example that the task was completed in real-time during the meeting, and the spoken words weren’t an indication of future work at all.

Extrapolation Error with Distance

Like all extrapolation algorithms, the further you get away from your source signal (or prompt in the case of LLMs), the more likely you will experience errors. Sometimes a single prediction that negates an otherwise affirmative statement or an incorrectly assigned gendered pronoun, can cause downstream errors in future predictions. These tiny errors can often lead to convincingly good responses that are factually inaccurate. In some cases, you may find LLMs return highly confident answers that are completely incorrect. These types of errors are referred to as hallucinations.

But both of these examples are really just forms of extrapolation error. The errors will be more pronounced when you make long predictions. This is especially true for content largely unseen by the underlying language model (for example, when trying to do long-form summarization of novel content).

Must Be Prompted

Simply put, if you don’t provide input text an LLM will do nothing. So if you are expecting ChatGPT to act as a sage and give you unsolicited advice, you’ll be waiting a long time. Many of the features Xembly customers rave about are based on our product providing unsolicited guidance. Large Language Models are no help to us here.

Limited Short-Term Memory

LLMs generally only operate on a limited window of text. In the case of ChatGPT, that window is roughly 3000 words. What that means is that new information not already incorporated in the initial LLM training data can very quickly fall out of memory. This is especially problematic for long conversations where new corporate lingo may be introduced at the start of a conversation and never mentioned again. Once whatever buzzword is used falls out of the context window it will no longer contribute to any future prediction, which can be problematic when trying to summarize a conversation.

Fixed in Time with no Long-term Memory

Every conversation you have with ChatGPT only exists for that session. Once you close that browser or exit your current conversation, there is no memory of what was said. That means you cannot depend on new words being understood in future conversations unless you reintroduce them within a new context window. So, if you introduce an LLM to anything it hasn’t heard before in a given session, you may find it uses that word correctly in subsequent responses. But if you enter a new session and have any hopes that the word will be used without introducing it in a new prompt, brace yourself—you will be disappointed.

To Use an LLM or Not to Use an LLM

It’s a big question. LLMs are exceedingly powerful, and you should strongly consider using them as part of your NLP stack. I’ve found the greatest value of many of these LLMs is that they potentially replace all the bespoke language models folks have been making for some time.  You may not need these custom entity modes, intent models, abstract summarization models, etc. It’s quite possible that LLMs can accomplish all of these things at similar or better accuracy, while possibly greatly reducing time to market for products that rely on this type of technology.  

There are many items in the LLM plus column, but if you are hoping to have a thought-provoking intelligent conversation with ChatGPT,  I suggest you walk outside and consult your nearest rock. You just might have a more engaging conversation!

The Annotators Dilemma: When Humans Teach Machines to Fail

What does a machine learning model trained via supervised learning and a lion raised in captivity have in common? … They’re both likely to die in the wild!

Now that might sound like a joke aimed at getting PETA to boycott my blog, but this is no laughing matter. Captive bred lions are more likely to die in the wild and so are machine learning models trained with human annotated data.

According to a 2008 National Geographic Article captive bred predators often die in the wild because they never learn the natural behaviors necessary for success. This is because their human captors never teach the animals the necessary survival skills (e.g., hunting) or inadvertently teach behaviors that are detrimental to their survival (e.g., no fear of humans). It’s not for lack of trying, but for a variety of reasons it is impractical or impossible to expose a captive predator to an environment that completely mirrors their ultimate home.

Why Humans Teach Machines to Fail

Not surprisingly, when humans teach machines, they fail for many of the same reasons. In both supervised and semi-supervised learning machine learning we train a model using human annotated data. Unfortunately, human cognitive and sensory capabilities can introduce a variety of consequences that often lead us to teach machines the wrong thing or fail to fully expose them to the environment they will find the wild. While there are likely many areas that impact the quality of human annotations, I’d like to cover five that I believe have the greatest impact on success.

Missing Fundamental and the Transposed Letter Effect (Priming)

Have you ever had someone explain an interesting factoid that sticks with you for life? One such example in my life is the “Missing Fundamental” effect first introduced to me by my freshman year Music Theory professor. The question he posed was “how are you able to hear the low A on a piano when the vast majority of audio equipment cannot reproduce the corresponding fundamental frequency”. It turns out the low A on a piano has a fundamental frequency of 27.5 HZ and most run of the mill consumer audio equipment is incapable of producing a frequency that low with any measurable gain. Yet we can hear the low A on a Piano recording even with those crappy speakers. The reason why is because of the “Missing Fundamental” effect. In essence the human brain can infer the missing fundamental frequency from upper harmonics.

A similar concept is the “Transposed Letter” effect. I’m sure you’ve seen the meme. You know those images with scrambled letters that tell you you’re a genius if you can read it. Your ability to read those sentences is due to transposed letter effect and related to priming. Basically, even if the letters in words printed on the page are jumbled, reading it can still activate the same region of the brain as the original word.

You might be asking what any of this to do with annotating data and teaching machines. The problem arises when you realize we humans can correctly identify something even when all the data is not actually present to do so. If we annotate data this way we are assuming the machine has the same capabilities and that might not be so. Teaching a machine that “can you raed this” and “can you read this” are the same thing may have unintended consequences.

Selection Bias

If you gave me millions of images and asked me to find and label all the images with tomatoes, I am probably going to quickly scan for anything red and circular. Why? Because that’s my initial vision of a tomato and scanning the images that way would likely speed up the process of going through them all. But that is me imparting my bias of what a tomato looks like into the annotated data set. And it’s exactly how a machine never learns that the Green Zebra and Cherokee Purple varieties are indeed tomatoes!

Multisensory Integration

Humans often make use of multiple senses simultaneously to improve our perception of the world. For example, it has been well documented that speech perception, especially in noisy environments, is greatly enhanced when the listener can leverage visual and auditory cues. However, the vast majority of commercial machine learning models are single modality (reading text, scanning images, scanning audio). So, if my ability to understand a noisy speech signal is only possible due to a corresponding video it may be dangerous to try and teach a machine what was said since the machine likely does not have access to the same raw data.

Response Bias

I am ashamed to admit this but every time I get an election ballot, I feel an almost compulsive need to select a candidate for every position. Even when I have little or no knowledge of the office the candidates are running for, the background of the competing candidates, and their policy positions. Usually, I arbitrarily select a candidate based on their party affiliation or what college they went to, which is probably only slightly better than selecting the first name on the ballot. My need to select a candidate even though I have no basis for doing so is likely a form of response bias. The problem with response bias is it generally leads to inaccurate data. If your annotators suffer from response bias, you are likely teaching the machine with inaccurate data.

Zoning Out

Have you ever driven somewhere only to get to your destination and have no recollection of how you got there? If so, like me, you have experienced zoning out. With repetitive tasks we tend to start with an implied speed versus accuracy metric but over time as the task gets boring, we start to zone out or get distracted but we maintain the same speed which ultimately leads to errors. Annotating data is a highly repetitive task and therefore has a high probability of generating these types of errors. And when we use error-ridden annotated data to teach our machines we likely teach them the wrong thing.

How to be a Better Teacher

While the problems above might seem daunting there are things we can do to help minimize the effects of human behavior on our ability to accurately teach machines.

Provide a Common Context

The missing fundamental and multisensory integration problems are both issues with context. In each of these cases either historical or current context allows us humans to discern something another species (a.k.a. the machine) may not be able to comprehend. The solution to this problem is to make sure humans teach the machine with a shared context. The easiest way to fix this problem is to limit the annotator to the same modality the machine will operate with. If the task at hand is to teach a machine to recognize speech from audio, then don’t provide the annotator access to any associated video content. If the task is to identify sarcasm in written text don’t provide the annotator with audio recordings of the text being spoken. This will ensure the annotator teaches the machine with mutually accessible data.

Beyond tooling you can also train your annotators to try and interpret data from multiple perspectives to ensure their previous experiences don’t cause brain activations that the machine might not benefit from. For example, it is very easy to read text in your head with your own internal inflections that might change the meaning. After all the slightest change in inflection can turn a benign comment into a sarcastic insult. However, if you train annotators to step back and try to read the text with multiple inflections you might avoid this problem.

Introduce Randomness

While it might be tempting to let users search around for items they think will help teach a machine doing so can increase the likelihood of selection bias. There may be good reasons to allow users to search to speed up data collection of certain classes, but it is also important to make sure a sizeable portion of your data is randomly selected. Make sure you set up different jobs for your annotators and ensure some proportion of your labeling effort is from randomly selected examples.

Reduce Cognitive Load

While we may not be able to prevent boredom and zoning out, we can reduce complexity in our labeling tools. By reducing the cognitive load we are more likely to minimize mistakes when people get distracted. Some ways to reduce cognitive load include limiting labeling tasks to single step processes (i.e., only label one thing at a time) and providing clear and concise instructions that remove ambiguity.

Be Unsure

Last but not least, allow people to be unsure. If you force people to put things in one of N buckets they will. By giving people the option of being “unsure” you minimize how often you’ll get inaccurate data due to people’s compulsion to provide an answer even if no correct answer is obvious.

Final Thoughts

No teacher wants to see their students fail. So it’s important to remember whether training a lion or a machine learning model, that different species likely learn in different ways. If we cater our teaching towards our students, we just might find our machine learning models fat and happy long after we sent them off into the wild.

*I’d like to thank Shane Walker and Phil Lipari who inspired this post and have helped me successfully teach many machine learning models to survive in the wild.

Voting is Just a Precision and Recall Optimization Problem

It’s hard to avoid the constant bickering about the results of our last election. Should mail-in voting be legal? Do we need stricter voter identification laws? Was there fraud in the last election? Did it impact the results? These are just a fraction of the questions circulating around elections and voter integrity these days. Sadly, these questions appear to be highly politicized and it’s unclear if anybody is really interested in asking what an optimal election system looks like.

In a true fair and accurate representative democracy, a vote not counted is just as costly as one inaccurately counted. More precisely, a single mother with no childcare who doesn’t vote because of 4-hour lines is just as damaging to the system as a vote for a republican candidate that is intentionally or accidentally recorded for the opposing Democratic candidate.

Therefore, we can conclude an optimal election system really involves optimizing on both axes. How do we make sure everyone who wants to vote gets to vote? And how do we ensure every vote is counted accurately? When viewed this way one can’t help but see the parallels to optimizing a machine learning classifier for precision (when we count votes for a given candidate how often did we get it right) and recall (of all possible votes for that candidate how many did we find).

Back the Truck Up! What is Precision and Recall Anyway

Precision and Recall are two metrics often used to measure the accuracy of a classifier. You might ask “why not just measure accuracy?” and that would be a valid question. Accuracy defined as everything we classified correctly divided by everything we evaluated, suffers from what is commonly known as the imbalanced class problem.

Suppose we have a classifier (a.k.a. laws and regulations) that can take a known set of voters who intend to vote “democrat” and “not democrat” (actual / input) and then outputs the recorded vote (predicted / output).

Let’s assume we evaluate 100 intended voters/votes, 97 of which intend to not vote for the democratic candidate and let’s build the dumbest classifier ever known. We are just going to count every vote as “not democrat”, regardless of whether the ballot was marked for the democratic candidate or not.

N (number of votes) = 100 Output (Predicted) Value
Democrat Not a Democrat
Input (Actual) Value Democrat TP = 0 FN = 3 TOTAL DEMOCRATS = 3
Not a Democrat FP = 0 TN = 97 TOTAL NOT DEMOCRATS = 97
POSITIVES = 0 NEGATIVES = 100

To make our calculations a little easier we can take those numbers and drop them into a table that compares inputs to outputs also known as a confusion matrix. To simplify some of our future calculations we can further define some of the cells in the table above

  • True Positives (TP): Correctly captured an intended vote for the democrats as a vote for the democrats (97)
  • True Negatives (TN): Correctly captured a vote NOT intended for the democrats as a vote, not for the democrats (97)
  • False Positives (FP): Incorrectly captured a vote NOT intended for the democrats as a vote for the democrats (0)
  • False Negatives (FN): Incorrectly captured an intended vote for the democrats as a vote not for the democrats (3)

Now we can slightly relabel our accuracy equation and calculate our accuracy with our naïve classifier and the associated values from the table above.

97% Accuracy! We just created the world’s stupidest classifier and achieved 97% accuracy! And therein lies the rub. The second I expose this classifier to the real world with a more balanced set of inputs across classes we will quickly see our accuracy plummet. Hence, we need a better set of metrics. Ladies and gentlemen, I am delighted to introduce …

  • Precision: Of the votes recorded (predicted) for the Democrats, how many were correct

  • Recall: Of all possible votes for the Democrats, how many did we find

What becomes blatantly clear from evaluating these two metrics is that our classifier, which appeared to have great accuracy, is terrible. None of the intended votes for the democrats were correctly captured and of all possible intended votes for the democrats, we found none of them. It’s worth noting that the example I’ve presented here is for a binary classifier (democrat, not democrat) but these metrics can easily be adapted to multi-class systems that more accurately reflect our actual candidate choices in the United States.

There’s No Such Thing as 100% Precision and Recall

Gödel’s incompleteness theorem, which loosely states that every non-trivial formal system is either incomplete or inconsistent, likely applies to machine learning and artificial intelligence systems. In other words, since machine learning algorithms are built around our known formal mathematical systems there will be some truths they can never describe. A consequence of that belief and something I preach to everyone I work with is that there is really no such thing as 100% precision and recall. No matter how great your model is and what your test metrics tell you. There will always be edge cases.

So if 100% precision and recall is all but impossible what do we do? When developing products around machine learning classifiers, we often ask ourselves what is most important to the customer, precision, recall, or both. For example, if I create a facial recognition system that notifies the police if you are a wanted criminal, we probably want to air on the side of precision because arresting innocent individuals would be intolerable. But in other cases, like flagging inappropriate images on a social network for human review, we might want to air on the side of recall, so we capture most if not all images and allow humans to further refine the set.

It turns out that very often precision and recall can be traded off. Most classifiers emit a confidence score of sorts (also known as a SoftMax output) and by just varying the threshold on that output we can trade-off precision for recall and vice-vera. Another way to think about this is, if I require my classifier to be very confident in its output before I accept the result, I can tip the results in favor of precision. If I loosen my confidence threshold, I can tip it back in favor of recall.

And how might this apply in voting? Well, if I structure my laws and regulations such that every voter must vote in person with 6 forms of ID and the vote is tallied in front of the voter by a 10-person bipartisan evaluation team who must all agree … we will likely have very high precision. After all, we’ve greatly increased the confidence in the vote outcome. But at what expense? We will also likely slow down the voting process and create massive lines which will significantly increase the number of people who might have intended to vote but don’t actually do so, hence decreasing recall.

Remind me again what the hell this has to do with Voting

The conservative-leaning Heritage Foundation makes the following statement on their website:

“It is incumbent upon state governments to safeguard the electoral process and ensure that every voter’s right to cast a ballot is protected.”

I strongly subscribe to that statement and I believe it is critical to the success of any representative democracy. But ensuring that every voter’s right to cast a ballot is protected, requires not only that we accurately record the captured votes, but also ensure that every voter who intends to vote is unhindered in doing so.

Maybe we need to move entirely to in-person voting while simultaneously allocating sufficient funds for more polling stations, government-mandated paid time off, and government-provided childcare. Or maybe we need all mail-in ballots but some new process or technology to ensure the accuracy of the votes. Ultimately, I don’t pretend to know the right answer, or if we even have a problem, to begin with. What I do know is that if we wish to improve our election systems we must first start with data on where we stand today and then tweak our laws and regulations to simultaneously optimize for precision and recall.

So, the next time a politician proposes changes to our election system ask … no demand, they provide data on the current system and how their proposed changes will impact precision and recall. Because only when we optimize for both these metrics can we stop worrying about making America great again and start working on making America even greater!