The dialogue and voices in this podcast were completely generated via AI

Chances are, you’ve already heard of Google’s landmark paper, “Attention Is All You Need.” This 2017 publication, which introduced the transformer architecture, is widely regarded as the fundamental enabler of modern generative AI. However, two years earlier, another team at Google published a far less celebrated paper that deserves equal recognition: “Hidden Technical Debt in Machine Learning Systems.”

I know it might sound crazy to put these two papers on the same pedestal, but hear me out. While the transformer architecture revolutionized what AI can do, the technical debt paper nailed the organizational and engineering nightmares we’d face when deploying these systems at scale. What those authors couldn’t possibly predict was how the transformer revolution would democratize machine learning, turning a niche concern into everyone’s problem overnight. Like it or not, virtually every company has become an AI company, and they’re all now discovering the joys of ML technical debt the hard way.

Let’s dive into the “Hidden Technical Debt” paper and see how its warnings apply to our current generative AI fever dream. I’ll focus on five areas where this paper was eerily prescient.

Entanglement

The CACE Principle

Sculley and team introduced the CACE principle (Change Anything, Change Everything), noting that ML systems have strong coupling between components. If you’ve worked with generative AI, you’re probably nodding vigorously right now.

Change one word in your prompt? Completely different output. Add a seemingly innocent instruction? Watch your AI go off the rails in creative new ways. The relationship between prompt changes and output effects can be entirely unpredictable. And yet, as the paper warns, “all the maintenance issues of traditional code apply,” except your “code” is now natural language prompts—without compilers, linters, or any formal semantics whatsoever. Fun times.

Prompt Chaining: Entanglement Hell

The original paper describes correction cascades as one form of entanglement and a prime source of technical debt. Correction cascades are defined in the paper as situations in which one model takes another model’s output as input to learn corrections, creating a cascading dependency. A similar example of entanglement in generative AI applications is prompt chaining, where the output of one prompt is fed into another downstream.

When one prompt feeds into another, they become entangled exactly as the paper describes. You’ve created a tightly coupled set of dependencies where changing anything in an upstream prompt risks system-wide failure.

It is therefore imperative that daisy-chained prompts can be independently isolated, monitored, and tested.

Data Dependencies: Unstable Data

Sculley’s paper suggests that data dependencies, like traditional software dependencies, carry a high risk of introducing technical debt. This is equally true in generative AI applications.

Suppose your application uses a prompt on top of a generative AI model to read data from a CSV file. Your prompt might explain the structure of the data in each column. But what happens if that data changes beneath you? What happens if a column you defined as an integer value between 0 and 100 suddenly becomes a float between 0.0 and 1.0?

This is where you’ve introduced a data dependency in your prompt, and you must build systems to monitor and protect against these changes.

Feedback Loops: Your AI Might Be Eating Its Own Tail

The paper warns about “analysis debt” that emerges when systems lack monitoring for system-level behaviors like feedback loops. With generative AI, these loops are easier than ever to create—and harder than ever to spot.

Here’s a common scenario: your model creates content, this content gets published, scraped, and then used as training data in an underlying generative AI model. Congratulations—you’ve created a digital ouroboros, a snake eating its own tail. These feedback loops can amplify biases, perpetuate inaccuracies, and generally make your system less reliable over time, all while looking perfectly normal on the surface.

The paper also describes “hidden” feedback loops that are more difficult to identify. A possible example is generative AI–based chatbots. If you use generative AI to drive your dialogue engine, you may bias user input and responses over time. You can inadvertently nudge users toward words or language they may not have intended to use. Feedback like this can cause long-term reliability issues.

Since most companies building generative AI applications don’t own the underlying model, it may be difficult to avoid these feedback loops entirely. Still, you must do everything in your power to avoid them or minimize their impact.

ML System Anti-Patterns: Dead Prompt Paths

One of my favorite anti-patterns the paper identifies is “dead experimental code paths”, code that was added during experimentation and then forgotten, but still runs in production. In prompt engineering, we’ve all created the equivalent: prompt instructions that no longer do anything useful but remain because nobody is brave enough to remove them.

These “dead prompt paths” aren’t just clutter. They consume valuable context window space, add computational overhead, and occasionally interact with other prompt elements in surprising ways. As the paper warns, such paths can be costly in terms of both complexity and computational resources. With generative AI, where every token costs both compute and cold, hard cash, this waste is even more painful.

Dealing with Changes in the External World

Many ML systems and generative AI applications are subject to the whims of the external world. And news flash: the external world is chaotic and can change on a dime.

There are dozens of ways changes in the external world can impact your generative AI application. Take function calling, for example, what if the underlying API signature changes, or what if the JSON structure of the API response changes? Subtle shifts like this can have outsized effects.

But that’s just one of many external factors. The underlying model can be upgraded, human language and norms can shift over time, and dependent external data sources can all change.

The world around your application is unstable, and you have to be prepared for it to shift. Your application must be robust to these changes.

Preventing the Generative AI Debt Spiral

We’ve just described several areas likely to introduce technical debt into your generative AI application. So what are you to do? In many ways, the solution isn’t that different from how technical debt is managed in traditional software—except that it may require leveling up your skills in the basics of ML testing.

There are three primary things you can do to tackle technical debt: monitor, test, and refactor.

Monitor, Monitor, Monitor

We discussed data dependencies, changes in the external world, and feedback loops. These all require monitoring. While you can try to make your systems as robust as possible, it’s honestly impossible to catch every edge case. It is therefore imperative that you bake effective monitoring into your application on day one.

If you have a dependency on a data value with a specific type or range, monitor that dependency independently and trigger an alert when it changes. Detect a sudden change to a public API signature or an update to your underlying 3rd-party generative AI model? Do the same.

Beyond monitoring for deterministic dependency changes, you should also monitor for probabilistic shifts, also sometimes referred to as model drift. The world around you can go through statistically significant changes that might cause the performance of your prompts and underlying models to decay over time. You may need to continuously label new data and test your prompts and models to effectively monitor and detect probabilistic shifts.

Think Like an ML Engineer: Label, Test, Measure

Every prompt in your system needs to be tested just like any other machine learning model. That means having ground-truth labeled data. For example, if you’re using a prompt to classify text, start by having humans label a set of example inputs with the correct outputs.

For classification prompts, measure performance with traditional ML metrics such as precision, recall, and F1 score. Using a prompt for ranking? Then evaluate it with ranking metrics like precision@k or normalized discounted cumulative gain (NDCG). The key point is simple: your prompt sits on top of a machine learning model, and therefore it must be tested with the same rigor as one.

If you have chained prompts, ensure each prompt can be tested independently against labeled data so you can verify each component of your system in isolation.

Keep Your Prompts Clean: Refactor Early and Often

Always be prepared to refactor your prompts and your system. Continuously look for dead prompt language. Be aggressive about removing unnecessary prompt text, but make sure you test your results.

Final Thoughts

As we rush headlong into our generative AI future, we would be wise to dust off this older paper and take its warnings seriously. The authors couldn’t have known about transformers, prompt engineering, or ChatGPT, but they absolutely nailed the second-order effects these technologies would create.

Technical debt doesn’t care whether your model has 7 billion or 7 trillion parameters, whether you’re training your own model or using an off-the-shelf LLM. It accumulates all the same, and the bill always comes due. Let’s build our generative AI systems with that reality firmly in mind.