What Is AI Model Drift?

In our recent post on model retraining, we touch on an unfortunate but unavoidable fact of machine learning (ML) models: They often have remarkably short lifespans.

This fact is old news for ML engineers. But the problem comes as a shock to business leaders who are investing millions in the AI capabilities of their organizations.

Models typically perform well in the lab where factors are tightly controlled. However, once they start taking in real-world data, their performance suffers. Certain models may start out delivering great inferences only to have the quality fade in the weeks following deployment. Other AI models may fail right away.

In either case, model drift is incredibly common. Scientific Reports notes that 91% of ML models degrade over time

What is going on? How can a technology relied upon by medicine, finance, defense, and other sectors just stop working? More importantly, what can be done about it?

Let’s dive deeper into the world of data science and explore the most important—but least talked about—problem in the field of AI: model drift.

Model drift illustration
Model drift occurs when incoming data in production shifts out of distribution with the data used in an AI model's training.

AI Model Drift—The Day 3 Problem Defined

Model drift describes the tendency for an ML model’s predictions to become less and less effective over time. It is also known as model decay or model degradation. At Striveworks, we call it “The Day 3 Problem” in reference to a standard machine learning operations (MLOps) process: 

The Day 3 Problem crops up regularly at any organization that has more than a handful of AI models in production. Yet, it is often overlooked.

“Temporal model degradation [is] a virtually unknown, yet critically important, property of machine learning models, essential for our understanding of AI and its applications,” says Daniel Vela and his fellow researchers in their Scientific Reports article, “Temporal Quality Degradation in AI Models.”

“AI models do not remain static, even if they achieve high accuracy when initially deployed, and even when their data comes from seemingly stable processes.”

Key Takeaways

  • Model drift is the tendency for ML models to fail in real-world applications.
  • It is also known as “model decay,” “model degradation,” or “the Day 3 Problem.”
  • Although incredibly common, model drift is “virtually unknown.”

Why Do ML Models Stop Working in Production?

As surprising as it is, AI models that evaluate well—or that even work perfectly well in production today—will stop working at some point in the coming days or weeks. 

Why? The answer is simple: Things change.

“Models fail because the world is dynamic,” says Jim Rebesco, co-founder and CEO of Striveworks. “The statistical phrase is ‘non-stationary,’ which means that the data being put into a model in production is different from the data it was trained on.”

ML models are created by training an algorithm on historical data. They ingest thousands or even millions of data points—images, rows of numbers, strings of text—to identify patterns. In production, these models excel at matching new data to similar examples that exist in their training data. But in the real world, models regularly encounter situations that appear different from a set of specially curated data points—and even slight differences can lead to bad outcomes.

“Take a predictive maintenance model trained on a particular engine,” says Rebesco. “Maybe we originally deployed this model in the summer and trained it on production data from the same period of time. Now, it’s winter, and you’ve got thermal contraction in the parts, or the lubricating oil is more viscous, and the data coming off the engine and going into the model looks a lot different than what it was trained on.”

Engines are one example. But model drift isn’t confined to any single AI model—or even a type of model. It’s a fundamental property of ML. It happens to predictive maintenance models drawing on structured data, but it also happens to computer vision models looking for airplanes on airfields as the seasons shift from summer to fall. It happens when a voice-to-text model trained on American accents is used to transcribe a meeting with Scottish investors. It happens when a large language model trained on pre-2020s data gets asked to define “rizz.”

Even popular, widely used AI models aren’t immune from model drift. In 2023, a paper from Stanford researchers showed that, over a few months, OpenAI’s flagship GPT-4 dropped in accuracy by 95.2% for certain problems.

In every case, the result is a machine learning model that produces inaccurate predictions from real-world data, rendering it useless or even harmful.

Key Takeaways

  • Model drift happens because the world is always changing.
  • It happens to every type of ML model.
  • It is an unavoidable, fundamental property of ML.

What Causes Model Drift?

Because the world is always changing in unpredictable ways, many different factors can spur a misalignment between a model’s production data and its training data. The following table shows the most common causes of model drift.

Cause of Drift

Definition

Example

Natural adaptations

Data changes in response to outputs from an AI model.

A financial trading model sells a stock because other models are selling it, driving down its price.

Adversarial adaptations

An agent’s behavior changes in order to evade an AI model.

An enemy air force attaches tires to its airplanes to trick models trying to detect them.

Use case differences

An AI model that functions well in one context produces poor results in another.

A model tracking US-China relations fails to interpret US-Japan relations correctly.

Time sensitivities

An AI model trained on data from earlier time periods misses new contextual changes.

A model that understands US-Japan relations in the 1940s fails to produce useful insights for US-Japan relations today.

Chaotic interference

A change in an upstream AI model’s settings introduces inaccuracies into a downstream model.

A change to an embedding model causes a text classification model using its outputs to label everything incorrectly.

AI aging

The process by which random variations in training can contribute to accelerated degradation.

An effective object detection model simply starts to waver more and more in accuracy over time.

Key Takeaways

  • Model drift can result from various factors including natural, human-produced, and technology-produced changes in model operating conditions. 

How Severe of a Problem Is AI Model Drift?

Model drift directly ruins an ML model’s performance. However, the importance and urgency of this bad performance can vary widely depending on how you use your ML model. In some cases, a drifted model may still be “good enough.”

For example, consider a streaming service’s recommendation engine. If it suggests an unexpected TV show because its model has misunderstood your taste, management may not need to panic about a Day 3 Problem. The model’s output is relatively trivial. So what if you don’t want to watch Suits? You aren’t about to cancel your subscription because Netflix queued it up for you. 

In other cases, a model’s output is only useful for a particular window of time. Once that window passes, the model’s prediction is inconsequential—whether it was good or bad. This is the case with self-driving cars. If your model thinks a tree is a person, it only matters for the amount of time it takes for the car to navigate past it. A one-off bad inference in this case doesn’t matter that much. But if the drift is substantial enough that the AI model always identifies trees as people, then the problem is much more severe.

Of course, a vast range of outcomes exists between those scenarios—where model drift can spell catastrophe. A financial system executing high-speed trades can rapidly lose millions of dollars if its model drifts. Likewise, drift occurring with a model used in cancer screenings can result in a misidentified tumor—with life-or-death consequences. AI models used in defense and intelligence applications that can no longer distinguish between friendly and adversarial aircraft become immediately unusable in combat. 

In all these scenarios, a failing model should trigger an immediate red alert. All too often, model drift happens for long periods of time before a human notices the problem. By the time a person can intervene, organizations may have made huge business or operational decisions based on flawed insights. 

Key Takeaways

  • Not all model drift is important or urgent—but the problem is often very severe.
  • High-stakes industries such as financial services, medicine, and defense are especially vulnerable.
  • Drifted models can deliver bad inferences for a long time before anyone notices.

How Can You Tell When an AI Model Is No Longer Functioning Properly?

Often, data scientists have good heuristics that suggest when a model has drifted. Experience dealing with models in production gives them a sense that a model isn’t working the way it should. 

However, this sense isn’t quantifiable, and it doesn’t scale when a data team has hundreds or thousands of AI models in production. In these situations, data scientists need more exacting ways of determining model drift.

At Striveworks, we use two statistical methods to determine if drift is occurring. They are the Kolmogorov-Smirnov test and the Cramér-von Mises test. Both of these tests are common statistical measures to determine if a dataset is “out of distribution” with a model’s training data. 

Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov test works by comparing the distribution of a dataset with a theoretical distribution. It looks for the maximum difference between the cumulative distribution function (CDF) of the data and the empirical CDF. This test is more focused on the center of distribution and is less sensitive to extremes at the tails. It’s a simple and versatile test that is non-parametric (i.e., it doesn’t require any assumptions about the underlying data), which makes it useful across a wide range of data distributions.

Cramér-von Mises Test

Like the Kolmogorov-Smirnov test, the Cramér-von Mises test also assesses the goodness of fit of two data distributions. However, instead of comparing the predicted CDF with an empirical CDF, the Cramér-von Mises test looks at the sum of the squares of differences in the CDFs. It considers the entire distribution (across the center and the tails), which makes it more effective at capturing deviation across the full distribution.

Both the Kolmogorov-Smirnov test and the Cramér-von Mises test are valuable—if different— ways to quantify if production data is out of distribution with a model’s training data. For a broader understanding of data distribution and drift detection, it makes sense to use both of them. 

Key Takeaways

  • Data experts can usually tell when an AI model is drifting, but statistical measurements are needed to standardize drift detection at scale.
  • The Kolmogorov-Smirnov and Cramér-von Mises tests are two different, complementary options for quantifying model drift.

What Can Be Done About AI Model Drift?

Unfortunately, data scientists and machine learning engineers can do little to prevent model drift. The build-deploy-fail cycle that creates a Day 3 Problem continues to persist—even with the massive expansion of AI capabilities in recent years.

But the news isn’t all bad. Even though data teams can’t stop model drift from happening, they can take steps to reduce its effects and extend the productive uptime of models. At Striveworks, we refer to this process as model remediation.

Once drift is detected through automated monitoring that tests incoming data using the Kolmogorov-Smirnov and/or Cramér-von Mises methods detailed above, models become candidates for remediation. Model remediation involves confirming that an AI model has drifted and then initiating a rapid retraining process to update the model and return it to production. Unlike removing a failing model from production and training an entirely new one to replace it, remediation happens much more quickly. It typically leverages a baseline model and fine-tunes it with appropriate data to restore performance in hours—not the days, weeks, or months frequently needed to build a new model from scratch. 

We’ll explore model remediation in more detail in an upcoming post. In the meantime, learn more about the model retraining step of remediation in our recent blog post “Why, When, and How to Retrain Machine Learning Models.”

Key Takeaways

  • Model drift cannot be prevented, but its effects can be reduced.
  • The process of resolving the effects of model drift is called model remediation.

  • Remediation is much faster than training and deploying a wholly new model.
  • By remediating ML models, data teams can maximize models’ effective time in production.

Understanding Model Drift Is Essential for Effective AI

Model drift is an inevitable fact for organizations using machine learning. Because the world is always changing, machine learning models in the real world soon begin to ingest data that looks different from their training data. When this data falls out of distribution, it can wreak havoc on model performance—especially in applications that are critically important, like medicine and defense. 

Fortunately, there are solutions to fix the problems that come with model drift. Model remediation quickly retrains struggling AI models to restore their performance and return them to production. By detecting drift quickly and immediately starting the model remediation process, data teams can reduce the effects of model drift and keep their models performing in production over the long haul. 

Frequently Asked Questions

  • What is AI model drift?

Model drift—also known as model decay, model degradation, or The Day 3 Problem—occurs when the performance of a machine learning model deteriorates over time due to changes in the underlying data or the environment.

  • Why does AI model drift happen?

Model drift happens because the real-world data that models encounter in production can differ significantly from the data they were trained on. This non-stationarity in data may stem from natural changes, adversarial actions, time sensitivities, and other factors.

  • How can we detect AI model drift?

Model drift can be detected through continuous monitoring of model performance metrics, comparing predictions with actual outcomes, and conducting periodic evaluations against updated datasets.

  • What are the consequences of ignoring AI model drift?

Ignoring model drift can lead to inaccurate predictions, misguided business decisions, operational failures, and—in high-stakes scenarios—catastrophic consequences.

  • Can AI model drift be prevented?

While model drift cannot be entirely prevented, its impact can be mitigated through model remediation—the process of rapidly retraining models using data that has a distribution more closely aligned with production data.