The Reason Your AI Program Fails to Sustain Value

When you think of data scientists, what comes to mind? If you’re like most people, you probably picture a team of math PhDs training models and conducting experiments—churning through millions of GPU hours to build the next GPT.

The reality is that most data scientists spend a fraction of their time on that.

Although data scientists are in high demand across sectors, only a fraction of their work hours are spent on traditional data science.

Anaconda’s 2022 State of Data Science report shows the reality: What we think of as “classical” data science only makes up 34% of data scientists’ time. The rest is spent on gruntwork—data preparation, data cleansing, deploying, and similar tasks that eat up hours of brain power. Increasingly, this work happens after models deploy—in the phase of the AI life cycle known as post-production.

Data from “State of Data Science Report 2022.” Anaconda, April 5, 2023.¹

To no one’s surprise, this isn’t a great use of resources. Everyone understands how important it is to speed up model development and management, automating the mundane and repetitive tasks to enable AI to scale.

But far fewer executives understand how critical it is to apply this approach to models after they’ve gone live, especially now that the generative AI sea change is shifting the value center of the entire machine learning operations (MLOps) life cycle.

The Changing Tide of Artificial Intelligence

We have entered a new era for artificial intelligence. After a number of hype cycles and years upon years of spiky progress, confidence in machine learning is now sparking real interest outside the lab. Infrastructure has become more standardized and turnkey, foundational model backbones are usable off the shelf, and algorithms can now efficiently train massive models like the ones driving the generative AI boom of the past several quarters.

Enterprises are pumping incredible energy and focus into bringing AI into their organizations, and leaders in their fields are—by and large—succeeding. Data science teams are building models and deploying them with success.

What’s the problem?

Increasingly, data science teams are consumed by the “care and feeding” of models. Many organizations have invested a lot of thought and resources into the “how” and “when” of training and deploying models. These same organizations have given much less attention to the “how” and “when” of managing models after they are deployed. This gap is now creating immense drag on both the realized ROI of production AI and the ongoing productivity of data science teams. It explains why one study found that only 10% of AI initiatives were generating significant ROI.

The root of the issue is model degradation. Over time, the performance of models in production starts to fade. When it happens, it does double harm to your AI projects. First, it interrupts the value your models are providing to a capability that your business needs to function. At the same time, it pulls your data scientists’ attention away from building new models and toward troubleshooting existing ones.

Model degradation is unavoidable, but it’s not unforeseeable. With appropriate attention and resources, machine learning teams can rapidly catch it and mitigate its effects. Yet, most teams have focused so heavily on the preproduction and production phases of the machine learning life cycle that they are critically unprepared for the morass of middling performance that’s waiting for them after they deploy.

Model Failure (and What to Do About It)

Why do models degrade over time? Drift.

Drift is, in effect, a measurement of incongruence between input data and a model’s training data. As operating conditions change, data from the real world shifts outside the scope of a model’s training, causing the model to deliver outputs that stray off target. Drift comes in many varieties—data drift, concept drift, and change management failures among them—but each one results in a series of wrong inferences. If left alone, drifted models produce useless findings that provide no benefit to your decision-making process—and may even harm it. (For a detailed explanation of drift and clear strategies for reducing its impact, see our white paper Model Drift and the Day 3 Problem.)

Data science teams have long struggled with their models drifting. But as organizations fast-track AI adoption, those without a plan to handle drift and remediate their models face a bigger challenge: At some point, their mission-critical models will degrade, draining the value of their AI initiatives and the productivity of their data scientists.

Identify Drift, Remediate Models, Sustain Value

This is a wake-up call to enterprise leaders. Greater reliance on AI creates greater risk if that AI fails. Launching more models into production doesn’t fix this problem. Consequently, the biggest opportunity for finding value from AI isn’t in training and deployment, but in model post-production. Instead of “Let’s deploy more models,” enterprise leaders need to orient their AI programs on “Let’s keep our production models performing as long as possible.”
There are three core activities during the post-production phase: monitoring, evaluation, and retraining. To keep models effective over the long haul, each of them needs attention and resources.

Monitoring

'Consistent monitoring is how you discover models that are drifting. Tools that automatically track drift are your first line of defense. Monitoring models in production is the foundational step that makes sustained value of AI models possible. To borrow the classic business axiom: What gets measured gets managed.

Evaluation

After drift is detected, data teams need context to determine next steps. How severe is the drift? How long has it been going on? Do the outputs matter forever or just for a moment? Looking at historical data and recent model outputs gives data science teams the information to determine if a data pipeline has fundamentally changed or if something as banal as dirt on a camera triggered a false alarm.

Retraining

Retraining is the repair step for broken models. Feeding new, more appropriate data to a model tunes it to handle real-world data correctly. Perhaps obviously, a model’s own production data is the gold standard for retraining. By capturing model outputs, your team has a ready-made dataset for annotation that’s uniquely appropriate for the data your models see in production.

Yet, the most important step for maintaining AI value happens even before models get involved. It’s when enterprise leaders understand that their shiny new models won’t perform forever, so they build a plan to remediate them when they start to drift.

More and more organizations are starting to understand this step in the AI life cycle and put resources into production. Unfortunately, many more are stuck working with an outdated paradigm—and their AI adoption suffers because of it.

Maintain a Plan: Plan to Maintain

The good news is that model drift isn’t insurmountable. Models are just statistical representations of datasets. They can be remediated with new data, and AI can sustain its value. But decision-makers need to understand a few critical points to ensure that they can overcome drift and keep their AI programs effective.

1. Drift happens. The world is complex and always changing, so there’s no way to prevent model drift over the long term. Know that it is going to happen, and build a business case around AI deployment that accounts for it.

2. Regular maintenance is the key to happiness. What do your teeth, cars, and AI have in common? Small, frequent maintenance helps you avoid the really big, expensive problems. No amount of exquisite tuning and training will build a model that never breaks. It’s difficult-to-impossible to account for the ways a model may drift in production. Catch drift early and respond to it rapidly in post-production—the monitoring, evaluation, and retraining phase of the AI life cycle.

3. Startups do things that don’t scale. Can you? Organizations running one or two models can perhaps handle drift detection and remediation on a case-by-case basis. But the process doesn’t work when an enterprise has 20 models in production—let alone 20,000.

4. Remediation is a team sport. While we throw around shorthand like “good” or “bad” or “performant” or “non-performant” models, it’s important to remember that model performance only makes sense in the context of your business goals. A model with 80% accuracy and recall may be perfectly acceptable for serving an ad but completely unsuited for medical diagnosis.

Similarly, your remediation plans—when to retrain, whether to take an old model offline or keep it in production, how much pre-production test and evaluation is enough, et cetera—all require buy-in and understanding beyond your data science team. Your business stakeholders need to weigh in as well.

If you have yet to make a plan for model remediation, you’re in good company. Most enterprises are still trying to introduce AI effectively. But the companies who have considered remediation are the ones set up to scale their AI adoption.

Fortunately, no one needs to reinvent the wheel. Enterprise leaders don’t need to throw on the brakes and rethink their entire AI programs. Rather, they need to update their expectations around models in production—recognizing that early success fades but fast action can keep their AI models delivering useful insights and their programs scaling for years to come.

Jim Rebesco is part of the team at Striveworks, a machine learning operations company.

1. “State of Data Science Report 2022.” Anaconda, April 5, 2023. https://www.anaconda.com/resources/whitepapers/state-of-data-science-report-2022.