Navigating AI/ML Uncertainty: Insights Borrowed from Election Polling
Though not immediately obvious, data scientists who train and evaluate artificial intelligence or machine learning (AI/ML) models encounter many problems analogous to those encountered by statisticians who design and interpret election polls. Consider the following two common scenarios:
Scenario 1
A well-known poll conducting agency releases the results of a recent poll. But, when the votes are counted, there are surprises. Surprises might include: candidate A winning when candidate B was leading in the polls or one candidate narrowly winning when the polls had predicted a wide margin of victory.
While it may feel like this first scenario is common, in practice, reputable polls tend to be right far more often than they are wrong (Silver 2018).
Scenario 2
A data scientist collects data to train a model. The data is properly partitioned into training, validation, and testing datasets. Several models are trained and the validation metrics are used to determine a best model. But, when the model is put into production, its performance on real production data is significantly worse than its performance on the test dataset.
Though we may not have quantitative data on how often this occurs, nearly every data scientist who has deployed a model has had to grapple with this challenge to some degree.
On the surface, these two scenarios may appear unrelated or only related because the so-called experts (the data scientists, statisticians, etc.) “got it wrong.” However, as we will see, both of these scenarios are symptoms of the same underlying statistical challenge. For the remainder of this discussion, we will consider a number of problems encountered when polling for the purposes of predicting election outcomes. For each problem, we will discuss a technique that pollsters could employ to attempt to improve their predictions and then examine what analogies could exist for model evaluation; the goal in all of this is to find more and better ways to accurately assess the quality of AI/ML models.
Sampling to Understand Population-Level Statistics
The process of polling aims to understand the feelings or intent of an entire population (or, say every registered voter, or every registered likely voter, etc.). Since it is usually infeasible to ask every registered voter their opinion on an issue, favored candidate, etc., a relatively small number of people (a sample of the population) is typically questioned. The art and science of polling is trying to design strategies for getting a good sample of the population and then inferring population-level preferences, intents, etc., from that sample.
The AI/ML model evaluation analog of sampling a population is the creation of a training and testing dataset. As we will see, nearly every common problem we will examine here related to AI/ML model building and evaluation is some variation of poorly performed sampling.
Sample Size
One of the most intuitive contributors to effective polling is sample size, or the number of respondents to a survey or poll. It is generally understood that large samples lead to more accurate polling results than smaller samples. While that intuition is mostly correct, there is a bit of nuance worth discussing. Rather than accuracy, the most significant impact of sample size is on uncertainty, which is often referenced by a poll’s estimated “margin of error.”
Consider the following illustration. Suppose that we conduct a poll of three likely voters (sample size of three) for candidates A and B. Further, suppose that we can sample those three voters uniformly at random from among all likely voters—this means every likely voter was equally likely to be selected for the poll and every likely voter would respond honestly if questioned for the poll. Note that for the illustration, uniformly at random is important—we will further discuss that in later sections—but very hard to achieve in practice. The first symptom of our uncertainty is a lack of precision. With a sample size of three, only four “poll outcomes” are possible: A gets 0% support, A gets 33% support, A gets 67% support, A gets 100% support (in all cases, B gets the rest of the support). In any of those scenarios, one candidate has an apparently large lead.
Thinking beyond the lack of precision, what do these numbers actually tell us? Are there any conclusions we can confidently draw? For the moment, let’s assume the poll showed a 33/67 split between the candidates (candidate A getting 33%). We confidently know that there’s at least one voter who favors A; we therefore know that the population statistic we are estimating (the percentage of the vote candidate A would get if all voters were to vote) is more than 0. Can we confidently say that candidate B would win if the election were held today? They are certainly leading in the only poll taken.
In Figure 1, we illustrate the probability of candidate A receiving at most one “vote” in this example poll of three people as a function of their overall support (the percentage of voters who would choose them). The downward trend should make sense: The more support they have in the population as a whole, the less likely it is that they get fewer votes in the poll. Two points of note are called out in the figure. If the candidate’s actual population-level support is approximately one in three (as the poll perhaps suggests), then we’d expect the candidate to receive at most one vote in the poll approximately 75% of the time… but, the remaining 25% of the time, they would have come out ahead in this poll. While 25% of something occurring may not be considered extremely likely, it is nonetheless substantial. We might choose to call that a 25% chance of “being surprised.” That relatively high chance of being surprised is reflective of the uncertainty in this poll. On the flip side, if candidate A’s support among the entire population were closer to 67%, then there’d still be an approximately 25% chance that our example poll would result in candidate A receiving at most one vote. About the most we can conclude from our small poll is that it is more likely than not that candidate B enjoys at least as much support among the entire population as candidate A. That is uncertainty; and in this case, our survey (with our uniformly random, honest, and full response rate) lacks confident predictive power because of the utterly small sample size.
Figure 1: A plot showing the (approximate) probability of candidate A receiving at most one vote in the poll of three people as a function of the population level of support they have among voters. Note: This assumes that the number of voters in the population is significantly larger than three.
The analog to sample size in model evaluation is the size of a test dataset. Having a test dataset that is too small causes the same types of problems for model evaluation: lack of precision and too much uncertainty. The source of the uncertainty is the fact that the population (all the data that the model may see in production) is potentially so much larger than the test set that it is somewhere between hard and impossible to know how well the test data represents the real data that the model will later encounter.
The remedy for small sample size (small test dataset) is simple: Get more data. How one goes about getting that data matters (as we will discuss in more detail later), but there are several options to consider. If the test dataset is artificially small, one could simply rebalance/re-split their data and retrain the model (to ensure that the new model wasn’t exposed to a data leak). A dataset might be artificially small if one had a large amount of data to partition between training and testing and simply chose to put nearly all of the data in the training set. If the test dataset is not artificially small, then new data would need to be sourced. That data could be sourced while the model is in production (e.g., in the process of monitoring the model as it encounters real data) or curated before a model is put into production (if the risks of using an uncertain model in a production environment are too high).
Selection Bias
In the example we shared above, our assumptions were that we could sample potential voters in such a way that every voter was equally likely to be questioned (uniformly at random), that respondents would be honest, and that everyone questioned would in fact respond. In reality though, not everyone is eager to take a survey or answer the questions of a pollster. If you receive a phone call from an unknown number, how likely are you to answer? And, if you answer, how likely is it that you’ll want to talk to the person on the other end about your political views?
In the example we shared above, our assumptions were that we could sample potential voters in such a way that every voter was equally likely to be questioned (uniformly at random), that respondents would be honest, and that everyone questioned would, in fact, respond. In reality, though, not everyone is eager to take a survey or answer the questions of a pollster. If you receive a phone call from an unknown number, how likely are you to answer? And, if you answer, how likely is it that you’ll want to talk to the person on the other end about your political views? These are examples of potential sources of a selection bias—or, sampling the population in a way that is not uniformly random (every voter is equally likely to be questioned and each questioned voter responds).
What impacts does selection bias produce? The root problem is that one particular segment of the population will be overrepresented in the data. For example, it used to be the case that most polling was done by phone. Before many people had cellular phones (or smartphones), pollsters might only call landline numbers (partly to avoid oversampling the population that would own a cellular phone and partly due to laws against automated dialing of cellular numbers). At the time, most people still had landline phones, so the practice was reasonable. Today, however, if polling were only done via phone calls to landlines, it would be impossible to reach many people, and the demographic of people with a landline phone skews older—so the poll would overrepresent the views of older people. This wouldn’t be a problem if younger and older people’s political views were, on average, largely similar, but this is simply not often the case. Numerous other practices can impact responses (and thereby generate selection biases), such as offering a small cash reward for responding to the survey (skews toward those for whom the small cash reward would have the largest impact) and online surveys (skews toward those who would see the survey online, filtering by age and particular websites one frequents), etc.
What can be done to mitigate the effects of selection bias? For pollsters, this can be particularly challenging: How do you get information from people who won’t respond to questions? But, at a minimum, one should be aware of how their polling may create a selection bias (e.g., only surveying a certain portion of the population) so they can design better polling strategies. One could also consider techniques such as weighting responses so that responses of people in more challenging-to-reach groups get higher weight—though this can have its own negative outcomes by increasing the impact of a few individuals (echoing the small sample size problems).
The analog of selection bias in model evaluation is not sampling data (for training/testing) sufficiently similar to the real production data. Frequently, this occurs because it is either hard to find data similar to the production data or it is simply much easier to gather less representative data. For example, suppose you wanted to train a model to recognize vehicles in overhead imagery. Since these objects are usually small (in commercial satellite imagery, for example), it could be hard to identify the make of a vehicle seen “in the wild.” So, to simplify, you collect imagery over dealerships and use your knowledge of which dealership the image is collected over to inform your image annotations. This likely gives you reasonably accurate training data. However, your model may also learn to recognize characteristics of dealerships (if they exist) and likely only sees vehicles in ideal situations (clean, new, and clearly displayed). But, when you apply your model to a Walmart parking lot, you discover that not every vehicle is in “like new” condition, there are no dealership characteristics for your model to exploit, etc. Your model may fall flat. You didn’t foresee this since the testing data was not representative of how you wanted to use the model.
What can model builders do? The most impactful thing a model builder can do is to ensure that at least their testing data (if not their training data) reflects as accurately as possible the data that the model will see when it is in use. In our example above, that means ensuring that the test data includes imagery over non-dealership parking lots. Sometimes the most relevant data is just not available—for example, if your goal is to have imagery of military vehicles in combat situations (instead of simply “parked” at a military base). If those vehicles have not been in combat situations because the country has not been in armed conflict, then there may not exist any relevant imagery at the start of conflict. In this case, rapid remediation is crucial—monitoring live performance of the best models you have so far, collecting new live data to use for future training, and retraining models (with the new data) when/if their performance is unacceptably low.
Unequal Weight or Importance of Individuals
In the United States, presidential elections are not determined by a national popular vote; instead, they are determined by the electoral college. Roughly, each state is allocated electoral votes proportional to that state’s representation in Congress. States award their electoral votes to candidates according to their own state laws—which, in most cases, means that the popular vote winner in that state receives all of the electoral votes allocated to the state. Since every state has two senators and the number of representatives to the House of Representatives is divvied up between the states roughly proportional to population, the number of voters per electoral vote awarded (i.e., the number of electoral votes that a state awards divided by the number of voters in that state) varies quite a bit from state to state. The consequence of these facts is that reputable polling in a presidential race must consider states individually—predicting a national popular vote is not terribly useful.
Just as not all votes in a presidential election carry the same weight, it may be the case that not every datum in a test dataset should carry the same weight. Datums in a test dataset may not be equally important for at least two reasons. First, a model may be broader than what is needed for a narrow application. For example, consider the vehicle make/model classifier that we described earlier. That model may have also been trained to identify buses, semi-trucks, sanitation vehicles, etc. If our use case is counting vehicles in a Walmart parking lot for the purpose of estimating foot traffic in the store, then we may not care much about the performance of the model on sanitation vehicles—which we wouldn’t expect to see in those parking lots. And, if we did see a sanitation vehicle at a Walmart, it is more likely that it is present to collect garbage than that the driver needed to shop in the store. A second reason our test datums may not deserve equal weight is that they may not equally reflect how a model performs on the intended production data. For example, if our test dataset contained both data from dealerships and Walmart parking lots, and our intent was to use the model on images of Walmart parking lots, then we should be much more concerned about the performance on those images in our test data than on the images over the dealership. It would be nice if the model worked in both scenarios, but what is crucial for application performance is that the model works where it is intended to work.
Regional Distribution
When the popular vote does matter (e.g., for a statewide office or to award a state’s electoral votes), the goal is to understand the opinions, preferences, etc., of the entire population of that state. It is not sufficient to merely have a large sample of the population of that state. We previously mentioned that our ideal sample was one in which every likely voter (or, everyone in the statistical “population”) was equally likely to be surveyed. Part of being equally likely to be surveyed includes a broad regional sampling. In theory, if we could easily reach into a bag and draw out a random voter, then we wouldn’t need to explicitly regionally sample—we’d get regional representation, on average, proportionate to the population of that region. But, in practice, it is hard to reach into a bag of all eligible voters and pull out the name and contact information of any one of them with equal probability. Therefore, it is easy to inadvertently skew polling toward overrepresented regions where it may be easier to conduct a poll or where polls may have higher response rates, etc. To mitigate this issue, pollsters can increase their efforts to survey in regions that they know are underrepresented in their polling. For example, using census data, pollsters may know that a certain percentage of a state’s population lives in cities and that the remaining percentage of the population lives in rural areas. If their responses are disproportionately high in cities, they can focus more effort in polling rural areas where, due to the relatively small sample size, uncertainty is highest.
Just as polling can differ from region to region, model performance can similarly vary. Consider a model that detects naval ships, commercial ships, and smaller (private) pleasure craft in overhead imagery. It is unlikely that the model is equally performant on the three classes of ships. As a consequence, the overall performance of the model on imagery of a naval base may vary significantly from the performance of that model over a commercial port or a marina. A good test set would include representation of each of these kinds of locations (assuming that performance on each type of ship is equally important—see the previous section).
Shifting Opinions
Polling is never done. There are two reasons for this. First, over longer time periods, candidates change—sometimes, those candidates even change during the election cycle. For example, President Biden suspended his campaign and Vice President Harris became the new Democratic candidate. Polling for Biden could not be directly substituted for polling for Harris even though there was likely a lot of correlation between polls of the two candidates. And, at least one candidate usually changes from election cycle to election cycle; so, polling from previous election cycles has little (if any) meaning for the current election cycle. The second reason that polling is never done is that people’s opinions, perceptions, enthusiasm, intents, etc., change over time. Preference changes rarely go so far as to result in inter-party voting (i.e., a Democrat voting for a Republican), but people’s enthusiasm for “their party’s candidate” can vary quite a bit…and that has a direct impact on who shows up to vote.
Just as opinions, preferences, and enthusiasm shift, the data that a model may see can change over time—or sometimes suddenly. We use the term data drift for characteristic changes in data. Consider our example of identifying vehicle make/models from overhead imagery. Imagery collected in the spring or summer (on, for example, clear, sunny days) could look significantly different than imagery in the same location in the winter where vehicles may have frost or be partially covered with snow, etc. Those changes are relatively slow. Some changes can occur quickly as in the case of storm damage or flooding. For example, parking lots in Asheville, NC, probably looked significantly different on September 28 than they did on September 26 when the remnants of hurricane Helene caused massive flooding in that city. Similarly, armed conflict can quickly change the landscape of cities, roads, etc.
Just like polling is never finished, model evaluation (for a model being actively used) should never be finished either. This ongoing model evaluation can take many forms. There are at least two ways this continuous evaluation can be done. The first, which can be automated, is monitoring data for drift. There are many strategies that could be employed to monitor for drift; Striveworks’ Chariot platform includes automated drift detection for models trained and deployed in Chariot. These detection strategies can alert people when drift has likely occurred and, therefore, when model performance is likely to degrade. Once drift is detected, a model builder/maintainer must collect data that reflects the new characteristic data. Frequently, that could be done by collecting recent, live production data. A second method for continuous model evaluation is to periodically spot check random samples of current data. It is possible for model performance to degrade even before an automatic drift detection algorithm recognizes a statistically significant difference in the data. Besides detecting drift early, spot checking can also be helpful in identifying other issues that may have occurred with the initial testing and evaluation, i.e., if there was an unknown bias in the initial test dataset.
Do Polls Influence Behavior?
One potential complicating factor of polling (and for broadcast media) is the question of whether or not poll results influence people’s opinions or actions. This is often called the Bandwagon Effect, and there are two associated concerns: The first is that polls discourage people from voting if they feel like it is hopeless for their preferred candidate—especially if you’re a “red” voter in a “blue” state or vice versa. The second concern is that some voters (perhaps those who may be “on the fence” or who don’t have a strong preference) may feel inclined to support the winner (who doesn’t want to be on the winning team?). The concern is not so much which candidate wins but whether or not the polling affected the outcome of the race; if polling can impact the outcome of the race, is it being conducted and analyzed in a statistically grounded, methodologically sound manner? Or, is the poll (or reported poll results) done in an intentionally biased manner for the purpose of promoting one candidate or the other? Unfortunately, discerning which polls are “legitimate” can be hard and, for most people, requires faith/trust in the reputation of the institute conducting the poll. While it may intuitively feel like polls could influence behavior, the question is still open as to whether or not they actually do (Koerth 2019).
Do model testing results impact a model’s performance? The obvious answer is: No, they don’t. Or, at least they shouldn’t. But, there are a number of ways in which test results could influence outcomes that might not be so obvious.
First, consider the scenario in which a model user is trying to decide which model to use for a specific application. Both models could serve in the application, but which is better? Most likely, that model user is going to compare the scores of these models on some shared or benchmark dataset and then choose whichever had better metrics. That choice of which model to use is a direct impact of test results; assuming that the two models do not perform identically, that choice directly impacts the application where the model is being used (for better or worse). In this case, it is certain that test metrics impacted the application outcome. What is more uncertain is whether or not the application outcomes could have been better with the other model. Just like polling results have a margin of error, there is inherent (and hard to measure) uncertainty in model performance (e.g., I expect my model’s performance to be “test metric +- uncertainty” on production data… but I don’t know how big that uncertainty is). Hopefully, the shared test dataset was representative of the “real” data that the application would consume—so that the test metric is as meaningful as possible.
Another scenario where test metrics could influence outcomes is if prediction thresholds are chosen using that test data. The number of false positives or negatives is (at least partly) a function of the threshold at which one considers a prediction usable. It is tempting to use testing data to determine what that ideal threshold should be—after all, one isn’t changing the model itself based on the testing data, merely changing how we interpret the model output based on test data. While one could argue the merits/drawbacks of this approach, the fact is that choosing a threshold in this fashion will inflate scores on the test data. That is not necessarily bad—if the test data is sufficiently reflective of the real production data that the model will see. But, it could easily become bad if the test data doesn’t reflect the real data well.
Conclusion
Model evaluation is much more than simply a number on some standard test dataset. The real power in a test set comes from its ability to accurately predict a model’s performance on real production data. Test datasets need to come from a sufficiently large sample of data to have sufficient precision and to minimize uncertainty. But, sample size is not enough to guarantee meaningful results from a test dataset. The test dataset needs to represent the full breadth of data on which the model is intended to work well. The test dataset should reflect the current state of a changing world—it cannot be static or too significantly aged. This implies that model evaluation is a continuous process that should never end.
References
Silver, Nate. “The Polls Are All Right,” FiveThirtyEight, last modified May 30, 2018, https://fivethirtyeight.com/features/the-polls-are-all-right/.
Koerth, Maggie. “Does Knowing Whom Others Might Vote For Change Whom You’ll Vote For?,” FiveThirtyEight, last modified December 5, 2019, https://fivethirtyeight.com/features/does-knowing-whom-others-might-vote-for-change-whom-youll-vote-for/.