Machine learning benchmarks like ImageNet, COCO, and LLM Leaderboard usually target a single metric, such as accuracy for classification tasks or mean average precision for object detection. But for real-world problems, using a single metric to judge performance is usually not a good idea—and can even be misleading. Consider a fraud detection model: If 0.1% of transactions are fraudulent, then a machine learning model that predicts that every transaction is not fraudulent will be 99.9% accurate—but is actually completely useless.
Even considering a host of other metrics, such as class-wise precision and recall, confusion matrices, or receiver operating characteristic (ROC) curves, will not give a complete picture. The crucial thing lacking from these metrics is an understanding of performance bias: when a model performs worse on a particular segment of the data than the whole. The history of machine learning has plenty of examples of performance bias, including many newsworthy ones. There are many instances of AI models being biased against people of color, such as healthcare models, lending models, and facial recognition models. Some LLMs have been shown to have a geographic bias.
Striveworks now has an open-source tool, Valor, for understanding these different types of biases. This model evaluation service exposes performance bias by defining a subset of data through filters on the data (and any arbitrary attached metadata) attached to Valor objects. It has first-class support for:
Below, we explore how machine learning teams can use Valor to gauge these sorts of model biases.
Valor is an open-source model evaluation service created to assist machine learning practitioners and teams in understanding and comparing model performance. It’s designed to fit into a modern MLOps tech stack; in particular, Valor is the model evaluation service for the Striveworks end-to-end MLOps platform.
Valor does the following:
Valor runs as a back-end service that users interact with via a Python client. For detailed information on setting up and using Valor, see the official documentation.
Valor identifies model performance bias through its robust metadata and attribute filtering.
To represent datasets, models, predictions, and ground truth data, the Valor Python client has the following fundamental classes:
Using Valor, the basic workflow is as follows.
from valor import Dataset, GroundTruth, Model, Prediction, Annotation
dataset = Dataset.create("dataset name")
dataset.add_groundtruth(GroundTruth(datum=Datum(...), annotations=[Annotation(...), ...]))
model = Model.create("model name")
model.add_prediction(dataset, Prediction(datum=Datum(...), annotations=[Annotation(...), ...]))
model.evaluate_classification(dataset)
One of the powers of Valor is that it allows all of the above objects to have arbitrary metadata associated with them. Users can filter metadata and attributes (such as class label or bounding-box size) to define subsets of data and then use those subsets for evaluation. This provides a means for quantifying model performance on different segments of the data.
Based on these metadata and attributes, Valor users can pass different types of filtering to evaluations.
Dates and times can be added as metadata, using Python's datetime library. For example:
from datetime import datetime, time
from valor import Datum
Datum(
uid=<UID>,
metadata={"date": datetime(year=2024, month=2, day=12), "time": time(hour=17, minute=49, second=25)}
)
Then, if we want to evaluate the performance of an object detection model on images taken during the day, we would do something like:
model.evaluate_detection(
datasets=dataset,
filter_by=[Datum.metadata["time"] >= time(hour=8), Datum.metadata["time"] <= time(hour=17)]
)
Or, to know how a classification model performs for data since the year 2023, we would do:
model.evaluate_detection(
datasets=dataset,
filter_by=[Datum.metadata["date"] >= datetime(year=2023, month=1, day=1)]
)
The standard data types (int, float, str, boolean) and their filtering are all supported in Valor as metadata values.
For example, demographic information may be attached as:
Datum(uid=<UID>, metadata={"sex": "Female", "age": 62, "race": "Pacific Islander", "hispanic_origin": False})
Then, to evaluate how a model performs on all female- and Hispanic-identifying people under the age of 50:
model.evaluate_classification(
dataset=dataset,
filter_by=[Datum.metadata["sex"] == "Female",
Datum.metadata["age"] < 50, Datum.metadata["hispanic_origin"] == True]
)
Metadata can be attached to objects besides datums. For example, suppose we’re evaluating an object detection model for a self-driving vehicle, and we want to know how well the model performs on pedestrians in the road versus not in the road. In this case, we can attach a boolean metadata field to every person-bounding-box annotation and use this to filter object detection evaluation:
dataset.add_groundtruth(
GroundTruth(datum=Datum(...),
annotations=[Annotation(
task_type=TaskType.OBJECT_DETECTION,
bounding_box=person_bbox,
labels=[Label(key="class", value="person")],
metadata={"in_road": True}
), ...])
)
model.evaluate_detection(dset, filter_by=[Annotation.metadata["in_road"] == True])
We explore this particular example in end-to-end detail in one of our sample notebooks.
Valor supports GeoJSON dicts as metadata, which can then be filtered by geometric operations, such as checking if a point is inside a region or if two regions intersect. For example, suppose every piece of data has a location of collection. We can add this as metadata to the datum:
Datum(uid=<UID>, metadata={"location": {"type": "Point", "coordinates": [-97.7431, 30.2672]}})
Now, if we want to see how a model performs on data that was collected from a certain city, we can do the following (where city_geojson is a GeoJSON dict specifying the city):
model.evaluate_classification(
datasets=dataset,
filter_by=[Datum.metadata["location"].inside(city_geojson)]
)
Finally, for geometric tasks (such as object detection and segmentation), we can filter regions by geometric properties (such as area). For example, to evaluate an object detection model on bounding boxes with an area of less than 100,000 square pixels, we can use:
model_seg.evaluate_detection(
valor_dataset,
filter_by=[Annotation.raster.area < 100000]
)
Valor is a game changer when it comes to understanding model performance bias. By filtering model evaluations based on metadata and attributes, machine learning practitioners gain a world of insight into how their models perform on datasets and, crucially, different segments within a single dataset. Most importantly, this information is essential to understanding model performance in the real world.
We encourage you to experiment with Valor and let us know how you use it to evaluate your ML models. Check out the Valor GitHub repository to start using it in your machine learning workflows today, and read Valor’s official documentation to learn more.
Looking for more information about the Valor evaluation service? Read our other blog posts:
Striveworks Introduces Valor, the Open-Source Tool to Evaluate Models
Eric Korman Explains Valor and Its Step Change for Model Evaluation