The goal of machine learning is to predict the future, based on data from the past. 

It’s more important than ever to make predictions that match reality, but as the world changes around us, so does the data that is used to generate new predictions. 

Machine learning models fail silently, which means they will make predictions even if the incoming data looks nothing like the data they were trained against. They will make predictions for scenarios and situations they were never trained for. They will be inaccurate and incorrect; and worst of all, they will be confidently incorrect.

And these incorrect predictions will influence business decisions, impacting both dollars and human lives.

Using Ground Truth to Calculate Model Performance  

The most straightforward way to measure model success in production is to compare how closely predictions match reality, using performance metrics like Accuracy, Precision, F1, Recall and more. Performance metrics provide insights into how the model is holding up in the production environment, and when the model needs re-training (or tuning). These metrics also help data scientists and model owners calculate the delta between predicted truth and ground truth. 

Machine learning teams rely on ground truth to test predictions that algorithms are making against the real world. No ML model guarantees 100% accuracy, but the goal is to get it as close as possible to this target. Each business area responsible for oversight of ML models in production sets their own tolerance when it comes to performance metrics. This threshold is carefully defined based on many aspects, including potential impact to revenue (both positive and negative).

Depending on the use case and the nature of the data, model performance can start degrading as soon as the model is deployed to production without a monitoring solution in place. After weeks in production, if the accuracy of that model has dropped to 70%, it could be well below what’s acceptable by business leadership, causing a measurable impact to business KPIs, such as revenue. However, with an AI Performance solution like Arthur that provides continuous model monitoring, it’s easy to detect when the tolerance threshold set for model accuracy falls into an unacceptable range and correct for it. 

Measuring performance is straightforward if data is generated from a known model, the ground truth. However, all models trained are limited by the ground truth quality used to train and test them, and by the timing and availability of ground truth data in production.

Ground Truth Challenges 

1. Accounting for time series and seasonal variability

Delayed ground truth is quite common when there’s a calendar delay between model prediction and when the ground truth information is ready. An example of this is in the financial services industry where customers have up to 3 months after a suspect transaction to flag it as fraudulent. There’s a 90 day lag; problems don’t manifest in real time when the original transaction occurs.

2. Definitions for ground truth vary across the organization 

Supervised learning requires a large volume of diverse data with corresponding correct ground truth labels. Enterprise datasets are siloed in systems across the organization and often complex in nature. These systems are often not interoperable. Ground truth consistency suffers when there are missing, inconsistent or edge case annotations.     

3. Computer vision and NLP models require humans-in-the-loop labeling

When you are working with model types like computer vision (CV) or natural language processing (NLP), ground truth labels are not readily available without manual annotation which requires labor-intensive processing. With NLP, you cannot always rely on the literal word meaning but someone must infer customer intent or satisfaction.  

Solving for Ground Truth 

While some ground truth challenges must be addressed earlier in the ML lifecycle at the organizational level or during model development, accounting for delayed ground truth can be solved with Arthur’s performance technology.

In some use cases, the ground truth will be available seconds after the prediction; in others, it could take months or years. Arthur’s platform allows for ground truth data to be updated at any moment, regardless of when the inference was recorded and with no data duplication. Performance metrics are then recalculated on the fly. 

Ground truth data can be updated individually (for every inference, as the data becomes available), or in bulk. With full support through the SDK and API, there are many different options to automate ground truth updates. Very simple scripts can be used to retrieve data from log files, databases or other sources and leverage the Arthur API to update ground truth data.

Working with Models When Ground Truth Is Delayed

In most production applications, there is a lag between prediction time and ground truth collection time, which significantly handicaps the ability to remediate model issues quickly. Leveraging labeling teams or services can help close this lag, but it will not completely remove it. Instead of monitoring metrics based on outputs, Arthur can alternatively monitor inputs based on data drift metrics with automating data drift thresholding

Arthur automatically creates relevant thresholds for detecting data drift, driving speed to value in optimizing ML models, while other ML observability and monitoring solutions rely on users to manually define thresholds for each attribute, which is slow and labor-intensive.  

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit fact.” - Sherlock Holmes

Remember, without ground truth data, the value of your predictive algorithms can be called into question. Using Arthur with ground truth to calculate model performance or using automated data drift thresholding to combat delayed ground truth, will build greater trust in model predictions and drive better business outcomes for everyone.