HMC Bee Lab: How to Determine and Interpret Tracking Accuracy: Beyond Single-Frame Detection

This semester, I am working on improving the system for tracking harvester ants. Initially, we worked on improving the accuracy of detecting ants in a single photo, as shown in the image below. Now we need to think about how to improve the overall tracking pipeline that combines detection and association, which essentially works by identifying objects in each individual video frame using the detector and then associating those detections across frames to maintain identity over time. As of right now, our codebase does not contain anything that will calculate or report an accuracy metric for the tracking system, which made me curious about how this accuracy can be determined and what exactly it means.

Example results of the Detection Model, where the ants are detected in bounding boxes and labelled with a prediction score. [1]

When thinking of how to measure a tracking system’s accuracy, you might ask yourself: how often is it correct? If a model can fairly accurately detect the ants in each frame, it seems reasonable to conclude it is performing well. However, this intuition can be misleading.

A tracking system can be locally accurate at each step and still produce incorrect trajectories overall. In other words, we can’t confirm that a tracking system works well by just looking at its performance for each frame, since the system still needs to match which ant is which to create overall trajectories for the entire video, which introduces more room for error.

So, how does the concept of accuracy change, when considering the overall global accuracy instead of the localized accuracy? Let’s start first with the localized accuracy. In each frame of an input video, a tracking system will identify ants and attempt to match them to ants in the previous frame. If the ants move only a little and remain separated, this matching process is pretty straightforward. Then, the system could correctly associate almost every ant with its previous position, and thus it would achieve high accuracy on a frame-by-frame basis.

However, the problem is that tracking is not just a combination of independent decisions. Each matching step must build off of the previous one to form a continuous trajectory over time. This means that there are no isolated errors; instead, every error propagates.

Consider the case where two ants pass close to each other, as shown in the image below. In a single frame, the model might misassign their identities and swap one for the other. This error might seem minor, since it is only for one frame and the positions are nearly identical. But, once this swap occurs, the frame after it will match its ants based on these wrong identities. And the frame after it will match relative to the one before, and so on. In this way, even if the model never makes another mistake for the rest of the video, all future positions of those ants will still be attributed to the incorrect identities. So, from that point forward, the trajectories will be fundamentally incorrect, even while being internally consistent.

The trajectories of two ants colliding where the system swaps their IDs, leading to incorrect labelling for the rest of their trajectory. [2]

This is our main problem. To have high global accuracy, a tracking system requires long-term consistency in addition to local correctness. For instance, an ant-tracking system that is 99% accurate at each step can still produce trajectories that are significantly wrong over time if the remaining 1% of errors disrupt the ant identities. This is what makes tracking different from other computer vision and machine learning tasks: small mistakes do not average out. Instead, they accumulate and affect the final result.

Achieving high global accuracy is especially difficult with harvester ants because they are visually similar to one another. Unlike trying to track humans, who may have different clothes or heights, ants of the same colony look practically identical to a computer. Thus, the system must rely on motion cues and proximity instead of appearance.

So, how do we evaluate these tracking systems?

There are numerous metrics established, but here are three I found particularly interesting:

1. MOTA (Multiple Object Tracking Accuracy)

MOTA is a widely used metric that combines three types of error into one score:

False Positives: Predicting an ant when there isn’t one
False Negatives: Missing an ant that is actually there
Identity Switches: When the tracker swaps the IDs of two ants (number of swaps)

Although MOTA is straightforward and interpretable, it overemphasizes detection. This means a system can have a high MOTA even if it frequently swaps IDs, as long as it correctly finds ants in every frame.

2. IDF1 (Identification F1-Score)

IDF1 is a better measurement of global consistency. It measures how long a tracker correctly identifies a specific object (like an ant!) over the entire duration of the video. Unlike MOTA, it focuses more on the identity than the detection, where it more heavily penalizes the system for ID swaps.

3. HOTA (Higher Order Tracking Accuracy)

HOTA is the newest standard in the field of tracking systems. It was designed to address the trade-offs between MOTA and IDF1, and it combines three sub-metrics:

Detection Accuracy: How well the tracker finds the ants.
Association Accuracy: How well you keep the ants’ IDs consistent.
Localization Accuracy: How precisely the bounding boxes match the ant’s actual body.

By using HOTA, we would be able to see exactly where a tracking pipeline is failing. In the case of ant tracking, we could answer the questions: Is the detector missing ants in the shadows? Or is the association process getting confused when more than one ant crowds together?

All of these metrics can be implemented using Python libraries, including motmetrics and TrackEval.

Ultimately, each of these metrics provide a different perspective for understanding what it means for a tracking system to be “accurate”. Applying one of these metrics to our ant tracking system would help us better evaluate the tracker’s performance and give insight on how to better improve it.

However, to actually compute any of these scores, we would need a form of ground truth that involves a record of where each individual ant is and which identity it should have for every frame of the video. This type of labelled data is difficult to create, and it is not something we currently have. So, it would be worth it to think more about whether these labels are feasible to create so that we can compute tracking accuracy for our system.

Further Reading:

Chu, P., Wang, J., Qu, Q., Huang, H., & Yu, N. (2022). “Multi-object Tracking by Detection and Query: an efficient end-to-end manner”. arXiv.
https://arxiv.org/html/2411.06197v1.

Luiten, J., Osep, A., Dendorfer, P., Howell, P., Leal-Taixé, L., & Leibe, B. (2021). “HOTA: A Higher Order Metric for Evaluating Multi-object Tracking.” International Journal of Computer Vision (IJCV) / PMC. https://pmc.ncbi.nlm.nih.gov/articles/PMC7881978/.

Mendez, M. “Understanding Object Tracking Metrics.” Miguel Mendez AI. https://miguel-mendez-ai.com/2024/08/25/mot-tracking-metrics.

Sharma, S. (2023). “Introduction to Tracker KPI.” Medium (Digital Engineering @ Centific). https://medium.com/digital-engineering-centific/introduction-to-tracker-kpi-6aed380dd688.

The Datature Team. (2025). “A Comprehensive Guide to Object Tracking Algorithms in 2025.” Datature Blog. https://datature.io/blog/a-comprehensive-guide-to-object-tracking-algorithms-in-2025.

Media Credits:

[1]: Image by the Author

[2]: Image by the Author

HMC Bee Lab

Pages

Wednesday, April 15, 2026

How to Determine and Interpret Tracking Accuracy: Beyond Single-Frame Detection

No comments:

Post a Comment