Understanding Object Tracking Metrics

After a long time, I have finally sat down to write this blog post on tracking metrics. It builds on my last post about tracking by detection and explores how we measure tracking performance.

In this article, I’ll provide an introduction to tracking metrics, starting from the basic principles and breaking down the key differences between various metrics. I’ll focus on three popular metrics: MOTA, IDF1, and HOTA, which are widely used in the Multi-Object Tracking (MOT) community. Understanding these is crucial because the choice of metric can significantly impact how we interpret a tracker’s performance.

Let’s get started!

The basics

Hungarian algorithm

The Hungarian algorithm plays a crucial role for tracking metrics, primarily used to:

Optimize bipartite matching between detections and ground truth objects per frame
Assign tracks to ground truth trajectories across the entire sequence

Let’s for now focus on the first point. The algorithm matches predicted tracks to ground truth objects in each frame, maximizing overall IoU scores. This results in:

TP: True Positives (matches with IoU above threshold)
FP: False Positives (unmatched predictions)
FN: False Negatives (unmatched ground truth objects)

While a detailed explanation of the algorithm is beyond the scope of this post, understanding its basic function helps in grasping how these metrics work. For a more in-depth explanation of the Hungarian algorithm, check out this excellent tutorial.

DetA

The Detection Accuracy (DetA) measures how well a tracker localizes objects in each frame, typically using Intersection over Union (IoU) thresholds. It essentially quantifies the spatial accuracy of detections.

Figure 1. IoU diagram from jonathanluiten

So once we have the TP, FP, and FN, we can compute the DetA as:

\begin{matrix} (1) & DetA = \frac{TP}{TP + FP + FN} \end{matrix}

AssA

The Association Accuracy (AssA), on the other hand, evaluates how accurately a tracker maintains object identities across frames. It focuses on the temporal consistency of ID assignments, measuring how well the tracker links detections of the same object over time. See, for example, the image below, extracted from HOTA [1]:

Figure 2. Different association results example (from HOTA [1])

We can observe different tracking results (A, B, C) for a single ground truth object (GT):

A: Detects the object 50% of the time with consistent identity
B: Detects the object 70% of the time, but assigns two different identities
C: Detects the object 100% of the time, but assigns up to four different identities

Which result is best? This is what the Association Accuracy (AssA) metric aims to determine. Different tracking metrics like MOTA, IDF1, and HOTA approach this question in various ways, each with its own methodology and emphasis on detection accuracy versus identity consistency.

MOTA (Multiple Object Tracking Accuracy)

MOTA introduces the concept of identity tracking to object detection metrics. It incorporates identity switches (IDSW), which occur when a single ground truth (GT) object is assigned to different track predictions over time.

The computation of MOTA involves temporal dependency, penalizing track assignment changes between consecutive frames. An IDSW is counted when a GT target $i$ matches track $j$ in the current frame but was matched to a different track $k$ ( $k \neq j$ ) in the previous frame.

In practice, the Hungarian matching algorithm is modified to minimize identity switches from the previous frame. In TrackEval code this is done using a simple gating trick:

score = IoU(GT, pred)
if pred == previous_assigned_id(GT):
    score = score * 1000

The MOTA metric is computed across all frames as:

\begin{matrix} (2) & MOTA = 1 - \frac{\sum_{t} (F N_{t} + F P_{t} + I D S W_{t})}{\sum_{t} G T_{t}} \end{matrix}

Where $t$ is the frame index, FN are False Negatives, FP are False Positives, IDSW are Identity Switches, and GT is the number of ground truth objects.

While MOTA’s simplicity is appealing, it has some limitations:

It only considers the previous frame for IDSW, so each switch is penalized only once, regardless of how long the incorrect assignment persists.
It can be dominated by FP and FN in crowded scenes, making IDSW less impactful.
IoU threshold is fixed so more or less detection accuracy is not reflected on the metric

IDF1

IDF1 addresses some of MOTA’s limitations by focusing on how long the tracker correctly identifies an object, rather than just counting errors. It’s based on the concept of Identification Precision (IDP) and Identification Recall (IDR).

It computes the assignment between prediction and ground truth objects across the entire video, rather than frame by frame.

The metric is simple:

\begin{matrix} (3) & IDF1 = \frac{2 * IDTP}{2 * IDTP + IDFP + IDFN} \end{matrix}

Where:

IDTP (ID True Positive): The number of correctly identified detections
IDFP (ID False Positive): Tracker predictions that don’t match any ground truth
IDFN (ID False Negative): Ground truth trajectories that aren’t tracked

The global assignment is computed using the Hungarian algorithm. It picks the best combination between prediction and ground truth that maximizes IDF1 for the whole video. It is easier to understand this by observing the image introduced in HOTA paper:

Figure 3. IDF1 metric diagram

The main problem I see with IDF1 is finding the best one-to-one matching between predicted and ground truth trajectories for the entire sequence since it can oversimplify complex tracking scenarios:

Imagine a corner kick in football. A tracker might correctly follow Player A running into the box, lose them in a cluster, and then mistakenly pick up Player B after the ball is cleared. IDF1 might treat this as one partially correct track for either Player A or B, ignoring that it’s correct for different players at different times.

This simplification can misrepresent a tracker’s performance in complex situations like crowded football plays, where player interactions and occlusions are frequent.

Key advantages of IDF1:

It’s more sensitive to long-term tracking consistency.
It balances precision and recall of identity predictions.
It’s less affected by the number of objects in the scene than MOTA.

However, IDF1 also has limitations:

IDF1 can decrease when improving detection. Just avoiding FP can result in a better metric (A vs C in Figure 2)
IoU threshold is fixed so more or less detection accuracy is not reflected on the metric

More limitations are presented in the HOTA paper. I recommend you to have a read because it is very well explained and intuitive.

HOTA (Higher Order Tracking Accuracy)

HOTA is a more recent metric designed to address the limitations of both MOTA and IDF1. It aims to provide a balanced assessment of detection and association performance. HOTA can be broken down into DetA (Detection Accuracy) and AssA (Association Accuracy), allowing separate analyses of these aspects.

The core HOTA formula is:

\begin{matrix} (4) & {HOTA}_{α} = \sqrt{{DetA}_{α} \cdot {AssA}_{α}} \end{matrix}

In this formula, the $α$ term represents the different Intersection over Union (IoU) thresholds used to compute the metric. A True Positive (TP) is only considered when the match IoU score is above the given $α$ threshold. The metric uses 19 different $α$ values, ranging from 0.05 to 0.95 in increments of 0.05.

HOTA uses global alignment (high-order association) between predicted and ground truth detections, similar to IDF1, but also incorporates localization accuracy. This means that HOTA evaluates both the ability to detect objects accurately and to maintain correct associations over time.

The HOTA algorithm can be summarized in the following steps:

for each frame:
    for each α:
        matching between gt and preds (Hungarian algorithm)
        obtain TP, FP and FN from previous matching
        compute AssA across the entire video for each TP.

Figure 4. HOTA metric diagram

In the original paper, Ass-IoU is referred to as the metric obtained by computing DetA across the entire sequence for a single true positive (TP) match in the current frame. The AssA metric can then be defined as follows:

\begin{matrix} (5) & AssA = \frac{1}{| TP |} \sum_{c \in TP} Ass-IoU (c) \end{matrix}

HOTA drawbacks:

Not Ideal for Online Tracking: HOTA’s association score depends on future associations across the entire video, making it less suitable for evaluating online tracking where future data isn’t available.
Doesn’t Account for Fragmentation: HOTA does not penalize fragmented tracking results, as it is designed to focus on long-term global tracking, which may not align with all application needs.

If you want to learn more about HOTA, I recommend reading the blog post by Jonathon Luiten. He is one of the authors of the HOTA paper, and his post is an excellent resource for learning how to use the metric to compare different trackers.

How do these metrics compare to each other?

We have examined how MOTA, IDF1, and HOTA function. Each metric has its own strengths and limitations. While HOTA is generally recommended for most applications, the choice of metric ultimately depends on your specific tracking scenario. The HOTA paper provides an excellent comparison that effectively captures the differences between these metrics:

Figure 5. Metric comparison

Having already introduced the left side of the image, let’s now focus on the right side, which displays metrics for each tracker output. The leftmost metric, DetA, exclusively evaluates detection quality. It yields the best results when the tracker accurately detects objects, regardless of their track ID. On the opposite end, we have AssA (derived from the HOTA definition). This metric prioritizes track ID consistency, which is why output A performs best in this category. The authors demonstrate how HOTA positions itself in the middle, striking a balance between detection quality and association accuracy.

The most suitable metric depends on your specific application. For instance:

If you’re developing a simple camera system to count people in a room, you might prioritize detection quality (DetA).
In a criminal tracking system where maintaining consistent track IDs is crucial, you should focus on AssA.
For most applications, such as sports tracking systems, you’ll need to balance both aspects. In these scenarios, HOTA emerges as the optimal choice, providing a comprehensive evaluation of tracker performance.

Conclusion

In this post, we’ve explored three key metrics used in Multi-Object Tracking: MOTA, IDF1, and HOTA. Each metric offers unique insights into tracking performance, with its own strengths and limitations. MOTA provides a straightforward measure but may be oversimplistic in complex scenarios. IDF1 focuses on long-term consistency but may not fully capture detection improvements. HOTA, which attempts to balance detection and association accuracy, has emerged as the standard metric used today for benchmarking tracking algorithms.

References

[1] Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016). MOT16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831.
[2] Ristani, E., Solera, F., Zou, R., Cucchiara, R., & Tomasi, C. (2016, October). Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision (pp. 17-35). Cham: Springer International Publishing.
[3] Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-Taixé, L., & Leibe, B. (2021). Hota: A higher order metric for evaluating multi-object tracking. International journal of computer vision, 129, 548-578.

Miguel Méndez

Posts