Pages

Thursday, June 23, 2016

The Art of Tracking: Build Your Own Multi-target Tracker!



ORF-Viz-Libero-Football.jpg


If you watch the sports channel often, you’ll probably notice that sometimes when they’re showing a soccer game, there’ll be little circles surrounding players and showing the tracks they’ve taken on the field. Ever wonder how they do so? Today I’ll show you the secrets behind these fancy tracking algorithms! The same algorithms are used in ant tracking (click here) and fly tracking (Ctrax from Caltech). And if you have followed our blog, you would know that we are doing a bee tracking project. We’ll also use the same kinds of algorithms to track the trajectories of these bees! The multitarget tracking algorithms have a wide range of applications. After you’ve read through this post, you’ll have the necessary knowledge for building your own simple multi-target tracker!


Overview
When there’s only one target you want to track, the problem you are facing is detection: as long as you figure out a way to detect object in the frame, you can connect all the detected coordinates and construct a complete track of the object. However, when there are multiple objects in the frame, and the number of objects might change during the video, the problem suddenly becomes more complicated.


You may wonder why this is the case; why couldn’t we just detect all the objects and record the coordinates of all the detections. Well, if you do so, you’ll soon realize that you can’t tell which object is which, so you can’t construct the tracks of the different objects. If there are two detections in the first frame and two detections in the second frame, there are four ways you can associate the detections together, not even including the possibilities that one of the object has exited during the two frames. To accurately track multiple targets at once, we need to solve the problem of detecting, and the problem of data association. We’ll talk about them one by one.


Detections
There are mainly two different categories of detection methods: detection by motion, and detection by features. The former detects moving objects relative to a background, and the latter detects objects in frame based on visual features of the object. The one we’ll discuss here is called the “Mixture of Gaussians” model, and it is in the “detection by motion” category. This model is relatively easy to understand, and it produces reasonably good results.


220px-Gaussian_Filter.svg.png
Figure 2: A Gaussian Function (a.k.a Normal Distribution)


If you still remember your high school statistics, you’ll probably know that a Gaussian function is the probability density function for the normal distribution. It describes the relative likelihood of a certain random variable on different real values. The normal distribution is used heavily in different disciplines of sciences, and due to its bell-shaped curve, sometimes it is also referred to as a “bell curve”.


So now that we know what is a Gaussian, a mixture of Gaussians will be relatively straightforward to understand: it is just a probability density function consisting of the superposition of different Gaussian functions at different means with potentially different standard deviations (see Figure 3).


img2.png


The mixture of Gaussians model of motion detection involves modeling each pixel as a mixture of Gaussians instead of one particular kind of distribution. Based on the variances and persistence of the different mixtures of Gaussians, we determine which Gaussians are representing the background, and which Gaussians are representing the moving objects. The background will usually be relatively static, so the variance of pixels representing background will be low. On the other hand, pixels representing moving objects will have higher variance due to the fact that their colors/intensity have changed (assuming the object has different colors than the background). And if the background has repetitive motions (such as moving leaves), the mixture of Gaussians model will have no problem incorporating them, because it can have multimodal distributions for the background model.


For a more thorough explanation on the mixture of Gaussians model, please read Adaptive background mixture models for real-time tracking by Chris Stauffer and W.E.L Grimson. For a demonstration of the Mixture of Gaussians in action, check out this video!


Data Association
After detections, we need a way to keep the identity of the same object across different frames. It involves associating the most probable new detections to existing tracks, and creating new tracks when none of the existing tracks associate well with the new detections. We’ll explore a method called Global Nearest Neighbor (GNN) matching, which produces reasonably good results.


Figure 4 below shows the difficulty we face here: when there are many different detections, we need to figure out a way to pick out the right detections for each track. GNN matching tackles this problem by assigning each of the different association a “cost”. These costs should be indicative of the likelihood that the associations are true. Then GNN matching choose the assignments so that the sum of their costs is minimized.


Figure 4: A graphical explanation of the data association problem (http://www.cse.psu.edu/~rtc12/CSE598C/datassocPart1.pdf)


First of all, GNN requires predictions of the objects’ locations based on previous detections. These predictions are illustrated in Figure 4 by the red triangle and rectangle. Usually, these predictions are calculated using linear estimation algorithms such as Kalman Filter, or Monte Carlo estimation algorithms such as Particle Filter. We’ll not go in depth on these two algorithms here, just keep in mind that they can produce predictions based on previous detections and errors. (For more explanations on these topics, their respective Wikipedia pages (Kalman Filter and Particle Filter) are good places to start!)


After we have the predictions, we now can calculate the costs of associations. The reason why we need to calculate costs of associations based on predictions instead of the previous detection is because doing so can better reflect the motion of the object. We want to select detections such that they are the best fits for the previous tracks. As you can see in Figure 4, without prediction, you may think that the cyan dot at the very left is the best association to the previous track. But with prediction, it is obvious that the cyan dot on the right is the best fit considering the motion of the object.

Figure 4: The Importance of Prediction

To calculate costs, we can use the Euclidean distances between the predicted position and new detections as the costs:
This is not a perfect cost function, as it doesn’t take the errors in predictions into account. But it should definitely give reasonable assignments as long as the objects are not very close to each other.


The next step is to find the assignments of detections such that the global sum of the costs is minimized. It turns out that this assignment problem is very well-studied in the field of computer science, and the industry standard algorithm for solving this problem is called the Hungarian algorithm. It solves the assignment problems very quickly (in polynomial time). For a detailed and easy-to-understand breakdown on the algorithm, see these excellent slides made by Bob Collins at Portland State University.


Further Reading
So now we’ve covered the most basic steps for building your own multi-target tracker. If you want to learn more about multi-target tracking, here are some of the best papers I found during my research this summer. Enjoy!


Papers on Detection


Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool, "SURF: Speeded Up Robust Features", Computer Vision and Image Understanding (CVIU), Vol. 110, No. 3, pp. 346--359, 2008 (http://www.vision.ee.ethz.ch/~surf/eccv06.pdf)


Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi, You Only Look Once: Unified, Real-Time Object Detection, arXiv:1506.02640 (http://arxiv.org/abs/1506.02640)


Stauffer, Chris, and W. Eric L. Grimson. "Adaptive background mixture models for real-time tracking." Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on.. Vol. 2. IEEE, 1999. (http://www.ai.mit.edu/projects/vsam/Publications/stauffer_cvpr98_track.pdf)


Papers on Tracking


Chao Ma, Jia-Bin Huang, Xiaokang Yang and Ming-Hsuan Yang, Hierarchical Convolutional Features for Visual Tracking, ICCV 2015 (http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Ma_Hierarchical_Convolutional_Features_ICCV_2015_paper.pdf)


Anton Milan, Seyed Hamid Rezatofighi, Anthony Dick, Konrad Schindler, Ian Reid, “Online Multi-target Tracking using Recurrent Neural Networks” arXiv:1604.03635 (https://arxiv.org/abs/1604.03635)


Books on Tracking

Blackman, S. S. "Multi-Target Tracking with Radar Applications." Artech House, Dedham (1986): 449.

No comments:

Post a Comment