TrackNet: A Deep Learning Network for Tracking High-speed ... · Although vision-based object tracking techniques have been developed to analyze sport competition videos, it is still

TrackNet: A Deep Learning Network for TrackingHigh-speed and Tiny Objects in Sports Applications

Yu-Chuan Huang I-No Liao Ching-Hsuan Chen Tsı-Uı Ik∗ Wen-Chih PengDepartment of Computer Science, College of Computer Science

National Chiao Tung University1001 University Road, Hsinchu City 30010, Taiwan

∗Email: [email protected]

Abstract—Ball trajectory data are one of the mostfundamental and useful information in the evaluation ofplayers’ performance and analysis of game strategies.Although vision-based object tracking techniques havebeen developed to analyze sport competition videos, it isstill challenging to recognize and position a high-speedand tiny ball accurately. In this paper, we develop a deeplearning network, called TrackNet, to track the tennisball from broadcast videos in which the ball images aresmall, blurry, and sometimes with afterimage tracks oreven invisible. The proposed heatmap-based deep learningnetwork is trained to not only recognize the ball imagefrom a single frame but also learn flying patterns fromconsecutive frames. TrackNet takes images with the sizeof 640× 360 to generate a detection heatmap from eithera single frame or several consecutive frames to positionthe ball and can achieve high precision even on publicdomain videos. The network is evaluated on the video ofthe men’s singles final at the 2017 Summer Universiade,which is available on YouTube. The precision, recall, andF1-measure of TrackNet reach 99.7%, 97.3%, and 98.5%,respectively. To prevent overfitting, 9 additional videosare partially labeled together with a subset from theprevious dataset to implement 10-fold cross validation, andthe precision, recall, and F1-measure are 95.3%, 75.7%,and 84.3%, respectively. A conventional image processingalgorithm is also implemented to compare with TrackNet.Our experiments indicate that TrackNet outperforms con-ventional method by a big margin and achieves exceptionalball tracking performance. The dataset and demo video areavailable at https://nol.cs.nctu.edu.tw/ndo3je6av9/.

Index Terms—Deep Learning, neural networks, tinyobject tracking, heatmap, tennis, badminton

I. INTRODUCTION

Video considered as logs of visual sensors con-tains a large amount of information. Informationextraction from videos has become a hot research

topic in the areas of image processing and deeplearning. In the applications of sports analyzing andathletes training, videos are helpful in the post-gamereview and tactical analysis. In professional sports,high-end cameras have been used to record highresolution and high frame rate videos and combinedwith image processing for referee assistance or datacollection. However, this solution requires enormousresources and is not affordable for individuals oramateurs. Developing a low-cost solution for dataacquisition from broadcast videos will be significantfor massive sports data collection.

Ball trajectory data are one of the most funda-mental and useful information for game analysis.However, for some sports such as tennis, badminton,baseball, etc., the ball is not only small but also mayfly as fast as several hundred kilometers per hour,resulting in tiny and blurry images. That makes theball tracking task becomes more challenging thanother sports. In this paper, we design a heatmap-based deep learning network, called TrackNet, toprecisely position ball of tennis and badminton onbroadcast videos or videos recorded by consumer’sdevices such as smartphones. TrackNet overcomesthe issues of blurry and remnant images and caneven detect occluded ball by learning its trajectorypatterns. The proposed network can be applied toother ball-based sports and help both amateurs andprofessional teams collect data with a moderatebudget.

Conventional image recognition is usually basedon the object’s appearance features such as shape,color, size, etc., or statistical features such as HOG,SIFT, etc. Due to a relatively long shutter time of

arX

iv:1

907.

0369

8v1

[cs

.LG

] 8

Jul

201

9

consumer or prosumer cameras, images of high-speed objects are prone to suffer from afterimageor blur issues, resulting in poor image recognitionaccuracy. The performance of ball tracking can beimproved by pairing candidates from frame to frameaccording to trajectory models to find the mostpossible one [1]. In addition, a classical techniquein image processing to improve image quality isby fusing multiple low-quality images. Based onthe above observations, instead of using the rule-based techniques, we propose to adopt deep learningnetwork to recognize the shape of the ball andlearn the trajectory patterns by applying multipleconsecutive frames to solve the mentioned issues.

Object classification and detection are two of theearliest studies in deep learning. VGG-16 [2] isone of the most popular networks for feature mapencoding. To detect and classify multiple objects inan image, the R-CNN family [3] [4] [5] structurallyexamine the picture in two stages. It firstly selectsmany areas that may contain interesting objects,called Region of Interests (RoIs), and then appliesobject detection and classification techniques onthese regions. However, its performance cannot ful-fill the needs of real-time applications. To speed up,the YOLO family [6] develops a one-stage end-to-end approach to detect objects in a limited searchspace, significantly reducing the computing time.The streamlined version of Tiny YOLO can even runon the Raspberry Pi. Compared to the block-basedalgorithms, Fully Convolutional Networks (FCN)proceeds pixel-wise classification. To compensatefor the size reduction of the feature map during theencoding process, upsampling and DeconvNet [7]are often used to decode the feature map, generatingan original size of the data array.

In this paper, a deep learning network, calledTrackNet, is proposed to realize a precise trajec-tory tracking network. Firstly, VGG-16 is adoptedto generate the feature map. Different from otherdeep learning networks, TrackNet can take multipleconsecutive frames as input. In this way, TrackNetlearns not only the features of the ball but also thecharacteristics of ball trajectories to enhance its ca-pability of object recognition and positioning. Sinceimages are downsampled and encoded by poolinglayers, the network follows the upsampling mech-anism of FCN to generate the heatmap for object

detection. At last, the position of our target objectis calculated based on the heatmap generated by thedeep learning network. To meet the characteristicsof tennis and badminton games, our calculation andevaluation are based on the assumption that there isat most one ball on the court.

To evaluate the proposed network, we have la-beled 20, 844 frames from the broadcast of men’ssingles final at the 2017 Summer Universiade. Toassess the performance of the proposed consecu-tive input frames technique, both single-frame andmultiple-frame versions of TrackNet are imple-mented. Along with the conventional image recog-nition algorithm [1], a comprehensive comparisonamong different models is performed. Experimentsindicate that the proposed TrackNet outperforms theconventional image recognition algorithm and effec-tively locates fast-moving tennis ball from broadcastsport competition videos. Moreover, to prevent thenotorious overfitting issue that happens frequentlyin deep learning solutions, additional data from9 tennis games on different courts are added tothe training dataset, including grass court, red claycourt, hard court, etc. Additionally, to explore themodel extensibility, badminton tracking by Track-Net is evaluated. We have labeled 18, 242 framesfrom the video of 2018 Indonesia Open Final - TAITzu Ying vs CHEN YuFei. Although badmintontravels much faster than tennis, our experimentalresults exhibit a decent performance.

The critical contribution of TrackNet comes fromits capability of precisely tracking fast-moving andtiny objects by learning the dynamic behavior of thetrajectory. In the tennis tracking application, 10-foldcross validation results in an outstanding perfor-mance of 95.3% precision, 75.7% recall, and 84.3%F1-measure. Such capability shows great potentialin expanding the variety of computer vision applica-tions. The rest of the paper is organized as follows.Section II provides an introduction to the relevantresearches and the convolutional neural network.Section III introduces the datasets used in this paper.Section IV elaborates the proposed deep learningnetwork and Gaussian heatmap techniques. SectionV provides experimental results and performanceevaluation. At last, Section VI concludes this paper.

2

II. RELATED WORKS

In recent years, the analysis of player perfor-mance and game tactics based on the trajectorydata of balls and players has received more andmore attention [8] [9] [10] [11]. Many trackingalgorithms and systems have been developed tocompute and collect the trajectory data. Currentcommercial solutions mainly rely on high resolutionand high frame rate video, resulting in high hard-ware investment. For example, the Hawk-Eye sys-tem [12] has been extensively used in professionalcompetitions to calculate ball trajectories and assistthe referee in clarifying controversial calls through3D visual depictions. Nonetheless, the system has todeploy high-end cameras with dedicated operatorsat selected locations and angles. The expense is toohigh for non-professional teams.

Attempting to position the ball from sports com-petition videos has been studied for years. However,since the ball size is relatively small, it is prone to beconfused with objects having similar color or shape,causing false positives. Furthermore, due to the highmoving speed of the ball, the resulting image isusually blurry, inducing false negatives. By explor-ing the trajectory pattern from consecutive frames,the ball positioning can be effectively improved.In addition, the flight trajectory itself possessesimportant information and is a subject in manypieces of research [13]. For instance, combiningmultiple cameras with 3D technology for tennisdetection [14], tracking tennis by particle filter inlow-quality films [15], and adopting two-layer dataassociation approach to calculate the most likelyball trajectory from the results of failure detectionin the frame-by-frame image processing [16] areenlightening studies.

The success of deep learning techniques in imageclassification [2] [17] encourages more researchersto adopt these methods to solve various problemssuch as object detection and interception [5] [6][18], computer games, network security, activityrecognition [19] [20], text and image semantic anal-ysis, and smart stores. The infrastructure of the deeplearning network is a structured and huge convolu-tional neural network trained with a large amountof labeled data. The most common operations ofCNNs include convolution, rectifier, pooling/down-

sampling, and deconvolution/up-sampling. A soft-max layer is usually used as the output layer.For example, the widely used VGG-16 [2] mainlyconsists of convolutional, maximum pooling, andReLU layers. Conceptually, front-end layers learnto identify simple geometric features, and back-endlayers are trained to identify object features.

In CNNs, each layer is a W ×H ×D data array.W , H , and D denote the width, height, and depthof the data array, respectively. The convolutionoperation is a filter with a kernel of size w×h×Dacross the W ×H range with the stride parameters being set as 1 in many applications. To avoidinformation loss near the boundary or maintain thesize of the output data array, columns and rows ofthe data array can be padded with zero by setting thepadding parameter p. Figure 1 depicts the relevantparameters of the convolution operation. Let W ′ andH ′ denote the width and height of the next layer.Then,

W ′ =W + 2p− w

s+1 and H ′ =

H + 2p− h

s+1.

Fig. 1. Convolution operation in deep learning networks.

Since the convolution operation is linear andcannot effectively capture nonlinear behaviors, anactivation function called rectifier is introduced tocapture nonlinear behaviors. The Rectified LinearUnit (ReLU) is the most commonly used activationfunction in deep learning models. If the input valueis negative, the function returns 0; otherwise, thefunction returns the input value. ReLU can beexpressed as f (x) = max(0, x). Maximum poolingprovides the functionality of down-sampling andfeature fusion. Maximum pooling fuses features byencoding data via down-sampling. The block ofdata will be represented only by the largest one.After pooling, the data size is reduced. On the other

3

hand, to achieve pixel-by-pixel classification, up-sampling is necessary to reconstruct an output withthe same size as the original image [21] [22]. In up-sampling, samples are duplicated to expand the datasize. Batch normalization is a widely used techniqueto speed up the training process. Each W ×H dataarray is independently standardized into a normaldistribution.

Backward propagation is commonly used in train-ing neural networks to learn the filter coefficients.Firstly, forward propagation is performed to havea preliminary prediction. Then, compared the pre-diction with the ground truth, a loss function willbe evaluated. Finally, the weights of the model,i.e., the filter coefficients, are updated accordingto the loss by the gradient descent method. Chainrule is adopted to calculate the gradient of the lossfunction layer by layer. The process will be repeatedagain and again until a certain number of repetitionsis reached or the loss falls below an acceptablethreshold. The design of the loss function is animportant factor that affects the training efficiencyand the performance of the network. Commonlyused loss functions include Root Mean Square Error(RMSE) and cross-entropy.

In this paper, we propose a deep learning networknamed TrackNet to detect tennis and badmintonon broadcast sport competition videos. By trainingwith consecutive input frames, TrackNet can notonly recognize the ball but also learn its trajectorypattern. A heatmap which is ideally a Gaussiandistribution centered on the ball image is thengenerated by TrackNet to indicate the position ofthe ball. The idea of exploiting heatmap for objectdetection has been adopted in many studies [23][24].

To compare and evaluate the performance ofTrackNet, we implement Archanas algorithm [1]which uses conventional image processing tech-niques to detect tennis ball. Archana’s algorithmfirstly smooths the image of each frame by a medianfilter to remove noise. After a background modelis calculated, background subtraction is performedto obtain the foreground. Then, the difference be-tween frames by logical AND operation is examinedto identify fast-moving foreground objects. Thoseobjects are compared with shape, size, and aspectratio of the tennis ball and selected by applying

TABLE ISEGMENTS OF LABEL FILES.

...0008.jpg, 2, 727, 447, 00009.jpg, 1, 735, 457, 00010.jpg, 1, 722, 433, 10011.jpg, 1, 707, 403, 0

...0029.jpg, 1, 555, 220, 00030.jpg, 1, 550, 218, 20031.jpg, 1, 547, 206, 0

...

dilation and erosion to generate candidates. To filterout wrong candidates, in our implementation, afully-connected neural network is trained to classifycandidates into positive and negative categories. Theone that has the highest probability in the positivecategory is selected, indicating the position of theball.

III. DATASET

Our first dataset is from the broadcast video ofthe tennis men’s singles final at the 2017 SummerUniversiade. The resolution, frame rate, and videolength are 1280 × 720, 30 fps, and 75 minutes,respectively. By screening out unrelated frames, 81game-related clips are segmented and each of themrecords a complete play, starting from ball serving toscore. There are 20, 844 frames in total. Each framepossesses the following attributes: ”Frame Name”,”Visibility Class”, ”X”, ”Y”, and ”Trajectory Pat-tern”. Table I is pieces of label files.

”Frame Name” is the name of the frame files.”Visibility Class”, VC for short, indicates the visi-bility of the ball in each frame. The possible valuesare 0, 1, 2, and 3. V C = 0 implies the ball is notwithin the frame. V C = 1 implies the ball can beeasily identified. V C = 2 implies the ball is in theframe but can not be easily identified. For example,as shown in Figure 2, the ball in 0079.jpg is hardlyvisible since the color of the tennis ball is similarto the text ”Taipei” on the court. However, with thehelp of neighboring frames, 0078.jpg and 0080.jpg,the unclear ball position of 0079.jpg can be labeled.Figure 2 (d), (e), and (f) illustrate the labelingresults. V C = 3 implies the ball is occluded by

4

other objects. For example, as shown in Figure 3, theball in 0139.jpg is occluded by the player. Similarly,based on the information from neighboring frames,0138.jpg and 0140.jpg, the ball position of 0139.jpgcan be estimated. Figure 3 (d), (e), and (f) illustratethe labeling results. In the dataset, the number offrames of V C = 0, 1, 2, 3 are 659, 18035, 2143,and 7, respectively.

Fig. 2. The ball image is hardly visible.

Fig. 3. The ball is occluded by the player.

”X” and ”Y” indicate the coordinate of tennis inthe pixel coordinate. Due to the high moving speed,tennis images in the broadcast video may be blurryand even have afterimage trace. In such cases, ”X”and ”Y” are considered as the latest position ofthe ball’s trace. For example, as shown in Figure4, the ball is flying from Player1 to Player2 with aprolonged trace and the red dot indicates the labeledcoordinate.

”Trajectory Pattern” indicates the ball movementtypes and are classified into three categories: flying,

Fig. 4. An example of the prolonged tennis trace.

hit, and bouncing. They are labeled by 0, 1, and 2,respectively. Figure 5 is an example of striking aball. The ball is flying at 0021.jpg and 0022.jpg. At0023.jpg, the ball is labeled as hit. Figure 6 shows abouncing case. The ball has not reached the groundat 0007.jpg and 0008.jpg. At 0009.jpg, the ball hitsthe ground and is labeled as bouncing.

Fig. 5. A hit case: (a) and (b) are labeled as flying, and (c) is labeledas hit.

Fig. 6. A bouncing case: (a) and (b) are labeled as flying, and (c)is labeled as bouncing.

To enrich the variety of training dataset, addi-tional 16, 118 frames are collected. These framescame from 9 videos recorded at different tenniscourts, including grass court, red clay court, hardcourt etc. By learning diverse scenarios, the deep

5

learning model is expected to recognize tennis ballat various courts. That increases the robustnessof the model. Further details will be presented inSection V.

In addition to tennis, to explore the versatilityof the proposed TrackNet in the applications ofhigh-speed and tiny objects tracking, a trial runon badminton match video is performed. Trackingbadminton is more challenging than tracking tennissince the speed of badminton is much faster thantennis. The fastest serve according to the officialrecords from the Association of Tennis Profession-als is John Isner’s 253 kilometers per hour at the2016 Davis Cup. On the other hand, the fastestbadminton hit in competition is Lee Chong Wei’s417 kilometers per hour smash at the 2017 JapanOpen according to Guinness World Records, whichis over 1.6 times faster than tennis. Besides, inprofessional competitions, the speed of badmintonis frequently over 300 kilometers per hour. Thefaster the object moves, the more difficult it is to betracked. Hence, it is expected that the performancewill degrade for badminton compared with tennis.

Our badminton dataset comes from a video ofthe badminton competition of 2018 Indonesia OpenFinal - TAI Tzu Ying vs CHEN YuFei. The reso-lution is 1280 × 720 and the frame rate is 30 fps.Similarly, unrelated frames such as commercial orhighlight replays are screened out. The resultingtotal number of frames is 18, 242. We label eachframe with the following attributes: ”Frame Name”,”Visibility Class”, ”X”, and ”Y”.

In badminton dataset, ”Visibility Class” is clas-sified into two categories, V C = 0 and V C = 1.V C = 0 means the ball is not in the frame andV C = 1 means the ball is in the frame. Unlikeour tennis dataset, we do not classify V C = 2and V C = 3 categories since the badminton movesso fast that blurry image happens very frequently.Therefore, in the badminton dataset, V C = 1includes all status of badminton as long as the ballis within the frame no matter it is clearly visible orhardly visible.

”X” and ”Y” indicate the coordinate of bad-minton. Similar to tennis, ”X” and ”Y” are definedby the latest position of the ball’s trace consideringits moving direction if the image is prolonged. Inbadminton video, prolonged trace often happens and

sometimes we could hardly identify the position ofthe ball. An example of how we label the prolongedimages is shown in Figure 7.

Fig. 7. An example of the prolonged badminton trace.

IV. TRACKNET

Fig. 8. An example of the detection heatmap.

TrackNet is composed of a convolutional neuralnetwork (CNN) followed by a deconvolutional neu-ral network (DeconvNet) [7]. It takes consecutiveframes to generate a heatmap indicating the positionof the object. The number of input frames is anetwork parameter. One input frame is consideredthe conventional CNN network. TrackNet with morethan one input frame can improve the movingobject detection by learning the trajectory pattern.For the purpose of evaluation, two networks areimplemented. One is with single frame input, andthe other is with three consecutive frames input.

TrackNet utilizes the heatmap-based CNN whichhas been proved useful in several applications [23][24]. TrackNet is trained to generate a probability-like detection heatmap having the same resolution asthe input frames. The ground truth of the heatmap isan amplified 2D Gaussian distribution located at the

6

Fig. 9. The architecture of the proposed TrackNet.

center of the tennis ball. The coordinates of the ballare available in the labeled dataset and the varianceof the Gaussian distribution refers to the diameterof tennis ball images. Let (x0, y0) be the ball centerand the heatmap function is expressed as

G (x, y) =

⌊(1

2πσ2e−

(x−x0)2+(y−y0)

2

2σ2

)(2πσ2 · 255

)⌋,

where the first part is a Gaussian distribution cen-tered at (x0, y0) with variance of σ2, and the secondpart scales the value to the range of [0, 255]. σ2 = 10is used in our implementation since the average ballradius is about 5 pixels, roughly corresponding tothe region of G (x, y) ≥ 128. Figure 8 is a visualizedheatmap function of a tennis ball.

The implementation details of TrackNet is illus-trated in Figure 9 and Table II. The input of theproposed network can be some number of consec-utive frames. The first 13 layers refer to the designof the first 13 layers of VGG-16 [2] for objectclassification. The 14-24 layers refer to DeconvNet[7] for semantic segmentation. To realize the pixel-wise prediction, upsampling is applied to recoverthe information loss from maximum pooling layers.Symmetric numbers of upsampling layers and max-imum pooling layers are implemented.

The final black-white binary detection heatmapis not directly available at the output of the deeplearning network. The network outputs a detectionheatmap that has continuous values within the rangeof [0, 255] for each pixel. Let L (i, j, k) denote thedata array of coordinates within (0, 0) ≤ (i, j) ≤

TABLE IINETWORK PARAMETERS OF TRACKNET.

Layer Filter Size Depth Padding Stride ActivationConv1 3× 3 64 2 1 ReLU+BNConv2 3× 3 64 2 1 ReLU+BNPool1 2× 2 max pooling and Stride = 2Conv3 3× 3 128 2 1 ReLU+BNConv4 3× 3 128 2 1 ReLU+BNPool2 2× 2 max pooling and Stride = 2Conv5 3× 3 256 2 1 ReLU+BNConv6 3× 3 256 2 1 ReLU+BNConv7 3× 3 256 2 1 ReLU+BNPool3 2× 2 max pooling and Stride = 2Conv8 3× 3 512 2 1 ReLU+BNConv9 3× 3 512 2 1 ReLU+BN

Conv10 3× 3 512 2 1 ReLU+BNUpS1 2× 2 upsampling

Conv11 3× 3 512 2 1 ReLU+BNConv12 3× 3 512 2 1 ReLU+BNConv13 3× 3 512 2 1 ReLU+BNUpS2 2× 2 upsampling

Conv14 3× 3 128 2 1 ReLU+BNConv15 3× 3 128 2 1 ReLU+BNUpS3 2× 2 upsampling

Conv16 3× 3 64 2 1 ReLU+BNConv17 3× 3 64 2 1 ReLU+BNConv18 3× 3 256 2 1 ReLU+BN

Softmax

(639, 359) and depth within 0 ≤ k ≤ 255. Thesoftmax layer calculates the probability distributionof depth k from possible 256 grayscale values. LetP (i, j, k) denote the probability of depth k at (i, j).

7

The softmax function is given by

P (i, j, k) =eL(i,j,k)∑255l=0 e

L(i,j,l).

Based on the probability given by the softmaxlayer on each pixel, the depth k with the highestprobability is selected as the heatmap value of thepixel. For each pixel, let

h (i, j) = argmaxkP (i, j, k)

denote the softmax layer output at (i, j), indicatingthe selected grayscale value at (i, j). Once thecomplete continuous detection heatmap is gener-ated, the coordinate of the ball can be determinedby the following two steps. The first step is topixel-wisely convert the heatmap into a black-whitebinary heatmap by the threshold t. If a pixel has avalue larger than or equal to t, the pixel is set to255. On the contrary, if a pixel has a value smallerthan t, the pixel is set to 0. Based on the previousdiscussion regarding the mean radius of a tennisball, threshold t is set as 128. The second step is toexploit the Hough Gradient Method [25] to find thecircle on the black-white binary detection heatmap.If exactly one circle is identified, the centroid ofthe circle is returned. In other cases, the heatmap isconsidered no ball detected.

During the training phase, the cross-entropy func-tion is used to calculate the loss function based onP (i, j, k). The corresponding ground truth functiondenoted by Q (i, j, k) is given by

Q (i, j, k) =

{1, if G (i, j) = k;0, otherwise.

Let HQ (P ) denote the loss function. Then,

HQ (P ) = −∑i,j,k

Q (i, j, k) logP (i, j, k) .

V. EXPERIMENTS

The experiment setup is as followed. The tennisdataset elaborated in Section III is used to evaluatethe performance of Archana’s algorithm, a conven-tional image processing technique, and the proposedTrackNet. The dataset contains 20, 844 frames andis randomly divided to the training set and test set.70% frames are the training set and 30% framesare the test set. To speed up the training speed, all

TABLE IIIKEY PARAMETERS USED IN MODEL TRAINING.

Parameters SettingLearning rate 1.0

Batch size 2Steps per epoch 200

epochs 500Initial weights random uniform

Range of initial weights [−0.05, 0.05]

frames are resized from 1280 × 720 to 640 × 360.To optimize weights of the network, the Adadeltaoptimizer [26] is applied. Table III summarizesother key parameters. Among these parameters, thenumber of epochs is one of the most critical factorsin model training. Underfitting happens if it is toosmall, while overfitting happens if it is too large.For TrackNet, the characteristic of loss versus thenumber of epochs is shown in Figure 10. Based onthe simulation, we select 500 epochs as our optimalvalue to prevent both underfitting and overfitting.

Fig. 10. The loss curve of TrackNet model training.

To compare the performance of TrackNet frame-works with one input frame and three consecutiveinput frames, two versions of TrackNet are imple-mented. For convenience, TrackNet that takes oneinput frame is named as Model I and TrackNet thattakes three consecutive input frames is named asModel II. For Model II, three consecutive framesare used to detect the ball coordinate in the lastframe. During the training phase, three consecutiveframes are considered a training sequence if the lastframe belongs to the training set. Likewise, threeconsecutive frames are considered a test sequenceif the last frame belongs to the test set. Note that

8

TABLE IVPERFORMANCE SUMMARY.

Archana’s TrackNet Model I TrackNet Model II TrackNet Model II’VC0 VC1 VC2 VC3 VC0 VC1 VC2 VC3 VC0 VC1 VC2 VC3 VC0 VC1 VC2 VC3

TP - 4046 418 0 - 4933 497 0 - 5223 565 2 - 5234 598 1FP 201 334 29 1 1 221 20 0 0 3 3 0 4 6 7 2TN 9 - - - 195 - - - 210 - - - 206 - - -FN - 947 214 6 - 241 139 7 - 101 93 5 - 87 56 4

Total 210 5327 661 7 196 5395 656 7 210 5327 661 7 210 5327 661 7

TABLE VACCURACY METRICS OF DIFFERENT MODELS.

Model Precision Recall F1-measureArchana’s [1] 92.5% 74.5% 82.5%TrackNet Model I 95.7% 89.6% 92.5%TrackNet Model II 99.8% 96.6% 98.2%TrackNet Model II’ 99.7% 97.3% 98.5%

TrackNet framework is scalable. Any number ofconsecutive input frames are allowed.

To define a proper specification for predictionerror, the size of the tennis is investigated. Thediameter of tennis images in the video ranges from2 to 12 pixels and the mean diameter is around5 pixels. Since the prediction error within a unitsize of the ball does not cause misleading in trajec-tory identification, we define the positioning error(PE) specification as 5 pixels to indicate whethera ball is accurately detected. Detections with PElarger than 5 pixels belong to false predictions.PE is defined by the Euclidean distance betweenthe model prediction and the ground truth. Figure11 shows the PE distribution of TrackNet models.The x-axis represents PE in the unit of pixels andthe y-axis is the percentage of occurrence. x = 0stands for perfect detection. x = 1 means PElies in 0 < PE ≤ 1, x = 2 means PE lies in1 < PE ≤ 2, and so on. Note that occurrencepercentages of PE > 5 of Model I and Model IIare 4.3% and 0.1%, respectively. That is, 95.7% and99.9% detections of Model I and Model II fulfill thespecification.

The Archana’s algorithm [1], an image processingtechnique developed by Archana and Geetha, isimplemented for comparison. The prediction detailsof Archana’s algorithm, TrackNet Model I, and

Fig. 11. The distribution of the positioning error.

TrackNet Model II are shown in Table IV, where TP,FP, TN, and FN stand for true positive, false posi-tive, true negative, and false negative, respectively.The numbers are grouped by ”Visibility Class”, VC.False positive of VC1, VC2, and VC3 stands forpredictions with PE larger than 5 pixels. Falsenegative of VC1, VC2, and VC3 means there isno ball detected or there is more than one balldetected when there is actually one ball in the frame.Note that since TrackNet Model I and TrackNetModel II utilize a different number of input frames,the training set and test set numbers are different.Archanas and TrackNet Model II’ follow the sametraining set and test set as TrackNet Model II.

It is observed that compared to Archana’s algo-rithm, both TrackNet Model I and TrackNet ModelII significantly reduce false positives and false neg-atives, resulting in an increase of both true positivesand true negatives. The comparison presents an ex-ceptional object detection capability of deep learn-ing networks over conventional image processingalgorithms. In addition, TrackNet Model II performseven better than TrackNet Model I, proving that

9

TABLE VIACCURACY ANALYSIS OF BADMINTON TRACKING.

Model Precision Recall F1-measureTrackNet-Tennis 75.8% 22.9% 35.2%TrackNet-Badminton 85.0% 57.7% 68.7%

training TrackNet with consecutive input frames canfurther improve its dynamic object tracking ability,especially for small objects. Moreover, TrackNetModel II even correctly positions occluded ballsoccasionally. 2 out of 7 occluded balls are preciselydetected. This discovery directly exhibits that con-secutive frames provide critical information for thenetwork to learn trajectory patterns of the interestedobject. By extracting information from neighboringframes, TrackNet Model II not only enhances itstracking precision on normal objects but also onblurry or occluded objects.

The overall performance in terms of precision,recall, and F1-measure are summarized in Table V.These three metrics are defined by

Precision =# of True Positive

# of True Positive + False Positive,

Recall =# of True Positive

# of VC1+VC2+VC3, and

F1-measure =2(Precision × Recall)

Precision + Recall.

The Archana’s algorithm reaches 92.5% precision,74.5% recall, and 82.5% F1-measure. With thehelp of powerful deep learning network, TrackNetModel I outperforms the Archana’s algorithm andreaches 95.7% precision, 89.6% recall, and 92.5%F1-measure. By learning how to extract trajec-tory information from neighboring frames, TrackNetModel II further improves the performance andachieves 99.8% precision, 96.6% recall, and 98.2%F1-measure.

To prevent the overfitting issue that frequentlyhappens in deep learning solutions, another 16, 118frames are added to the training set. These 16, 118frames are collected from an additional 9 videosrecorded at different tennis courts, including grasscourt, red clay court, hard court etc. The modeltrained by the enriched training set is named asTrackNet Model II’. TrackNet Model II’ follows

the same training logic as TrackNet Model II withthe only difference in the variety of training set.The prediction details are shown in Table IV. Asexpected, the performance of TrackNet Model II’is similar to TrackNet Model II on the same testset as shown in Table V. TrackNet Model II’achieves 99.7% precision, 97.3% recall, and 98.5%F1-measure. Furthermore, 10-fold cross validationis adopted on TrackNet Model II’ for the pur-pose of safety and comprehensive analysis. At last,TrackNet Model II’ with 10-fold cross validationreaches 95.3% precision, 75.7% recall, and 84.3%F1-measure.

In addition to tennis, we also apply the proposedTrackNet to the badminton dataset as introduced inSection III. The badminton dataset contains 18, 242frames with the resolution of 1280×720. Similarly,all frames are resized from 1280×720 to 640×360to speed up the training process. The dataset israndomly divided to the training set and test set.70% frames are the training set and 30% frames arethe test set. For badminton dataset, model trainingparameters, including learning rate, batch size, num-ber of epochs, etc., are set to the same values usedin the training of tennis dataset as shown in TableIII.

Before evaluating TrackNet on badminton, thespecification of a correct detection is defined byanalyzing the dimension of badminton images inthe video. Unlike tennis, badminton is not spherical,resulting in a larger size variation. We define thediameter of a badminton image by taking an averageon its largest length and width. The image existsin two extreme cases. One happens when the bad-minton moves toward the camera at the backcourtand the other happens when the badminton moveslaterally at the frontcourt. In our dataset, such casesresult in a large variation in images’ diameter rang-ing from 3 to 24 pixels. Since the mean diameter isaround 7.5 pixels, we define the PE specification as7.5 pixels to indicate whether a badminton image isaccurately detected. Detections with PE larger than7.5 pixels belong to incorrect predictions. Comparedwith tennis that has PE specification of 5 pixels, thePE specification of badminton seems to be released.The main reason is that images of badminton arelarger than tennis in the video since the badmintoncourt is smaller than the tennis court. Therefore, the

10

camera uses a smaller focal length to capture theentire court, resulting in larger images of ball andplayers.

To evaluate the badminton tracking ability ofTrackNet, we adopt the transfer learning idea thatdirectly applies the well-trained TrackNet model bytennis dataset for badminton trajectories recogni-tion. Here, we name the transfer learning mode asTrackNet-Tennis which is trained by tennis datasetusing three consecutive input frames. As shown inTable VI, for badminton tracking, TrackNet-Tennisonly achieves precision, recall, and F1-measure of75.8%, 22.9%, and 35.2%, respectively. Althoughthe precision seems acceptable, the recall is toopoor to be used. Such a low recall is due toa large number of false negatives, implying thatthe badminton cannot be recognized in many cir-cumstances. The main reason causing such poorperformance lies in the fundamental characteristicsdifference between tennis and badminton, includingvelocity, trajectories, shape, etc. To verify the feasi-bility of TrackNet framework on badminton track-ing, we train another model named as TrackNet-Badminton which is trained by badminton datasetusing three consecutive input frames. As shown inTable VI, TrackNet-Badminton reaches precision,recall, and F1-measure of 85.0%, 57.7%, and 68.7%,respectively. As expected, TrackNet-Badminton isable to learn the features of badminton, leading tosignificant performance improvement.

Furthermore, when we compare tennis and bad-minton tracking performance using TrackNet frame-work, it can be observed that tennis tracking out-performs badminton tracking by a noticeable mar-gin. This is because badminton travels much fasterthan tennis, resulting in much more unclear objectimages in badminton videos. As elaborated in Sec-tion III, the fastest recorded badminton moves in417 kilometers per hour, while the fastest recordedtennis moves in 253 kilometers per hour. Such anenormous increase in velocity causes performancedegradation especially in the aspect of the recalldue to high false negatives. High traveling speedmakes the badminton move across long distancewithin only a few frames. The property of dynamictrajectories in such high speed becomes hard torecognize by the model. In addition to the absolutespeed, badminton possesses a much higher variation

in traveling speed than tennis. For example, inbadminton, a drop stroke and a smash stroke have asignificant difference in velocity. Such extreme sce-narios commonly happen during a badminton com-petition, making the model hard to fit both scenariosperfectly. Nonetheless, although the performance intracking badminton is not as phenomenal as tennis,achieving a precision of 85.0% is accurate enough tocorrectly depict all trajectories in the game. Futureresearch on TrackNet improvement in the aspects ofidentifying trajectories of extreme fast objects andlearning distinct patterns caused by significant speedvariation will be conducted.

VI. CONCLUSION

In this paper, we proposed TrackNet, a heatmap-based deep learning network comprising both con-volutional and deconvolutional neural network.TrackNet is able to precisely position coordinatesof high-speed and tiny objects such as tennis andbadminton. With TrackNet, accurate predictions canbe achieved on broadcast sports videos withouthigh frame rate and high resolution, significantlyreducing the cost from recording and processinghigh specification videos. To enhance TrackNet’scapability of identifying trajectory patterns of fast-moving objects, we designed a scalable input thatallows feeding TrackNet with multiple consecutiveinput frames. By evaluating both conventional im-age processing algorithm and the proposed Track-Net on the real tennis video dataset, we demon-strated that TrackNet can achieve an explainableand exceptional prediction performance by adoptingconsecutive input frames concept on the deep neuralnetwork. Moreover, for the even faster objects suchas badminton, TrackNet achieves a decent trackingcapability according to our experimental results,exhibiting promising extensibility to related appli-cations.

ACKNOWLEDGEMENT

This work of T.-U. Ik was supported in part by theMinistry of Science and Technology, Taiwan undergrant MOST 107-2627-H-009-001 and MOST 105-2221-E-009-102-MY3. This work was financiallysupported by the Center for Open Intelligent Con-nectivity from The Featured Areas Research Center

11

Program within the framework of the Higher Edu-cation Sprout Project by the Ministry of Education(MOE), Taiwan.

REFERENCES

[1] M. Archana and M. K. Geetha, “Object detection and trackingbased on trajectory in broadcast tennis video,” Procedia Com-puter Science, vol. 58, pp. 225–232, 2015.

[2] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,” arXiv preprintarXiv:1409.1556, 2014.

[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmen-tation,” in Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition (CVPR 2014), 23-28 June 2014,pp. 580–587.

[4] R. Girshick, “Fast R-CNN,” in International Conference onComputer Vision (ICCV 2015), 11-18 December 2015, pp.1440–1448.

[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towardsreal-time object detection with region proposal networks,” inAdvances in neural information processing systems, 2015, pp.91–99.

[6] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You onlylook once: Unified, real-time object detection,” in Proceedingsof the IEEE conference on computer vision and pattern recog-nition, 2016, pp. 779–788.

[7] H. Noh, S. Hong, and B. Han, “Learning deconvolution net-work for semantic segmentation,” in Proceedings of the IEEEInternational Conference on Computer Vision, 2015, pp. 1520–1528.

[8] H.-T. Chen, W.-J. Tsai, S.-Y. Lee, and J.-Y. Yu, “Ball trackingand 3D trajectory approximation with applications to tacticsanalysis from single-camera volleyball sequences,” MultimediaTools and Applications, vol. 60, no. 3, pp. 641–667, October2012.

[9] X. Wang, V. Ablavsky, H. B. Shitrit, and P. Fua, “Take youreyes off the ball: Improving ball-tracking by focusing on teamplay,” Computer Vision and Image Understanding, vol. 119, pp.102–115, February 2014.

[10] T.-S. Fu, H.-T. Chen, C.-L. Chou, W.-J. Tsai, and S.-Y. Lee,“Screen-strategy analysis in broadcast basketball video usingplayer tracking,” in Processing of the 2011 IEEE Visual Com-munications and Image (VCIP), 6-9 November 2011.

[11] H. Myint, P. Wong, L. Dooley, and A. Hopgood, “Trackinga table tennis ball for umpiring purposes,” in Proceedings ofthe 14th IAPR International Conference on Machine VisionApplications (MVA 2015), 18-22 May 2015, pp. 170–173.

[12] “Hawk-eye,” https://en.wikipedia.org/wiki/Hawk-Eye.

[13] X. Yu, C.-H. Sim, J. R. Wang, and L. F. Cheong, “A trajectory-based ball detection and tracking algorithm in broadcast tennisvideo,” in 2004 International Conference on Image Processing(ICIP 2004), vol. 2. Singapore: IEEE, 24-27 October 2004,pp. 1049–1052.

[14] V. Reno, N. Mosca, M. Nitti, C. Guaragnella, T. D’Orazio, andE. Stella, “Real-time tracking of a tennis ball by combining 3ddata and domain knowledge,” in Technology and Innovation inSports, Health and Wellbeing (TISHW), International Confer-ence on. IEEE, 2016, pp. 1–7.

[15] F. Yan, W. Christmas, and J. Kittler, “A tennis ball trackingalgorithm for automatic annotation of tennis match,” in Pro-ceedings of the British Machine Vision Conference (BMVC2005), vol. 2. Durham, England: BMVA, 5-8 September 2005,pp. 619–628.

[16] X. Zhou, L. Xie, Q. Huang, S. J. Cox, and Y. Zhang, “Tennisball tracking using a two-layered data association approach,”IEEE Transactions on Multimedia, vol. 17, no. 2, pp. 145–156,2015.

[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi-fication with deep convolutional neural networks,” in Advancesin neural information processing systems, 2012, pp. 1097–1105.

[18] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutionalnetworks for biomedical image segmentation,” in InternationalConference on Medical image computing and computer-assistedintervention. Springer, 2015, pp. 234–241.

[19] W. Jiang and Z. Yin, “Human activity recognition usingwearable sensors by deep convolutional neural networks,” inProceedings of the 23rd ACM international conference onMultimedia. ACM, 2015, pp. 1307–1310.

[20] Y. Chen and Y. Xue, “A deep learning approach to human ac-tivity recognition based on single accelerometer,” in 2015 IEEEInternational Conference on Systems, Man, and Cybernetics(SMC). IEEE, 2015, pp. 1488–1492.

[21] V. Badrinarayanan, A. Handa, and R. Cipolla, “Segnet: A deepconvolutional encoder-decoder architecture for robust semanticpixel-wise labelling,” arXiv preprint arXiv:1505.07293, 2015.

[22] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutionalnetworks for semantic segmentation,” in Proceedings of theIEEE conference on computer vision and pattern recognition,2015, pp. 3431–3440.

[23] V. Belagiannis and A. Zisserman, “Recurrent human poseestimation,” in 2017 12th IEEE International Conference onAutomatic Face and Gesture Recognition (FG 2017). IEEE,2017, pp. 468–475.

[24] T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets forhuman pose estimation in videos,” in Proceedings of the IEEEInternational Conference on Computer Vision, 2015, pp. 1913–1921.

[25] “Hough gradient method,” https://goo.gl/gZTQRm.[26] M. D. Zeiler, “ADADELTA: an adaptive learning rate method,”

arXiv preprint, vol. abs/1212.5701, 2012. [Online]. Available:http://arxiv.org/abs/1212.5701

12

https://en.wikipedia.org/wiki/Hawk-Eye

https://goo.gl/gZTQRm

http://arxiv.org/abs/1212.5701

TrackNet: A Deep Learning Network for Tracking High-speed ... · Although vision-based object tracking techniques have been developed to analyze sport competition videos, it is still

Documents