Real-Time Camera Pose Estimation for Sports Fields · Real-Time Camera Pose Estimation for Sports Fields Leonardo Citraro 1, Pablo Márquez-Neila 2, Stefano Savarè1, Vivek Jayaram3,

Real-Time Camera Pose Estimation for Sports Fields

Leonardo Citraro∗1, Pablo Márquez-Neila∗2, Stefano Savarè1, Vivek Jayaram3, Charles Dubout3, FélixRenaut3, Andrés Hasfura3, Horesh Ben Shitrit3, and Pascal Fua1

1Computer Vision Laboratory, École Polytechnique Fédérale de Lausanne2ARTORG Center for Computer Aided Surgery, University of Bern

3Second Spectrum Inc.

Abstract

Given an image sequence featuring a portion of a sportsfield filmed by a moving and uncalibrated camera, such asthe one of a smartphone, our goal is to compute automat-ically and in real-time the focal length and extrinsic cam-era parameters for each image in the sequence without us-ing a priori knowledges of the position and orientation ofthe camera. To this end, we propose a novel frameworkthat combines accurate localization and robust identifica-tion of specific keypoints in the image by using a fully-convolutional deep architecture. Our algorithm exploitsboth the field lines and the players’ image locations, as-suming their ground plane positions to be given, to achieveaccuracy and robustness that is beyond the current state ofthe art. We will demonstrate its effectiveness on challengingsoccer, basketball, and volleyball benchmark datasets.

1. IntroductionAccurate camera registration is a prerequisite for many

applications such as augmented reality or 3D reconstruc-tion. It is now a commercial reality in well-textured envi-ronments and when additional sensors can be used to sup-plement the camera. However, sports arenas such as theone depicted in Fig. 1 pose special challenges. The pres-ence of the well-marked lines helps, but they provide highlyrepetitive patterns and very little texture. Furthermore, theplayers often occlude the landmarks that could be used fordisambiguation. Finally, challenging lighting conditions areprevalent outdoors and not uncommon indoors, as shown inthe figure.

As a result, traditional keypoint-based methods typicallyfail in such scenarios [22]. An alternative is to explicitlyuse edge and line information [12, 15, 24], but these meth-

∗Contributed equally.

Figure 1: Camera Pose Estimation from Correspondences.Even though the lighting is bad, we can reliably detect preselectedintersections of court lines. This yields 2D to 3D correspondencesthat we can use to compute the focal length and camera extrinsicparameters. If the position of the players is known, we can detectthem and use them to instantiate additional 3D to 2D correspon-dences.

ods tend to be slow, to use sensitive parameterization basedon vanishing points or to use prior knowledge of the posi-tion of the camera. With the advent of deep learning, directregression from the image to the camera pose has becomean option [16], but it often fails to deliver accurate results.

In this paper, we propose a novel framework that com-bines accurate localization and robust identification of spe-cific keypoints in the image by using a fully-convolutionaldeep architecture [21]. In practice, these keypoints are takento be intersections between ground lines, and the networkleverages the fact that they do not overlap to drastically re-duce inference time. These keypoints can then be used todirectly compute the homography from the image plane tothe ground plane, along with the camera focal length and ex-trinsic parameters, from single images. When using videosequences, the unique identities of the keypoints make it

1

arX

iv:2

003.

1410

9v1

[cs

.CV

] 3

1 M

ar 2

020

easy to impose temporal consistency and improve robust-ness. Finally, when the ground location of the players isknown, it can be used to further improve accuracy and ro-bustness of the estimations. To demonstrate this, we use acommercial system [23] that uses a set of fixed cameras tocompute these locations. This enables us to use the play-ers’ feet as additional landmarks to increase robustness tonarrow fields of views and lack of visible lines.

We will show that our method outperforms the state-of-the-art methods for soccer scenarios [12, 15, 24], which arethe only ones for which there are published results. In ad-dition, as publicly available datasets in this subject are rare,we will introduce and demonstrate the effectiveness of oursystem in challenging basketball, volleyball, and soccer sce-narios that feature difficult light condition, motion blur, andnarrow fields of view. We will make the basketball and vol-leyball datasets publicly available.

In short, our contribution is a fast, robust, and genericframework that can handle much more challenging situa-tions than existing approaches. We leverage the fact thatkeypoints on a plane do not overlap to drastically reduce in-ference time, thus enabling the detection of a high numberof interest points. In addition, we exploit the position of theplayers to further increase robustness and accuracy in im-ages lacking of visible features. Our method easily operatesat 20-30 frames per second on a desktop computer with anNvidia TITAN X (Pascal) GPU.

2. Related WorkAs the dimensions of sports fields are known and 3D

models are available, a naive approach to camera pose es-timation would be to look for projections of specific partsof the models in each image, establish 3D-to-2D correspon-dences, and use them to compute the camera parameters.Unfortunately, because the patterns in sports arenas andfields are repetitive, occlusions frequent, or poor lighting,correspondences established using traditional methods suchas SIFT [19], SURF [2], or BRIEF [3] are unreliable, whichmakes this approach prone to failure. More specializedmethods have therefore been proposed to overcome thesedifficulties by leveraging the specificities of sports fieldswithout resorting to using additional sensors.

In the case of soccer, the field is large and the lines de-limiting it are widely separated. As a result, in many cam-era views, too few are visible for reliable camera registra-tion. In [5], this problem is mitigated by using a two-pointmethod. The approach is effective but also very restrictivebecause it requires prior knowledge of the position and ro-tation axis of the camera. In [18] the lines of the field areused to compute a homography while in [1] the mathemati-cal characterization of the central circle is used to overcomethe shortage of features. Similarly, in [9], points, lines, andellipses are exploited to localize hockey rinks. This can be

effective for specific views but lacks generality. In the re-sults section, we will show that using the players’ positionsis most beneficial when only few of the court lines are visi-ble in the image.

In [6, 24] homography estimation relies on a dictionaryof precomputed features of synthetic edge images and cor-responding poses. For a given input image, they first per-form a nearest-neighbor search to find the most similar onein the database. Then, the candidate homography is refinedusing image features. Temporal consistency and smooth-ness is enforced in [24] over the estimates over time. Thelimiting factor in these approaches is the variability of thepotential poses. Both methods use the fact that the cam-era is in a fixed position to reduce the size of the dictionarywhich would be very large otherwise. In our approach, wehave no constraints on the position and orientation of thecamera and the variability of the poses does not affect theperformance.

In [12] the problem is approached differently. Estimatingthe homography relating the image plane to a soccer field istreated as a branch and bound inference in a Markov randomfield (MRF) whose energy is minimized when the imageand a generative model agree. The image is first segmentedusing a deep network to find lines, circles, and grassy areas,and finally vanishing points. The vanishing points are usedto constrain the search for the homography matrix and speedup the energy minimization. The limiting factor in thismethod is in the estimation of the vanishing points whichinvolves a computation known to be error-prone when theperspective distortion is severe. Our approach involves nosuch computation.

In [15], a framework that minimizes an inferred registra-tion error is proposed for accurate localization of the field.Two deep networks are used. The first regresses an esti-mate of the homography characterizing the field. The sec-ond, the registration error between the current frame and themodel projected with the estimate. Through differentiationin the second network, the homography is refined to mini-mize the registration error. The error estimation process andthe refinement are then repeated multiple times until conver-gence. This method has the potential to produce accurateposes; however, it requires a relatively good initial estimatefrom the first network to converge. In addition, at each iter-ation the model of the court needs to be warped and forwardpassed into the second network making this approach slow.

In the experiment section we will compare our results tothose of a traditional descriptor-based approach [19] alongwith a newer end-to-end regression network [16] and theapproaches of [12, 15, 24] as they have demonstrated goodresults.

Figure 2: Semantic and Player Keypoints. The red dotsdenote semantic keypoints. The blue crosses represent play-ers’ locations that become our player keypoints when theyare available.

3. Approach

Given an image sequence featuring a portion of a sportsfield filmed by a moving and uncalibrated camera, such asthe one of a smartphone, our goal is to compute in real-timethe focal length and extrinsic camera parameters for eachimage in the sequence without using a priori knowledge ofthe camera position and orientation.

Our approach relies on two information sources that aremore dependable and almost always available in images ofsports fields. The primary one comprises the lines paintedon the ground, their intersections, and the corners they de-fine, such as those depicted by red dots in Fig. 2. We assigna unique identity to each one of them and refer to them assemantic keypoints. The secondary source and optional onecome from the players. We detect the projection of theircenter of mass on the ground in the images using a multi-camera setup and refer to them as player keypoints. Sinceexploiting the identities of the players would require diffi-cult tracking of the number on the jerseys from single-view,we treat the player keypoints as points that do not have aspecific identity. The player keypoints are represented bythe blue crosses in Fig. 2.

To overcome the issues related to difficult lighting con-dition and poorly textured scenes we rely on a fully-convolutional U-Net architecture [21] that combines globaland local information for simultaneous description and ac-curate localization of both kinds of keypoints. This archi-tecture has proved very effective for image segmentationpurposes, and we will show that it is just as effective for ourpurposes. First, we use the semantic keypoints localized inthe image to compute an initial estimate of the homographythat maps the image plane to the field. Then, we use theplayers keypoints to refine the estimation. The homographyis then decomposed into intrinsic and extrinsic parameters.

Non-maximumSuppression(NMS)

HomographyEstimation

Refinement

IntrinsicsEstimation HomographyDecomposition

TemporalConsistency

Least-SquaresMinimization

��

� ��

�

��

��

��

��

��

Semantic

1

Backgrou

nd

Semantic

JPlay

ersU-Net

LocalGlobal

Figure 3: Our approach. We use a U-Net to detect the semanticand player keypoints. They are then used to compute the homo-graphies from the ground to the image planes in individual images.Finally, the camera parameters are inferred from these homogra-phies and refined by imposing temporal consistency and non-linearleast-square minimization.

Finally, we use a particle filter to enforce robustness overtime.

Fig. 3 summarizes our approach. We now formalize itand describe its individual components in more detail.

3.1. Formalization

Let {It}Tt=1 be a sequence of T images of a sport fieldtaken with a possibly uncalibrated camera. The camerapose at time t is represented by a 3 × 4 transformationmatrix Mt = [Rt | tt], where R is a rotation matrix andtt a translation vector. Mt is parameterized by 6 extrin-sic parameters. Similarly, the camera internal calibrationis given by a 3 × 3 matrix Kt parameterized by 5 intrin-sic parameters. Formally, our problem is to find the statevector Xt = [Kt,Mt] for each image of the sequence.

We assume complete knowledge about the position offield lines. In particular, we know the world coordinates ZSof a set of semantic keypoints in the sport field that havebeen manually selected once and for all, such as those de-picted by red dots in Fig. 2. Optionally, we can also assumeknowledge of the position of the players on the sport fieldat each time step t, that is, the world coordinates ZtP of theprojection of their center of gravity onto the ground plane.In practice, the players’ positions in world coordinates arecomputed using a multi-camera system [23] assumed to be

Figure 4: Detecting keypoints. Example image with super-imposed prediction. The crosses indicate the true projectionof the interest points and the red dots the 2D locations foundby our network. The gray/white patches represent the scorereturned by the U-Net. Even though some interest pointsare occluded, the network localizes them accurately.

synchronized with the mobile camera. The estimated posi-tion can be slightly imprecise when the players jump. How-ever, the resulting error is small enough to be neglected. Theplayer keypoints are shown as blue crosses in Fig. 2.

Given an image It of the sequence, our method estimatesthe 2D-image locations ztS of the semantic keypoints andztP of the player’s center of gravity. We then match themto their known 3D locations ZS and ZtP . From the result-ing 3D-to-2D correspondences we compute the homogra-phy Ht between ground plane and the image, which is thenfurther decomposed into Kt and Mt as described in Ap-pendix.

A typical basketball court or soccer field is roughly sym-metric with respect to its long and short axis. As a result, itcan be hard to distinguish views taken from one corner fromthose taken from the opposite one as the keypoints will lookthe same. However, if we know on what side of the field thecamera is, this ambiguity disappears and we can assign aunique identity to all keypoints, which is what we do. Morespecifically, during training we swap the identities of thesymmetric points when the camera moves to the other sideof the court. By doing so we maintain the same identityto similar feature points and provide coherent inputs to ournetwork. If the court has logos that are not symmetric, twonetworks have to be trained, one for each side of the playingfield.

3.2. Detecting Keypoints

To detect the keypoints and compute their 2D locationsin individual images, we train a single U-Net deep net-work [21] that jointly locates semantic and player keypointsin the image. We configure the network to perform pixel-wise classification with standard multi-class cross-entropy

loss. The network produces a volume that encodes the iden-tity and presence of a keypoint so that the pixels that arewithin a distance b from it will be assigned the class corre-sponding to it.

Output Volume. We dimension our network to take asinput an RGB image I ∈ RH×W×3 and return a vol-ume V = {V0, ...,VJ+1} composed of J + 2 channelswhere J is the number of semantic keypoints. These have aunique identity, therefore, channels Vj with j in the range{1, ..., J} are used to encode their locations. On the otherhand, the player keypoints are all assigned the same identity,to this end, we define a single channel VJ+1 to encode theirlocations. To assign a class also to locations where there isno keypoint we take V0 to be the background channel.

Let zS|j be the projected j-th semantic keypoint and V∗jits associated ground-truth channel, we set to 1 the pixelsthat are at a distance b from the interest point and 0 else-where. In the same manner, we set the pixels of V∗J+1 to1 at a location where there is a player keypoint and 0 else-where. If two keypoints are within 2 · b of each other, thepixels are assigned the class of the closest one. Finally, weset the background channel V∗0 so that for any value at lo-cation p ∈ R2 it satisfies

∑i Vi(p) = 1.

Training. The output of our network is a volume V ∈[0, 1]H×W×(J+2) that encodes class probabilities for eachpixel in the image, that is the probability of a pixel belong-ing to one of the J+2 classes defining the keypoints’ identi-ties and the background. This is achieved using a pixel-wisesoftmax layer. During training, the ground-truth keypoints’locations ZS and ZtP are projected for a given image It us-ing the associated ground-truth homography. Then, theseprojections are used to create the ground-truth output vol-ume V∗ ∈ {0, 1}H×W×(J+2) as described in the previousparagraph. In addition to volume V∗, we create weights tocompensate class imbalance. These are defined for a givenclass as the fraction of the total number of pixels in the im-age and the number of pixels that belong to that class.

As will be discussed in Section 4.1, our training datacomprise sequences of varying lengths taken from differ-ent viewpoints with annotations for both the semantic key-points and the players’ locations. At every training itera-tion, we choose our minibatches by taking into considera-tion the frequency of a viewpoint. In other words, imagesfrom short sequences are more likely to be chosen than im-ages from long ones. This tends to make the distribution ofviewpoints more even.

Finally, to increase global context awareness, we usean augmentation method named SpatialDropout [25] dur-ing training to force the network to use the information sur-rounding a keypoint to infer its position. At every train-ing iteration, we randomly create boxes of different sizes

and zero-out the pixels of the input image that are withinthem. The number of boxes so as their sizes and positionsare drawn from uniform distributions. As a result and asshown in Fig. 4, keypoints can be correctly detected andlocalized even when they are occluded by a player.

Inference. At run-time, we leverage the fact that key-points defined on a plane do not overlap in projection todrastically reduce inference time. The background channelV0 encodes all the information required to locate a key-point; therefore, we perform non-minimum suppression onthis channel only and then assess their identities by look-ing for the index of the maximum in the corresponding col-umn in the volume. This enables us to handle many interestpoints in real-time.

3.3. Estimating Intrinsic and Extrinsic Parameters

Having detected semantic keypoint locations ztS , that is,markings on the ground, and player keypoints ztP , that is,the projection of player’s center of gravity in image It, wecan now exploit them to recover the camera parameters.Since the camera focal length is not known a priori, thecamera extrinsics cannot be computed directly. To this end,we first compute a homography Ht from the image plane tothe field then we estimate intrinsics and the extrinsic param-eters Kt and Mt from Ht as described in the Appendix.

As discussed at the end of Section 3.1, the locationsof the semantic keypoints can be readily used to estimatea homography because they are assigned unique identitiesthat translate into a 3D-to-2D correspondence between a 3Dpoint on the playing field and a 2D image location.

By contrast, exploiting the player keypoints that can bethe projection of one of many 3D locations requires estab-lishing the correspondences. Doing so by brute force searchwould be computationally infeasible in real-time, and evenusing more sophisticated methods that leverage a prioriknowledge about the camera position [7, 20] can be slow.Instead, we use a simple yet effective two-step approach:Given image It, we use the semantic keypoints to com-pute a first estimate of the homography Ht

0. This allowsus to back-project the detected player locations ztP from theimage plane to world coordinates and associate the back-projected points to the closest ground-truth positions ZtP .Finally, we use the newly defined players’ correspondenceswith the already known semantic ones to estimate a newhomography Ht

1.This approach enables us to use players data to produce

a more accurate mapping that translates to better estimatesof the focal length and the pose.

3.4. Enforcing Temporal Consistency

Using the approach described above, keypoints can befound independently in individual images and used to com-

pute a homography and derive camera parameters for each.However, in a video sequence acquired with a moving cam-era, this fails to exploit the fact that its motion may be shakybut is not arbitrary. To do so, we rely on a particle filteringapproach known as condensation [14] to enforce temporalconsistency on the pose Mt, with the intrinsics Kt beingupdated at each iteration to allow the focal length to change.

The idea underlying the condensation algorithm is tonumerically approximate and refine the probability densityfunction p(Mt|zt) over time. A set of N random posescalled particles stn with associated weights πtn approximatethe posterior distribution

p(Mt|zt) =N∑n=1

πtnδ(Mt − stn), (1)

where δ(.) is the Dirac delta function. At every iteration,the particles are generated, transformed and also discardedbased on their weights πtn. These are at the heart of thisprocedure. The larger they are for a given particle, the morelikely it is to be retained. They should therefore be chosenso that particles that correspond to likely extrinsic param-eters, that is, parameters that yield low re-projection errorsare assigned a high weight.

To this end, we use the extrinsic parameters associatedto the particles to project the ground-truth 3D points andcompute the mean error distance from the projections to theestimated positions zt. For the semantic keypoints, we com-pute the distance ξtS|n to the corresponding predicted 2D lo-cation. For the player ones whose identity is unknown, wesearch for the detection closest to the projection and use it tocompute the error ξtP |n. We assume a Gaussian model forboth error components ξtS|n and ξtP |n. Therefore, we takethe weight of particle n-th to be

πtn = α exp[(−ξtS|n√

2σS

)2]+(1−α) exp

[(−ξtP |n√2σP

)2], (2)

where σS and σG control the importance of a particle basedon its error, α instead balances the two contributions. In-tuitively, if the error for a given particle is close to zerothe associated weight will be close to one. The new stateis then taken to be the expected value of the posteriorE[p(Mt|zt)] ≈

∑Nn=1 π

tnstn. We describe the whole frame-

work procedure in more detail in Appendix B.

4. Experiments4.1. Datasets, Metrics, and Baselines

In this section we introduce the datasets we tested ourmethod on, the metric we used to assess performance, andthe baselines against which we compared ourselves.

Datasets. We tested our approach on the followingdatasets

• Basketball. We filmed an amateur basketball match onour campus using smartphones and moving around thefield. At the same time, 8 fixed and calibrated cameraswere filming it, and we used their output to estimatethe players’ positions on ground. The sequences con-tain a variable number of people ranging from 10 to 13that are either running, walking or standing. For eachsmartphone image, we estimated ground-truth posesusing a semi-automated tool. It tracks interest pointsfrom image to image under the supervision of an op-erator. When the system looses track, the operatorcan click on points of interest and restart the process.In practice, this is much faster than doing everythingmanually. In this manner, it took us about 60 hoursto compute homographies for all of the 50127 imagesforming 28 distinct sequences, some of which featuredifficult light conditions and foreign objects such asgymnastic mats and other pieces of equipment occlud-ing parts of the field. Manually annotating every framewould have taken at least six weeks. We used 12 se-quences for training and 16 for testing. This datasetwill be made publicly available.

• Volleyball. These volleyball sequences were filmedusing broadcast cameras and are publicly avail-able [13], along with the corresponding players’ po-sitions. We again used the semi-automated tool de-scribed above to compute ground-truth poses and in-trinsic parameters that change over time in 12987 im-ages coming from four different matches and will alsomake them publicly available. The images includeplayers, referees and coaches but only the players, sixin each team, were tracked. We used two sequencesfor training and two for testing.

• Soccer MLS. We filmed a Major League soccer matchusing one moving smartphone and 10 fixed cameras toestimate the position of the players, as for the Basket-ball dataset. All the players and the three referees weretracked in this dataset for a total of 25 people. Thefocal length is constant between frames but differentfor each sequence. We then used our semi-automatedtool to compute ground-truth poses for 14160 im-ages divided into 20 sequences from different locationsaround the court. We used 10 sequences for trainingand 10 for testing.

• Soccer World Cup. This is a publicly-availabledataset used in [12, 24]. It comprises 395 images ac-quired by the broadcast cameras at the 2014 World Cupin Brazil. The images have an associated homography

and they are not in sequence. Since the players’ posi-tions are not provided, we extracted them manually inevery image in order to demonstrate their usefulness.We extracted the position of all the players and the ref-erees that are visible in the images.

Evaluation Metrics. We use five metrics to evaluatethe recovered camera parameters: intersection over union(IoU), reprojection error, angular error in degrees, transla-tion error in meters and relative focal length error. We reportthe mean, the median and the area under the curve (AUC)for each of them.

To compute the reprojection error we first project a gridof points defining the playing surface onto the image, andthen average their distances from their true locations. Onlythe visible points in the image are taken into account. Forindependence from the image size we normalize the result-ing values by the image height.

The IoU is taken to be the shared area between theground-truth model of the court and the re-projected onedivided by the area of the union of these two areas. TheIoU is one if the two coincide exactly and zero if they donot overlap at all. In this work and in [12], the entire sportsfield template is considered. In [24], the IoU is computedusing only the area of the court that is visible in the image.We discourage the use of this version as it has an importantflaw. Since this metric simply compares areas, croppingboth ground-truth and model means removing the parts thatare not aligned. These are in fact the ones that contributenegatively to the score. It is therefore easier to obtain per-fect scores even if the estimate is far from correct. We givemore explanations in Appendix C.

We compute the angular error asarccos [(Tr(RT

gt ·Rest)− 1)/2] while the translationerror as ||tgt − test||2. The relative focal length error isdefined as |fgt − fest|/fgt. Finally, the AUC is computedby sorting the errors in ascending order and by consideringthe values lower than a threshold. For IoU we take athreshold of 1, for the reprojection error 0.1, for the angularerror 10◦, 2.5 m for the translation error and 0.1 for therelative focal length error.

Baselines and Variants. We compare our method againstthe following approaches.

• SIFT [19]: We use the OpenCV implementation ofSIFT to locate and match interest points between animage and a set of reference images. We manuallyselect reference images from the training set in sucha way to cover all viewpoints. Given a query image,we attempt to match it against each reference image inturn. We use the two-nearest-neighbor heuristic [19]with a distance ratio of 0.8 to reject keypoints with-out a reliable correspondence. The reference image

that features the largest number of correspondences isused in conjunction with RANSAC [8] to compute ahomography.

• PoseNet [16]: Direct regression from the query im-age to a translation and quaternion vectors. We useResnet-50 [11] pretrained on ImageNet. We replacethe last average pooling and fully-connected layerswith a ReLU activation followed by a 1 × 1 con-volutional layer to reduce the number of features ofthe activation volume from 2048 to 512 then a fully-connected layer to output the 7 elements vector. In-stead of feeding an image we feed the accumulator ofthe Hough transform, this produces better results over-all. In addition, we normalize the values of translationand quaternions vectors using the mean of the trainingdistribution. Finally, we set the balance parameter βof the loss to 1e − 3 and train the network for 50000iterations with Adam and learning rate set to 1e− 4.

• Branch and Bound [12]: A branch and bound ap-proach using lines and circles as cues to estimate a ho-mography.

• Synthetic dictionary [24]: Nearest-neighbor searchover a precomputed dictionary of synthetic edge im-ages.

• Learned Errors [15]: Homography refinementthrough iterative minimization of an inferred registra-tion error.

The last three approaches are presented in more detail inSection 2. We also compare against the following variantsof our own approach.

• OURS: Our complete method using semantic key-points, player positions, refinement stages, spatial-dropout and particle filter.

• OURS w/o Players: Our method without using theplayers.

• OURS w/o P.Filter: Our method without the particlefilter, that is, without temporal consistency.

• OURS w/o S.Dropout: Our method without Spatial-Dropout [25] for the increase in global context aware-ness.

4.2. Implementation Details

For all our experiments we train a U-Net architec-ture [21] from scratch using SpatialDropout [25], pixelwisesoftmax as last layer, cross-entropy loss and ReLU activa-tion. For Basketball, Volleyball and Soccer MLS the num-ber of downsampling steps of the network are 4 whereas the

number of filters in the first layer to 32, in Soccer World Cupto 5 and 48 respectively. We optimize the parameters of thenetwork using Adam [17] with learning rate set to 1e − 4.We resize the images by maintaining the original aspect ra-tio. For Basketball and Volleyball the height of the inputimages are 256 pixel, in Soccer MLS 360 pixels whereas inSoccer World Cup 400 pixels. At every iteration a patch ofsize 224×224 is randomly cropped from the image and fedinto the network. We use batch size of 4. In non-maximumsuppression, a local-maximum is considered a keypoint ifits response is higher than 0.25, we discard the rest. Forall the experiments we use 300 particles for the filter withσS and σG set both to 2, α = 0.5. We run all the experi-ments on an Intel Xeon CPU E5-2680 2.50GHz and NvidiaTITAN X (Pascal) GPU.

4.3. Qualitative and Quantitative Results

Fig. 5 demonstrates the effectiveness of our method inestablishing correspondences. Our method outperformsSIFT by a large margin. Fig. 6 depicts some qualitative re-sults for basketball, soccer and volleyball. The registrationin these cases is accurate to the point where the projectedmodels’ lines match almost perfectly the lines of the court.By contrast, we present some failure cases in Fig. 7. Theyare usually caused by the lack of visible lines, clutter, andnarrow field of view. We now turn to quantifying these suc-cesses and failures.

Comparison to the Baselines. We report comparative re-sults on our benchmark datasets in Fig. 8 and Tab. 1. Theresults for Soccer World Cup are shown in Tab. 2. As theSoccer World Cup dataset does not include intrinsic andextrinsic parameters, we report IoU and reprojection erroronly. The code for Synthetic dictionary and Learned Er-rors are not publicly available, therefore, we report the pub-lished results for Soccer World Cup. Note that in Tab. 1 wedo not report the relative focal length errors for PoseNet asthis method only produces rotation and translation matri-ces. To compute the other metrics for PoseNet we used theground-truth intrinsics.

OURS does best overall with OURS w/o Players a closesecond. Interestingly, OURS w/o Players does slightly bet-ter than OURS on Volleyball. We take this to mean that, inthis case, the keypoints are dense enough for a precise esti-mate. Therefore using the players whose location cannot bedetected very accurately does not help. By contrast, for Soc-cer World Cup and Soccer MLS where the field is largerand the keypoints fewer, using the players is key to top per-formance. As shown by the median in Tab. 1 and by Fig. 8,our method is extremely precise. For most of the basketballand volleyball images, the reprojection error on our methodis less than 5 pixels on a full-hd image (1080 × 1920), insoccer less than 7 pixels. The translation error is less than

OURSSIFT

Figure 5: Robust Keypoints Detection. (left) Putative correspondences drawn from the image to the model of the court using SIFTapproach described in Section 4.1 and (right) by our method where red and blue dots are semantic and generic keypoints respectively.Incorrect correspondences are shown as red lines, green otherwise. Even though the lighting is poor, our method (right) can reliablyestablish many correct correspondences, whereas those found using SIFT [19] (left) are mostly wrong. The other baselines methods do notappear in this figure as they are not keypoint based.

Figure 6: Qualitative results. 3D field lines projected and overlaid on the images according to the recovered camera registration.

(a) (b) (c) (d)

Figure 7: Failure cases. In the basketball dataset (a,b), narrow view points and clutter caused by the foreign objects, players and thehoop’s frame are the main reasons for imprecise localization of keypoints leading to inaccurate poses. In (c), the inaccurate mapping is dueto the shortage of visible lines. In (d), the network failed to locate keypoints correctly, this is most likely due to the fact that this viewpointis far from the distribution of the training set we used.

20 cm for most images of our basketball dataset, about 50cm for volleyball and less the 3 m for soccer.

Ablation Study. In Tab. 1, we also report perfor-mance numbers for OURS w/o P.Filter and OURS w/o

S.Dropout. They are consistently worse than those of ourfull approach, thus confirming the importance of the particlefilter and of dropout.

A question that arises in practice is where to position thekeypoints. As the network uses both local and global in-

0.00 0.02 0.04 0.06 0.08 0.10Cumulative reprojection error norm.

0.0

0.2

0.4

0.6

0.8

1.0

Ours (pose)Ours /wo Players (pose)SIFT (pose)Posenet (pose)


0.0

0.2

0.4

0.6

0.8

1.0



0.0

0.2

0.4

0.6

0.8

1.0

OursOurs

(pose) Players (pose)

SIFT (pose)Posenet (pose)

w/o


0.0

0.2

0.4

0.6

0.8

1.0


0.0 0.2 0.4 0.6 0.8 1.0Cumulative 1 IoU

0.0

0.2

0.4

0.6

0.8

1.0


0.0 0.2 0.4 0.6 0.8 1.0Cumulative 1 IoU

0.0

0.2

0.4

0.6

0.8

1.0


0.0 0.2 0.4 0.6 0.8 1.0Cumulative 1 IoU

0.0

0.2

0.4

0.6

0.8

1.0


0.0 0.2 0.4 0.6 0.8 1.0Cumulative 1 IoU

0.0

0.2

0.4

0.6

0.8

1.0


Basketball Volleyball Soccer MLS Soccer World Cup

Figure 8: Quantitative results. Top row: cumulative distribution of normalized reprojection errors (NRE). Bottom row:cumulative distribution of one-minus intersection-over-union (1-IoU). For basketball and volleyball datasets the players datado not improve the accuracy of the estimation, by contrast, for both soccer datasets the players are key to top performance.

Table 1: Quantitative results for Basketball, Volleyball and Soccer MLS datasets.

fps IoU Norm. reproj. error Angular error (◦) Translation error (m) Rel. focal length errorMean Mean Median 1-AUC<1 Mean Median AUC<0.1 Mean Median AUC<10 Mean Median AUC<2.5 Mean Median AUC<0.1

Bas

ketb

all OURS 22 0.966 0.980 0.966 0.013 0.003 0.940 1.681 0.565 0.926 0.330 0.189 0.906 0.012 0.011 0.880

OURS w/o Players 26 0.962 0.977 0.962 0.012 0.004 0.930 1.179 0.610 0.917 0.294 0.194 0.898 0.013 0.012 0.867OURS w/o P.Filter 33 0.927 0.978 0.927 0.041 0.004 0.895 6.962 0.590 0.884 1.865 0.196 0.865 0.012 0.011 0.879OURS w/o S.Dropout 22 0.962 0.979 0.962 0.018 0.004 0.921 1.107 0.563 0.915 0.291 0.180 0.898 0.010 0.006 0.904SIFT [19] 0.6 0.463 0.462 0.463 0.308 0.239 0.336 59.137 28.489 0.345 7.389 4.591 0.311 0.408 0.357 0.273PoseNet [16] 19 0.739 0.755 0.738 0.385 0.284 0.046 5.709 3.743 0.559 3.319 2.645 0.226 - - -

Volle

ybal

l

OURS 32 0.978 0.982 0.977 0.005 0.003 0.960 0.444 0.282 0.965 0.702 0.519 0.726 0.023 0.015 0.776OURS w/o Players 38 0.987 0.990 0.987 0.003 0.002 0.973 0.284 0.177 0.976 0.662 0.491 0.742 0.021 0.015 0.788OURS w/o P.Filter 43 0.976 0.979 0.976 0.004 0.004 0.957 0.424 0.339 0.957 0.711 0.525 0.721 0.023 0.016 0.773OURS w/o S.Dropout 32 0.976 0.981 0.976 0.005 0.003 0.957 0.462 0.311 0.961 0.863 0.604 0.671 0.028 0.020 0.722SIFT [19] 2 0.923 0.957 0.913 0.024 0.008 0.770 1.587 0.513 0.838 2.792 0.800 0.506 0.081 0.026 0.613PoseNet [16] 19 0.861 0.877 0.859 0.027 0.024 0.731 0.615 0.568 0.937 0.556 0.499 0.777 - - -

Socc

erM

LS OURS 21 0.949 0.974 0.948 0.021 0.006 0.898 3.973 0.927 0.864 4.895 2.668 0.192 0.035 0.008 0.833

OURS w/o Players 25 0.885 0.966 0.885 0.055 0.008 0.769 7.985 1.265 0.754 9.756 2.833 0.181 0.112 0.011 0.727OURS w/o P.Filter 31 0.923 0.970 0.923 0.036 0.008 0.858 6.022 1.006 0.841 9.767 2.672 0.181 0.038 0.009 0.827OURS w/o S.Dropout 21 0.943 0.976 0.942 0.030 0.006 0.881 7.070 0.898 0.848 4.813 2.509 0.190 0.046 0.007 0.836SIFT [19] 0.8 0.809 0.944 0.804 0.137 0.010 0.680 12.553 1.104 0.732 15.223 3.059 0.146 0.128 0.013 0.709PoseNet [16] 19 0.822 0.848 0.822 0.211 0.154 0.118 11.249 1.835 0.730 10.661 4.070 0.093 - - -

formation, keypoints can be placed anywhere on the field,however, their position affects how precise is their localiza-tion in the image. Fig. 9 depicts three potential configura-tions. We trained and tested our network using each onein turn. In Table 3, we report the corresponding averagedistance between projected ground-truth points and the de-tections, along with the proportion of inliers. The averagedistance is computed using the detections that are within 5pixels to the closest corresponding ground-truth point.

The two configurations with keypoints located at cornersand line intersections are more precise and perform verysimilarly. The third configuration with regularly spacedkeypoints that do not match any specific image feature doesless well but still yields a reasonable precision. This con-firms our network’s ability to account for context aroundthe keypoints.

Table 2: Quantitative results for Soccer World Cup dataset.

fps IoU Norm. reproj. errorMean Mean Median 1-AUC<1 Mean Median AUC<0.1

Socc

erW

.Cup

OURS 8 0.939 0.955 0.934 0.007 0.005 0.926OURS w/o Players 9 0.905 0.918 0.901 0.018 0.012 0.820SIFT [19] 1.6 0.170 0.011 0.168 0.591 0.479 0.01PoseNet [16] 19 0.528 0.559 0.525 0.849 0.878 0.00Branch and Bound [12] 2.3 0.83 - - - - -Synthetic Dictionary [24] 5 0.914* 0.927* - - - -Learned Errors [15] - 0.898 0.929 - - - -

Table 3: Testing different keypoint configurations. We con-sider a detection to be an inlier if its distance to the closestcorresponding ground-truth is less than 5 pixels in an image of256 × 455 pixels. The mean distance is computed between theprojected ground-truth keypoints and the inliers.

Configuration Mean Distance Inlier ProportionMean±std. Mean±std.

Keypoints on corners 1 (red) 1.15 ± 0.88 0.78 ± 0.22Keypoints on corners 2 (blue) 1.08 ± 0.85 0.79 ± 0.21Keypoints on a grid (white) 1.66 ± 1.20 0.69 ± 0.27

Figure 9: Different keypoint configurations (Basketball). Redand blue dots depict two different configurations for semantic key-points located at line intersections. The white ones are equallyspaced and represent a third.

5. Conclusion

We have developed a new camera-registration frame-work that combines accurate localization and robust identi-fication of specific keypoints in the image by using a fully-convolutional deep architecture. It derives its robustnessand accuracy from being able to jointly exploit the informa-tion by the field lines and the players’ 2D locations, whilealso enforcing temporal consistency.

Future work will focus on detecting not only the 2D loca-tion of the projection of the players’ center of gravity, as wecurrently do, but also their joints so that we can reconstruct

*computed using the visible part of the field in the image. See Ap-pendix C for more explanations.

their 3D pose. In this way, we will be able to simultane-ously achieve camera registration and 3D pose estimation.This has a tremendous potential in terms of augmenting theimages and developing real-time tools that could be used toexplain to viewers what the action is.

6. AcknowledgmentsThis work was supported in part under an Innosuisse

Grant funding the collaboration between SecondSpectrumand EPFL.

AppendixA. From Homography to Camera Parameters

Let us consider aw×h image I along with the homogra-phy H between ground plane and its image plane, which isto be decomposed into K and M = [R, t], a 3×3 matrix ofintrinsic parameters and a 3 × 4 matrix of extrinsic param-eters, as defined in Section 3.1. In this section, we outlinehow to derive M and K from H. For a full treatment, werefer the interested reader to [10].

Intrinsic parameters. In practice, the principal point ofmodern cameras is located close to the center of the imageand there is no skew. We can therefore write

K =

f 0 w/2

0 f h/2

0 0 1

, (3)

where f is the initially unknown focal length and the onlyparameter to be estimated. It can be shown that knowingH, two linear constraints on the intrinsics parameters canbe solved for the unknown f which yield two solutions ofthe form:

f1 =g1(h1,h2, w, h)

h7 · h8, (4)

f2 =g2(h1,h2, w, h)

(h7 + h8) · (h7 − h8),

where h1 and h2 are the first two columns of H, h7 andh8 the first two elements of H third row, and g1, g2 arealgebraic functions.f1 and f2 are only defined when the denominators are

non-zero and the closer to zero they are, the less the preci-sion. In practice we compare the value of these denomina-tors and use the following heuristic

f =

{f1 |h7 · h8| > |(h7 + h8) · (h7 − h8)| .f2 otherwise.

(5)

Extrinsic parameters. To extract the rotation and trans-lation matrices R and t from H, we first define the 3 × 3matrix B = [b1,b2,b3] and a scale factor λ to write H asλKB. λ can be computed as (||K−1h1|| + ||K−1h2||)/2.Then, assuming that the x-axis and y-axis define the groundplane, we obtain a first estimate of the rotation and trans-lation matrices R = [b1,b2,b1 × b2] and t = b3. Weorthogonalize the rotation using singular value decomposi-tion R = UΣVT, R = UVT. Finally, we refine the pose[R, t] on H by non-linear least-squares minimization.

B. Complete FrameworkRecall from Section 3.2 that at each discrete time step

t, we estimate the 2D locations of our keypoints zt, whichare noisy and sometimes plain wrong. As we have seen inSection A, they can be used to estimate the intrinsic and ex-trinsic parameters Mt

d and Kt for single frames. The intrin-sic parameters computed from a single frame are sensitiveto noise and depend on the accuracy of Ht, for this reason,at every time step t we estimate their values by consideringpast k frames. We perform outlier rejection over the past kestimate of the intrinsics then, compute the median, this al-lows to increase robustness and precision admitting smoothvariations of the parameters over time. If the parameter areknown to be constant over time, k can be set so to considerall past estimates. Once the intrinsics are computed, we ob-tain the new robust pose Mt from the filter and minimizethe error in the least-squares sense using all the detectedkeypoints.

This particle-filter is robust but can still fail if the cameramoves very suddenly. To detect such events and re-initializeit, we keep track of the number of 3D model points whosereprojection falls within a distance t for the pose computedfrom point correspondences Mt

d and the filtered pose Mt.When the count for Mt

d is higher we re-initialize the filter.The pseudo code shown in Algorithm 1 summarizes

these steps.

C. Intersection-Over-Union (IoU) of the visiblearea

In [24], the intersection-over-union metric is computedusing only the area of the court that is visible in the image.This area is shown in gray in Figure 10. After superimpo-sition of the projected model (red frame) with the ground-truth one (blue frame) the area of the court that is not visiblein the image is removed and therefore not taken into accountin the computation of the IoU. It can be shown that the IoUof the gray area gives a perfect score while in reality the es-timate is far from correct. The worst case scenario is whenthe viewpoint leads to an image containing only grassy areaof the playing field. In this case, as long as the projectedmodel covers the ground-truth one, this metric gives perfect

Algorithm 1 Complete framework pseudo code.1: procedure INTRINSICS AND EXTRINSICS ESTIMATION2: for t = 1 to T: iterates over time3: —–Single frame estimation—–4: zt = {zt

S , ztP } ← detect keypoints from It

5: Ht ← robust estimation using (ZS , ztS )

6: Htr ← refinement using Players (Ht,Zt, zt)

7: Kt ← intrinsics estimation from Htr

8: Ktm ← moving median /w outliers rejection over Kt:t−k

9: Mtd ← homography decomposition (Kt

m, Htr)

10: —–Particle Filtering—–11: for n = 1 to N: iterates over the particles12: {stn, π

tn} ← sampling with replacement from {st−1

n , πt−1n }

13: stn ← stn + wn add randomness where wn ∼ N (0,Σ)

14: πn ← g(stn, Ktm,ZS , z

tS ,Z

tP , z

tP ) weights computation

15: endfor16: Mt ←

∑Nn=1 π

tnstn expected value as filter output

17: Mt ← Levenberg–Marquardt refinement of Mt on Htr

18: —–Filter re-initialization—–19: if no. inliers(Kt

m, Mtd, z

t) > no. inliers(Ktm, M

t, zt) then:20: for n = 1 to N:21: {stn, π

tn} ← {M

td + wn, 1/N} where wn ∼ N (0,Σ)

22: endfor23: endif

Figure 10: Example failure case of the Intersection-over-union metric that only uses the visible part of the court inthe image. The ground-truth model is shown in blue, the re-projected one in red and the gray area is the projected imageplane. This version of IoU would give perfect score whilethe one that takes the whole template into account wouldgive around 0.6.

score. For this reason, we discourage the use of this versionof IoU.

References[1] Alvarez, L., Caselles, V.: Homography Estimation Us-

ing One Ellipse Correspondence and Minimal AdditionalInformation. In: International Conference on Image Pro-cessing, pp. 4842–4846 (2014) 2

[2] Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: SURF:Speeded Up Robust Features. Computer Vision and Im-age Understanding 10(3), 346–359 (2008) 2

[3] Calonder, M., Lepetit, V., Strecha, C., Fua, P.: BRIEF:Binary Robust Independent Elementary Features. In:European Conference on Computer Vision, pp. 778–792(2010) 2

[4] Cao, Z., Simon, T., Wei, S., Sheikh, Y.: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields.

In: Conference on Computer Vision and Pattern Recog-nition, pp. 1302–1310 (2017)

[5] Chen, J., Zhu, F., Little, J.J.: A Two-Point Methodfor PTZ Camera Calibration in Sports. In: IEEE Win-ter Conference on Applications of Computer Vision, pp.287–295 (2018) 2

[6] Chen, J., Little, J.J.: Sports Camera Calibration viaSynthetic Data. In: Conference on Computer Vision andPattern Recognition (Workshops), (2019) 2

[7] David, P., Dementhon, D., Duraiswami, R., Samet, H.:SoftPOSIT: Simultaneous Pose and Correspondence De-termination. International Journal of Computer Vision59(3), 259–284 (2004) 5

[8] Fischler, M., Bolles, R.: Random Sample Consensus: AParadigm for Model Fitting with Applications to ImageAnalysis and Automated Cartography. CommunicationsACM 24(6), 381–395 (1981) 7

[9] Gupta, A., Little, J.J., Woodham, R.: Using Line andEllipse Features for Rectification of Broadcast HockeyVideo. In: Canadian Conference on Computer and RobotVision, pp. 32–39 (2011) 2

[10] Hartley, R., Zisserman, A.: Multiple View Geometryin Computer Vision. Cambridge University Press (2000)10

[11] He, K., Zhang, X., Ren, S., Sun, J.: Deep ResidualLearning for Image Recognition. In: Conference onComputer Vision and Pattern Recognition, pp. 770–778(2016) 7

[12] Homayounfar, N., Fidler, S., Urtasun, R.: Sports FieldLocalization via Deep Structured Models. In: Confer-ence on Computer Vision and Pattern Recognition, pp.4012–4020 (2017) 1, 2, 6, 7, 10

[13] Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat,A., Mori, G.: A Hierarchical Deep Temporal Model forGroup Activity Recognition. In: Conference on Com-puter Vision and Pattern Recognition, pp. 1971–1980(2016) 6

[14] Isard, M., Blake, A.: Condensation – ConditionalDensity Propagation for Visual Tracking. InternationalJournal of Computer Vision 1, 5–28 (1998) 5

[15] Jiang, W., Higuera, J. C. G., Angles, B., Sun, W., Ja-van, M., Yi, K. M.: Optimizing Through Learned Er-rors for Accurate Sports Field Registration. IEEE WinterConference on Applications of Computer Vision (2019)1, 2, 7, 10

[16] Kendall, A., Grimes, M., Cipolla, R.: Posenet: A Con-volutional Network for Real-Time 6-DOF Camera Relo-calization. In: International Conference on ComputerVision, pp. 2938–2946 (2015) 1, 2, 7, 9, 10

[17] Kingma, D., Ba, J.: Adam: A Method for StochasticOptimisation. In: International Conference on LearningRepresentations (2015) 7

[18] Liu, S., Chen, J., Chang, C., Ai. Y.: A New Accu-rate and Fast Homography Computation Algorithm forSports and Traffic Video Analysis. In: IEEE Transac-tions on Circuits and Systems for Video Technology, pp.2993–3006 (2018) 2

[19] Lowe, D.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of ComputerVision 20(2), 91–110 (2004) 2, 6, 8, 9, 10

[20] Moreno-noguer, F., Lepetit, V., Fua, P.: Pose Priorsfor Simultaneously Solving Alignment and Correspon-dence. In: European Conference on Computer Vision,pp. 405–418 (2008) 5

[21] Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convo-lutional Networks for Biomedical Image Segmentation.In: Conference on Medical Image Computing and Com-puter Assisted Intervention, pp. 234–241 (2015) 1, 3, 4,7

[22] Sattler, T., Maddern, W., Toft, C., Torii, A., Ham-marstrand, L., Stenborg, E., Safari, D., Okutomi, M.,Pollefeys, M., Sivic, J., Kahl, F., Pajdla, T.: Bench-marking 6DOF Outdoor Visual Localization in Chang-ing Conditions. In: Conference on Computer Vision andPattern Recognition (2018) 1

[23] SecondSpectrum: (2015).http://www.secondspectrum.com/ 2, 3

[24] Sharma, R.A., Bhat, B., Gandhi, V., Jawahar, C.V.:Automated Top View Registration of Broadcast Foot-ball Videos. IEEE Winter Conference on Applicationsof Computer Vision pp. 305–313 (2018) 1, 2, 6, 7, 10,11

[25] Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bre-gler, C.: Efficient Object Localization Using Convolu-tional Networks. In: Conference on Computer Visionand Pattern Recognition, pp. 648–656 (2015) 4, 7

Real-Time Camera Pose Estimation for Sports Fields · Real-Time Camera Pose Estimation for Sports Fields Leonardo Citraro 1, Pablo Márquez-Neila 2, Stefano Savarè1, Vivek Jayaram3,

Documents