Conference PaperPDF Available

Soccer Jersey Number Recognition Using Convolutional Neural Networks

Authors:
  • TerraLoupe GmbH
Soccer Jersey Number Recognition Using Convolutional Neural Networks
Sebastian Gerke
Fraunhofer HHI
Einsteinufer 37, 10587 Berlin, Germany
sebastian.gerke@hhi.fraunhofer.de
Karsten M¨
uller
Fraunhofer HHI
Einsteinufer 37, 10587 Berlin, Germany
karsten.mueller@hhi.fraunhofer.de
Ralf Sch¨
afer
Fraunhofer HHI
Einsteinufer 37, 10587, Germany
ralf.schaefer@hhi.fraunhofer.de
Abstract
In this paper, a deep convolutional neural network based
approach to the problem of automatically recognizing jer-
sey numbers from soccer videos is presented. It is meant
as a tool for subsequent automatic player identification ap-
proaches that utilize jersey numbers together with knowl-
edge about teams and the jersey numbers of their players.
Two different jersey number vector encoding schemes are
presented and compared to each other. The first treats ev-
ery number as a separate class, while the second one treats
each digit as a class. Additionally, the semi-automatic pro-
cess for the annotation of a jersey number dataset consist-
ing of 8281 jersey numbers is described. The best recog-
nition rate of 0.83 was achieved by the proposed approach
with data augmentation and without using dropout, com-
pared to 0.4 for a more traditional histogram of oriented
gradients (HOG) and support vector machine (SVM) based
approach.
1. Introduction
Soccer is one of the most popular sports in the world.
In recent years, interest in automatic soccer analysis tools
grew significantly. Soccer analysis results can be used for
new ways of storytelling on TV, for match preparations or
for the generation of statistics. One of the fundamental anal-
ysis is the identification of players to associate actions and
statistics to actual players. However, identifying players in
broadcast soccer videos automatically (and even manually)
is challenging. Especially for the overview camera it is dif-
ficult due to the low resolution per player, which makes face
recognition impossible and jersey numbers are often hard to
read, especially with standard definition resolutions. Only
with the rise of widely available HD content in recent years,
Figure 1. Examples of the player detector output that is used to
create the dataset presented here. The upper half of the actual
player bounding box is shown.
jersey number recognition became feasible.
2. Related Work
Existing approaches for automatic player identification
in broadcast soccer videos can be categorized in two groups:
One performing face recognition on closeup shots (not
overview shots) in variuos types of sports videos, while
other approaches rely on jersey number recognition. For
the latter group, no approach is known to operate on soccer
overview shots. They either operate on other sports where
4321
the resolution per player is higher (e.g. in basketball [5, 9]),
or they perform on closeup shots, where jersey numbers are
better readable [1] and face recognition is feasible [2].
In [9] basketball players are detected using a deformable
parts model (DPM), after which an exact localization of the
jersey number is performed. Then, normalization, followed
by thresholding and calculating the correlation between the
digits and digit templates is applied. In [2], player identi-
fication is performed in overview shots by employing SIFT
features for face recognition.
All approaches have a quite sophisticated, hand-
engineered image processing pipeline in common. They
often perform explicit localization of jersey numbers, fol-
lowed by digit segmentation. In contrast to these ap-
proaches, a deep learning approach is proposed here. It
does not rely on explicit localization of jersey number re-
gions and no explicit segmentation is performed. Rather, a
deep convolutional neural network is trained which handles
the complete pixel-to-jersey number recognition process.
3. Dataset Generation
Although there have been a few existing approaches for
jersey number recognition, these approaches usually rely on
video content where the image area of a single jersey num-
ber is relatively large, as they either originate from medium
and close-up shots in soccer video broadcasts or from sports
where the main camera has a narrower camera angle, e.g.
in basketball broadcasts. Therefore, a ground truth dataset
consisting of 10,000 cropped images (from 65 different soc-
cer videos) containing soccer players and labelled by their
jersey number (if visible) was created using manual and au-
tomatic labelling cooperatively. The workflow of the anno-
tation process is depicted in figure 2.
3.1. Semi-automatic Labelling
First, an automatic player detection based on histogram
of oriented gradients (HOG) [4] together with a linear sup-
port vector machine (SVM) was performed on 65 different
soccer videos, similar to what is described in [6]. For each
video, 100 random overview frames were selected and the
player detector was applied in a high-precision setting in
order to increase the probability for a true positive. This
resulted in approx. 70,000 cropped images of players.
Then, a small subset of 2300 of these cropped player im-
ages was labelled if their jersey number is visible or not.
By presenting this binary classification to the human anno-
tators, this classification task is actually simpler and there-
fore faster than annotating whole numbers. A linear SVM
classifier on HOG features was trained on this classification
task. This should increase the number of cropped players
presented to human annotators where a number is visible
and readable. This classifier was applied to 70,000 cropped
player images, of which the highest ranked 10,000 images
were used for manual jersey number labelling. This step
is crucial, as in most of the 70,000 images a number is not
even visible, which would yield a very sparsely annotated
dataset.
These 10,000 images are then manually labelled. Volun-
teers were asked to either assign the visible number (basi-
cally from 1 to 44, excluding a few numbers that are not
present within the whole dataset), or they could indicate
why it was not possible to assign a number. This could be
one of not visible,not readable,multiple players and box er-
ror.not visible is supposed to be assigned to images where
the number is not visible at all, while not readable is sup-
posed to be assigned to images where the number is either
only partly visible due to the player’s pose or not readable,
e.g. due to motion blur or illumination. Multiple players
refers to images that contain more than one player and it
is not obvious which player the annotation refers to. Box
error is supposed to be assigned to images where a player
is not correctly detected, being too small or too big. While
the relatively fine-grained annotation of error cases might
be usefull for future research, the error classes are currently
not used.
After annotating each of the 10,000 images, a validation
step was performed to reduce the number of false annota-
tions. Therefore, all images of a jersey number are shown.
False annotations are then easily visible and are corrected.
When analyzing the ratio of images that contain visible
players both in the small initial subset of 2300 images and
in set of 10,000 images that was selected by the aforemen-
tioned ranking, the ratio of images with visible numbers
could be increased significantly. From the small subset,
1010 out of 2300 images have visible numbers (ratio 0.43),
while about 8,000 out of 10,000 images (ratio 0.8) have vis-
ible (and readable) numbers on the pre-ranked dataset. That
means by using this pre-processing step, the effort for ob-
taining 8,000 labelled samples was reduced by almost 50%.
For experimenting with automatic jersey number classi-
fiers, the dataset described above is split into a training and
a test corpus. It is split by video, i.e. all images from a video
are either in the training or in the test set, in order to avoid
unrealistic scenarios where classification relies on training
samples from the same video. After splitting, the training
corpus consists of 5759 images and the test corpus consists
of 2520 images.
3.2. Dataset Properties
The number distribution is shown in figure 3. It shows
that numbers are not equally distributed, but rather imbal-
anced. While there are e.g. 600 samples for number 10 (the
most frequent number), there are only 7 samples for number
41. This could actually make training a classifier a challeng-
ing task. In comparison to a similar datasets, the Street View
House Number dataset (SVHN)[10], where digits between
Player Detector
65 videos
70,000
detections
...
man. label on 2300 dets:
number visible / not vis.+
SVM classifier
10,000 top-
ranked
detections
...
manual number
annotation
10,000 labelled
detections
...
Figure 2. Workflow of the semi-automatic jersey number dataset annotation process. Blue arrows denote manual annotation steps.
Figure 3. Jersey number distribution within the complete (training
+ test) dataset.
Figure 4. Distribution of first digit of jersey numbers within the
complete (training + test) dataset.
0 and 9 are annotated, the ratio between the most frequent
and the most rare label is much larger: It is 86 for the dataset
presented here and 3 for the SVHN dataset.
3.3. Comparison to other datasets
Table 1 gives an overview of key dataset characteris-
tics of the presented dataset in comparison to similar com-
Figure 5. Distribution of second digit of jersey numbers within the
complete (training + test) dataset.
puter vision datasets. It consists of the soccer jersey num-
ber dataset presented here (SJN), the MNIST database of
handwritten digits dataset [8], the Street View house num-
ber dataset (SVHN) [10] and the traffic sign recognition
dataset (TS) [12]. As can be seen, the dataset presented
here consists of more classes than the MNIST or SVHN
dataset, as jersey numbers are classified as whole numbers,
not a sequence of digits. This means there are fewer positive
samples per class than for these datasets, even if the dataset
would be perfectly balanced. However, in section 4 we
present an alternative coding scheme that separately models
digits and yields a more even distribution of classes. Addi-
tionally, the resolution for other datasets is usually smaller,
but the 64 ×128 resolution is the actual bounding box size
of the whole player. For jersey number recognition, it is suf-
ficient to only consider the upper half of the bounding box.
Within the upper half of the bounding box, the precise loca-
tion of the jersey number is not annotated. That makes this
task harder than other datasets, where the number (or traffic
sign) locations are annotated manually and therefore more
precise. Similar to the SVHN and TS dataset, the dataset
consists of RGB color images, whereas the MNIST dataset
is a grey-scale dataset. However, the most significant dif-
ference is the actual size of the dataset. The presented SJN
dataset is by far the smallest among those four, which could
Dataset Classes Resolution Training Test
MNIST [8] 10 28 ×28 ×160,000 10,000
SVHN [10] 10 32 ×32 ×373,257 26,032
TS [12] 43 32 ×32 ×339,209 12,630
SJN 36 64 ×128 ×35,760 2,521
Table 1. Comparison with other similar datasets. The image reso-
lution and the number of channels, as well as training and test set
sizes are given.
make approaches that rely on large datasets less promising.
Also, given the smaller dataset size and larger problem size
(number of classes), results on this presented datasets (in
terms of accuracy) are expected to be not as good as re-
ported results on the other datasets mentioned here.
4. Classification Problem
In this work, two different methods for (jersey) number
recognition as a classification problem are evaluated. The
first approach is to model all occuring jersey numbers as
a separate class. In our case, this would mean a 40-class
classification problem, as not all one- or two-digit numbers
appear in the dataset. That means, that the classifier c(x)
assigns exactly one class (number) yto each input sample
image x:
c(x) = y, y ∈ {1,2,3, ..., 40}(1)
Alternatively, one could treat the problem as a two-label
classification problem, with one label for each digit. One
for the most significant digit of a one- or two-digit number,
and one for the least signicant digit:
c(x) = (y1, y2), y1∈ {10,11,12,13,14}, y2∈ {0, .., 9}
(2)
where the continuous labels 10-14 stand for the first digit,
i.e. 10 represents single digit numbers, 11 represents num-
bers whose first (most significant) digit is 1, etc. For a neu-
ral network, categorical labels are usually encoded by bi-
nary vectors whose dimensionality is equal to the number
of different labels. That means that for the classification
problem as described in equation 1, labels are converted to
a 40-dimensional vector with exactly one dimension (that
of the groundtruth label) set to one, all others element set to
0:
ybin = [00,...,0y1,1y,0y+1 . . . , 0n]T(3)
The output of the neural network classifier then needs to be
converted back to a class label ypredicted by choosing the
maximum element of the resulting vector y0(that contains
real-valued entries):
ypredicted = argmax
i∈{1,2,...40}
y0
i(4)
Figure 6. Training samples obtained by applying random scaling
and cropping of the original sample.
For the second case, the binary vector consists of two
non-zero elements for groundtruth labels. One for each
digit, with numbers smaller than ten having an imaginary
0 as their first digit.
ybin = [00,...,0y21,1y2,0y2+1 . . . , 0,1y1,0,...,0n]T
(5)
Converting network predictions back to numbers is then a
combination of the maximum element of the first 10 ele-
ments of y0and the maximum of the subsequent 5 elements:
ypredicted = ( argmax
i∈{0,1,2,...9}
y0
i,argmax
j∈{10,11,...,14}
y0
j)(6)
Both approaches have their advantages and disadvan-
tages: As can be seen in figure 3, treating all numbers as
separate classes imposes a very imbalanced dataset. Given
the dataset it is even conceptually impossible to recog-
nize two-digit numbers that do not occur, i.e. all numbers
>45. When applying two separate classification problems,
it would be possible to model jersey numbers that have not
been seen until 49, i.e. where for each number, each digit
has been seen in all places (first and second digit of the num-
ber) in the training set. However, it might be difficult for
an algorithm to separate the first and second digit of the
number when no explicit localization or segmentation has
been performed. Additional factors such as slight perspec-
tive changes might make separating the digits even more
difficult. Therefore, it might be more appropriate to model
numbers holistically.
5. Data augmentation
As the soccer jersey number dataset is quite small, data
augmentation is expected to play a key role for good recog-
nition results. Here, we apply data augmentation to increase
the number of training samples from 5,760 samples to ap-
prox. 56,000 training samples. As the jersey numbers are
Figure 7. Output of the first convolutional layer for a sample im-
age.
not centered within in a certain region of the image, a clas-
sifier is supposed to be tranlation invariant. In order to im-
prove this invariance, multiple variants of an existing train-
ing sample are generated, each cropping a different 40 ×40
patch from the upper half of a bounding box (64 ×128)
shifted within a certain range. Additionally, as the size of
the actual region of the jersey number within the player’s
bounding box is not known, differently scaled samples (the
scale factor is randomly choosen between 0.9 and 1.1) are
generated for augmentation, as shown in the example in fig-
ure 6.
As described later, runs that operate on color and
grayscale images are tested. For the grayscale runs, ad-
ditional data augmentation by inverting all training sam-
ples was performed, yielding a training dataset of approx.
108,000 samples.
6. Deep Convolutional Neural Network
As a baseline, a HOG based radial basis function (RBF)
kernel SVM classifier was used, similar to [10]. However in
[10], a linear SVM was used. HOG features are calculated
only for the upper half of player bounding boxes to reduce
the influence of irrelevant image parts. On these features,
an RBF kernel based SVM is trained. Using this baseline,
an accuracy of 0.404 was obtained.
Additionally, a convolutional neural network was trained
to recognize numbers. The Keras [3] Python library for
deep neural networks was used throughout the following ex-
periments. Its architecture is inspired by models for generic
image classification (similar to a model for the CIFAR-
10 [7] dataset) and recognizing house numbers in street
view images (using the street view house number dataset).
The base architecture consists of three convolutional lay-
ers, each followed by a max-pooling layer and a rectified
linear unit (ReLU). Then, there are three fully connected
hidden layers with optional dropout [11] layers and finally
a softmax loss layer follows. The network architecture con-
Figure 8. Output of the first convolutional layer for a sample im-
age.
sists of three convolutional layers and three subsequent fully
connected layers. It has been trained and tested and is de-
scribed in detail in section 7. Without any further data aug-
mentation and parameter tuning, the accuracy obtained was
approx. 0.60, which is already better than the more classical
HOG+SVM based approach.
The detailed network architecture is as follows: Three
convolutional layers (with 16×5×5/30×7×7/50×3×3
parameters), each with rectified linear units (ReLU) as their
activation function, followed by a max-pooling layer. Then,
three fully connected layers with ReLU activation follow.
Table 2 gives the details of the network architecture which
holds for all runs. Only data augmentation, dropout pa-
rameters and color space vary between runs. The convo-
lutional stride is always set to one pixel, while pooling size
and stride is two pixels for the first convolutional layer and
three pixels for the remaining convolutional layers. In com-
parison to the network architecture in [11] for the SVHN
dataset, they used more filter channels ((96, 128, 256) in-
stead of (16, 30, 50) used here) for the convolutional layers.
The two fully connected layers in their work each have 2048
units, while in this work, only 34 units are used. The rea-
son for reducing the number of units is mainly the lack of a
large dataset. The SVHN dataset is two orders of magnitude
larger (as an extended training corpus of the SVHN dataset
was used) than the jersey number dataset used here.
Figure 9 shows sample classification results using the
best-performing recognizer (ConvNet grey aug inv.) for dif-
ferent categories, namely 2, 3, 4, 6, 8, 10, 13, 15, 16, 21, 20
and 25. Figure 7 depicts the 16 learned convolution filters
in the first layer. It shows that mainly edge filters have been
learned, with some filters . Figure 8 shows the 16 responses
Stage 1 2 3 4 5 6
Layer type conv + max conv + max conv + max full full full (output)
# channels 16 30 50 34 34 45/15
Filter size 5×5 7 ×7 3 ×3- - -
Conv. Strides 1×1 1 ×1 1 ×1- - -
Pooling Size 2×2 3 ×3 3 ×3- - -
Pooling Str. 2×2 3 ×3 3 ×3- - -
Spatial input Size 40 ×40 20 ×20 6 ×6 2 ×2- -
Table 2. Deep convolutional network architecture.
Figure 9. Sample classification results using the best configuration. Each column shows random results for the classes 2, 3, 4, 6, 8, 10, 13,
15, 16, 21, 20 and 25.
Run Accuracy
HOG 0.40
ConvNet 0.61
ConvNet Dropout 0.71
ConvNet grey Dropout 0.72
ConvNet inv Dropout 0.76
ConvNet inv grey Dropout 0.72
ConvNet augmented grey digit-wise 0.62
ConvNet augmented 0.68
ConvNet augmented Dropout 0.71
ConvNet augmented grey 0.73
ConvNet augmented grey inv. 0.82
ConvNet augmented grey inv. Dropout 0.83
Table 3. Results for different approaches and settings for jersey
number recognition.
for the sample image for these filters.
7. Experimental Results
In table 3, all results in terms of accuracy are given.
There, ConvNet denotes the baseline neural network run,
while HOG denotes the run consisting of HOG features to-
gether with a support vector machine (SVM). If the run de-
scriptions contain the grey keyword, training and testing is
performed on greyscale images rather than RGB color im-
ages in the standard case. augmented stands for spatial data
augmentation as described earlier in section 5. Inv. stands
for data augmentation by inverting images and Dropout for
those networks with dropout layers after each fully con-
nected layer.
During this optimization, dropout parameters were cho-
sen carefully. When adding higher (around 0.5) dropout
ratios to all fully connected layers, the obtained accuracy
was below the case when moderately dropout ratios (around
0.2) were used. Also adding dropout to the first fully con-
nected layer gave better results than adding dropout to all
layers. It is assumed that the loss of information by drop-
ping many activations in the network leads to sub-optimal
results. However, overfitting was reduced and the train and
test loss did not diverge, which they did when not using
dropout at all.
Data augmentation by applying spatial transformations
(scaling and translation) as well as applying color (or
greyscale) inversion result in an increased accuracy of up
to 0.83. More experiments are necessary to check if ad-
ditional data augmentation is necessary to further improve
performance.
Figure 10. Confusion matrix of the best configuration. Misclassifi-
cations mostly appear where the predicted shares at least one digit
with the groundtruth label.
While for some configurations, utilizing full RGB color
information seem to yield slightly better results than a sim-
ilar network operating on greyscale images, we think that
using greyscale has some advantages in terms of expected
generalizability. Whenever using color information would
yield better results, this might be due to some correlation
between jersey colors and jersey numbers. For example,
some rarely occuring jersey number might appear only in a
single team. While this would help if all teams are known
at training time, this correlation does not help when apply-
ing jersey number recognition to new unknown data. Us-
ing dropout for regularization did not always improve re-
sults, when the other network parameters remain constant.
It was not tested if increasing the networks capacity by
adding more layers or adding connections would benefit
from dropout.
Modelling jersey number recognition as two separate
classification problems (the digit-wise run) for the first and
second digit did not work as good as the holistic approach.
The best approach on augmented grayscale images per-
formed worse (accuracy of 0.62) than most holistic ap-
proaches.
Interestingly, although the dataset is quite small, the ac-
curacy reached by using deep convolutional networks out-
performs that of the more traditional HOG+SVM approach
by a large margin (0.83 vs 0.40). This at first sight seems
to be counter-intuitive, as the promise of deep learning ap-
proaches is actually to make use of larger datasets.
For a closer analysis, confusion matrices are used, which
contain correctly classified entries at the main diagonal,
while wrongly classified entries occur at other positions.
When looking at the confusion matrices for both the best
Figure 11. Confusion matrix of digit-wise classifier. Did not im-
prove misclassifications from wrong digit order in comparison to
the one class per jersey number configuration.
holistic and the best digit-wise networks in figure 10 and
11, it is apparent that mainly classes that share one digit are
confused. These are all confusions that are in the diagonal
decimal blocks (adjacent to the true positive diagonal, i.e.
where the first digit is recognized correctly, but the second
one is misclassififed. The lines parallel to the diagonal –
shifted by ten classes - represent misclassifications where
the last digit was correctly identified but the first one was
not.
In contrast to the previous assumption, modelling the
two digits separately did not circumvent these misclassifi-
cations. Rather, the classification results as a whole became
worse and the same misclassification errors were notice-
able, apparently even more noticeable than in the holistic
case.
8. Conclusion
In this paper, a dataset consisting of 8521 annotated soc-
cer player images is presented, together with convolutional
neural network based approach for jersey number recogni-
tion. The problem of jersey number recognition, which con-
sists of one- or two-digit numbers for all known team sports,
was posed as two different classification problems. One
holistic approach of one class per number and one digit-
wise approach that models each digit at each position within
a number separately. By conducting experimental evalua-
tions, it was shown that the holistic approach performed bet-
ter throughout the experiments. Another interesting finding
was that deep learning approaches yield quite good results
even with smaller datasets like the one presented here. By
utilizing data augmentation, the training set size can be in-
creased significantly. Applying dropout for regularization
improved results especially for those runs where no data
augmentation was performed.
In the future, it would be interesting to analyze more
network architectures, especially if applying dropout would
allow for deeper network architectures. Another promising
direction could be the use of spatial transformer networks as
well as more data augmentation techniques. For example,
additional rotation or perspective distortion could improve
invariance to slightly different player poses.
References
[1] E. Andrade, E. Khan, J. Woods, and M. Ghanbari. Player
identification in interactive sport scenes using region space
analysis prior information and number recognition. In In-
ternational Conference on Visual Information Engineering
(VIE 2003). Ideas, Applications, Experience, pages 57–60.
IEE, 2003.
[2] L. Ballan, M. Bertini, A. D. Bimbo, and W. Nunziati. Soc-
cer Players Identification Based on Visual Local Features. In
Proceedings of the 6th ACM international conference on Im-
age and video retrieval, pages 258 – 265, Amsterdam, The
Netherlands, 2007. ACM.
[3] F. Chollet. Keras: Theano-based deep learning library.
https://github.com/fchollet/keras, 2015.
[4] N. Dalal and B. Triggs. Histograms of Oriented Gradi-
ents for Human Detection. In 2005 IEEE Computer Soci-
ety Conference on Computer Vision and Pattern Recognition
(CVPR’05), pages 886–893. IEEE, 2005.
[5] D. Delannay, N. Danhier, and C. De Vleeschouwer. Detec-
tion and recognition of sports(wo)men from multiple views.
In 2009 Third ACM/IEEE International Conference on Dis-
tributed Smart Cameras (ICDSC), pages 1–7. IEEE, Aug.
2009.
[6] S. Gerke, S. Singh, A. Linnemann, and P. Ndjiki-Nya. Unsu-
pervised color classifier training for soccer player detection.
In Visual Communications and Image Processing (VCIP).,
2013.
[7] A. Krizhevsky. Learning Multiple Layers of Features from
Tiny Images. Technical report, 2009.
[8] Y. LeCun, C. Cortes, and C. J. Burges. The mnist database
of handwritten digits, 1998.
[9] C.-W. Lu, C.-Y. Lin, C.-Y. Hsu, M.-F. Weng, L.-W. Kang,
and H.-Y. M. Liao. Identification and Tracking of Players
in Sport Videos. In Proceedings of the Fifth International
Conference on Internet Multimedia Computing and Service
- ICIMCS ’13, page 113, New York, New York, USA, 2013.
ACM Press.
[10] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.
Ng. Reading digits in natural images with unsupervised fea-
ture learning. In NIPS workshop on deep learning and unsu-
pervised feature learning, number 2, page 5. Granada, Spain,
2011.
[11] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov. Dropout : A Simple Way to Prevent Neu-
ral Networks from Overfitting. Journal of Machine Learning
Research (JMLR), 15:1929–1958, 2014.
[12] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The
German Traffic Sign Recognition Benchmark: A multi-class
classification competition. In IEEE International Joint Con-
ference on Neural Networks, pages 1453–1460, 2011.
... JNR is a challenging task in practice. In past research, JNR was generally regarded as a single task process [1,2,3,4,5]. Some published methods treated jersey number recognition as an overall classification task [1], [5]. ...
... In past research, JNR was generally regarded as a single task process [1,2,3,4,5]. Some published methods treated jersey number recognition as an overall classification task [1], [5]. In such a classification task, jersey numbers are typically divided into 100 categories (i.e., the numbers 0-99) along with an extra category to deal with unrecognizable cases. ...
... The field of sports analytics, particularly with respect to computer vision, has experienced significant advancements facilitated by deep learning techniques. A large number of studies [1,2,3,4,5], [8,9,10,11] address various aspects of player identification or recognition by detecting and recognizing players and their jersey numbers in team sport videos. ...
Preprint
Full-text available
Jersey number recognition (JNR) has always been an important task in sports analytics. Improving recognition accuracy remains an ongoing challenge because images are subject to blurring, occlusion, deformity, and low resolution. Recent research has addressed these problems using number localization and optical character recognition. Some approaches apply player identification schemes to image sequences, ignoring the impact of human body rotation angles on jersey digit identification. Accurately predicting the number of jersey digits by using a multi-task scheme to recognize each individual digit enables more robust results. Based on the above considerations, this paper proposes a multi-task learning method called the angle-digit refine scheme (ADRS), which combines human body orientation angles and digit number clues to recognize athletic jersey numbers. Based on our experimental results, our approach increases inference information, significantly improving prediction accuracy. Compared to state-of-the-art methods, which can only handle a single type of sport, the proposed method produces a more diverse and practical JNR application. The incorporation of diverse types of team sports such as soccer, football, basketball, volleyball, and baseball into our dataset contributes greatly to generalized JNR in sports analytics. Our accuracy achieves 64.07% on Top-1 and 89.97% on Top-2, with corresponding F1 scores of 67.46% and 90.64%, respectively.
... The advent of deep learning facilitated the use of jersey numbers rather than player appearances for unique player identification. Earlier works [3], [4], [28] use CNNs to predict the jersey number: Gerke et al. [28] recognize jersey numbers directly from soccer images while Li et al. [4] and Liu et al. [3] propose a unified solution to detect and classify jersey numbers using Spatial Transformer Networks (STNs) and pose-guided recurrent CNNs, respectively. Vats et al. [8] leverage a multi-task loss function to perform holistic and digit-wise predictions. ...
... The advent of deep learning facilitated the use of jersey numbers rather than player appearances for unique player identification. Earlier works [3], [4], [28] use CNNs to predict the jersey number: Gerke et al. [28] recognize jersey numbers directly from soccer images while Li et al. [4] and Liu et al. [3] propose a unified solution to detect and classify jersey numbers using Spatial Transformer Networks (STNs) and pose-guided recurrent CNNs, respectively. Vats et al. [8] leverage a multi-task loss function to perform holistic and digit-wise predictions. ...
... The problem of jersey number recognition has been posed as image-level recognition [6,17,19,20,25] as well as tracklet-level recognition [4,7,26,27]. Some methods detect and localize the jersey number region and then classify the numbers [17,19,20], while others assume that the im-age region containing the jersey number has already been cropped [6,12,25]. Progress on this problem has been slowed by the lack of large public datasets that can be used to compare methods. This is now starting to be addressed with the 2023 release of the SoccerNet Jersey Number dataset [2,9]. ...
... Gerke et al. [12] and Li et al. [17] were among the first to apply CNN-based classification approaches to image-level jersey number recognition, and CNNs have been the dominant approach since this time. Liu et al. [19,20] demonstrated the utility of body pose detection to improve classification with Faster R-CNN [22] and Mask R-CNN [16] architectures, respectively. ...
Preprint
Full-text available
Jersey number recognition is an important task in sports video analysis, partly due to its importance for long-term player tracking. It can be viewed as a variant of scene text recognition. However, there is a lack of published attempts to apply scene text recognition models on jersey number data. Here we introduce a novel public jersey number recognition dataset for hockey and study how scene text recognition methods can be adapted to this problem. We address issues of occlusions and assess the degree to which training on one sport (hockey) can be generalized to another (soccer). For the latter, we also consider how jersey number recognition at the single-image level can be aggregated across frames to yield tracklet-level jersey number labels. We demonstrate high performance on image- and tracklet-level tasks, achieving 91.4% accuracy for hockey images and 87.4% for soccer tracklets. Code, models, and data are available at https://github.com/mkoshkina/jersey-number-pipeline.
... Jersey Number recognition from static images. Gerke et al. [6] recognize jersey numbers from soccer images using Convolutional Neural Networks (CNN). Li et al. [12] use a CNN to classify the digits on a player's jersey, and mitigates the use of an extra object detection module and localizes the digits of all the players in a particular frame by using Spatial Transformer Networks (STN). ...
... Some example frames of a few tracklets can be observed in Figure 5. The existing datasets [6,12,14,20] used for jersey number detection predominantly use static images, limiting modeling of temporal context, since it is essential to address problems in regards to its visibility. The different datasets available in the literature for jersey number identification are compared in Table 2. [14] 3,567 Kanav et al [20] 54,251 Li et al [12] 215, 036 Kanav et al [21] ( †) 670,410 SoccerNet ( †) 2,052,306 ...
... The first jersey number recognition methods, such as the framework of Ye et al. [58], have been developed around handcrafted features such as Zernike moment features [59]. The most recent works are, however, based on convolutional neural networks [60,61]. To improve the number classification accuracy, some methods resorted to the preprocessing of the player thumbnails. ...
Article
Full-text available
Positional data in team sports is key in evaluating the players’ individual and collective performances. When the sole source of data is a broadcast-like video of the game, an efficient video tracking method is required to generate this data. This article describes a framework that extracts individual soccer player positions on the field. It is based on two main components. As in broadcast-like videos of team sport games, the camera view moves to follow the action and a sport field registration method estimates the homography between the pitch and the frame space. Our method estimates the positions of key points sampled on the pitch thanks to an encoder–decoder architecture. The attention mechanisms of the encoder, based on a vision transformer, captures characteristic pitch features globally in the frames. A multiple person tracker generates tracklets in the frame space by associating, with bipartite matching, the player detections between the current and the previous frames thanks to Intersection-Over-Union and distance criteria. Tracklets are then iteratively merged with appearance criteria thanks to a re-identification model. This model is fine-tuned in a self-supervised way on the player thumbnails of the video sample to specifically recognize the fine identification details of each player. The player positions in the frames projected by the homographies allow the obtaining of the real position of the players on the pitch at every moment of the video. We experimentally evaluate our sport field registration method and our 2D player tracker on public datasets. We demonstrate that they both outperform previous works for most metrics. Our 2D player tracker was also awarded first place at the SoccerNet tracking challenge in 2022 and 2023.
Chapter
This chapter introduces convolutional neural networks (CNNs) and describes how they can be used in the context of sports analytics. CNNs are suitable for end-to-end learning on images or similarly structured data. CNNs can efficiently learn features of images based on pixel values and, for example, extract suitable features for a classification task. In this context, the models benefit from parameter sharing in the convolutional layers and exhibit translation equivariance and invariance properties. CNNs are thus suited for learning features from positional data of team sports, provided that the data is put into an appropriate structure.
Article
Rich and complex events in sports have led to the development of a wide-variety of techniques for interpreting content of sports videos in terms of players’ actions, poses, gait, performance, etc. This is due to the requirements from coaches, trainers and players who expect to analyze actions in top sports events, as well as sports fans who practice to imitate professional playing skills, e.g ., dribbling, shooting, etc. However, this poses two key challenges for automated sports analysis community. Firstly, there are extremely limited public sports datasets. Secondly, recent advances in interpretations of sports activities, e.g ., soccer, are predominantly made through analyzing coarse-grained contents. Players’ fine-grained skills analysis still remains under-explored. To alleviate these problems, this paper (a) collects the dataset of highlight videos of soccer players, including two coarse-grained action types of soccer players and six fine-grained actions of players. Detailed annotations are provided for the collected dataset, in terms of action classes, bounding boxes, segmentation maps, and body keypoints of soccer players, and positions of a soccer ball in a game. (b) leverages the understanding of complex highlight videos by proposing an energy-motion features aggregation network- EMA-Net to fully exploit energy-based representation of soccer players movements in video sequences and explicit motion dynamics of soccer players in videos for soccer players’ fine-grained action analysis. Experimental results and ablation studies validate the proposed approach in recognizing soccer players actions using the collected soccer highlight video datasets.
Chapter
Due to the rapid movement of players, ice hockey is a high-speed sport that poses significant challenges for player tracking. In this paper, we present a comprehensive framework for player identification and tracking in ice hockey games, utilising deep neural networks trained on actual gameplay data. Player detection, identification, and tracking are the three main components of our architecture. The player detection component detects individuals in an image sequence using a region proposal technique. The player identification component makes use of a text detector model that performs character recognition on regions containing text detected by a scene text recognition model, enabling us to resolve ambiguities caused by players from the same squad having similar appearances. After identifying the players, a visual multi-object tracking model is used to track their movements throughout the game. Experiments conducted with data collected from actual ice hockey games demonstrate the viability of our proposed framework for tracking and identifying players in real-world settings. Our framework achieves an average precision (AP) of 67.3 and a Multiple Object Tracking Accuracy (MOTA) of 80.2 for player detection and tracking, respectively. In addition, our team identification and player number identification accuracy is 82.39% and 87.19%, respectively. Overall, our framework is a significant advancement in the field of player tracking and identification in ice hockey, utilising cutting-edge deep learning techniques to achieve high accuracy and robustness in the face of complex and fast-paced gameplay. Our framework has the potential to be applied in a variety of applications, including sports analysis, player tracking, and team performance evaluation. Further enhancements can be made to address the challenges posed by complex and cluttered environments and enhance the system’s precision.
Conference Paper
Full-text available
In this paper, we propose a novel framework to automatically perform player tracking and identification for sport videos filmed by a single pan-tilt-zoom camera from the court view. The proposed scheme is separated into three parts. The first part is to detect players by a deformable part model. The second part is to recognize jersey numbers by gradient differences and optical character recognition. The final part applies particle filters to track players. Experimental results demonstrate the efficacy of the proposed algorithm and the feasibility for sports video analysis.
Conference Paper
Full-text available
Player detection in sports video is a challenging task: In contrast to typical surveillance applications, a pan-tilt-zoom camera model is used. Therefore, simple background learning approaches cannot be used. Furthermore, camera motion causes severe motion blur, making gradient based approaches less robust than in settings where the camera is static. The contribution of this paper is a sequence adaptive approach that utilizes color information in an unsupervised manner to improve detection accuracy. Therefore, different color features, namely color histograms, color spatiograms and a color and edge directivity descriptor are evaluated. It is shown that the proposed color adaptive approach improves detection accuracy. In terms of maximum F1 score, an improvement from 0.79 to 0.81 is reached using block-wise HSV histograms. The average number of false positives per image (FPPI) at two fixed recall levels decreased by approximately 23%.
Conference Paper
Full-text available
The methods presented in this paper aim at detecting and recognizing players on a sport-field, based on a distributed set of loosely synchronized cameras. Detection assumes player verticality, and sums the cumulative projection of the multiple views' foreground activity masks on a set of planes that are parallel to the ground plane. After summation, large projection values indicate the position of the player on the ground plane. This position is used as an anchor for the player bounding box projected in each one of the views. Within this bounding box, the regions provided by mean-shift segmentation are sorted out based on contextual features, e.g. relative size and position, to select the ones that are likely to correspond to a digit. Normalization and classification of the selected regions then provides the number and identity of the player. Since the player number can only be read when it faces towards the camera, graph-based tracking is considered to propagate the identity of a player along its trajectory.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Conference Paper
This paper proposes using a novel region space technique to track sport persons for the purpose of extracting their shirt numbers and use this to provide augmented information to the viewer. The region adjacency graph and picture trees are used to perform a search for an object using prior knowledge from a scene description. Once the candidate object has been extracted the sub-space is examined for alphanumeric characters, which are then characterized by optical character recognition. Rogue candidates may be removed based on the recognition histograms with improved robustness using temporal analysis. The recognized sport person is accentuated using graphical overlays from a database.
Article
Detecting and reading text from natural images is a hard computer vision task that is central to a variety of emerging applications. Related problems like document character recognition have been widely studied by computer vision and machine learning researchers and are virtually solved for practical applications like reading handwritten digits. Reliably recognizing characters in more complex scenes like photographs, however, is far more difficult: the best existing methods lag well behind human performance on the same tasks. In this paper we attack the prob-lem of recognizing digits in a real application using unsupervised feature learning methods: reading house numbers from street level photos. To this end, we intro-duce a new benchmark dataset for research use containing over 600,000 labeled digits cropped from Street View images. We then demonstrate the difficulty of recognizing these digits when the problem is approached with hand-designed fea-tures. Finally, we employ variants of two recently proposed unsupervised feature learning methods and find that they are convincingly superior on our benchmarks.
Article
April 8, 2009Groups at MIT and NYU have collected a dataset of millions of tiny colour images from the web. It is, in principle, an excellent dataset for unsupervised training of deep generative models, but previous researchers who have tried this have found it di cult to learn a good set of lters from the images. We show how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex. Using a novel parallelization algorithm to distribute the work among multiple machines connected on a network, we show how training such a model can be done in reasonable time. A second problematic aspect of the tiny images dataset is that there are no reliable class labels which makes it hard to use for object recognition experiments. We created two sets of reliable labels. The CIFAR-10 set has 6000 examples of each of 10 classes and the CIFAR-100 set has 600 examples of each of 100 non-overlapping classes. Using these labels, we show that object recognition is signi cantly