Conference PaperPDF Available

Soccer Jersey Number Recognition Using Convolutional Neural Networks

December 2015

December 2015

DOI:10.1109/ICCVW.2015.100

Conference: The IEEE International Conference on Computer Vision (ICCV) Workshops, 2015, pp. 17-24

Authors:

Sebastian Gerke

TerraLoupe GmbH

Karsten Mueller

ltao.info

Examples of the player detector output that is used to create the dataset presented here. The upper half of the actual player bounding box is shown.

…

Training samples obtained by applying random scaling and cropping of the original sample.

…

Output of the first convolutional layer for a sample image.

…

Output of the first convolutional layer for a sample image.

…

Confusion matrix of the best configuration. Misclassifications mostly appear where the predicted shares at least one digit with the groundtruth label.

…

Figures - uploaded by Sebastian Gerke

Content may be subject to copyright.

Content uploaded by Sebastian Gerke

Content may be subject to copyright.

Soccer Jersey Number Recognition Using Convolutional Neural Networks

Sebastian Gerke

Fraunhofer HHI

Einsteinufer 37, 10587 Berlin, Germany

sebastian.gerke@hhi.fraunhofer.de

Karsten M¨

uller

Fraunhofer HHI

Einsteinufer 37, 10587 Berlin, Germany

karsten.mueller@hhi.fraunhofer.de

Ralf Sch¨

afer

Fraunhofer HHI

Einsteinufer 37, 10587, Germany

ralf.schaefer@hhi.fraunhofer.de

Abstract

In this paper, a deep convolutional neural network based

approach to the problem of automatically recognizing jer-

sey numbers from soccer videos is presented. It is meant

as a tool for subsequent automatic player identiﬁcation ap-

proaches that utilize jersey numbers together with knowl-

edge about teams and the jersey numbers of their players.

Two different jersey number vector encoding schemes are

presented and compared to each other. The ﬁrst treats ev-

ery number as a separate class, while the second one treats

each digit as a class. Additionally, the semi-automatic pro-

cess for the annotation of a jersey number dataset consist-

ing of 8281 jersey numbers is described. The best recog-

nition rate of 0.83 was achieved by the proposed approach

with data augmentation and without using dropout, com-

pared to 0.4 for a more traditional histogram of oriented

gradients (HOG) and support vector machine (SVM) based

approach.

1. Introduction

Soccer is one of the most popular sports in the world.

In recent years, interest in automatic soccer analysis tools

grew signiﬁcantly. Soccer analysis results can be used for

new ways of storytelling on TV, for match preparations or

for the generation of statistics. One of the fundamental anal-

ysis is the identiﬁcation of players to associate actions and

statistics to actual players. However, identifying players in

broadcast soccer videos automatically (and even manually)

is challenging. Especially for the overview camera it is dif-

ﬁcult due to the low resolution per player, which makes face

recognition impossible and jersey numbers are often hard to

read, especially with standard deﬁnition resolutions. Only

with the rise of widely available HD content in recent years,

Figure 1. Examples of the player detector output that is used to

create the dataset presented here. The upper half of the actual

player bounding box is shown.

jersey number recognition became feasible.

2. Related Work

Existing approaches for automatic player identiﬁcation

in broadcast soccer videos can be categorized in two groups:

One performing face recognition on closeup shots (not

overview shots) in variuos types of sports videos, while

other approaches rely on jersey number recognition. For

the latter group, no approach is known to operate on soccer

overview shots. They either operate on other sports where

4321

the resolution per player is higher (e.g. in basketball [5, 9]),

or they perform on closeup shots, where jersey numbers are

better readable [1] and face recognition is feasible [2].

In [9] basketball players are detected using a deformable

parts model (DPM), after which an exact localization of the

jersey number is performed. Then, normalization, followed

by thresholding and calculating the correlation between the

digits and digit templates is applied. In [2], player identi-

ﬁcation is performed in overview shots by employing SIFT

features for face recognition.

All approaches have a quite sophisticated, hand-

engineered image processing pipeline in common. They

often perform explicit localization of jersey numbers, fol-

lowed by digit segmentation. In contrast to these ap-

proaches, a deep learning approach is proposed here. It

does not rely on explicit localization of jersey number re-

gions and no explicit segmentation is performed. Rather, a

deep convolutional neural network is trained which handles

the complete pixel-to-jersey number recognition process.

3. Dataset Generation

Although there have been a few existing approaches for

jersey number recognition, these approaches usually rely on

video content where the image area of a single jersey num-

ber is relatively large, as they either originate from medium

and close-up shots in soccer video broadcasts or from sports

where the main camera has a narrower camera angle, e.g.

in basketball broadcasts. Therefore, a ground truth dataset

consisting of 10,000 cropped images (from 65 different soc-

cer videos) containing soccer players and labelled by their

jersey number (if visible) was created using manual and au-

tomatic labelling cooperatively. The workﬂow of the anno-

tation process is depicted in ﬁgure 2.

3.1. Semi-automatic Labelling

First, an automatic player detection based on histogram

of oriented gradients (HOG) [4] together with a linear sup-

port vector machine (SVM) was performed on 65 different

soccer videos, similar to what is described in [6]. For each

video, 100 random overview frames were selected and the

player detector was applied in a high-precision setting in

order to increase the probability for a true positive. This

resulted in approx. 70,000 cropped images of players.

Then, a small subset of 2300 of these cropped player im-

ages was labelled if their jersey number is visible or not.

By presenting this binary classiﬁcation to the human anno-

tators, this classiﬁcation task is actually simpler and there-

fore faster than annotating whole numbers. A linear SVM

classiﬁer on HOG features was trained on this classiﬁcation

task. This should increase the number of cropped players

presented to human annotators where a number is visible

and readable. This classiﬁer was applied to 70,000 cropped

player images, of which the highest ranked 10,000 images

were used for manual jersey number labelling. This step

is crucial, as in most of the 70,000 images a number is not

even visible, which would yield a very sparsely annotated

dataset.

These 10,000 images are then manually labelled. Volun-

teers were asked to either assign the visible number (basi-

cally from 1 to 44, excluding a few numbers that are not

present within the whole dataset), or they could indicate

why it was not possible to assign a number. This could be

one of not visible,not readable,multiple players and box er-

ror.not visible is supposed to be assigned to images where

the number is not visible at all, while not readable is sup-

posed to be assigned to images where the number is either

only partly visible due to the player’s pose or not readable,

e.g. due to motion blur or illumination. Multiple players

refers to images that contain more than one player and it

is not obvious which player the annotation refers to. Box

error is supposed to be assigned to images where a player

is not correctly detected, being too small or too big. While

the relatively ﬁne-grained annotation of error cases might

be usefull for future research, the error classes are currently

not used.

After annotating each of the 10,000 images, a validation

step was performed to reduce the number of false annota-

tions. Therefore, all images of a jersey number are shown.

False annotations are then easily visible and are corrected.

When analyzing the ratio of images that contain visible

players both in the small initial subset of 2300 images and

in set of 10,000 images that was selected by the aforemen-

tioned ranking, the ratio of images with visible numbers

could be increased signiﬁcantly. From the small subset,

1010 out of 2300 images have visible numbers (ratio 0.43),

while about 8,000 out of 10,000 images (ratio 0.8) have vis-

ible (and readable) numbers on the pre-ranked dataset. That

means by using this pre-processing step, the effort for ob-

taining 8,000 labelled samples was reduced by almost 50%.

For experimenting with automatic jersey number classi-

ﬁers, the dataset described above is split into a training and

a test corpus. It is split by video, i.e. all images from a video

are either in the training or in the test set, in order to avoid

unrealistic scenarios where classiﬁcation relies on training

samples from the same video. After splitting, the training

corpus consists of 5759 images and the test corpus consists

of 2520 images.

3.2. Dataset Properties

The number distribution is shown in ﬁgure 3. It shows

that numbers are not equally distributed, but rather imbal-

anced. While there are e.g. 600 samples for number 10 (the

most frequent number), there are only 7 samples for number

41. This could actually make training a classiﬁer a challeng-

ing task. In comparison to a similar datasets, the Street View

House Number dataset (SVHN)[10], where digits between

Player Detector

65 videos

70,000

detections

...

man. label on 2300 dets:

number visible / not vis.+

SVM classifier

10,000 top-

ranked

detections

...

manual number

annotation

10,000 labelled

detections

...

Figure 2. Workﬂow of the semi-automatic jersey number dataset annotation process. Blue arrows denote manual annotation steps.

Figure 3. Jersey number distribution within the complete (training

+ test) dataset.

Figure 4. Distribution of ﬁrst digit of jersey numbers within the

complete (training + test) dataset.

0 and 9 are annotated, the ratio between the most frequent

and the most rare label is much larger: It is 86 for the dataset

presented here and 3 for the SVHN dataset.

3.3. Comparison to other datasets

Table 1 gives an overview of key dataset characteris-

tics of the presented dataset in comparison to similar com-

Figure 5. Distribution of second digit of jersey numbers within the

complete (training + test) dataset.

puter vision datasets. It consists of the soccer jersey num-

ber dataset presented here (SJN), the MNIST database of

handwritten digits dataset [8], the Street View house num-

ber dataset (SVHN) [10] and the trafﬁc sign recognition

dataset (TS) [12]. As can be seen, the dataset presented

here consists of more classes than the MNIST or SVHN

dataset, as jersey numbers are classiﬁed as whole numbers,

not a sequence of digits. This means there are fewer positive

samples per class than for these datasets, even if the dataset

would be perfectly balanced. However, in section 4 we

present an alternative coding scheme that separately models

digits and yields a more even distribution of classes. Addi-

tionally, the resolution for other datasets is usually smaller,

but the 64 ×128 resolution is the actual bounding box size

of the whole player. For jersey number recognition, it is suf-

ﬁcient to only consider the upper half of the bounding box.

Within the upper half of the bounding box, the precise loca-

tion of the jersey number is not annotated. That makes this

task harder than other datasets, where the number (or trafﬁc

sign) locations are annotated manually and therefore more

precise. Similar to the SVHN and TS dataset, the dataset

consists of RGB color images, whereas the MNIST dataset

is a grey-scale dataset. However, the most signiﬁcant dif-

ference is the actual size of the dataset. The presented SJN

dataset is by far the smallest among those four, which could

Dataset Classes Resolution Training Test

MNIST [8] 10 28 ×28 ×160,000 10,000

SVHN [10] 10 32 ×32 ×373,257 26,032

TS [12] 43 32 ×32 ×339,209 12,630

SJN 36 64 ×128 ×35,760 2,521

Table 1. Comparison with other similar datasets. The image reso-

lution and the number of channels, as well as training and test set

sizes are given.

make approaches that rely on large datasets less promising.

Also, given the smaller dataset size and larger problem size

(number of classes), results on this presented datasets (in

terms of accuracy) are expected to be not as good as re-

ported results on the other datasets mentioned here.

4. Classiﬁcation Problem

In this work, two different methods for (jersey) number

recognition as a classiﬁcation problem are evaluated. The

ﬁrst approach is to model all occuring jersey numbers as

a separate class. In our case, this would mean a 40-class

classiﬁcation problem, as not all one- or two-digit numbers

appear in the dataset. That means, that the classiﬁer c(x)

assigns exactly one class (number) yto each input sample

image x:

c(x) = y, y ∈ {1,2,3, ..., 40}(1)

Alternatively, one could treat the problem as a two-label

classiﬁcation problem, with one label for each digit. One

for the most signiﬁcant digit of a one- or two-digit number,

and one for the least signicant digit:

c(x) = (y1, y2), y1∈ {10,11,12,13,14}, y2∈ {0, .., 9}

(2)

where the continuous labels 10-14 stand for the ﬁrst digit,

i.e. 10 represents single digit numbers, 11 represents num-

bers whose ﬁrst (most signiﬁcant) digit is 1, etc. For a neu-

ral network, categorical labels are usually encoded by bi-

nary vectors whose dimensionality is equal to the number

of different labels. That means that for the classiﬁcation

problem as described in equation 1, labels are converted to

a 40-dimensional vector with exactly one dimension (that

of the groundtruth label) set to one, all others element set to

ybin = [00,...,0y−1,1y,0y+1 . . . , 0n]T(3)

The output of the neural network classiﬁer then needs to be

converted back to a class label ypredicted by choosing the

maximum element of the resulting vector y0(that contains

real-valued entries):

ypredicted = argmax

i∈{1,2,...40}

i(4)

Figure 6. Training samples obtained by applying random scaling

and cropping of the original sample.

For the second case, the binary vector consists of two

non-zero elements for groundtruth labels. One for each

digit, with numbers smaller than ten having an imaginary

0 as their ﬁrst digit.

ybin = [00,...,0y2−1,1y2,0y2+1 . . . , 0,1y1,0,...,0n]T

(5)

Converting network predictions back to numbers is then a

combination of the maximum element of the ﬁrst 10 ele-

ments of y0and the maximum of the subsequent 5 elements:

ypredicted = ( argmax

i∈{0,1,2,...9}

i,argmax

j∈{10,11,...,14}

j)(6)

Both approaches have their advantages and disadvan-

tages: As can be seen in ﬁgure 3, treating all numbers as

separate classes imposes a very imbalanced dataset. Given

the dataset it is even conceptually impossible to recog-

nize two-digit numbers that do not occur, i.e. all numbers

>45. When applying two separate classiﬁcation problems,

it would be possible to model jersey numbers that have not

been seen until 49, i.e. where for each number, each digit

has been seen in all places (ﬁrst and second digit of the num-

ber) in the training set. However, it might be difﬁcult for

an algorithm to separate the ﬁrst and second digit of the

number when no explicit localization or segmentation has

been performed. Additional factors such as slight perspec-

tive changes might make separating the digits even more

difﬁcult. Therefore, it might be more appropriate to model

numbers holistically.

5. Data augmentation

As the soccer jersey number dataset is quite small, data

augmentation is expected to play a key role for good recog-

nition results. Here, we apply data augmentation to increase

the number of training samples from 5,760 samples to ap-

prox. 56,000 training samples. As the jersey numbers are

Figure 7. Output of the ﬁrst convolutional layer for a sample im-

age.

not centered within in a certain region of the image, a clas-

siﬁer is supposed to be tranlation invariant. In order to im-

prove this invariance, multiple variants of an existing train-

ing sample are generated, each cropping a different 40 ×40

patch from the upper half of a bounding box (64 ×128)

shifted within a certain range. Additionally, as the size of

the actual region of the jersey number within the player’s

bounding box is not known, differently scaled samples (the

scale factor is randomly choosen between 0.9 and 1.1) are

generated for augmentation, as shown in the example in ﬁg-

ure 6.

As described later, runs that operate on color and

grayscale images are tested. For the grayscale runs, ad-

ditional data augmentation by inverting all training sam-

ples was performed, yielding a training dataset of approx.

108,000 samples.

6. Deep Convolutional Neural Network

As a baseline, a HOG based radial basis function (RBF)

kernel SVM classiﬁer was used, similar to [10]. However in

[10], a linear SVM was used. HOG features are calculated

only for the upper half of player bounding boxes to reduce

the inﬂuence of irrelevant image parts. On these features,

an RBF kernel based SVM is trained. Using this baseline,

an accuracy of 0.404 was obtained.

Additionally, a convolutional neural network was trained

to recognize numbers. The Keras [3] Python library for

deep neural networks was used throughout the following ex-

periments. Its architecture is inspired by models for generic

image classiﬁcation (similar to a model for the CIFAR-

10 [7] dataset) and recognizing house numbers in street

view images (using the street view house number dataset).

The base architecture consists of three convolutional lay-

ers, each followed by a max-pooling layer and a rectiﬁed

linear unit (ReLU). Then, there are three fully connected

hidden layers with optional dropout [11] layers and ﬁnally

a softmax loss layer follows. The network architecture con-

Figure 8. Output of the ﬁrst convolutional layer for a sample im-

age.

sists of three convolutional layers and three subsequent fully

connected layers. It has been trained and tested and is de-

scribed in detail in section 7. Without any further data aug-

mentation and parameter tuning, the accuracy obtained was

approx. 0.60, which is already better than the more classical

HOG+SVM based approach.

The detailed network architecture is as follows: Three

convolutional layers (with 16×5×5/30×7×7/50×3×3

parameters), each with rectiﬁed linear units (ReLU) as their

activation function, followed by a max-pooling layer. Then,

three fully connected layers with ReLU activation follow.

Table 2 gives the details of the network architecture which

holds for all runs. Only data augmentation, dropout pa-

rameters and color space vary between runs. The convo-

lutional stride is always set to one pixel, while pooling size

and stride is two pixels for the ﬁrst convolutional layer and

three pixels for the remaining convolutional layers. In com-

parison to the network architecture in [11] for the SVHN

dataset, they used more ﬁlter channels ((96, 128, 256) in-

stead of (16, 30, 50) used here) for the convolutional layers.

The two fully connected layers in their work each have 2048

units, while in this work, only 34 units are used. The rea-

son for reducing the number of units is mainly the lack of a

large dataset. The SVHN dataset is two orders of magnitude

larger (as an extended training corpus of the SVHN dataset

was used) than the jersey number dataset used here.

Figure 9 shows sample classiﬁcation results using the

best-performing recognizer (ConvNet grey aug inv.) for dif-

ferent categories, namely 2, 3, 4, 6, 8, 10, 13, 15, 16, 21, 20

and 25. Figure 7 depicts the 16 learned convolution ﬁlters

in the ﬁrst layer. It shows that mainly edge ﬁlters have been

learned, with some ﬁlters . Figure 8 shows the 16 responses

Stage 1 2 3 4 5 6

Layer type conv + max conv + max conv + max full full full (output)

# channels 16 30 50 34 34 45/15

Filter size 5×5 7 ×7 3 ×3- - -

Conv. Strides 1×1 1 ×1 1 ×1- - -

Pooling Size 2×2 3 ×3 3 ×3- - -

Pooling Str. 2×2 3 ×3 3 ×3- - -

Spatial input Size 40 ×40 20 ×20 6 ×6 2 ×2- -

Table 2. Deep convolutional network architecture.

Figure 9. Sample classiﬁcation results using the best conﬁguration. Each column shows random results for the classes 2, 3, 4, 6, 8, 10, 13,

15, 16, 21, 20 and 25.

Run Accuracy

HOG 0.40

ConvNet 0.61

ConvNet Dropout 0.71

ConvNet grey Dropout 0.72

ConvNet inv Dropout 0.76

ConvNet inv grey Dropout 0.72

ConvNet augmented grey digit-wise 0.62

ConvNet augmented 0.68

ConvNet augmented Dropout 0.71

ConvNet augmented grey 0.73

ConvNet augmented grey inv. 0.82

ConvNet augmented grey inv. Dropout 0.83

Table 3. Results for different approaches and settings for jersey

number recognition.

for the sample image for these ﬁlters.

7. Experimental Results

In table 3, all results in terms of accuracy are given.

There, ConvNet denotes the baseline neural network run,

while HOG denotes the run consisting of HOG features to-

gether with a support vector machine (SVM). If the run de-

scriptions contain the grey keyword, training and testing is

performed on greyscale images rather than RGB color im-

ages in the standard case. augmented stands for spatial data

augmentation as described earlier in section 5. Inv. stands

for data augmentation by inverting images and Dropout for

those networks with dropout layers after each fully con-

nected layer.

During this optimization, dropout parameters were cho-

sen carefully. When adding higher (around 0.5) dropout

ratios to all fully connected layers, the obtained accuracy

was below the case when moderately dropout ratios (around

0.2) were used. Also adding dropout to the ﬁrst fully con-

nected layer gave better results than adding dropout to all

layers. It is assumed that the loss of information by drop-

ping many activations in the network leads to sub-optimal

results. However, overﬁtting was reduced and the train and

test loss did not diverge, which they did when not using

dropout at all.

Data augmentation by applying spatial transformations

(scaling and translation) as well as applying color (or

greyscale) inversion result in an increased accuracy of up

to 0.83. More experiments are necessary to check if ad-

ditional data augmentation is necessary to further improve

performance.

Figure 10. Confusion matrix of the best conﬁguration. Misclassiﬁ-

cations mostly appear where the predicted shares at least one digit

with the groundtruth label.

While for some conﬁgurations, utilizing full RGB color

information seem to yield slightly better results than a sim-

ilar network operating on greyscale images, we think that

using greyscale has some advantages in terms of expected

generalizability. Whenever using color information would

yield better results, this might be due to some correlation

between jersey colors and jersey numbers. For example,

some rarely occuring jersey number might appear only in a

single team. While this would help if all teams are known

at training time, this correlation does not help when apply-

ing jersey number recognition to new unknown data. Us-

ing dropout for regularization did not always improve re-

sults, when the other network parameters remain constant.

It was not tested if increasing the networks capacity by

adding more layers or adding connections would beneﬁt

from dropout.

Modelling jersey number recognition as two separate

classiﬁcation problems (the digit-wise run) for the ﬁrst and

second digit did not work as good as the holistic approach.

The best approach on augmented grayscale images per-

formed worse (accuracy of 0.62) than most holistic ap-

proaches.

Interestingly, although the dataset is quite small, the ac-

curacy reached by using deep convolutional networks out-

performs that of the more traditional HOG+SVM approach

by a large margin (0.83 vs 0.40). This at ﬁrst sight seems

to be counter-intuitive, as the promise of deep learning ap-

proaches is actually to make use of larger datasets.

For a closer analysis, confusion matrices are used, which

contain correctly classiﬁed entries at the main diagonal,

while wrongly classiﬁed entries occur at other positions.

When looking at the confusion matrices for both the best

Figure 11. Confusion matrix of digit-wise classiﬁer. Did not im-

prove misclassiﬁcations from wrong digit order in comparison to

the one class per jersey number conﬁguration.

holistic and the best digit-wise networks in ﬁgure 10 and

11, it is apparent that mainly classes that share one digit are

confused. These are all confusions that are in the diagonal

decimal blocks (adjacent to the true positive diagonal, i.e.

where the ﬁrst digit is recognized correctly, but the second

one is misclassiﬁfed. The lines parallel to the diagonal –

shifted by ten classes - represent misclassiﬁcations where

the last digit was correctly identiﬁed but the ﬁrst one was

not.

In contrast to the previous assumption, modelling the

two digits separately did not circumvent these misclassiﬁ-

cations. Rather, the classiﬁcation results as a whole became

worse and the same misclassiﬁcation errors were notice-

able, apparently even more noticeable than in the holistic

case.

8. Conclusion

In this paper, a dataset consisting of 8521 annotated soc-

cer player images is presented, together with convolutional

neural network based approach for jersey number recogni-

tion. The problem of jersey number recognition, which con-

sists of one- or two-digit numbers for all known team sports,

was posed as two different classiﬁcation problems. One

holistic approach of one class per number and one digit-

wise approach that models each digit at each position within

a number separately. By conducting experimental evalua-

tions, it was shown that the holistic approach performed bet-

ter throughout the experiments. Another interesting ﬁnding

was that deep learning approaches yield quite good results

even with smaller datasets like the one presented here. By

utilizing data augmentation, the training set size can be in-

creased signiﬁcantly. Applying dropout for regularization

improved results especially for those runs where no data

augmentation was performed.

In the future, it would be interesting to analyze more

network architectures, especially if applying dropout would

allow for deeper network architectures. Another promising

direction could be the use of spatial transformer networks as

well as more data augmentation techniques. For example,

additional rotation or perspective distortion could improve

invariance to slightly different player poses.

References

[1] E. Andrade, E. Khan, J. Woods, and M. Ghanbari. Player

identiﬁcation in interactive sport scenes using region space

analysis prior information and number recognition. In In-

ternational Conference on Visual Information Engineering

(VIE 2003). Ideas, Applications, Experience, pages 57–60.

IEE, 2003.

[2] L. Ballan, M. Bertini, A. D. Bimbo, and W. Nunziati. Soc-

cer Players Identiﬁcation Based on Visual Local Features. In

Proceedings of the 6th ACM international conference on Im-

age and video retrieval, pages 258 – 265, Amsterdam, The

Netherlands, 2007. ACM.

[3] F. Chollet. Keras: Theano-based deep learning library.

https://github.com/fchollet/keras, 2015.

[4] N. Dalal and B. Triggs. Histograms of Oriented Gradi-

ents for Human Detection. In 2005 IEEE Computer Soci-

ety Conference on Computer Vision and Pattern Recognition

(CVPR’05), pages 886–893. IEEE, 2005.

[5] D. Delannay, N. Danhier, and C. De Vleeschouwer. Detec-

tion and recognition of sports(wo)men from multiple views.

In 2009 Third ACM/IEEE International Conference on Dis-

tributed Smart Cameras (ICDSC), pages 1–7. IEEE, Aug.

2009.

[6] S. Gerke, S. Singh, A. Linnemann, and P. Ndjiki-Nya. Unsu-

pervised color classiﬁer training for soccer player detection.

In Visual Communications and Image Processing (VCIP).,

2013.

[7] A. Krizhevsky. Learning Multiple Layers of Features from

Tiny Images. Technical report, 2009.

[8] Y. LeCun, C. Cortes, and C. J. Burges. The mnist database

of handwritten digits, 1998.

[9] C.-W. Lu, C.-Y. Lin, C.-Y. Hsu, M.-F. Weng, L.-W. Kang,

and H.-Y. M. Liao. Identiﬁcation and Tracking of Players

in Sport Videos. In Proceedings of the Fifth International

Conference on Internet Multimedia Computing and Service

- ICIMCS ’13, page 113, New York, New York, USA, 2013.

ACM Press.

[10] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.

Ng. Reading digits in natural images with unsupervised fea-

ture learning. In NIPS workshop on deep learning and unsu-

pervised feature learning, number 2, page 5. Granada, Spain,

2011.

[11] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and

R. Salakhutdinov. Dropout : A Simple Way to Prevent Neu-

ral Networks from Overﬁtting. Journal of Machine Learning

Research (JMLR), 15:1929–1958, 2014.

[12] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The

German Trafﬁc Sign Recognition Benchmark: A multi-class

classiﬁcation competition. In IEEE International Joint Con-

ference on Neural Networks, pages 1453–1460, 2011.

Generalized Jersey Number Recognition Using Multi-task Learning With Orientation-guided Weight Refinement

Preprint

Full-text available

Jun 2024

Jersey number recognition (JNR) has always been an important task in sports analytics. Improving recognition accuracy remains an ongoing challenge because images are subject to blurring, occlusion, deformity, and low resolution. Recent research has addressed these problems using number localization and optical character recognition. Some approaches apply player identification schemes to image sequences, ignoring the impact of human body rotation angles on jersey digit identification. Accurately predicting the number of jersey digits by using a multi-task scheme to recognize each individual digit enables more robust results. Based on the above considerations, this paper proposes a multi-task learning method called the angle-digit refine scheme (ADRS), which combines human body orientation angles and digit number clues to recognize athletic jersey numbers. Based on our experimental results, our approach increases inference information, significantly improving prediction accuracy. Compared to state-of-the-art methods, which can only handle a single type of sport, the proposed method produces a more diverse and practical JNR application. The incorporation of diverse types of team sports such as soccer, football, basketball, volleyball, and baseball into our dataset contributes greatly to generalized JNR in sports analytics. Our accuracy achieves 64.07% on Top-1 and 89.97% on Top-2, with corresponding F1 scores of 67.46% and 90.64%, respectively.

Domain-guided Masked Autoencoders for Unique Player Identification

Article

Full-text available

May 2024

A General Framework for Jersey Number Recognition in Sports Video

Preprint

Full-text available

May 2024

Jersey number recognition is an important task in sports video analysis, partly due to its importance for long-term player tracking. It can be viewed as a variant of scene text recognition. However, there is a lack of published attempts to apply scene text recognition models on jersey number data. Here we introduce a novel public jersey number recognition dataset for hockey and study how scene text recognition methods can be adapted to this problem. We address issues of occlusions and assess the degree to which training on one sport (hockey) can be generalized to another (soccer). For the latter, we also consider how jersey number recognition at the single-image level can be aggregated across frames to yield tracklet-level jersey number labels. We demonstrate high performance on image- and tracklet-level tasks, achieving 91.4% accuracy for hockey images and 87.4% for soccer tracklets. Code, models, and data are available at https://github.com/mkoshkina/jersey-number-pipeline.

Jersey Number Recognition using Keyframe Identification from Low-Resolution Broadcast Videos

Conference Paper

Oct 2023

Individual Locating of Soccer Players from a Single Moving View

Article

Full-text available

Sep 2023
SENSORS-BASEL

Positional data in team sports is key in evaluating the players’ individual and collective performances. When the sole source of data is a broadcast-like video of the game, an efficient video tracking method is required to generate this data. This article describes a framework that extracts individual soccer player positions on the field. It is based on two main components. As in broadcast-like videos of team sport games, the camera view moves to follow the action and a sport field registration method estimates the homography between the pitch and the frame space. Our method estimates the positions of key points sampled on the pitch thanks to an encoder–decoder architecture. The attention mechanisms of the encoder, based on a vision transformer, captures characteristic pitch features globally in the frames. A multiple person tracker generates tracklets in the frame space by associating, with bipartite matching, the player detections between the current and the previous frames thanks to Intersection-Over-Union and distance criteria. Tracklets are then iteratively merged with appearance criteria thanks to a re-identification model. This model is fine-tuned in a self-supervised way on the player thumbnails of the video sample to specifically recognize the fine identification details of each player. The player positions in the frames projected by the homographies allow the obtaining of the real position of the players on the pitch at every moment of the video. We experimentally evaluate our sport field registration method and our 2D player tracker on public datasets. We demonstrate that they both outperform previous works for most metrics. Our 2D player tracker was also awarded first place at the SoccerNet tracking challenge in 2022 and 2023.

Optimizing Long-Term Robot Tracking with Multi-Platform Sensor Fusion

Conference Paper

Jan 2024

Convolutional Neural Networks

Chapter

Mar 2024

This chapter introduces convolutional neural networks (CNNs) and describes how they can be used in the context of sports analytics. CNNs are suitable for end-to-end learning on images or similarly structured data. CNNs can efficiently learn features of images based on pixel values and, for example, extract suitable features for a classification task. In this context, the models benefit from parameter sharing in the convolutional layers and exhibit translation equivariance and invariance properties. CNNs are thus suited for learning features from positional data of team sports, provided that the data is put into an appropriate structure.

Energy-Motion Features Aggregation Network for Players’ Fine-Grained Action Analysis in Soccer Videos

Article

Feb 2024

Rich and complex events in sports have led to the development of a wide-variety of techniques for interpreting content of sports videos in terms of players’ actions, poses, gait, performance, etc. This is due to the requirements from coaches, trainers and players who expect to analyze actions in top sports events, as well as sports fans who practice to imitate professional playing skills, e.g ., dribbling, shooting, etc. However, this poses two key challenges for automated sports analysis community. Firstly, there are extremely limited public sports datasets. Secondly, recent advances in interpretations of sports activities, e.g ., soccer, are predominantly made through analyzing coarse-grained contents. Players’ fine-grained skills analysis still remains under-explored. To alleviate these problems, this paper (a) collects the dataset of highlight videos of soccer players, including two coarse-grained action types of soccer players and six fine-grained actions of players. Detailed annotations are provided for the collected dataset, in terms of action classes, bounding boxes, segmentation maps, and body keypoints of soccer players, and positions of a soccer ball in a game. (b) leverages the understanding of complex highlight videos by proposing an energy-motion features aggregation network- EMA-Net to fully exploit energy-based representation of soccer players movements in video sequences and explicit motion dynamics of soccer players in videos for soccer players’ fine-grained action analysis. Experimental results and ablation studies validate the proposed approach in recognizing soccer players actions using the collected soccer highlight video datasets.

Convolutional Neural Networks

Chapter

Oct 2023

Tracking and Identification of Ice Hockey Players

Chapter

Sep 2023

Due to the rapid movement of players, ice hockey is a high-speed sport that poses significant challenges for player tracking. In this paper, we present a comprehensive framework for player identification and tracking in ice hockey games, utilising deep neural networks trained on actual gameplay data. Player detection, identification, and tracking are the three main components of our architecture. The player detection component detects individuals in an image sequence using a region proposal technique. The player identification component makes use of a text detector model that performs character recognition on regions containing text detected by a scene text recognition model, enabling us to resolve ambiguities caused by players from the same squad having similar appearances. After identifying the players, a visual multi-object tracking model is used to track their movements throughout the game. Experiments conducted with data collected from actual ice hockey games demonstrate the viability of our proposed framework for tracking and identifying players in real-world settings. Our framework achieves an average precision (AP) of 67.3 and a Multiple Object Tracking Accuracy (MOTA) of 80.2 for player detection and tracking, respectively. In addition, our team identification and player number identification accuracy is 82.39% and 87.19%, respectively. Overall, our framework is a significant advancement in the field of player tracking and identification in ice hockey, utilising cutting-edge deep learning techniques to achieve high accuracy and robustness in the face of complex and fast-paced gameplay. Our framework has the potential to be applied in a variety of applications, including sports analysis, player tracking, and team performance evaluation. Further enhancements can be made to address the challenges posed by complex and cluttered environments and enhance the system’s precision.

Identification and tracking of players in sport videos

Conference Paper

Full-text available

Aug 2013

In this paper, we propose a novel framework to automatically perform player tracking and identification for sport videos filmed by a single pan-tilt-zoom camera from the court view. The proposed scheme is separated into three parts. The first part is to detect players by a deformable part model. The second part is to recognize jersey numbers by gradient differences and optical character recognition. The final part applies particle filters to track players. Experimental results demonstrate the efficacy of the proposed algorithm and the feasibility for sports video analysis.

Unsupervised color classifier training for soccer player detection

Conference Paper

Full-text available

Nov 2013

Player detection in sports video is a challenging task: In contrast to typical surveillance applications, a pan-tilt-zoom camera model is used. Therefore, simple background learning approaches cannot be used. Furthermore, camera motion causes severe motion blur, making gradient based approaches less robust than in settings where the camera is static. The contribution of this paper is a sequence adaptive approach that utilizes color information in an unsupervised manner to improve detection accuracy. Therefore, different color features, namely color histograms, color spatiograms and a color and edge directivity descriptor are evaluated. It is shown that the proposed color adaptive approach improves detection accuracy. In terms of maximum F1 score, an improvement from 0.79 to 0.81 is reached using block-wise HSV histograms. The average number of false positives per image (FPPI) at two fixed recall levels decreased by approximately 23%.

Detection and recognition of sports(wo)men from multiple views

Conference Paper

Full-text available

Oct 2009

The methods presented in this paper aim at detecting and recognizing players on a sport-field, based on a distributed set of loosely synchronized cameras. Detection assumes player verticality, and sums the cumulative projection of the multiple views' foreground activity masks on a set of planes that are parallel to the ground plane. After summation, large projection values indicate the position of the player on the ground plane. This position is used as an anchor for the player bounding box projected in each one of the views. Within this bounding box, the regions provided by mean-shift segmentation are sorted out based on contextual features, e.g. relative size and position, to select the ones that are likely to correspond to a digit. Normalization and classification of the selected regions then provides the number and identity of the player. Since the player number can only be read when it faces towards the camera, graph-based tracking is considered to propagate the identity of a player along its trajectory.

Histograms of Oriented Gradients for Human Detection

Conference Paper

Jul 2005
IEEE Comput Soc Conf Comput Vis Pattern Recogn

We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

Learning multiple layers of features from tiny images

Article

Jan 2009

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Article

Jun 2014
J MACH LEARN RES

Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.

Player identification in interactive sport scenes using region space analysis prior information and number recognition

Conference Paper

Jan 2003

E.L. Andrade

This paper proposes using a novel region space technique to track sport persons for the purpose of extracting their shirt numbers and use this to provide augmented information to the viewer. The region adjacency graph and picture trees are used to perform a search for an object using prior knowledge from a scene description. Once the candidate object has been extracted the sub-space is examined for alphanumeric characters, which are then characterized by optical character recognition. Rogue candidates may be removed based on the recognition histograms with improved robustness using temporal analysis. The recognized sport person is accentuated using graphical overlays from a database.

Reading Digits in Natural Images with Unsupervised Feature Learning

Article

Jan 2011

Detecting and reading text from natural images is a hard computer vision task that is central to a variety of emerging applications. Related problems like document character recognition have been widely studied by computer vision and machine learning researchers and are virtually solved for practical applications like reading handwritten digits. Reliably recognizing characters in more complex scenes like photographs, however, is far more difficult: the best existing methods lag well behind human performance on the same tasks. In this paper we attack the prob-lem of recognizing digits in a real application using unsupervised feature learning methods: reading house numbers from street level photos. To this end, we intro-duce a new benchmark dataset for research use containing over 600,000 labeled digits cropped from Street View images. We then demonstrate the difficulty of recognizing these digits when the problem is approached with hand-designed fea-tures. Finally, we employ variants of two recently proposed unsupervised feature learning methods and find that they are convincingly superior on our benchmarks.

Learning Multiple Layers of Features from Tiny Images

Article

May 2012

Alex Krizhevsky

April 8, 2009Groups at MIT and NYU have collected a dataset of millions of tiny colour images from the web. It is, in principle, an excellent dataset for unsupervised training of deep generative models, but previous researchers who have tried this have found it di cult to learn a good set of lters from the images. We show how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex. Using a novel parallelization algorithm to distribute the work among multiple machines connected on a network, we show how training such a model can be done in reasonable time. A second problematic aspect of the tiny images dataset is that there are no reliable class labels which makes it hard to use for object recognition experiments. We created two sets of reliable labels. The CIFAR-10 set has 6000 examples of each of 10 classes and the CIFAR-100 set has 600 examples of each of 100 non-overlapping classes. Using these labels, we show that object recognition is signi cantly

The mnist database of handwritten digits

Article

Soccer Jersey Number Recognition Using Convolutional Neural Networks

Figures

Recommended publications

Feature Extraction of Video Using Artificial Neural Network

Towards Real-Time Ball Localization Using CNNs

Recognition method of basketball players' shooting action based on graph convolution neural network

Deep Learning-Based Algorithm for Recognizing Tennis Balls