Conference PaperPDF Available

Abstract and Figures

We present a dataset specifically designed to be used as a benchmark to compare vision systems in the RoboCup Humanoid Soccer domain. The dataset is composed of a collection of images taken in various real-world locations as well as a collection of simulated images. It enables comparing vision approaches with a meaningful and expressive metric. The contributions of this paper consist of providing a comprehensive and annotated dataset, an overview of the recent approaches to vision in RoboCup, methods to generate vision training data in a simulated environment, and an approach to increase the variety of a dataset by automatically selecting a diverse set of images from a larger pool. Additionally , we provide a baseline of YOLOv4 and YOLOv4-tiny on this dataset.
Content may be subject to copyright.
TORSO-21 Dataset:
Typical Objects in RoboCup Soccer 2021
Marc Bestmann, Timon Engelke, Niklas Fiedler, Jasper G¨uldenstein,
Jan Gutsche, Jonas Hagge, and Florian Vahl
Hamburg Bit-Bots, Department of Informatics, Universit¨at Hamburg,
Vogt-K¨olln-Straße 30, 22527 Hamburg, Germany
{bestmann, 7engelke, 5fiedler, 5guelden, 7gutsche, 5hagge,
7vahl}@informatik.uni-hamburg.de
https://robocup.informatik.uni-hamburg.de
Abstract. We present a dataset specifically designed to be used as a
benchmark to compare vision systems in the RoboCup Humanoid Soc-
cer domain. The dataset is composed of a collection of images taken in
various real-world locations as well as a collection of simulated images. It
enables comparing vision approaches with a meaningful and expressive
metric. The contributions of this paper consist of providing a compre-
hensive and annotated dataset, an overview of the recent approaches to
vision in RoboCup, methods to generate vision training data in a simu-
lated environment, and an approach to increase the variety of a dataset
by automatically selecting a diverse set of images from a larger pool. Ad-
ditionally, we provide a baseline of YOLOv4 and YOLOv4-tiny on this
dataset.
Keywords: Computer Vision · Vision Dataset · Deep Learning.
1 Introduction
In recent years, similar to other domains, the approaches for computer vision
in the RoboCup soccer domain moved nearly completely to deep learning based
methods [2]. Still, a quantitative comparison between the different approaches
is difficult, as most approaches are evaluated using their custom-made dataset.
The presented performance of the approaches is therefore not only related to
its detection quality but also the specific challenge posed by the used dataset.
Especially, if images are only from a single location or without natural light,
they can hardly be an indicator for actual performance in a competition.
Outside of the RoboCup domain, this problem is addressed by creating stan-
dardized datasets for various challenges in computer vision [7,9,20,32]. These
datasets are used as a benchmark when comparing existing approaches with
each other, allowing a quantitative evaluation (e. g. [4]). Tsipras et al. investi-
gated how well results of evaluations with the ImageNet dataset [7] reflect the
All authors contributed equally.
2 Bestmann et al.
performance of approaches in their actual tasks [31]. They observed that in some
cases, the scores achieved in the ImageNet challenge poorly reflect real-world ca-
pabilities. In the RoboCup domain, participants are challenged with restricted
hardware capabilities resulting in computer vision approaches specifically de-
signed for the given environment (e. g. [28]). Thus, existing datasets are even
less applicable to evaluate the vision pipelines designed for RoboCup soccer.
We propose a standardized dataset for the RoboCup Humanoid Soccer do-
main consisting of images of the Humanoid League (HSL) as well as the Standard
Platform League (SPL). We provide two image collections. The first one con-
sists of images from various real-world locations, recorded by different robots. It
includes annotations for the ball, goalposts, robots, lines, field edge, and three
types of line intersections. The second collection is generated in the Webots
simulator [22] which is used for the official RoboCup Virtual Humanoid Soc-
cer Competition1. Additionally to the labels of the first collection, labels for
the complete goal, crossbar, segmentation images for all classes, depth images,
6D poses for all labels, as well as the camera location in the field of play, are
provided. For both collections, we give a baseline using YOLOv4 [4].
Most of the existing popular image datasets are only designed to compare
image classification approaches. In RoboCup Soccer, object localization, as well
as segmentation, are also commonly used (see Table 1).
While the creation and sharing of datasets were already facilitated by the
ImageTagger platform [13], it did not help increase the comparability of vision
pipelines since teams use different parts of the available images. Furthermore,
many teams published the datasets that they used in their publications (see
Table 1). Still, none of these papers have compared their work directly to others.
While this lack of using existing datasets could simply result from missing
knowledge about their existence, since they are often only mentioned briefly as
a side note in the publications, this is not probable. In our experience, we chose
to create a new dataset for our latest vision pipeline publication [14] since the
other datasets did not include the object classes required. Another issue is a lack
of variety in some sets, e. g. only including the NAO robot or being recorded in
just one location. Furthermore, the label type of the dataset may also limit its
uses, e. g. a classification set is not usable for bounding box based approaches.
The remainder of this paper is structured as follows: Our methods of image
collection and annotation are presented in Section 2 and Section 3 respectively.
We evaluate and discuss the proposed dataset in Section 4 followed by a conclu-
sion of our work in Section 5.
2 Image Collection
The dataset presented in this work is composed out of images recorded in the
real world as well as in simulation using the Webots simulator. In the following,
we describe the methods of image collection and also our method to reduce the
number of similar images for greater variety in the dataset.
1https://humanoid.robocup.org/hl-2021/v-hsc/ (last accessed: 2021/06/14)
TORSO-21 Dataset: Typical Objects in RoboCup Soccer 2021 3
Table 1. Comparison of approaches to vision in RoboCup Humanoid Soccer leagues.
Detection types are abbreviated as follows: classification (C), bounding box (B), seg-
mentation (S), keypoints (K). Detection classes are abbreviated as follows: ball (B),
goal (G), goalpost (P), field (F), robot (R), obstacles (O), lines (L), line intersec-
tions (I). The ◦ sign means the data is publicly available, but the specific dataset is
not specified. (X) means it is partially publicly available. The sources are as follows:
ImageTagger (IT), SPQR NAO image dataset (N), self created (S), not specified (?).
The Locations are competition (C), lab (L), and not specified (?).
Approach Dataset
Year
Paper
League
Detection
Type
Classes
# Images
Synthetic
Source
Public
Location
2016
[1] SPL CB,G,R 6,843 × NX?
[10] HSL B,C,K R1,500 × ???
[25] HSL-K SB1,160 × S×?
2017
[24] HSL-A KB,I,P,O,R 2,400 × S,YouTube ×C,L
[6] SPL CR6,843 × NX?
[17] SPL CB,F,P,R 100,000 XS?
[21] SPL CB16,000 × S×?
[18] HSL SR4,000 × S,N X?
2018
[27] SPL SB,R,P,L syn: 5,000
real: 570 (X)S×C,L
[12] SPL CB40,756 × S×C,L
[15] HSL-T B,S B,F ? × S??
[8] HSL-K SB1,000 × IT XC
[11] HSL-A KB,P,R 3,000 × (IT) (◦) ?
[26] HSL-K SB35,327 × IT XC,L
2019
[19] HSL-A KB4,562 × SXC,L
[30] HSL-K BB1,000 × S×C,L
[16] HSL-K CB,P ? × Rhoban
Tagger C
[23] SPL BRsyn: 28,000
real: 7,000 (X)IT,
SimRobot XC,L
[3] HSL-K BB,P 1,423 × S, IT XC
[14] HSL-K B,S B,F,L,O,P,R ? × IT XC
[28] SPL BB,I,P,R syn: 6,250
real: 710 (X)Unreal Engine, S XC,L
2021
[5] SPL BB,R,I ? (X)N, S X?
[29] SPL B,S B,P,R,L,I syn: 6,250
real: 3,000 (X)Unreal Engine, S XC,L
Ours HSL-K
SPL B,S,K B,P,F,R,L,I,(G) syn: 10,000
real: 10,464 (X)IT, Webots XC,L
2.1 Reality
To create a diverse dataset, we collected images from multiple sources. First, from
our recordings during different RoboCup competitions and our lab. Second, we
investigated the data other teams uploaded publicly to the ImageTagger. Finally,
we asked other teams to provide images especially from further locations, and
for situations that were not already represented in the existing images. While
this provided a large set of images, most of them had to be excluded to prevent
biasing the final dataset. First, the number of images from the SPL was limited,
as these only include the NAO robot and this could have easily lead to an over-
4 Bestmann et al.
representation of the robot model. Many imagesets were excluded to limit one
of the following biases: the ball is always in the image center, the camera is
stationary, there are no robots on the field, other robots are not moving, or
the camera always points onto the field. Generally, the selection focus was on
including images that were recorded by a robot on a field, rather than images
that were recorded by humans from the side of the field. Using images recorded
by different teams is also crucial to include different camera and lens types,
locations, and perspectives.
2.2 Simulation
As the RoboCup world championship in the HSL is held virtually in 2021, we
deemed a data collection recorded in a simulated environment necessary. Diverse
data can be generated as required. We chose the Webots simulator because
it is used for the official competition. The official KidSize environment of the
competition including the background, ball, goals, and turf as well as lighting
conditions was used. During data generation, we used six robot models (including
our own) which were published for this year’s competition in the league.
Four setups are used per matchup of these robots. These vary by the robot
team marker color and the team from whose perspective the image is captured.
Scenes were generated in four scenarios per setup. In the first one, images are
taken at camera positions uniformly distributed over the field. To prevent a bias
of always having the ball included, we created a second scenario without a ball,
but the same distribution of camera positions. Similar to the previous two, we
also include two scenarios with the camera position normally distributed around
a target on the field with and without a ball present. These last two scenarios
imitate a robot that is contesting the ball.
We generated 100 images for each of the presented scenarios resulting in a
total of: 6
2·2·2·4·100 images = 24000 images. The data is split into a 85%
training and 15% test set.
For each image, a new scene is generated. The scenes are set up randomly by
first sampling a target position. The ball is placed at the target position or out
of sight depending on the scenario. Then the field players are placed by sampling
from a normal distribution around the target position since it occurs often that
multiple robots are grouped. To prevent robots from standing inside of each
other, we resample the robot’s position in the case of a collision. The heading of
each robot is sampled from a normal distribution around facing the ball for the
images where the ball is at the target position and from a uniform distribution
otherwise. We assume each team to have a goalie, which stands on a random
position on the goal line with its heading sampled from a normal distribution
with the mean being the robot looking towards the field. The postures of the
robots with our own robot model are each sampled from a set of 260 postures.
These were recorded while the robot was performing one of six typical actions.
The sampling is weighted by the estimated probability of an action occurring in
a game (walking: 50%, standing: 20%, kicking: 10%, standup: 10%, falling: 5%,
TORSO-21 Dataset: Typical Objects in RoboCup Soccer 2021 5
fallen: 5%). We chose these samplings to cover the majority of situations a robot
could realistically face in a match.
The camera position is sampled in the field either from a uniform distribu-
tion or from a normal distribution around the target position depending on the
scenario. The camera floats freely in space instead of being mounted on a robot.
We chose to do this to be able to simulate various robot sizes. Thus, edge cases
such as the robot looking at its shoulder, are not included in this collection. The
camera is generally oriented towards the target position. To avoid a bias towards
the ball being in the center of the image, the camera orientation is offset so the
position of the target position is evenly distributed in the image space.
Since the robot sizes (and thereby camera height) and fields of view (FOVs)
are very different in the league, we decided to also model this in the dataset. We
collected these parameters from the robot specifications from the last RoboCup
competition. On this basis, we calculated the mean FOV and height as well as its
standard deviations (FOV: µ= 89., σ2= 28.1, height: µ= 0.64m, σ2= 0.12)
for the HSL-KidSize. Based on this, for each image, we sample an FOV and
a camera height from a normal distribution around the mean and with the
standard deviation of the league. If the sampled FOV or height is smaller or larger
than the extreme values used by a team (FOV: 60°180°, height: 0.45m0.95m),
we resample for a new value.
2.3 Avoiding Similarity
How well a dataset represents a domain is not only related to its size but also the
diversity between the images. In the Pascal VOC dataset [9] special attention
was put on removing exact and near-duplicate images from the set of images to
reduce redundancy and unnecessary labeling work. This approach worked well
on their raw data taken from a photo-sharing website. However, the RoboCup
domain poses additional challenges, as the images are typically recorded in a
sequence. Therefore, most images are similar to the one before, since the robots
only move slowly or are even standing (especially prevalent with goalkeepers).
While a naive approach of taking every nth image can address this problem, it
can also remove short events which rarely occur in the dataset. Additionally, the
robots typically have some version of capture bias such as continuously tracking
the ball or looking back to positions where they expect objects to be. Finally,
the position of the robot on the field is not evenly distributed. Positions like the
goal area, the center circle, and the sides of the field where the robots walk in
are more commonly represented than for example the corners.
To avoid these issues, we used unsupervised machine learning to train a
variational autoencoder. It was trained on a dataset consisting of low-resolution
images (128 ×112 pixels) from the various imagesets we decided to include. The
autoencoder has 3,416,987 trainable parameters and is based upon the conv-
vae2GitHub repository using the original model architecture. We trained it to
represent the images of this domain in a 300 dimensional latent space. To prune
2https://github.com/noctrog/conv-vae (last accessed: 2021/06/14)
6 Bestmann et al.
0.000 0.005 0.010 0.015 0.020 0.025 0.030
Reconstruction error
0
500
1000
1500
2000
2500
3000
3500
4000
Number of images
Successful Reconstruction
Failed Reconstruction
Fig. 1. Distribution of the reconstruction error from the variational autoencoder on
the unfiltered dataset (left) and exemplary images with distance to a reference point
in latent space (right). Ddescribes the Euclidean distance in the latent space of the
N’th distant neighbor of the reference image.
similar images, we used this latent space representation to remove images with
close proximity to a given image. Neighbors within a given Euclidean distance
were determined using a k-d tree. During the greedy sampling process, we start
with the set Econtaining all the unfiltered images and a k-d tree representing
the latent space relations. An image is randomly selected from Eand all its
close neighbors including the image itself are removed from Ewhile the sampled
image itself is added to our filtered set O. We repeat this process until Eis
empty and Ocontains our filtered imageset. This algorithm is based on the
assumption that the variational autoencoder can represent a given image in its
latent space. This may not be the case for edge cases. Therefore we check the
reconstruction performance of the autoencoder on a given image by comparing
the original image against the decoder output and calculating the mean squared
error between both of them. Outliers with an error of more than 1.64σ(which
equals 10% of the dataset) are added to Oregardless of their latent space distance
to other images. The error distribution is shown in Figure 1. Since a high error
implies that a situation is not represented significantly in our existing dataset to
be encoded into the latent space, it is assumed to be sufficiently distinct from the
other images in the set. To filter our real-world dataset, we used 44,366 images
as an input to this selection algorithm and reduced it to 10,464 images.
3 Image Annotation
In the following, we define the label types we used in the dataset. Additionally, we
explain the process of image annotation for both the images gathered in the real
world and in simulation. We provide labels for the classes ball, robot, goalpost,
field area, lines, T-, L-, and X-line intersections. Only features that are relevant
for a robot in a RoboCup Soccer game were labeled. Thus, no balls or robots
outside of the current field and no other fields are labeled. Parts of the recording
robot, e. g. its feet, are not labeled. Additionally, each label might be marked
TORSO-21 Dataset: Typical Objects in RoboCup Soccer 2021 7
as concealed or blurred. Concealed means that the object is partially covered
by another object that is in front of it. Labels of objects, that are truncated
as they are located on the border of the image, are not marked as concealed.
The exception to this are the line crossings, they are concealed if they are not
entirely visible. A blurred annotation is affected by either motion or camera blur,
resulting in a significantly changed appearance of the object, e. g. a ball might
appear oval rather than circular. A concealed or blurred object is significantly
harder to detect. For example, this information could be used in the calculation
of the loss function to specifically focus on also detecting blurred and concealed
objects. It could also be used to focus on them less since a team might have no
issues with motion blur because they use a different camera setup.
To avoid ambiguities, we define each object class in detail:
Ball: The ball is represented as a bounding box. It is possible to compute a near
pixel-precise ellipse from the bounding box [26]. In some images, multiple balls
are present on the field of play. We label all of them even though this would not
occur in a regular game.
Robot: We define robot labels as a bounding box. Unlike the ball, it is not
as easy to generate an accurate shape of the robot with just a bounding box,
because the form of a robot is not as easy to define. However, to make labeling
feasible, we compromise by using bounding boxes.
Goalpost: The label for goalposts is a four-point polygon. This allows to accu-
rately describe tilted goalposts. Because the polygon encompasses the goalpost
tightly, this method allows the computation of a segmentation image, the middle
point, and the lowest point, which is required for the projection of the goalpost
position from image space into Cartesian space. Only the goalposts on the goal
line are labeled, excluding other parts of the goal.
Field Area: The field area is relevant as everything outside of the field provides
no useful information for the robot. We define it with a series of connected lines,
the ends are connected to the right and left borders of the image, assuming the
end of the field area is visible there. A segmentation is computed from the area
between the lines and the bottom of the image.
Lines: We offer a segmentation image for lines as the ground truth because there
is no other option to annotate lines with sufficient precision, as their width in
image space is highly variable.
Field Features: We define the T-Intersections, L-Intersections, and X-Intersections
(including penalty mark and center point) of lines as field features. For this fea-
ture, we only define a single point in the center of the intersection.
To create labels for the real-world images, we used the ImageTagger. It pro-
vides all the necessary labeling types we used, other than for the annotation of
lines. For these, we created a specialized tool. First, it allows the user to specify
smoothing and adaptive threshold parameters for a given image. Based on this,
a proposed segmentation mask is generated which can then be corrected manu-
ally. In the last step, all balls, robots, goalposts, and the area above the field are
excluded from the line segmentation using the existing labels for these classes.
8 Bestmann et al.
In general, features are only labeled when they are detectable by a human con-
sidering only the respective image (i. e. no context information from a previous
image should be necessary). Some imagesets were already labeled for some of
the classes. The rest of the labels were created by us. Additionally, we manually
verified all labels in the set.
One of the main advantages of training data generation in simulated environ-
ments is the availability of ground truth data. Webots offers the functionality
to generate bounding boxes and segmentation images for the objects present
in a scene. Each individual object has a distinct color in the segmentation im-
age. Furthermore, we provide bounding box annotations for the classes ball and
robot, four-point polygons for goalposts and the goal top bar, and a single image
coordinate for T-, L-, and X-intersections. Since the bounding boxes provided
by Webots were inaccurate in some cases, we computed the minimum enclosing
rectangle from the segmentation images. The bounding boxes for robots and the
ball are calculated from this. For goalposts, we used rotated bounding boxes
to account for the fact they may be tilted yielding similar annotations as the
4-point polygons used for manual annotation of real-world images. Line intersec-
tions were annotated by projecting their known positions into the image space
of the camera. To detect whether they are visible, we verify in the segmentation
image that lines are visible close to the intersection point. If the intersection is
occluded, we still include it in the annotations, but mark it as “not in image”.
The remaining classes (i.e. goal, field area, and lines) are only provided in the
segmentation images.
4 Evaluation
To evaluate the dataset, we performed a statistical analysis of the data and assess
the performance of a YOLOv4 on it. We focus our evaluation on the real-world
images since the images from simulation were generated as described in Sec-
tion 2.2. The real-world dataset contains 10,464 images and 101,432 annotations
in eight different classes. In Table 2 we showcase metadata about the images and
annotations present in the collection. Figure 2 shows exemplary annotations on
Table 2. Detailed distribution of annotations (left) and general statistics (right) of
the real-world collection.
Annotation
Type Total Annotations per Image
0 1 2 3 4 5+
Ball 10,959 46.6% 49.6% 2.9% 0.9% 0.0% 0.0%
Goalpost 12,780 46.8% 31.4% 21.6% 0.3% 0.0% 0.0%
Robot 14,383 64.2% 18.6% 7.6% 4.3% 2.1% 3.1%
L-Intersection 15,458 48.6% 19.3% 22.4% 5.6% 3.1% 1.1%
T-Intersection 13,479 46.1% 35.1% 12.7% 3.5% 1.8% 0.9%
X-Intersection 13,445 59.0% 24.0% 9.6% 4.2% 2.3% 0.9%
Field Area 10,464 Segmentation Mask
Lines 10,464
Sum 101,432
Ball Types 6
Locations 12
Camera Types 5
On Robot 76%
During Game 59%
Natural Light 38%
League
SPL 25%
HSL 75%
Perspective
Field 93%
Goalie 4%
From Outside 3%
TORSO-21 Dataset: Typical Objects in RoboCup Soccer 2021 9
Fig. 2. Examples from the dataset. First row from the real-world collection, second
row shows one image of the simulation collection with the corresponding segmentation
and depth images.
Ball
0.00
0.02
0.04
0.06
Goalpost
0.00
0.01
0.02
0.03
Robot
0.000
0.025
0.050
0.075
0.100
Field Area
0.00
0.25
0.50
0.75
1.00
T-Intersection
0.00
0.01
0.02
0.03
L-Intersection
0.00
0.01
0.02
0.03
X-Intersection
0.00
0.01
0.02
0.03
Lines
0.00
0.02
0.04
Fig. 3. Visualization of the position density of the respective annotations in the image
space over all images of the real-world collection.
images from the dataset. Also, we investigated the positions of annotations in
image space. This was done by plotting the heatmaps shown in Figure 3. Many
of the patterns evident in the heatmaps are caused by the typical positions of
the robots in the field and especially prominent by their head behavior as they
are often programmed to look directly at a ball.
Based on metrics used in related work, we decided to present detection re-
sults using the mean average precision (mAP) and intersection over union (IoU)
metrics. The IoU metric compares how well pixels of the ground truth and the
detection overlap. Since the ball is round, but the labels are rectangular, we com-
puted an ellipse in the bounding box as the ground truth for the IoU. Similarly,
the line intersections are labeled as a single coordinate, but since the intersec-
tion itself is larger, we assumed a bounding box with height and width as 5% of
the image height and -width respectively. In case of a true negative, we set the
value of the IoU to 1. With the IoU, pixel-precise detection methods can achieve
10 Bestmann et al.
Table 3. Mean average precision of YOLOv4 and YOLOv4-tiny on this dataset. For
the intersections, we used a bounding box of 5% of the image size. The mAP values for
the goalpost and crossbar are calculated from a bounding box that fully encompasses
the polygon. The values are IOU,mAP with IOU threshold of 50%,mAP with IOU
threshold of 75%. The floating point operations (FLOPS) required by YOLOv4 and
YOLOv4-tiny per sample are 127 billion FLOPS and 6.79 billion FLOPS respectively.
Environment Approach Metric Ball Goalpost Robot T-Int. L-Int. X-Int. Crossbar
Real World
YOLOv4 [4]
IoU
mAP(50%)
mAP(75%)
91.1%
98.8%
89.7%
70.0%
91.9%
54.9%
91.7%
96.0%
72.7%
77.3%
95.1%
23.6%
79.2%
94.4%
23.8%
83.6%
93.5%
23.1%
-
YOLOv4-tiny
IoU
mAP(50%)
mAP(75%)
89.2%
97.5%
80.0%
69.9%
89.6%
42.9%
89.3%
91.4%
47.7%
75.5%
89.8%
43.3%
75.8%
88.8%
39.7%
82.2%
92.6%
38.9%
-
Simulation
YOLOv4
IoU
mAP(50%)
mAP(75%)
88.5%
92.1%
84.6%
51.2%
94.2%
76.4%
87.2%
93.7%
82.7%
70.5%
97.9%
86.7%
69.3%
97.2%
87.1%
78.9%
98.6%
91.1%
58.1%
89.5%
66.2%
YOLOv4-tiny
IoU
mAP(50%)
mAP(75%)
85.1%
80.4%
59.5%
51.4%
91.0%
64.8%
82.5%
83.5%
57.8%
63.1%
89.8%
55.4%
60.9%
85.8%
50.6%
73.2%
91.7%
63.4%
58.9%
91.2%
59.2%
higher scores than bounding box based approaches. The mAP metric classifies a
detection as true positive if the ground truth and predicted bounding box have
an IoU of at least e. g. 75%. It also represents how many of the individual objects
were correctly found, especially when pixel-precise detection is less important.
We present exemplary results of a YOLOv4 on the dataset in Table 3.
We would like to note that the dataset does not include images of the HSL
AdultSize league. This is caused by the lack of available images and robot models
for simulation. However, we expect the dataset to be still usable as a benchmark
as the HSL KidSize and AdultSize leagues are visually very similar from a robot’s
perspective.
5 Conclusion
Efforts to share training data between teams have eased the transition to machine
learning based approaches and were a good starting point for new teams. How-
ever, as we have shown in Table 1, many of the existing and new approaches were
hard to compare quantitatively to each other as there was no common bench-
mark available. This work closes this gap by providing a benchmark dataset that
is specific to the RoboCup Humanoid Soccer domain. Additional contributions
of this paper are a system for vision training data generation in a simulated en-
vironment and an approach to increase the variety of a dataset by automatically
selecting a diverse set of images from a larger pool.
The quality of the dataset is limited by the availability of images. Therefore,
we hope that more teams start recording images on their robots during games
and publish them so that future datasets can profit from this. Future datasets
could include image sequences to allow detection of robots’ actions, e.g. a kick,
and include images of outdoor fields with real grass. This dataset could also be
used as a qualification metric for future competitions.
TORSO-21 Dataset: Typical Objects in RoboCup Soccer 2021 11
The dataset and tools used to create it are available at
https://github.com/bit-bots/TORSO 21 dataset.
References
1. Albani, D., Youssef, A., Suriani, V., Nardi, D., Bloisi, D.D.: A deep learning ap-
proach for object recognition with nao soccer robots. In: Robot World Cup. pp.
392–403. Springer (2016)
2. Asada, M., von Stryk, O.: Scientific and technological challenges in robocup. An-
nual Review of Control, Robotics, and Autonomous Systems 3, 441–471 (2020)
3. Barry, D., Shah, M., Keijsers, M., Khan, H., Hopman, B.: Xyolo: A model for real-
time object detection in humanoid soccer on low-end hardware. In: International
Conference on Image and Vision Computing New Zealand (IVCNZ). pp. 1–6. IEEE
(2019)
4. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy
of object detection. arXiv preprint arXiv:2004.10934 (2020)
5. Cruz, N., Leiva, F., Ruiz-del Solar, J.: Deep learning applied to humanoid soccer
robotics: playing without using any color information. Autonomous Robots pp.
1–16 (2021)
6. Cruz, N., Lobos-Tsunekawa, K., Ruiz-del Solar, J.: Using convolutional neural
networks in robots with limited computational resources: detecting nao robots
while playing soccer. In: Robot World Cup. pp. 19–30. Springer (2017)
7. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale
hierarchical image database. In: IEEE conference on computer vision and pattern
recognition. pp. 248–255. IEEE (2009)
8. van Dijk, S.G., Scheunemann, M.M.: Deep learning for semantic segmentation on
minimal hardware. In: Robot World Cup. pp. 349–361. Springer (2018)
9. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal
visual object classes (voc) challenge. International journal of computer vision 88(2),
303–338 (2010)
10. Farazi, H., Behnke, S.: Real-time visual tracking and identification for a team
of homogeneous humanoid robots. In: Robot World Cup. pp. 230–242. Springer
(2016)
11. Farazi, H., Ficht, G., Allgeuer, P., Pavlichenko, D., Rodriguez, D., Brandenburger,
A., Hosseini, M., Behnke, S.: Nimbro robots winning robocup 2018 humanoid adult-
size soccer competitions. In: Robot World Cup. pp. 436–449. Springer (2018)
12. Felbinger, G.C., G¨ottsch, P., Loth, P., Peters, L., Wege, F.: Designing convolutional
neural networks using a genetic approach for ball detection. In: Robot World Cup.
pp. 150–161. Springer (2018)
13. Fiedler, N., Bestmann, M., Hendrich, N.: Imagetagger: An open source online plat-
form for collaborative image labeling. In: Robot World Cup: XXII. pp. 162–169.
Springer (2018)
14. Fiedler, N., Brandt, H., Gutsche, J., Vahl, F., Hagge, J., Bestmann, M.: An open
source vision pipeline approach for robocup humanoid soccer. In: Robot World
Cup: XXIII. pp. 376–386. Springer (2019)
15. Gabel, A., Heuer, T., Schiering, I., Gerndt, R.: Jetson, where is the ball? Using
neural networks for ball detection at robocup 2017. In: Robot World Cup. pp.
181–192. Springer (2018)
16. Gondry, L., Hofer, L., Laborde-Zubieta, P., Ly, O., Math´e, L., Passault, G., Pirrone,
A., Skuric, A.: Rhoban football club: Robocup humanoid kidsize 2019 champion
team paper. In: Robot World Cup. pp. 491–503. Springer (2019)
12 Bestmann et al.
17. Hess, T., Mundt, M., Weis, T., Ramesh, V.: Large-scale stochastic scene generation
and semantic annotation for deep convolutional neural network training in the
robocup spl. In: Robot World Cup. pp. 33–44. Springer (2017)
18. Javadi, M., Azar, S.M., Azami, S., Ghidary, S.S., Sadeghnejad, S., Baltes, J.: Hu-
manoid robot detection using deep learning: A speed-accuracy tradeoff. In: Robot
World Cup. pp. 338–349. Springer (2017)
19. Kukleva, A., Khan, M.A., Farazi, H., Behnke, S.: Utilizing temporal information
in deep convolutional network for efficient soccer ball detection and tracking. In:
Robot World Cup. pp. 112–125. Springer (2019)
20. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P.,
Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference
on computer vision. pp. 740–755. Springer (2014)
21. Menashe, J., Kelle, J., Genter, K., Hanna, J., Liebman, E., Narvekar, S., Zhang,
R., Stone, P.: Fast and precise black and white ball detection for robocup soccer.
In: Robot World Cup. pp. 45–58. Springer (2017)
22. Michel, O.: Cyberbotics ltd. webots™: professional mobile robot simulation. Inter-
national Journal of Advanced Robotic Systems 1(1), 5 (2004)
23. Poppinga, B., Laue, T.: Jet-net: real-time object detection for mobile robots. In:
Robot World Cup. pp. 227–240. Springer (2019)
24. Schnekenburger, F., Scharffenberg, M., W¨ulker, M., Hochberg, U., Dorer, K.: De-
tection and localization of features on a soccer field with feedforward fully con-
volutional neural networks (fcnn) for the adult-size humanoid robot sweaty. In:
Proceedings of the 12th Workshop on Humanoid Soccer Robots, IEEE-RAS Inter-
national Conference on Humanoid Robots, Birmingham. sn (2017)
25. Speck, D., Barros, P., Weber, C., Wermter, S.: Ball localization for robocup soccer
using convolutional neural networks. In: Robot World Cup. pp. 19–30. Springer
(2016)
26. Speck, D., Bestmann, M., Barros, P.: Towards real-time ball localization using
cnns. In: Robot World Cup. pp. 337–348. Springer (2018)
27. Szemenyei, M., Estivill-Castro, V.: Real-time scene understanding using deep neu-
ral networks for robocup spl. In: Robot World Cup. pp. 96–108. Springer (2018)
28. Szemenyei, M., Estivill-Castro, V.: Robo: Robust, fully neural object detection for
robot soccer. In: Robot World Cup. pp. 309–322. Springer (2019)
29. Szemenyei, M., Estivill-Castro, V.: Fully neural object detection solutions for robot
soccer. Neural Computing and Applications pp. 1–14 (04 2021)
30. Teimouri, M., Delavaran, M.H., Rezaei, M.: A real-time ball detection approach
using convolutional neural networks. In: Robot World Cup. pp. 323–336. Springer
(2019)
31. Tsipras, D., Santurkar, S., Engstrom, L., Ilyas, A., Madry, A.: From imagenet
to image classification: Contextualizing progress on benchmarks. In: International
Conference on Machine Learning. pp. 9625–9635. PMLR (2020)
32. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for bench-
marking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)
Acknowledgments. Thanks to all individuals and teams that provided data and
labels or helped to develop and host the ImageTagger.
This research was partially funded by the Ministry of Science, Research and Equalities
of Hamburg as well as the German Research Foundation (DFG) and the National
Science Foundation of China (NSFC) in project Crossmodal Learning, TRR-169.
... We used the TORSO-21 Dataset [12] to train and evaluate different variants of the network. The dataset is used because it resembles the intended deployment domain of the architecture. ...
... The number of filters before the output layers (purple, pink) depend on the number of predicted classes, this figure shows the layout for the TORSO-21 dataset with three bounding box and three segmentation classes. matches our use case, we have used the TORSO-21 Dataset [12] from the RoboCup soccer domain to train, test, and select from the various approaches mentioned in section III. Additionally, we have used the Cityscapes Dataset [14] to evaluate our final model architecture and to quantitatively compare it with other architectures. ...
... While this change is relatively small, it is still significant for classes like the field where the majority is easy to detect but the interesting edge cases are only solved by the deeper model. This includes cases where other intentionally unlabeled soccer fields are visible in the background, images with natural light (areas might be over-or underexposed), or cases where there are obstructions (people, robots, cables, laptops, ...) on the field that are part of the field class due to the simplified annotation method for field annotations described in the TORSO-21 paper [12]. Deeper encoder feature maps help in this regard, because they are less spatially specific, allowing to focus on a larger context while also having more non-linearities which allow for an approximation of more complex data distributions. ...
Conference Paper
Full-text available
Fast and accurate visual perception utilizing a robot's limited hardware resources is necessary for many mobile robot applications. We are presenting YOEO, a novel hybrid CNN which unifies previous object detection and semantic segmentation approaches using one shared encoder backbone to increase performance and accuracy. We show that it outperforms previous approaches on the TORSO-21 and Cityscapes datasets.
... The data-driven approaches need large amounts of training data. This is an issue for many domains, but in the RC domain large quantities of annotated data for supervised learning are available as part of open data projects (Bestmann et al. 2022). While very powerful data-driven approaches exist, real-time constraints are still a limiting factor on embedded platforms like the autonomous robots used in the RC domain. ...
... See figure 2 for an example. (Bestmann et al., 2022) ...
Article
Full-text available
Robotics researchers have been focusing on developing autonomous and human-like intelligent robots that are able to plan, navigate, manipulate objects, and interact with humans in both static and dynamic environments. These capabilities, however, are usually developed for direct interactions with people in controlled environments, and evaluated primarily in terms of human safety. Consequently, human-robot interaction (HRI) in scenarios with no intervention of technical personnel is under-explored. However, in the future, robots will be deployed in unstructured and unsupervised environments where they will be expected to work unsupervised on tasks which require direct interaction with humans and may not necessarily be collaborative. Developing such robots requires comparing the effectiveness and efficiency of similar design approaches and techniques. Yet, issues regarding the reproducibility of results, comparing different approaches between research groups, and creating challenging milestones to measure performance and development over time make this difficult. Here we discuss the international robotics competition called RoboCup as a benchmark for the progress and open challenges in AI and robotics development. The long term goal of RoboCup is developing a robot soccer team that can win against the world’s best human soccer team by 2050. We selected RoboCup because it requires robots to be able to play with and against humans in unstructured environments, such as uneven fields and natural lighting conditions, and it challenges the known accepted dynamics in HRI. Considering the current state of robotics technology, RoboCup’s goal opens up several open research questions to be addressed by roboticists. In this paper, we (a) summarise the current challenges in robotics by using RoboCup development as an evaluation metric, (b) discuss the state-of-the-art approaches to these challenges and how they currently apply to RoboCup, and (c) present a path for future development in the given areas to meet RoboCup’s goal of having robots play soccer against and with humans by 2050.
Chapter
This paper proposes a method to calibrate the model used for inverse perspective mapping of humanoid robots. It aims at providing a reliable way to determine the robot’s position given the known objects around it. The position of the objects can be calculated using coordinate transforms applied to the data from the robot’s vision device. Those transforms are dependent on the robot’s joint angles (such as knee, hip) and the length of some components (e.g. torso, thighs, calves). In practice, because of the sensitivity of the transforms with respect to the inaccuracies of the mechanical data, this calculation may yield errors that make it inadequate for the purpose of determining the objects’ positions. The proposed method reduces those errors using an optimization algorithm that can find offsets that can compensate those mechanical inaccuracies. Using this method, a kid-sized humanoid robot was able to determine the position of objects up to 2 m away from the itself with an average of 3.4 cm of error.
Conference Paper
Full-text available
We showcase a pipeline to train, evaluate, and deploy deep learning architectures for monocular depth estimation in the RoboCup Soccer Humanoid domain. In contrast to previous approaches, we apply the methods on embedded systems in highly dynamic but heavily constrained environments. The results indicate that our monocular depth estimation pipeline is usable in the RoboCup environment.
Thesis
Full-text available
In this work, a classifier for clothes was developed which solely relies on depth information. The task was approached using a neural network based on the PointNet architecture. The classification of clothes serves as a use case to investigate the usability of PointNet as a classifier of non-rigid objects. To train and evaluate the network, a new dataset was created which consists of samples of eight types of clothes grasped at a single random point. In the evaluation, diverse properties of the approach are shown and analyzed. The classifier was integrated into the ROS environment to allow its usage in various robot systems. Results of the evaluation indicate, that a sufficient classification accuracy can be reached when distinguishing general types of clothes. Furthermore, diverse tools were programmed which aid with the investigation of the recorded data and classification results.
Article
Full-text available
The goal of this paper is to describe a vision system for humanoid robot soccer players that does not use any color information, and whose object detectors are based on the use of convolutional neural networks. The main features of this system are the following: (i) real-time operation in computationally constrained humanoid robots, and (ii) the ability to detect the ball, the pose of the robot players, as well as the goals, lines and other key field features robustly. The proposed vision system is validated in the RoboCup Standard Platform League, where humanoid NAO robots are used. Tests are carried out under realistic and highly demanding game conditions, where very high performance is obtained: a robot detection accuracy of 94.90%, a ball detection accuracy of 97.10%, and a correct determination of the robot orientation 99.88% of the times when the observed robot is static, and 95.52% when the robot is moving.
Chapter
Full-text available
In 2019, Rhoban Football Club reached the first place of the KidSize soccer competition for the fourth time and performed the first in-game throw-in in the history of the Humanoid league. Building on our existing code-base, we improved some specific functionalities, introduced new behaviors and experimented with original methods for labeling videos. This paper presents and reviews our latest changes to both software and hardware, highlighting the lessons learned during RoboCup.
Chapter
Full-text available
We are proposing an Open Source ROS vision pipeline for the RoboCup Soccer context. It is written in Python and offers sufficient precision while running with an adequate frame rate on the hardware of kid-sized humanoid robots to allow a fluent course of the game. Fully Convolutional Neural Networks (FCNNs) are used to detect balls while conventional methods are applied to detect robots, obstacles, goalposts, the field boundary, and field markings. The system is evaluated using an integrated evaluator and debug framework. Due to the usage of standardized ROS messages, it can be easily integrated into other teams’ code bases.
Conference Paper
Full-text available
Soccer ball detection is identified as one of the critical challenges in the RoboCup competition. It requires an efficient vision system capable of handling the task of detection with high precision and recall and providing robust and low inference time. In this work, we present a novel convolutional neural network (CNN) approach to detect the soccer ball in an image sequence. In contrast to the existing methods where only the current frame or an image is used for the detection, we make use of the history of frames. Using history allows to efficiently track the ball in situations where the ball disappears or gets partially occluded in some of the frames. Our approach exploits spatio-temporal correlation and detects the ball based on the trajectory of its movements. We present our results with three convolutional methods, namely temporal convolu-tional networks (TCN), ConvLSTM, and ConvGRU. We first solve the detection task for an image using fully convolutional encoder-decoder architecture, and later, we use it as an input to our temporal models and jointly learn the detection task in sequences of images. We evaluate all our experiments on a novel dataset prepared as a part of this work. Furthermore, we present empirical results to support the effectiveness of using the history of the ball in challenging scenarios.
Chapter
Full-text available
Deep learning has revolutionised many fields, but it is still challenging to transfer its success to small mobile robots with minimal hardware. Specifically, some work has been done to this effect in the RoboCup humanoid football domain, but results that are performant and efficient and still generally applicable outside of this domain are lacking. We propose an approach conceptually different from those taken previously. It is based on semantic segmentation and does achieve these desired properties. In detail, it is being able to process full VGA images in real-time on a low-power mobile processor. It can further handle multiple image dimensions without retraining, it does not require specific domain knowledge to achieve a high frame rate and it is applicable on a minimal mobile hardware.
Chapter
Deep Learning has become exceptionally popular in the last few years due to its success in computer vision [1, 2, 3] and other fields of AI [4, 5, 6]. However, deep neural networks are computationally expensive, which limits their application in low power embedded systems, such as mobile robots. In this paper, an efficient neural network architecture is proposed for the problem of detecting relevant objects in robot soccer environments. The ROBO model’s increase in efficiency is achieved by exploiting the peculiarities of the environment. Compared to the state-of-the-art Tiny YOLO model, the proposed network provides approximately 35 times decrease in run time, while achieving superior average precision, although at the cost of slightly worse localization accuracy.
Chapter
In most applications for autonomous robots, the detection of objects in their environment is of significant importance. As many robots are equipped with cameras, this task is often solved by image processing techniques. However, due to limited computational resources on mobile systems, it is common to use specialized algorithms that are highly adapted to the respective scenario. Sophisticated approaches such as Deep Neural Networks, which recently demonstrated a high performance in many object detection tasks, are often difficult to apply. In this paper, we present JET-Net (Just Enough Time), a model frame for efficient object detection based on Convolutional Neural Networks. JET-Net is able to perform real-time robot detection on a NAO V5 robot in a robot football environment. Experiments show that this system is able to reliably detect other robots in various situations. Moreover, we present a technique that reuses the learned features to obtain more information about the detected objects. Since the additional information can entirely be learned from simulation data, it is called Simulation Transfer Learning.
Chapter
Convolutional neural networks (CNNs) are the state-of-the-art method for most computer vision tasks. But, the deployment of CNNs on mobile or embedded platforms is challenging because of CNNs’ excessive computational requirements. We present an end-to-end neural network solution to scene understanding for robot soccer. We compose two key neural networks: one to perform semantic segmentation on an image, and another to propagate class labels between consecutive frames. We trained our networks on synthetic datasets and fine-tuned them on a set consisting of real images from a Nao robot. Furthermore, we investigate and evaluate several practical methods for increasing the efficiency and performance of our networks. Finally, we present RoboDNN, a C++ neural network library designed for fast inference on the Nao robots.
Chapter
At RoboCup 2017, the HULKs reached the Standard Platform League’s quarter finals and won the mixed team competition together with our fellow team B-Human. This paper describes the design of a convolutional neural network used for the detection of the black and white ball - one of the key contributions that led to the team’s success. We present a genetic design approach that optimizes network hyperparameters for a cost effective inference on the NAO, with limited amount of training data. Experimental results demonstrate that the genetic algorithm is able to optimize the hyperparameters of convolutional neural networks. We show that the resulting network is able to run in real-time on the robot with a very precise classification in generalization test.