ArticlePDF Available

Digital taxonomist: Identifying plant species in community scientists’ photographs

Authors:

Abstract and Figures

Automatic identification of plant specimens from amateur photographs could improve species range maps, thus supporting ecosystems research as well as conservation efforts. However, classifying plant specimens based on image data alone is challenging: some species exhibit large variations in visual appearance, while at the same time different species are often visually similar; additionally, species observations follow a highly imbalanced, long-tailed distribution due to differences in abundance as well as observer biases. On the other hand, most species observations are accompanied by side information about the spatial, temporal and ecological context. Moreover, biological species are not an unordered list of classes but embedded in a hierarchical taxonomic structure. We propose a multimodal deep learning model that takes into account these additional cues in a unified framework. Our Digital Taxonomist is able to identify plant species in photographs better than a classifier trained on the image content alone, the performance gained is over 6 percent points in terms of accuracy.
Content may be subject to copyright.
Zurich Open Repository and
Archive
University of Zurich
Main Library
Strickhofstrasse 39
CH-8057 Zurich
www.zora.uzh.ch
Year: 2021
Digital taxonomist: Identifying plant species in community scientists’
photographs
de Lutio, Riccardo ; She, Yihang ; D’Aronco, Stefano ; Russo, Stefania ; Brun, Philipp ; Wegner, Jan D
; Schindler, Konrad
Abstract: Automatic identication of plant specimens from amateur photographs could improve species
range maps, thus supporting ecosystems research as well as conservation eorts. However, classifying
plant specimens based on image data alone is challenging: some species exhibit large variations in visual
appearance, while at the same time dierent species are often visually similar; additionally, species
observations follow a highly imbalanced, long-tailed distribution due to dierences in abundance as well
as observer biases. On the other hand, most species observations are accompanied by side information
about the spatial, temporal and ecological context. Moreover, biological species are not an unordered list
of classes but embedded in a hierarchical taxonomic structure. We propose a multimodal deep learning
model that takes into account these additional cues in a unied framework. Our Digital Taxonomist is
able to identify plant species in photographs better than a classier trained on the image content alone,
the performance gained is over 6 percent points in terms of accuracy.
DOI: https://doi.org/10.1016/j.isprsjprs.2021.10.002
Posted at the Zurich Open Repository and Archive, University of Zurich
ZORA URL: https://doi.org/10.5167/uzh-208557
Journal Article
Published Version
The following work is licensed under a Creative Commons: Attribution 4.0 International (CC BY 4.0)
License.
Originally published at:
de Lutio, Riccardo; She, Yihang; D’Aronco, Stefano; Russo, Stefania; Brun, Philipp; Wegner, Jan D;
Schindler, Konrad (2021). Digital taxonomist: Identifying plant species in community scientists’ pho-
tographs. ISPRS Journal of Photogrammetry and Remote Sensing, 182:112-121.
DOI: https://doi.org/10.1016/j.isprsjprs.2021.10.002
ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121
Available online 25 October 2021
0924-2716/© 2021 The Author(s). Published by Elsevier B.V. on behalf of International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). This is an
open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Digital taxonomist: Identifying plant species in community
scientistsphotographs
Riccardo de Lutio
a
,
*
, Yihang She
a
, Stefano DAronco
a
, Stefania Russo
a
, Philipp Brun
b
,
Jan D. Wegner
a
,
c
, Konrad Schindler
a
a
EcoVision Lab, Photogrammetry and Remote Sensing, ETH Zürich, Switzerland
b
Land Change Science, Dynamic Macroecology, WSL, Switzerland
c
Institute for Computational Science, University of Zurich, Switzerland
ARTICLE INFO
Keywords:
Species recognition
Community science
Hierarchical classication
Multimodal learning
ABSTRACT
Automatic identication of plant specimens from amateur photographs could improve species range maps, thus
supporting ecosystems research as well as conservation efforts. However, classifying plant specimens based on
image data alone is challenging: some species exhibit large variations in visual appearance, while at the same
time different species are often visually similar; additionally, species observations follow a highly imbalanced,
long-tailed distribution due to differences in abundance as well as observer biases. On the other hand, most
species observations are accompanied by side information about the spatial, temporal and ecological context.
Moreover, biological species are not an unordered list of classes but embedded in a hierarchical taxonomic
structure. We propose a multimodal deep learning model that takes into account these additional cues in a
unied framework. Our Digital Taxonomist is able to identify plant species in photographs better than a classier
trained on the image content alone, the performance gained is over 6 percent points in terms of accuracy.
1. Introduction
Biodiversity describes the diversity of life in terms of species
numbers, similarity, abundance, and distribution across spatial scales
(Barrotta and Gronda, 2020; Gaston and Spicer, 2004). Biodiversity is
essential to human well-being but rapidly deteriorating worldwide in
response to anthropogenic pressure (Díaz et al., 2019). To effectively
conserve biodiversity, its spatio-temporal distribution needs to be well
understood, which requires efcient monitoring schemes. Scientic
surveys conducted at regional or country scales are, however, costly in
terms of time and nancial resources, as highly skilled professionals
need to repeatedly examine extensive geographical areas and carefully
document the encountered species.
One viable way to complement professional biodiversity monitoring
is the community science approach. The community science paradigm
aims at involving the general public in scientic observations and in-
vestigations, and is particularly useful in cases where the experiment is
characterized by a large spatial and/or temporal scale (Silvertown,
2009). The community science approach has a long history in
biodiversity monitoring (Dickinson et al., 2010). For example, volun-
teers have participated in the annual Christmas Bird Counts of the Na-
tional Audubon Society in the USA since 1900 (Butcher and Niven,
2007).
With the rise of smartphones and other portable electronic devices,
community science in biodiversity monitoring has grown. Over the past
decade, a multitude of smartphone apps have been released, allowing
community scientists to conveniently report observations of plants and
animals, and to upload images to online databases. Among the most
popular of these apps is the iNaturalist (iNaturalist, 2021) initiative,
with over 3 million users and more than 36 million valid observations
1
distributed across the globe.
Although data gathered with community science is extremely valu-
able, it poses a number of challenges that need to be solved before it can
be exploited effectively. One major issue is data quality, i.e., it is
generally difcult to ensure that the collected data is correct and
consistent. The main reasons are that community science data (either in
the form of images or simple species presence observation) (i) are
collected by non-experts with varying training, expertise and skills, for
* Corresponding author.
E-mail address: riccardo.delutio@geod.baug.ethz.ch (R. de Lutio).
1
A valid observation is an observation that has a date, a location, media evidence (image or sound), and has not been voted captive/cultivated.
Contents lists available at ScienceDirect
ISPRS Journal of Photogrammetry and Remote Sensing
journal homepage: www.elsevier.com/locate/isprsjprs
https://doi.org/10.1016/j.isprsjprs.2021.10.002
Received 26 May 2021; Received in revised form 4 October 2021; Accepted 4 October 2021
ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121
113
instance, community scientists will on average not be able to name rare
species as well as specialists; (ii) often exhibit signicant biases due to
geographical variations in sampling effort, observation methods and
traditions, as well as regional differences in infrastructure and
accessibility.
In the context of biodiversity and species distribution mapping,
Machine Learning (ML) can provide several tools for mitigating at least
some of these limitations. For instance, the species recognition for the
data collected on the eld can be automatized to some extent to help the
community scientist. This can either be done on-device to assist the user
during data collection, as well as in a second step to assist the experts in
verifying the user-supplied labels. In recent years, computer vision has
made great progress, mostly due to the rise of statistical ML. In fact, the
application that spearheaded this development was the classication of
image content into human-dened (semantic) categories (Deng et al.,
2009). It is thus natural to ask whether ML can also assist community
scientists to classify their photographs into taxonomic species, helping
them to correctly identify what they have observed; thus paving the way
towards more accurate and larger-scale species distribution maps. Visual
species recognition has been studied fairly extensively in recent years,
with different image sources ranging from carefully collected zoological
or botanical collections to uncontrolled outdoor and camera trap data
(Khosla et al., 2011; Welinder et al., 2010; Beery et al., 2020). In this
paper, we specically focus on the case of recognising plant species in
data collected via community science applications such as iNaturalist
(iNaturalist, 2021) or Info Flora (Info Flora, 2021). Properties that
distinguish this specic scenario from other image classication tasks
include: (i) Species observation numbers show an imbalanced distribu-
tion, as some species are naturally rare or harder to nd and document
than others (and perhaps also less attractive to photograph), such that
they are rarely observed and only a few samples are available to train an
ML model; (ii) Side information is often readily available, e.g., the
location and time when the image was taken are usually known, and in
turn can be linked to further information like terrain maps, satellite
images, etc.; (iii) Biological species are related to each other in a hier-
archical manner, i.e., through a taxonomic tree,
2
and one can leverage
these relations during both training and inference. In particular, one
may assume that, at any level of the hierarchy, species in the same group
are, on average, more similar than species in distinct groups (see
Fig. A1).
In this study, we develop an ML model for classifying community
science photographs. Our focus is on how to best exploit side information
that comes with the actual photograph, to improve species recognition.
By side information, we mean the locations and time points of the ob-
servations, as well as associated environmental variables and optical
satellite imagery. Location and time are usually uploaded together with
the images.
3
Our model is inspired by other works such as (Chu et al.,
2019; Mac Aodha et al., 2019), however, there are a few key differences:
(i) we make use of additional metadata (altitude and Sentinel-2), (ii) we
train the model following a late fusion strategy and (iii) we make use of
the marginalisation loss (Kumar and Zheng, 2017).
Many environmental variables are publicly available, as are remote
sensing images, e.g., the Sentinel-2 satellite data repositories (Coperni-
cus open access hub, 2021). Moreover, we include the taxonomic hier-
archy to improve model performance at inference time. Hierarchically
structured class labels can be benecial in two different ways: on the one
hand, the hierarchy can be used as a regularisation of the model, which
has been shown to improve the classication of rare classes (Turkoglu
et al., 2021); on the other hand, the hierarchy can also be used at
inference time to provide a prediction (at a coarser level) for species not
present in the list of the output classes. We investigate different strate-
gies to exploit the side information and empirically compare them. We
nd that a model combining the community science images, spatio-
temporal context, hierarchical labels and remote sensing images
trained in a joint manner with a late fusion strategy performs the best.
We validate the proposed method on a subset of the iNaturalist cata-
logue, with 56,608 observations of 977 distinct plant species, which
includes observations of plant species across the territory of Switzerland.
2. Related work
2.1. Context-based modelling
Research has shown that the location context is important for
modeling the distribution of species, and therefore can especially benet
ne-grained classication tasks. In (Wittich et al., 2018) the authors
adopt a nearest neighbour approach to predict the possible species that a
person could encounter at certain locations given the previously recor-
ded nearby observations. Although the paper acknowledges the fact that
such information can be used to help and speed up species recognition,
they do not combine their method with any image-based classication
model. In (Berg et al., 2014) the location and time where a photo was
taken are used to dene a prior distribution over bird species occur-
rences. An adaptive kernel density estimation is employed to construct
that distribution, which is then combined with probabilistic output from
a Support Vector Machine (SVM). Although the proposed method is
effective when using spatial and temporal metadata to improve classi-
cation, the usage of SVM severely limited the overall performance.
Novel, deep learning-based methods can achieve higher accuracies on
the same dataset without spatio-temporal priors (Foret et al., 2021).
With the fast advancement of deep learning, researchers have developed
ways to utilise the location context with Convolutional Neural Networks
(CNNs). In (Tang et al., 2015) the authors investigate how to encode the
images GPS coordinate to increase prediction accuracy. The encoding is
then concatenated with the image representation from the CNN before
the nal (linear) classier. The paper also investigates the impact of
further map features, e.g., precipitation maps, alongside simple GPS
coordinates. (Chu et al., 2019 and Mac Aodha et al., 2019) are two
studies that combine deep learning and geographical information to
improve species recognition accuracy. In (Chu et al., 2019) the authors
propose a renement network that merges the prediction from a CNN
with a secondary network that receives as input the location where the
image was taken. The weights of the CNN network are kept frozen while
training the renement module. As a second option, the paper proposes
a method where the location-aware network can alter the feature
extraction inside the CNN, based on the pictures location. This second
technique, however, did not lead to a substantial improvement. (Mac
Aodha et al., 2019) propose a slightly different solution for the same
problem, in this case the network responsible for extracting the
geographical prior is in fact trained separately. The problem in this case
is that the dataset consists exclusively of positive labels, i.e., it contains
no information where the context speaks against a certain species label.
To overcome this, the authors propose a joint embedding loss able to
deal with presence-only datasets. The difference between the two ap-
proaches is that in the former (Chu et al., 2019) the geographical
network is trained to improve the image-based prediction coming from
the CNN, but cannot make a meaningful prediction on its own, i.e.,
without the CNN; whereas in the latter work (Mac Aodha et al., 2019)
the geographical network is trained separately and can also be evaluated
without an image, effectively producing a species distribution map.
2.2. Hierarchical labels
Complementary to location context, structure among the species
labels helps the classication task by sharing features among related (i.
e., nearby) classes. In (Srivastava and Salakhutdinov, 2013), the output
2
Namely, a sub-tree of the general hierarchy of (from top to bottom)
kingdom, phylum, class, order, family, genus and species (Stace, 1991).
3
These parameters constitute sensitive personal information, but community
scientists are usually willing to disclose them to geo-locate their observations.
R. de Lutio et al.
ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121
114
classes are organised in a hierarchical structure, and features are
transferred between related classes to inject the a priori hierarchy into
the deep neural network classier. (Yan et al., 2015) was another early
work that tackled hierarchical classication in the context of visual
recognition. The proposed method is limited to a 2-level hierarchy, and
it is composed of two classiers: a coarser one, which separates more
easily distinguishable classes, and a ner one the resolves the more
difcult cases. (Xiao et al., 2014), and more recently (Roy et al., 2018)
analysed the use of hierarchical labels for visual recognition for the
specic case of incremental learning. A hierarchical classier for
clothing recognition was proposed in (Kumar and Zheng, 2017). The
model predicts a label hierarchy instead of a single label for the input
object, by analyzing detection errors. The method exhibits good gener-
alization capabilities also for novel clothing products that were not seen
during training. In the past years, researchers have explored different
ways to inject knowledge about hierarchical labels into neural networks.
The authors of (Chen et al., 2018) propose a framework to predict the
category scores at each hierarchy (tree) level in a top-down manner,
with a multi-head network where each branch is responsible for a
different level. Recently, (Dhall et al., 2020) have investigated and
compared a number of strategies and loss functions to integrate hier-
archical semantic structure into a CNN, including per-level classiers,
hierarchical softmax, and a marginalisation loss. The marginalisation
loss summarizes the hierarchical information in a bottom-up manner
and, although being one of the simplest approaches, emerged as one of
the most effective. In (Turkoglu et al., 2021) the authors investigate the
task of classifying agricultural crops from a sequence of satellite images,
where the crop labels also exhibit a hierarchical structure (e.g., wheat is
more similar to other cereals than to, say, orchards). They propose a
convolutional recurrent architecture, where increasing depth in the
spatial/convolutional dimension corresponds to a ner hierarchy level,
thus deriving higher-level features for ner classication from coarser
lower-level features. The layout is specic to the recurrent setup and it is
unclear how to adapt it to conventional CNNs without disrupting the
feature extraction backbone.
As a general comment, we note that methods designed for hierar-
chical labels tend to use custom architectures and cannot easily be
combined with well-known, pre-trained high-performance backbones.
3. Methodology
We now outline our proposed model for plant species classication.
The model can be understood as composed of two branches: the rst
branch infers a probability distribution over plant species, by looking
exclusively at the input image; the second branch infers another species
distribution only from the auxiliary information, which is then com-
bined with the image-based prediction to obtain a rened posterior
distribution. The entire two-branch network is supervised jointly with a
hierarchical loss that leverages the structure of the taxonomy.
3.1. Inference from image
Given an image I that depicts a certain plant specimen, we can use a
CNN to infer its species y. The network outputs a probability distribution
p(y|I;θ)over all C possible species, where θ are the learnable parameters
(convolution weights). To lighten the notation we drop θ when it is clear
from the context, and simply write p(y|I). In our implementation we use
the popular ResNet architecture (He et al., 2016), although other net-
works could also be employed. Our ResNet is pre-trained on ImageNet
(Deng et al., 2009), a setting that has become common practice to speed
up training and boost performance with limited data.
Fig. 1. Overview of our model.
Fig. 2. Different Hypericum species, in order
H. androsaemum, H. calycinum, H. hirsutum and H. perforatum.
The present species are visually similar but have different
geographical distribution ranges. For such groups of species
additional spatio-temporal information can help to improve
classication accuracy. For each species we visualise the
probability score learned by our Location Encoder (left), the
location of the training samples (red dots) and a sample
image from our training set (right). (For interpretation of the
references to color in this gure legend, the reader is referred
to the web version of this article.)
R. de Lutio et al.
ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121
115
3.2. Inference from spatio-temporal context
As explained above, community science observations are often
accompanied by auxiliary information, in particular spatio-temporal
context, i.e., where and when the photo was taken. We denote that
spatio-temporal context by the vector x. The spatial information in-
cludes longitude (x), latitude (y), and altitude (z), while the day of the
year t represents the temporal information.
4
This information is typi-
cally included in the imagesmetadata, except for the altitude, which
can be easily derived from the location given a Digital Elevation Model
(DEM). The spatio-temporal context of an observation has been shown
to be a useful cue for classifying species observations (see Section 2.1
and Fig. 2) which is not surprising, as the probability of observing a
certain species varies greatly across space and time.
Several methods have been proposed to merge such auxiliary infor-
mation into the classication, for instance see (Chu et al., 2019; Mac
Aodha et al., 2019). We will briey describe the different strategies and
highlight their pros and cons:
Early Fusion In this case the image I and auxiliary information x
are together fed into a model which shall directly
predict p(y|I,x;θ,ϕ). That model is trained by mini-
mizing a suitable loss function such as the cross-
entropy between the predicted and true labels. The
advantage of such an approach is that it does not
impose any independence assumptions and the
model can, in principle, leverage any statistical
relation between y and the inputs, including corre-
lations between I and x). However, this generality
comes at a price: (i) at inference time the complete
auxiliary information x must be fed to the model to
obtain a reliable prediction, and (ii) if the training
data is scarce, processing the two sources I and x
together increases the risk of over-tting to spurious
correlations.
Separate
Training
This approach, exemplied by (Mac Aodha et al.,
2019), takes the opposite route and employs two
completely separate networks: one mainnetwork
processes only the image to obtain p(y|I;θ), the sec-
ond auxiliaryone processes only the side infor-
mation to obtain p(y|x;ϕ). The two networks are
trained separately and produce separate scores that
are only merged at inference time. This corresponds
to the assumption that I and x are independent, such
that p(y|I,x)p(y|I)p(y|x). The main advantage of
this approach is a much reduced danger of over-
tting, as visual information and context are decor-
related. A further advantage is that one can use
additional datasets without images to train the
spatio-temporal prior. On the other hand, training
that prior without supporting image information can
also be difcult, particularly in the common situation
with presence-only annotations (Mac Aodha et al.,
2019). Finally, any real correlations between x and I
will be lost, by construction.
5
Late Fusion This approach, employed for instance as one of the
methods in (Chu et al., 2019), constitutes a
compromise between early fusion and the separate
training. Separate branches are maintained for I and
x. But their scores are not only combined during
inference but also during training, with a joint loss
function on the combined prediction p(y|I,x). The
risk of over-tting remains low compared to early
fusion, as the model admits correlations between
visual and auxiliary cues only globally, but not
between individual variables: p(y|x)acts as a spatio-
temporally varying rescaling of the image-based class
scores p(y|I), and vice versa. At the same time,
presence-only observations do not challenge the
training of the spatio-temporal prior, as the loss is
computed only after including the visual
information.
All the aforementioned methods are legitimate design choices,
whether to prefer one or the other depends on the particular problem as
well as the available data. In the experiment section, we empirically
compare their performance for plant species classication. In terms of
network architecture, for separate training and late fusion, the auxiliary
information is rst embedded into a C-dimensional vector with a fully-
connected network (FCNcontext), with C the number of classes (see
Fig. 1). The FCNcontext, with parameters ϕ, has as last layer a sigmoid,
such that its output represents a presence/absence probability per class.
Note that the sigmoid (rather than a softmax over C classes) is chosen to
reect that, at a given place and time, multiple species can be present
with high probability.
3.3. Inference using auxiliary Sentinel-2 images
Finally, given that we know the location where a specic species
observation was made, we can extract additional context information
from remotely-sensed sources, to potentially improve species identi-
cation performance. To illustrate this, we add a Sentinel-2 image of the
region around x as further auxiliary data. Sentinel-2 was chosen for its
potential to supplement meaningful information about the local
ecosystem: it provides complete coverage of the region of interest
(Switzerland). We choose to only use the 4 bands with the highest spatial
resolution (10 m GSD) across the visible and infrared spectrum (ranging
from 0.5 to 1.0
μ
m). These are commonly used to derive vegetation
information and have been shown to be sufcient to derive further
vegetation parameters (Lang et al., 2019).
The satellite data S is fed into the model in a similar fashion as the
location context. The only difference is that the embedding of the raw
data into the C-dimensional vector p(y|S;
ψ
)is a convolutional encoder
with parameters
ψ
(rather than a fully-connected network), to account
for the nature of image data. In our implementation we use a ResNet-50.
As before, the embedded satellite imagery is combined with the other
inputs according to the late fusion strategy and all three branches are
trained jointly, via the merged score p(y|I,x,S).
3.4. Integration of taxonomic hierarchy
Hierarchical labels derived from plant taxonomy are another source
of non-visual a priori information about plant species. The taxonomic
hierarchy endows the output space with additional structure that may
help to correctly classify plant species, especially if the training data is
heavily imbalanced. Attempts to use the hierarchy rest on the assump-
tion that closely related species in the tree have higher visual similarity
than more distant ones.
6
. On the one hand, the hierarchical grouping
(for instance, of many rare species into a common genus) gives rare
species statistical strength, as confusing them with each other becomes
cheaper than confusing them with some frequently observed species
from a different genus. On the other hand, the grouping also benets the
4
Thus assuming the distribution is seasonally varying but stationary over a
few years.
5
Such patterns are likely to exist. Examples include location-specic shadows
or time-dependent snow cover.
6
In expectation, not necessarily in every instance
R. de Lutio et al.
ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121
116
ne-grained species classication, as it favours feature sharing between
adjacent classes that, by themselves, have too few samples to learn a
good representation (Srivastava and Salakhutdinov, 2013). The taxo-
nomic levels we use are, from the bottom to the top of the hierarchy:
species, genus, family, order, class and phylum.
To integrate hierarchical labels, we adopt the marginalisation loss
proposed in (Kumar and Zheng, 2017). As shown in Fig. 3, the output of
the classier is the probability distribution over all species. Margin-
alising over all species within each genus thus yields the probability
distribution over genera. This procedure can then be repeated to derive
the distribution over families, etc.:
p(yl
i) =
jKi
p(yl+1
j)(1)
where p(yl
i)is the predicted probability for the i-th label at hierarchy
level l, and p(yl+1
j)is the probability of class j at the next-coarser hier-
archy level l+1. With Ki we denote the set of child classes of parent
class i. Based on the distribution p(yl)derived at level l, we can compute
a cross-entropy loss l for each individual level. The marginalisation loss
is then simply the sum of all these intermediate losses:
mar =
l
l.(2)
3.5. Data preprocessing
All community science images were resized to the size of 256 ×256
and then centre-cropped to 224 ×224. The images used for training
were additionally augmented by random rotations, random horizontal
ips and color-jitter, which are all standard methods to help mitigate the
risk of over-tting. Furthermore, all images were normalized according
to the mean and standard deviation of the training set.
We encode the observation time, measured as day of year t, into (t1,
t2)using the sine-cosine mapping (Mac Aodha et al., 2019), Eq. 3. In this
way December 31st and January 1st are mapped close to each other,
correctly accounting for the cyclic nature of the variable.
t1=sin(2
π
t
365)
t2=cos(2
π
t
365)
(3)
Regarding the location coordinates, we rescale longitude, latitude
and altitude separately to t into the interval [ −1,1]and denote the
triple of normalised coordinates as our geo-location (x,y,z).
Finally, the Sentinel-2 images are extracted from a cloud-free mosaic
of images taken in 2020. As previously indicated, we only use the four
spectral bands with a 10 m spatial resolution (R, G, B and N-IR), since
they are often sufcient to derive vegetation parameters (Lang et al.,
2019). From this mosaic, we extract patches of 256 ×256 pixels to
ensure enough context (ca. 1.3 km around the sample location, see
Table A.2 for a comparison of the performance with different sized
patches).
3.6. Balanced sampling
We used a balanced sampling strategy, where the sampling weight of
each image Wi is inversely proportional to the number of images Nyi of
the corresponding class yi:
Wi=1
Nyi
.(4)
This strategy will oversample the rare species from the tail of the
distribution and undersample the frequent species from the head of the
distribution, so as to mitigate the impact of the imbalance on the clas-
sier. It should be noted that the effect cannot be completely removed:
even when sampled with higher frequency, the few images of a rarely
observed species will inevitably carry less information than the many
example images of an abundant species. As a result there is no clear
advantage in neither of the two approaches, making this a mere design
choice. In fact, using the balanced sampling strategy, compared to the
conventional training method, improves the per-class accuracy while
decreasing the overall accuracy (see Table A.1). Although the differ-
ences in performance are small, we decide to prioritise the per-class
accuracy since we believe it is more important for our application and
Fig. 3. The idea of the marginalisation loss is to simultaneously apply a cross-
entropy loss at all levels of the taxonomic hierarchy. As the output of the
classier is the probability distribution over all species, marginalising over all
species within each genus yields the probability distribution over genera. This
procedure can then be repeated to derive the distributions at all higher levels.
The marginalisation loss is simply the sum of the intermediate losses computed
at each level.
Fig. 4. Sample distribution for each species in the training dataset. Note the logarithmic scale of the y-axis. Both diagrams share the same scale.
R. de Lutio et al.
ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121
117
thus choose to use the balanced sampling approach.
Fig. 4 shows the number of training images per species in the training
set before and after balanced sampling.
We employ a stochastic gradient descent (SGD) (Bottou et al., 2012)
to optimise the parameters of our model. We set two different learning
rates, a smaller one of 5105 for the pre-trained convolutional layers of
the CNN, and a larger one of 2103 for the fully connected layers. These
learning rates are further reduced when a plateau is reached. The batch
size is xed to 32, and all models were trained for 100 epochs. We use a
cross-entropy loss for baselines that ignore the label hierarchy, and the
marginalisation loss (Eq. 2) for hierarchically structured labels.
All results (unless stated otherwise) are computed with 5-fold cross-
validation, stratied to ensure uniform class distribution across all folds.
4. Dataset
From the iNaturalist database (iNaturalist, 2021), we have down-
loaded all images
7
of plants that are located in Switzerland and labeled
as Research Grade. The latter constitutes the highest level of data
quality, where observations meet ve criteria: (1) they must include a
date, (2) a spatial geo-reference, (3) a picture (or sound, but we only
focus on images in this work), (4) the subject must be a naturally living
organism (not captive or cultivated), and (5) at least 2 identiers should
agree on a taxon, out of a minimum of 3 identiers.
As shown in Table 1, a total of 60,781 images were downloaded (see
Fig. 6), which represented 2,374 species. However, as seen in Fig. 5, the
dataset is highly imbalanced and follows a long-tail distribution. We
discard all species with <10 images in order to ensure reliability and
statistical signicance of the experimental results. After this ltering we
are left with 56,608 images representing 977 species. We also generated
a dataset of unseen species for further experiments (see Section 5.4).
These are observations of species that have fewer than 10 but more than
5 images. For each of those species, we select 5 images at random.
Besides the images, the dataset also contains non-visual information,
including the additional data that we use in our model, i.e., longitude,
latitude, day of the year and hierarchical labels. To obtain altitude we
extract the height value corresponding to the given geo-location from
the swissALTI3D DEM of the Swiss national mapping agency (Swisstopo,
2021).
5. Experimental results
5.1. Model performance
We have conducted experiments with the following models to
empirically determine their performance gain: (1) Baseline, which
corresponds to a standard ResNet50; (2) Baseline þLocation Context,
where we add the location encoder to the baseline in a late fusion setup;
(3) Baseline þHierarchical Labels, where we add the marginalisation
loss to the baseline; and (4) Proposed Model, which leverages both the
location context and the hierarchical labels. Here, we again use the late
fusion strategy, which empirically achieved the best performance (see
Table 3).
Table 1
Overview of our dataset.
Description Images Species
Overall 60,781 2,374
Selected 56,608 977
Unseen 1,650 330
Fig. 5. Sample distribution for all Research GradeiNaturalist observations in
Switzerland. Note the logarithmic scale of the y-axis.
Fig. 6. Example images from our dataset.
Table 3
Comparison of different training strategies. Note that the top-k accuracies
indicate the average species-specic metrics.
Model Accuracy
(%)
Top-1
(%)
Top-3
(%)
Top-5
(%)
Separate Training 76.41 65.29 82.09 86.58
Joint Training: Early
Fusion
73.65 64.84 79.60 83.87
Joint Training: Late
Fusion
79.12 69.76 84.86 88.95
7
As of November 5th, 2020.
R. de Lutio et al.
ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121
118
As seen in Table 2, adding either the location context or the hierar-
chical labels to the baseline model signicantly improves the results, for
all metrics. Note that we compute the top-k accuracies as the average
species-specic metrics in order to give the same weight to all the spe-
cies in the evaluation. Thus the overall accuracy is higher than the top-1
hit-rate due to the imbalanced nature of our dataset, which is preserved
in our stratied cross-validation. Furthermore, improvements from
location context and hierarchical labels are largely orthogonal, as ex-
pected, since they leverage different types of information. These results
indicate a clear benet of complementing visual cues from community
science images with additional sources of information. For a more
detailed ablation study of the exact contributions of every component in
our location context see Section 5.5. Visual inspection of misclassied
images conrms that location context helps in the case of visually
similar species that occur in different geographical regions (see Fig. 7),
whereas hierarchical labels help to classify species with few images.
Finally, as seen in Fig. 8 our proposed model improves over the
baseline for all four ranges of species counts and the margin of
improvement is largest for the tail species of the dataset with a number
of images between 10 and 50. This is very useful since rare species are
more commonly misidentied by community science and are particu-
larly important for conservation purposes.
5.2. Training strategies
Table 3 compares the three different training strategies described in
Section 3.2. The separate training strategy has the advantage that one
can use the image classier and get reasonable predictions even when
metadata is missing. Whereas the joint training strategies should always
perform better, at the cost of being less exible, as metadata is
mandatory. Under ideal circumstances, one would also expect the early
fusion strategy to perform best, as it is not subject to any factorisation
constraints on p(y|I,x)and can leverage the complete correlation
structure. In practice, we however observe the worst performance, see
Table 3. It appears that the increased model capacity leads to over-
tting. The late fusion training strategy, with its restricted interaction
between image and context cues, emerges as the best compromise with
clearly superior performance. Separate training does bring a noticeable
improvement over the baseline but does not reach the late fusion
approach. Likely this is, at least in part, due to the presence-only labels
hampering the learning of the prior p(y|x).
5.3. Evaluation at different hierarchy levels
When using the taxonomic hierarchy during training in conjunction
with the marginalisation loss, we can predict at inference time labels at
different hierarchy levels. If taxonomic distance indeed correlates with
similar visual features and ecological requirements (see Fig. A1), then
the predictions at higher levels should be increasingly more correct. I.e.,
even if a specimen is assigned the wrong species label it might be
assigned the correct genus label, as it is more likely to be confused with a
similar species from the same genus.
8
We have evaluated our model at all taxonomic levels that we use, see
Fig. 7. Misclassication example: both images of Phyteuma orbiculare
(Left) are misclassied as Phyteuma hemisphaericum (Right) by our
baseline model. When including the location context, our proposed
model correctly classies the image with the green frame, whereas the
image with the red frame is still misclassied. The green and red ar-
rows indicate the locations of the respective left two images. The
underlying maps are the species distribution maps downloaded from
Info Flora (Info Flora, 2021). This highlights the importance of
including the location information to distinguish visually similar
species that have different geographical ranges. (For interpretation of
the references to color in this gure legend, the reader is referred to
the web version of this article.)
Fig. 8. Improvement in Mean Accuracy over the baseline for species with
different numbers of images in the dataset.
Table 2
Ablation study of our proposed model. Note that the top-k accuracies denote the
average species-specic metrics in order to give the same weight to all the
species in the evaluation.
Model Accuracy
(%)
Top-1
(%)
Top-3
(%)
Top-5
(%)
Baseline 73.48 62.48 79.04 83.97
Baseline +Location
Context
76.99 67.47 82.50 87.01
Baseline +Hierarchical
Labels
76.30 65.49 82.20 86.81
Proposed Model 79.12 69.76 84.86 88.95
8
Note that also the chance level increases, as there are fewer possible labels.
R. de Lutio et al.
ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121
119
Table 5. Indeed, the performance is better for the higher levels (c.f.
Table 4). Furthermore, higher up in the hierarchy, fewer classes are
poorly represented; the long-tail distribution is less extreme.
5.4. Experiments with unseen species
Given the hierarchical labels, it is also possible to classify new species
which the classier has not seen at all during training. While the
assigned species label will necessarily always be wrong, one would hope
that the predictions at coarser taxonomy levels are often sensible. For
this experiment, we picked 330 species that were initially discarded
from our dataset for having <10 images, but for which at least 5 images
are available, c.f. the Unseenrow in Table 1. The corresponding re-
sults in Table 6 conrm our intuition: while there is of course a signif-
icant performance drop compared to the trained species, it is still
possible to classify unseen species into the right Genus, Family or Order
with reasonable performance, well above chance level (the probability
of success of a classier that always predicts the most common class).
This capability can be extremely useful in the context of community
science, where the coarser labels can be used to refer examples to the
right expert for classication or to detect gaps in the taxonomy lists
offered to users.
5.5. Contextual information and Sentinel-2 images ablation study
To investigate the contributions of different types of contextual in-
formation, and the potential benet of adding satellite imagery, we
perform extensive ablation studies.
In Table 7 we show the impact of the different contextual informa-
tion (altitude, geo-coordinates, day of the year) on the evaluated met-
rics. As it can be seen they all contribute to some extent, with the altitude
being one of the most important. Considering the high altitude vari-
ability of the Swiss landscape it was rather expected that the altitude
could carry the most valuable information. When the full context is
combined the performance metrics show a further increase meaning that
the additional data carry orthogonal information.
Finally, Table 8 displays the performance achieved with the inte-
gration of Sentinel-2 imagery. Overall, their impact turns out to be
small. When naively adding the Sentinel-2 branch, performance even
drops slightly, apparently due to over-tting. By adding standard drop-
out regularisation (Srivastava et al., 2014) on the last fully-connected
layer, we were able to remedy this behaviour and achieve a mild (but
still statistically signicant) performance gain. To ensure that the
difference is actually caused by the satellite imagery and not the drop-
out, we add an additional baseline where the model without the
Sentinel-2 branch is trained with drop-out. Interestingly, this even
degraded the performance.
While it is promising that the much-enriched context information
from the satellite image brings an improvement over the simple geo-
location, that gain is relatively modest, at least with our implementa-
tion. Further research, beyond the scope of the present paper, will be
needed to clarify the potential of satellite (or airborne) data as auxiliary
information.
6. Conclusion
In this work, we have demonstrated that easily accessible side in-
formation can bring rather large performance gains when classifying
community science photographs. We have focused on the spatio-
temporal context of the observations, and have shown how it can
rene the classication model by providing relevant prior knowledge
regarding the distribution and occurrence of species observations. We
have also briey touched on extended radiometric context from optical
satellite imagery, a direction where we see quite some potential for
further research. Moreover, we have veried that exploiting the hier-
archical structure of biological taxonomy not only improves the species
recognition performance, but also enables more reliable predictions at
coarser taxonomy levels, and even coarse classication of species not
seen at all during the classier training.
In terms of practical community science applications, our model is
also a step towards a viable scheme for verifying user-supplied labels.
For instance, the proposed method could provide hints to the commu-
nity scientist when labelling the species, or it could facilitate the
reviewing validation by experts, marking specic observations where
the model disagrees with the label provided by the community scientist.
Of course, these suggestions would need to be followed with care in
practice to avoid creating a conrmation bias of the model. We hope
that, ultimately, a larger number of correct species observations will
contribute to better species distribution models, to inform biodiversity
research and conservation initiatives, particularly for rare species.
Declaration of Competing Interest
The authors declare that they have no known competing nancial
interests or personal relationships that could have appeared to inuence
the work reported in this paper.
Table 4
Number of classes at each hierarchical level.
Level Species Genus Family Order Class Phylum
Number 977 489 121 50 8 3
Table 6
Accuracy (%) on unseen species.
Evaluation Set Species Genus Family Order Class Phylum
5-fold Cross-Val 79.02 83.39 87.26 88.54 97.24 99.89
Unseen Species 24.27 41.86 50.23 85.60 96.00
Table 7
Ablation study of spatio-temporal context. Note that the top-k accuracies indi-
cate the average species-specic metrics.
Model Accuracy
(%)
Top-1
(%)
Top-3
(%)
Top-5
(%)
Baseline 73.48 62.48 79.04 83.97
Baseline +Altitude 75.40 65.11 81.00 85.80
Baseline +Geo-coordinates 75.07 64.86 80.63 85.45
Baseline +Day of the year 75.51 65.00 80.91 85.51
Baseline þFull Location
Context
76.99 67.47 82.50 87.01
Table 8
Results adding Sentinel-2 mosaic. Note that the top-k accuracies indicate the
average species-specic metrics.
Model Accuracy
(%)
Top-1
(%)
Top-3
(%)
Top-5
(%)
Proposed model 79.12 69.76 84.86 88.95
Proposed model with Dropout 78.02 67.77 83.39 87.52
Proposed model +Sen-2 78.59 68.29 84.43 88.60
Proposed model þSen-2
with Dropout
79.73 70.32 85.52 89.33
Table 5
Results at different hierarchical levels. Note that top-3 and top-5 accuracy at
phylum level are meaningless, since there are only 3 possible phyla. Note that
the top-k accuracies indicate the average species-specic metrics.
Metric (%) Species Genus Family Order Class Phylum
Accuracy 79.02 83.39 87.26 88.54 97.24 99.89
Top-1 69.50 73.23 75.84 78.53 88.52 86.63
Top-3 84.49 85.76 87.81 89.34 98.01 100.0
Top-5 88.57 89.37 91.45 92.95 99.51 100.0
R. de Lutio et al.
ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121
120
Appendix A. Ablation studies
Fig. A1.
Appendix B. Confusion matrix
Tables A.1 and A.2.
Fig. A1. Confusion matrix for our proposed model. The species in the rows and columns have been ordered based on their taxonomy. The hierarchy between the
species is made apparent by the dendrograms. The block-like structure along the diagonal indicates that species that are close in terms of their taxonomy are
misclassied for each other more often than unrelated species.
Table A.1
Ablation study of balanced sampling. Note that the top-k accuracies denote the average species-specic metrics.
Model Accuracy (%) Top-1 (%) Top-3 (%) Top-5 (%)
Baseline +No Balanced Sampling 75.57 62.08 80.9 86.18
Baseline +Balanced Sampling 73.48 62.48 79.04 83.97
Proposed Model +No Balanced Sampling 80.05 69.23 86.15 90.29
Proposed Model +Balanced Sampling 79.12 69.76 84.86 88.95
Table A.2
Comparison of different extents for Sentinel-2 images. Note that the top-k accuracies indicate the average species-specic metrics.
Model Accuracy (%) Top-1 (%) Top-3 (%) Top-5 (%)
No Sent-2 79.12 69.76 84.86 88.95
Small Sent-2 (128 ×128) 79.5 69.56 84.76 88.92
Normal Sent-2 (256 ×256) 79.73 70.32 85.52 89.33
Large Sent-2 (512 ×512) 79.16 69.9 84.94 88.91
R. de Lutio et al.
ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121
121
References
Barrotta, P., Gronda, R., 2020. Controversies and Interdisciplinarity: Beyond disciplinary
fragmentation for a new knowledge model. In: Ch. What is the Meaning of
Biodiversity?, 16. John Benjamins Publishing Company, pp. 115131.
Beery, S., Cole, E., Gjoka, A., 2020. The iWildCam 2020 competition dataset, In:
Proceedings, IEEE Conference on Computer Vision and Pattern Recognition
Workshops.
Berg, T., Liu, J., Woo Lee, S., Alexander, M.L., Jacobs, D.W., Belhumeur, P.N., 2014.
Birdsnap: Large-scale ne-grained visual categorization of birds. In: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp. 20112018.
Bottou, L., 2012. Neural Networks: Tricks of the Trade, second ed. Berlin Heidelberg:
Springer, pp. 421436 (Ch. Stochastic Gradient Descent Tricks).
Butcher, G., Niven, D., 2007. Combining data from the Christmas Bird Count and the
Breeding Bird Survey to determine the continental status and trends of North
America birds. Tech. rep. National Audubon Society.
Chen, T., Wu, W., Gao, Y., Dong, L., Luo, X., Lin, L., 2018. Fine-grained representation
learning and recognition by exploiting hierarchical semantic embedding. In:
Proceedings of the ACM International Conference on Multimedia, pp. 20232031.
Chu, G., Potetz, B., Wang, W., Howard, A., Song, Y., Brucher, F., Leung, T., Adam, H.,
2019. Geo-aware networks for ne-grained recognition. In: Proceedings, IEEE
International Conference on Computer Vision Workshops, pp. 247254.
Copernicus open access hub. https://scihub.copernicus.eu (last accessed on 26.05.2021).
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L., 2009, ImageNet: A large-scale
hierarchical image database. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 248255.
Dhall, A., Makarova, A., Ganea, O., Pavllo, D., Greeff, M., Krause, A., 2020. Hierarchical
image classication using entailment cone embeddings. In: Proceedings, IEEE
Conference on Computer Vision and Pattern Recognition Workshops, pp. 836837.
Díaz, S., Settele, J., Brondízio, E., Ngo, H., Gu`
eze, M., Agard, J., Arneth, A., Balvanera, P.,
Brauman, K., Butchart, S., Chan, K., Garibaldi, L., Ichii, K., Liu, J., Subramanian, S.,
Midgley, G., Miloslavich, P., Moln´
ar, Z., Obura, D., Pfaff, A., Polasky, S., Purvis, A.,
Razzaque, J., Reyers, B., Chowdhury, R., Shin, Y., Visseren-Hamakers, I., Willis, K.,
Zayas, C., 2019. Summary for policymakers of the global assessment report on
biodiversity and ecosystem services, Tech. rep. Intergovernmental Science-Policy
Platform on Biodiversity and Ecosystem Services.
Dickinson, J.L., Zuckerberg, B., Bonter, D.N., 2010. Citizen science as an ecological
research tool: challenges and benets. Ann. Rev. Ecol. Evol. Systematics 41,
149172.
Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B., 2021. Sharpness-aware minimization
for efciently improving generalization. In: Proceedings of the International
Conference on Learning Representations.
Gaston, K.J., Spicer, J.I., 2004. Biodiversity: An introduction, second ed. Blackwell
Publishing.
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 770778.
iNaturalist, https://www.inaturalist.org (last accessed on 26.05.2021).
Info Flora. https://www.infoflora.ch last accessed on 26.05.2021.
Khosla, A., Jayadevaprakash, N., Yao, B., Fei-Fei, L., 2011. Novel dataset for ne-grained
image categorization. In: First Workshop on Fine-Grained Visual Categorization at
the IEEE Conference on Computer Vision and Pattern Recognition.
Kumar, S., Zheng, R., 2017. Hierarchical category detector for clothing recognition from
visual data. In: Proceedings, IEEE International Conference on Computer Vision
Workshops, pp. 23062312.
Lang, N., Schindler, K., Wegner, J.D., 2019. Country-wide high-resolution vegetation
height mapping with Sentinel-2. Remote Sens. Environ. 233, 111347.
Mac Aodha, O., Cole, E., Perona, P., 2019. Presence-only geographical priors for ne-
grained image classication. In: Proceedings of the IEEE International Conference on
Computer Vision, pp. 95969606.
Roy, D., Panda, P., Roy, K., 2018. Tree-CNN: A hierarchical deep convolutional neural
network for incremental learning. arXiv: 1802.05800.
Silvertown, J., 2009. A new dawn for citizen science. Trends Ecol. Evol. 24 (9), 467471.
Srivastava, N., Salakhutdinov, R., 2013. Discriminative transfer learning with tree-based
priors. In: Proceedings, Advances in Neural Information Processing Systems.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014.
Dropout: A simple way to prevent neural networks from overtting. J. Mach. Learn.
Res. 15 (56), 19291958.
Stace, C.A., 1991. Plant Taxonomy and Biosystematics. Cambridge University Press.
Swisstopo. https://www.swisstopo.admin.ch/en/geodata/height/alti3d.html (last
accessed on 26.05.2021).
Tang, K., Paluri, M., Fei-Fei, L., Fergus, R., Bourdev, L., 2015. Improving image
classication with location context. In: Proceedings of the IEEE International
Conference on Computer Vision, pp. 10081016.
Turkoglu, M.O., DAronco, S., Perich, G., Liebisch, F., Streit, C., Schindler, K., Wegner, J.
D., 2021. Crop mapping from image time series: deep learning with multi-scale label
hierarchies. arXiv:2102.08820.
Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P., 2010.
Caltech-UCSD Birds 200, Tech. rep. California Institute of Technology.
Wittich, H.C., Seeland, M., W¨
aldchen, J., Rzanny, M., M¨
ader, P., 2018. Recommending
plant taxa for supporting on-site species identication. BMC Bioinformatics 19 (1),
117.
Xiao, T., Zhang, J., Yang, K., Peng, Y., Zhang, Z., 2014. Error-driven incremental learning
in deep convolutional neural network for large-scale image classication. In:
Proceedings of the ACM International Conference on Multimedia, pp. 177186.
Yan, Z., Zhang, H., Piramuthu, R., Jagadeesh, V., DeCoste, D., Di, W., Yu, Y., 2015. HD-
CNN: hierarchical deep convolutional neural networks for large scale visual
recognition. In: Proceedings of the IEEE International Conference on Computer
Vision, pp. 27402748.
R. de Lutio et al.
... Moreover, sophisticated procedures of spatial bias correction are computationally demanding when applied to many species 16 . Driven by their success in data science 17 , deep neural networks (DNNs) have become an increasingly popular alternative to model biodiversity from observational data [18][19][20][21][22][23][24] . Compared to SDMs, DNNbased modeling frameworks offer interesting new perspectives. ...
... If spatial variations in sampling intensity are similarly represented in the occurrence records of a group of species, they have a smaller effect on the relative observation probability of one species versus another, than on habitat suitability scores derived from contrasting records of individual species against random background points 16 . Lower susceptibility to sampling bias, in turn, reduces the need to sacrifice observations through thinning and allows harnessing information from the full dataset, for example, the explicit consideration of seasonal effects on observed biodiversity 19,20 . Immediate ways to cope with taxonomic reporting bias, on the other hand, are not offered by DNNs. ...
... Maps of potentially dominant species may be useful for forest rangers to identify locally competitive species 56 , but comparable observation probabilities are also advantageous in other contexts. They can, for example, be used to quality control citizen science observations or to improve image-based plant species identification 19,20 . The third perspective emerges from projecting climate-change impact on phenology and potential dominance. ...
Article
Full-text available
In the age of big data, scientific progress is fundamentally limited by our capacity to extract critical information. Here, we map fine-grained spatiotemporal distributions for thousands of species, using deep neural networks (DNNs) and ubiquitous citizen science data. Based on 6.7 M observations, we jointly model the distributions of 2477 plant species and species aggregates across Switzerland with an ensemble of DNNs built with different cost functions. We find that, compared to commonly-used approaches, multispecies DNNs predict species distributions and especially community composition more accurately. Moreover, their design allows investigation of understudied aspects of ecology. Including seasonal variations of observation probability explicitly allows approximating flowering phenology; reweighting predictions to mirror cover-abundance allows mapping potentially canopy-dominant tree species nationwide; and projecting DNNs into the future allows assessing how distributions, phenology, and dominance may change. Given their skill and their versatility, multispecies DNNs can refine our understanding of the distribution of plants and well-sampled taxa in general.
... The crowd sourced data is collected by nonexperts with varying equipment, expertise and skill and often show significant biases due to (1) the geographical variation in sampling effort (e.g. affected by population density and accessibility) and (2) citizens/non-experts are more likely to miss rare species and sample 'eye catching' species, resulting in long tailed distribution (de Lutio et al., 2021;Jones, 2020) with many observations of common species and few/no observations of rare species. These problems are mitigated to some extent by automatically filtering and relying on data quality checks by a network of experts. ...
... Many applications allow multiple input images with different viewpoints (flowers, leaf, whole plant, etc.) . Others include geo-location of the image, and from the position various environmental conditions such as climate and terrain can be inferred (de Lutio et al., 2021;Terry et al., 2020). ...
... Finally, recent studies suggest to take the hierarchical nature of taxonomy into account, possibly using taxonomic knowledge to infer family-level information of species unknown to the algorithm (de Lutio et al., 2021;Seeland et al., 2019). All this ancillary data is improving the detection algorithm, to the point that it can compete with human experts in simple settings (de Lutio et al., 2021;Jones, 2020;Mahecha et al., 2021;Wäldchen et al., 2018). ...
Article
Full-text available
Large‐scale biodiversity monitoring is essential for assessing biodiversity trends, yet traditional surveying methods are limited in the spatial/temporal scale they can cover. Recent technological developments have led to computer vision‐based species identification tools, such as the Pl@ntNet application. Increasing accuracy of such algorithms presents an opportunity of integrating computer vision into larger monitoring schemes and could lead to automating ground‐based evidence provision related to agri‐environmental measures (e.g. flower strips, field margins). However, images from surveys or farmer declarations do not live up to the standards of current applications. In order to integrate these automated methods into biodiversity monitoring, more generalized models are needed. We create a dataset using 500 manually delineated images of vegetation patches in European grasslands taken during the Land Use/cover Area Survey (LUCAS) grassland module. We train the Faster R‐CNN model to detect and extract individual flower objects. Using this model, we extract the abundance of flowers in an image, analyse their colour distribution, and use the Pl@ntNet application to identify the species of the individual flowers detected. The best model reaches precision and recall of 0.89/0.61 and predicts 1377 flowers on the 100 test images distributed between 10 colours. Using Pl@ntNet, only 52 flowers were identified with a certainty score above 0.5 due to the limitations in image size and quality. Of these flowers, 30% were correctly automatically identified at the species level and 42% at the genus level. The results show that we can automatically extract valuable information on floral abundances, colours, and sizes from images of vegetation patches, though in most cases better images are needed for species identification. Despite limitations with image quality, integrating this workflow into large‐scale monitoring could speed up the sampling process and allow for better spatial and temporal data on floral diversity and abundance.
... Terry et al. [15] developed a multi-input neural network model that fuses contextual metadata and images to identify ladybird species in the British Isles, UK, demonstrating that deep learning models can effectively use contextual information to improve the top-1 accuracy of multi-input models from 48.2% to 57.3%. de Lutio et al. [16] utilized the spatial, temporal, and ecological contexts attached to most plant species' observation information to construct a digital taxonomist that improved accuracy from 73.48% to 79.12% compared to a model trained using only images. Mou et al. [17] used animals' visual features, for example, the color of a bird's feathers or the color of an animal's fur, to improve the recognition accuracy of a contrastive language-image pre-trained (CLIP) model on multiple animal datasets. ...
... Since dates have a cyclical feature, the end of one year and the start of the next year should be close to each other. Therefore, we use sine-cosine mapping [16] to encode the date metadata captured by the camera trap as (d1, d2) according to Equations (5) and (6). With this cyclical encoding, December 31st and January 1st are mapped to be near each other. ...
... In this study, we present a novel method that fuses image and temporal metadata for recognizing wildlife in camera trap images. Our experimental results on different camera trap datasets demonstrate that leveraging temporal metadata can improve overall wildlife recognition performance, which is similar to the findings of Terry et al. [15] and de Lutio et al. [16] who utilized contextual data to enhance the recognition performance of citizen science images. ...
Article
Full-text available
Camera traps play an important role in biodiversity monitoring. An increasing number of studies have been conducted to automatically recognize wildlife in camera trap images through deep learning. However, wildlife recognition by camera trap images alone is often limited by the size and quality of the dataset. To address the above issues, we propose the Temporal-SE-ResNet50 network, which aims to improve wildlife recognition accuracy by exploiting the temporal information attached to camera trap images. First, we constructed the SE-ResNet50 network to extract image features. Second, we obtained temporal metadata from camera trap images, and after cyclical encoding, we used a residual multilayer perceptron (MLP) network to obtain temporal features. Finally, the image features and temporal features were fused in wildlife identification by a dynamic MLP module. The experimental results on the Camdeboo dataset show that the accuracy of wildlife recognition after fusing the image and temporal information is about 93.10%, which is an improvement of 0.53%, 0.94%, 1.35%, 2.93%, and 5.98%, respectively, compared with the ResNet50, VGG19, ShuffleNetV2-2.0x, MobileNetV3-L, and ConvNeXt-B models. Furthermore, we demonstrate the effectiveness of the proposed method on different national park camera trap datasets. Our method provides a new idea for fusing animal domain knowledge to further improve the accuracy of wildlife recognition, which can better serve wildlife conservation and ecological research.
... In addition, taxonomic experts have used additional information from images, such as where and when the images were captured, to assist in species recognition. Previous studies [27][28][29][30] have demonstrated that incorporating geolocation information into species-recognition models can help improve recognition performance. Liu et al. [31] proposed the use of geographical distribution knowledge in textual form to improve zero-shot species recognition accuracy. ...
... Species observations in community science often come with location coordinates, such as longitude and latitude, and utilizing prior knowledge of the location can help with species recognition. To simplify the calculation and help the model learn better, we fitted latitude and longitude to range [−1, 1], respectively, with reference to the settings of Mac Aodha et al. [27] and de Lutio et al. [29]. Then, we concatenated them into g(latitude, longitude), and mapped this onto the location output S according to Equation (3). ...
Article
Full-text available
Automatic recognition of species is important for the conservation and management of biodiversity. However, since closely related species are visually similar, it is difficult to distinguish them by images alone. In addition, traditional species-recognition models are limited by the size of the dataset and face the problem of poor generalization ability. Visual-language models such as Contrastive Language-Image Pretraining (CLIP), obtained by training on large-scale datasets, have excellent visual representation learning ability and demonstrated promising few-shot transfer ability in a variety of few-shot species recognition tasks. However, limited by the dataset on which CLIP is trained, the performance of CLIP is poor when used directly for few-shot species recognition. To improve the performance of CLIP for few-shot species recognition, we proposed a few-shot species-recognition method incorporating geolocation information. First, we utilized the powerful feature extraction capability of CLIP to extract image features and text features. Second, a geographic feature extraction module was constructed to provide additional contextual information by converting structured geographic location information into geographic feature representations. Then, a multimodal feature fusion module was constructed to deeply interact geographic features with image features to obtain enhanced image features through residual connection. Finally, the similarity between the enhanced image features and text features was calculated and the species recognition results were obtained. Extensive experiments on the iNaturalist 2021 dataset show that our proposed method can significantly improve the performance of CLIP’s few-shot species recognition. Under ViT-L/14 and 16-shot training species samples, compared to Linear probe CLIP, our method achieved a performance improvement of 6.22% (mammals), 13.77% (reptiles), and 16.82% (amphibians). Our work provides powerful evidence for integrating geolocation information into species-recognition models based on visual-language models.
... In early fusion, all input modalities are combined into a single tensor before being processed by the network (Lang et al., 2021;Teng et al., 2023). Late fusion derives an independent prediction or representation for each modality, only merging them at the end (Aodha et al., 2019;de Lutio et al., 2021;Sastry et al., 2023). SatCLIP presents a special case of this, where the embeddings produced by SINR are aligned with the embeddings produced by a satellite embedder to serve as satellite image-free embeddings in a downstream task. ...
Article
Full-text available
We propose a deep learning approach for high-resolution species distribution modelling (SDM) at large scale combining point-wise, crowd-sourced species observation data and environmental data with Sentinel-2 satellite imagery. What makes this task challenging is the great variety of controlling factors for species distribution, such as habitat conditions, human intervention, competition, disturbances, and evolutionary history. Experts either incorporate these factors into complex mechanistic models based on presence-absence data collected in field campaigns or train machine learning models to learn the relationship between environmental data and presence-only species occurrence. We extend the latter approach here and learn deep SDMs end-to-end based on point-wise, crowd-sourced presence-only data in combination with satellite imagery. Our method, dubbed Sat-SINR, jointly models the spatial distributions of 5.6k plant species across Europe and increases the spatial resolution by a factor of 100 compared to the current state of the art. We exhaustively test and ablate multiple variations of combining geo-referenced point data with satellite imagery and show that our deep learning-based SDM method consistently shows an improvement of up to 3 percentage points across three metrics. We make all code publicly available at https://github.com/ecovision-uzh/sat-sinr.
... Moreover, using a whole plant image is insufficient, as different organs vary in scale, and capturing all their details in a single image is impractical (Wäldchen et al., 2018). In response to this limitation, very recent studies have delved into the application of multimodal learning techniques (de Lutio et al., 2021;Liu et al., 2016;Nhan et al., 2020;Salve et al., 2018;Hoang Trong et al., 2020;Wang et al., 2022;Zhou et al., 2021), which integrate diverse data sources to provide a comprehensive representation of phenomena. Particularly, Nhan et al. (2020) illustrate that leveraging images from multiple plant organs outperforms reliance on a single organ, in line with botanical insights. ...
Preprint
Full-text available
Plant classification is vital for ecological conservation and agricultural productivity, enhancing our understanding of plant growth dynamics and aiding species preservation. The advent of deep learning (DL) techniques has revolutionized this field by enabling autonomous feature extraction, significantly reducing the dependence on manual expertise. However, conventional DL models often rely solely on single data sources, failing to capture the full biological diversity of plant species comprehensively. Recent research has turned to multi-modal learning to overcome this limitation by integrating multiple data types, which enriches the representation of plant characteristics. This shift introduces the challenge of determining the optimal point for modality fusion. In this paper, we introduce a pioneering multimodal DL-based approach for plant classification with automatic modality fusion. Utilizing the multimodal fusion architecture search, our method integrates images from multiple plant organs-flowers, leaves, fruits, and stems-into a cohesive model. Our method achieves 83.48% accuracy on 956 classes of the PlantCLEF2015 dataset, surpassing state-of-the-art methods. It outperforms late fusion by 11.07% and is more robust to missing modalities. We validate our model against established benchmarks using standard performance metrics and McNemar's test, further underscoring its superiority.
... The minimum goal of the PhotoMon Project is to archive photos that provide a coarse (one photo per season) documentation of ecosystem reference states at various locations in the park. However, some elements of the landscape are most effectively observed using more frequent photos (e.g., presence and abundance of ephemeral wildflowers [de Lutio et al. 2021;Crimmins and Crimmins 2008]; plant phenology [Barve et al. 2019]; etc.), and the more ambitious goals of the project are to provide higher-resolution archives of five or ten photos per site per season. Additionally, Pinery Park has an active interpretive program, and a further Project goal is to provide visitors with opportunities for park stewardship. ...
Article
Full-text available
Photo-point monitoring through repeat photography allows assessment of long-term ecosystem changes, and photos may be collected using citizen science methods. Such efforts can generate large photo collections, but are susceptible to varying participation and data quality. To date, there have been few assessments of the success of citizen science projects using repeat photography methods in meeting their objectives. We report on the success of the PhotoMon Project, a photo-point monitoring program at Pinery Provincial Park, Canada, at meeting its primary goals of affordably collecting seasonal reference photographs of significant ecosystems within the park, while providing a stewardship opportunity for park visitors. We investigated how the quantity of submitted photos varied over time (quantity), and how closely those photos matched the suite of criteria of the PhotoMon Project (quality). Photo submissions occurred year-round and at all sites, although a low proportion of park visitors participated in the program. Photo quantity varied among sites and seasonally, reaching a low during the winter, but with proportional participation in the project lowest in summer. Photo quality was consistent year-round, with most photos meeting most program criteria. Common issues with photo quality included photo lighting and orientation. We conclude that the program met its scientific goal of compiling seasonal reference photos, but that comparatively few park visitors engage in the program. We suggest changes to increase visitor motivation to participate, but recognize that these may compromise the program’s current affordability and ease of management.
Article
Full-text available
Simple Summary Species recognition is a key part of understanding biodiversity and can help us to better conserve and manage biodiversity. Traditional species recognition methods require large amounts of image data to train the recognition model, but obtaining image data of rare and endangered species is a challenge. However, Contrastive Language–Image Pre-training (CLIP), a generalized artificial intelligence model, can perform classification by calculating the similarity between images and text without the need for training data. Taking advantage of this and considering the unique geographic distribution pattern of species, we propose a CLIP-based species recognition method that can recognize species based on geographic distribution knowledge. This study is the first to combine geographic distribution knowledge with species recognition, which can lead to more effective recognition of rare and endangered species. Abstract Species recognition is a crucial part of understanding the abundance and distribution of various organisms and is important for biodiversity conservation and management. Traditional vision-based deep learning-driven species recognition requires large amounts of well-labeled, high-quality image data, the collection of which is challenging for rare and endangered species. In addition, recognition methods designed based on specific species have poor generalization ability and are difficult to adapt to new species recognition scenarios. To address these issues, zero-shot species recognition based on Contrastive Language–Image Pre-training (CLIP) has become a research hotspot. However, previous studies have primarily utilized visual descriptive information and taxonomic information of species to improve zero-shot recognition performance, and the use of geographic distribution characteristics of species to improve zero-shot recognition performance has not been explored. To fill this gap, we proposed a CLIP-driven zero-shot species recognition method that incorporates knowledge of the geographic distribution of species. First, we designed three prompts based on the species geographic distribution statistical data. Then, the latitude and longitude coordinate information attached to each image in the species dataset was converted into addresses, and they were integrated together to form the geographical distribution knowledge of each species. Finally, species recognition results were derived by calculating the similarity after acquiring features by the trained CLIP image encoder and text encoder. We conducted extensive experiments on multiple species datasets from the iNaturalist 2021 dataset, where the zero-shot recognition accuracies of mammals, mollusks, reptiles, amphibians, birds, and insects were 44.96%, 15.27%, 17.51%, 9.47%, 28.35%, and 7.03%, an improvement of 2.07%, 0.48%, 0.35%, 1.12%, 1.64%, and 0.61%, respectively, as compared to CLIP with default prompt. The experimental results show that the fusion of geographic distribution statistical data can effectively improve the performance of zero-shot species recognition, which provides a new way to utilize species domain knowledge.
Preprint
Full-text available
In the age of big data, scientific progress is fundamentally limited by our capacity to extract critical information. We show that recasting multispecies distribution modeling as a ranking problem allows analyzing ubiquitous citizen-science observations with unprecedented efficiency. Based on 6.7M observations, we jointly modeled the distributions of 2477 plant species and species aggregates across Switzerland, using deep neural networks (DNNs). Compared to commonly-used approaches, multispecies DNNs predicted species distributions and especially community composition more accurately. Moreover, their setup allowed investigating understudied aspects of ecology: including seasonal variations of observation probability explicitly allowed approximating flowering phenology, especially for small, herbaceous species; reweighting predictions to mirror cover-abundance allowed mapping potentially canopy-dominant tree species nationwide; and projecting DNNs into the future allowed assessing how distributions, phenology, and dominance may change. Given their skill and their versatility, multispecies DNNs can refine our understanding of the distribution of plants and well-sampled taxa in general.
Preprint
Full-text available
In the age of big data, scientific progress is fundamentally limited by our capacity to extract critical information. We show that recasting multispecies distribution modeling as a ranking problem allows analyzing ubiquitous citizen-science observations with unprecedented efficiency. Based on 6.7M observations, we jointly modeled the distributions of 2477 plant species and species aggregates across Switzerland, using deep neural networks (DNNs). Compared to commonly-used approaches, multispecies DNNs predicted species distributions and especially community composition more accurately. Moreover, their setup allowed investigating understudied aspects of ecology: including seasonal variations of observation probability explicitly allowed approximating flowering phenology, especially for small, herbaceous species; reweighting predictions to mirror cover-abundance allowed mapping potentially canopy-dominant tree species nationwide; and projecting DNNs into the future allowed assessing how distributions, phenology, and dominance may change. Given their skill and their versatility, multispecies DNNs can refine our understanding of the distribution of plants and well-sampled taxa in general.
Article
Full-text available
The aim of this paper is to map agricultural crops by classifying satellite image time series. Domain experts in agriculture work with crop type labels that are organised in a hierarchical tree structure, where coarse classes (like orchards) are subdivided into finer ones (like apples, pears, vines, etc.). We develop a crop classification method that exploits this expert knowledge and significantly improves the mapping of rare crop types. The three-level label hierarchy is encoded in a convolutional, recurrent neural network (convRNN), such that for each pixel the model predicts three labels at different level of granularity. This end-to-end trainable, hierarchical network architecture allows the model to learn joint feature representations of rare classes (e.g., apples, pears) at a coarser level (e.g., orchard), thereby boosting classification performance at the fine-grained level. Additionally, labelling at different granularity also makes it possible to adjust the output according to the classification scores; as coarser labels with high confidence are sometimes more useful for agricultural practice than fine-grained but very uncertain labels. We validate the proposed method on a new, large dataset that we make public. ZueriCrop covers an area of 50 km × 48 km in the Swiss cantons of Zurich and Thurgau with a total of 116′000 individual fields spanning 48 crop classes, and 28,000 (multi-temporal) image patches from Sentinel-2. We compare our proposed hierarchical convRNN model with several baselines, including methods designed for imbalanced class distributions. The hierarchical approach performs superior by at least 9.9 percentage points in F1-score.
Article
Full-text available
Background: Predicting a list of plant taxa most likely to be observed at a given geographical location and time is useful for many scenarios in biodiversity informatics. Since efficient plant species identification is impeded mainly by the large number of possible candidate species, providing a shortlist of likely candidates can help significantly expedite the task. Whereas species distribution models heavily rely on geo-referenced occurrence data, such information still remains largely unused for plant taxa identification tools. Results: In this paper, we conduct a study on the feasibility of computing a ranked shortlist of plant taxa likely to be encountered by an observer in the field. We use the territory of Germany as case study with a total of 7.62M records of freely available plant presence-absence data and occurrence records for 2.7k plant taxa. We systematically study achievable recommendation quality based on two types of source data: binary presence-absence data and individual occurrence records. Furthermore, we study strategies for aggregating records into a taxa recommendation based on location and date of an observation. Conclusion: We evaluate recommendations using 28k geo-referenced and taxa-labeled plant images hosted on the Flickr website as an independent test dataset. Relying on location information from presence-absence data alone results in an average recall of 82%. However, we find that occurrence records are complementary to presence-absence data and using both in combination yields considerably higher recall of 96% along with improved ranking metrics. Ultimately, by reducing the list of candidate taxa by an average of 62%, a spatio-temporal prior can substantially expedite the overall identification problem.
Article
Full-text available
In recent years, Convolutional Neural Networks (CNNs) have shown remarkable performance in many computer vision tasks such as object recognition and detection. However, complex training issues, such as `catastrophic forgetting' and hyper-parameter tuning, make incremental learning in CNNs a difficult challenge. In this paper, we propose a hierarchical deep neural network, with CNNs at multiple levels, and a corresponding training method for incremental learning. The network grows in a tree-like manner to accommodate the new classes of data without losing the ability to identify the previously trained classes. The proposed network was tested on CIFAR-100 and reported 60.46% accuracy and 20% reduction in training effort as compared to retraining final layers of a deep network. The network organizes the incoming classes of data into feature-driven super-classes and improves upon existing hierarchical CNN models by adding the capability of self-growth.
Conference Paper
Appearance information alone is often not sufficient to accurately differentiate between fine-grained visual categories. Human experts make use of additional cues such as where, and when, a given image was taken in order to inform their final decision. This contextual information is readily available in many online image collections but has been underutilized by existing image classifiers that focus solely on making predictions based on the image contents. We propose an efficient spatio-temporal prior, that when conditioned on a geographical location and time, estimates the probability that a given object category occurs at that location. Our prior is trained from presence-only observation data and jointly models object categories, their spatio-temporal distributions, and photographer biases. Experiments performed on multiple challenging image classification datasets show that combining our prior with the predictions from image classifiers results in a large improvement in final classification performance.
Article
Sentinel-2 multi-spectral images collected over periods of several months were used to estimate vegetation height for Gabon and Switzerland. A deep convolutional neural network (CNN) was trained to extract suitable spectral and textural features from reflectance images and to regress per-pixel vegetation height. In Gabon, reference heights for training and validation were derived from airborne LiDAR measurements. In Switzerland, reference heights were taken from an existing canopy height model derived via photogrammetric surface reconstruction. The resulting maps have a mean absolute error (MAE) of 1.7 m in Switzerland and 4.3 m in Gabon (a root mean square error (RMSE) of 3.4 m and 5.6 m, respectively), and correctly estimate vegetation heights up to >50 m. They also show good qualitative agreement with existing vegetation height maps. Our work demonstrates that, given a moderate amount of reference data (i.e., 2000 km² in Gabon and ≈5800 km² in Switzerland), high-resolution vegetation height maps with 10 m ground sampling distance (GSD) can be derived at country scale from Sentinel-2 imagery.
Conference Paper
Object categories inherently form a hierarchy with different levels of concept abstraction, especially for fine-grained categories. For example, birds (Aves) can be categorized according to a four-level hierarchy of order, family, genus, and species. This hierarchy encodes rich correlations among various categories across different levels, which can effectively regularize the semantic space and thus make prediction less ambiguous. However, previous studies of fine-grained image recognition primarily focus on categories of one certain level and usually overlook this correlation information. In this work, we investigate simultaneously predicting categories of different levels in the hierarchy and integrating this structured correlation information into the deep neural network by developing a novel Hierarchical Semantic Embedding (HSE) framework. Specifically, the HSE framework sequentially predicts the category score vector of each level in the hierarchy, from highest to lowest. At each level, it incorporates the predicted score vector of the higher level as prior knowledge to learn finer-grained feature representation. During training, the predicted score vector of the higher level is also employed to regularize label prediction by using it as soft targets of corresponding sub-categories. To evaluate the proposed framework, we organize the 200 bird species of the Caltech-UCSD birds dataset with the four-level category hierarchy and construct a large-scale butterfly dataset that also covers four level categories. Extensive experiments on these two and the newly-released VegFru datasets demonstrate the superiority of our HSE framework over the baseline methods and existing competitors.