ArticlePDF Available

Digital taxonomist: Identifying plant species in community scientists’ photographs

December 2021
ISPRS Journal of Photogrammetry and Remote Sensing 182(9):112-121

December 2021
182(9):112-121

DOI:10.1016/j.isprsjprs.2021.10.002

License
CC BY 4.0

Authors:

Riccardo de Lutio

ETH Zurich

Stefania Russo

ETH Zurich

Show all 7 authorsHide

Automatic identification of plant specimens from amateur photographs could improve species range maps, thus supporting ecosystems research as well as conservation efforts. However, classifying plant specimens based on image data alone is challenging: some species exhibit large variations in visual appearance, while at the same time different species are often visually similar; additionally, species observations follow a highly imbalanced, long-tailed distribution due to differences in abundance as well as observer biases. On the other hand, most species observations are accompanied by side information about the spatial, temporal and ecological context. Moreover, biological species are not an unordered list of classes but embedded in a hierarchical taxonomic structure. We propose a multimodal deep learning model that takes into account these additional cues in a unified framework. Our Digital Taxonomist is able to identify plant species in photographs better than a classifier trained on the image content alone, the performance gained is over 6 percent points in terms of accuracy.

Overview of our model.

…

Different Hypericum species, in order H. androsaemum, H. calycinum, H. hirsutum and H. perforatum. The present species are visually similar but have different geographical distribution ranges. For such groups of species additional spatio-temporal information can help to improve classification accuracy. For each species we visualise the probability score learned by our Location Encoder (left), the location of the training samples (red dots) and a sample image from our training set (right). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

…

Sample distribution for all "Research Grade" iNaturalist observations in Switzerland. Note the logarithmic scale of the y-axis.

…

Example images from our dataset.

…

Results adding Sentinel-2 mosaic. Note that the top-k accuracies indicate the average species-specific metrics.

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

Zurich Open Repository and

Archive

University of Zurich

Main Library

Strickhofstrasse 39

CH-8057 Zurich

www.zora.uzh.ch

Year: 2021

Digital taxonomist: Identifying plant species in community scientists’

photographs

de Lutio, Riccardo ; She, Yihang ; D’Aronco, Stefano ; Russo, Stefania ; Brun, Philipp ; Wegner, Jan D

; Schindler, Konrad

Abstract: Automatic identication of plant specimens from amateur photographs could improve species

range maps, thus supporting ecosystems research as well as conservation eorts. However, classifying

plant specimens based on image data alone is challenging: some species exhibit large variations in visual

appearance, while at the same time dierent species are often visually similar; additionally, species

observations follow a highly imbalanced, long-tailed distribution due to dierences in abundance as well

as observer biases. On the other hand, most species observations are accompanied by side information

about the spatial, temporal and ecological context. Moreover, biological species are not an unordered list

of classes but embedded in a hierarchical taxonomic structure. We propose a multimodal deep learning

model that takes into account these additional cues in a unied framework. Our Digital Taxonomist is

able to identify plant species in photographs better than a classier trained on the image content alone,

the performance gained is over 6 percent points in terms of accuracy.

DOI: https://doi.org/10.1016/j.isprsjprs.2021.10.002

Posted at the Zurich Open Repository and Archive, University of Zurich

ZORA URL: https://doi.org/10.5167/uzh-208557

Journal Article

Published Version

The following work is licensed under a Creative Commons: Attribution 4.0 International (CC BY 4.0)

License.

Originally published at:

de Lutio, Riccardo; She, Yihang; D’Aronco, Stefano; Russo, Stefania; Brun, Philipp; Wegner, Jan D;

Schindler, Konrad (2021). Digital taxonomist: Identifying plant species in community scientists’ pho-

tographs. ISPRS Journal of Photogrammetry and Remote Sensing, 182:112-121.

DOI: https://doi.org/10.1016/j.isprsjprs.2021.10.002

ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121

Available online 25 October 2021

open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Digital taxonomist: Identifying plant species in community

scientists’ photographs

Riccardo de Lutio

, Yihang She

, Stefano D’Aronco

, Stefania Russo

, Philipp Brun

Jan D. Wegner

, Konrad Schindler

EcoVision Lab, Photogrammetry and Remote Sensing, ETH Zürich, Switzerland

Land Change Science, Dynamic Macroecology, WSL, Switzerland

Institute for Computational Science, University of Zurich, Switzerland

ARTICLE INFO

Keywords:

Species recognition

Community science

Hierarchical classication

Multimodal learning

ABSTRACT

Automatic identication of plant specimens from amateur photographs could improve species range maps, thus

supporting ecosystems research as well as conservation efforts. However, classifying plant specimens based on

image data alone is challenging: some species exhibit large variations in visual appearance, while at the same

time different species are often visually similar; additionally, species observations follow a highly imbalanced,

long-tailed distribution due to differences in abundance as well as observer biases. On the other hand, most

species observations are accompanied by side information about the spatial, temporal and ecological context.

Moreover, biological species are not an unordered list of classes but embedded in a hierarchical taxonomic

structure. We propose a multimodal deep learning model that takes into account these additional cues in a

unied framework. Our Digital Taxonomist is able to identify plant species in photographs better than a classier

trained on the image content alone, the performance gained is over 6 percent points in terms of accuracy.

1. Introduction

Biodiversity describes the diversity of life in terms of species’

numbers, similarity, abundance, and distribution across spatial scales

(Barrotta and Gronda, 2020; Gaston and Spicer, 2004). Biodiversity is

essential to human well-being but rapidly deteriorating worldwide in

response to anthropogenic pressure (Díaz et al., 2019). To effectively

conserve biodiversity, its spatio-temporal distribution needs to be well

understood, which requires efcient monitoring schemes. Scientic

surveys conducted at regional or country scales are, however, costly in

terms of time and nancial resources, as highly skilled professionals

need to repeatedly examine extensive geographical areas and carefully

document the encountered species.

One viable way to complement professional biodiversity monitoring

is the community science approach. The community science paradigm

aims at involving the general public in scientic observations and in-

vestigations, and is particularly useful in cases where the experiment is

characterized by a large spatial and/or temporal scale (Silvertown,

2009). The community science approach has a long history in

biodiversity monitoring (Dickinson et al., 2010). For example, volun-

teers have participated in the annual Christmas Bird Counts of the Na-

tional Audubon Society in the USA since 1900 (Butcher and Niven,

2007).

With the rise of smartphones and other portable electronic devices,

community science in biodiversity monitoring has grown. Over the past

decade, a multitude of smartphone apps have been released, allowing

community scientists to conveniently report observations of plants and

animals, and to upload images to online databases. Among the most

popular of these apps is the iNaturalist (iNaturalist, 2021) initiative,

with over 3 million users and more than 36 million valid observations

distributed across the globe.

Although data gathered with community science is extremely valu-

able, it poses a number of challenges that need to be solved before it can

be exploited effectively. One major issue is data quality, i.e., it is

generally difcult to ensure that the collected data is correct and

consistent. The main reasons are that community science data (either in

the form of images or simple species presence observation) (i) are

collected by non-experts with varying training, expertise and skills, for

* Corresponding author.

E-mail address: riccardo.delutio@geod.baug.ethz.ch (R. de Lutio).

A valid observation is an observation that has a date, a location, media evidence (image or sound), and has not been voted captive/cultivated.

Contents lists available at ScienceDirect

ISPRS Journal of Photogrammetry and Remote Sensing

journal homepage: www.elsevier.com/locate/isprsjprs

https://doi.org/10.1016/j.isprsjprs.2021.10.002

Received 26 May 2021; Received in revised form 4 October 2021; Accepted 4 October 2021

ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121

113

instance, community scientists will on average not be able to name rare

species as well as specialists; (ii) often exhibit signicant biases due to

geographical variations in sampling effort, observation methods and

traditions, as well as regional differences in infrastructure and

accessibility.

In the context of biodiversity and species distribution mapping,

Machine Learning (ML) can provide several tools for mitigating at least

some of these limitations. For instance, the species recognition for the

data collected on the eld can be automatized to some extent to help the

community scientist. This can either be done on-device to assist the user

during data collection, as well as in a second step to assist the experts in

verifying the user-supplied labels. In recent years, computer vision has

made great progress, mostly due to the rise of statistical ML. In fact, the

application that spearheaded this development was the classication of

image content into human-dened (semantic) categories (Deng et al.,

2009). It is thus natural to ask whether ML can also assist community

scientists to classify their photographs into taxonomic species, helping

them to correctly identify what they have observed; thus paving the way

towards more accurate and larger-scale species distribution maps. Visual

species recognition has been studied fairly extensively in recent years,

with different image sources ranging from carefully collected zoological

or botanical collections to uncontrolled outdoor and camera trap data

(Khosla et al., 2011; Welinder et al., 2010; Beery et al., 2020). In this

paper, we specically focus on the case of recognising plant species in

data collected via community science applications such as iNaturalist

(iNaturalist, 2021) or Info Flora (Info Flora, 2021). Properties that

distinguish this specic scenario from other image classication tasks

include: (i) Species observation numbers show an imbalanced distribu-

tion, as some species are naturally rare or harder to nd and document

than others (and perhaps also less attractive to photograph), such that

they are rarely observed and only a few samples are available to train an

ML model; (ii) Side information is often readily available, e.g., the

location and time when the image was taken are usually known, and in

turn can be linked to further information like terrain maps, satellite

images, etc.; (iii) Biological species are related to each other in a hier-

archical manner, i.e., through a taxonomic tree,

and one can leverage

these relations during both training and inference. In particular, one

may assume that, at any level of the hierarchy, species in the same group

are, on average, more similar than species in distinct groups (see

Fig. A1).

In this study, we develop an ML model for classifying community

science photographs. Our focus is on how to best exploit side information

that comes with the actual photograph, to improve species recognition.

By side information, we mean the locations and time points of the ob-

servations, as well as associated environmental variables and optical

satellite imagery. Location and time are usually uploaded together with

the images.

Our model is inspired by other works such as (Chu et al.,

2019; Mac Aodha et al., 2019), however, there are a few key differences:

(i) we make use of additional metadata (altitude and Sentinel-2), (ii) we

train the model following a late fusion strategy and (iii) we make use of

the marginalisation loss (Kumar and Zheng, 2017).

Many environmental variables are publicly available, as are remote

sensing images, e.g., the Sentinel-2 satellite data repositories (Coperni-

cus open access hub, 2021). Moreover, we include the taxonomic hier-

archy to improve model performance at inference time. Hierarchically

structured class labels can be benecial in two different ways: on the one

hand, the hierarchy can be used as a regularisation of the model, which

has been shown to improve the classication of rare classes (Turkoglu

et al., 2021); on the other hand, the hierarchy can also be used at

inference time to provide a prediction (at a coarser level) for species not

present in the list of the output classes. We investigate different strate-

gies to exploit the side information and empirically compare them. We

nd that a model combining the community science images, spatio-

temporal context, hierarchical labels and remote sensing images

trained in a joint manner with a late fusion strategy performs the best.

We validate the proposed method on a subset of the iNaturalist cata-

logue, with 56,608 observations of 977 distinct plant species, which

includes observations of plant species across the territory of Switzerland.

2. Related work

2.1. Context-based modelling

Research has shown that the location context is important for

modeling the distribution of species, and therefore can especially benet

ne-grained classication tasks. In (Wittich et al., 2018) the authors

adopt a nearest neighbour approach to predict the possible species that a

person could encounter at certain locations given the previously recor-

ded nearby observations. Although the paper acknowledges the fact that

such information can be used to help and speed up species recognition,

they do not combine their method with any image-based classication

model. In (Berg et al., 2014) the location and time where a photo was

taken are used to dene a prior distribution over bird species occur-

rences. An adaptive kernel density estimation is employed to construct

that distribution, which is then combined with probabilistic output from

a Support Vector Machine (SVM). Although the proposed method is

effective when using spatial and temporal metadata to improve classi-

cation, the usage of SVM severely limited the overall performance.

Novel, deep learning-based methods can achieve higher accuracies on

the same dataset without spatio-temporal priors (Foret et al., 2021).

With the fast advancement of deep learning, researchers have developed

ways to utilise the location context with Convolutional Neural Networks

(CNNs). In (Tang et al., 2015) the authors investigate how to encode the

image’s GPS coordinate to increase prediction accuracy. The encoding is

then concatenated with the image representation from the CNN before

the nal (linear) classier. The paper also investigates the impact of

further map features, e.g., precipitation maps, alongside simple GPS

coordinates. (Chu et al., 2019 and Mac Aodha et al., 2019) are two

studies that combine deep learning and geographical information to

improve species recognition accuracy. In (Chu et al., 2019) the authors

propose a renement network that merges the prediction from a CNN

with a secondary network that receives as input the location where the

image was taken. The weights of the CNN network are kept frozen while

training the renement module. As a second option, the paper proposes

a method where the location-aware network can alter the feature

extraction inside the CNN, based on the picture’s location. This second

technique, however, did not lead to a substantial improvement. (Mac

Aodha et al., 2019) propose a slightly different solution for the same

problem, in this case the network responsible for extracting the

geographical prior is in fact trained separately. The problem in this case

is that the dataset consists exclusively of positive labels, i.e., it contains

no information where the context speaks against a certain species label.

To overcome this, the authors propose a joint embedding loss able to

deal with presence-only datasets. The difference between the two ap-

proaches is that in the former (Chu et al., 2019) the geographical

network is trained to improve the image-based prediction coming from

the CNN, but cannot make a meaningful prediction on its own, i.e.,

without the CNN; whereas in the latter work (Mac Aodha et al., 2019)

the geographical network is trained separately and can also be evaluated

without an image, effectively producing a species distribution map.

2.2. Hierarchical labels

Complementary to location context, structure among the species

labels helps the classication task by sharing features among related (i.

e., nearby) classes. In (Srivastava and Salakhutdinov, 2013), the output

Namely, a sub-tree of the general hierarchy of (from top to bottom)

kingdom, phylum, class, order, family, genus and species (Stace, 1991).

These parameters constitute sensitive personal information, but community

scientists are usually willing to disclose them to geo-locate their observations.

R. de Lutio et al.

ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121

114

classes are organised in a hierarchical structure, and features are

transferred between related classes to inject the a priori hierarchy into

the deep neural network classier. (Yan et al., 2015) was another early

work that tackled hierarchical classication in the context of visual

recognition. The proposed method is limited to a 2-level hierarchy, and

it is composed of two classiers: a coarser one, which separates more

easily distinguishable classes, and a ner one the resolves the more

difcult cases. (Xiao et al., 2014), and more recently (Roy et al., 2018)

analysed the use of hierarchical labels for visual recognition for the

specic case of incremental learning. A hierarchical classier for

clothing recognition was proposed in (Kumar and Zheng, 2017). The

model predicts a label hierarchy instead of a single label for the input

object, by analyzing detection errors. The method exhibits good gener-

alization capabilities also for novel clothing products that were not seen

during training. In the past years, researchers have explored different

ways to inject knowledge about hierarchical labels into neural networks.

The authors of (Chen et al., 2018) propose a framework to predict the

category scores at each hierarchy (tree) level in a top-down manner,

with a multi-head network where each branch is responsible for a

different level. Recently, (Dhall et al., 2020) have investigated and

compared a number of strategies and loss functions to integrate hier-

archical semantic structure into a CNN, including per-level classiers,

hierarchical softmax, and a marginalisation loss. The marginalisation

loss summarizes the hierarchical information in a bottom-up manner

and, although being one of the simplest approaches, emerged as one of

the most effective. In (Turkoglu et al., 2021) the authors investigate the

task of classifying agricultural crops from a sequence of satellite images,

where the crop labels also exhibit a hierarchical structure (e.g., wheat is

more similar to other cereals than to, say, orchards). They propose a

convolutional recurrent architecture, where increasing depth in the

spatial/convolutional dimension corresponds to a ner hierarchy level,

thus deriving higher-level features for ner classication from coarser

lower-level features. The layout is specic to the recurrent setup and it is

unclear how to adapt it to conventional CNNs without disrupting the

feature extraction backbone.

As a general comment, we note that methods designed for hierar-

chical labels tend to use custom architectures and cannot easily be

combined with well-known, pre-trained high-performance backbones.

3. Methodology

We now outline our proposed model for plant species classication.

The model can be understood as composed of two branches: the rst

branch infers a probability distribution over plant species, by looking

exclusively at the input image; the second branch infers another species

distribution only from the auxiliary information, which is then com-

bined with the image-based prediction to obtain a rened posterior

distribution. The entire two-branch network is supervised jointly with a

hierarchical loss that leverages the structure of the taxonomy.

3.1. Inference from image

Given an image I that depicts a certain plant specimen, we can use a

CNN to infer its species y. The network outputs a probability distribution

p(y|I;θ)over all C possible species, where θ are the learnable parameters

(convolution weights). To lighten the notation we drop θ when it is clear

from the context, and simply write p(y|I). In our implementation we use

the popular ResNet architecture (He et al., 2016), although other net-

works could also be employed. Our ResNet is pre-trained on ImageNet

(Deng et al., 2009), a setting that has become common practice to speed

up training and boost performance with limited data.

Fig. 1. Overview of our model.

Fig. 2. Different Hypericum species, in order

H. androsaemum, H. calycinum, H. hirsutum and H. perforatum.

The present species are visually similar but have different

geographical distribution ranges. For such groups of species

additional spatio-temporal information can help to improve

classication accuracy. For each species we visualise the

probability score learned by our Location Encoder (left), the

location of the training samples (red dots) and a sample

image from our training set (right). (For interpretation of the

references to color in this gure legend, the reader is referred

to the web version of this article.)

R. de Lutio et al.

ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121

115

3.2. Inference from spatio-temporal context

As explained above, community science observations are often

accompanied by auxiliary information, in particular spatio-temporal

context, i.e., where and when the photo was taken. We denote that

spatio-temporal context by the vector x. The spatial information in-

cludes longitude (x), latitude (y), and altitude (z), while the day of the

year t represents the temporal information.

This information is typi-

cally included in the images’ metadata, except for the altitude, which

can be easily derived from the location given a Digital Elevation Model

(DEM). The spatio-temporal context of an observation has been shown

to be a useful cue for classifying species observations (see Section 2.1

and Fig. 2) – which is not surprising, as the probability of observing a

certain species varies greatly across space and time.

Several methods have been proposed to merge such auxiliary infor-

mation into the classication, for instance see (Chu et al., 2019; Mac

Aodha et al., 2019). We will briey describe the different strategies and

highlight their pros and cons:

Early Fusion In this case the image I and auxiliary information x

are together fed into a model which shall directly

predict p(y|I,x;θ,ϕ). That model is trained by mini-

mizing a suitable loss function such as the cross-

entropy between the predicted and true labels. The

advantage of such an approach is that it does not

impose any independence assumptions and the

model can, in principle, leverage any statistical

relation between y and the inputs, including corre-

lations between I and x). However, this generality

comes at a price: (i) at inference time the complete

auxiliary information x must be fed to the model to

obtain a reliable prediction, and (ii) if the training

data is scarce, processing the two sources I and x

together increases the risk of over-tting to spurious

correlations.

Separate

Training

This approach, exemplied by (Mac Aodha et al.,

2019), takes the opposite route and employs two

completely separate networks: one “main” network

processes only the image to obtain p(y|I;θ), the sec-

ond “auxiliary” one processes only the side infor-

mation to obtain p(y|x;ϕ). The two networks are

trained separately and produce separate scores that

are only merged at inference time. This corresponds

to the assumption that I and x are independent, such

that p(y|I,x)∝p(y|I)⋅p(y|x). The main advantage of

this approach is a much reduced danger of over-

tting, as visual information and context are decor-

related. A further advantage is that one can use

additional datasets without images to train the

spatio-temporal prior. On the other hand, training

that prior without supporting image information can

also be difcult, particularly in the common situation

with presence-only annotations (Mac Aodha et al.,

2019). Finally, any real correlations between x and I

will be lost, by construction.

Late Fusion This approach, employed for instance as one of the

methods in (Chu et al., 2019), constitutes a

compromise between early fusion and the separate

training. Separate branches are maintained for I and

x. But their scores are not only combined during

inference but also during training, with a joint loss

function on the combined prediction p(y|I,x). The

risk of over-tting remains low compared to early

fusion, as the model admits correlations between

visual and auxiliary cues only “globally”, but not

between individual variables: p(y|x)acts as a spatio-

temporally varying rescaling of the image-based class

scores p(y|I), and vice versa. At the same time,

presence-only observations do not challenge the

training of the spatio-temporal prior, as the loss is

computed only after including the visual

information.

All the aforementioned methods are legitimate design choices,

whether to prefer one or the other depends on the particular problem as

well as the available data. In the experiment section, we empirically

compare their performance for plant species classication. In terms of

network architecture, for separate training and late fusion, the auxiliary

information is rst embedded into a C-dimensional vector with a fully-

connected network (FCNcontext), with C the number of classes (see

Fig. 1). The FCNcontext, with parameters ϕ, has as last layer a sigmoid,

such that its output represents a presence/absence probability per class.

Note that the sigmoid (rather than a softmax over C classes) is chosen to

reect that, at a given place and time, multiple species can be present

with high probability.

3.3. Inference using auxiliary Sentinel-2 images

Finally, given that we know the location where a specic species

observation was made, we can extract additional context information

from remotely-sensed sources, to potentially improve species identi-

cation performance. To illustrate this, we add a Sentinel-2 image of the

region around x as further auxiliary data. Sentinel-2 was chosen for its

potential to supplement meaningful information about the local

ecosystem: it provides complete coverage of the region of interest

(Switzerland). We choose to only use the 4 bands with the highest spatial

resolution (10 m GSD) across the visible and infrared spectrum (ranging

from 0.5 to 1.0

m). These are commonly used to derive vegetation

information and have been shown to be sufcient to derive further

vegetation parameters (Lang et al., 2019).

The satellite data S is fed into the model in a similar fashion as the

location context. The only difference is that the embedding of the raw

data into the C-dimensional vector p(y|S;

)is a convolutional encoder

with parameters

(rather than a fully-connected network), to account

for the nature of image data. In our implementation we use a ResNet-50.

As before, the embedded satellite imagery is combined with the other

inputs according to the late fusion strategy and all three branches are

trained jointly, via the merged score p(y|I,x,S).

3.4. Integration of taxonomic hierarchy

Hierarchical labels derived from plant taxonomy are another source

of non-visual a priori information about plant species. The taxonomic

hierarchy endows the output space with additional structure that may

help to correctly classify plant species, especially if the training data is

heavily imbalanced. Attempts to use the hierarchy rest on the assump-

tion that closely related species in the tree have higher visual similarity

than more distant ones.

. On the one hand, the hierarchical grouping

(for instance, of many rare species into a common genus) gives rare

species statistical strength, as confusing them with each other becomes

cheaper than confusing them with some frequently observed species

from a different genus. On the other hand, the grouping also benets the

Thus assuming the distribution is seasonally varying but stationary over a

few years.

Such patterns are likely to exist. Examples include location-specic shadows

or time-dependent snow cover.

In expectation, not necessarily in every instance

R. de Lutio et al.

ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121

116

ne-grained species classication, as it favours feature sharing between

adjacent classes that, by themselves, have too few samples to learn a

good representation (Srivastava and Salakhutdinov, 2013). The taxo-

nomic levels we use are, from the bottom to the top of the hierarchy:

species, genus, family, order, class and phylum.

To integrate hierarchical labels, we adopt the marginalisation loss

proposed in (Kumar and Zheng, 2017). As shown in Fig. 3, the output of

the classier is the probability distribution over all species. Margin-

alising over all species within each genus thus yields the probability

distribution over genera. This procedure can then be repeated to derive

the distribution over families, etc.:

p(yl

i) = ∑

j∈Ki

p(yl+1

j)(1)

where p(yl

i)is the predicted probability for the i-th label at hierarchy

level l, and p(yl+1

j)is the probability of class j at the next-coarser hier-

archy level l+1. With Ki we denote the set of child classes of parent

class i. Based on the distribution p(yl)derived at level l, we can compute

a cross-entropy loss ℒl for each individual level. The marginalisation loss

is then simply the sum of all these intermediate losses:

ℒmar =∑

ℒl.(2)

3.5. Data preprocessing

All community science images were resized to the size of 256 ×256

and then centre-cropped to 224 ×224. The images used for training

were additionally augmented by random rotations, random horizontal

ips and color-jitter, which are all standard methods to help mitigate the

risk of over-tting. Furthermore, all images were normalized according

to the mean and standard deviation of the training set.

We encode the observation time, measured as day of year t, into (t1,

t2)using the sine-cosine mapping (Mac Aodha et al., 2019), Eq. 3. In this

way December 31st and January 1st are mapped close to each other,

correctly accounting for the cyclic nature of the variable.

⎧

⎪

⎨

⎪

⎩

t1=sin(2

365)

t2=cos(2

365)

(3)

Regarding the location coordinates, we rescale longitude, latitude

and altitude separately to t into the interval [ −1,1]and denote the

triple of normalised coordinates as our geo-location (x,y,z).

Finally, the Sentinel-2 images are extracted from a cloud-free mosaic

of images taken in 2020. As previously indicated, we only use the four

spectral bands with a 10 m spatial resolution (R, G, B and N-IR), since

they are often sufcient to derive vegetation parameters (Lang et al.,

2019). From this mosaic, we extract patches of 256 ×256 pixels to

ensure enough context (ca. 1.3 km around the sample location, see

Table A.2 for a comparison of the performance with different sized

patches).

3.6. Balanced sampling

We used a balanced sampling strategy, where the sampling weight of

each image Wi is inversely proportional to the number of images Nyi of

the corresponding class yi:

Wi=1

Nyi

.(4)

This strategy will oversample the rare species from the tail of the

distribution and undersample the frequent species from the head of the

distribution, so as to mitigate the impact of the imbalance on the clas-

sier. It should be noted that the effect cannot be completely removed:

even when sampled with higher frequency, the few images of a rarely

observed species will inevitably carry less information than the many

example images of an abundant species. As a result there is no clear

advantage in neither of the two approaches, making this a mere design

choice. In fact, using the balanced sampling strategy, compared to the

conventional training method, improves the per-class accuracy while

decreasing the overall accuracy (see Table A.1). Although the differ-

ences in performance are small, we decide to prioritise the per-class

accuracy since we believe it is more important for our application and

Fig. 3. The idea of the marginalisation loss is to simultaneously apply a cross-

entropy loss at all levels of the taxonomic hierarchy. As the output of the

classier is the probability distribution over all species, marginalising over all

species within each genus yields the probability distribution over genera. This

procedure can then be repeated to derive the distributions at all higher levels.

The marginalisation loss is simply the sum of the intermediate losses computed

at each level.

Fig. 4. Sample distribution for each species in the training dataset. Note the logarithmic scale of the y-axis. Both diagrams share the same scale.

R. de Lutio et al.

ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121

117

thus choose to use the balanced sampling approach.

Fig. 4 shows the number of training images per species in the training

set before and after balanced sampling.

We employ a stochastic gradient descent (SGD) (Bottou et al., 2012)

to optimise the parameters of our model. We set two different learning

rates, a smaller one of 5⋅10−5 for the pre-trained convolutional layers of

the CNN, and a larger one of 2⋅10−3 for the fully connected layers. These

learning rates are further reduced when a plateau is reached. The batch

size is xed to 32, and all models were trained for 100 epochs. We use a

cross-entropy loss for baselines that ignore the label hierarchy, and the

marginalisation loss (Eq. 2) for hierarchically structured labels.

All results (unless stated otherwise) are computed with 5-fold cross-

validation, stratied to ensure uniform class distribution across all folds.

4. Dataset

From the iNaturalist database (iNaturalist, 2021), we have down-

loaded all images

of plants that are located in Switzerland and labeled

as “Research Grade”. The latter constitutes the highest level of data

quality, where observations meet ve criteria: (1) they must include a

date, (2) a spatial geo-reference, (3) a picture (or sound, but we only

focus on images in this work), (4) the subject must be a naturally living

organism (not captive or cultivated), and (5) at least 2 identiers should

agree on a taxon, out of a minimum of 3 identiers.

As shown in Table 1, a total of 60,781 images were downloaded (see

Fig. 6), which represented 2,374 species. However, as seen in Fig. 5, the

dataset is highly imbalanced and follows a long-tail distribution. We

discard all species with <10 images in order to ensure reliability and

statistical signicance of the experimental results. After this ltering we

are left with 56,608 images representing 977 species. We also generated

a dataset of unseen species for further experiments (see Section 5.4).

These are observations of species that have fewer than 10 but more than

5 images. For each of those species, we select 5 images at random.

Besides the images, the dataset also contains non-visual information,

including the additional data that we use in our model, i.e., longitude,

latitude, day of the year and hierarchical labels. To obtain altitude we

extract the height value corresponding to the given geo-location from

the swissALTI3D DEM of the Swiss national mapping agency (Swisstopo,

2021).

5. Experimental results

5.1. Model performance

We have conducted experiments with the following models to

empirically determine their performance gain: (1) Baseline, which

corresponds to a standard ResNet50; (2) Baseline þLocation Context,

where we add the location encoder to the baseline in a late fusion setup;

(3) Baseline þHierarchical Labels, where we add the marginalisation

loss to the baseline; and (4) Proposed Model, which leverages both the

location context and the hierarchical labels. Here, we again use the late

fusion strategy, which empirically achieved the best performance (see

Table 3).

Table 1

Overview of our dataset.

Description Images Species

Overall 60,781 2,374

Selected 56,608 977

Unseen 1,650 330

Fig. 5. Sample distribution for all “Research Grade” iNaturalist observations in

Switzerland. Note the logarithmic scale of the y-axis.

Fig. 6. Example images from our dataset.

Table 3

Comparison of different training strategies. Note that the top-k accuracies

indicate the average species-specic metrics.

Model Accuracy

(%)

Top-1

(%)

Top-3

(%)

Top-5

(%)

Separate Training 76.41 65.29 82.09 86.58

Joint Training: Early

Fusion

73.65 64.84 79.60 83.87

Joint Training: Late

Fusion

79.12 69.76 84.86 88.95

As of November 5th, 2020.

R. de Lutio et al.

ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121

118

As seen in Table 2, adding either the location context or the hierar-

chical labels to the baseline model signicantly improves the results, for

all metrics. Note that we compute the top-k accuracies as the average

species-specic metrics in order to give the same weight to all the spe-

cies in the evaluation. Thus the overall accuracy is higher than the top-1

hit-rate due to the imbalanced nature of our dataset, which is preserved

in our stratied cross-validation. Furthermore, improvements from

location context and hierarchical labels are largely orthogonal, as ex-

pected, since they leverage different types of information. These results

indicate a clear benet of complementing visual cues from community

science images with additional sources of information. For a more

detailed ablation study of the exact contributions of every component in

our location context see Section 5.5. Visual inspection of misclassied

images conrms that location context helps in the case of visually

similar species that occur in different geographical regions (see Fig. 7),

whereas hierarchical labels help to classify species with few images.

Finally, as seen in Fig. 8 our proposed model improves over the

baseline for all four ranges of species counts and the margin of

improvement is largest for the tail species of the dataset with a number

of images between 10 and 50. This is very useful since rare species are

more commonly misidentied by community science and are particu-

larly important for conservation purposes.

5.2. Training strategies

Table 3 compares the three different training strategies described in

Section 3.2. The separate training strategy has the advantage that one

can use the image classier and get reasonable predictions even when

metadata is missing. Whereas the joint training strategies should always

perform better, at the cost of being less exible, as metadata is

mandatory. Under ideal circumstances, one would also expect the early

fusion strategy to perform best, as it is not subject to any factorisation

constraints on p(y|I,x)and can leverage the complete correlation

structure. In practice, we however observe the worst performance, see

Table 3. It appears that the increased model capacity leads to over-

tting. The late fusion training strategy, with its restricted interaction

between image and context cues, emerges as the best compromise with

clearly superior performance. Separate training does bring a noticeable

improvement over the baseline but does not reach the late fusion

approach. Likely this is, at least in part, due to the presence-only labels

hampering the learning of the prior p(y|x).

5.3. Evaluation at different hierarchy levels

When using the taxonomic hierarchy during training in conjunction

with the marginalisation loss, we can predict at inference time labels at

different hierarchy levels. If taxonomic distance indeed correlates with

similar visual features and ecological requirements (see Fig. A1), then

the predictions at higher levels should be increasingly more correct. I.e.,

even if a specimen is assigned the wrong species label it might be

assigned the correct genus label, as it is more likely to be confused with a

similar species from the same genus.

We have evaluated our model at all taxonomic levels that we use, see

Fig. 7. Misclassication example: both images of Phyteuma orbiculare

(Left) are misclassied as Phyteuma hemisphaericum (Right) by our

baseline model. When including the location context, our proposed

model correctly classies the image with the green frame, whereas the

image with the red frame is still misclassied. The green and red ar-

rows indicate the locations of the respective left two images. The

underlying maps are the species distribution maps downloaded from

Info Flora (Info Flora, 2021). This highlights the importance of

including the location information to distinguish visually similar

species that have different geographical ranges. (For interpretation of

the references to color in this gure legend, the reader is referred to

the web version of this article.)

Fig. 8. Improvement in Mean Accuracy over the baseline for species with

different numbers of images in the dataset.

Table 2

Ablation study of our proposed model. Note that the top-k accuracies denote the

average species-specic metrics in order to give the same weight to all the

species in the evaluation.

Model Accuracy

(%)

Top-1

(%)

Top-3

(%)

Top-5

(%)

Baseline 73.48 62.48 79.04 83.97

Baseline +Location

Context

76.99 67.47 82.50 87.01

Baseline +Hierarchical

Labels

76.30 65.49 82.20 86.81

Proposed Model 79.12 69.76 84.86 88.95

Note that also the chance level increases, as there are fewer possible labels.

R. de Lutio et al.

ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121

119

Table 5. Indeed, the performance is better for the higher levels (c.f.

Table 4). Furthermore, higher up in the hierarchy, fewer classes are

poorly represented; the long-tail distribution is less extreme.

5.4. Experiments with unseen species

Given the hierarchical labels, it is also possible to classify new species

which the classier has not seen at all during training. While the

assigned species label will necessarily always be wrong, one would hope

that the predictions at coarser taxonomy levels are often sensible. For

this experiment, we picked 330 species that were initially discarded

from our dataset for having <10 images, but for which at least 5 images

are available, c.f. the “Unseen” row in Table 1. The corresponding re-

sults in Table 6 conrm our intuition: while there is of course a signif-

icant performance drop compared to the trained species, it is still

possible to classify unseen species into the right Genus, Family or Order

with reasonable performance, well above chance level (the probability

of success of a classier that always predicts the most common class).

This capability can be extremely useful in the context of community

science, where the coarser labels can be used to refer examples to the

right expert for classication or to detect gaps in the taxonomy lists

offered to users.

5.5. Contextual information and Sentinel-2 images ablation study

To investigate the contributions of different types of contextual in-

formation, and the potential benet of adding satellite imagery, we

perform extensive ablation studies.

In Table 7 we show the impact of the different contextual informa-

tion (altitude, geo-coordinates, day of the year) on the evaluated met-

rics. As it can be seen they all contribute to some extent, with the altitude

being one of the most important. Considering the high altitude vari-

ability of the Swiss landscape it was rather expected that the altitude

could carry the most valuable information. When the full context is

combined the performance metrics show a further increase meaning that

the additional data carry orthogonal information.

Finally, Table 8 displays the performance achieved with the inte-

gration of Sentinel-2 imagery. Overall, their impact turns out to be

small. When naively adding the Sentinel-2 branch, performance even

drops slightly, apparently due to over-tting. By adding standard drop-

out regularisation (Srivastava et al., 2014) on the last fully-connected

layer, we were able to remedy this behaviour and achieve a mild (but

still statistically signicant) performance gain. To ensure that the

difference is actually caused by the satellite imagery and not the drop-

out, we add an additional baseline where the model without the

Sentinel-2 branch is trained with drop-out. Interestingly, this even

degraded the performance.

While it is promising that the much-enriched context information

from the satellite image brings an improvement over the simple geo-

location, that gain is relatively modest, at least with our implementa-

tion. Further research, beyond the scope of the present paper, will be

needed to clarify the potential of satellite (or airborne) data as auxiliary

information.

6. Conclusion

In this work, we have demonstrated that easily accessible side in-

formation can bring rather large performance gains when classifying

community science photographs. We have focused on the spatio-

temporal context of the observations, and have shown how it can

rene the classication model by providing relevant prior knowledge

regarding the distribution and occurrence of species observations. We

have also briey touched on extended radiometric context from optical

satellite imagery, a direction where we see quite some potential for

further research. Moreover, we have veried that exploiting the hier-

archical structure of biological taxonomy not only improves the species

recognition performance, but also enables more reliable predictions at

coarser taxonomy levels, and even coarse classication of species not

seen at all during the classier training.

In terms of practical community science applications, our model is

also a step towards a viable scheme for verifying user-supplied labels.

For instance, the proposed method could provide hints to the commu-

nity scientist when labelling the species, or it could facilitate the

reviewing validation by experts, marking specic observations where

the model disagrees with the label provided by the community scientist.

Of course, these suggestions would need to be followed with care in

practice to avoid creating a conrmation bias of the model. We hope

that, ultimately, a larger number of correct species observations will

contribute to better species distribution models, to inform biodiversity

research and conservation initiatives, particularly for rare species.

Declaration of Competing Interest

The authors declare that they have no known competing nancial

interests or personal relationships that could have appeared to inuence

the work reported in this paper.

Table 4

Number of classes at each hierarchical level.

Level Species Genus Family Order Class Phylum

Number 977 489 121 50 8 3

Table 6

Accuracy (%) on unseen species.

Evaluation Set Species Genus Family Order Class Phylum

5-fold Cross-Val 79.02 83.39 87.26 88.54 97.24 99.89

Unseen Species – 24.27 41.86 50.23 85.60 96.00

Table 7

Ablation study of spatio-temporal context. Note that the top-k accuracies indi-

cate the average species-specic metrics.

Model Accuracy

(%)

Top-1

(%)

Top-3

(%)

Top-5

(%)

Baseline 73.48 62.48 79.04 83.97

Baseline +Altitude 75.40 65.11 81.00 85.80

Baseline +Geo-coordinates 75.07 64.86 80.63 85.45

Baseline +Day of the year 75.51 65.00 80.91 85.51

Baseline þFull Location

Context

76.99 67.47 82.50 87.01

Table 8

Results adding Sentinel-2 mosaic. Note that the top-k accuracies indicate the

average species-specic metrics.

Model Accuracy

(%)

Top-1

(%)

Top-3

(%)

Top-5

(%)

Proposed model 79.12 69.76 84.86 88.95

Proposed model with Dropout 78.02 67.77 83.39 87.52

Proposed model +Sen-2 78.59 68.29 84.43 88.60

Proposed model þSen-2

with Dropout

79.73 70.32 85.52 89.33

Table 5

Results at different hierarchical levels. Note that top-3 and top-5 accuracy at

phylum level are meaningless, since there are only 3 possible phyla. Note that

the top-k accuracies indicate the average species-specic metrics.

Metric (%) Species Genus Family Order Class Phylum

Accuracy 79.02 83.39 87.26 88.54 97.24 99.89

Top-1 69.50 73.23 75.84 78.53 88.52 86.63

Top-3 84.49 85.76 87.81 89.34 98.01 100.0

Top-5 88.57 89.37 91.45 92.95 99.51 100.0

R. de Lutio et al.

ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121

120

Appendix A. Ablation studies

Fig. A1.

Appendix B. Confusion matrix

Tables A.1 and A.2.

Fig. A1. Confusion matrix for our proposed model. The species in the rows and columns have been ordered based on their taxonomy. The hierarchy between the

species is made apparent by the dendrograms. The block-like structure along the diagonal indicates that species that are close in terms of their taxonomy are

misclassied for each other more often than unrelated species.

Table A.1

Ablation study of balanced sampling. Note that the top-k accuracies denote the average species-specic metrics.

Model Accuracy (%) Top-1 (%) Top-3 (%) Top-5 (%)

Baseline +No Balanced Sampling 75.57 62.08 80.9 86.18

Baseline +Balanced Sampling 73.48 62.48 79.04 83.97

Proposed Model +No Balanced Sampling 80.05 69.23 86.15 90.29

Proposed Model +Balanced Sampling 79.12 69.76 84.86 88.95

Table A.2

Comparison of different extents for Sentinel-2 images. Note that the top-k accuracies indicate the average species-specic metrics.

Model Accuracy (%) Top-1 (%) Top-3 (%) Top-5 (%)

No Sent-2 79.12 69.76 84.86 88.95

Small Sent-2 (128 ×128) 79.5 69.56 84.76 88.92

Normal Sent-2 (256 ×256) 79.73 70.32 85.52 89.33

Large Sent-2 (512 ×512) 79.16 69.9 84.94 88.91

R. de Lutio et al.

ISPRS Journal of Photogrammetry and Remote Sensing 182 (2021) 112–121

121

References

Barrotta, P., Gronda, R., 2020. Controversies and Interdisciplinarity: Beyond disciplinary

fragmentation for a new knowledge model. In: Ch. What is the Meaning of

Biodiversity?, 16. John Benjamins Publishing Company, pp. 115–131.

Beery, S., Cole, E., Gjoka, A., 2020. The iWildCam 2020 competition dataset, In:

Proceedings, IEEE Conference on Computer Vision and Pattern Recognition

Workshops.

Berg, T., Liu, J., Woo Lee, S., Alexander, M.L., Jacobs, D.W., Belhumeur, P.N., 2014.

Birdsnap: Large-scale ne-grained visual categorization of birds. In: Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2011–2018.

Bottou, L., 2012. Neural Networks: Tricks of the Trade, second ed. Berlin Heidelberg:

Springer, pp. 421–436 (Ch. Stochastic Gradient Descent Tricks).

Butcher, G., Niven, D., 2007. Combining data from the Christmas Bird Count and the

Breeding Bird Survey to determine the continental status and trends of North

America birds. Tech. rep. National Audubon Society.

Chen, T., Wu, W., Gao, Y., Dong, L., Luo, X., Lin, L., 2018. Fine-grained representation

learning and recognition by exploiting hierarchical semantic embedding. In:

Proceedings of the ACM International Conference on Multimedia, pp. 2023–2031.

Chu, G., Potetz, B., Wang, W., Howard, A., Song, Y., Brucher, F., Leung, T., Adam, H.,

2019. Geo-aware networks for ne-grained recognition. In: Proceedings, IEEE

International Conference on Computer Vision Workshops, pp. 247–254.

Copernicus open access hub. https://scihub.copernicus.eu (last accessed on 26.05.2021).

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L., 2009, ImageNet: A large-scale

hierarchical image database. In: Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pp. 248–255.

Dhall, A., Makarova, A., Ganea, O., Pavllo, D., Greeff, M., Krause, A., 2020. Hierarchical

image classication using entailment cone embeddings. In: Proceedings, IEEE

Conference on Computer Vision and Pattern Recognition Workshops, pp. 836–837.

Díaz, S., Settele, J., Brondízio, E., Ngo, H., Gu`

eze, M., Agard, J., Arneth, A., Balvanera, P.,

Brauman, K., Butchart, S., Chan, K., Garibaldi, L., Ichii, K., Liu, J., Subramanian, S.,

Midgley, G., Miloslavich, P., Moln´

ar, Z., Obura, D., Pfaff, A., Polasky, S., Purvis, A.,

Razzaque, J., Reyers, B., Chowdhury, R., Shin, Y., Visseren-Hamakers, I., Willis, K.,

Zayas, C., 2019. Summary for policymakers of the global assessment report on

biodiversity and ecosystem services, Tech. rep. Intergovernmental Science-Policy

Platform on Biodiversity and Ecosystem Services.

Dickinson, J.L., Zuckerberg, B., Bonter, D.N., 2010. Citizen science as an ecological

research tool: challenges and benets. Ann. Rev. Ecol. Evol. Systematics 41,

149–172.

Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B., 2021. Sharpness-aware minimization

for efciently improving generalization. In: Proceedings of the International

Conference on Learning Representations.

Gaston, K.J., Spicer, J.I., 2004. Biodiversity: An introduction, second ed. Blackwell

Publishing.

He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In:

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

pp. 770–778.

iNaturalist, https://www.inaturalist.org (last accessed on 26.05.2021).

Info Flora. https://www.infoflora.ch last accessed on 26.05.2021.

Khosla, A., Jayadevaprakash, N., Yao, B., Fei-Fei, L., 2011. Novel dataset for ne-grained

image categorization. In: First Workshop on Fine-Grained Visual Categorization at

the IEEE Conference on Computer Vision and Pattern Recognition.

Kumar, S., Zheng, R., 2017. Hierarchical category detector for clothing recognition from

visual data. In: Proceedings, IEEE International Conference on Computer Vision

Workshops, pp. 2306–2312.

Lang, N., Schindler, K., Wegner, J.D., 2019. Country-wide high-resolution vegetation

height mapping with Sentinel-2. Remote Sens. Environ. 233, 111347.

Mac Aodha, O., Cole, E., Perona, P., 2019. Presence-only geographical priors for ne-

grained image classication. In: Proceedings of the IEEE International Conference on

Computer Vision, pp. 9596–9606.

Roy, D., Panda, P., Roy, K., 2018. Tree-CNN: A hierarchical deep convolutional neural

network for incremental learning. arXiv: 1802.05800.

Silvertown, J., 2009. A new dawn for citizen science. Trends Ecol. Evol. 24 (9), 467–471.

Srivastava, N., Salakhutdinov, R., 2013. Discriminative transfer learning with tree-based

priors. In: Proceedings, Advances in Neural Information Processing Systems.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014.

Dropout: A simple way to prevent neural networks from overtting. J. Mach. Learn.

Res. 15 (56), 1929–1958.

Stace, C.A., 1991. Plant Taxonomy and Biosystematics. Cambridge University Press.

Swisstopo. https://www.swisstopo.admin.ch/en/geodata/height/alti3d.html (last

accessed on 26.05.2021).

Tang, K., Paluri, M., Fei-Fei, L., Fergus, R., Bourdev, L., 2015. Improving image

classication with location context. In: Proceedings of the IEEE International

Conference on Computer Vision, pp. 1008–1016.

Turkoglu, M.O., D’Aronco, S., Perich, G., Liebisch, F., Streit, C., Schindler, K., Wegner, J.

D., 2021. Crop mapping from image time series: deep learning with multi-scale label

hierarchies. arXiv:2102.08820.

Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P., 2010.

Caltech-UCSD Birds 200, Tech. rep. California Institute of Technology.

Wittich, H.C., Seeland, M., W¨

aldchen, J., Rzanny, M., M¨

ader, P., 2018. Recommending

plant taxa for supporting on-site species identication. BMC Bioinformatics 19 (1),

1–17.

Xiao, T., Zhang, J., Yang, K., Peng, Y., Zhang, Z., 2014. Error-driven incremental learning

in deep convolutional neural network for large-scale image classication. In:

Proceedings of the ACM International Conference on Multimedia, pp. 177–186.

Yan, Z., Zhang, H., Piramuthu, R., Jagadeesh, V., DeCoste, D., Di, W., Yu, Y., 2015. HD-

CNN: hierarchical deep convolutional neural networks for large scale visual

recognition. In: Proceedings of the IEEE International Conference on Computer

Vision, pp. 2740–2748.

R. de Lutio et al.

Multispecies deep learning using citizen science data produces more informative plant community models

Article

Full-text available

May 2024

In the age of big data, scientific progress is fundamentally limited by our capacity to extract critical information. Here, we map fine-grained spatiotemporal distributions for thousands of species, using deep neural networks (DNNs) and ubiquitous citizen science data. Based on 6.7 M observations, we jointly model the distributions of 2477 plant species and species aggregates across Switzerland with an ensemble of DNNs built with different cost functions. We find that, compared to commonly-used approaches, multispecies DNNs predict species distributions and especially community composition more accurately. Moreover, their design allows investigation of understudied aspects of ecology. Including seasonal variations of observation probability explicitly allows approximating flowering phenology; reweighting predictions to mirror cover-abundance allows mapping potentially canopy-dominant tree species nationwide; and projecting DNNs into the future allows assessing how distributions, phenology, and dominance may change. Given their skill and their versatility, multispecies DNNs can refine our understanding of the distribution of plants and well-sampled taxa in general.

Detecting flowers on imagery with computer vision to improve continental scale grassland biodiversity surveying

Article

Full-text available

May 2024

Large‐scale biodiversity monitoring is essential for assessing biodiversity trends, yet traditional surveying methods are limited in the spatial/temporal scale they can cover. Recent technological developments have led to computer vision‐based species identification tools, such as the Pl@ntNet application. Increasing accuracy of such algorithms presents an opportunity of integrating computer vision into larger monitoring schemes and could lead to automating ground‐based evidence provision related to agri‐environmental measures (e.g. flower strips, field margins). However, images from surveys or farmer declarations do not live up to the standards of current applications. In order to integrate these automated methods into biodiversity monitoring, more generalized models are needed. We create a dataset using 500 manually delineated images of vegetation patches in European grasslands taken during the Land Use/cover Area Survey (LUCAS) grassland module. We train the Faster R‐CNN model to detect and extract individual flower objects. Using this model, we extract the abundance of flowers in an image, analyse their colour distribution, and use the Pl@ntNet application to identify the species of the individual flowers detected. The best model reaches precision and recall of 0.89/0.61 and predicts 1377 flowers on the 100 test images distributed between 10 colours. Using Pl@ntNet, only 52 flowers were identified with a certainty score above 0.5 due to the limitations in image size and quality. Of these flowers, 30% were correctly automatically identified at the species level and 42% at the genus level. The results show that we can automatically extract valuable information on floral abundances, colours, and sizes from images of vegetation patches, though in most cases better images are needed for species identification. Despite limitations with image quality, integrating this workflow into large‐scale monitoring could speed up the sampling process and allow for better spatial and temporal data on floral diversity and abundance.

Improved Wildlife Recognition through Fusing Camera Trap Images and Temporal Metadata

Article

Full-text available

Feb 2024
Diversity

Camera traps play an important role in biodiversity monitoring. An increasing number of studies have been conducted to automatically recognize wildlife in camera trap images through deep learning. However, wildlife recognition by camera trap images alone is often limited by the size and quality of the dataset. To address the above issues, we propose the Temporal-SE-ResNet50 network, which aims to improve wildlife recognition accuracy by exploiting the temporal information attached to camera trap images. First, we constructed the SE-ResNet50 network to extract image features. Second, we obtained temporal metadata from camera trap images, and after cyclical encoding, we used a residual multilayer perceptron (MLP) network to obtain temporal features. Finally, the image features and temporal features were fused in wildlife identification by a dynamic MLP module. The experimental results on the Camdeboo dataset show that the accuracy of wildlife recognition after fusing the image and temporal information is about 93.10%, which is an improvement of 0.53%, 0.94%, 1.35%, 2.93%, and 5.98%, respectively, compared with the ResNet50, VGG19, ShuffleNetV2-2.0x, MobileNetV3-L, and ConvNeXt-B models. Furthermore, we demonstrate the effectiveness of the proposed method on different national park camera trap datasets. Our method provides a new idea for fusing animal domain knowledge to further improve the accuracy of wildlife recognition, which can better serve wildlife conservation and ecological research.

CLIP-Driven Few-Shot Species-Recognition Method for Integrating Geographic Information

Article

Full-text available

Jun 2024

Automatic recognition of species is important for the conservation and management of biodiversity. However, since closely related species are visually similar, it is difficult to distinguish them by images alone. In addition, traditional species-recognition models are limited by the size of the dataset and face the problem of poor generalization ability. Visual-language models such as Contrastive Language-Image Pretraining (CLIP), obtained by training on large-scale datasets, have excellent visual representation learning ability and demonstrated promising few-shot transfer ability in a variety of few-shot species recognition tasks. However, limited by the dataset on which CLIP is trained, the performance of CLIP is poor when used directly for few-shot species recognition. To improve the performance of CLIP for few-shot species recognition, we proposed a few-shot species-recognition method incorporating geolocation information. First, we utilized the powerful feature extraction capability of CLIP to extract image features and text features. Second, a geographic feature extraction module was constructed to provide additional contextual information by converting structured geographic location information into geographic feature representations. Then, a multimodal feature fusion module was constructed to deeply interact geographic features with image features to obtain enhanced image features through residual connection. Finally, the similarity between the enhanced image features and text features was calculated and the species recognition results were obtained. Extensive experiments on the iNaturalist 2021 dataset show that our proposed method can significantly improve the performance of CLIP’s few-shot species recognition. Under ViT-L/14 and 16-shot training species samples, compared to Linear probe CLIP, our method achieved a performance improvement of 6.22% (mammals), 13.77% (reptiles), and 16.82% (amphibians). Our work provides powerful evidence for integrating geolocation information into species-recognition models based on visual-language models.

Sat-SINR: High-Resolution Species Distribution Models Through Satellite Imagery

Article

Full-text available

Jun 2024

We propose a deep learning approach for high-resolution species distribution modelling (SDM) at large scale combining point-wise, crowd-sourced species observation data and environmental data with Sentinel-2 satellite imagery. What makes this task challenging is the great variety of controlling factors for species distribution, such as habitat conditions, human intervention, competition, disturbances, and evolutionary history. Experts either incorporate these factors into complex mechanistic models based on presence-absence data collected in field campaigns or train machine learning models to learn the relationship between environmental data and presence-only species occurrence. We extend the latter approach here and learn deep SDMs end-to-end based on point-wise, crowd-sourced presence-only data in combination with satellite imagery. Our method, dubbed Sat-SINR, jointly models the spatial distributions of 5.6k plant species across Europe and increases the spatial resolution by a factor of 100 compared to the current state of the art. We exhaustively test and ablate multiple variations of combining geo-referenced point data with satellite imagery and show that our deep learning-based SDM method consistently shows an improvement of up to 3 percentage points across three metrics. We make all code publicly available at https://github.com/ecovision-uzh/sat-sinr.

Automatic Fused Multimodal Deep Learning for Plant Identification

Preprint

Full-text available

Jun 2024

Plant classification is vital for ecological conservation and agricultural productivity, enhancing our understanding of plant growth dynamics and aiding species preservation. The advent of deep learning (DL) techniques has revolutionized this field by enabling autonomous feature extraction, significantly reducing the dependence on manual expertise. However, conventional DL models often rely solely on single data sources, failing to capture the full biological diversity of plant species comprehensively. Recent research has turned to multi-modal learning to overcome this limitation by integrating multiple data types, which enriches the representation of plant characteristics. This shift introduces the challenge of determining the optimal point for modality fusion. In this paper, we introduce a pioneering multimodal DL-based approach for plant classification with automatic modality fusion. Utilizing the multimodal fusion architecture search, our method integrates images from multiple plant organs-flowers, leaves, fruits, and stems-into a cohesive model. Our method achieves 83.48% accuracy on 956 classes of the PlantCLEF2015 dataset, surpassing state-of-the-art methods. It outperforms late fusion by 11.07% and is more robust to missing modalities. We validate our model against established benchmarks using standard performance metrics and McNemar's test, further underscoring its superiority.

Snap Decisions: Assessing Participation and Data Quality in a Citizen Science Program Using Repeat Photography

Article

Full-text available

Nov 2023

Photo-point monitoring through repeat photography allows assessment of long-term ecosystem changes, and photos may be collected using citizen science methods. Such efforts can generate large photo collections, but are susceptible to varying participation and data quality. To date, there have been few assessments of the success of citizen science projects using repeat photography methods in meeting their objectives. We report on the success of the PhotoMon Project, a photo-point monitoring program at Pinery Provincial Park, Canada, at meeting its primary goals of affordably collecting seasonal reference photographs of significant ecosystems within the park, while providing a stewardship opportunity for park visitors. We investigated how the quantity of submitted photos varied over time (quantity), and how closely those photos matched the suite of criteria of the PhotoMon Project (quality). Photo submissions occurred year-round and at all sites, although a low proportion of park visitors participated in the program. Photo quantity varied among sites and seasonally, reaching a low during the winter, but with proportional participation in the project lowest in summer. Photo quality was consistent year-round, with most photos meeting most program criteria. Common issues with photo quality included photo lighting and orientation. We conclude that the program met its scientific goal of compiling seasonal reference photos, but that comparatively few park visitors engage in the program. We suggest changes to increase visitor motivation to participate, but recognize that these may compromise the program’s current affordability and ease of management.

Utilizing Geographical Distribution Statistical Data to Improve Zero-Shot Species Recognition

Article

Full-text available

Jun 2024

Simple Summary Species recognition is a key part of understanding biodiversity and can help us to better conserve and manage biodiversity. Traditional species recognition methods require large amounts of image data to train the recognition model, but obtaining image data of rare and endangered species is a challenge. However, Contrastive Language–Image Pre-training (CLIP), a generalized artificial intelligence model, can perform classification by calculating the similarity between images and text without the need for training data. Taking advantage of this and considering the unique geographic distribution pattern of species, we propose a CLIP-based species recognition method that can recognize species based on geographic distribution knowledge. This study is the first to combine geographic distribution knowledge with species recognition, which can lead to more effective recognition of rare and endangered species. Abstract Species recognition is a crucial part of understanding the abundance and distribution of various organisms and is important for biodiversity conservation and management. Traditional vision-based deep learning-driven species recognition requires large amounts of well-labeled, high-quality image data, the collection of which is challenging for rare and endangered species. In addition, recognition methods designed based on specific species have poor generalization ability and are difficult to adapt to new species recognition scenarios. To address these issues, zero-shot species recognition based on Contrastive Language–Image Pre-training (CLIP) has become a research hotspot. However, previous studies have primarily utilized visual descriptive information and taxonomic information of species to improve zero-shot recognition performance, and the use of geographic distribution characteristics of species to improve zero-shot recognition performance has not been explored. To fill this gap, we proposed a CLIP-driven zero-shot species recognition method that incorporates knowledge of the geographic distribution of species. First, we designed three prompts based on the species geographic distribution statistical data. Then, the latitude and longitude coordinate information attached to each image in the species dataset was converted into addresses, and they were integrated together to form the geographical distribution knowledge of each species. Finally, species recognition results were derived by calculating the similarity after acquiring features by the trained CLIP image encoder and text encoder. We conducted extensive experiments on multiple species datasets from the iNaturalist 2021 dataset, where the zero-shot recognition accuracies of mammals, mollusks, reptiles, amphibians, birds, and insects were 44.96%, 15.27%, 17.51%, 9.47%, 28.35%, and 7.03%, an improvement of 2.07%, 0.48%, 0.35%, 1.12%, 1.64%, and 0.61%, respectively, as compared to CLIP with default prompt. The experimental results show that the fusion of geographic distribution statistical data can effectively improve the performance of zero-shot species recognition, which provides a new way to utilize species domain knowledge.

Rank-based deep learning from citizen-science data to model plant communities

Preprint

Full-text available

Apr 2023

In the age of big data, scientific progress is fundamentally limited by our capacity to extract critical information. We show that recasting multispecies distribution modeling as a ranking problem allows analyzing ubiquitous citizen-science observations with unprecedented efficiency. Based on 6.7M observations, we jointly modeled the distributions of 2477 plant species and species aggregates across Switzerland, using deep neural networks (DNNs). Compared to commonly-used approaches, multispecies DNNs predicted species distributions and especially community composition more accurately. Moreover, their setup allowed investigating understudied aspects of ecology: including seasonal variations of observation probability explicitly allowed approximating flowering phenology, especially for small, herbaceous species; reweighting predictions to mirror cover-abundance allowed mapping potentially canopy-dominant tree species nationwide; and projecting DNNs into the future allowed assessing how distributions, phenology, and dominance may change. Given their skill and their versatility, multispecies DNNs can refine our understanding of the distribution of plants and well-sampled taxa in general.

Rank-based deep learning from citizen-science data to model plant communities

Preprint

Full-text available

May 2023

Crop mapping from image time series: Deep learning with multi-scale label hierarchies

Article

Full-text available

Oct 2021
REMOTE SENS ENVIRON

The aim of this paper is to map agricultural crops by classifying satellite image time series. Domain experts in agriculture work with crop type labels that are organised in a hierarchical tree structure, where coarse classes (like orchards) are subdivided into finer ones (like apples, pears, vines, etc.). We develop a crop classification method that exploits this expert knowledge and significantly improves the mapping of rare crop types. The three-level label hierarchy is encoded in a convolutional, recurrent neural network (convRNN), such that for each pixel the model predicts three labels at different level of granularity. This end-to-end trainable, hierarchical network architecture allows the model to learn joint feature representations of rare classes (e.g., apples, pears) at a coarser level (e.g., orchard), thereby boosting classification performance at the fine-grained level. Additionally, labelling at different granularity also makes it possible to adjust the output according to the classification scores; as coarser labels with high confidence are sometimes more useful for agricultural practice than fine-grained but very uncertain labels. We validate the proposed method on a new, large dataset that we make public. ZueriCrop covers an area of 50 km × 48 km in the Swiss cantons of Zurich and Thurgau with a total of 116′000 individual fields spanning 48 crop classes, and 28,000 (multi-temporal) image patches from Sentinel-2. We compare our proposed hierarchical convRNN model with several baselines, including methods designed for imbalanced class distributions. The hierarchical approach performs superior by at least 9.9 percentage points in F1-score.

Hierarchical Image Classification using Entailment Cone Embeddings

Conference Paper

Full-text available

Jun 2020

Recommending plant taxa for supporting on-site species identification

Article

Full-text available

May 2018
BMC BIOINFORMATICS

Background: Predicting a list of plant taxa most likely to be observed at a given geographical location and time is useful for many scenarios in biodiversity informatics. Since efficient plant species identification is impeded mainly by the large number of possible candidate species, providing a shortlist of likely candidates can help significantly expedite the task. Whereas species distribution models heavily rely on geo-referenced occurrence data, such information still remains largely unused for plant taxa identification tools. Results: In this paper, we conduct a study on the feasibility of computing a ranked shortlist of plant taxa likely to be encountered by an observer in the field. We use the territory of Germany as case study with a total of 7.62M records of freely available plant presence-absence data and occurrence records for 2.7k plant taxa. We systematically study achievable recommendation quality based on two types of source data: binary presence-absence data and individual occurrence records. Furthermore, we study strategies for aggregating records into a taxa recommendation based on location and date of an observation. Conclusion: We evaluate recommendations using 28k geo-referenced and taxa-labeled plant images hosted on the Flickr website as an independent test dataset. Relying on location information from presence-absence data alone results in an average recall of 82%. However, we find that occurrence records are complementary to presence-absence data and using both in combination yields considerably higher recall of 96% along with improved ranking metrics. Ultimately, by reducing the list of candidate taxa by an average of 62%, a spatio-temporal prior can substantially expedite the overall identification problem.

Tree-CNN: A Hierarchical Deep Convolutional Neural Network for Incremental Learning

Article

Full-text available

Feb 2018
NEURAL NETWORKS

In recent years, Convolutional Neural Networks (CNNs) have shown remarkable performance in many computer vision tasks such as object recognition and detection. However, complex training issues, such as `catastrophic forgetting' and hyper-parameter tuning, make incremental learning in CNNs a difficult challenge. In this paper, we propose a hierarchical deep neural network, with CNNs at multiple levels, and a corresponding training method for incremental learning. The network grows in a tree-like manner to accommodate the new classes of data without losing the ability to identify the previously trained classes. The proposed network was tested on CIFAR-100 and reported 60.46% accuracy and 20% reduction in training effort as compared to retraining final layers of a deep network. The network organizes the incoming classes of data into feature-driven super-classes and improves upon existing hierarchical CNN models by adding the capability of self-growth.

Presence-Only Geographical Priors for Fine-Grained Image Classification

Conference Paper

Nov 2019

Appearance information alone is often not sufficient to accurately differentiate between fine-grained visual categories. Human experts make use of additional cues such as where, and when, a given image was taken in order to inform their final decision. This contextual information is readily available in many online image collections but has been underutilized by existing image classifiers that focus solely on making predictions based on the image contents. We propose an efficient spatio-temporal prior, that when conditioned on a geographical location and time, estimates the probability that a given object category occurs at that location. Our prior is trained from presence-only observation data and jointly models object categories, their spatio-temporal distributions, and photographer biases. Experiments performed on multiple challenging image classification datasets show that combining our prior with the predictions from image classifiers results in a large improvement in final classification performance.

Geo-Aware Networks for Fine-Grained Recognition

Conference Paper

Oct 2019

Country-wide high-resolution vegetation height mapping with Sentinel-2

Article

Nov 2019
REMOTE SENS ENVIRON

Sentinel-2 multi-spectral images collected over periods of several months were used to estimate vegetation height for Gabon and Switzerland. A deep convolutional neural network (CNN) was trained to extract suitable spectral and textural features from reflectance images and to regress per-pixel vegetation height. In Gabon, reference heights for training and validation were derived from airborne LiDAR measurements. In Switzerland, reference heights were taken from an existing canopy height model derived via photogrammetric surface reconstruction. The resulting maps have a mean absolute error (MAE) of 1.7 m in Switzerland and 4.3 m in Gabon (a root mean square error (RMSE) of 3.4 m and 5.6 m, respectively), and correctly estimate vegetation heights up to >50 m. They also show good qualitative agreement with existing vegetation height maps. Our work demonstrates that, given a moderate amount of reference data (i.e., 2000 km² in Gabon and ≈5800 km² in Switzerland), high-resolution vegetation height maps with 10 m ground sampling distance (GSD) can be derived at country scale from Sentinel-2 imagery.

Fine-Grained Representation Learning and Recognition by Exploiting Hierarchical Semantic Embedding

Conference Paper

Oct 2018

Object categories inherently form a hierarchy with different levels of concept abstraction, especially for fine-grained categories. For example, birds (Aves) can be categorized according to a four-level hierarchy of order, family, genus, and species. This hierarchy encodes rich correlations among various categories across different levels, which can effectively regularize the semantic space and thus make prediction less ambiguous. However, previous studies of fine-grained image recognition primarily focus on categories of one certain level and usually overlook this correlation information. In this work, we investigate simultaneously predicting categories of different levels in the hierarchy and integrating this structured correlation information into the deep neural network by developing a novel Hierarchical Semantic Embedding (HSE) framework. Specifically, the HSE framework sequentially predicts the category score vector of each level in the hierarchy, from highest to lowest. At each level, it incorporates the predicted score vector of the higher level as prior knowledge to learn finer-grained feature representation. During training, the predicted score vector of the higher level is also employed to regularize label prediction by using it as soft targets of corresponding sub-categories. To evaluate the proposed framework, we organize the 200 bird species of the Caltech-UCSD birds dataset with the four-level category hierarchy and construct a large-scale butterfly dataset that also covers four level categories. Extensive experiments on these two and the newly-released VegFru datasets demonstrate the superiority of our HSE framework over the baseline methods and existing competitors.

Hierarchical Category Detector for Clothing Recognition from Visual Data

Conference Paper

Oct 2017

Plant Taxonomy and Biosystematics.

Article

Nov 1981

Digital taxonomist: Identifying plant species in community scientists’ photographs

Abstract and Figures

Recommended publications

Digital Taxonomist: Identifying Plant Species in Citizen Scientists' Photographs

Crop mapping from image time series: deep learning with multi-scale label hierarchies

Learning Graph Regularisation for Guided Super-Resolution

Guided Super-Resolution As Pixel-to-Pixel Transformation