ChapterPDF Available

Model-R: A Framework for Scalable and Reproducible Ecological Niche Modeling

January 2018

January 2018

DOI:10.1007/978-3-319-73353-1_15

In book: High Performance Computing (pp.218-232)

Authors:

Andrea Sánchez-Tapia

Global Fishing Watch

Marinez Ferreira De Siqueira

Instituto de Pesquisas Jardim Botânico do Rio de Janeiro

Felipe Sodré Mendes Barros

Instituto Internacional para Sustentabilidade

Show all 8 authorsHide

Spatial analysis tools and synthesis of results are key to identifying the best solutions in biodiversity conservation. The importance of process automation is associated with increased efficiency and performance both in the data pre-processing phase and in the post-analysis of the results generated by the packages and modeling programs. The Model-R framework was developed with the main objective of unifying pre-existing ecological niche modeling tools into a common framework and building a web interface that automates steps of the modeling process and occurrence data retrieval. The web interface includes RJabot, a functionality that allows for searching and retrieving occurrence data from Jabot, the main reference on botanical collections management system in Brazil. It returns data in a suitable format to be consumed by other components of the framework. Currently, the tools are multi-projection, they can thus be applied to different sets of temporal and spatial data. Model-R is also multi-algorithm, with seven algorithms available for modeling: BIOCLIM, Mahalanobis distance, Maxent, GLM, RandomForest, SVM, and DOMAIN. The algorithms as well as the entire modeling process may be parametrized using command-line tools or through the web interface. We hope that the use of this application, not only by modeling specialists but also as a tool for policy makers, will be a significant contribution to the continuous development of biodiversity conservation analysis. The Model-R web interface can be installed locally or on a server. A software container is provided to automate the installation.

Output of getOccurrence showing occurrence points obtained from Jabot.

…

. Description of variables generated by the modeling process.

…

Modeling steps in the web interface of Model-R.

…

Map with original occurrence records (left) and richness map generated by analyzing Model-R output data (right).

…

Parallelization effects on performance.

…

Figures - uploaded by Carla Osthoff

Content may be subject to copyright.

Content uploaded by Carla Osthoff

Content may be subject to copyright.

Model-R: A Framework for Scalable and

Reproducible Ecological Niche Modeling

Andrea S´anchez-Tapia1, Marinez Ferreira de Siqueira1, Rafael Oliveira Lima1,

Felipe Sodr´e M. Barros2, Guilherme M. Gall3, Luiz M. R. Gadelha Jr.3,

Lu´ıs Alexandre E. da Silva1, and Carla Osthoﬀ3

1Botanic Garden of Rio de Janeiro, Rio de Janeiro, Brazil

{andreasancheztapia, marinez, rafael, estevao}@jbrj.gov.br

2International Institute for Sustainability, Rio de Janeiro, Brazil

f.barros@iis-rio.org

3National Laboratory for Scientiﬁc Computing, Petr´opolis, Brazil

{gmgall, lgadelha, osthoff}@lncc.br

Abstract. Spatial analysis tools and synthesis of results are key to iden-

tifying the best solutions in biodiversity conservation. The importance

of process automation is associated with increased eﬃciency and perfor-

mance both in the data pre-processing phase and in the post-analysis

of the results generated by the packages and modeling programs. The

Model-R framework was developed with the main objective of unifying

pre-existing ecological niche modeling tools into a common framework

and building a web interface that automates steps of the modeling process

and occurrence data retrieval. The web interface includes RJabot, a func-

tionality that allows for searching and retrieving occurrence data from

Jabot, the main reference on botanical collections management system in

Brazil. It returns data in a suitable format to be consumed by other com-

ponents of the framework. Currently, the tools are multi-projection, they

can thus be applied to diﬀerent sets of temporal and spatial data. Model-

R is also multi-algorithm, with seven algorithms available for modeling:

BIOCLIM, Mahalanobis distance, Maxent, GLM, RandomForest, SVM,

and DOMAIN. The algorithms as well as the entire modeling process may

be parametrized using command-line tools or through the web interface.

We hope that the use of this application, not only by modeling special-

ists but also as a tool for policy makers, will be a signiﬁcant contribution

to the continuous development of biodiversity conservation analysis. The

Model-R web interface can be installed locally or on a server. A software

container is provided to automate the installation.

Keywords: species distribution modeling, ecological niche modeling,

science gateways, scalability, provenance

1 Introduction

Ecological Niche Modeling (ENM) has been widely used for over a decade [1] [2]

[3] [4]. In recent years ENM approaches have become an essential tool for species

conservation, ecology and evolution studies, as well for systematic conservation

and restoration planning [5]. These models use species occurrence data and pre-

dictor variables that are combined to form statistical and theoretical models

resulting in projections in the geographic space that represent the potential ge-

ographic distribution of a species [6]. The environmental suitability maps [7],

generated by the models inform how similar a particular area is to the area

where the species occurs, thus identifying the potential area for occupation by

the species, from the predictor variables selected.

Ecological niche modeling comprises several stages, which require knowledge

of many concepts and techniques related to various ﬁelds of biology, such as

biodiversity, biogeography, as well as climate and data processing tools, before,

during and after obtaining the model [8][5]. The biotic data processing step con-

sists of obtaining, evaluating and preparing the points of presence and, in some

cases, of absence of the species to be modeled. In this process, it is fundamental

to perform data cleaning with the removal of inaccurate or unreliable data. In

the step of treatment and choice of environmental layers, one obtains and selects

the layers to be used in the analysis. Traditionally, it is necessary to use a Geo-

graphic Information System (GIS) tools for clipping and adjusting the resolution

and cropping the raster layers to the modeling extension, requiring a reasonable

knowledge of the tool. This task can be even more time-consuming when deal-

ing with a large dataset. The use of speciﬁc data types by the algorithms, and

their various forms of parametrization, requires a reasonable knowledge of pro-

gramming for their full use. The importance of process automation is associated

with increased eﬃciency and performance both in the data pre-processing phase

and in the post-analysis of the results generated by the packages and modeling

programs, which is the main objective of this work. The elimination of external

tools for data acquisition and preparation, as well as their standardization, re-

duces the possibility of errors, confers reproducibility and improves the speed of

the modeling process, making the whole process more eﬃcient.

The modeling process consists of many steps, as described in [8], some of

which consume considerable time to be performed by traditional means. A re-

source available for tackling this problem is the R statistical environment, which

features various possibilities of automation but does require some knowledge of

programming for obtaining the desired outcomes in this process. The main ob-

jective of this work was to package modeling procedures as R functions and to

create an application (Model-R) that allows, either via command-line or through

a web interface, to perform ecological niche modeling, overcoming the most com-

mon barriers and providing approaches for data entry steps, data cleaning, choice

of predictor variables, parametrization of algorithms, and post-analysis as well

as the retrieval of the results. A list of acronyms and variable deﬁnitions is

presented in Table 1.

2 Model-R Framework

The Model-R framework for ecological niche modeling is given by a set of ecolog-

ical niche modeling functions (dismo.mod,final.models,ensemble), functions

for retrieving species occurrences records (Rjabot and Rgbif) and the graphic

user interface. It allows researchers to use their own data. The framework is di-

vided in front and backend; some functions are presented at web interface that

abstracts and automates the main steps involved in the ecological niche modeling

process. This is a dynamic process, and our goal is that this interface will evolve

and incorporate more and more aspects of the framework. All these components

were implemented in R and are described in the subsections that follow.

2.1 Frontend

The main focus of the web application for Model-R is the development of an in-

terface for the modeling process, allowing users without programming knowledge

to perform the steps of the modeling process consistently, avoiding the concern

with script coding and concentrating on the data and its processing workﬂow.

To do so, we adapted the modeling functions into a Shiny application [9]. The

Shiny package [9] is a web application framework for R, allowing the creation of

interactive applications that can be accessed from devices with internet access,

such as computers, tablets, and mobile phones. Thus, the application provides a

graphical interface where users can easily choose biotic and abiotic data, perform

the data cleaning on occurrence records, choose algorithms and their parameters.

They can also download the results, as well as the script of the modeling process

that allow its execution without the use of the Model-R web application. The

use of the script as a stand-alone application allows for more precise adjustment

of the parameters or adjustments that were not possible in the web interface.

To make the application and process user-friendly, we separated the features by

steps, following the modeling process described in [8].

The following steps of the modeling process are available in the application:

biotic data entry; data cleaning; choice of abiotic variables; cutting oﬀ the ge-

ographic extension; choosing the algorithm and its parameters; visualization of

results; and downloading the resulting data.

Biotic data entry. This stage represents the entry of biotic data in the

system. A modeling project can be created using the ”Create Project” feature,

this allows for keeping track of the modeling experiments performed. Creating a

project allows one to assign a name and thus organize and store the information

generated. Biotic data can be given as input to the application in three ways:

queries to the GBIF database, queries to the Jabot database, and uploading CSV

ﬁles. CSV ﬁles allow for uploading occurrence records not present in GBIF and

Jabot from other databases after conversion to this format. The RJabot package

makes the query to the Jabot database (Figure 1). These records are given by

species name, latitude, and longitude. At the end of the biotic data entry step,

a map with the occurrence records is displayed. In July 2017, GBIF contained

approximately 10 million species occurrence records about Brazil, 70% of which

were published by its Brazilian node, the Brazilian Biodiversity Information

System (SiBBr) [10].

Fig. 1. Output of getOccurrence showing occurrence points obtained from Jabot.

Data cleaning. This step allows cleaning the biotic data entered into the

application. It has two features: ”Eliminate duplicate data” and “Delete Occur-

rence point”. “Eliminate Duplicate Data” removes occurrence records entered

to the application that have the same value for latitude and longitude. “Delete

occurrence point” eliminates points that were evaluated by the user as erroneous

in their location and will not be used in modeling. Using the interface, the user

clicks the button “Delete duplicate” or selects the point it wants to eliminate

suspicious data. After that, the user can also save the ﬁnal biotic dataset, after

the data cleaning process.

Abiotic data entry. This step is responsible for the entry of abiotic vari-

ables and deﬁnition of the geographic extensions of the modeling process and its

projection. The ﬁrst step is to set the spatial extension of modeling process (i.e.:

the extent to which the modeling will be done, also understood as study area).

The extensions can be deﬁned directly on the map, which displays the occurrence

points selected in the previous steps. Regarding spatial projection, the applica-

tion allows users to deﬁne a diﬀerent extent to project ENM in another region.

This can be useful, for instance, for checking the ability of a species to become

invasive in the given region. Also, it is possible to deﬁne a projection in time,

in instance, to the past (Pleistocene/Holocene) or future (2050 and 2070) using

Worldclim dataset4and Bio-ORACLE variables [11]. Independently of the spa-

tial and temporal projection chosen, the user might deﬁne the spatial resolution

(i.e. pixel size) of the abiotic dataset. For development, we used the resolution

of 10 arc minutes (for Worldclim variables) and 9.2 km (for Bio-ORACLE vari-

ables), due to storage space and processing speed reasons. The main database

technologies that optimize storage and speed processing are already under study,

so the application supports others resolutions, like 30 seconds, 2.5, 5 arc minutes.

The map, the occurrence points, and the geographic extensions are displayed

using the Leaﬂet [12]. The package allows for zooming and interacting with the

map. The application is conﬁgured to work with Wordclim and Bio-ORACLE

to retrieve abiotic data and allow for other variables to be added manually to

the application.

Once the abiotic variables are deﬁned, the Model-R application displays the

variables considering the extension deﬁned by the user, and a table with charts

containing the correlation values between them (see Figure 2, step 4), allowing

to verify the correlated variables. Strongly correlated variables can impair the

prediction performance and statistics of the modeling process [13] [14].

Fig. 2. Modeling steps in the web interface of Model-R.

4http://www.worldclim.org

3 Modeling process and backend

The next step in the web application, modeling process, is the core of the species

distribution modeling workﬂow and was implemented as a three-step procedure,

wrapped in R functions, called dismo.mod() (in reference to the dismo package

[15] from which it draws the main structure and functions), final.models()

and ensemble().

dismo.mod() takes the input data, partitions it by cross-validation, ﬁts the mod-

els for each partition and writes the evaluation statistics tables, using func-

tion evaluate () in the dismo package, with some modiﬁcations, such as

the calculation of TSS for each partition. It writes the raw models, i.e. the

continuous outputs, in raster and image formats. Writing to the hard disk

allows keeping the main memory uncluttered. The structure of the function

draws both on the dismo [15] and the biomod2 [16] tutorials.

final.model() joins the ﬁtted models for each partition into a ﬁnal model per

species per algorithm. It can select the best partitions according to their

TSS or AUC value. The default is selecting by T SS > 0.7, but this can be

changed by the user. The function also allows choosing which algorithms

will be processed. Otherwise, it will read all algorithms available from the

statistics table, and to use a mask to crop the models to a subset of the ﬁtting

geographic area. Finally, it cuts the continuous models by the threshold that

maximizes the TSS of the model and averages these models.

ensemble() computes the average of the ﬁnal models, to obtain an ensemble

model per species and retaining only the algorithms and partitions that

were selected previously. It can also focus on the areas where the algorithms

exhibited consensus. The default is 0.5, which corresponds to a Weighted

Majority Rule Ensemble to reduce variability between the algorithms in ﬁnal

models so that the ﬁnal models only retains areas predicted by at least half

of the algorithms [17].

The application interface runs this framework in the background, but the

user can adjust the following parameters:

Partition Number. The number of times the model will be generated for each

selected algorithm and, consequently, the number of times the k-fold parti-

tioning will be performed. (dividing the total data set in kmutually exclusive

subsets of the same size, using k−1 for parameter estimation and algorithm

training and the remaining subset to evaluate the accuracy of the model).

Number of pseudo-absences. Number of points sampled randomly in the

background for use in the modeling process.

Modeling algorithms Seven algorithms are available: BIOCLIM, Mahalanobis,

Maxent, DOMAIN available in the dismo package [15] GLM (’stats’), Ran-

domForest (’randomForest’) and SVM (’kernlab’). BIOCLIM, Malahanobis,

and DOMAIN are based on simple statistical techniques, such as environ-

mental distance. GLM is based on regression techniques. Lastly, Maxent,

RandomForest, and SVM are based on machine learning techniques.

Buﬀer should be applied during the sampling of pseudo-absences. This is an

inclusive buﬀer, it calculates the distance between the occurrence points and

use the maximum or the mean geographic distance between the occurrences

of the species within which pseudo-absences will be generated.

Project on another extension. The application reprojected the model to dif-

ferent extensions (spatial or temporal) from the modeling process obtained

on the creation extension.

At the end of the execution, kcontinuous models, kbinary models and one

ensemble model are generated for each species and algorithm, as displayed in

Figure 2 (step 5). The values obtained from the validation process are stored as

a table, and their values are presented in Figure 2 (step 6). A brief description

of each variable is presented in Table 1.

Table 1. Description of variables generated by the modeling process.

Variable name Description

Sp Species name

Part Partition number

Algorithm Modeling algorithm employed

AUC Computed Area Under Curve

TSS True skill statistic = (sensitivity + speciﬁty) - 1

Kappa Cohen’s Kappa coeﬃcient

No Omission Threshold where there is no omission

Prevalence Prevalence

Sensitivity Sensitivity

TSSth Threshold = (sens+esp)

Np Number of presences

Na Number of Absences

4 Reproducibility

Provenance information [18] is given by the documentation of the conception and

execution of computational processes, including the activities that were executed

and the respective data sets consumed and produced by them. Applications of

provenance include reproducibility of computational processes, sharing and reuse

of knowledge, data quality evaluation and attribution of scientiﬁc results [19].

Reproducibility is one of the important features of Model-R. The inclusion of

this feature is motivated by many academic journals recommending that authors

of computational studies should also provide the required data sets, tools, and

workﬂows used to generate the results [20] [21] so that reviewers and readers

could better validate them. For each modeling project speciﬁed and executed

in Model-R, the following information is available for download: the R script,

illustrated in Figure 2 (step 6) that allows for reproducing the steps that were

performed to produce results of the modeling process and to re-execute the mod-

eling process without using the web interface of Model-R; a CSV ﬁle containing

the resulting variables from the modeling process; the occurrence records used

after data cleaning; the raster ﬁles in the GeoTIFF format generated by the

application; a raster ﬁle in the GeoTIFF format with an ensemble of the models

generated; raster ﬁles in the GeoTIFF format with the projection of the model

into another geographic extension. These are only generated when the ”Project

into another extension” option is selected.

5 Case Study and Evaluation

A case study was performed with woody plants of the Brazilian Atlantic Forest

and is described next.

Species occurrence data. The original plant names database (3,952 plant

names and 171,144 original records) were compiled from SpeciesLink5and Neo-

TropTree6(List of species with number of records – appendix 1) and corrected

according to the Catalog of Plants and Fungi of Brazil (CPFB)7, using R pack-

age ﬂora [22], which is based on the List’s IPT database. The CPFB publishes

the oﬃcial List of the Brazilian Flora, meeting Target 1 of the Global Strat-

egy for Plant Conservation. The catalog recognized and checked 3,910 names.

The 42 names that were not found by the LSBF were looked for in The Plant

List8(TPL) and then in the Taxonomic Name Resolution Service9(TNRS), as

implemented in the R packages Taxonstand [23] and taxize [24]. [25]. The in-

formation from the CPFB, TPL, and TNRS was cross-checked, and when there

were conﬂicts, the names from the CPFB were given priority. For each species

the complete occurrence data was treated for (1) records that fell out of the

Brazilian limit, (2) duplicated records, (3) non-duplicated records that fell in

the same 1km-pixel. Only species with at least 100 unique occurrences (deleting

duplicated within each pixel) were maintained, and of these, to overcome bias of

with marginal occurrence for Atlantic Rainforest, only species with more than

50% of occurrences in the Atlantic Rain Forest were considered. After all these

procedures, a sub-sample of the 96 species (35,672 presence records) that pre-

sented the largest numbers of samples was chosen to compose the given woody

plants case study (Figure 3, left).

Environmental data. As environmental predictors, 28 variables with spa-

tial resolution of 1km2were compiled and organized. Those variables were sum-

marized by PCA axes, from which the ﬁrst ten axes (about 95% of the data

variation) were used to run models. Aspect variable was edited and had its sin

and cosin created to be used as variables.

5http://splink.cria.org.br/

6http://prof.icb.ufmg.br/treeatlan/

7http://ﬂoradobrasil.jbrj.gov.br/

8http://www.theplantlist.org/

9http://tnrs.iplantcollaborative.org/

Environmental Niche Modeling. Environmental niche models were built

for each species, using dismo.mod, ﬁnal.model, and ensemble functions.

A three-fold cross validation procedure was performed. Random pseudo-

absence points (nback = 2 ×n) were sorted within a maximum distance buﬀer

(the radius of the buﬀer is the maximal geographic distance between the occur-

rence points) and divided into three groups, for training and testing purposes.

For each partition (k= 3) and algorithm, a model was built, and its perfor-

mance was tested by calculating the True Skill Statistic [26]. The authors found

that TSS scores were largely unaﬀected by prevalence and values from 0.6 to

1.0 were considered as a good adjustment of the model accuracy. Because of

that, only models with T S S > 0.7 were retained. Selected partitions were cut by

the threshold that maximizes their TSS, and the resulting binary models were

averaged to generate a model per algorithm. The scale in these ﬁnal models is

equivalent to the number of partitions that predict the species presence (it goes

from 0 to n

nin 1

nintervals where nis the number of selected models). The en-

semble model (e.g. joining models from diﬀerent algorithms) was obtained by

averaging the ﬁnal models for each algorithm. A species potential richness map

was generated by summing the binary ﬁnal models, cut by the average threshold

that maximizes TSS values (Figure 3, right).

Fig. 3. Map with original occurrence records (left) and richness map generated by

analyzing Model-R output data (right).

Performance and Parallelization The dismo.mod() function, in which the

modeling process of Model-R is based, was entirely sequential in its ﬁrst version.

Models for all species of interest were generated one after another. To improve

performance, parallel processing was employed. Now, if ncores are available,

models for nspecies can be generated simultaneously. The snowfall [27] R pack-

age provided support for the parallelization. It provides functions for parallel

collection processing. sfLapply, for instance, is the parallel version of the stan-

dard lapply, which applies some function to every element of an array, producing

a new array with the results.

The eﬀects of the parallelization on performance can be seen in Figure 4.

Each point in the plot is the arithmetic mean of the time elapsed to do three

executions of dismo.mod(). Models for 96 species were generated, varying the

number of cores from 1 to 64. The algorithms used were RandomForest, SVM,

and Maxent. The variability in execution time for 96 species can be explained

in part by the parallelization strategy used, i.e. one thread per species. The

total time that it takes to apply all the modeling algorithms can be signiﬁcantly

diﬀerent from one species to another.

20 50 100 200 500

Average time varying the number of cores

cores

time (min)

1 2 4 6 8 10 14 18 22 28 34 42 52 64

Fig. 4. Parallelization eﬀects on performance.

The creation of separate functions for each of the modeling algorithms that

dismo.mod() can ﬁt was another important optimization. In its ﬁrst version,

all algorithms were generating models in the context of a single function. The

memory allocated to the variables used by one algorithm was never released even

if the referenced algorithm had ﬁnished its work. R is a programming language

with garbage collector [28] meaning it releases memory when an object is no

longer used. It does this by counting how many references point to each object

and when the counter reaches zero, removes that object. Since dismo.mod() was

keeping at least one reference to the variables used by all selected algorithms for

all the runtime of the function, a lot of memory was being occupied unneces-

sarily. The separation did not make the modeling process faster but allowed the

generation of more models per node because of the smaller memory footprint.

The generation of models for a single species was performed using approximately

5GB of resident memory. Resident memory is a metric that gets closer to the

actual memory budget of a process [29]. The version with separate functions for

each modeling algorithm uses half of this memory.

6 Related Work

As parameters for comparison, two related services in this area were considered:

the Biomodelos portal [30], developed by the Humboldt Institute in Colombia,

and the Virtual Biodiversity e-Laboratory (BioVel) [31], an initiative supported

by the European Union. These two examples were chosen because they repre-

sent two distinct eﬀorts from the standpoint of the internal and external target

audience of the system.

BioVel provides, via a web interface, a service that allows management of

scientiﬁc workﬂows [32] for biodiversity. Several pre-deﬁned activities can be

composed to form these workﬂows, as an example, the following features were

developed and are available in BioVel: geographic and temporal selection of

occurrences; data cleaning; Taxonomic name resolution; modeling algorithms

ecological niches (openModeller) [33]. Such activities can be composed freely

in complex scientiﬁc workﬂows for performing various analyses on biodiversity.

This ﬂexibility of service and the range of applications available in its catalog,

generate a plurality of results provided by the service that can be diﬃcult to

assess regarding quality and suitability, since the service is freely accessible and

does not have a methodology systematic qualiﬁcation of models where experts

can criticize, comment and change the generated results.

The Biomodelos portal [30] is intended for species distribution modeling,

which is carried out and published on the website by the Humboldt Institute

modeling team. The most interesting feature of biomodelos, absent from similar

portals is the existence of a network of taxonomists, who are also users of the

portal, which evaluates each species distribution published on the website. The

taxonomy experts have access to metadata about how the species distribution

models were executed and can assign a note to the generated model, add notes

to them or geographically edit the distribution map (excluding, for example,

records in areas with known absences). Thus, species distributions published in

Biomodelos are accompanied by information to support the decision maker in

assessing their quality and ﬁtness for the use.

Other initiatives based on R include SDM [34] and Wallace [35], and based

on scientiﬁc workﬂow management system include Kepler [36], VisTrails [37] [38]

or in the cloud computing environment [39] [40] [41]. They have some similarity

with our application as well as some striking diﬀerences, especially in terms of

functionality, such as the lack of scalability in the implementation of the models

and the absence of provenance recording.

7 Conclusion

In this work, we presented Model-R, a framework for species distribution mod-

eling. It abstracts away cumbersome steps usually involved in this type of mod-

eling, such as data acquisition and cleaning, by providing a productive web

interface that allows for customizing key steps of the process, including the

pre-processing of biotic and abiotic data and the post-analysis of the results.

The RJabot package, for instance, allows for easy retrieval of species occurrence

records from the Rio de Janeiro Botanical Garden Herbarium (RB) [42], one of

the most complete sources of information about the Brazilian ﬂora. The scalable

execution of the modeling process is enabled through the use of parallel pro-

gramming libraries available for the R environment. Having separate functions

per algorithm also presents an opportunity for further exploration of parallelism.

Currently only parallelism by species is used. All models for a given species are

generated by the same core even if more than one algorithm is used. Paral-

lelism by algorithm is feasible as well. Model-R also enables reproducibility of

the modeling process by providing the data sets generated and scripts in R that

allow for reproducing the steps used to generate them. The application supports

applying the modeling process to diﬀerent sets of temporal or spatial data. Max-

ent, RandomForest, and all the algorithms supported by the dismo package are

supported by Model-R, and their parameters can be customized through its web

interface. We expect the application to become a valuable tool for scientist work-

ing with analysis and synthesis of biodiversity data, and for decision-makers in

biodiversity conservation.

As future work, we plan to better automate the generation of raster ﬁles con-

taining abiotic data by using GIS tools, such as PostGIS. These are currently

generated manually for some pre-deﬁned resolutions and copied to the Model-R

application server. We also plan to further improve the scalability of the ap-

plication by adapt it to run on petascale computing resources of the Brazilian

National System for High-Performance Computing10 [43] using the Swift [44]

parallel scripting system, which gathers provenance information [45] [46]. Addi-

tionally, we are working on porting the modeling scripts to Big Data platforms.

In particular, we are adapting them to the Spark platform [47] using its R in-

terface [48].

Model-R is available as open-source software on Github11. To facilitate its

installation, we also built a software container that is available on Docker Hub12.

This software container is synchronized to the Github repository, i.e. any update

10 http://sdumont.lncc.br

11 https://github.com/Model-R/Model-R

12 https://hub.docker.com/r/modelr/shinyapp/

to the source code on Github triggers the production of an updated software

container.

Acknowledgments

This work has been supported by CNPq, SiBBr, and FAPERJ.

The original publication is available at www.springerlink.com:

S´anchez-Tapia, A., de Siqueira, M., Lima, R., Barros, F., Gall, G., Gadelha, L.,

and da Silva, L. (2017). Model-R: A Framework for Scalable and Reproducible

Ecological Niche Modeling. High Performance Computing - Fourth Latin Amer-

ican Conference, CARLA 2017. Communications in Computer and Information

Science, vol. 796. Springer, 2017.

http://link.springer.com/chapter/10.1007/978-3-319-73353- 1_15

References

1. Ara´ujo, M.B., Williams, P.H.: Selecting areas for species persistence using occur-

rence data. Biological Conservation 96(3) (dec 2000) 331–345

2. Engler, R., Guisan, A., Rechsteiner, L.: An improved approach for predicting the

distribution of rare and endangered species from occurrence and pseudo-absence

data. Journal of Applied Ecology 41(2) (apr 2004) 263–274

3. Ortega-Huerta, M.A., Peterson, A.T.: Modelling spatial patterns of biodiversity for

conservation prioritization in North-eastern Mexico. Diversity and Distributions

10(1) (jan 2004) 39–54

4. Chen, Y.: Conservation biogeography of the snake family Colubridae of China.

North-Western Journal of Zoology 5(2) (2009) 251–262

5. Peterson, A.T., Sober´on, J., Pearson, R.G., Anderson, R.P., Mart´ınez-Meyer, E.,

Nakamura, M., Ara´ujo, M.B.: Ecological Niches and Geographic Distributions.

Princeton University Press (2011)

6. Anderson, R.P., Lew, D., Peterson, A.: Evaluating predictive models of species’

distributions: criteria for selecting optimal models. Ecological Modelling 162(3)

(apr 2003) 211–232

7. Sillero, N.: What does ecological modelling model? A proposed classiﬁcation of

ecological niche models based on their underlying methods. Ecological Modelling

222(8) (apr 2011) 1343–1346

8. Santana, F., de Siqueira, M., Saraiva, A., Correa, P.: A reference business process

for ecological niche modelling. Ecological Informatics 3(1) (jan 2008) 75–86

9. Chang, W.: shiny: Web Application Framework for R. https://cran.r-

project.org/web/packages/shiny (2016)

10. Gadelha, L., Guimar˜aes, P., Moura, A.M., Drucker, D.P., Dalcin, E., Gall, G.,

Tavares, J., Palazzi, D., Poltosi, M., Porto, F., Moura, F., Leo, W.V.: SiBBr: Uma

Infraestrutura para Coleta, Integra¸c˜ao e An´alise de Dados sobre a Biodiversidade

Brasileira. In: VIII Brazilian e-Science Workshop (BRESCI 2014). Proc. XXXIV

Congress of the Brazilian Computer Society. (2014)

11. Tyberghein, L., Verbruggen, H., Pauly, K., Troupin, C., Mineur, F., De Clerck,

O.: Bio-ORACLE: a global environmental dataset for marine species distribution

modelling. Global Ecology and Biogeography (2012)

12. Agafonkin, V.: Leaﬂet - a JavaScript library for interactive maps.

http://leaﬂetjs.com/ (2016)

13. Guisan, A., Zimmermann, N.E.: Predictive habitat distribution models in ecology.

Ecological Modelling 135(2-3) (dec 2000) 147–186

14. Lomba, A., Pellissier, L., Randin, C., Vicente, J., Moreira, F., Honrado, J., Guisan,

A.: Overcoming the rare species modelling paradox: A novel hierarchical framework

applied to an Iberian endemic plant. Biological Conservation 143(11) (nov 2010)

2647–2657

15. Hijmans, R.J., Elith, J.: dismo: Species Distribution Modeling. https://cran.r-

project.org/web/packages/dismo (2016)

16. Thuiller, W., Lafourcade, B., Engler, R., Ara´ujo, M.B.: BIOMOD - A platform for

ensemble forecasting of species distributions. Ecography 32(3) (2009) 369–373

17. Ara´ujo, M.B., Whittaker, R.J., Ladle, R.J., Erhard, M.: Reducing uncertainty in

projections of extinction risk from climate change: Uncertainty in Species’ Range

Shift Projections. Global Ecology and Biogeography 14(6) (jun 2005) 529–538

18. Freire, J., Koop, D., Santos, E., Silva, C.: Provenance for Computational Tasks:

A Survey. Computing in Science & Engineering 10(3) (may 2008) 11–21

19. Gadelha, L., Mattoso, M.: Applying Provenance to Protect Attribution in Dis-

tributed Computational Scientiﬁc Experiments. In Lud¨ascher, B., Plale, B., eds.:

Provenance and Annotation of Data and Processes. Volume 8628 of Lecture Notes

in Computer Science. Springer (2015) 139–151

20. Sandve, G.K., Nekrutenko, A., Taylor, J., Hovig, E.: Ten Simple Rules for Repro-

ducible Computational Research. PLoS Computational Biology 9(10) (oct 2013)

e1003285

21. Wilson, G., Aruliah, D.A., Brown, C.T., Chue Hong, N.P., Davis, M., Guy, R.T.,

Haddock, S.H.D., Huﬀ, K.D., Mitchell, I.M., Plumbley, M.D., Waugh, B., White,

E.P., Wilson, P.: Best practices for scientiﬁc computing. PLoS biology 12(1) (jan

2014) e1001745

22. Carvalho, G.: ﬂora: Tools for Interacting with the Brazilian Flora 2020.

https://cran.r-project.org/web/packages/ﬂora/index.html (2016)

23. Cayuela, L., Oksanen, J.: Taxonstand: Taxonomic Standardization of Plant Species

Names. https://cran.r-project.org/web/packages/Taxonstand (2016)

24. Chamberlain, S.A., Sz¨ocs, E.: taxize: taxonomic search and retrieval in R.

F1000Research 2(jan 2013) 191

25. Chamberlain, S., Szoecs, E., Foster, Z., Boettiger, C., Ram, K., Bartomeus, I.,

Baumgartner, J., O’Donnell, J.: taxize: Taxonomic Information from Around the

Web. https://cran.r-project.org/web/packages/taxize (2016)

26. Allouche, O., Tsoar, A., Kadmon, R.: Assessing the accuracy of species distribution

models: prevalence, kappa and the true skill statistic (TSS). Journal of Applied

Ecology 43(6) (sep 2006) 1223–1232

27. Knaus, J.: snowfall: Easier cluster computing (based on snow). https://cran.r-

project.org/web/packages/snowfall (2016)

28. Wickham, H.: Advanced R. Chapman and Hall/CRC (2014)

29. Simmonds, C.: Mastering Embedded Linux Programming. Packt (2015)

30. Biomodelos: Instituto Alexander von Humboldt.

http://biomodelos.humboldt.org.co (2016)

31. Vicario, S., Hardisty, A., Haitas, N.: BioVeL: Biodiversity Virtual e-Laboratory.

EMBnet.journal 17(2) (sep 2011) 5

32. Liu, J., Pacitti, E., Valduriez, P., Mattoso, M.: A Survey of Data-Intensive Scientiﬁc

Workﬂow Management. Journal of Grid Computing 13(4) (mar 2015) 457–493

33. Souza Mu˜noz, M.E., Giovanni, R., Siqueira, M.F., Sutton, T., Brewer, P., Pereira,

R.S., Canhos, D.A.L., Canhos, V.P.: openModeller: a generic approach to species’

potential distribution modelling. GeoInformatica 15(1) (aug 2009) 111–135

34. Naimi, B., Ara´ujo, M.B.: sdm: a reproducible and extensible R platform for species

distribution modelling. Ecography 39(4) (feb 2016) 368–375

35. Kass, J., Anderson, R.P., Aiello-Lammens, M., Muscarella, B., Vilela, B.: Wal-

lace (beta v0.1): Harnessing Digital Biodiversity Data for Predictive Modeling,

Fueled by R. http://devpost.com/software/wallace-beta-v0-1-harnessing-digital-

biodiversity-data-for-predictive-modeling-fueled-by-r (2016)

36. Pennington, D.D., Higgins, D., Peterson, A.T., Jones, M.B., Lud¨ascher, B., Bowers,

S.: Ecological Niche Modeling Using the Kepler Workﬂow System. In Taylor, I.J.,

Deelman, E., Gannon, D.B., Shields, M., eds.: Workﬂows for e-Science. Springer

London (jan 2007) 91–108

37. Talbert, C., Talbert, M., Morisette, J., Koop, D.: Data Management Challenges

in Species Distribution Modeling. IEEE Bulletin of the Technical Committee on

Data Engineering 36(4) (2013) 31–40

38. Morisette, J.T., Jarnevich, C.S., Holcombe, T.R., Talbert, C.B., Ignizio, D., Tal-

bert, M.K., Silva, C., Koop, D., Swanson, A., Young, N.E.: VisTrails SAHM:

visualization and workﬂow management for species habitat modeling. Ecography

36(2) (feb 2013) 129–135

39. Candela, L., Castelli, D., Coro, G., Pagano, P., Sinibaldi, F.: Species distribution

modeling in the cloud. Concurrency and Computation: Practice and Experience

28(4) (jul 2016) 1056–1079

40. Candela, L., Castelli, D., Coro, G., Lelii, L., Mangiacrapa, F., Marioli, V., Pagano,

P.: An Infrastructure-oriented Approach for supporting Biodiversity Research.

Ecological Informatics (aug 2014)

41. Amaral, R., Badia, R.M., Blanquer, I., Braga-Neto, R., Candela, L., Castelli, D.,

Flann, C., De Giovanni, R., Gray, W.A., Jones, A., Lezzi, D., Pagano, P., Perez-

Canhos, V., Quevedo, F., Rafanell, R., Rebello, V., Sousa-Baena, M.S., Torres, E.:

Supporting biodiversity studies with the EUBrazilOpenBio Hybrid Data Infras-

tructure. Concurrency and Computation: Practice and Experience 27(2) (2015)

376–394

42. Forzza, R., Mynssen, C., Tamaio, N.; Barros, C.; Franco, L., Pereira, M.: As

cole¸c˜oes do herb´ario. 200 anos do Jardim Botˆanico do Rio de Janeiro. Jardim

Botˆanico do Rio de Janeiro, Rio de Janeiro (2008)

43. Mondelli, M.L., Galheigo, M., Medeiros, V., Bastos, B.F., Gomes, A.T.A., Vas-

concelos, A.T.R., Gadelha Jr., L.M.R.: Integrating Scientiﬁc Workﬂows with Sci-

entiﬁc Gateways: A Bioinformatics Experiment in the Brazilian National High-

Performance Computing Network. In: X Brazilian e-Science Workshop. Anais do

XXXVI Congresso da Sociedade Brasileira de Computa¸c˜ao, SBC (2016) 277–284

44. Wilde, M., Hategan, M., Wozniak, J.M., Cliﬀord, B., Katz, D.S., Foster, I.: Swift:

A language for distributed parallel scripting. Parallel Computing 37(9) (sep 2011)

633–652

45. Gadelha, L.M.R., Wilde, M., Mattoso, M., Foster, I.: Exploring provenance in high

performance scientiﬁc computing. In: Proc. of the 1st Annual Workshop on High

Performance Computing meets Databases - HPCDB ’11, ACM Press (nov 2011)

17–20

46. Mondelli, M.L., de Souza, M.T., Oca˜na, K., de Vasconcelos, A.T.R., Gadelha Jr,

L.M.R.: HPSW-Prof: A Provenance-Based Framework for Proﬁling High Perfor-

mance Scientiﬁc Workﬂows. In: Proc. of Satellite Events of the 31st Brazilian

Symposium on Databases (SBBD 2016), SBC (2016) 117–122

47. Armbrust, M., Das, T., Davidson, A., Ghodsi, A., Or, A., Rosen, J., Stoica, I.,

Wendell, P., Xin, R., Zaharia, M.: Scaling Spark in the Real World: Performance

and Usability. Proceedings of the VLDB Endowment 8(12) (2015) 1840–1843

48. Venkataraman, S., Stoica, I., Zaharia, M., Yang, Z., Liu, D., Liang, E., Falaki, H.,

Meng, X., Xin, R., Ghodsi, A., Franklin, M.: SparkR: Scaling R Programs with

Spark. In: Proceedings of the 2016 International Conference on Management of

Data - SIGMOD ’16, New York, New York, USA, ACM Press (2016) 1099–1104

A survey of biodiversity informatics: Concepts, practices, and challenges

Article

Full-text available

Nov 2020

The unprecedented size of the human population, along with its associated economic activities, has an ever‐increasing impact on global environments. Across the world, countries are concerned about the growing resource consumption and the capacity of ecosystems to provide resources. To effectively conserve biodiversity, it is essential to make indicators and knowledge openly available to decision‐makers in ways that they can effectively use them. The development and deployment of tools and techniques to generate these indicators require having access to trustworthy data from biological collections, field surveys and automated sensors, molecular data, and historic academic literature. The transformation of these raw data into synthesized information that is fit for use requires going through many refinement steps. The methodologies and techniques applied to manage and analyze these data constitute an area usually called biodiversity informatics. Biodiversity data follow a life cycle consisting of planning, collection, certification, description, preservation, discovery, integration, and analysis. Researchers, whether producers or consumers of biodiversity data, will likely perform activities related to at least one of these steps. This article explores each stage of the life cycle of biodiversity data, discussing its methodologies, tools, and challenges. This article is categorized under: • Algorithmic Development > Biological Data Mining Abstract Biodiversity data has a life‐cycle that involves planning, gathering, quality improvement, documentation through metadata, preservation through publication, discovery and integration, and analysis. This article describes the practices, tools, and challenges in each step of this life‐cycle.

The future scenario of an iconic tree from the Brazilian Cerrado: perspectives on Eremanthus Less. (Asteraceae) conservation

Article

Full-text available

Oct 2022
Rev Bras Botânica

Characterized as one of the largest biodiversity hot spots, the Cerrado ecoregion is home to a wide variety of endemic species. Several threats such as agricultural expansion and habitat fragmentation put the species of the Cerrado ecosystems and biodiversity at risk. Thus, this study analysed the genus Eremanthus, which has abundant species in the Cerrado and suffers from intense anthropogenic pressure due to overexploitation, mainly as a material utilized for the construction of fences and the extraction of essential oils. Environmental suitability was estimated for the genus for the present and future (2070), in order to characterize the importance of the climate in the species distribution and to analyse the conservation status of the genus. The Species Distribution Modelling and Gap Analysis showed that the areas of environmental suitability are limited and are found in a matrix composed of a high presence of anthropic activity, which can intensify the loss of species habitat and increase the vulnerability of the group. The studied species were classified as Endangered and Vulnerable according to IUCN criteria, presenting very reduced areas of environmental suitability projected in the future and a low percentage of species in protected areas, that may influence possible species extinctions in the genus. Thus, this study provides insights to assist in conservation planning and reinforces the importance of protecting the biodiversity of the Cerrado.

The Future Scenario of an Iconic Tree From the Brazilian Cerrado: Perspectives on Eremanthus Less. (Asteraceae) Conservation

Preprint

Full-text available

May 2022

Characterized as one of the largest biodiversity hotspots, the Cerrado ecoregion houses a wide variety of endemic species. Several threats, such as agricultural expansion and habitat fragmentation, put the species of the Cerrado ecosystems and biodiversity at risk. The genus Eremanthus is frequent in the Cerrado and suffers from intense anthropogenic pressure due to overexploitation mainly for the construction of fences and extraction of essential oil. Environmental suitability, of the Mid-Holocene, present and future (2070), were estimated for the genus in order to characterize the importance of the climate in the species distribution and to analyse the conservation status. The Species Distribution Modelling showed that most species of Eremanthus presented similarities between the environmental suitability of the present and the Mid-Holocene, enabling the identification of areas of environmental stability in OSL areas of campos rupestres. The species of the genus were classified as Endangered and Vulnerable according to IUCN criteria, presenting very reduced areas of environmental suitability projected in the future and a low percentage of species in Protected Areas, that may influence possible extinctions of species in the genus. The approaches in this study provide consistent subsidies to assist in conservation planning.

Choosing among correlative, mechanistic and hybrid models of species’ niche and distribution

Article

Dec 2021
Integr Zool

Different models are available to estimate species’ niche and distribution. Mechanistic and correlative models have different underlying conceptual bases, thus generating different estimates of a species’ niche and geographic extent. Hybrid models, which combining correlative and mechanistic approaches, are considered a promising strategy, however, no synthesis in the literature assessed their applicability for terrestrial vertebrates to allow best-choice model considering their strengths and trade-offs. Here, we provide a systematic review of studies that compared or integrated correlative and mechanistic models to estimate species’ niche for terrestrial vertebrates under climate change. Our goal was to understand their conceptual, methodological, and performance differences, and the applicability of each approach. The studies we reviewed directly compared mechanistic and correlative predictions in terms of accuracy or estimated suitable area, however, without any quantitative analysis to support comparisons. Contrastingly, many studies suggest that instead of comparing approaches, mechanistic and correlative methods should be integrated (hybrid models). However, we stress that the best approach is highly context-dependent. Indeed, the quality and effectiveness of the prediction depends on the study's objective, methodological design, and which type of species’ niche and geographic distribution estimated are more appropriate to answer the study's issue. This article is protected by copyright. All rights reserved

Climate change effects on animal presence in the Massaciuccoli Lake basin

Article

Full-text available

May 2024
ECOL INFORM

Mapping an endangered living fossil's distribution, threats, and protection coverage

Article

Full-text available

Mar 2024

Species Distribution Models have emerged in recent decades as a powerful tool for biodiversity research, as they allow, for example, assessing the status of species in their whole distributions. These models have been particularly helpful in demonstrating that amphibian populations are at high risk, with many declining and others already extinct. Here, we assessed the potential distribution of the helmeted water toad (Calyptocephalella gayi), an endemic Chilean amphibian considered a living fossil and cataloged as vulnerable, which requires significant conservation efforts. We modeled the species' potential distribution using the Maximum Entropy (Maxent) approach and determined the overlap with national protected areas. In addition, we conducted a geospatial risk analysis to estimate the threat level to which this toad is being subjected. The potential distribution of C. gayi ranges from 28 S to 44 S, mainly explained by altitude, mean diurnal temperature range, slope, and distance to water bodies. Protected areas only cover 3.55% of the species' potential geographic distribution, which is of concern, considering that the geospatial risk analysis showed that 60.61% of C. gayi's distribution is subjected to extreme and high risks. We discuss how these results are relevant to focusing and directing efficient protection and conservation efforts for this species shortly. K E Y W O R D S amphibian conservation, Calyptocephalella gayi, endemism, geospatial risk analysis

An open science automatic workflow for multi-model species distribution estimation

Article

Full-text available

Mar 2024

Integrated Environmental Assessment systems and ecosystem models study the links between anthropogenic and climatic pressures on marine ecosystems and help understand how to manage the effects of the unsustainable exploitation of ocean resources. However, these models have long implementation times, data and model interoperability issues and require heterogeneous competencies. Therefore, they would benefit from simplification, automatisation, and enhanced integrability of the underlying models. Artificial Intelligence can help overcome several limitations by speeding up the modelling of crucial functional parts, e.g. estimating the environmental conditions fostering a species’ persistence and proliferation in an area (the species’ ecological niche) and, consequently, its geographical distribution. This paper presents a full-automatic workflow to estimate species’ distributions through statistical and machine learning models. It embeds four ecological niche models with complementary approaches, i.e. Artificial Neural Networks, Maximum Entropy, Support Vector Machines, and AquaMaps. It automatically estimates the optimal model parametrisations and decision thresholds to distinguish between suitable- and unsuitable-habitat locations and combines the models within one ensemble model. Finally, it combines several ensemble models to produce a species richness map (biodiversity index). The software is open-source, Open Science compliant, and available as a Web Processing Service-standardised cloud computing service that enhances efficiency, integrability, cross-domain reusability, and experimental reproduction and repetition. We first assess workflow stability and sensitivity and then demonstrate effectiveness by producing a biodiversity index for the Mediterranean based on ∼\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sim $$\end{document}1500 species data. Moreover, we predict the spread of the invasive Siganus rivulatus in the Mediterranean and its current and future overlap with the native Sarpa salpa under different climate change scenarios.

Species distribution modeling allied with land-use reveal priority sites and species for palm (Arecaceae) conservation in Rio de Janeiro, Brazil

Article

Full-text available

Jul 2022

Palms species (Arecaceae) are abundant in tropical forests and influence ecosystems in important ways. Moreover, they are a relevant feature in the Atlantic Forest biodiversity hotspot. In this study, we seek to better understand the distribution of palm richness in Rio de Janeiro state, Brazil, with the aim to support conservation decisions and actions. Maps for 15 palm species were generated through species distribution modeling and then stacked into a palm richness map, which was further combined with current land-use and protected area maps to generate a realistic portrayal of the current situation of Arecaceae in the state. Our results revealed an increasing inland-to-coast pattern of richness that matches the biogeographical subdivision of the Atlantic Forest. Considering the land-use information, the palm species potential distribution is drastically reduced, especially for some species which already have a restricted distribution in the state. We also identified the most relevant protected areas for the conservation of palms in the state and those which might have been overlooked in floristic inventories, thus requiring more detailed investigation. Moreover, we point out those species with few points, for which species distribution models could not be built, and argue that they are the ones more likely to be threatened by habitat loss and should be the focus of specimen collection and recording. Finally, we draw attention to a large medium-richness remnant located between two protected areas which probably functions as a connection between them and should be a priority area for conservation.

Habitat distribution change of commercial species in the Adriatic Sea during the COVID-19 pandemic

Article

Full-text available

May 2022
ECOL INFORM

The COVID-19 pandemic has led to reduced anthropogenic pressure on ecosystems in several world areas, but resulting ecosystem responses in these areas have not been investigated. This paper presents an approach to make quick assessments of potential habitat changes in 2020 of eight marine species of commercial importance in the Adriatic Sea. Measurements from floating probes are interpolated through an advection-equation based model. The resulting distributions are then combined with species observations through an ecological niche model to estimate habitat distributions in the past years (2015–2018) at 0.1° spatial resolution. Habitat patterns over 2019 and 2020 are then extracted and explained in terms of specific environmental parameter changes. These changes are finally assessed for their potential dependency on climate change patterns and anthropogenic pressure change due to the pandemic. Our results demonstrate that the combined effect of climate change and the pandemic could have heterogeneous effects on habitat distributions: three species (Squilla mantis, Engraulis encrasicolus, and Solea solea) did not show significant niche distribution change; habitat suitability positively changed for Sepia officinalis, but negatively for Parapenaeus longirostris, due to increased temperature and decreasing dissolved oxygen (in the Adriatic) generally correlated with climate change; the combination of these trends with an average decrease in chlorophyll, probably due to the pandemic, extended the habitat distributions of Merluccius merluccius and Mullus barbatus but reduced Sardina pilchardus distribution. Although our results are based on approximated data and reliable at a macroscopic level, we present a very early insight of modifications that will possibly be observed years after the end of the pandemic when complete data will be available. Our approach is entirely based on Findable, Accessible, Interoperable, and Reusable (FAIR) data and is general enough to be used for other species and areas.

Population structure and intraspecific ecological niche differentiation point to lineage divergence promoted by polyploidization in Psidium cattleyanum (Myrtaceae)

Article

Full-text available

Jun 2022
TREE GENET GENOMES

Polyploidy is defined as the presence of more than two complete chromosome sets in an organism and has frequently occurred throughout the history of angiosperms. Polyploidization is a process that typically results in instant speciation. Using Psidium cattleyanum, a natural polyploid complex with several cytotypes, we aim to test two hypotheses regarding speciation in polyploids: polyploidization promotes (1) interruption of gene flow and (2) intraspecific niche divergence. We analyzed 12 natural populations of P. cattleyanum, integrating population genetics data, accessed by microsatellite markers, and climatic niche analysis, using environmental niche modeling, to provide insights about polyploid speciation. We found strong genetic structure in populations and cytotypes and low environmental niche similarity between cytotypes. Genetic diversity declines with increasing ploidy levels which is probably associated with asexual reproduction. Our results corroborate that polyploidy is generating a reproductive barrier and is associated with niche divergence among cytotypes. Therefore, we infer future divergent lineages between cytotypes of P. cattleyanum and confirm the role of polyploidy as an evolutionary step in speciation in this group. Additionally, this study provides new information for the discussion about how polyploidy affects the genetic diversity of taxa and ecological niches.

Ecological Niches and Geographic Distributions (MPB-49)

Book

Full-text available

Nov 2011

This book provides a first synthetic view of an emerging area of ecology and biogeography, linking individual- and population-level processes to geographic distributions and biodiversity patterns. Problems in evolutionary ecology, macroecology, and biogeography are illuminated by this integrative view. The book focuses on correlative approaches known as ecological niche modeling, species distribution modeling, or habitat suitability modeling, which use associations between known occurrences of species and environmental variables to identify environmental conditions under which populations can be maintained. The spatial distribution of environments suitable for the species can then be estimated: a potential distribution for the species. This approach has broad applicability to ecology, evolution, biogeography, and conservation biology, as well as to understanding the geographic potential of invasive species and infectious diseases, and the biological implications of climate change. The book lays out conceptual foundations and general principles for understanding and interpreting species distributions with respect to geography and environment. Focus is on development of niche models. While serving as a guide for students and researchers, the book also provides a theoretical framework to support future progress in the field.

HPSW-Prof: A Provenance-Based Framework for Profiling High Performance Scientific Workflows

Conference Paper

Full-text available

Oct 2016

Scientific experiments usually demand high performance computing (HPC) and involve the execution of a flow of activities, so they can be modeled as scientific workflows. Scientific Workflow Management Systems (SWfMS) provide ways of defining and executing these experiments in HPC environments and they produce detailed information about the workflow composition and execution. However, analyzing this information is not always trivial. This paper presents a profiling framework called HPSW-Prof 1 that aims to provide the user with a set of features for the statistical treatment and manipulation of provenance information obtained from scientific experiments executed with SWfMS Swift. Through the HPSW-Prof, data analysis can become a transparent process since it also offers a visualization layer that supports users for better accessing and manipulating their results.

Integrating Scientific Workflows with Scientific Gateways: A Bioinformatics Experiment in the Brazilian National High-Performance Computing Network

Conference Paper

Full-text available

Jul 2016

Bioinformatics experiments are rapidly and constantly evolving due improvements in sequencing technologies. These experiments usually demand high performance computation and produce huge quantities of data. They also require different programs to be executed in a certain order, allowing the experiments to be modeled as workflows. However, users do not always have the infrastructure needed to perform these experiments. Our contribution is the integration of scientific workflow management systems and grid-enabled scientific gateways, providing the user with a transparent way to run these workflows in geographically distributed computing resources. The availability of the work-flow through the gateway allows for a better usability of these experiments.

SparkR: Scaling R Programs with Spark

Conference Paper

Full-text available

Jun 2016

R is a popular statistical programming language with a number of extensions that support data processing and machine learning tasks. However, interactive data analysis in R is usually limited as the R runtime is single threaded and can only process data sets that fit in a single machine's memory. We present SparkR, an R package that provides a frontend to Apache Spark and uses Spark's distributed computation engine to enable large scale data analysis from the R shell. We describe the main design goals of SparkR, discuss how the high-level DataFrame API enables scalable computation and present some of the key details of our implementation.

sdm: A reproducible and extensible R platform for species distribution modelling

Article

Full-text available

Feb 2016
ECOGRAPHY

sdm is an object-oriented, reproducible and extensible, platform for species distribution modelling. It uses individual species and community-based approaches, enabling ensembles of models to be fitted and evaluated, to project species potential distributions in space and time. It provides a standardized and unified structure for handling species distributions data and modelling techniques, and supports markedly different modelling approaches, including correlative, process-based (mechanistic), agent-based, and cellular automata. The object-oriented design of software is such that scientists can modify existing methods, extend the framework by developing new methods or modelling procedures, and share them to be reproduced by other scientists. sdm can handle spatial and temporal data for single or multiple species and uses high performance computing solutions to speed up modelling and simulations. The framework is implemented in R, providing a flexible and easy-to-use GUI interface.

Conservation biogeography of the snake family Colubridae of China

Article

Full-text available

Dec 2009
NORTH-WEST J ZOOL

Youhua Chen

The areal distribution of the snake family Colubridae in China was analyzed quantitatively with the aims of determining the zoogeographic regions, areas of endemism, priority areas for conservation and important environmental factors. A presence/absence data matrix of 141 Colubridae species was analyzed by the two-way indicator analysis (TWINSPAN) for regionalization, parsimony analysis of endemicity for areas of endemism, and linear programming for priority areas selection. Ecological niche modeling was integrated into priority areas selection to achieve the objective of protecting species potential suitable habitats. The Bioclim True/False model was used because of its conservatism property. Results indicated there are nine major zoogeographical regions based on Colubridae species, some of which had been documented by previous zoogeographical regionalizations of East Asia. Four endemic areas were identified by parsimony analysis of endemicity: One in Yunnan, one in Taiwan, and another two in Tibet Province. Optimal priority areas set identified thirty-five grid cells based on species' suitable habitat ranges.

A Survey of Data-Intensive Scientific Workflow Management

Article

Full-text available

Mar 2015

Nowadays, more and more computer-based scientific experiments need to handle massive amounts of data. Their data processing consists of multiple computational steps and dependencies within them. A data-intensive scientific workflow is useful for modeling such process. Since the sequential execution of data-intensive scientific workflows may take much time, Scientific Workflow Management Systems (SWfMSs) should enable the parallel execution of data-intensive scientific workflows and exploit the resources distributed in different infrastructures such as grid and cloud. This paper provides a survey of data-intensive scientific workflow management in SWfMSs and their parallelization techniques. Based on a SWfMS functional architecture, we give a comparative analysis of the existing solutions. Finally, we identify research issues for improving the execution of data-intensive scientific workflows in a multisite cloud.

Advanced R

Book

Sep 2014

Hadley Wickham

Scaling spark in the real world

Article

Aug 2015

Apache Spark is one of the most widely used open source processing engines for big data, with rich language-integrated APIs and a wide range of libraries. Over the past two years, our group has worked to deploy Spark to a wide range of organizations through consulting relationships as well as our hosted service, Databricks. We describe the main challenges and requirements that appeared in taking Spark to a wide set of users, and usability and performance improvements we have made to the engine in response.

BioVeL: Biodiversity Virtual e-Laboratory

Article

Dec 2011

The aim of the BioVeL project is to provide a seamlessly connected informatics environment that makes it easier for biodiversity scientists to carry out in-silico analysis of relevant biodiversity data and to pursue in-silico experimentation based on composing and executing sequences of complex digital data manipulations and modelling tasks. In BioVeL scientists and technologists will work together to meet the needs and demands for in-silico or ‘e-Science’ and to create a production-quality informatics infrastructure to enable pipelining of data and analysis into efficient integrated workflows. Workflows represent a way of speeding up scientific advance when that advance is based on the manipulation of digital data.

Model-R: A Framework for Scalable and Reproducible Ecological Niche Modeling

Abstract and Figures

Recommended publications

Assessing the Global Risk of Establishment of Cydia pomonella (Lepidoptera: Tortricidae) Using CLIME...

A comparison of absolute performance of different correlative and mechanistic species distribution m...

The importance of biotic interactions in species distribution models: A test of the Eltonian noise h...

Consensual predictions of potential distributional areas for invasive species: A case study of Argen...