Publications

Sort by

Direction

Title contains

Teams

Persons

Author last name contains

INSaFLU: an automated open web-based bioinformatics suite “from-reads” for influenza whole-genome-sequencing-based surveillance

InSaFLU

Abstract (Expand)

Background A new era of flu surveillance has already started based on the genetic characterization and exploration of influenza virus evolution at whole-genome scale. Although this has been prioritized …d by national and international health authorities, the demanded technological transition to whole-genome sequencing (WGS)-based flu surveillance has been particularly delayed by the lack of bioinformatics infrastructures and/or expertise to deal with primary next-generation sequencing (NGS) data. Results We developed and implemented INSaFLU (“INSide the FLU”), which is the first influenza-oriented bioinformatics free web-based suite that deals with primary NGS data (reads) towards the automatic generation of the output data that are actually the core first-line “genetic requests” for effective and timely influenza laboratory surveillance (e.g., type and sub-type, gene and whole-genome consensus sequences, variants’ annotation, alignments and phylogenetic trees). By handling NGS data collected from any amplicon-based schema, the implemented pipeline enables any laboratory to perform multi-step software intensive analyses in a user-friendly manner without previous advanced training in bioinformatics. INSaFLU gives access to user-restricted sample databases and projects management, being a transparent and flexible tool specifically designed to automatically update project outputs as more samples are uploaded. Data integration is thus cumulative and scalable, fitting the need for a continuous epidemiological surveillance during the flu epidemics. Multiple outputs are provided in nomenclature-stable and standardized formats that can be explored in situ or through multiple compatible downstream applications for fine-tuned data analysis. This platform additionally flags samples as “putative mixed infections” if the population admixture enrolls influenza viruses with clearly distinct genetic backgrounds, and enriches the traditional “consensus-based” influenza genetic characterization with relevant data on influenza sub-population diversification through a depth analysis of intra-patient minor variants. This dual approach is expected to strengthen our ability not only to detect the emergence of antigenic and drug resistance variants but also to decode alternative pathways of influenza evolution and to unveil intricate routes of transmission. Conclusions In summary, INSaFLU supplies public health laboratories and influenza researchers with an open “one size fits all” framework, potentiating the operationalization of a harmonized multi-country WGS-based surveillance for influenza virus.

Authors: Vítor Borges, Miguel Pinheiro, Pedro Pechirra, Raquel Guiomar, João Paulo Gomes

Date Published: 1st Dec 2018

Publication Type: InProceedings

DOI: 10.1186/s13073-018-0555-0

Citation: Genome Med 10(1)

Created: 8th Apr 2020 at 11:56, Last updated: 16th Jan 2023 at 13:34

BioExcel Building Blocks, a software library for interoperable biomolecular simulation workflows

BioBB Building Blocks

(Show All)

Abstract (Expand)

In the recent years, the improvement of software and hardware performance has made biomolecular simulations a mature tool for the study of biological processes. Simulation length and the size and …

Authors: Pau Andrio, Adam Hospital, Javier Conejero, Luis Jordá, Marc Del Pino, Laia Codo, Stian Soiland-Reyes, Carole Goble, Daniele Lezzi, Rosa M. Badia, Modesto Orozco, Josep Ll. Gelpi

Date Published: 1st Dec 2019

Publication Type: Journal

DOI: 10.1038/s41597-019-0177-4

Citation: Sci Data 6(1),169

Created: 14th Jun 2021 at 11:27, Last updated: 16th Jan 2023 at 13:34

Making Canonical Workflow Building Blocks interoperable across workflow languages

BioBB Building Blocks

(Show All)

Abstract (Expand)

We here introduce the concept of Canonical Workflow Building Blocks (CWBB), a methodology of describing and wrapping computational tools, in order for them to be utilized in a reproducible manner from …

Authors: Stian Soiland-Reyes, Genís Bayarri, Pau Andrio, Robin Long, Douglas Lowe, Ania Niewielska, Adam Hospital

Date Published: 7th Mar 2021

Publication Type: Journal

DOI: 10.1162/dint_a_00135

Citation:

Created: 14th Jun 2021 at 11:40, Last updated: 16th Jan 2023 at 13:34

Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language

Common Workflow Language (CWL) community

(Show All)

Abstract (Expand)

A widely used standard for portable multilingual data analysis pipelines would enable considerable benefits to scholarly publication reuse, research/industry collaboration, regulatory cost control, and …

Authors: Michael R. Crusoe, Sanne Abeln, Alexandru Iosup, Peter Amstutz, John Chilton, Nebojša Tijanić, Hervé Ménager, Stian Soiland-Reyes, Carole Goble

Date Published: 14th May 2021

Publication Type: Unpublished

Citation: arXiv 2105.07028 [cs.DC]

Created: 5th Jul 2021 at 12:58, Last updated: 16th Jan 2023 at 13:34

MicroPIPE: validating an end-to-end workflow for high-quality complete bacterial genome construction.

QCIF Bioinformatics

Abstract (Expand)

BACKGROUND: Oxford Nanopore Technology (ONT) long-read sequencing has become a popular platform for microbial researchers due to the accessibility and affordability of its devices. However, easy and …

Authors: V. Murigneux, L. W. Roberts, B. M. Forde, M. D. Phan, N. T. K. Nhu, A. D. Irwin, P. N. A. Harris, D. L. Paterson, M. A. Schembri, D. M. Whiley, S. A. Beatson

Date Published: 25th Jun 2021

Publication Type: Journal

PubMed ID: 34172000

Citation: BMC Genomics. 2021 Jun 25;22(1):474. doi: 10.1186/s12864-021-07767-z.

Created: 9th Aug 2021 at 00:26, Last updated: 16th Jan 2023 at 13:34

Perspectives on automated composition of workflows in the life sciences

FAIR Computational Workflows

(Show All)

Abstract (Expand)

Scientific data analyses often combine several computational tools in automated pipelines, or workflows. Thousands of such workflows have been used in the life sciences, though their composition has …

Authors: Anna-Lena Lamprecht, Magnus Palmblad, Jon Ison, Veit Schwämmle, Mohammad Sadnan Al Manir, Ilkay Altintas, Christopher J. O. Baker, Ammar Ben Hadj Amor, Salvador Capella-Gutierrez, Paulos Charonyktakis, Michael R. Crusoe, Yolanda Gil, Carole Goble, Timothy J. Griffin, Paul Groth, Hans Ienasescu, Pratik Jagtap, Matúš Kalaš, Vedran Kasalica, Alireza Khanteymoori, Tobias Kuhn, Hailiang Mei, Hervé Ménager, Steffen Möller, Robin A. Richardson, Vincent Robert, Stian Soiland-Reyes, Robert Stevens, Szoke Szaniszlo, Suzan Verberne, Aswin Verhoeven, Katherine Wolstencroft

Date Published: 2021

Publication Type: Journal

DOI: 10.12688/f1000research.54159.1

Citation: F1000Res 10:897

Created: 1st Dec 2021 at 21:35, Last updated: 16th Jan 2023 at 13:34

FAIR Computational Workflows

(Show All)

Abstract (Expand)

Computational workflows describe the complex multi-step methods that are used for data collection, data preparation, analytics, predictive modelling, and simulation that lead to new data products. They …

Authors: Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes, Daniel Garijo, Yolanda Gil, Michael R. Crusoe, Kristian Peters, Daniel Schober

Date Published: 2020

Publication Type: Journal

DOI: 10.1162/dint_a_00033

Citation: Data Intellegence 2(1-2):108-121

Created: 1st Dec 2021 at 21:43, Last updated: 16th Jan 2023 at 13:34

FAIR Computational Workflows

Testing

Abstract

Not specified

Authors: Carole Goble, Sarah Cohen-Boulakia, Stian Soiland-Reyes, Daniel Garijo, Yolanda Gil, Michael R. Crusoe, Kristian Peters, Daniel Schober

Date Published: 2020

Publication Type: Journal

DOI: 10.1162/dint_a_00033

Citation: Data Intellegence 2(1-2):108-121

Created: 2nd Dec 2021 at 10:16, Last updated: 16th Jan 2023 at 13:34

Landscape Analysis for the Specimen Data Refinery

Specimen Data Refinery

(Show All)

Abstract (Expand)

This report reviews the current state-of-the-art applied approaches on automated tools, services and workflows for extracting information from images of natural history specimens and their labels. We …

Authors: Stephanie Walton, Laurence Livermore, Olaf Bánki, Robert W. N. Cubey, Robyn Drinkwater, Markus Englund, Carole Goble, Quentin Groom, Christopher Kermorvant, Isabel Rey, Celia M Santos, Ben Scott, Alan Williams, Zhengzhe Wu

Date Published: 14th Aug 2020

Publication Type: Journal

DOI: 10.3897/rio.6.e57602

Citation: Walton S, Livermore L, Bánki O, Cubey RWN, Drinkwater R, Englund M, Goble C, Groom Q, Kermorvant C, Rey I, Santos CM, Scott B, Williams AR, Wu Z (2020) Landscape Analysis for the Specimen Data Refinery. Research Ideas and Outcomes 6: e57602. https://doi.org/10.3897/rio.6.e57602

Created: 8th Dec 2021 at 16:58, Last updated: 16th Jan 2023 at 13:34

Perspectives on automated composition of workflows in the life sciences

Testing

Abstract

Not specified

Date Published: 2021

Publication Type: Journal

DOI: 10.12688/f1000research.54159.1

Citation: F1000Res 10:897

Created: 8th Dec 2021 at 17:07, Last updated: 16th Jan 2023 at 13:34

Ten simple rules for making a software tool workflow-ready

Specimen Data Refinery, FAIR Computational Workflows

(Show All)

Abstract (Expand)

Workflows have become a core part of computational scientific analysis in recent years. Automated computational workflows multiply the power of researchers, potentially turning “hand-cranked” data …

Authors: Paul Brack, Peter Crowther, Stian Soiland-Reyes, Stuart Owen, Douglas Lowe, Alan R. Williams, Quentin Groom, Mathias Dillen, Frederik Coppens, Björn Grüning, Ignacio Eguinoa, Philip Ewels, Carole Goble

Date Published: 24th Mar 2022

Publication Type: Journal

DOI: 10.1371/journal.pcbi.1009823

Citation: PLoS Comput Biol 18(3):e1009823

Created: 25th Apr 2022 at 11:43, Last updated: 16th Jan 2023 at 13:34

The Specimen Data Refinery: A Canonical Workflow Framework and FAIR Digital Object Approach to Speeding up Digital Mobilisation of Natural History Collections

Specimen Data Refinery, FAIR Computational Workflows

(Show All)

Abstract (Expand)

A key limiting factor in organising and using information from physical specimens curated in natural science collections is making that information computable, with institutional digitization tending … to focus more on imaging the specimens themselves than on efficiently capturing computable data about them. Label data are traditionally manually transcribed today with high cost and low throughput, rendering such a task constrained for many collection-holding institutions at current funding levels. We show how computer vision, optical character recognition, handwriting recognition, named entity recognition and language translation technologies can be implemented into canonical workflow component libraries with findable, accessible, interoperable, and reusable (FAIR) characteristics. These libraries are being developed in a cloud- based workflow platform—the ‘Specimen Data Refinery’ (SDR)—founded on Galaxy workflow engine, Common Workflow Language, Research Object Crates (RO-Crate) and WorkflowHub technologies. The SDR can be applied to specimens’ labels and other artefacts, offering the prospect of greatly accelerated and more accurate data capture in computable form. Two kinds of FAIR Digital Objects (FDO) are created by packaging outputs of SDR workflows and workflow components as digital objects with metadata, a persistent identifier, and a specific type definition. The first kind of FDO are computable Digital Specimen (DS) objects that can be consumed/produced by workflows, and other applications. A single DS is the input data structure submitted to a workflow that is modified by each workflow component in turn to produce a refined DS at the end. The Specimen Data Refinery provides a library of such components that can be used individually, or in series. To cofunction, each library component describes the fields it requires from the DS and the fields it will in turn populate or enrich. The second kind of FDO, RO-Crates gather and archive the diverse set of digital and real-world resources, configurations, and actions (the provenance) contributing to a unit of research work, allowing that work to be faithfully recorded and reproduced. Here we describe the Specimen Data Refinery with its motivating requirements, focusing on what is essential in the creation of canonical workflow component libraries and its conformance with the requirements of an emerging FDO Core Specification being developed by the FDO Forum.

Authors: Alex Hardisty, Paul Brack, Carole Goble, Laurence Livermore, Ben Scott, Quentin Groom, Stuart Owen, Stian Soiland-Reyes

Date Published: 7th Mar 2022

Publication Type: Journal

DOI: 10.1162/dint_a_00134

Citation: Data Intelligence:1-19

Created: 25th Apr 2022 at 11:45, Last updated: 16th Jan 2023 at 13:34

A Community Roadmap for Scientific Workflows Research and Development

FAIR Computational Workflows

(Show All)

Abstract (Expand)

Preprint: https://arxiv.org/abs/2110.02168 The landscape of workflow systems for scientific applications is notoriously convoluted with hundreds of seemingly equivalent workflow systems, many isolated …

Authors: Rafael Ferreira da Silva, Henri Casanova, Kyle Chard, Ilkay Altintas, Rosa M Badia, Bartosz Balis, Taina Coleman, Frederik Coppens, Frank Di Natale, Bjoern Enders, Thomas Fahringer, Rosa Filgueira, Grigori Fursin, Daniel Garijo, Carole Goble, Dorran Howell, Shantenu Jha, Daniel S. Katz, Daniel Laney, Ulf Leser, Maciej Malawski, Kshitij Mehta, Loic Pottier, Jonathan Ozik, J. Luc Peterson, Lavanya Ramakrishnan, Stian Soiland-Reyes, Douglas Thain, Matthew Wolf

Date Published: 1st Nov 2021

Publication Type: Journal

DOI: 10.1109/WORKS54523.2021.00016

Citation: 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS),pp.81-90,IEEE

Created: 25th Apr 2022 at 11:49, Last updated: 16th Jan 2023 at 13:34

MGnify: the microbiome analysis resource in 2020

MGnify

Abstract (Expand)

MGnify (http://www.ebi.ac.uk/metagenomics) provides a free to use platform for the assembly, analysis and archiving of microbiome data derived from sequencing microbial populations that are present in …

Authors: Alex L Mitchell, Alexandre Almeida, Martin Beracochea, Miguel Boland, Josephine Burgin, Guy Cochrane, Michael R Crusoe, Varsha Kale, Simon C Potter, Lorna J Richardson, Ekaterina Sakharova, Maxim Scheremetjew, Anton Korobeynikov, Alex Shlemov, Olga Kunyavskaya, Alla Lapidus, Robert D Finn

Date Published: 7th Nov 2019

Publication Type: Journal

DOI: 10.1093/nar/gkz1035

Citation: Nucleic Acids Research,gkz1035

Created: 7th Jun 2022 at 08:56, Last updated: 16th Jan 2023 at 13:34

Collection of wing images for conservation of honey bees (Apis mellifera) biodiversity in Europe

Apis-wings

Abstract (Expand)

Identification of honey bee (Apis mellifera) from various parts of the world is essential for protection of their biodiversity. The identification can be based on wing measurements which is inexpensive …

Authors: Andrzej Oleksa, Eliza Căuia, Adrian Siceanu, Zlatko Puškadija, Marin Kovačić, M. Alice Pinto, Pedro João Rodrigues, Fani Hatjina, Leonidas Charistos, Maria Bouga, Janez Prešern, Irfan Kandemir, Slađan Rašić, Szilvia Kusza, Adam Tofilski

Date Published: 1st Oct 2022

Publication Type: Journal

DOI: 10.5281/zenodo.7244070

Citation:

Created: 28th Feb 2023 at 12:24, Last updated: 28th Feb 2023 at 14:26

Dataset: Computer software for identification of honey bee subspecies and evolutionary lineages

Apis-wings

Abstract (Expand)

Coordinates of 19 landmarks from honey bee (Apis mellifera) worker wings. They represent 1832 workers, 187 colonies, 25 subspecies and four evolutionary lineages. The material was obtained from the …

Authors: Anna Nawrocka, Irfan Kandemir, Stefan Fuchs, Adam Tofilski

Date Published: 1st Apr 2018

Publication Type: Journal

DOI: 10.5281/zenodo.7567336

Citation:

Created: 28th Feb 2023 at 14:25, Last updated: 28th Feb 2023 at 14:27

Ten simple rules and a template for creating workflows-as-applications

FAIR Computational Workflows

Abstract

Not specified

Authors: Michael J. Roach, N. Tessa Pierce-Ward, Radoslaw Suchecki, Vijini Mallawaarachchi, Bhavya Papudeshi, Scott A. Handley, C. Titus Brown, Nathan S. Watson-Haigh, Robert A. Edwards

Date Published: 15th Dec 2022

Publication Type: Journal

DOI: 10.1371/journal.pcbi.1010705

Citation: PLoS Comput Biol 18(12):e1010705

Created: 7th Apr 2023 at 15:22, Last updated: 7th Apr 2023 at 15:26

MetaGT: A pipeline for de novo assembly of metatranscriptomes with the aid of metagenomic data

HoloFood at MGnify

Abstract (Expand)

While metagenome sequencing may provide insights on the genome sequences and composition of microbial communities, metatranscriptome analysis can be useful for studying the functional activity of a …

Authors: Daria Shafranskaya, Varsha Kale, Rob Finn, Alla L. Lapidus, Anton Korobeynikov, Andrey D. Prjibelski

Date Published: 28th Oct 2022

Publication Type: Journal

DOI: 10.3389/fmicb.2022.981458

Citation: Front. Microbiol. 13,981458

Created: 12th Apr 2023 at 10:54, Last updated: 12th Apr 2023 at 10:56

Automatic, Efficient and Scalable Provenance Registration for FAIR HPC Workflows

Workflows and Distributed Computing

(Show All)

Abstract (Expand)

Provenance registration is becoming more and more important, as we increase the size and number of experiments performed using computers. In particular, when provenance is recorded in HPC environments, …

Authors: Raul Sirvent, Javier Conejero, Francesc Lordan, Jorge Ejarque, Laura Rodriguez-Navas, Jose M. Fernandez, Salvador Capella-Gutierrez, Rosa M. Badia

Date Published: 1st Nov 2022

Publication Type: Proceedings

DOI: 10.1109/WORKS56498.2022.00006

Citation: 2022 IEEE/ACM Workshop on Workflows in Support of Large-Scale Science (WORKS),pp.1-9,IEEE

Created: 2nd Aug 2023 at 15:37, Last updated: 2nd Aug 2023 at 15:41

DSCrank: A Method for Selection and Ranking of Datasets

yPublish - Bioinfo tools

Abstract (Expand)

Considerable efforts have been made to build the Web of Data. One of the main challenges has to do with how to identify the most related datasets to connect to. Another challenge is to publish a local …

Authors: Yasmmin Cortes Martins, Fábio Faria da Mota, Maria Cláudia Cavalcanti

Date Published: 2016

Publication Type: Journal

DOI: 10.1007/978-3-319-49157-8_29

Citation: Metadata and Semantics Research 672:333-344,Springer International Publishing

Created: 23rd Oct 2023 at 14:59, Last updated: 23rd Oct 2023 at 15:04

EpiCurator: an immunoinformatic workflow to predict and prioritize SARS-CoV-2 epitopes

yPublish - Bioinfo tools

Abstract (Expand)

The ongoing coronavirus 2019 (COVID-19) pandemic, triggered by the emerging SARS-CoV-2 virus, represents a global public health challenge. Therefore, the development of effective vaccines is an urgent …

Authors: Cristina S. Ferreira, Yasmmin C. Martins, Rangel Celso Souza, Ana Tereza R. Vasconcelos

Date Published: 2021

Publication Type: Journal

DOI: 10.7717/peerj.12548

Citation: PeerJ 9:e12548

Created: 23rd Oct 2023 at 15:04, Last updated: 23rd Oct 2023 at 15:06

OntoPPI: Towards Data Formalization on the Prediction of Protein Interactions

yPublish - Bioinfo tools

Abstract (Expand)

The Linking Open Data (LOD) cloud is a global data space for publishing and linking structured data on the Web. The idea is to facilitate the integration, exchange, and processing of data. The LOD cloud …

Authors: Yasmmin Cortes Martins, Maria Cláudia Cavalcanti, Luis Willian Pacheco Arge, Artur Ziviani, Ana Tereza Ribeiro de Vasconcelos

Date Published: 2019

Publication Type: Journal

DOI: 10.1007/978-3-030-36599-8_23

Citation: Metadata and Semantic Research 1057:260-271,Springer International Publishing

Created: 23rd Oct 2023 at 15:09, Last updated: 23rd Oct 2023 at 15:12

Large-Scale Protein Interactions Prediction by Multiple Evidence Analysis Associated With an In-Silico Curation Strategy

yPublish - Bioinfo tools

Abstract (Expand)

Predicting the physical or functional associations through protein-protein interactions (PPIs) represents an integral approach for inferring novel protein functions and discovering new drug targets …

Authors: Yasmmin Côrtes Martins, Artur Ziviani, Marisa Fabiana Nicolás, Ana Tereza Ribeiro de Vasconcelos

Date Published: 6th Sep 2021

Publication Type: Journal

DOI: 10.3389/fbinf.2021.731345

Citation: Front. Bioinform. 1,731345

Created: 23rd Oct 2023 at 15:13, Last updated: 23rd Oct 2023 at 15:16

PPIntegrator: semantic integrative system for protein–protein interaction and application for host–pathogen datasets

yPublish - Bioinfo tools

Abstract (Expand)

Semantic web standards have shown importance in the last 20 years in promoting data formalization and interlinking between the existing knowledge graphs. In this context, several ontologies and data …

Authors: Yasmmin Côrtes Martins, Artur Ziviani, Maiana de Oliveira Cerqueira e Costa, Maria Cláudia Reis Cavalcanti, Marisa Fabiana Nicolás, Ana Tereza Ribeiro de Vasconcelos

Date Published: 2023

Publication Type: Journal

DOI: 10.1093/bioadv/vbad067

Citation: Bioinformatics Advances 3(1),vbad067

Created: 23rd Oct 2023 at 15:18, Last updated: 23rd Oct 2023 at 15:21

Analysis of Protein-Protein Interactions networks and cross-species transfer learning comparison for seven organisms

yPublish - Bioinfo tools

Abstract (Expand)

Motivation Protein-protein interactions (PPIs) can be used for a plenty of applications like inferring protein functions or even helping the drug discovery process. For human specie, there is a lot of … validated information and functional annotations for the proteins in its interactome. In other species, the known interactome is much smaller compared with human and there are many proteins with few or no annotations by specialists. Understanding the interactome of other species helps to trace evolutionary characteristics, compare important biological processes and also build interactomes for new organisms according to other organisms more related with it instead of relying just to the human interactome. Results In this study, we evaluate the performance of PredPrIn workflow in predicting interactome for seven organisms in terms of scalability and precision showing that PredPrIn gets over than 70% of precision and it takes less than three days even on the largest datasets. We made a transfer learning analysis predicting an organism interactome from each other organism, we then showed an implication regarding to their evolutionary relation in the number of ortholog proteins shared between these organisms. We also present an analysis of functional enrichment showing the proportion of shared annotations between positive and false interactions predicted and extraction of topological features of each organism interactome such as proteins acting as hubs and bridge between modules. From each organism, one of the most frequent biological processes was selected and the proteins and pairs present in it were compared in terms of quantity in the interactome available in HINT database for that organism and the one predicted by PredPrIn. In this comparison we showed that we covered those proteins and pairs covered in HINT and also enriched these processes for almost all organisms. Conclusions In this work, we have proved the efficiency of PredPrIn workflow for protein interaction prediction for seven different organisms using scalability, performance and transfer learning analyses. We have also made cross-species interactome comparisons showing the most frequent biological processes for each organism as well as the topological features of each organism interactome showing the consistency with hypothesis about biological networks. Finally, we described the enrichment made by PredPrIn in selected biological processes showing that its prediction was important to enhance information about these organisms interactomes.

Author: Yasmmin C Martins

Date Published: 7th Jun 2023

Publication Type: Journal

DOI: 10.1101/2023.06.05.543725

Citation: biorxiv;2023.06.05.543725v1,[Preprint]

Created: 23rd Oct 2023 at 15:23, Last updated: 23rd Oct 2023 at 15:24

The impact of non-lineage defining mutations in the structural stability for variants of concern of SARS-CoV-2

yPublish - Bioinfo tools

Abstract (Expand)

Motivation The identification of the most important mutations, that lead to a structural and functional change in a highly transmissible virus variants, is essential to understand the impacts and the …

Authors: Yasmmin Martins, Ronaldo Francisco da Silva

Date Published: 22nd Jun 2023

Publication Type: Journal

DOI: 10.1101/2023.06.22.546079

Citation: biorxiv;2023.06.22.546079v1,[Preprint]

Created: 23rd Oct 2023 at 15:25, Last updated: 23rd Oct 2023 at 15:28

survInTime - Exploring surveillance methods and data analysis on Brazilian respiratory syndrome dataset and community mobility changes

yPublish - Bioinfo tools

Abstract (Expand)

Background The covid-19 pandemic brought negative impacts in almost every country in the world. These impacts were observed mainly in the public health sphere, with a rapid raise and spread of the … disease and failed attempts to restrain it while there was no treatment. However, in developing countries, the impacts were severe in other aspects such as the intensification of social inequality, poverty and food insecurity. Specifically in Brazil, the miscommunication among the government layers conducted the control measures to a complete chaos in a country of continental dimensions. Brazil made an effort to register granular informative data about the case reports and their outcomes, while this data is available and can be consumed freely, there are issues concerning the integrity and inconsistencies between the real number of cases and the number of notifications in this dataset. Results We projected and implemented four types of analysis to explore the Brazilian public dataset of Severe Acute Respiratory Syndrome (srag dataset) notifications and the google dataset of community mobility change (mobility dataset). These analysis provides some diagnosis of data integration issues and strategies to integrate data and experimentation of surveillance analysis. The first type of analysis aims at describing and exploring the data contained in both datasets, starting by assessing the data quality concerning missing data, then summarizing the patterns found in this datasets. The Second type concerns an statistical experiment to estimate the cases from mobility patterns organized in periods of time. We also developed, as the third analysis type, an algorithm to help the understanding of the disease waves by detecting them and compare the time periods across the cities. Lastly, we build time series datasets considering deaths, overall cases and residential mobility change in regular time periods and used as features to group cities with similar behavior. Conclusion The exploratory data analysis showed the under representation of covid-19 cases in many small cities in Brazil that were absent in the srag dataset or with a number of cases very low than real projections. We also assessed the availability of data for the Brazilian cities in the mobility dataset in each state, finding out that not all the states were represented and the best coverage occurred in Rio de Janeiro state. We compared the capacity of place categories mobility change combination on estimating the number of cases measuring the errors and identifying the best components in mobility that could affect the cases. In order to target specific strategies for groups of cities, we compared strategies to cluster cities that obtained similar outcomes behavior along the time, highlighting the divergence on handling the disease.

Authors: Yasmmin Côrtes Martins, Ronaldo Francisco da Silva

Date Published: 27th Sep 2023

Publication Type: Journal

DOI: 10.1101/2023.09.26.559599

Citation: biorxiv;2023.09.26.559599v1,[Preprint]

Created: 23rd Oct 2023 at 15:30, Last updated: 23rd Oct 2023 at 15:32

Multi-task analysis of gene expression data on cancer public datasets

yPublish - Bioinfo tools

Abstract (Expand)

Background There is an availability of omics and often multi-omics cancer datasets on public databases such as Gene Expression Omnibus (GEO), International Cancer Genome Consortium and The Cancer Genome … Atlas Program. Most of these databases provide at least the gene expression data for the samples contained in the project. Multi-omics has been an advantageous strategy to leverage personalized medicine, but few works explore strategies to extract knowledge relying only on gene expression level for decisions on tasks such as disease outcome prediction and drug response simulation. The models and information acquired on projects based only on expression data could provide decision making background for future projects that have other level of omics data such as DNA methylation or miRNAs. Results We extended previous methodologies to predict disease outcome from the combination of protein interaction networks and gene expression profiling by proposing an automated pipeline to perform the graph feature encoding and further patient networks outcome classification derived from RNA-Seq. We integrated biological networks from protein interactions and gene expression profiling to assess patient specificity combining the treatment/control ratio with the patient normalized counts of the deferentially expressed genes. We also tackled the disease outcome prediction from the gene set enrichment perspective, combining gene expression with pathway gene sets information as features source for this task. We also explored the drug response outcome perspective of the cancer disease still evaluating the relationship among gene expression profiling with single sample gene set enrichment analysis (ssGSEA), proposing a workflow to perform drug response screening according to the patient enriched pathways. Conclusion We showed the importance of the patient network modeling for the clinical task of disease outcome prediction using graph kernel matrices strategy and showed how ssGSEA improved the prediction only using transcriptomic data combined with pathway scores. We also demonstrated a detailed screening analysis showing the impact of pathway-based gene sets and normalization types for the drug response simulation. We deployed two fully automatized Screening workflows following the FAIR principles for the disease outcome prediction and drug response simulation tasks.

Author: Yasmmin Martins

Date Published: 28th Sep 2023

Publication Type: Journal

DOI: 10.1101/2023.09.27.23296213

Citation: medrxiv;2023.09.27.23296213v1,[Preprint]

Created: 23rd Oct 2023 at 15:33, Last updated: 23rd Oct 2023 at 15:34

Framework para a Construção de Redes Filogenéticas em Ambiente de Computação de Alto Desempenho

HP2NET - Framework for construction of phylogenetic networks on High Performance Computing (HPC) environment

Abstract (Expand)

No presente artigo é apresentado uma avaliação de desempenho de um Framework de Redes Filogenéticas no ambiente do supercomputador Santos Dumont. O trabalho reforça os benefícios de paralelizar o …

Authors: Rafael Terra, Kary Ocaña, Carla Osthoff, Lucas Cruz, Philippe Navaux, Diego Carvalho

Date Published: 19th Oct 2022

Publication Type: InProceedings

DOI: 10.5753/wscad.2022.226366

Citation: Anais do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD 2022),pp.73-84,Sociedade Brasileira de Computação

Created: 9th Jan 2024 at 12:54, Last updated: 9th Jan 2024 at 12:57

Gerência e Análises de Workflows aplicados a Redes Filogenéticas de Genomas de Dengue no Brasil

HP2NET - Framework for construction of phylogenetic networks on High Performance Computing (HPC) environment

Abstract (Expand)

Processos evolutivos e dispersão de genomas de Dengue no Brasil são relevantes na direção do impacto e vigilância endemo-epidêmico e social de arboviroses emergentes. Árvores e redes filogenéticas …

Authors: Rafael Terra, Micaella Coelho, Lucas Cruz, Marco Garcia-Zapata, Luiz Gadelha, Carla Osthoff, Diego Carvalho, Kary Ocaña

Date Published: 18th Jul 2021

Publication Type: InProceedings

DOI: 10.5753/bresci.2021.15788

Citation: Anais do XV Brazilian e-Science Workshop (BRESCI 2021),pp.49-56,Sociedade Brasileira de Computação

Created: 9th Jan 2024 at 12:55, Last updated: 9th Jan 2024 at 12:56

Framework para execução de workflows de redes filogenéticas em ambientes de computação de alto desempenho

HP2NET - Framework for construction of phylogenetic networks on High Performance Computing (HPC) environment

Abstract (Expand)

In the last years, the development of technologies, such as next-generation sequencing and high-performance computing allowed the execution of Bioinformatics experiments of high complexity and … computationally intensives. Different Bioinformatics fields need to use high-performance computing platforms to take advantage of the parallelism and tasks distribution, through specialized technologies of scientific workflows management systems. One of the Bioinformatics fields that need high-performance computing is phylogeny, a field that expresses the evolutive relations between genes and organisms, establishing which of them are most related evolutively. The phylogeny is used in several approaches, such as in the species classification; in the discovery of individuals’ kinship; in the identification of pathogens origins, and even in conservation biology. A way of representing these phylogenetic relations is using phylogenetic networks. However, the construction of these networks uses computationally intensive algorithms that require the constant manipulation of different input data. This work aims the development of a framework for construction of explicit phylogenetic networks, modeling a scientific workflow that adds different methods for the construction of the networks and the required input data treatment. The framework was developed to allow the use of multiple flows from the workflow in an automated, parallel, and distributed manner in a single execution and also to be executable in high- performance computing environments, constituting a challenging task, once the tools used are not developed focused in this environment. To orchestrate the workflow tasks, the scalable parallel programing library Parsl was used, allowing to do optimizations in the workflow’s tasks execution, performing better management of the resources. Two versions of the framework were developed, called Single Partition and Multi Partition, differing in the manner in which the resources are used. In tests performed, there was an improvement in the execution time of about five times when compared to the sequential execution of a flow without the optimizations. The framework was validated using public data of Dengue virus genomes, which were processed, annotated, and executed in the framework using the Santos Dumont supercomputer. The construction of the genomes’ explicit phylogenetic networks indicates that the framework is a functional, efficient, and easy to use tool.

Authors: Rafael Terra, Kary Ocaña, Carla Osthoff, Diego Carvalho

Date Published: 18th Feb 2022

Publication Type: Master's Thesis

Citation: TERRA, R. S. Framework para execução de workflows de redes filogenéticas em ambientes de computação de alto desempenho. 2022. 71 f. Tese. (Programa de Pós-Graduação em Modelagem Computacional) - Laboratório Nacional de Computação Científica, Petrópolis, 2022.

Created: 9th Jan 2024 at 13:16

MGnify Genomes: A Resource for Biome-specific Microbial Genome Catalogues

MGnify

(Show All)

Abstract

Not specified

Authors: Tatiana A. Gurbich, Alexandre Almeida, Martin Beracochea, Tony Burdett, Josephine Burgin, Guy Cochrane, Shriya Raj, Lorna Richardson, Alexander B. Rogers, Ekaterina Sakharova, Gustavo A. Salazar, Robert D. Finn

Date Published: 1st Jul 2023

Publication Type: Journal

DOI: 10.1016/j.jmb.2023.168016

Citation: Journal of Molecular Biology 435(14):168016

Created: 23rd May 2024 at 12:15, Last updated: 23rd May 2024 at 12:16

MOLGENIS VIP: an open-source and modular pipeline for high-throughput and integrated DNA variant analysis

MOLGENIS

Abstract (Expand)

Abstract In silico variant interpretation pipelines have become an integral part of genetics research and genome diagnostics. However, challenges remain for automated variant interpretation and candidate …

Authors: W.T.K. Maassen, L.F. Johansson, B. Charbon, D. Hendriksen, S. van den Hoek, M.K. Slofstra, R. Mulder, M.T. Meems-Veldhuis, R. Sietsma, H.H. Lemmink, C.C. van Diemen, M.E. van Gijn, M.A. Swertz, K.J. van der Velde

Date Published: 15th Apr 2024

Publication Type: Unpublished

DOI: 10.1101/2024.04.11.24305656

Citation: medrxiv;2024.04.11.24305656v2,[Preprint]

Created: 12th Jun 2024 at 10:50

Laserfarm – A high-throughput workflow for generating geospatial data products of ecosystem structure from airborne laser scanning point clouds

Laserfarm applications to European demonstration sites

Abstract (Expand)

Quantifying ecosystem structure is of key importance for ecology, conservation, restoration, and biodiversity monitoring because the diversity, geographic distribution and abundance of animals, plants … and other organisms is tightly linked to the physical structure of vegetation and associated microclimates. Light Detection And Ranging (LiDAR) — an active remote sensing technique — can provide detailed and high resolution information on ecosystem structure because the laser pulse emitted from the sensor and its subsequent return signal from the vegetation (leaves, branches, stems) delivers three-dimensional point clouds from which metrics of vegetation structure (e.g. ecosystem height, cover, and structural complexity) can be derived. However, processing 3D LiDAR point clouds into geospatial data products of ecosystem structure remains challenging across broad spatial extents due to the large volume of national or regional point cloud datasets (typically multiple terabytes consisting of hundreds of billions of points). Here, we present a high-throughput workflow called ‘Laserfarm’ enabling the efficient, scalable and distributed processing of multi-terabyte LiDAR point clouds from national and regional airborne laser scanning (ALS) surveys into geospatial data products of ecosystem structure. Laserfarm is a free and open-source, end-to-end workflow which contains modular pipelines for the re-tiling, normalization, feature extraction and rasterization of point cloud information from ALS and other LiDAR surveys. The workflow is designed with horizontal scalability and can be deployed with distributed computing on different infrastructures, e.g. a cluster of virtual machines. We demonstrate the Laserfarm workflow by processing a country-wide multi-terabyte ALS dataset of the Netherlands (covering ∼34,000 km2 with ∼700 billion points and ∼ 16 TB uncompressed LiDAR point clouds) into 25 raster layers at 10 m resolution capturing ecosystem height, cover and structural complexity at a national extent. The Laserfarm workflow, implemented in Python and available as Jupyter Notebooks, is applicable to other LiDAR datasets and enables users to execute automated pipelines for generating consistent and reproducible geospatial data products of ecosystems structure from massive amounts of LiDAR point clouds on distributed computing infrastructures, including cloud computing environments. We provide information on workflow performance (including total CPU times, total wall-time estimates and average CPU times for single files and LiDAR metrics) and discuss how the Laserfarm workflow can be scaled to other LiDAR datasets and computing environments, including remote cloud infrastructures. The Laserfarm workflow allows a broad user community to process massive amounts of LiDAR point clouds for mapping vegetation structure, e.g. for applications in ecology, biodiversity monitoring and ecosystem restoration.

Authors: W. Daniel Kissling, Yifang Shi, Zsófia Koma, Christiaan Meijer, Ou Ku, Francesco Nattino, Arie C. Seijmonsbergen, Meiert W. Grootes

Date Published: 1st Dec 2022

Publication Type: Journal

DOI: 10.1016/j.ecoinf.2022.101836

Citation: Ecological Informatics 72:101836

Created: 7th Feb 2025 at 08:39, Last updated: 24th Apr 2025 at 16:02

Country-wide data of ecosystem structure from the third Dutch airborne laser scanning survey

Laserfarm applications to European demonstration sites

Abstract (Expand)

The third Dutch national airborne laser scanning flight campaign (AHN3, Actueel Hoogtebestand Nederland) conducted between 2014 and 2019 during the leaf-off season (October–April) across the whole … Netherlands provides a free and open-access, country-wide dataset with ∼700 billion points and a point density of ∼10(–20) points/m2. The AHN3 point cloud was obtained with Light Detection And Ranging (LiDAR) technology and contains for each point the x, y, z coordinates and additional characteristics (e.g. return number, intensity value, scan angle rank and GPS time). Moreover, the point cloud has been pre-processed by ‘Rijkswaterstraat’ (the executive agency of the Dutch Ministry of Infrastructure and Water Management), comes with a Digital Terrain Model (DTM) and a Digital Surface Model (DSM), and is delivered with a pre-classification of each point into one of six classes (0: Never Classified, 1: Unclassified, 2: Ground, 6: Building, 9: Water, 26: Reserved [bridges etc.]). However, no detailed information on vegetation structure is available from the AHN3 point cloud. We processed the AHN3 point cloud (∼16 TB uncompressed data volume) into 10 m resolution raster layers of ecosystem structure at a national extent, using a novel high-throughput workflow called ‘Laserfarm’ and a cluster of virtual machines with fast central processing units, high memory nodes and associated big data storage for managing the large amount of files. The raster layers (available as GeoTIFF files) capture 25 LiDAR metrics of vegetation structure, including ecosystem height (e.g. 95th percentiles of normalized z), ecosystem cover (e.g. pulse penetration ratio, canopy cover, and density of vegetation points within defined height layers), and ecosystem structural complexity (e.g. skewness and variability of vertical vegetation point distribution). The raster layers make use of the Dutch projected coordinate system (EPSG:28992 Amersfoort / RD New), are each ∼1 GB in size, and can be readily used by ecologists in a geographic information system (GIS) or analytical open-source software such as R and Python. Even though the class ‘1: Unclassified’ mainly includes vegetation points, other objects such as cars, fences, and boats can also be present in this class, introducing potential biases in the derived data products. We therefore validated the raster layers of ecosystem structure using >180,000 hand-labelled LiDAR points in 100 randomly selected sample plots (10 m × 10 m each) across the Netherlands. Besides vegetation, objects such as boats, fences, and cars were identified in the sampled plots. However, the misclassification rate of vegetation points (i.e. non-vegetation points that were assumed to be vegetation) was low (∼0.05) and the accuracy of the 25 LiDAR metrics derived from the AHN3 point cloud was high (∼90%). To minimize existing inaccuracies in this country-wide data product (e.g. ships on water bodies, chimneys on roofs, or cars on roads that might be incorrectly used as vegetation points), we provide an additional mask that captures water bodies, buildings and roads generated from the Dutch cadaster dataset. This newly generated country-wide ecosystem structure data product provides new opportunities for ecology and biodiversity science, e.g. for mapping the 3D vegetation structure of a variety of ecosystems or for modelling biodiversity, species distributions, abundance and ecological niches of animals and their habitats.

Authors: W. Daniel Kissling, Yifang Shi, Zsófia Koma, Christiaan Meijer, Ou Ku, Francesco Nattino, Arie C. Seijmonsbergen, Meiert W. Grootes

Date Published: 1st Feb 2023

Publication Type: Journal

DOI: 10.1016/j.dib.2022.108798

Citation: Data in Brief 46:108798

Created: 7th Feb 2025 at 08:41, Last updated: 24th Apr 2025 at 15:57

Modern Approaches to the Monitoring of Biоdiversity (MAMBO)

Laserfarm applications to European demonstration sites

Abstract (Expand)

EU policies, such as the EU biodiversity strategy 2030 and the Birds and Habitats Directives, demand unbiased, integrated and regularly updated biodiversity and ecosystem service data. However, efforts …

Authors: Toke Høye, Tom August, Mario V Balzan, Koos Biesmeijer, Pierre Bonnet, Tom Breeze, Christophe Dominik, France Gerard, Alexis Joly, Vincent Kalkman, W. Daniel Kissling, Teodor Metodiev, Jesper Moeslund, Simon Potts, David Roy, Oliver Schweiger, Deepa Senapathi, Josef Settele, Pavel Stoev, Dan Stowell

Date Published: 7th Dec 2023

Publication Type: Journal

DOI: 10.3897/rio.9.e116951

Citation: RIO 9,e116951

Created: 7th Feb 2025 at 08:43, Last updated: 24th Apr 2025 at 15:55

Towards consistently measuring and monitoring habitat condition with airborne laser scanning and unmanned aerial vehicles

Laserfarm applications to European demonstration sites

(Show All)

Abstract (Expand)

Indicators of habitat condition are essential for tracking conservation progress, but measuring biotic, abiotic and landscape characteristics at fine resolution over large spatial extents remains …

Authors: W. Daniel Kissling, Yifang Shi, Jinhu Wang, Agata Walicka, Charles George, Jesper E. Moeslund, France Gerard

Date Published: 1st Dec 2024

Publication Type: Journal

DOI: 10.1016/j.ecolind.2024.112970

Citation: Ecological Indicators 169:112970

Created: 7th Feb 2025 at 08:44, Last updated: 24th Apr 2025 at 15:53

Data of vegetation structure metrics retrieved from airborne laser scanning surveys for European demonstration sites

Laserfarm applications to European demonstration sites

Abstract (Expand)

This dataset provides a standardized collection of rasterized Light Detection And Ranging (LiDAR) metrics in GeoTIFF format, derived from country-wide airborne laser scanning (ALS) data across seven … demonstration sites in five European countries: Mols Bjerge National Park (Denmark), Reserve Naturelle Nationale du Bagnas (France), Oostvaardersplassen (Netherlands), Salisbury Plain (United Kingdom), Knepp Estate (United Kingdom), Monks Wood (United Kingdom), and the island of Comino (Malta). The sites range in areal size from 0.08 km2 to 54 km2 and include habitat types such as forests, broadleaf and conifer woodlands, small plantations, dry and wet grasslands, marshes, reedbeds, arable fields, farmland, scrublands and mediterranean garigue. A total of 35 LiDAR metrics were calculated, of which 28 represent vegetation structural attributes. These include vegetation height (seven metrics), vegetation cover (fourteen metrics), and vegetation vertical variability (seven metrics). Additionally, seven metrics describe point density (one metric), eigenvalues (three metrics), and normal vectors (three metrics). The rasterized LiDAR metrics have a spatial resolution of 10 m, with coverage and extent defined by shapefiles corresponding to each demonstration site. The raw ALS point clouds were clipped to the site boundaries and processed with the 'Laserfarm' workflow, a standardized computational workflow that includes modular pipelines for re-tiling, normalization, feature extraction, and rasterization. Laserfarm employs the feature extraction module of the open-source ‘Laserchicken’ software to compute the LiDAR metrics. The workflow was implemented using the IT services of the Dutch national facility for information and communication technology, SURF. The clipped LiDAR point clouds are available through a public repository, except for the LiDAR point clouds from Comino, Malta, which are not publicly available. The 35 rasterized LiDAR metrics (GeoTIFF files, 10 m resolution) from all sites, including Comino, as well as the corresponding site boundary shapefiles (geospatial vector format), are provided in a Zenodo repository. Additionally, the Jupyter Notebooks with Python code for executing the Laserfarm workflow are available to facilitate reproducibility and further computational applications. Users should note that the rasterized LiDAR metrics may contain zero or NA values, particularly over water surfaces, with the pulse penetration ratio metric potentially indicating false high vegetation cover over water. Users may reclassify or mask areas with zero values accordingly. Some pixels exhibit abnormal vegetation height values, which can be filtered before analysis. Certain striping patterns, likely resulting from overlapping flight lines and increased point density, are present in some metrics, though their overall impact appears minimal. This dataset enables diverse applications, including canopy height measurements, mapping of hedgerows, treelines, and forest patches, as well as characterizing vegetation density, vertical stratification, and habitat openness. It supports landscape-scale habitat analysis and contributes to the standardization of vegetation metrics from ALS data for site-specific ecological monitoring (e.g., Natura 2000). Moreover, the dataset demonstrates the automated execution of LiDAR data processing workflows, which is crucial for establishing a transnational and multi-site biodiversity and ecosystem observation network.

Authors: W. Daniel Kissling, Wessel Mulder, Jinhu Wang, Yifang Shi

Date Published: 1st Jun 2025

Publication Type: Journal

DOI: 10.1016/j.dib.2025.111548

Citation: Data in Brief 60:111548

Created: 24th Apr 2025 at 15:32, Last updated: 24th Apr 2025 at 15:37

Laserchicken—A tool for distributed feature calculation from massive LiDAR point cloud datasets

Laserfarm applications to European demonstration sites

Abstract (Expand)

Point cloud datasets provided by LiDAR have become an integral part in many research fields including archaeology, forestry, and ecology. Facilitated by technological advances, the volume of these …

Authors: C. Meijer, M.W. Grootes, Z. Koma, Y. Dzigan, R. Gonçalves, B. Andela, G. van den Oord, E. Ranguelova, N. Renaud, W.D. Kissling

Date Published: 1st Jul 2020

Publication Type: Journal

DOI: 10.1016/j.softx.2020.100626

Citation: SoftwareX 12:100626

Created: 24th Apr 2025 at 15:48, Last updated: 24th Apr 2025 at 15:49