GitHub - dalmolingroup/euryale: A pipeline for taxonomic classification and functional annotation of metagenomic reads. Based on MEDUSA

Introduction

dalmolingroup/euryale is a pipeline for taxonomic classification and functional annotation of metagenomic reads. Based on MEDUSA.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

Pipeline summary

Pre-processing

Read QC (FastQC)
Read trimming and merging (fastp)
(optionally) Host read removal (BowTie2)
Duplicated sequence removal (fastx collapser)
Present QC and other data (MultiQC)

Assembly

(optionally) Read assembly (MEGAHIT)

Taxonomic classification

Sequence classification (Kaiju)
Sequence classification (Kraken2)
Visualization (Krona)

Functional annotation

Sequence alignment (DIAMOND)
Map alignment matches to functional database (annotate)

Quick Start

Install Nextflow (>=22.10.1)
Install any of Docker, Singularity (you can follow this tutorial), Podman, Shifter or Charliecloud for full pipeline reproducibility (you can use Conda both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs).
Download the pipeline and test it on a minimal dataset with a single command:

nextflow run dalmolingroup/euryale -profile test,YOURPROFILE --outdir <OUTDIR>

Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE in the example command above). You can chain multiple config profiles in a comma-separated string.

The pipeline comes with config profiles called docker, singularity, podman, shifter, charliecloud and conda which instruct the pipeline to use the named tool for software management. For example, -profile test,docker.

Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile <institute> in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment.

If you are using singularity, please use the nf-core download command to download images first, before running the pipeline. Setting the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.

If you are using conda, it is highly recommended to use the NXF_CONDA_CACHEDIR or conda.cacheDir settings to store the environments in a central location for future pipeline runs.

Start running your own analysis!

nextflow run dalmolingroup/euryale \
  --input samplesheet.csv \
  --outdir <OUTDIR> \
  --kaiju_db kaiju_reference \
  --diamond_db diamond_db \
  --reference_fasta diamond_fasta \
  --host_fasta host_reference_fasta \
  --id_mapping id_mapping_file \
  -profile <docker/singularity/podman/shifter/charliecloud/conda/institute>

Databases and references

A question that pops up a lot is: Since Euryale requires a lot of reference parameters, where can I find these references?

Below we provide a short list of places where you can find these databases. But, of course, we're not limited to these references: Euryale should be able to process your own databases, should you want to build them yourself.

Alignment

For the alignment you can either provide --diamond_db for a pre-built DIAMOND database, or you can provide --reference_fasta. For reference fasta, by default Euryale expects something like NCBI-nr, but similarly formatted reference databases should also suffice.

Taxonomic classification

At its current version, Euryale doesn't build a reference taxonomic database, but pre-built ones are supported.

If you're using Kaiju (the default), you can provide a reference database with --kaiju_db and provide a .tar.gz file like the ones provided in the official Kaiju website. We have extensively tested Euryale with the 2021 version of the nr database and it should work as expected.
If you're using Kraken2 (By supplying --run_kraken2), we expect something like the pre-built .tar.gz databases provided by the Kraken2 developers to be provided to --kraken2_db.

Functional annotation

We expect an ID mapping reference to be used within annotate. Since we're already expecting by default the NCBI-nr to be used as the alignment reference, the ID mapping data file provided by Uniprot should work well when provided to --id_mapping.

Host reference

If you're using metagenomic reads that come from a known host's microbiome, you can also provide the host's genome FASTA to --host_fasta parameter in order to enable our decontamination subworkflow. Ensembl provides easy to download genomes that can be used for this purpose. Alternatively, you can provide a pre-built BowTie2 database directory to the --bowtie2_db parameter.

Documentation

The dalmolingroup/euryale documentation is split into the following pages:

Usage

- An overview of how the pipeline works, how to run it and a description of all of the different command-line flags.

Output

- An overview of the different results produced by the pipeline and how to interpret them.

Credits

dalmolingroup/euryale was originally written by João Cavalcante.

We thank the following people for their extensive assistance in the development of this pipeline:

Diego Morais (for developing the original MEDUSA pipeline)

Citations

Morais DAA, Cavalcante JVF, Monteiro SS, Pasquali MAB and Dalmolin RJS (2022) MEDUSA: A Pipeline for Sensitive Taxonomic Classification and Flexible Functional Annotation of Metagenomic Shotgun Sequences. Front. Genet. 13:814437. doi: 10.3389/fgene.2022.814437

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
assets		assets
bin		bin
conf		conf
dockerfiles		dockerfiles
docs		docs
lib		lib
modules		modules
subworkflows/local		subworkflows/local
test_data		test_data
workflows		workflows
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
.nf-core.yml		.nf-core.yml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
mkdocs.yml		mkdocs.yml
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
pyproject.toml		pyproject.toml
tower.yml		tower.yml

License

dalmolingroup/euryale

Folders and files

Latest commit

History

Repository files navigation

Introduction

Pipeline summary

Pre-processing

Assembly

Taxonomic classification

Functional annotation

Quick Start

Databases and references

Alignment

Taxonomic classification

Functional annotation

Host reference

Documentation

Credits

Citations

About

Topics

Resources

License

Stars

Watchers

Forks

Languages