Research Object Crate for gene2phylo

Original URL: https://workflowhub.eu/workflows/793/ro_crate?version=1

# gene2phylo **gene2phylo** is a snakemake pipeline for batch phylogenetic analysis of a given set of input genes. ## Contents - [Setup](#setup) - [Example data](#example-data) - [Input](#input) - [Output](#output) - [Running your own data](#running-your-own-data) - [Getting help](#getting-help) - [Citations](#citations) ## Setup The pipeline is written in Snakemake and uses conda to install the necessary tools. It is *strongly recommended* to install conda using Mambaforge. See details here https://snakemake.readthedocs.io/en/stable/getting_started/installation.html Once conda is installed, you can pull the github repo and set up the base conda environment. ``` # get github repo git clone https://github.com/o-william-white/gene2phylo # change dir cd gene2phylo # setup conda env conda env create -n snakemake -f workflow/envs/conda_env.yaml ```

↥ back to top

## Example data Before you run your own data, it is recommended to run the example datasets provided . This will confirm there are no user-specific issues with the setup and it also installs all the dependencies. The example data includes mitochondrial and ribosomal genes from 25 different butterfly species. To run the example data, use the code below. The first time you run the pipeline, it will take some time to install each of the conda environments, so it is a good time to take a tea break :). ``` conda activate snakemake snakemake \ --cores 4 \ --use-conda ```
↥ back to top

## Input Snakemake requires a `config.yaml` to define input parameters. For the example data provided, the config file is located here `config/config.yaml` and it looks like this: ``` # name of input directory containg genes input_dir: .test # realign (True or False) realign: True # alignment missing data threshold for alignment (0.0 - 1.0), only required if realign == True missing_threshold: 0.5 # alignment trimming method to use (gblocks or clipkit), only required if realign == True alignment_trim: gblocks # name of outgroup sample (optional) # use "NA" if there is no obvious outgroup # if more than one outgroup use a comma separated list i.e. "sampleA,sampleB" outgroup: Eurema_blanda # plot dimensions (cm) plot_height: 20 plot_width: 20 ```
↥ back to top

## Output All output files are saved to the `results` direcotry. Below is a table summarising all of the output files generated by the pipeline. | Directory | Description | |---------------------------|---------------------------| | mafft | Optional: Mafft aligned fasta files of all genes | | mafft_filtered | Optional: Mafft aligned fasta files after the removal of sequences based on a missing data threshold | | alignment_trim | Optional: Ambiguous parts of alignment removed using either gblocks or clipkit | | iqtree | Iqtree phylogenetic analysis for each gene | | iqtree_plots | Plots of Iqtree phylogenetic tree for each gene | | concatenate_alignments | Partitioned alignment of all genes | | iqtree_partitioned | Iqtree partitioned phylogenetic analysis | | iqtree_partitioned_plot | Plot of Iqtree partitioned tree | | astral | Astral phylogenetic analysis of all gene trees | | astral_plot | Plot of Astral tree |
↥ back to top

## Running your own data For the pipeline to function properly, the input gene alignments must be: - in a single directory - end with ".fasta" - named after the aligned gene (e.g. "cox1.fasta" or "28S.fasta") - share identical sample names across alignments (e.g. all genes from sample A share the same name) Please see the example data in the `.test/` directory as an example. Then you need to generate your own config.yaml file, using the example template provided. ## Getting help If you have any questions, please do get in touch in the issues or by email o.william.white@gmail.com
↥ back to top

## Citations If you use the pipeline, please cite our bioarxiv preprint: https://doi.org/10.1101/2023.08.11.552985 Since the pipeline is a wrapper for several other bioinformatic tools we also ask that you cite the tools used by the pipeline: - Gblocks (default) https://doi.org/10.1093/oxfordjournals.molbev.a026334 - Clipkit (optional) https://doi.org/10.1371/journal.pbio.3001007 - Mafft (optional) https://doi.org/10.1093/molbev/mst010 - Iqtree https://doi.org/10.1093/molbev/msu300 - Ete3 https://doi.org/10.1093/molbev/msw046 - Ggtree https://doi.org/10.1111/2041-210X.12628 - Astral https://doi.org/10.1186/s12859-018-2129-y
↥ back to top

Author
License
MIT

Contents

Main Workflow: gene2phylo
Size: 929 bytes