The V-pipe workflow can be customized through the configuration file config.yaml
or config.json
or, for backward compatibility with the legacy INI-style format used in V-pipe v1.x/2.x, vpipe.config
. This configuration file is a text file written using a basic structure composed of sections, properties and values. When using YAML or JSON format use these languages associative array/dictionaries in two levels for sections and properties. When using the older INI format, sections are expected in squared brackets, and properties are followed by corresponding values.
Further more, it is possible to specify additional options on the command line using Snakemake’s --configfile
to pass additional YAML/JSON configuration files, and/or using Snakemake’s --config
to pass sections and properties in a YAML Flow style/JSON syntax.
The order of precedence is:
command line options (--config
, --configfile
) >> default configuration file (config/config.yaml
or config.yaml
) >> legacy configuration INI (vpipe.config
) >> Virus-specific base config (virus_based_config
) >> default values
Example: For instance, we suggest providing as input a tabular file specifying sample unique identifiers (e.g., patient identifiers), and dates for different sequencing runs related to the same patient. The name of this file (here, samples.tsv
) can be provided by specifying the section as input
and the property as samples_file
, as follows in the example below.
In this document, we provide a comprehensive list of all user-configurable options stratified by sections.
input:
samples_file: samples.tsv
This section of the configuration provides general options that control the overall behavior of the pipeline.
We provide virus-specific base configuration files which contain handy defaults for, e.g., HIV and SARS-CoV-2. Check the git repository’s config subdirectory to learn about them.
hiv
sars-cov-2
By default trimming and clipping of reads is performed by PRINSEQ 1 – a versatile raw read processor for short-reads with many customization options, that we use mostly for Illumina short-read sequencing.
Some other sequencing platforms, e.g., Oxford Nanopore Technologies, are not compatible with this software and usually perform quality control during the fast5 basecalling and demultiplexing anyway, e.g., by Guppy. Use skip
to avoid performing preprocessing such already quality-trimmed fastq files.
Schmieder, R. and Edwards, R. Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011. ↩
skip
There are three options for mapping reads, either using ngshmmalign
, BWA MEM (bwa
) 1, Bowtie 2 (bowtie
) 2, or minimap2 (minimap
)3. To use a different aligner than the default, indicate which aligner you want to use by setting the property aligner.
Note: Some virus-specific base configuration specified in virus_base_config
might change this option’s default to a more appropriate aligner for that virus, e.g., depending on its usual diversity and mutation rate.
You are still free to override that default in your configuration shall the need arise.
minimap
There are two options available for trimming primers, either using iVar trim (ivar
) 1 or Samtools ampliconclip (samtools
) 2. iVar trim is used by default. If you prefer to use Samtools ampliconclip, then indicate so in the configuration file as in the example
samtools
There are two options available for calling single nucleotide variants, either using ShoRAH (shorah
) 1 or LoFreq (lofreq
) 2. ShoRAH is used by default. If you prefer to use LoFreq, then indicate so in the configuration file as in the example
Zagordi, O. et al. ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC Bioinformatics. 2011. ↩
Wilm, A. et al. LoFreq: A sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 2012. ↩
lofreq
There are three options available for haplotype reconstruction, namely SAVAGE 1, HaploClique 2 or PredictHaplo 3. SAVAGE is used by default. If you wish to use HaploClique, then indicate it in the configuration file as in the example.
Baaijens, J. A. et al., De novo assembly of viral quasispecies using overlap graphs. Genome Res. 2017. ↩
Töpfer, A. et al. Viral quasispecies assembly via maximal clique finding. PLOS Computational Biology. 2014. ↩
Prabhakaran, S. et al. HIV haplotype inference using a propagating dirichlet process mixture model. IEEE/ACM transactions on computational biology and bioinformatics 11.1. 2013. ↩
haploclique
This option should be used to specify the default number of threads for all multi-threaded rules. That is, unless the number of threads is specified for each rule, this value is set as default.
Value must be greater or equal to 1
4
Sets the algorithm to be used when computing checksums for uploadable data.
sha256
Some step of V-pipe produce temporary files such as, e.g., decompressed intermediate — i.e. files which aren’t kept long-term but are deleted after all steps that needed them have finished. By default, these files are written in the output data directory. This option, makes it is possible to write them in a different directory instead. Use this option to, e.g., leverage a faster cluster-local storage or avoid wasting backup space on a snapshotted storage. You might want to consult the documentation provided by your HPC.
temp
/cluster/scratch
Specify whether TSV files like coverage and base counts should be 1-based (i.e.: the first base pair position is called 1
) like standard practice used in biology and most text formats such as VCF and GFF, or should be 0-based (i.e.: the first base pair position is called 0
) like in several Python tools such as pysam and the BED format.
By default V-pipe uses 1-based TSV file (position column starts with 1
), but this option change the behaviour.
0
Charater to use when assembling the two levels (e.g.: sample and a date), into a column title to be used in a report TSV file
E.g., with this sample file
patient1 20100113
patient1 20110202
patient2 20081130
the coverage TSV file’s column will be called patient1/20100113
, patient1/20110202
and patient2/20081130
.
-
Properties in this section of the configuration control the input of the pipeline.
The input file for the workflow will be searched in this directory.
V-pipe expects the input samples to be organized in a two-level directory hierarchy.
raw_data
holds the sequencing data in FASTQ format (optionally compressed with GZip).For example:
📁samples
├──📁patient1
│ ├──📁20100113
│ │ └──📁raw_data
│ │ ├──🧬patient1_20100113_R1.fastq
│ │ └──🧬patient1_20100113_R2.fastq
│ └──📁20110202
│ └──📁raw_data
│ ├──🧬patient1_20100202_R1.fastq
│ └──🧬patient1_20100202_R2.fastq
└──📁patient2
└──📁20081130
└──📁raw_data
├──🧬patient2_20081130_R1.fastq.gz
└──🧬patient2_20081130_R2.fastq.gz
tests/data/hiv/
tests/data/sars-cov-2/
Indicate whether the input sequencing reads correspond to paired-end reads.
Paired-ended reads need to be in split files with _R1
and _R2
suffixes:
📁raw_data
├──🧬patient2_20081130_R1.fastq.gz
└──🧬patient2_20081130_R2.fastq.gz
False
V-pipe expects paired-end reads to be in files that end in _R1
and _R2
exactly right before the file extension, e.g., _R1.fastq.gz
, because this is how the workflow finds and recognizes them.
But Illumina’s bcl2fastq demultiplexer might introduce additional strings, e.g., _R2_001.fast.gz
or, depending on its mismatches settings, e.g., _R2_001_MM_1.fast.gz
. Use this options to specify anything which should go between the _R1
and _R2
endings and the file extension.
_001
_001_MM_1
File containing sample unique identifiers and dates as tab-separated values, e.g.,
patient1 20100113
patient1 20110202
patient2 20081130
Here, we have two samples from patient 1 and one sample from patient 2. By default, V-pipe searches for a file named samples.tsv, if this file does not exist, a list of samples is built by globbing datadir directory contents.
Optionally, the samples file can contain a third column specifying the read length. This is particularly useful when samples are sequenced using protocols with different read lengths.
Optionally, a fourth column can contain a short name of a protocol (e.g.: v3
) that is detailed in the file specified in input
=> protocols_file
. This is useful if protocol details such as primers change over time, e.g. to adapt to new variants with SNV breaking primer binding affinity.
Standardized Snakemake workflows place their tables inside the config/
subdirectory, but using this options you can specify alternate locations, e.g., the current working directory (as done in legacy V-pipe v1.x/2.x).
samples.tsv
When different samples have been processed with different library protocols, this file specifies a lookup table with per-protocol specific (primers bed and fasta), eg.:
v41:
name: SARS-CoV-2 ARTIC V4.1
inserts_bedfile: references/primers/v41/SARS-CoV-2.insert.bed
primers_bedfile: references/primers/v41/SARS-CoV-2.primer.bed
primers_file: references/primers/v41/SARS-CoV-2.tsv
primers_fasta: references/primers/v41/SARS-CoV-2.primer.fasta
v4:
name: SARS-CoV-2 ARTIC V4
inserts_bedfile: references/primers/v4/SARS-CoV-2.insert.bed
primers_bedfile: references/primers/v4/SARS-CoV-2.primer.bed
primers_file: references/primers/v4/SARS-CoV-2.tsv
primers_fasta: references/primers/v4/ARTIC_v4.fasta
v3:
name: SARS-CoV-2 ARTIC V3
inserts_bedfile: references/primers/v3/nCoV-2019.insert.bed
primers_bedfile: references/primers/v3/nCoV-2019.primer.bed
primers_file: references/primers/v3/nCoV-2019.tsv
primers_fasta: references/primers/v3/ARTIC_v3.fasta
The short name can then be referenced in the samples TSV table file:
sample_a 20211108 250 v3
sample_b 20220214 250 v4
Note: The virus-specific base configuration specified in general
=> virus_base_config
will most likely change this option’s default.
You are still free to override that default in your configuration shall the need arise.
resources/sars-cov-2/primers.yaml
Default for those samples whose read length isn’t specified explicitly in the optional third column of the samples.tsv
table.
100
A bed file with primers position to trim the alignment output
Note: The virus-specific base configuration specified in general
=> virus_base_config
will most likely change this option’s default.
You are still free to override that default in your configuration shall the need arise.
Note: individual sample can override this using the 4th column in the samples TSV table file and the protocols YAML look-up table.
resources/sars-cov-2/primers/v3/nCoV-2019.primer.bed
A bed file with inserts position of the multiplex PCR output to use with amplicon-based analysis.
Note: The virus-specific base configuration specified in general
=> virus_base_config
will most likely change this option’s default.
You are still free to override that default in your configuration shall the need arise.
Note: individual sample can override this using the 4th column in the samples TSV table file and the protocols YAML look-up table.
resources/sars-cov-2/primers/v3/nCoV-2019.primer.bed
Using this parameter, the user can specify the read-length threshold that should be applied during the quality trimming as a percentage (0 < trim_percent_cutoff
< 1).
Value must be greater or equal to 0
and lesser or equal to 1
0.9
Reference sequence to use for the alignment step
Note: The virus-specific base configuration specified in general
=> virus_base_config
will most likely change this option’s default to a reference for that virus.
You are still free to override that default in your configuration shall the need arise.
resources/hiv/HXB2.fasta
resources/sars-cov-2/NC_045512.2.fasta
A directory containing gff files that can be optionally used to annotate the reference genome in the visualization, e.g., with genes, mature products, protein domains, regions of interests, etc.
Note: The virus-specific base configuration specified in general
=> virus_base_config
will most likely change this option’s default.
You are still free to override that default in your configuration shall the need arise.
resources/hiv/gffs/
resources/sars-cov-2/gffs/
An associative array providing user-friendly name to display for each annotation .gff file in the gff_directory
Note: The virus-specific base configuration specified in general
=> virus_base_config
will most likely change this option’s default.
You are still free to override that default in your configuration shall the need arise.
resources/hiv/metainfo.yaml
resources/sars-cov-2/metainfo.yaml
The specific annotation GFF file that has provides the genes position along the genome for reports that mention specific genes such frameshift-deletions-check.
Note: The virus-specific base configuration specified in general
=> virus_base_config
will most likely change this option’s default.
You are still free to override that default in your configuration shall the need arise.
Note: if not set, V-pipe will try auto-selecting a .gff file from the gff_directory
.
resources/hiv/gffs/GCF_000864765.1_ViralProj15476_genomic.gff
resources/sars-cov-2/gffs/Genes_NC_045512.2.GFF3
A table with primers to display on the visualization
Note: The virus-specific base configuration specified in general
=> virus_base_config
will most likely change this option’s default.
You are still free to override that default in your configuration shall the need arise.
Note: individual sample can override this using the 4th column in the samples TSV table file and the protocols YAML look-up table.
resources/sars-cov-2/primers/v3/nCoV-2019.tsv
Directory holding a list of COJAC YAML definitions of variants of concern that will be used for search of variant signatures
Note: The virus-specific base configuration specified in general
=> virus_base_config
will most likely change this option’s default.
You are still free to override that default in your configuration shall the need arise.
resources/sars-cov-2/voc/
A FASTQ file with sequences of interest
Note: These sequences are used, together with the consensus sequence, to build a phylogenetic tree.
resources/sars-cov-2/phylogeny/selected_covid_sequences.fasta
Properties in this section of the configuration control the output of the pipeline.
The workflow will write its output files into this directory. This will follow the same structure as for the input.
For each sample, V-pipe produces several output files that are located in the corresponding sample-specific directory. First, the alignment file and consensus sequences are located in the alignments
and references
subdirectories, respectively. Second, output files containing SNVs and viral haplotypes are located in the variants
subdirectories.
Using the sample example as in the input
section, the output files for the two patient samples will be located in the following subdirectories:
📁results
├──📁patient1
│ ├──📁20100113
│ │ ├──📁alignments
│ │ | └──REF_aln.bam
│ │ ├──📁references
| | | ├──consensus.bcftools.fasta
| | | ├──ref_ambig.fasta
| | | └──ref_majority.fasta
| | └──📁variants
| | ├──📁SNVs
| | | └──snvs.vcf
| | └──📁global
| | └──contigs_stage_c.fasta
│ └──📁20110202
│ ├──📁alignments
│ | └──REF_aln.bam
│ ├──📁references
| | ├──consensus.bcftools.fasta
| | ├──ref_ambig.fasta
| | └──ref_majority.fasta
| └──📁variants
| ├──📁SNVs
| | └──snvs.vcf
| └──📁global
| └──contigs_stage_c.fasta
└─📁patient2
├──📁alignments
| └──REF_aln.bam
├──📁references
| ├──consensus.bcftools.fasta
| ├──ref_ambig.fasta
| └──ref_majority.fasta
└──📁variants
├──📁SNVs
| └──snvs.vcf
└──📁global
└──contigs_stage_c.fasta
results
subdirectorysamples/
subdirectory as the input (as used to be done in legacy V-pipe v1.x/2.x), you can use this options you can specify alternate target locations.samples
In addition, V-pipe can optionally generate a few cohort-wide results, such as a current cohort consensus fasta file, or a TSV file containing the frequencies of all minor alleles that differ from the consensus among analyzed samples.
By default, these output files are located at the base of the output
datadir
, outside of the two-level per sample structure:
results
├──minority_variants.tsv
├──cohort_consensus.fasta
├──patient1
│ ├──20100113
│ │ ├──alignments
…
If you prefer instead, e.g., such cohort-wide results behind written in a subdirectory of the working directory at the same level as the datadir
s, you can use this options you can specify alternate subdirectory relative to the datadir
property. (Use ..
prefix if you want instead your cohort-wide results to be in a directory at the sample level as samples/
and results/
. See the example below to recreate the variants/
directory used by legacy V-pipe v1.x/2.x).
../variants
V-pipe can produce several outputs to assess the quality of the output of its steps, e.g., checking whether a sample’s consensus sequence generated by bctfools does result in frameshifting indels and writing a report in sample’s …/references/frameshift_deletions_check.tsv
. Such reports can be useful when submitting sequences to GISAID.
This option turns on such QA features.
True
This option indicates that the samples come from PCR amplification and the primers should be trimmed from amplicons in the alignment file. The trimmed read are written to each sample’s …/variants/SNVs/REF_aln_trim.bam
.
Using this option requires either specifying a primers bed file in input
=> protocols_file
, or using a 4 column input samples TSV file and specify a protocol look-up YAML file in input
=> protocols_file
.
True
This option selects whether the SNV caller step should be executed and its output written to each sample’s …/variants/SNVs/snvs.csv
.
True
This option activates local haplotype reconstruction (only available when using ShoRAH).
True
This option turns on global haplotype reconstruction.
True
This option selects whether to generate HTML visualization of the SNVs in each sample’s …/visualization/index.html
.
True
This option turns on the computation of diversity measures in each sample.
True
This option turns on dehumanization of the raw reads (i.e. removal of host’s reads) and generates the file dehuman.cram
. This is useful to prepare raw reads for upload on public databases such as, e.g. ENA (European Nucleotide Archive).
This only applies to the upload and does not affect the main workflow.
True
This option can be used for assistance in incremental upload of data. See section upload
for an example.
True
The path to the different software packages can be specified using this section.
It is especially useful when dependencies are not obtained via conda such as VICUNA, and when the software packages are not in the PATH
.
Note we strongly recommend to use conda environments, by adding the --use-conda
flag to the V-pipe execution command, e.g. ./vpipe --use-conda
. If you prefer to use your own installations, this section allows you to specify the location of the executable files.
bwa: /path/to/bwa
haploclique: /path/to/haploclique
Due to a special license, VICUNA is not available from bioconda and must be installed from its original website.
Use this option to specify where you have installed its executable.
We use software PRINSEQ 1 for quality control. By default, we use options -ns_max_n 4 -min_qual_mean 30 -trim_qual_left 30 -trim_qual_right 30 -trim_qual_window 10
, which indicates to trim reads using a sliding window with size 10 bp, and trim bases if their quality scores are less than 30. Additionally, reads are filtered out if the average quality score is below 30 and if they contain more than 4 N’s. The user can choose to overwrite the default settings or use additional parameters by using the property extra
. E.g., if many reads are filtered out in this step, the user can choose to lower the quality threshold as indicated in the example.
Please do not modify PRINSEQ options -out_format
, -out_good
, nor -min_len
. Instead of using -min_len
to define threshold on the read length after trimming, use input
=> trim_percent_cutoff
.
Schmieder, R. and Edwards, R. Quality control and preprocessing of metagenomic datasets. Bioinformatics. 2011. ↩
-ns_max_n 4 -min_qual_mean 20 -trim_qual_left 20 -trim_qual_right 20 -trim_qual_window 10
NOTE The conda environment for this rule doesn’t work properly. The package on the bioconda channel, mvicuna, is slightly different from VICUNA and it has different command-line arguments. Moreover, VICUNA and mvicuna are no longer maintained. In the future, this rule will be deprecated.
NOTE Obtaining a initial reference de novo is implemented for more than one sample.
This option is useful for debugging purposes.
True
Pass additional options to run ngshmmalign
V-pipe uses option -R <path/to/initial_reference>
, thus option -r arg
is not allowed. Also, instead of passing -l
via the property extra
, set leave_msa_temp
to True
. Lastly, please do not modify options -o arg
, -w arg
, -t arg
, and -N arg
. These are already managed by V-pipe.
Panel of diverse references against which to align reads as a QA step
Note: The virus-specific base configuration specified in general
=> virus_base_config
will most likely change this option’s default.
You are still free to override that default in your configuration shall the need arise.
resources/hiv/5-Virus-Mix.fasta
This rule takes all previously aligned reads by hmm_align
. Therefore, resources should be allocated accordingly.
With property extra
, users can pass additional options to run BWA MEM. For more details on BWA MEM configurable options refer to the software documentation.
Indicate if qualities are Phred+33 (default) or Phred+64 (--phred64
).
--phred64
Specify Bowtie 2 presets.
Pass additional options to run Bowtie 2. V-pipe handles the input and output files, as well as the reference sequence. Thus, do not modify these options
For more details on Bowtie 2 configurable options refer to the software documentation.
Specify minimap2 preset options. See minimape’s documentation for details about each of the presets.
map-ont
With property extra
, users can pass additional options to run minimap2. For more details on minimap2 configurable options refer to the software documentation.
Minimum read depth for reporting variants per locus.
Read count below which ambiguous base ’n’ is reported.
Minimum phred quality score for a base to be included.
Minimum frequency for an ambiguous nucleotide.
Value must be greater or equal to 0
and lesser or equal to 1
Value must be greater or equal to 0
and lesser or equal to 1
Minimum read depth for reporting variants per locus.
50
Output a numpy array file containing frequencies of all bases, including gaps and also the most abundant base across samples.
True
Construct intervals based on overlapping windows of the read alignment. By default, regions with high coverage are built based on the position-wise read depth.
True
Minimum read depth. A region spanning the reference genome is returned if coverage
is set to 0.
0
Indicate whether to apply a more liberal shifting on intervals’ right-endpoint.
False
Indicate whether to use the cohort-consensus sequence from the analyzed samples (output from minor_variants
rule located in the cohort-wide output results/cohort_onsensus.fasta
) or the reference sequence by setting this option to False.
False
Hyperparameter used for instantiating a new cluster.
Ignore SNVs adjacent to indels.
Value must be greater or equal to 0
and lesser or equal to 1
Omit windows with coverage less than this value.
50
ShoRAH performs local haplotype reconstruction on windows of the read alignment. The overlap between these windows is defined by the window shifts. By default, it is set to 3, i.e., apart from flanking regions each position is covered by 3 windows.
Indicate whether to move files produced in previous/interrupted runs to subdirectory named old
True
Indicate whether to use the cohort-consensus sequence from the analyzed samples (output from minor_variants
rule located in the cohort-wide output results/cohort_onsensus.fasta
) or the reference sequence by setting this option to False.
False
Pass additional options to run lofreq call
NOTE This rule only works in Linux.
Size of the batches of reads to be processed by SAVAGE. It is recommended that 500 < coverage/split
< 1000.
If set to True
(default) a predefined set of parameter values is used for drawing edges between reads in the read graph.
Singletons are defined as proposed haplotypes which are supported by a single read. If this property is set to True
, singletons are discarded.
If set to True
(default) probability of the overhangs is ignored.
Sets a threshold to limit the size of cliques.
Indicates the maximum number of clique to be considered in the next iteration.
Additional parameters to be passed to haploclique.
Warning: this won’t overwrite the other options (e.g. clique_size_limi
and max_num_cliques
should still be set via their own respective properties, do not pass parameters --limit_clique_size=
nor --max_cliques=
via this extra
property).
Use to specify a region of interest.
Use to specify a region of interest.
9719
29836
When the ground truth is available (e.g., simulation studies), a multiple sequence alignment of types making up the population can be provided, and additional checks are performed.
Minimal number of coocurrences to search for in amplicon. Lowering this property to 1 will make COJAC also look for amplicon with singleton mutations.
1
Format of the output CSV.
lines
(default) - each amplicon a separate entry on a separate line.columns
- one column per ampliconcolumns
This section is used to set up a timeline of the samples. Some output, e.g., deconvolution of quasispecies mixture using LolliPop, need to have a time component. By default it calls a script that uses regular expressions and look-up tables to extract this information from the samples’ own names. But by using the properties script
and options
and adapting the environment provided in property conda
, it is possible to heavily customize the actions (e.g. it is possible to query an external database instead). For inspiration, see the default script file_parser.py
.
The default environment only provides regular expression functions (python-reges
) but depending on your needs you would want to provide a custom environment with additional tools (e.g. drivers to query a databse, etc.)
Don’t dispatch the timeline rule to the cluster for execution, run locally.
False
Script that sets up a timeline of the samples.
Its purpose is to take the V-pipe’s samples TSV file and add two columns:
It will receive the following parameters (in addition to what is specified in property options:
For an example, see the default script file_parser.py
, it uses regular expressions (regex) to parse the first two columns (sample and batch names) and extract a date, and a location code that is further look-ed up in a table. It takes two additional parameters:
Additional options to be passed to the script, e.g. for an extra configuration file with database server information.
By default, passes an option to the default script to force always using the regex (do not fall back to copy-pasting columns).
Option for the default script: TSV table that maps location codes (e.g. short alphanumeric codes) used in sample names to full names of locations (e.g. city names).
For example:
code location
10 Zürich (ZH)
16 Genève (GE)
Ba Basel (BS)
wastewater_plants.tsv
Option for the default script: YAML file the defines how to parse time series information out of the columns of samples.tsv, e.g.:
sample: (?P<location>\d+)_(?P<year>20\d{2})_(?P<month>[01]?\d)_(?P<day>[0-3]?\d)
datefmt: "%Y%m%d"
sample
defines a regular expression to be applied on the first column (sample names)batch
defines a regular expression to be applied on the second column (sequencing batch dates)location
used for the location codesyear
, month
, day
used for the dates of the timelinedate
can be used to capture the whole date stringdatefmt
gives a time format string to parse the date
capturing group.If set, this user-provided TSV file (e.g.: generated with an external tool, prior of running V-pipe) will be used for obtaining locations and dates – as needed by LolliPop – instead of generating results/timeline.tsv
with the rule timeline.
This follows the following format (similar to the output of rule timeline):
sample batch reads proto location_code date location
A1_05_2023_04_12 20230428_HNG5MDRX2 250 v41 5 2023-04-12 Lugano (TI)
A2_10_2023_04_13 20230428_HNG5MDRX2 250 v41 10 2023-04-13 Zürich (ZH)
A3_16_2023_04_14 20230428_HNG5MDRX2 250 v41 16 2023-04-14 Genève (GE)
…
samples.tsv
Configuration file with parameters for kernel deconvolution
/git/lollipop/deconv_linear_logit_quasi_strat.yaml
/git/lollipop/deconv_linear_wald.yaml
/git/lollipop/deconv_bootstrap
Variants configuration used during deconvolution
var_conf.yaml
Variants to scan per periods (as determined with COJAC by leveraging the output of the cooc rule)
var_dates.yaml
Format of the output CSV.
lines
(default) - each variants a separate entry on a separate line.columns
- one column per variantcolumns
Host’s genome used to remove reads (e.g. human genome)
Note: if this file is absent, it is possible to fetch it from a remote server, see property ref_host_url
below.
/cluster/project/igenomes/Homo_sapiens/NCBI/GRCh38/Sequence/BWAIndex/genome.fa
If the host’s genome specified in property ref_host
isn’t present, fetch it from a remote server.
Note remember to set aside enough memory for the indexing rule, see section ref_bwa_index
property mem
.
http://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.29_GRCh38.p14/GCA_000001405.29_GRCh38.p14_genomic.fna.gz
Indicate whether to store the host-aligned reads in a CRAM file …/alignments/host_aln.cram
.
True
Use this option when generating dehumanized raw reads (dehuman.cram
) on old samples that have already been processed in the past — a catch up.
Normally, removing host-mapping reads requires analyzing reads which were rejected by V-pipe’s main processing (as specified in section general
, property aligner
). But this output is considered temporary and will get deleted by Snakemake once the processing of a sample has finished. To generate dehuman.cram
V-pipe would need to run the aligner again, which will both regenerate the data necessary for this output but also generate a new alignment which will trigger the whole workflow again.
Use this property catchup
to only generate the input necessary for dehuman.cram
, leaving untouched the alignment and everything else that has already been processed.
True
This section is used to assist and prepare uploads of the data, e.g. to European Nucleotide Archive. By default it calls a script that creates symlinks making it easy to identify new/updated samples between calls of V-pipe. But by using the properties script
and options
and adapting the environment provided in property conda
, it is possible to heavily customize the actions (e.g. it is possible to upload to an SFTP server by calling sftp
from a modified script). For inspiration, see the default script prepare_upload_symlinks.sh
.
The default environment only provides hashing functions (xxhash
, linux coreutils’ sha
{nnn}sum
collection, etc.) but depending on your needs you would want to provide a custom environment with additional tools (e.g. sftp
, rsync
, curl
, lftp
, custom specialized cloud uploaders, etc.)
Don’t dispatch the rule to the cluster for execution, run locally.
False
When preparing data for upload, specifies which consensus sequence should be uploaded.
majority
Generate checksum for each individual consensus sequence (if a consensus is regenerated, it will help determine whether the new file has changed content or is virtually the same as the previous).
True
Also include the original .fastq.gz
sequencing reads files from raw_data/
in the list of files to be uploaded. See property orig_cram
below for a compressed version and see output dehumanized_raw_reads
and section dehuman
for depleting reads from the host.
True
Also include a compressed version of the original sequencing raw reads files from raw_data/
. Similar to property orig_fastq
above, but with reference-based compression.
True
Custom script that assists and prepares uploads.
It will receive the following positional parameters:
For an example, see the default script prepare_upload_symlinks.sh
, it generates symlinks that help tracking which samples are new and/or updated between runs of V-pipe and thus should be considered for upload.
Named options to be passed to the script, before the positional parameters. E.g. for an extra configuration file with SFTP server information.