Poly Pipeline
master @ 7c668f0

View on GitHub

Download RO-Crate

Workflow Type: Shell Script

Stable

POLY_PIPELINE

A data analysis pipeline for STOmics data tailored to polyploid organisms.

Contributing

Contributions are welcome! Please fork the repository and submit a pull request. See the CONTRIBUTING.md for details.

Main Pipeline (Cluster Execution)

The core scripts are optimized for SGE and PBS clusters execution. They use relative paths and must be run from the POLY_PIPELINE main directory.

Data Input

The input file must be the .gef file (post-processed by the SAW pipeline). It must be placed in the INPUT/datasets/ folder.

IMPORTANT: Place only one .gef file in the datasets folder.

The script provides a converter (check ANALYSIS [0] below), capable of converting .GEM and .H5AD files to the proper .GEF file prior to execution

It is possible (BUT NOT REQUIRED) to generate differential analysis for a list of genes of interest, generate the file INPUT/interest_genes.txt, or use the explicit path (check below for information) following the structure:

gene_name,Gene_ID_1,Gene_ID_2,Gene_ID_3,Gene_ID_4,Gene_ID_5
FLORAL_MERISTEM,AT5G08570,LOC107775591,Nicotiana_T001,LOC107775592
STRESS_HEAT,AT1G53540,LOC107817066,GmHIS4_A01,LOC107817067,AT2G41090
AUXIN_RESPONSE,AT3G15540,LOC107833544,Os02g0602300,AT4G20560
CELL_CYCLE,LOC107769919,AT1G44110,LOC107769920,AT3G53210
APICAL_DOMINANCE,LOC107802111,AT2G44320,LOC107802112
DEFENSE_MECH,AT5G41220,LOC107764120,Solyc01g099710
GIBBERELLIN_SYN,LOC107823450,AT1G05030,LOC107823451,AT3G44360

Each line represents one gene of interest starting with the identification of the gene followed by all correponding IDs (The IDs must match the mapping reference used in the generation of the .gef file).

Step	Script	Description
Analysis AND Annotation	`bin/2_COMPLETE_ANALYSIS.sh`	Complete analysis following the Stereopy documentation (generates the `stereopy_ultimate_analysis.py` script).

Cluster Execution Example (SGE or PBS)

The scripts are submitted with explicit Miniconda or docker image paths and parameter variables (qsub -v).

Analysis Script:

qsub -v ST_PYTHON="/home/user/.conda/envs/st/bin/python",ANALYSIS=1,MIN_COUNTS=50,MIN_GENES=5,PCT_COUNTS_MT=30,N_PCS=30 bin/2_COMPLETE_ANALYSIS.sh

qsub -v ST_PYTHON="/project/directory/POLY_PIPELINE/stereopy_1.5.1.sif",ANALYSIS=1,MIN_COUNTS=50,MIN_GENES=5,PCT_COUNTS_MT=30,N_PCS=30 bin/2_COMPLETE_ANALYSIS.sh

Variable	Description	Default
`ST_PYTHON`	Path to the python executable inside the st environment (SGE) or the container (PBS) for main analysis.	-
`R_CONTAINER`	Path to the R container for secondary analysis (3 - Network Analysis).	-
`MIN_COUNTS`	Minimum number of counts per cell.	20
`MIN_GENES`	Minimum number of genes per cell.	3
`PCT_COUNTS_MT`	Acceptable percentage of mitochondrial genes.	2
`N_PCS`	Number of principal components. This step can be inproved after first run. Check the Elbow Plot (RESULTS/results_ultimate/plots/qc/pca_elbow_enhanced.png) and insert the value of the elbow as N_PCS	-
`ANALYSIS`	(Optional) Select the type of analysis (check below for details): [0] Converter, [1] Primary analysis, [3] Network Analysis	1
`INTEREST_GENES_PATH`	(Optional) Select the list of candidate genes for analysis (see above). use explicit path for custom list: INTEREST_GENES_PATH="/Storage/user/file_name.txt"	"INPUT/interest_genes.txt"
`EXPRESSION_THR`	(Optional) Set expression threshold for Interest Genes filtering.	1.0
`MIN_X`	(Optional) Minimum X coordinate for spatial filtering.	-
`MAX_X`	(Optional) Maximum X coordinate for spatial filtering.	-
`MIN_Y`	(Optional) Minimum Y coordinate for spatial filtering.	-
`MAX_Y`	(Optional) Maximum Y coordinate for spatial filtering.	-
`HVG_MIN_MEAN`	(Optional) Min mean filtering for selection of HVGs.	0.0125
`HVG_MAX_MEAN`	(Optional) Max mean filtering for selection of HVGs.	3.0
`HVG_DISP`	(Optional) Dispersion filtering for selection of HVGs.	0.5
`HVG_TOP`	(Optional) Number of top genes selected for HVG filtering.	2000
`INPUT_PATH`	(Required for analysis [0] Converter) Input file or folder with files to be converted to .gef format	-

Converting files (.GEM or .H5AD) to .GEF prior to primary analysis only requires the input folder or input file, bin size is optional:

SGE

qsub -v ST_PYTHON="/home/user/.conda/envs/st/bin/python",INPUT_PATH="path/to/file.h5ad",BIN_SIZE=100,ANALYSIS=0 bin/2_COMPLETE_ANALYSIS.sh
qsub -v ST_PYTHON="/home/user/.conda/envs/st/bin/python",INPUT_PATH="path/to/files/",BIN_SIZE=100,ANALYSIS=0 bin/2_COMPLETE_ANALYSIS.sh

PBS

qsub -v ST_PYTHON="/project/directory/POLY_PIPELINE/stereopy_1.5.1.sif",ANALYSIS=0,INPUT_PATH="path/to/file.h5ad",BIN_SIZE=50 bin/2_COMPLETE_ANALYSIS.sh
qsub -v ST_PYTHON="/project/directory/POLY_PIPELINE/stereopy_1.5.1.sif",ANALYSIS=0,INPUT_PATH="path/to/files/",BIN_SIZE=50 bin/2_COMPLETE_ANALYSIS.sh

The variables are not required, the script can run with defaults and the entire tissue area.
The corret python or docker image path (ST_PYTHON) for the server must be selected.
If coordinate filtering is required (MIN_X, MAX_X, MIN_Y, MAX_Y), all coordinate parameters must be provided together.

Analysis selection

The script is set to the primary analysis [1] as standard, proper for any spatial analysis from original files and generating the primary results.
The script includes secondary analysis for specific uses: [3] for network analysis.
Options must be explicit when submiting the job (or [1] will be used as standard).

Local Execution Example

IMPORTANT: This analysis requires high computational resources and are not recommended to be run locally. To run the main analysis locally using the bash wrapper and your specific Conda path:

ST_PYTHON='/home/user/.conda/envs/st/bin/python' MIN_COUNTS=50 MIN_GENES=5 PCT_COUNTS_MT=30 N_PCS=30,ANALYSIS=1 bash bin/2_COMPLETE_ANALYSIS.sh

Network Visualization (Secondary analysis)

Example of job command (SGE):

qsub -v ST_PYTHON="home/user/.conda/envs/st/bin/python",ANALYSIS=3 bin/2_COMPLETE_ANALYSIS.sh

Example of job command (PBS):

qsub -v ST_PYTHON="/project/directory/POLY_PIPELINE/stereopy_1.5.1.sif",R_CONTAINER="/project/directory/POLY_PIPELINE/r_hdwgcna.sif",ANALYSIS=3 bin/2_COMPLETE_ANALYSIS.sh

After running the secondary network analysis, the Edges and Nodes files will be generated under the EXPORTS folder, which can be used for posterior visualizations/filtering, mainly NETWORKX and Cytoscape.
Importing for Cytoscape:
File -> Import -> Network from File....
Select file EXPORTS/[project_name]_FULL_EDGES.txt.
Under the configuration, select fromNode (Source Node), toNode (Target Node) and weight (Edge Attribute).
Click OK.
File -> Import -> Table from File....
Select file EXPORTS/[project_name]_FULL_NODES.txt.
Select (auto) column gene_name as key.
Click OK.
For proper coloring clusters: Select Style from sidebar, select Fill Color, select module as column, select Discrete Mapping and use right buttom to select Mapping Value Generators to automatically select colors for each module.
Check documentation for further details.