# IsoAnnot IsoAnnot is a new tool for generating functional and structural annotation at isoform level, capable of collecting and integrating information from different databases to categorize and describe each isoform, including functional and structural information for both transcript and protein. ⚠️⚠️ IsoAnnot is currenlty under beta-testing. Please see the [latest release](https://github.com/ConesaLab/IsoAnnot/releases/tag/v0.9.0b1) and download IsoAnnot from the [release](https://github.com/ConesaLab/IsoAnnot/tree/v0.9.0b1) branch. ## Requirements ### Computational Requirements The computational requirements to run IsoAnnot may vary depending on the organism of interest and the size of the transcriptome you want to annotate. **Reference benchmark** (Human transcriptome): - **Transcriptome size**: 252,205 isoforms - **CPU cores**: 8 cores - **Memory**: 12 GB RAM - **Disk space**: 14 GB - **Execution time**: ~20 hours The number of cores can be modified by editing the `--cores` parameter in the last line of `IsoAnnot/isoannot.sh` (default is 8 cores). ### Software Prerequisites IsoAnnot requires the following software to be installed before use: 1. **Operating System**: GNU/Linux (tested and supported) 2. **Python**: Python 3 (managed automatically by conda) 3. **Conda**: For dependency management 4. **Snakemake**: Workflow management system (version 7.x recommended) ## Installation IsoAnnot is distributed as a compressed file containing the proper directory structure. **Installation steps:** 1. Extract the package to your desired installation folder: ```bash tar -xzf IsoAnnot.tar.gz cd IsoAnnot ``` 2. Ensure all prerequisites are installed (see [Installation Prerequisites](#installation-prerequisites)) 3. Install external software (see [External Software](#external-software)) 4. Activate the snakemake conda environment: ```bash conda activate snakemake ``` You're now ready to run IsoAnnot! ## Configuration Files Configuration files control how IsoAnnot processes data for each species and database combination. Snakemake configuration files in IsoAnnot use the YAML file format and are organized on a per-species basis. ### Where to Find Config Files Configuration files are organized in a hierarchical directory structure: ``` IsoAnnot/config/ ├── ensembl/ │ ├── hsapiens/ │ │ ├── config.yaml │ │ └── Snakefile.smk │ ├── mmusculus/ │ │ ├── config.yaml │ │ └── Snakefile.smk │ └── ... ├── refseq/ │ ├── hsapiens/ │ │ ├── config.yaml │ │ └── Snakefile.smk │ └── ... ├── mytranscripts/ │ ├── hsapiens/ │ │ ├── config.yaml │ │ └── Snakefile.smk │ └── ... └── generic/ ├── config.yaml # Generic settings ├── Snakefile.smk # Main workflow ├── Snakefile_ensembl.smk ├── Snakefile_refseq.smk └── Snakefile_mytranscripts.smk ``` **Path structure**: `config///config.yaml` **Examples**: - Human Ensembl: `config/ensembl/hsapiens/config.yaml` - Mouse RefSeq: `config/refseq/mmusculus/config.yaml` - Custom human transcripts: `config/mytranscripts/hsapiens/config.yaml` ### How to Modify Config Files To modify an existing configuration: 1. Navigate to the config file: ```bash cd IsoAnnot/config/// nano config.yaml ``` 2. Edit parameters as needed (see [Configuration Parameters Explained](#configuration-parameters-explained)) 3. Save the file 4. Run IsoAnnot with the updated configuration: ```bash cd IsoAnnot ./isoannot.sh --database --species ``` **Common modifications**: - Update database URLs to newer releases - Change file paths for custom data - Adjust species-specific parameters - Modify the `transcript_versioned` flag ### Generic Configuration The generic configuration file (`config/generic/config.yaml`) contains global settings used across all species: ```yaml interproscan_path: "software/interproscan/interproscan.sh" pfam_clan_url: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.clans.tsv.gz dir_sqanti: "scripts/sqanti3/" ``` **Key parameters**: - `interproscan_path`: Path to InterProScan executable - `pfam_clan_url`: URL for Pfam clan database - `dir_sqanti`: Directory containing SQANTI3 scripts ## Output ### Output Structure IsoAnnot generates its output in a structured directory hierarchy within the directory supplied by the user. In case none is given, it will use the running directory by default. The structure of `/data/` folder is as follows: ``` /data/ └── / # e.g., Hsapiens/ ├── _tappas__annotation_file.gff3 # Main output ├── _tappas__annotation_file.gff3_mod # Modified GFF3 ├── config/ # Downloaded config files │ ├── ensembl/ │ ├── refseq/ │ └── global/ ├── output/ │ └── / # Database-specific outputs │ ├── layers/ # Annotation layers │ │ ├── go.gtf │ │ ├── interpro.gtf │ │ ├── reactome.gtf │ │ └── ... │ ├── transcripts/ # Transcript files │ ├── proteins/ # Protein sequences │ └── ... └── tmp/ # Temporary processing files ``` **Directory naming**: - ``: Capitalized species prefix from config (e.g., `Hsapiens`, `Mmusculus`, `Stuberosum`) - ``: Lowercase common name from config (e.g., `human`, `mouse`, `potato`) - ``: Database used (e.g., `ensembl`, `refseq`, `mytranscripts`) ### Main Output Files #### Primary Annotation File **File**: `_tappas__annotation_file.gff3` This is the main output file containing comprehensive isoform-level annotations. **Example**: `human_tappas_ensembl_annotation_file.gff3` **Location**: `IsoAnnot/data//` **Content**: GFF3-formatted annotation with: - Gene and transcript structures - Protein-coding predictions - Functional annotations from multiple databases - Structural features - Post-translational modifications #### Modified Annotation File **File**: `_tappas__annotation_file.gff3_mod` A modified version of the main GFF3 file optimized for downstream analysis tools. ### Understanding the GFF3 Annotation File The output GFF3 file integrates information from multiple sources: **Structural information**: - Gene and transcript coordinates - Exon/intron structure - CDS (coding sequence) regions - UTR regions (5' and 3') **Functional annotations** (in attributes column): - **Gene Ontology (GO)**: Biological process, molecular function, cellular component - **InterPro**: Protein domains, families, and functional sites - **Pfam**: Protein family classifications - **Reactome**: Pathway associations - **UniProt**: Protein function descriptions **Post-translational modifications**: - Phosphorylation sites - Other PTMs from PhosphoSitePlus **Example GFF3 attributes**: ``` gene_id=ENSG00000000003;transcript_id=ENST00000000003;GO=GO:0005515,GO:0003824; InterPro=IPR001478,IPR015421;Reactome=R-HSA-112316;UniProt=P12345 ``` **Using the output**: - Import into genome browsers (IGV, UCSC Genome Browser) - Use with tappAS for isoform-level functional analysis - Parse programmatically for custom analyses - Filter by specific annotation types ## Troubleshooting **Problem**: "The snakefile or configfile requested do not exist" - **Solution**: Ensure config files exist for your species at `config///` **Problem**: InterProScan not found - **Solution**: Run `./InterproScan_install.sh` or verify `interproscan_path` in `config/generic/config.yaml` **Problem**: Out of memory errors - **Solution**: Increase available RAM or reduce the number of cores used **Problem**: Download errors for database files - **Solution**: Check internet connection and verify URLs in config file are current **Problem**: Snakemake directory locked - **Solution**: Use `--unlock` option (see [Unlocking the Working Directory](#unlocking-the-working-directory)) ## Support For issues, questions, or contributions: - **GitHub Issues**: https://github.com/ConesaLab/IsoAnnot/issues - **Documentation**: This README --- ## License [License information to be added]