IsoAnnot
0.9.0b1 @ 0572562

Workflow Type: Snakemake
Work-in-progress

IsoAnnot

IsoAnnot is a new tool for generating functional and structural annotation at isoform level, capable of collecting and integrating information from different databases to categorize and describe each isoform, including functional and structural information for both transcript and protein.

⚠️⚠️ IsoAnnot is currenlty under beta-testing. Please see the latest release and download IsoAnnot from the release branch.

Requirements

Computational Requirements

The computational requirements to run IsoAnnot may vary depending on the organism of interest and the size of the transcriptome you want to annotate.

Reference benchmark (Human transcriptome):

  • Transcriptome size: 252,205 isoforms
  • CPU cores: 8 cores
  • Memory: 12 GB RAM
  • Disk space: 14 GB
  • Execution time: ~20 hours

The number of cores can be modified by editing the --cores parameter in the last line of IsoAnnot/isoannot.sh (default is 8 cores).

Software Prerequisites

IsoAnnot requires the following software to be installed before use:

  1. Operating System: GNU/Linux (tested and supported)
  2. Python: Python 3 (managed automatically by conda)
  3. Conda: For dependency management
  4. Snakemake: Workflow management system (version 7.x recommended)

Installation

IsoAnnot is distributed as a compressed file containing the proper directory structure.

Installation steps:

  1. Extract the package to your desired installation folder:

    tar -xzf IsoAnnot.tar.gz
    cd IsoAnnot
    
  2. Ensure all prerequisites are installed (see Installation Prerequisites)

  3. Install external software (see External Software)

  4. Activate the snakemake conda environment:

    conda activate snakemake
    

You're now ready to run IsoAnnot!

Configuration Files

Configuration files control how IsoAnnot processes data for each species and database combination. Snakemake configuration files in IsoAnnot use the YAML file format and are organized on a per-species basis.

Where to Find Config Files

Configuration files are organized in a hierarchical directory structure:

IsoAnnot/config/
├── ensembl/
│   ├── hsapiens/
│   │   ├── config.yaml
│   │   └── Snakefile.smk
│   ├── mmusculus/
│   │   ├── config.yaml
│   │   └── Snakefile.smk
│   └── ...
├── refseq/
│   ├── hsapiens/
│   │   ├── config.yaml
│   │   └── Snakefile.smk
│   └── ...
├── mytranscripts/
│   ├── hsapiens/
│   │   ├── config.yaml
│   │   └── Snakefile.smk
│   └── ...
└── generic/
    ├── config.yaml          # Generic settings
    ├── Snakefile.smk        # Main workflow
    ├── Snakefile_ensembl.smk
    ├── Snakefile_refseq.smk
    └── Snakefile_mytranscripts.smk

Path structure: config///config.yaml

Examples:

  • Human Ensembl: config/ensembl/hsapiens/config.yaml
  • Mouse RefSeq: config/refseq/mmusculus/config.yaml
  • Custom human transcripts: config/mytranscripts/hsapiens/config.yaml

How to Modify Config Files

To modify an existing configuration:

  1. Navigate to the config file:

    cd IsoAnnot/config///
    nano config.yaml
    
  2. Edit parameters as needed (see Configuration Parameters Explained)

  3. Save the file

  4. Run IsoAnnot with the updated configuration:

    cd IsoAnnot
    ./isoannot.sh --database  --species 
    

Common modifications:

  • Update database URLs to newer releases
  • Change file paths for custom data
  • Adjust species-specific parameters
  • Modify the transcript_versioned flag

Generic Configuration

The generic configuration file (config/generic/config.yaml) contains global settings used across all species:

interproscan_path: "software/interproscan/interproscan.sh"
pfam_clan_url: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.clans.tsv.gz
dir_sqanti: "scripts/sqanti3/"

Key parameters:

  • interproscan_path: Path to InterProScan executable
  • pfam_clan_url: URL for Pfam clan database
  • dir_sqanti: Directory containing SQANTI3 scripts

Output

Output Structure

IsoAnnot generates its output in a structured directory hierarchy within the directory supplied by the user. In case none is given, it will use the running directory by default. The structure of /data/ folder is as follows:

/data/
└── /                                    # e.g., Hsapiens/
    ├── _tappas__annotation_file.gff3     # Main output
    ├── _tappas__annotation_file.gff3_mod # Modified GFF3
    ├── config/                                  # Downloaded config files
    │   ├── ensembl/
    │   ├── refseq/
    │   └── global/
    ├── output/
    │   └── /                               # Database-specific outputs
    │       ├── layers/                         # Annotation layers
    │       │   ├── go.gtf
    │       │   ├── interpro.gtf
    │       │   ├── reactome.gtf
    │       │   └── ...
    │       ├── transcripts/                    # Transcript files
    │       ├── proteins/                       # Protein sequences
    │       └── ...
    └── tmp/                                    # Temporary processing files

Directory naming:

  • ``: Capitalized species prefix from config (e.g., Hsapiens, `Mmusculus`, `Stuberosum`)
  • ``: Lowercase common name from config (e.g., human, `mouse`, `potato`)
  • ``: Database used (e.g., ensembl, `refseq`, `mytranscripts`)

Main Output Files

Primary Annotation File

File: _tappas__annotation_file.gff3

This is the main output file containing comprehensive isoform-level annotations.

Example: human_tappas_ensembl_annotation_file.gff3

Location: IsoAnnot/data//

Content: GFF3-formatted annotation with:

  • Gene and transcript structures
  • Protein-coding predictions
  • Functional annotations from multiple databases
  • Structural features
  • Post-translational modifications

Modified Annotation File

File: _tappas__annotation_file.gff3_mod

A modified version of the main GFF3 file optimized for downstream analysis tools.

Understanding the GFF3 Annotation File

The output GFF3 file integrates information from multiple sources:

Structural information:

  • Gene and transcript coordinates
  • Exon/intron structure
  • CDS (coding sequence) regions
  • UTR regions (5' and 3')

Functional annotations (in attributes column):

  • Gene Ontology (GO): Biological process, molecular function, cellular component
  • InterPro: Protein domains, families, and functional sites
  • Pfam: Protein family classifications
  • Reactome: Pathway associations
  • UniProt: Protein function descriptions

Post-translational modifications:

  • Phosphorylation sites
  • Other PTMs from PhosphoSitePlus

Example GFF3 attributes:

gene_id=ENSG00000000003;transcript_id=ENST00000000003;GO=GO:0005515,GO:0003824;
InterPro=IPR001478,IPR015421;Reactome=R-HSA-112316;UniProt=P12345

Using the output:

  • Import into genome browsers (IGV, UCSC Genome Browser)
  • Use with tappAS for isoform-level functional analysis
  • Parse programmatically for custom analyses
  • Filter by specific annotation types

Troubleshooting

Problem: "The snakefile or configfile requested do not exist"

  • Solution: Ensure config files exist for your species at config///

Problem: InterProScan not found

  • Solution: Run ./InterproScan_install.sh or verify interproscan_path in config/generic/config.yaml

Problem: Out of memory errors

  • Solution: Increase available RAM or reduce the number of cores used

Problem: Download errors for database files

  • Solution: Check internet connection and verify URLs in config file are current

Problem: Snakemake directory locked

Support

For issues, questions, or contributions:


License

[License information to be added]

Version History

0.9.0b1 @ 0572562 (earliest) Created 16th Jun 2026 at 09:25 by Fabián Robledo

Merge pull request #9 from ConesaLab/dev

Dev updates


Frozen 0.9.0b1 0572562
help Creators and Submitter
Creators
  • Alessandra Martinez
  • Pablo Atienza
Submitter
Activity

Views: 7   Downloads: 0

Created: 16th Jun 2026 at 09:25

Last updated: 16th Jun 2026 at 09:28

Annotated Properties
Operation annotations
Scientific disciplines
Computer Science, Biochemistry, Genetics and Molecular Biology
help Attributions

None

Total size: 49.9 MB
Powered by
(v.1.17.3)
Copyright © 2008 - 2026 The University of Manchester and HITS gGmbH