Python Functions Reference
Complete reference for all Python functions in AnnoRefine.
Hint Generation
bam2hints()
Convert BAM alignments to Augustus/GeneMark hints format.
Signature:
annorefine.bam2hints(
bam_file: str,
output_file: str,
library_type: str,
*,
priority: int = 4,
max_gap_len: int = 14,
min_intron_len: int = 32,
max_intron_len: int = 350000,
min_end_block_len: int = 8,
max_query_gap_len: int = 5,
exonpart_cutoff: int = 10,
source: str = 'E',
introns_only: bool = False,
no_multiplicity: bool = False,
remove_redundant: bool = False,
max_coverage: int = 0,
splice_sites_on: bool = False,
truncated_splice_sites: bool = False,
score: float = 0.0,
max_gene_len: int = 400000,
threads: int = None,
contig: str = None,
region: tuple = None,
contig_map: dict = None
) -> dict
Parameters:
bam_file(str): Path to input BAM file (must be sorted and indexed if using contig/region filtering)output_file(str): Path for output GFF hints filelibrary_type(str): Library strandedness specification. Options: "FR", "RF", "UU"priority(int): Priority of hint group (default: 4)source(str): Source identifier (default: "E")introns_only(bool): Only retrieve intron hints (default: False)threads(int): Number of threads to use (default: None, uses all available)contig(str): Filter to only process alignments on this contig (default: None)region(tuple): Filter to region as (contig, start, end) tuple (default: None)contig_map(dict): Dictionary to rename contigs in output. Keys are input contig names, values are output contig names (default: None, no mapping)
Returns: Dictionary with conversion statistics
Example:
import annorefine
# Basic usage
result = annorefine.bam2hints(
bam_file="alignments.bam",
output_file="hints.gff",
library_type="RF"
)
# Advanced usage
result = annorefine.bam2hints(
bam_file="alignments.bam",
output_file="hints.gff",
library_type="RF",
priority=4,
source="E",
threads=8,
min_intron_len=32,
max_intron_len=350000,
contig="chr1"
)
# With contig name mapping
contig_map = {
'NC_000001.11': 'chr1',
'NC_000002.12': 'chr2',
'NC_000003.12': 'chr3'
}
result = annorefine.bam2hints(
bam_file="alignments.bam",
output_file="hints.gff",
library_type="RF",
contig_map=contig_map # Rename contigs in output
)
print(f"Generated {result['total_hints_generated']} hints")
Hint Processing
join_hints()
Join and merge hints from multiple GFF files.
Signature:
annorefine.join_hints(
input_files: List[str],
output_file: str,
introns_only: bool = False
) -> dict
Parameters:
input_files(List[str]): List of input GFF hint files to joinoutput_file(str): Path for output joined hints fileintrons_only(bool): If True, only output intron hints (useful for GeneMark) (default: False)
Returns: Dictionary with joining statistics
Example:
import annorefine
# Join hints from multiple sources
result = annorefine.join_hints(
input_files=[
"bam_hints.gff",
"protein_hints.gff",
"transcript_hints.gff"
],
output_file="joined_hints.gff"
)
# Join and filter to introns only
result = annorefine.join_hints(
input_files=["bam_hints.gff", "protein_hints.gff"],
output_file="genemark_hints.gff",
introns_only=True
)
print(f"Merged {result['total_input_hints']} into {result['output_hints']} hints")
filter_hints()
Filter hints by type, multiplicity, or contig.
Signature:
annorefine.filter_hints(
input_file: str,
output_file: str,
hint_types: List[str] = None,
min_mult: int = None,
contig: str = None
) -> dict
Parameters:
input_file(str): Input GFF fileoutput_file(str): Output GFF filehint_types(List[str]): Filter by hint types (e.g., ["intron", "exon"]) (default: None)min_mult(int): Minimum multiplicity (default: None)contig(str): Filter to specific contig (default: None)
Returns: Dictionary with filter statistics
Example:
import annorefine
# Filter by hint type
result = annorefine.filter_hints(
input_file="all_hints.gff",
output_file="intron_hints.gff",
hint_types=["intron"]
)
# Filter by multiplicity
result = annorefine.filter_hints(
input_file="all_hints.gff",
output_file="high_confidence_hints.gff",
min_mult=10
)
# Filter by contig
result = annorefine.filter_hints(
input_file="all_hints.gff",
output_file="chr1_hints.gff",
contig="chr1"
)
Annotation Refinement
refine()
Convenience function for annotation refinement with keyword arguments.
Signature:
annorefine.refine(
fasta_file: str,
gff3_file: str,
bam_file: str,
output_file: str,
*,
min_coverage: int = 5,
min_splice_support: int = 3,
max_utr_extension: int = 1000,
enable_novel_gene_detection: bool = False,
min_novel_gene_coverage: int = 10,
min_novel_gene_length: int = 300,
min_exon_length: int = 50,
validate_splice_sites: bool = True,
strand_bias_threshold: float = 0.65,
max_reads_for_strand_detection: int = 10000,
library_type: str = "auto",
threads: int = None
) -> dict
Parameters:
fasta_file(str): Path to genome FASTA filegff3_file(str): Path to input GFF3 annotation filebam_file(str): Path to RNA-seq BAM fileoutput_file(str): Path for output refined GFF3 filemin_coverage(int): Minimum coverage for UTR extension (default: 5)min_splice_support(int): Minimum reads supporting splice junction (default: 3)max_utr_extension(int): Maximum UTR extension length (default: 1000)enable_novel_gene_detection(bool): Enable novel gene detection (default: False)min_novel_gene_coverage(int): Minimum coverage for novel genes (default: 10)min_novel_gene_length(int): Minimum length for novel genes (default: 300)min_exon_length(int): Minimum exon length (default: 50)validate_splice_sites(bool): Validate splice sites (default: True)strand_bias_threshold(float): Threshold for strand detection (default: 0.65)max_reads_for_strand_detection(int): Max reads to sample for strand detection (default: 10000)library_type(str): Library strandedness - "auto", "FR", "RF", or "UU" (default: "auto")contig_map(dict): Map GFF3 contig names to BAM contig names (default: None)threads(int): Number of threads (default: None, uses all available)
Returns: Dictionary with refinement statistics and results
{
'genes_processed': int, # Total genes processed
'genes_failed': int, # Genes that failed processing
'transcripts_with_structure_changes': int, # Transcripts with exon/intron changes
'transcripts_with_5utr_extension': int, # Transcripts with 5' UTR extended
'transcripts_with_3utr_extension': int, # Transcripts with 3' UTR extended
'transcripts_with_5utr_trimming': int, # Transcripts with 5' UTR trimmed
'transcripts_with_3utr_trimming': int, # Transcripts with 3' UTR trimmed
'novel_genes_detected': int, # Novel genes discovered
'gene_models': list, # List of GeneModel objects
'output_file': str # Path to output file
}
Example:
import annorefine
# Basic refinement
result = annorefine.refine(
fasta_file="genome.fa",
gff3_file="annotations.gff3",
bam_file="alignments.bam",
output_file="refined.gff3"
)
# Print summary statistics
print(f"Genes processed: {result['genes_processed']}")
print(f"Structure changes: {result['transcripts_with_structure_changes']}")
print(f"5' UTR extensions: {result['transcripts_with_5utr_extension']}")
print(f"3' UTR extensions: {result['transcripts_with_3utr_extension']}")
print(f"5' UTR trims: {result['transcripts_with_5utr_trimming']}")
print(f"3' UTR trims: {result['transcripts_with_3utr_trimming']}")
# With novel gene detection and explicit library type
result = annorefine.refine(
fasta_file="genome.fa",
gff3_file="annotations.gff3",
bam_file="alignments.bam",
output_file="refined.gff3",
enable_novel_gene_detection=True,
min_novel_gene_coverage=10,
min_novel_gene_length=300,
library_type="RF", # Specify stranded library
threads=8
)
print(f"Novel genes detected: {result['novel_genes_detected']}")
# Access gene models
for gene in result['gene_models']:
print(f"Gene: {gene.gene_id}, Transcripts: {len(gene.transcripts)}")
# With contig mapping (GFF3 uses chr1, BAM uses NC_000001.11)
result = annorefine.refine(
fasta_file="genome.fa",
gff3_file="annotations.gff3",
bam_file="alignments.bam",
output_file="refined.gff3",
contig_map={'chr1': 'NC_000001.11', 'chr2': 'NC_000002.12'},
library_type="RF"
)
refine_annotations()
Refine annotations using a configuration object.
Signature:
annorefine.refine_annotations(
fasta_file: str,
gff3_file: str,
bam_file: str,
output_file: str,
config: RefinementConfig = None,
threads: int = None
) -> dict
Parameters:
fasta_file(str): Path to genome FASTA filegff3_file(str): Path to input GFF3 annotation filebam_file(str): Path to RNA-seq BAM fileoutput_file(str): Path for output refined GFF3 fileconfig(RefinementConfig): Configuration object (default: None, uses defaults)threads(int): Number of threads (default: None, uses all available)
Returns: Dictionary with refinement statistics
Example:
import annorefine
# Using configuration object
config = annorefine.RefinementConfig(
min_coverage=10,
min_splice_support=5,
max_utr_extension=2000,
enable_novel_gene_detection=True
)
result = annorefine.refine_annotations(
fasta_file="genome.fa",
gff3_file="annotations.gff3",
bam_file="alignments.bam",
output_file="refined.gff3",
config=config,
threads=8
)
detect_library_type()
Detect RNA-seq library strandedness from BAM file using gene models.
Signature:
annorefine.detect_library_type(
gff3_file: str,
bam_file: str,
strand_bias_threshold: float = 0.65,
max_reads: int = 10000,
contig_map: dict = None
) -> dict
Parameters:
gff3_file(str): Path to GFF3 file with gene annotationsbam_file(str): Path to RNA-seq BAM filestrand_bias_threshold(float): Threshold for detecting stranded data, range 0.5-1.0 (default: 0.65)max_reads(int): Maximum reads to sample for detection (default: 10000)contig_map(dict): Map GFF3 contig names to BAM contig names (default: None)
Returns: Dictionary with library type information
{
'library_type': str, # "FR", "RF", or "UU"
'strand_bias': str, # "ForwardStranded", "ReverseStranded", or "Unstranded"
'strand_bias_ratio': float, # 0.5-1.0 (higher = more stranded)
'total_reads': int, # Total reads analyzed
'mapped_reads': int # Number of mapped reads
}
Example:
import annorefine
# Detect library type
result = annorefine.detect_library_type(
gff3_file="genes.gff3",
bam_file="alignments.bam"
)
print(f"Library type: {result['library_type']}")
print(f"Strand bias: {result['strand_bias']}")
print(f"Strand bias ratio: {result['strand_bias_ratio']:.2f}")
# Use detected library type in refinement
annorefine.refine(
fasta_file="genome.fa",
gff3_file="genes.gff3",
bam_file="alignments.bam",
output_file="refined.gff3",
library_type=result['library_type']
)
# With contig mapping (GFF3 uses chr1, BAM uses NC_000001.11)
result = annorefine.detect_library_type(
gff3_file="genes.gff3",
bam_file="alignments.bam",
contig_map={'chr1': 'NC_000001.11', 'chr2': 'NC_000002.12'}
)
Configuration Classes
RefinementConfig
Configuration object for annotation refinement.
Constructor:
annorefine.RefinementConfig(
min_coverage: int = 5,
min_splice_support: int = 3,
max_utr_extension: int = 1000,
enable_novel_gene_detection: bool = False,
min_novel_gene_coverage: int = 10,
min_novel_gene_length: int = 300,
min_exon_length: int = 50,
validate_splice_sites: bool = True,
strand_bias_threshold: float = 0.65,
max_reads_for_strand_detection: int = 10000
)
Example:
import annorefine
config = annorefine.RefinementConfig(
min_coverage=10,
min_splice_support=5,
max_utr_extension=2000,
enable_novel_gene_detection=True,
min_novel_gene_coverage=15,
min_novel_gene_length=500,
min_exon_length=50,
validate_splice_sites=True,
strand_bias_threshold=0.7
)
Bam2HintsConfig
Configuration object for BAM to hints conversion.
Constructor:
annorefine.Bam2HintsConfig(
priority: int = 4,
max_gap_len: int = 14,
min_intron_len: int = 32,
max_intron_len: int = 350000,
min_end_block_len: int = 8,
max_query_gap_len: int = 5,
exonpart_cutoff: int = 10,
source: str = "E",
introns_only: bool = False,
no_multiplicity: bool = False,
remove_redundant: bool = False,
max_coverage: int = 0,
splice_sites_on: bool = False,
truncated_splice_sites: bool = False,
score: float = 0.0,
max_gene_len: int = 400000,
contig_map: dict = None
)
Example:
import annorefine
config = annorefine.Bam2HintsConfig(
priority=4,
max_gap_len=14,
min_intron_len=32,
max_intron_len=350000,
min_end_block_len=8,
source="E",
introns_only=False
)
# With contig mapping
contig_map = {'NC_000001.11': 'chr1', 'NC_000002.12': 'chr2'}
config = annorefine.Bam2HintsConfig(
priority=4,
source="E",
contig_map=contig_map
)
Utility Functions
version()
Get the AnnoRefine version string.
Returns: str - Version string (e.g., "2025.9.18")
current_num_threads()
Get the current number of threads configured for parallel processing.
import annorefine
num_threads = annorefine.current_num_threads()
print(f"Using {num_threads} threads")
Returns: int - Number of threads
Return Values
All main functions return dictionaries with statistics:
bam2hints() Returns
{
'total_hints_generated': int, # Total hints generated
'intron_hints': int, # Number of intron hints
'exon_hints': int, # Number of exon hints
'exonpart_hints': int, # Number of exonpart hints
'dss_hints': int, # Number of donor splice site hints
'ass_hints': int # Number of acceptor splice site hints
}
join_hints() Returns
{
'input_files': int, # Number of input files
'total_input_hints': int, # Total hints from all inputs
'output_hints': int, # Hints after merging
'output_file': str # Output file path
}