BAM to Hints Conversion
AnnoRefine converts RNA-seq BAM alignments to hints format for Augustus and GeneMark gene predictors.
Overview
The bam2hints command extracts evidence from RNA-seq alignments and generates hints in GFF format. These hints guide gene prediction tools to improve accuracy.
Hint Types
AnnoRefine generates several types of hints:
- intron: Splice junctions (gaps in alignments)
- exon: Internal exons (flanked by introns on both sides)
- exonpart: Terminal exons (first/last in transcript)
- dss: Donor splice sites (GT dinucleotides)
- ass: Acceptor splice sites (AG dinucleotides)
Command Line Usage
Basic Usage
Common Options
annorefine bam2hints \
--input alignments.bam \
--output hints.gff \
--stranded RF \
--threads 4 \
--priority 4 \
--source E
Library Strandedness
The --stranded parameter is required and specifies the library type:
- FR: Paired-end forward/reverse (fr-secondstrand)
- RF: Paired-end reverse/forward (fr-firststrand) - most common for Illumina TruSeq
- UU: Paired-end unstranded
Determining Strandedness
If you're unsure of your library type, check your sequencing protocol documentation. Most modern Illumina RNA-seq libraries use RF (dUTP method).
Introns Only (for GeneMark)
GeneMark typically only uses intron hints:
annorefine bam2hints \
--input alignments.bam \
--output genemark_hints.gff \
--stranded RF \
--intronsonly
Per-Contig Processing
For parallel processing of large genomes:
# Process a single contig
annorefine bam2hints \
--input alignments.bam \
--output chr1_hints.gff \
--stranded RF \
--contig chr1
Advanced Options
annorefine bam2hints \
--input alignments.bam \
--output hints.gff \
--stranded RF \
--min-intron-len 32 \
--max-intron-len 350000 \
--min-end-block-len 8 \
--max-gap-len 14 \
--priority 4 \
--source E \
--threads 8
Python API Usage
Basic Conversion
import annorefine
result = annorefine.bam2hints(
bam_file="alignments.bam",
output_file="hints.gff",
library_type="RF"
)
print(f"Generated {result['total_hints_generated']} hints")
With Options
result = annorefine.bam2hints(
bam_file="alignments.bam",
output_file="hints.gff",
library_type="RF",
priority=4,
source="E",
threads=4,
min_intron_len=32,
max_intron_len=350000
)
Introns Only
result = annorefine.bam2hints(
bam_file="alignments.bam",
output_file="genemark_hints.gff",
library_type="RF",
introns_only=True
)
Per-Contig Processing
# Process a single contig
result = annorefine.bam2hints(
bam_file="alignments.bam",
output_file="chr1_hints.gff",
library_type="RF",
contig="chr1"
)
# Or process a specific region
result = annorefine.bam2hints(
bam_file="alignments.bam",
output_file="region_hints.gff",
library_type="RF",
region=("chr1", 1000000, 2000000)
)
Contig Name Mapping
Rename contigs in the output (useful for converting between naming conventions):
# Define contig name mapping
contig_map = {
'NC_000001.11': 'chr1',
'NC_000002.12': 'chr2',
'NC_000003.12': 'chr3',
# ... etc
}
# Generate hints with renamed contigs
result = annorefine.bam2hints(
bam_file="alignments.bam",
output_file="hints.gff",
library_type="RF",
contig_map=contig_map
)
# Contigs in the map will be renamed in the output
# Contigs not in the map will keep their original names
Joining Hints from Multiple Sources
After generating hints from different sources (BAM, proteins, transcripts), join them:
import annorefine
# Join hints from multiple sources
result = annorefine.join_hints(
input_files=[
"bam_hints.gff",
"protein_hints.gff",
"transcript_hints.gff"
],
output_file="joined_hints.gff"
)
print(f"Merged {result['total_input_hints']} hints into {result['output_hints']}")
Create GeneMark Hints
# Join and filter to introns only for GeneMark
result = annorefine.join_hints(
input_files=[
"bam_hints.gff",
"protein_hints.gff"
],
output_file="genemark_hints.gff",
introns_only=True
)
Output Format
The output is in GFF format compatible with Augustus and GeneMark:
# Generated by AnnoRefine v2025.9.18
# Command: bam2hints --input alignments.bam --output hints.gff --stranded RF
# Library type: RF
# Introns only: false
chr1 b2h intron 1000 2000 0 + . mult=15;pri=4;src=E;
chr1 b2h exon 2001 2500 0 + . mult=10;pri=4;src=E;
Performance Tips
- Use multiple threads: Set
--threadsto match your CPU cores - Process by contig: For very large genomes, process contigs in parallel
- Index your BAM: Ensure BAM files are indexed (
.baifile present) - Filter by region: Use
--contigor region filtering for targeted analysis