BAM to Hints Conversion

AnnoRefine converts RNA-seq BAM alignments to hints format for Augustus and GeneMark gene predictors.

Overview

The bam2hints command extracts evidence from RNA-seq alignments and generates hints in GFF format. These hints guide gene prediction tools to improve accuracy.

Hint Types

AnnoRefine generates several types of hints:

intron: Splice junctions (gaps in alignments)
exon: Internal exons (flanked by introns on both sides)
exonpart: Terminal exons (first/last in transcript)
dss: Donor splice sites (GT dinucleotides)
ass: Acceptor splice sites (AG dinucleotides)

Command Line Usage

Basic Usage

annorefine bam2hints \
    --input alignments.bam \
    --output hints.gff \
    --stranded RF

Common Options

annorefine bam2hints \
    --input alignments.bam \
    --output hints.gff \
    --stranded RF \
    --threads 4 \
    --priority 4 \
    --source E

Library Strandedness

The --stranded parameter is required and specifies the library type:

FR: Paired-end forward/reverse (fr-secondstrand)
RF: Paired-end reverse/forward (fr-firststrand) - most common for Illumina TruSeq
UU: Paired-end unstranded

Determining Strandedness

If you're unsure of your library type, check your sequencing protocol documentation. Most modern Illumina RNA-seq libraries use RF (dUTP method).

Introns Only (for GeneMark)

GeneMark typically only uses intron hints:

annorefine bam2hints \
    --input alignments.bam \
    --output genemark_hints.gff \
    --stranded RF \
    --intronsonly

Per-Contig Processing

For parallel processing of large genomes:

# Process a single contig
annorefine bam2hints \
    --input alignments.bam \
    --output chr1_hints.gff \
    --stranded RF \
    --contig chr1

Advanced Options

annorefine bam2hints \
    --input alignments.bam \
    --output hints.gff \
    --stranded RF \
    --min-intron-len 32 \
    --max-intron-len 350000 \
    --min-end-block-len 8 \
    --max-gap-len 14 \
    --priority 4 \
    --source E \
    --threads 8

Python API Usage

Basic Conversion

import annorefine

result = annorefine.bam2hints(
    bam_file="alignments.bam",
    output_file="hints.gff",
    library_type="RF"
)

print(f"Generated {result['total_hints_generated']} hints")

With Options

result = annorefine.bam2hints(
    bam_file="alignments.bam",
    output_file="hints.gff",
    library_type="RF",
    priority=4,
    source="E",
    threads=4,
    min_intron_len=32,
    max_intron_len=350000
)

Introns Only

result = annorefine.bam2hints(
    bam_file="alignments.bam",
    output_file="genemark_hints.gff",
    library_type="RF",
    introns_only=True
)

Per-Contig Processing

# Process a single contig
result = annorefine.bam2hints(
    bam_file="alignments.bam",
    output_file="chr1_hints.gff",
    library_type="RF",
    contig="chr1"
)

# Or process a specific region
result = annorefine.bam2hints(
    bam_file="alignments.bam",
    output_file="region_hints.gff",
    library_type="RF",
    region=("chr1", 1000000, 2000000)
)

Contig Name Mapping

Rename contigs in the output (useful for converting between naming conventions):

# Define contig name mapping
contig_map = {
    'NC_000001.11': 'chr1',
    'NC_000002.12': 'chr2',
    'NC_000003.12': 'chr3',
    # ... etc
}

# Generate hints with renamed contigs
result = annorefine.bam2hints(
    bam_file="alignments.bam",
    output_file="hints.gff",
    library_type="RF",
    contig_map=contig_map
)

# Contigs in the map will be renamed in the output
# Contigs not in the map will keep their original names

Joining Hints from Multiple Sources

After generating hints from different sources (BAM, proteins, transcripts), join them:

import annorefine

# Join hints from multiple sources
result = annorefine.join_hints(
    input_files=[
        "bam_hints.gff",
        "protein_hints.gff",
        "transcript_hints.gff"
    ],
    output_file="joined_hints.gff"
)

print(f"Merged {result['total_input_hints']} hints into {result['output_hints']}")

Create GeneMark Hints

# Join and filter to introns only for GeneMark
result = annorefine.join_hints(
    input_files=[
        "bam_hints.gff",
        "protein_hints.gff"
    ],
    output_file="genemark_hints.gff",
    introns_only=True
)

Output Format

The output is in GFF format compatible with Augustus and GeneMark:

# Generated by AnnoRefine v2025.9.18
# Command: bam2hints --input alignments.bam --output hints.gff --stranded RF
# Library type: RF
# Introns only: false
chr1    b2h intron  1000    2000    0   +   .   mult=15;pri=4;src=E;
chr1    b2h exon    2001    2500    0   +   .   mult=10;pri=4;src=E;

Performance Tips

Use multiple threads: Set --threads to match your CPU cores
Process by contig: For very large genomes, process contigs in parallel
Index your BAM: Ensure BAM files are indexed (.bai file present)
Filter by region: Use --contig or region filtering for targeted analysis

BAM to Hints Conversion

Overview

Hint Types

Command Line Usage

Basic Usage

Common Options

Library Strandedness

Introns Only (for GeneMark)

Per-Contig Processing

Advanced Options

Python API Usage

Basic Conversion

With Options

Introns Only

Per-Contig Processing

Contig Name Mapping

Joining Hints from Multiple Sources

Create GeneMark Hints

Output Format

Performance Tips

Next Steps