Skip to content

fastx

FASTA/FASTQ file handling utilities.

fastx

translate(dna, strand, phase, table=1)

Translates DNA sequence into proteins.

Takes DNA (or rather cDNA sequence) and translates to proteins/amino acids. It requires the DNA sequence, the strand, translation phase, and translation table.

Parameters

dna : str DNA (cDNA) sequence as nucleotides strand : str, (+/-) strand to translate (+ or -) phase : int phase to start translation [0,1,2] table : int, default=1 translation table [1]

Returns

protSeq : str string of translated amino acid sequence

Source code in buscolite/fastx.py
def translate(dna, strand, phase, table=1):
    """Translates DNA sequence into proteins.

    Takes DNA (or rather cDNA sequence) and translates to proteins/amino acids.
    It requires the DNA sequence, the strand, translation phase, and translation
    table.

    Parameters
    ----------
    dna : str
        DNA (cDNA) sequence as nucleotides
    strand : str, (+/-)
        strand to translate (+ or -)
    phase : int
        phase to start translation [0,1,2]
    table : int, default=1
        translation table [1]

    Returns
    -------
    protSeq : str
        string of translated amino acid sequence

    """

    def _split(str, num):
        return [str[start : start + num] for start in range(0, len(str), num)]

    if strand == "-" or strand == -1:
        seq = RevComp(dna)
    else:
        seq = dna
    seq = seq[phase:]
    # map seq to proteins
    protSeq = []
    for c, i in enumerate(_split(seq, 3)):
        if len(i) == 3:
            iSeq = i.upper()
            if c == 0:  # first codon
                if iSeq in codon_table[table]["start"]:
                    aa = "M"
                    protSeq.append(aa)
                else:
                    if iSeq in codon_table[table]["table"]:
                        aa = codon_table[table]["table"][iSeq]
                        protSeq.append(aa)
                    else:
                        protSeq.append("X")
            else:
                if iSeq in codon_table[table]["table"]:
                    aa = codon_table[table]["table"][iSeq]
                    protSeq.append(aa)
                else:
                    protSeq.append("X")
    return "".join(protSeq)

fasta2dict(fasta, full_header=False)

Read FASTA file to dictionary.

This is same as biopython SeqIO.to_dict(), return dictionary keyed by contig name and value is the sequence string.

Parameters

fasta : filename FASTA input file (can be gzipped) full_header : bool, default=False return full header for contig names, default is split at first space

Returns

seqs : dict returns OrderedDict() of header: seq

Source code in buscolite/fastx.py
def fasta2dict(fasta, full_header=False):
    """Read FASTA file to dictionary.

    This is same as biopython SeqIO.to_dict(), return dictionary
    keyed by contig name and value is the sequence string.

    Parameters
    ----------
    fasta : filename
        FASTA input file (can be gzipped)
    full_header : bool, default=False
        return full header for contig names, default is split at first space

    Returns
    -------
    seqs : dict
        returns OrderedDict() of header: seq

    """
    seqs = OrderedDict()
    for title, seq in pyfastx.Fasta(fasta, full_name=True, build_index=False):
        if full_header:
            title = title
        else:
            if " " in title:
                title = title.split(" ", 1)[0]
        seqs[title] = seq
    return seqs

fasta2headers(fasta, full_header=False)

Read FASTA file set of headers.

Simple function to read FASTA file and return set of contig names

Parameters

fasta : filename FASTA input file (can be gzipped) full_header : bool, default=False return full header for contig names, default is split at first space

Returns

headers : set returns set() of header names

Source code in buscolite/fastx.py
def fasta2headers(fasta, full_header=False):
    """Read FASTA file set of headers.

    Simple function to read FASTA file and return set of contig names

    Parameters
    ----------
    fasta : filename
        FASTA input file (can be gzipped)
    full_header : bool, default=False
        return full header for contig names, default is split at first space

    Returns
    -------
    headers : set
        returns set() of header names

    """
    # generate a set of the contig/scaffold names
    headers = set()
    for title, seq in pyfastx.Fasta(fasta, full_name=True, build_index=False):
        if full_header:
            title = title
        else:
            if " " in title:
                title = title.split(" ", 1)[0]
        headers.add(title)
    return headers

fasta2lengths(fasta, full_header=False)

Read FASTA file to dictionary of sequence lengths.

Reads FASTA file (optionally gzipped) and returns dictionary of contig header names as keys with length of sequences as values

Parameters

fasta : filename FASTA input file (can be gzipped) full_header : bool, default=False return full header for contig names, default is split at first space

Returns

seqs : dict returns dictionary of header: len(seq)

Source code in buscolite/fastx.py
def fasta2lengths(fasta, full_header=False):
    """Read FASTA file to dictionary of sequence lengths.

    Reads FASTA file (optionally gzipped) and returns dictionary of
    contig header names as keys with length of sequences as values

    Parameters
    ----------
    fasta : filename
        FASTA input file (can be gzipped)
    full_header : bool, default=False
        return full header for contig names, default is split at first space

    Returns
    -------
    seqs : dict
        returns dictionary of header: len(seq)

    """
    seqs = {}
    for title, seq in pyfastx.Fasta(fasta, full_name=True, build_index=False):
        if full_header:
            title = title
        else:
            if " " in title:
                title = title.split(" ", 1)[0]
        seqs[title] = len(seq)
    return seqs

explode_fasta(fasta, folder, suffix='.fa')

Read FASTA file and write 1 contig per file to folder

Parameters

fasta : filename FASTA input file (can be gzipped) folder : directory directory to write contigs to

Returns

seqs : dict returns dictionary of header: len(seq)

Source code in buscolite/fastx.py
def explode_fasta(fasta, folder, suffix=".fa"):
    """Read FASTA file and write 1 contig per file to folder

    Parameters
    ----------
    fasta : filename
        FASTA input file (can be gzipped)
    folder : directory
        directory to write contigs to

    Returns
    -------
    seqs : dict
        returns dictionary of header: len(seq)

    """
    if not os.path.isdir(folder):
        os.makedirs(folder)
    seqs = {}
    for title, seq in pyfastx.Fasta(fasta, full_name=True, build_index=False):
        if " " in title:
            title = title.split(" ", 1)[0]
        seqs[title] = len(seq)
        with open(os.path.join(folder, "{}{}".format(title, suffix)), "w") as outfile:
            outfile.write(">{}\n{}\n".format(title, softwrap(seq)))

    return seqs

getSeqRegions(seqs, header, coordinates)

From sequence dictionary return spliced coordinates.

Takes a sequence dictionary (ie from fasta2dict), the contig name (header) and the coordinates to fetch (list of tuples)

Parameters

seqs : dict dictionary of sequences keyed by contig name/ header header : str contig name (header) for sequence in seqs dictionary coordinates : list of tuples list of tuples of sequence coordinates to return [(1,10), (20,30)]

Returns

result : str returns spliced DNA sequence

Source code in buscolite/fastx.py
def getSeqRegions(seqs, header, coordinates):
    """From sequence dictionary return spliced coordinates.

    Takes a sequence dictionary (ie from fasta2dict), the contig name (header)
    and the coordinates to fetch (list of tuples)

    Parameters
    ----------
    seqs : dict
        dictionary of sequences keyed by contig name/ header
    header : str
        contig name (header) for sequence in seqs dictionary
    coordinates : list of tuples
        list of tuples of sequence coordinates to return [(1,10), (20,30)]

    Returns
    -------
    result : str
        returns spliced DNA sequence

    """
    # takes SeqRecord dictionary or Index, returns sequence string
    # coordinates is a list of tuples [(1,10), (20,30)]
    result = ""
    sorted_coordinates = sorted(coordinates, key=lambda tup: tup[0])
    for x in sorted_coordinates:
        result += seqs[header][x[0] - 1 : x[1]]
    return result