Bioinformatics algorithms implemented in C, wrapped in Python.

Disclaimer: I had this idea in the shower.

Bioinformatics applications are naturally targeted at solving a specific problem. If a researcher creates a new protocol, he or she must a.) adjust that protocol to the software b.) write software specific to that problem or c.) wait for someone to write software specific to that problem. The goal of Pypette is to make option b easier by implementing common bioinformatics algorithms that can be mixed and matched.

Example problem:

A researcher wants to perform a de novo assembly from short reads using a de-Brujin graph approach similar to Velvet, taking into account quality scores. The short reads came from an environmental sample amplified with random hexamers and is thus a metagenomic sample. The researcher estimates that there are 4 prominent organisms in the sample and each should have a genome coverage of about 18x. Before assembling, the researcher wishes to bin the reads according to which organism the read is most likely to belong to using the hidden-markov model approach (via: Assembly is hard because it’s not decomposable)

This procedure would be complicated to implement from scratch. As a result the researcher may either give up, change the experiment, or attempt to carry out the procedure by adapting existing tools.


One way to make this easier would be to implement sensible interfaces (APIs) to commonly used bioinformatics programs allowing them to be combined in a programming language.

The following example is contrived and not implemented. It simply represents an ideal scenario.

In this example solution, the inputs and outputs of the read binner and assembler are easily combined using an imaginary Python library.

from pypette import debrujin, markov, fastq

# Load the reads
with open("reads.fastq") as handle:
    reads = fastq.load(handle)

# Cluster reads
reads_binned = \
    markov.cluster(reads, probability=lambda x: x.qual.to_f(), bins=4)
# Construction of debrujin graph per bin
graphs = {}
for n, bin in enumerate(reads_binned):
    graphs[n] = debrujin(bin, probability=lambda x: x.qual.to_f(), k=31)

# Solve graph, convert to contigs
contigs = {}
[ contigs[n] = graphs[n].contigs() for n in graphs.solve() ]

# Output contig files in fasta format
for n in contigs:
    with open("contigs_%s" % n) as handle:
        for contig in contigs[n]:
            print >> handle, contig.fastq()

or something like that…


This is a super over-simplified example but it illustrates the idea.

The user was able to craft a complex bioinformatics application with very little code without having to implement the underlying bioinformatics algorithms, just customize them.