Home Blog Research & Publications Talks Code Contact

SCons for data science and compbio

Published at: 2014-12-20 21:20

Tags: compbio, data-science, scons, python

When I started working at the Hutch with Erick Matsen, I was introduced to a tool the team used for orchestrating analyses: SCons, a Python based build tool. Like make, SCons was initially developed for compiling software. However, it's also quite useful for data science and computational biology.

Why you want this

Data science and compbio projects invariably require pushing data through multiple processing and computational steps. While for simple projects one can do this manually, executing one command at a time becomes untenable as things get more complicated:

Reproducibility becomes tricky without a record of how you performed the computations, what parameters you used, etc.
Running analyses for multiple parameter settings means entering all the commands multiple times, which is error prone and not DRY.
The process requires your continued attention throughout the computation; You can't just hit go and walk away.
If you update one of the intermediate results, it can be difficult to keep track of what needs to be rerun downstream and to make sure things are run the same way.

Shell scripts are a simple solution to some of these problems. They are ubiquitous, familiar to many, and easy to construct given a series of shell commands. However, proper build tools have some advantages:

Dependency tracking: if you change an intermediate result, only downstream results are rebuilt.
Parallel execution of independent computations.
[EDIT: added 2015-05-13] In some cases prevent you from clobbering existing data.

These advantages are exactly why build tools were developed: to save time. This is especially crucial in computational biology, where jobs are often long running (sometimes taking days or weeks), and things are frequently built and tweaked iteratively.

Installation

SCons can be installed with pip install --egg scons if you're using pip. If you're on Ubuntu, you can also apt-get install scons. Homebrew may be an option for OSX users as well.

Example Usage

The data and code are available on Github if you'd like run the examples. Check out the code, then run ./install_prereqs.sh to get everything set up. (Currently only Ubuntu supported; pull requests welcome.)

Taking an example from computational biology, let's say you have a set of HIV genetic sequences from several patients and want to determine the most likely evolutionary relationship between the viruses in these patients. A simple analysis might consist of:

aligning the sequences
building a phylogenetic tree to determine likely ancestry of the sequences
printing the tree, with tips colored by patient

Each of these steps can be carried out by a different command line program (in this case muscle, fasttree and a custom script color_tree.py, respectively). Supposing we have sequence and patient data in sequences.fasta and metadata.csv files respectively, let's see how we'd put everything together with SCons.

If you're following the tutorial, cd example1 and then run scons to run.

SCons is designed to read SConstruct files which specify the flow of the computation. In this case, our SConstruct file would look like this:

# SCons actually does this import for you, but I like the added clarity.
from SCons.Script import Command

# Assign our input filenames to some variables for convenience
sequences = 'input/sequences.fasta'
metadata = 'input/metadata.csv'

# Our first step is to create the alignment file. 
align = Command('output/alignment.fasta',   # Path of the target (output) file
        sequences,                          # Source (input)
        'muscle -in $SOURCE -out $TARGET')  # Action executed; Note the use of
                                            # $SOURCE and $TARGET

# We can now specify the `align` object as a source for other targets
tree = Command('output/tree.tre', align, 'fasttree -nt $SOURCE > $TARGET')

# Now let's build our final target, a colored tree
colored_tree = Command('output/colored_tree.svg',
        [tree, metadata],
        './bin/colored_tree.py -I -c patient $SOURCES $TARGET')

The only thing in all of this that isn't vanilla python is the Command function. This function takes 3 arguments:

target filename(s)
source files, either filename strings, or the results of other Command calls
and an action which can either be a function or a command line string

Calling this function registers the given task and returns an object representing the file(s) built by that task. Based on the succession of tasks registered, SCons evaluates a dependency graph. As it executes the tasks in the graph, it stores a fingerprint for each file so that it knows what needs to be rebuilt on subsequent runs if something changes. Additionally, changing the action for a given target will trigger a rebuild of that targets.

To run the SConstruct, execute scons and you should see a number of files created in a new output directory. Try opening the output/colored_tree.svg and enjoy the fruits of your labor.

OK; That's pretty cool, but what about something more complicated?

If this was where SCons stopped, you might be wondering whether it's worth all the trouble. So let's take a look at a more intricate example.

Suppose we want to run the above analysis, but compare the trees obtained from three different alignment methods? All we'd have to do is add a little for loop that:

performs roughly the same analysis for each of the different methods
gathers the final results in a collection
runs a final Command on said collection to produce the singular final output

See example2 if you're following the Github code.

To start, let's add from os import path to the beginning of our SConstruct file. Then, modify everything after the input files as follows

# Create a dictionary we can use to get the correct action string for each alignment method
action_strings = dict(muscle='muscle -in $SOURCE -out $TARGET',
        mafft='mafft $SOURCE > $TARGET',
        clustal='clustalw -INFILE=$SOURCE -OUTFILE=$TARGET -OUTPUT=FASTA -TYPE=DNA')

# Next we have to initialize the collection of colored trees we're going to join together
colored_trees = []


# Now we branch on the various alignment methods
for program in ['muscle', 'mafft', 'clustal']:
    outdir = path.join('output', rogram)     # use a method-specific outdir
    align_action = action_strings[program]   # get correct action string

    # The rest of the for loop is almost the same as in the last example...

    align = Command(path.join(outdir, 'alignment.fasta'),
            sequences,
            align_action)

    tree = Command(path.join(outdir, 'tree.tre'), align, 'fasttree -nt $SOURCE > $TARGET')

    colored_tree = Command(path.join(outdir, 'colored_tree.svg'),
            [tree, metadata],
            './bin/colored_tree.py -I -c patient $SOURCES $TARGET')
    colored_trees.append(colored_tree)       # add the colored tree to our collection


# Final output: join together all the trees into a single SVG
combined_trees = Command('output/combined_trees.svg',
    colored_trees,
    'svg_stack.py $SOURCES > $TARGET')

Run, and voilà! SCons builds trees for each of the alignment methods, and joins the resulting tree figures into a single figure.

Now just for fun, rm -r output and rerun with scons -n 3 and observe the parallel awesomeness :-)

It's worth highlighting that all we needed to solve this second problem was a bit of extra python logic. This is in contrast with make, which requires increasingly convoluted syntax as the logic becomes more complex. This feature was a major motivation behind the development of SCons: programming logic is the more scalable solution to complex build logic.

There are a number of other tools that take this approach as well, which I hope to take a look at in later posts. However, without getting into too many details, one of the things that stands out about SCons for this use case is that Command returns an object representing the file(s)/results produced by that step of the computation. Being able to pass this around in your programs becomes really valuable when doing data analysis.

Summary

I hope this tutorial has given you a general sense of the capabilities of SCons, and convinced you it's a useful tool for orchestrating data analysis pipelines. I'll soon be following this post up with more on how to maximize usage of SCons. In the mean time, here are a few resources in case you're keen on digging in now:

Running in parallel over a Slurm cluster: bioscons
More declarative running of nested parameter/data sets: Nestly

PS: Thanks to Erick Matsen for his helpful feedback on this post.

Linked in and Twitter icons by Anas Ramadan and Elegant Themes from Flaticon, licensed CC BY 3.0