Published at: 2014-12-20 21:20
Tags: compbio, data-science, scons, python
When I started working at the Hutch with Erick Matsen, I was introduced to a tool the team used for orchestrating analyses: SCons, a Python based build tool. Like make, SCons was initially developed for compiling software. However, it's also quite useful for data science and computational biology.
Data science and compbio projects invariably require pushing data through multiple processing and computational steps. While for simple projects one can do this manually, executing one command at a time becomes untenable as things get more complicated:
Shell scripts are a simple solution to some of these problems. They are ubiquitous, familiar to many, and easy to construct given a series of shell commands. However, proper build tools have some advantages:
These advantages are exactly why build tools were developed: to save time. This is especially crucial in computational biology, where jobs are often long running (sometimes taking days or weeks), and things are frequently built and tweaked iteratively.
SCons can be installed with
pip install --egg scons if you're using pip. If you're on Ubuntu, you can also
apt-get install scons. Homebrew may be an option for OSX users as well.
Taking an example from computational biology, let's say you have a set of HIV genetic sequences from several patients and want to determine the most likely evolutionary relationship between the viruses in these patients. A simple analysis might consist of:
Each of these steps can be carried out by a different command line program (in this case
fasttree and a custom script
color_tree.py, respectively). Supposing we have sequence and patient data in
metadata.csv files respectively, let's see how we'd put everything together with SCons.
SCons is designed to read
SConstruct files which specify the flow of the computation. In this case, our
SConstruct file would look like this:
# SCons actually does this import for you, but I like the added clarity. from SCons.Script import Command # Assign our input filenames to some variables for convenience sequences = 'input/sequences.fasta' metadata = 'input/metadata.csv' # Our first step is to create the alignment file. align = Command('output/alignment.fasta', # Path of the target (output) file sequences, # Source (input) 'muscle -in $SOURCE -out $TARGET') # Action executed; Note the use of # $SOURCE and $TARGET # We can now specify the `align` object as a source for other targets tree = Command('output/tree.tre', align, 'fasttree -nt $SOURCE > $TARGET') # Now let's build our final target, a colored tree colored_tree = Command('output/colored_tree.svg', [tree, metadata], './bin/colored_tree.py -I -c patient $SOURCES $TARGET')
The only thing in all of this that isn't vanilla python is the
Command function. This function takes 3 arguments:
Calling this function registers the given task and returns an object representing the file(s) built by that task. Based on the succession of tasks registered, SCons evaluates a dependency graph. As it executes the tasks in the graph, it stores a fingerprint for each file so that it knows what needs to be rebuilt on subsequent runs if something changes. Additionally, changing the action for a given target will trigger a rebuild of that targets.
To run the
scons and you should see a number of files created in a new
output directory. Try opening the
output/colored_tree.svg and enjoy the fruits of your labor.
If this was where SCons stopped, you might be wondering whether it's worth all the trouble. So let's take a look at a more intricate example.
Suppose we want to run the above analysis, but compare the trees obtained from three different alignment methods? All we'd have to do is add a little for loop that:
Commandon said collection to produce the singular final output
To start, let's add
from os import path to the beginning of our
SConstruct file. Then, modify everything after the input files as follows
# Create a dictionary we can use to get the correct action string for each alignment method action_strings = dict(muscle='muscle -in $SOURCE -out $TARGET', mafft='mafft $SOURCE > $TARGET', clustal='clustalw -INFILE=$SOURCE -OUTFILE=$TARGET -OUTPUT=FASTA -TYPE=DNA') # Next we have to initialize the collection of colored trees we're going to join together colored_trees =  # Now we branch on the various alignment methods for program in ['muscle', 'mafft', 'clustal']: outdir = path.join('output', rogram) # use a method-specific outdir align_action = action_strings[program] # get correct action string # The rest of the for loop is almost the same as in the last example... align = Command(path.join(outdir, 'alignment.fasta'), sequences, align_action) tree = Command(path.join(outdir, 'tree.tre'), align, 'fasttree -nt $SOURCE > $TARGET') colored_tree = Command(path.join(outdir, 'colored_tree.svg'), [tree, metadata], './bin/colored_tree.py -I -c patient $SOURCES $TARGET') colored_trees.append(colored_tree) # add the colored tree to our collection # Final output: join together all the trees into a single SVG combined_trees = Command('output/combined_trees.svg', colored_trees, 'svg_stack.py $SOURCES > $TARGET')
Run, and voilà! SCons builds trees for each of the alignment methods, and joins the resulting tree figures into a single figure.
Now just for fun,
rm -r output and rerun with
scons -n 3 and observe the parallel awesomeness :-)
It's worth highlighting that all we needed to solve this second problem was a bit of extra python logic. This is in contrast with
make, which requires increasingly convoluted syntax as the logic becomes more complex. This feature was a major motivation behind the development of SCons: programming logic is the more scalable solution to complex build logic.
There are a number of other tools that take this approach as well, which I hope to take a look at in later posts. However, without getting into too many details, one of the things that stands out about SCons for this use case is that
Command returns an object representing the file(s)/results produced by that step of the computation. Being able to pass this around in your programs becomes really valuable when doing data analysis.
I hope this tutorial has given you a general sense of the capabilities of SCons, and convinced you it's a useful tool for orchestrating data analysis pipelines. I'll soon be following this post up with more on how to maximize usage of SCons. In the mean time, here are a few resources in case you're keen on digging in now:
PS: Thanks to Erick Matsen for his helpful feedback on this post.