Published at: 2014-12-20 21:20
Tags: compbio, data-science, scons, python
When I started working at the Hutch with Erick Matsen, I was introduced to a tool the team used for orchestrating analyses: SCons, a Python based build tool. Like make, SCons was initially developed for compiling software. However, it's also quite useful for data science and computational biology.
Data science and compbio projects invariably require pushing data through multiple processing and computational steps. While for simple projects one can do this manually, executing one command at a time becomes untenable as things get more complicated:
Shell scripts are a simple solution to some of these problems. They are ubiquitous, familiar to many, and easy to construct given a series of shell commands. However, proper build tools have some advantages:
These advantages are exactly why build tools were developed: to save time. This is especially crucial in computational biology, where jobs are often long running (sometimes taking days or weeks), and things are frequently built and tweaked iteratively.
SCons can be installed with pip install --egg scons
if you're using pip. If you're on Ubuntu, you can also apt-get install scons
. Homebrew may be an option for OSX users as well.
Taking an example from computational biology, let's say you have a set of HIV genetic sequences from several patients and want to determine the most likely evolutionary relationship between the viruses in these patients. A simple analysis might consist of:
Each of these steps can be carried out by a different command line program (in this case muscle
, fasttree
and a custom script color_tree.py
, respectively). Supposing we have sequence and patient data in sequences.fasta
and metadata.csv
files respectively, let's see how we'd put everything together with SCons.
SCons is designed to read SConstruct
files which specify the flow of the computation. In this case, our SConstruct
file would look like this:
# SCons actually does this import for you, but I like the added clarity.
from SCons.Script import Command
# Assign our input filenames to some variables for convenience
sequences = 'input/sequences.fasta'
metadata = 'input/metadata.csv'
# Our first step is to create the alignment file.
align = Command('output/alignment.fasta', # Path of the target (output) file
sequences, # Source (input)
'muscle -in $SOURCE -out $TARGET') # Action executed; Note the use of
# $SOURCE and $TARGET
# We can now specify the `align` object as a source for other targets
tree = Command('output/tree.tre', align, 'fasttree -nt $SOURCE > $TARGET')
# Now let's build our final target, a colored tree
colored_tree = Command('output/colored_tree.svg',
[tree, metadata],
'./bin/colored_tree.py -I -c patient $SOURCES $TARGET')
The only thing in all of this that isn't vanilla python is the Command
function. This function takes 3 arguments:
Command
callsCalling this function registers the given task and returns an object representing the file(s) built by that task. Based on the succession of tasks registered, SCons evaluates a dependency graph. As it executes the tasks in the graph, it stores a fingerprint for each file so that it knows what needs to be rebuilt on subsequent runs if something changes. Additionally, changing the action for a given target will trigger a rebuild of that targets.
To run the SConstruct
, execute scons
and you should see a number of files created in a new output
directory. Try opening the output/colored_tree.svg
and enjoy the fruits of your labor.
If this was where SCons stopped, you might be wondering whether it's worth all the trouble. So let's take a look at a more intricate example.
Suppose we want to run the above analysis, but compare the trees obtained from three different alignment methods? All we'd have to do is add a little for loop that:
Command
on said collection to produce the singular final outputTo start, let's add from os import path
to the beginning of our SConstruct
file. Then, modify everything after the input files as follows
# Create a dictionary we can use to get the correct action string for each alignment method
action_strings = dict(muscle='muscle -in $SOURCE -out $TARGET',
mafft='mafft $SOURCE > $TARGET',
clustal='clustalw -INFILE=$SOURCE -OUTFILE=$TARGET -OUTPUT=FASTA -TYPE=DNA')
# Next we have to initialize the collection of colored trees we're going to join together
colored_trees = []
# Now we branch on the various alignment methods
for program in ['muscle', 'mafft', 'clustal']:
outdir = path.join('output', rogram) # use a method-specific outdir
align_action = action_strings[program] # get correct action string
# The rest of the for loop is almost the same as in the last example...
align = Command(path.join(outdir, 'alignment.fasta'),
sequences,
align_action)
tree = Command(path.join(outdir, 'tree.tre'), align, 'fasttree -nt $SOURCE > $TARGET')
colored_tree = Command(path.join(outdir, 'colored_tree.svg'),
[tree, metadata],
'./bin/colored_tree.py -I -c patient $SOURCES $TARGET')
colored_trees.append(colored_tree) # add the colored tree to our collection
# Final output: join together all the trees into a single SVG
combined_trees = Command('output/combined_trees.svg',
colored_trees,
'svg_stack.py $SOURCES > $TARGET')
Run, and voilĂ ! SCons builds trees for each of the alignment methods, and joins the resulting tree figures into a single figure.
Now just for fun, rm -r output
and rerun with scons -n 3
and observe the parallel awesomeness :-)
It's worth highlighting that all we needed to solve this second problem was a bit of extra python logic. This is in contrast with make
, which requires increasingly convoluted syntax as the logic becomes more complex. This feature was a major motivation behind the development of SCons: programming logic is the more scalable solution to complex build logic.
There are a number of other tools that take this approach as well, which I hope to take a look at in later posts. However, without getting into too many details, one of the things that stands out about SCons for this use case is that Command
returns an object representing the file(s)/results produced by that step of the computation. Being able to pass this around in your programs becomes really valuable when doing data analysis.
I hope this tutorial has given you a general sense of the capabilities of SCons, and convinced you it's a useful tool for orchestrating data analysis pipelines. I'll soon be following this post up with more on how to maximize usage of SCons. In the mean time, here are a few resources in case you're keen on digging in now:
PS: Thanks to Erick Matsen for his helpful feedback on this post.
Linked in and Twitter icons by Anas Ramadan and Elegant Themes from Flaticon, licensed CC BY 3.0
Content Copyright 2019, Christopher T. Small; Site generated by Oz