Processing data can be a real a mess!

Here at Factual we’ve felt the pain of managing data workflows for a very long time. Here are just a few of the issues:

  • a multitude of steps, with complicated dependencies
  • code and input can change frequently – it’s tiring and error-prone to figure out what needs to be re-built
  • inputs scattered all over (home directories, NFS, HDFS, etc.), tough to maintain, tough to sustain repeatability

Paul Butler, a self-described Data Hacker, recently published an article called “Make for Data Scientists“, which explored the challenges of managing data processing work. Paul went on to explain why GNU Make could be a viable tool for easing this pain. He also pointed out some limitations with Make, for example the assumption that all data is local.

We were gladdened to read Paul’s article, because we’d been hard at work building an internal tool to help manage our data workflows. A defining goal was to end up with a kind of “Make for data”, but targeted squarely at the problems of managing data workflow.

Introducing ‘Drake’, a “Make for Data”

We call this tool Drake, and today we are excited to share Drake with the world, as an open source project. It is written in Clojure.

Drake is a text-based command line data workflow tool that organizes command execution around data and its dependencies. Data processing steps are defined along with their inputs and outputs.  It automatically resolves dependencies and provides a rich set of options for controlling the workflow. It supports multiple inputs and outputs and has HDFS support built-in.

We use Drake at Factual on various internal projects. It serves as a primary way to define, run, and manage data workflow. Some core benefits we’ve seen:
    • Non-programmers can run Drake and fully manage a workflow
    • Encourages repeatability of the overall data building process
    • Encourages consistent organization (e.g., where supporting scripts live, and how they’re run)
    • Precise control over steps (for more effective testing, debugging, etc.)
    • Unifies different tools in a single workflow (shell commands, Ruby, Python, Clojure, pushing data to production, etc.)

Examples

Here’s a simple example of a Drake workflow file with three steps:

;
; Grabs us some data from the Internets
;
contracts.csv <-
  curl http://www.ferc.gov/docs-filing/eqr/soft-tools/sample-csv/contract.txt > $OUTPUT

;
; Filters out all but the evergreen contracts
;
evergreens.csv <- contracts.csv
  grep Evergreen $INPUT > $OUTPUT

;
; Saves a super fancy report
;
report.txt <- evergreens.csv 
  linecount = len(file("$[INPUT]").readlines())
  with open("$[OUTPUT]", "w") as f:
    f.write("File $[INPUT] has {0} lines.\n".format(linecount))

Items to the left of an arrow ( <- ) are output files, and to the right of an arrow are input files. Under the line specifying inputs and outputs is the body of the step, holding one ore more commands. The command(s) of a step are expected to handle the input(s) and produce the expected output(s). By default, Drake steps are written as bash commands.

Assuming we called this file workflow.d (what Drake expects by default), we’d kick off the entire workflow by simply running Drake in that directory:

$ drake

Drake will give us a preview and ask us to confirm we know what’s going on:

The following steps will be run, in order:
  1: contracts.csv <-  [missing output]
  2: evergreens.csv <- contracts.csv [projected timestamped]
  3: report.txt <- evergreens.csv [projected timestamped]
Confirm? [y/n]

By default, Drake will run all steps required to build all output files that are not up to date. But imagine we wanted to run our workflow only up to producing evergreens.csv, but no further. Easy:

$ drake evergreens.csv

The preview:

The following steps will be run, in order:
  1: contracts.csv <-  [missing output]
  2: evergreens.csv <- contracts.csv [projected timestamped]
Confirm? [y/n]

That’s a very simple example. To see a workflow that’s a bit more interesting, take a look at the “human-resources” workflow in Drake’s demos. There you’ll see a workflow that uses HDFS, contains inline Ruby, Python, and Clojure code, and deals with steps that have multiple inputs and produce multiple outputs. Diagramed, it looks like:

As our workflows grow complicated, Drake’s value grows more apparent. Take target selection for example. Imagine we’ve run the full workflow shown above and everything’s up-to-date. Then we hear that the skills database has been updated. We’d like to force a rebuild of skills and all affected dependent outputs. Drake knows how to force build (+), and it knows about the concept of downtree (^). So we can just do this:

$ drake +^skills

Drake will prompt us with a preview…

The following steps will be run, in order:
1: skills <- [forced]
2: people.skills <- skills, people [forced]
3: people.json <- people.skills [forced]
4: last_gt_first.txt, first_gt_last.txt <- people.json [forced]
5: for_HR.csv <- people.json [forced]
Confirm? [y/n]

… and we’re off and running.

But wait, there’s more!

Drake offers a ton more stuff to help you bring sanity to your otherwise chaotic data workflow, including:

  • rich target selection options
  • support for inline Ruby, Python, and Clojure code
  • tags
  • ability to “branch” your input and output files
  • HDFS integration
  • variables
  • includes

Drake’s designer gives you a screencast

Here’s a video of Artem Boytsov, primary designer of Drake, giving a detailed walk through:

Drake integrates with Factual

Just in case you were wondering! Drake includes convenient support for Factual’s public API, so you can easily integrate your workflows with Factual data. If that interests you, and you’re not afraid to sling a bit of Clojure, please see the wiki docs for the Clojure-based protocol called c4.

Drake has a full specification and user manual

A lot of work went into designing and specifying Drake. To prove it, here’s the 60 page specification document. The specification can be downloaded as a PDF and treated like a user manual.

We’ve also started wiki-based documentation for Drake.

Build Drake for yourself

To get your hands on Drake, you can build it from the GitHub repo.

All feedback welcome!

If you’re a wrangler of data workflows, we hope Drake might be of some use to you. Bug reports and contributions can be submitted via the GitHub repo. Any comments or questions can be submitted to the Google Group for Drake.

 

Go make some great workflows!

 

Sincerely,

Aaron Crow

Software Engineer at Factual