Introducing Drake, a kind of ‘make for data’

Processing data can be a real a mess!

Here at Factual we’ve felt the pain of managing data workflows for a very long time. Here are just a few of the issues:

  • a multitude of steps, with complicated dependencies
  • code and input can change frequently – it’s tiring and error-prone to figure out what needs to be re-built
  • inputs scattered all over (home directories, NFS, HDFS, etc.), tough to maintain, tough to sustain repeatability

Paul Butler, a self-described Data Hacker, recently published an article called “Make for Data Scientists“, which explored the challenges of managing data processing work. Paul went on to explain why GNU Make could be a viable tool for easing this pain. He also pointed out some limitations with Make, for example the assumption that all data is local.

We were gladdened to read Paul’s article, because we’d been hard at work building an internal tool to help manage our data workflows. A defining goal was to end up with a kind of “Make for data”, but targeted squarely at the problems of managing data workflow.

Introducing ‘Drake’, a “Make for Data”

We call this tool Drake, and today we are excited to share Drake with the world, as an open source project. It is written in Clojure.

Drake is a text-based command line data workflow tool that organizes command execution around data and its dependencies. Data processing steps are defined along with their inputs and outputs.  It automatically resolves dependencies and provides a rich set of options for controlling the workflow. It supports multiple inputs and outputs and has HDFS support built-in.

We use Drake at Factual on various internal projects. It serves as a primary way to define, run, and manage data workflow. Some core benefits we’ve seen:
    • Non-programmers can run Drake and fully manage a workflow
    • Encourages repeatability of the overall data building process
    • Encourages consistent organization (e.g., where supporting scripts live, and how they’re run)
    • Precise control over steps (for more effective testing, debugging, etc.)
    • Unifies different tools in a single workflow (shell commands, Ruby, Python, Clojure, pushing data to production, etc.)

Examples

Here’s a simple example of a Drake workflow file with three steps:

;
; Grabs us some data from the Internets
;
contracts.csv <-
  curl http://www.ferc.gov/docs-filing/eqr/soft-tools/sample-csv/contract.txt > $OUTPUT

;
; Filters out all but the evergreen contracts
;
evergreens.csv <- contracts.csv
  grep Evergreen $INPUT > $OUTPUT

;
; Saves a super fancy report
;
report.txt <- evergreens.csv 
  linecount = len(file("$[INPUT]").readlines())
  with open("$[OUTPUT]", "w") as f:
    f.write("File $[INPUT] has {0} lines.\n".format(linecount))

Items to the left of an arrow ( <- ) are output files, and to the right of an arrow are input files. Under the line specifying inputs and outputs is the body of the step, holding one ore more commands. The command(s) of a step are expected to handle the input(s) and produce the expected output(s). By default, Drake steps are written as bash commands.

Assuming we called this file workflow.d (what Drake expects by default), we’d kick off the entire workflow by simply running Drake in that directory:

$ drake

Drake will give us a preview and ask us to confirm we know what’s going on:

The following steps will be run, in order:
  1: contracts.csv <-  [missing output]
  2: evergreens.csv <- contracts.csv [projected timestamped]
  3: report.txt <- evergreens.csv [projected timestamped]
Confirm? [y/n]

By default, Drake will run all steps required to build all output files that are not up to date. But imagine we wanted to run our workflow only up to producing evergreens.csv, but no further. Easy:

$ drake evergreens.csv

The preview:

The following steps will be run, in order:
  1: contracts.csv <-  [missing output]
  2: evergreens.csv <- contracts.csv [projected timestamped]
Confirm? [y/n]

That’s a very simple example. To see a workflow that’s a bit more interesting, take a look at the “human-resources” workflow in Drake’s demos. There you’ll see a workflow that uses HDFS, contains inline Ruby, Python, and Clojure code, and deals with steps that have multiple inputs and produce multiple outputs. Diagramed, it looks like:

As our workflows grow complicated, Drake’s value grows more apparent. Take target selection for example. Imagine we’ve run the full workflow shown above and everything’s up-to-date. Then we hear that the skills database has been updated. We’d like to force a rebuild of skills and all affected dependent outputs. Drake knows how to force build (+), and it knows about the concept of downtree (^). So we can just do this:

$ drake +^skills

Drake will prompt us with a preview…

The following steps will be run, in order:
1: skills <- [forced]
2: people.skills <- skills, people [forced]
3: people.json <- people.skills [forced]
4: last_gt_first.txt, first_gt_last.txt <- people.json [forced]
5: for_HR.csv <- people.json [forced]
Confirm? [y/n]

… and we’re off and running.

But wait, there’s more!

Drake offers a ton more stuff to help you bring sanity to your otherwise chaotic data workflow, including:

  • rich target selection options
  • support for inline Ruby, Python, and Clojure code
  • tags
  • ability to “branch” your input and output files
  • HDFS integration
  • variables
  • includes

Drake’s designer gives you a screencast

Here’s a video of Artem Boytsov, primary designer of Drake, giving a detailed walk through:

Drake integrates with Factual

Just in case you were wondering! Drake includes convenient support for Factual’s public API, so you can easily integrate your workflows with Factual data. If that interests you, and you’re not afraid to sling a bit of Clojure, please see the wiki docs for the Clojure-based protocol called c4.

Drake has a full specification and user manual

A lot of work went into designing and specifying Drake. To prove it, here’s the 60 page specification document. The specification can be downloaded as a PDF and treated like a user manual.

We’ve also started wiki-based documentation for Drake.

Build Drake for yourself

To get your hands on Drake, you can build it from the GitHub repo.

All feedback welcome!

If you’re a wrangler of data workflows, we hope Drake might be of some use to you. Bug reports and contributions can be submitted via the GitHub repo. Any comments or questions can be submitted to the Google Group for Drake.

 

Go make some great workflows!

 

Sincerely,

Aaron Crow

Software Engineer at Factual

How to build a .NET Web API app that uses Factual’s C# Driver

The ASP.NET Web API provides a nice platform for building RESTful applications on the .NET Framework. Factual’s C# driver makes it easy to use Factual’s public API in the .NET world.

We recently asked Sergey Maskalik, the author of Factual’s C# driver, to bring these two things together and create a tutorial showing .NET developers how to build a Web API application that makes use of Factual.

The result is the aptly titled, “ASP.NET Web API with Factual Driver Example” wiki page.

If you’re a .NET developer, we hope you find this tutorial useful.

Have fun!
Sincerely,
Aaron
Software Engineer

Real-Time Places Data Stack Goes Global

About two months ago, we announced that our entire Factual data stack was now working in real-time for US Place entities.  Today we’re pleased to report that this also now applies to the 49 additional countries that Factual covers.

As a reminder, Gil previously explained the Seven Steps a contribution travels as it works its way through Factual’s data stack.  Now we are able to apply these processes to all global inputs, of which entity resolution is the most complex.  Resolving an entity requires an algorithmic process that determines if two strings are referring to the same place.  Without a solid Resolve API, our Place data would quickly get hammered with duplicate records.  For example, let’s say we want to add the following data about a well-known restaurant in France:

  • Name:  Hiramatsu
  • Address:  Rue de Longchamp
  • Locality:  Paris
  • Country: FR

Even though the contribution has no address number and contained “Rue” instead of the current value of “R.”, Resolve is able to add this data to the existing restaurant entry rather than create a duplicate record.  Now imagine you want to push millions of contributions through — no problem, we can handle it.

Things get pretty interesting when you try this in double-byte characters.  Take “Tea House Takano” (aka ティーハウスタカノ) in Japan — which we have here in our dataset.  Let’s say we want to add the following data fields:

  • Name:  ティーハウスタカノ
  • State/Region:  東京都
  • Country: Japan

Resolve is able to determine that this record already exists in Factual, even though the contribution does not have address data, and is submitted with “Japan” instead of “JP” (as consistent with our schema).  Resolve both matches and normalizes the input; another dupe prevented.

What does this mean for our developers?  Access to even more fresh, accurate, and comprehensive global place data!  It also means that any data you or your users contribute to Factual through the Submit API is incorporated instantly and hassle-free.

US Resolve is currently available as a stand alone API, and we will be making International Resolve available shortly in Q1 2013.  So you’ll soon be able to perform place data cleansing and enrichment for global data.

Know the correct phone number of a spa in Argentina, have a better lat/lng for a bar in Tokyo, think a cafe in Jakarta is missing?  Throw all the data at us and we’ll sort it out.

Interested in using or contributing to our Global Places data?  Get an API key and get started!

Adding data regularly,

Bill Michels

Updated Restaurants Data

Last week, excitement and joy spread through the office as news broke that the Lakers had hired Mike D’Antoni as the new head coach (at least, excitement in my part of the office — Gil is more of a Clippers fan).

Some of us made plans to attend a game. First we did the easy stuff – like buying tickets and coordinating a carpool. But we still had to answer one tough question: where to eat?

We have a pretty diverse crowd at Factual. We would need to find a restaurant that served the right food, had enough room for big groups, and was open before the game. If our party grew, we would have even more requirements – is the restaurant wheelchair accessible? Do they allow smoking? What about a kid’s menu?

Thankfully, our latest release of Restaurants data can more than answer those questions.

The updated dataset, which now includes over 1.2 million restaurants with 43 extended attributes, has improved as follows:

  • Added over 400,000 restaurants
  • Increased coverage across extended attributes by 4.5% — with the biggest gains in: hours of operation (+11%), delivery (+38%), cuisine (+22%), and price (+10%)
  • Hours listings for over 350,000 restaurants, all listed in standardized JSON format
  • Cuisine types for over 1 million restaurants
  • Price, rating, and delivery options for more than 500,000 restaurants each
    ** (check out our docs for a full list of extended attributes)

The new data is derived from 60+ million contributions from over 160 thousand sources.  With this round, our US Restaurants is now definitively one of the largest and most comprehensive sources of restaurant data available today.
And of course, all of this is easily accessible through the Factual API.  To go back to our example, let’s say we wanted to find a Restaurant in Downtown Los Angeles:

/t/restaurants?filters={"locality":"los angeles","neighborhood":{"$search":"Downtown"}}

We also want to make sure that it serves dinner and has a full bar. And does it take reservations? We have a pretty big party.

/t/restaurants?filters={"locality":"los angeles","neighborhood":{"$search":"Downtown"},"reservations":true,"alcohol_bar":true,"meal_dinner":true}

And can we see its hours? Better safe than sorry.

/t/restaurants?filters={"locality":"los angeles","neighborhood":{"$search":"Downtown"},"reservations":true,"alcohol_bar":true,"meal_dinner":true,"hours":{"$blank":false}}

These queries are just a small sampling of what you can do with our lightning-fast API.  Please keep in mind that all queries made against our Global Places data can also be made against Restaurants data.  Get started today by requesting an API key or download, and look for continuous refinement and expansion in the coming months.

-Ben Coppersmith, Data Specialist

Gil Speaking at Techonomy on “The Meanings of Data”

Gil will be participating on a panel “The Forest for the Trees: The Meanings of Data” at the upcoming Techonomy Conference on Sunday, November 11th in Tucson, Arizona at the Ritz Carlton. 

The panel discussion will be moderated by Harvard Business Review and include a lively debate about data patterns. 

We live in a world of patterns. Now we’re getting better at discerning them. As we see the big patterns in human behavior, and in the movement of money, products, jobs, weather, energy, disease, and even molecules and stars, a new era of understanding dawns. Can companies and governments draw proper conclusions fast enough? Where will this world of patterns discerned take us?

If you are attending the conference, be sure to come by and say hi to Gil!