Gil Elbaz Interviewed at OSCON 2012

Gil was at O’Reilly Media’s Open Source Convention (OSCON) earlier this week where he was interviewed on the subject of data accessibility.

Keep on Driving

You may have noticed that we’ve built up a pretty extensive array of software drivers to access our API. I’d like to take a few moments to provide insight into our driver strategy and the lessons that have informed it.

First and foremost: we anticipate virtually all developers will prefer to use a driver when we’ve made it available. The core API exists as a neutral layer to support our drivers, and as a mechanism to fall back on when the language you are using doesn’t have an available driver. Our reason for providing drivers is to make it easier for you to write code, and for us to help you with your code if things go awry. Another way to think about it is like connecting Java to a SQL database: once someone has written a stable JDBC driver, most users would take advantage of that driver rather than writing their own.

We released our “V3” API about a year ago. Early on, most users were left to writing their own code to connect to our API from within the platforms they were using. We did a lot of question answering, question asking, and debugging of both our code and our customers’ code. We also gave a survey to our developers to answer basic questions like what languages they were developing in and what features related to our core offerings were most important to them. Based on our survey results (see chart, below), we introduced our first set of “drivers” for the most common development platforms among our developer base about six months ago.

Survey question: Which of the following are you using (or planning to use) to connect to the Factual API?

Since then, we’ve had time to reflect and refactor, as well as to expand our coverage to a lot of other languages. Here’s some of what we learned along the way.

Documentation by example is everything

Developers don’t like to read long documents and usually skip to the example section, if there is one. Hopefully our documentation reflects this lesson. If not, let us know! Aside from packing our core API with lots of examples, we’ve made sure that each driver is also abundantly packed with language specific code examples. In the case of iOS, we even provided an entire working sample app to look at.

Our API has been growing by leaps and bounds

In fact, our API has been growing faster than our resources for updating drivers can keep up with. To solve this, we implemented our “raw request” functionality in every driver – guaranteeing that there is always a way to make your requests, even for features that are barely out of the starting blocks. The raw request feature in each of our drivers provides a barebones manner to make an arbitrary Factual API request with a handful of request parameters. You provide the endpoint you are hitting, a map of parameters, and your API key. We take care of the encoding, OAuth and everything else to make sure the request gets to Factual, and that any error in doing so is handled cleanly. While this doesn’t always offer the same level of convenience as a dedicated function might (e.g., some of our drivers have a row filter construction set for generating really sophisticated filter objects), it guarantees that you can make any currently unsupported request with relative ease.

Make it easy to hone in on where things break

There’s a lot to the Factual stack. When you connect your code to Factual, one of the most important things is making sure that we can quickly conclude whether a bug is manifested in one of our drivers, your code or configuration, or our core API. All our drivers have debug modes that dump as much helpful information as available, and we try to include more than enough in our API error responses to help piece together what has really gone wrong.

Each programming language may have a different way of doing things correctly.

The Ruby Way isn’t necessarily compatible with the Zen of Python. Some languages make multi-threaded automatic garbage collection easy. Some, not so much. We’ve strived to be consistent with the expectations of each developer community, rather than forcing consistency internally across drivers.

Sometimes you need to go it alone

Determining what should be provided as syntactic sugar in a driver rather than trusted to developer creativity and coding skills is sometimes difficult. Adding code to our wrapper might not produce enough value or flexibility for our developers, and introduces the possibility of a new debugging point. We’ve tried to stay on the side of providing driver functions that spare users from writing their own code to solve common operations that are at high risk for errors. For example when we introduced our new API, the overwhelming majority of errors users were hitting occurred in two places: OAuth signing their requests and encoding filter parameters. As a result, these were the most obvious places for us to provide initial value in our drivers. We’ve taken care of both of these tasks so our developers never have to.

On the other hand, we’ve repeatedly considered whether it would be a wise decision to add any number of convenient driver methods that connect one of the services in our Crosswalk data to the appropriate third-party API.  For example, instead of just getting the Foursquare ID out of Crosswalk, we considered building a “checkIn()” method that would actually fetch the ID and do the whole dance for you. What we decided for now is that it’s a lot of trouble to build and support this functionality where it interfaces to a third party API and we’re likely to fall behind new functionality of that third party API over time. While we haven’t quite hit upon a hard rule on what features to directly support in our wrappers, we’re hammering it out and feel confident that our strategy is starting to produce a pretty balanced feature set.

We’re still learning!

Currently, the overwhelming majority of our user base is accessing our API using one of our language-specific drivers. We hope that developers are picking that path because it is the path of least resistence to getting started with Factual. If you aren’t using one of our drivers yet, we’d love to know why. You can give us your feedback here.

Factual’s API: A Good Fit for Node

Most people have heard the buzz surrounding Node. Words like “fast”, “scalable”, “concurrency” come to mind. At Factual, we pride ourselves on using (and finding) the right tool for the job. We use everything from jquery to hadoop, postgresql to mongodb and fortunately our engineering culture gives us the leeway to experiment a little.  However, any technology that we deploy has to be measured and justified.  Qualities such as agility, performance, stability and cost are all part of the equation.

We understand that there are many misinformed developers out there that think Node means instant scalability and performance.  The truth is, as with a lot of technologies out there, it solves a very specific problem.  What does it solve?  How well does it solve it?  What are the trade offs?  This is our brief experience with Node and how it worked for us.

We’ve been using JavaScript since the very first iteration of our product with an interactive grid that allowed users to upload, update and fuse datasets into usable visualizations of data. I actually created a Ruby library that allowed us to build on some of the fundamentals of traditional object oriented programming at the time  (now known as Mochiscript).

Since then, we’ve started to focus more on curated data and having an API that can deliver it quickly.  This major shift in our product came with a whole set of new requirements.  We now have to deliver responses in under 200ms while handling basic things like: authentication, permission checking, real time statistics, and query processing.  We had to do this in an agile language that allowed us to build out features quickly to respond to popular developer requests.

Since we had experience in Ruby, we built our first prototype using a barebones Sinatra stack.  This gave us decent performance (+20ms with 120 concurrent connections on top of our datastores), but still, it didn’t quite scale enough for the type of traffic that we were anticipating.

At this point, we were considering Java and/or Clojure (both of which are used by other teams at Factual) but decided to look into Node because we needed to be agile and were already quite familiar with JavaScript.

After porting over Mochiscript from Ruby to Node and seeing just how fast Google’s V8 JavaScript Engine performed, we decided to write a prototype and pit it against our Ruby version.  The difference was outstanding: +10ms on top of our datastores processing 400 concurrent connections.  The combination between Node’s performance and the ability to use Mochiscript to help us scale our code base allowed us to go from prototype to production in a couple of months.  After a few more optimizations, we’re now under +5ms per request.

Before we dive too deep into praising Node, let’s list out some of the tradeoffs we made:

  • Still an immature framework (we discovered a socket leak in the earlier versions of their http libraries)
  • Spaghetti code: callbacks galore.  This can be mitigated through use of good design and sticking with certain patterns.  However, it’s still not fun.
  • Debugging can be soul crushing at times (mostly due to the first 2 problems)

The reason why Node was such a good fit for us was because:

  1. We were IO bound
  2. We used very little CPU per request
  3. We were able to use evented programming to help us aggressively cache

The first two reasons are nothing new.  In fact, the combination of these two seem to be the poster child of what Node does well.  These two reasons fit our problem set perfectly.  The third reason, however, is the one we want to shed some light onto.

Using node can be a bit of a paradigm shift from what many engineers are accustomed to.  Callbacks can get out of control.  However, the closures provided by JavaScript can be leveraged in other ways.  One way we found extremely useful is to reduce the number of redundant database queries.

Consider the following example of getting user data from the database:

1 2 3 4 5 6 7 8 9 10 11 12 13
var connect = require('connect');
var users = require('./lib/users');
var app = connect.createServer();
 
app.get('/users/:userId', function (req, res) {
users.get(req.params.userId, function (err, user) {
res.end("Welcome " + user.name);
});
});
 
app.get('/stats', function (req, res) {
res.end(JSON.stringify(counts));
});
view raw query-user.js hosted with ❤ by GitHub

Now let’s add a caching layer to this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
var connect = require('connect');
var users = require('./lib/users');
var app = connect.createServer();
 
var userCache = {};
function getUser(id, cb) {
if (id in userCache) {
cb(userCache[id]);
} else {
users.get(id, cb);
}
}
 
app.get('/users/:userId', function (req, res) {
getUser(req.params.userId, function (user) {
res.end("Welcome " + user.name);
});
});
view raw cache-user.js hosted with ❤ by GitHub

Since we have caching, we need a way to invalidate this.  The Redis pubsub feature is a great way to handle things like this for realtime updates to your cache:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
var connect = require('connect');
var users = require('./lib/users');
var app = connect.createServer();
var redis = require('redis').createClient();
 
var userCache = {};
function getUser(id, cb) {
if (id in userCache) {
cb(userCache[id]);
} else {
users.get(id, cb);
}
}
 
redis.subscribe('update-user', function (id) { delete userCache[id] });
 
app.get('/users/:userId', function (req, res) {
getUser(req.params.userId, function (user) {
res.end("Welcome " + user.name);
});
});
 

For fun, we have a call that gives us stats on urls that were visited:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
var connect = require('connect');
var users = require('./lib/users');
var app = connect.createServer();
 
var userCache = {};
var counts = {};
function getUser(id, cb) {
if (id in userCache) {
cb(userCache[id]);
} else {
users.get(id, cb);
}
}
 
app.get('/users/:userId', function (req, res) {
getUser(req.params.userId, function (user) {
if (req.url in counts) {
counts[req.url]++;
} else {
counts[req.url] = 1;
}
 
res.end("Welcome " + user.name);
});
});
view raw counts.js hosted with ❤ by GitHub

Using these patterns was a byproduct of the evented model Node/JavaScript employs and really gives us flexibility into caching, realtime statistics, and on-the-fly configuration.  Our experiences with redis and/or memcache for caching were great, but in order to squeeze every ounce of performance from our Node servers, we wanted to limit any redundant processing (even to the point where we didn’t want to parse json).

We understand that there are more cases where Node is NOT the solution.  However, in our case, it was a great fit.  Since we’ve started using Node about a year ago, it has matured tremendously and has great support from both Joyent and the rest of the open source community.  We’ve started using it for various other internal projects.  It is becoming more of a general solution and less niche.  I encourage any developer to explore Node and see if it is a good fit for his problem.  If anything, it’ll get you thinking about IO and how much time you spend waiting on it.

Now Featuring Neighborhoods in Global Places

Today we tagged nearly fourteen million Factual places with neighborhoods. This new data is available now to all Factual partners at no additional cost.

You can see the 50 country breakdown in our Global Places data, including over seven million places tagged with neighborhoods in the US alone. Factual API users will now see the ‘neighborhood’ attribute in all returns from the global table that have been tagged:

     "neighborhood":[
          "Grandes Carrières-Clichy",
          "Montmartre"]

or

     "neighborhood":[
          "Pill Hill",
          "Peralta Villa"]

We’ve used Flickr Neighborhoods to tag our places. This data is in turn based on the Yahoo Geoplanet dataset, which takes a liberal attitude towards what constitutes a ‘neighborhood’ — basically any informal, local geography. We’re fans of this resource because its use does not impose upstream license encumberances on our users, it increases discoverability of the data, and, lastly, it is not tessellated (the neighborhoods overlap), which we think better reflects the situation in the real world.

We hope you find this a useful enhancement to Factual data. As ever, If you don’t yet have one, get your Factual API key here.

-Tyler Bell
(Puts on cardigan zipper sweater, removes dress shoes for sneakers)

Factual Node.js Driver

We’re pleased to announce the release of the officially supported Node.js driver for Factual.

Get going!

You can get the driver using npm…

$ npm install factual-api

…and then install it in your project and connect:

var Factual = require('factual-api');
var factual = new Factual('YOUR_KEY', 'YOUR_SECRET');

(If you don’t have an API key, sign up to get one. It’s free and easy.)

Basic Query Example

Do a full-text search of the restaurant database for rows that match the terms “Coffee, Los Angeles”

factual.get('/t/restaurants-us',{q:"Coffiee, Los Angeles"}, function (error, res) {
  console.log("show "+ res.included_rows +"/"+ res.total_row_count +" rows:", res.data);
});

You can find out more about basic query capabilities in the docs for our Core API docs.

Geo Filter Example

The driver supports all of Factual’s Geo Filtering. Here’s an example of finding Starbucks locations near a latitude, longitude:

factual.get('/t/places', {q:"starbucks",
                          geo:{"$circle":{"$center":[34.041195,-118.331518],"$meters":1000}}},
                          function (error, res) {
  console.log(res.data);
});

Crosswalk Example

Factual’s Crosswalk service lets you map third-party (Yelp, Foursquare, etc.) identifiers for businesses or points of interest to each other where each ID represents the same place.

Here’s an example using the Node.js driver to query with a factual id, getting entites from Yelp and Foursquare:

factual.get('/places/crosswalk', {"factual_id":"57ddbca5-a669-4fcf-968f-a1c8210a479a",
                                   only:"yelp,foursquare"},
                                   function (error, res) {
  console.log(res.data);
});

Resolve

Use Factual’s Resolve to enrich your data and match it against Factual’s:

//
// Resolve an entity, starting with the business name and location:
//
factual.get('/places/resolve', {values:{"name":"huckleberry",
                                        "latitude":34.023827,
                                        "longitude":-118.49251}},
                                function (error, res) {
  console.log(res.data);
});

We’ve put great effort into making our public API fast. We know you expect your Node projects to be fast too. Use this driver to access Factual’s data platform, be fast, and go do great work!

Sincerely,
Aaron
Software Engineer