First Release of Factual C# / .NET Driver

We’re pleased to announce the release of the officially supported C# / .NET driver for Factual.

Get going!

The driver is available on Nuget. To install the driver into your Visual Studio project, go to the package manager console window and do this:

Install-Package FactualDriver

See the README for more details, including instructions for non-Nuget use.

Once you have the driver installed in your project, you can create an authenticated handle to Factual like this:

Factual factual = new Factual(MY_KEY, MY_SECRET);

(If you don’t have an API key, sign up to get one. It’s free and easy.)

Basic Query Example

Do a full-text search of Factual’s Places table for entities matching “Sushi Santa Monica”:

factual.Fetch("places", new Query().Search("Sushi Santa Monica"));

You can find out more about detailed query capabilities in the driver’s README as well as the general docs for our Core API docs.

Geo Filter Example

The driver supports all of Factual’s Geo Filtering. Here’s an example of finding Starbucks locations near a latitude, longitude:

factual.Fetch("places", new Query()
                              .Field("name").BeginsWith("Starbucks")
                              .WithIn(new Circle(34.06018, -118.41835, 5000)));

Crosswalk Example

Factual’s Crosswalk service lets you map third-party (Yelp, Foursquare, etc.) identifiers for businesses or points of interest to each other where each ID represents the same place.

Here’s an example using the C# driver to query with a factual id, getting entites from Yelp and Foursquare:

var query = new Query().Field("factual_id").Equal("97598010-433f-4946-8fd5-4a6dd1639d77");
query.Or
     (
       query.Field("namespace").Equal("yelp"),
       query.Field("namespace").Equal("foursquare")
     );
factual.Fetch("crosswalk", query);

Resolve

Use Factual’s Resolve to enrich your data and match it against Factual’s:

factual.Fetch("places", new ResolveQuery()
                              .Add("name", "McDonalds")
                              .Add("address", "10451 Santa Monica Blvd")
                              .Add("region", "CA")
                              .Add("postcode", "90025"));

Geopulse

The C# driver supports Factual’s Geopulse API. This gives you hyper­local context for any geographic point in the world. Check it out:

factual.Geopulse(new Geopulse(
                           new Point(34.06021, ­118.41828)).Only(
                                                             "commercial_density",
                                                             "commercial_profile"));

Thanks go to Sergey Maskalik for his care and dedication in the creation and continuing development of this driver.

We hope you’ll take the Factual C# / .NET driver and build something great!

Sincerely,
Aaron
Software Engineer

Factual Makes a House Call with Improved Healthcare Provider Data

In the U.S., we have many choices when considering healthcare providers for ourselves or loved ones. Unfortunately, much of this information is scattered, disjointed, and unreliable.  If you’re shopping around for a specific type of doctor or specialist based on the insurance you have, that information is not easily available via the big search engines or major health web portal or apps.  Insurance companies have made it fairly difficult for both consumers and application developers to get data on the doctors they reimburse.

Fortunately, Factual has spent months pouring over numerous sources and tens of millions of inputs to build enhanced Healthcare Provider data which now includes the critical “insurance carried” field.  Additionally, this 1.8 million record strong dataset includes other attributes such as medical affiliations, gender, languages spoken, and education.

Among the initial launch partners for this data will be Doctor.com, a leading health site helping consumers connect with the healthcare providers best suited to address their specific needs and preferences.

Factual is focused on delivering high quality data to its customers, such as Doctor.com, and we’re constantly improving our data quality.  This new release of Healthcare Providers has more sources of data, better rules for ensuring you see only high quality, sensible data, and of course it’s served through our ultra-fast API.

Querying Data

As a developer, you can use the API which employs the same filter context with which you’re familiar (these examples are formatted for readability — remember to URL-encode your request values).  These examples show the HTTP requests to make, but be sure to make life easy and check out the API drivers.

This query returns the first twenty female doctors in Omaha (use the offset parameter to pull the next set).

The text fields in the table have been indexed, which enables full text search against them.  Developers may find it easier to just create ad hoc queries, rather than querying against specific attributes.  This query produces almost the identical results to the one above, with a few extra lines because words can appear in fields other than the expected one – such as “Omaha Blvd”.

Here we do a combination of filters and free text search:

Now we have a precise location, and fairly precise query without having to know that the column names were degrees and gender.

Let’s make sure she’s an experienced doctor, with at least 10 years of experience. To do this, we need to add a filter attribute with an inequality (see the Core API documentation on filters for further technical details):

Learning More

We hope that you’re healthy and your users are too.  But when it’s time to see a doctor, dentist, chiropractor, or a multitude of other providers, we think this data will prove to be a helpful resource.

The Data Stack: A Structured Approach

At Factual, we think about data day and night. Our goal is not just to create products, but to create new data economies that draw people to participate because we offer a better data experience. Because of this mission, we organize our work using what I call the data stack.

The idea of a data stack for Factual was really necessary as we developed our initial plan to pursue many data verticals at once.  These verticals have very different data, but the challenges associated with accessing and improving them turn out to have a number of commonalities. The structure of the stack emerged from my experience in building, investing, and observing data-driven businesses. The data stack refers to a set of categories that describe the different capabilities needed to transform data into more valuable forms. At Factual, we gain significant efficiencies by starting with a common framework, which leads to pooling expertise and maximum reuse of technology.

Could the notion of a data stack be relevant outside of Factual? I believe that using the data stack as an organizing principle can help people avoid a number of common problems.  In today’s blog, I attempt to explain why.

Data Stack Basics

The number of technology layers involved in processing data is daunting, as are the number of steps from beginning to end to make data useful. It is a huge challenge in most organizations just to understand the data that is available, let alone gain a coherent picture of how data flows from beginning to end when it is put to use. For example, in most businesses there is a department responsible for business intelligence that connects and collects information through an extract, transform, and load process that cleans and joins the information. This information is formed into a standard schema from which insights are derived through analysis, reports are rendered, and applications are built. What is often lost, in my opinion, is that there is an army of analysts working all around the company that are going through similar steps. Everyone involved would benefit if there were a wider awareness of the common process that takes place so that knowledge and techniques could be shared.

Without common awareness of the data stack—the stages in the end-to-end process of putting data to work—people tend to organize efforts around specific technologies or repositories. You find data warehousing groups, analytics groups, reporting groups, and so on, each of which creates its own version of a data stack.

A company based on the data stack uses the concept to organize the common categories of transformations and capabilities applied to data, not tied to specific technologies. As I will expand upon later, creating a common way of talking about the lifecycle of data has numerous benefits:

  • Brings more awareness of the available universe of data services
  • Leads to more standardization, improving alignment and optimization of data-related efforts
  • Speeds up development by increasing reusability and sharing of knowledge about more powerful ways of using technology
  • Lowers costs of integration
  • Catalyzes innovation and experimentation
  • Increases emphasis on managing data as a flow, not a repository.

Here are the elements of the data stack:

  • Resource. Any data resource or service that can be used to access data.  These resources may be thought of as raw data, what some have been referring to as the “oil of the 21st century.”  But just as likely, a resource may be the end-result of a different data value chain – the improved data offered by some other company or process.
    Key question: Do you have a catalog of all resources that are readily accessible?
  • Connect. Refers to the way that you connect to a resource. The Web created a fairly standard way of connecting to information through URIs and HTTP or through RESTful API calls. Portable authorization schemes like OAuth have created a way to share permissions. But many resources have not adopted standardized approaches.
    Key question: Do you have standardized ways of connecting to all available resources?
  • Collect. Incorporates the notion of connecting to something one or more times to collect everything needed. The web crawl is a well-known form of collection, but the growth of sensors, machine data, and the data exhaust we all produce open up many new forms of collection.
    Key question: How often do you connect to your resources to ensure the freshness of your collections?
  • Cache. Once collected, it often makes sense to cache what you’ve collected so that you can look at it carefully on your own time rather than having to go back to the source. Remember, creating a cache should rarely be a goal in itself, but a means to dramatically speed up a process.
    Key question: Have you optimized the cache so that all interested parties can access the collected data efficiently?
  • Index. Indexing is another key step to bringing efficiency to your processes – it’s the way to access information you’ve cached without scanning everything. One very common type is a full-text index. A full-text index, like Google, indexes every single word of every document for instant access.
    Key question: Does your indexing strategy optimize for the most important access patterns?
  • Clean. Data cleaning often involves fixing errors, filling in missing data, or eliminating duplicate records. A cleaning process can involve application of basic cleaning rules, manual intervention, or advanced machine learning.
    Key question: Do you know whether your data needs cleaning?
  • Join.  Data often lives in silos but can be connected. This can be easy if you have a clean relational model or if you have standardized on common identification schemes.  In other cases, it can be incredibly painful.
    Key question: Is data integration a challenge that is stopping you from asking and answering important questions?
  • Derive. New and valuable information can be generated by crunching all the data you have. Such derivative data is especially valuable if you have a unique algorithm or are differentiated by the set of data you can access.
    Key question: Do you have a catalog of common and proprietary analysis and machine learning techniques?
  • Render. The world is very interested in data visualization because it improves the user experience of any consumer of that information. A basic form of rendering might involve transforming a numerical rating into an image showing a number stars, or the display of a sad, neutral, or happy face. Much of both rendering and visualization is used to support applications and user interfaces.
    Key question: Do you have standards for the way data is rendered and common technology to support those standards?
  • Resource/InfoService/Application. The last step is to provide the results as another resource that can then be easily consumed by another service or end-user application. When you provide the results as a resource, the whole cycle can start again. Applications, of course, can use the results of many different data stacks to achieve their goals.
    Key question: Is it easy to find and understand the data and capabilities of your information services and applications?

In later blog entries, I’ll look at specific technologies in each category, explain ways to apply each capability of the stack to understand various types of information value chains, and describe why the data stack is really not so much a stack as a wide-ranging network of services that can be used piecemeal.

At Factual, we have data value chains that work together in complex ways. For example, in our data contribution ecosystem, we have one value chain that uses various capabilities of the data stack to rapidly get changes into our canonical database. We have another value chain that operates in parallel to apply a more time-consuming, and ultimately more fine-tuned, way of evaluating contributions. Both value chains are needed to make the canonical database both timely and accurate.

Applications of the Data Stack

So far, so good. But what good is the data stack? How can it help you improve your ability to get more value from data?

Here are three ways that I use the stack to help me in my work. I suspect they will work for others as well.

1) Normalizing Data Products

Every day, I closely follow the news about new companies offering data-related services or more generally in the big data space. When I hear about a new product or service, I decompose the idea using the capabilities in the stack. That helps me quickly figure out if there is anything special or new going on and how it might get integrated or complementary to Factual:

Here are some examples of normalizing data products based on the data stack.

Twitter:

  • Consumers (resource) are engaged via web & mobile apps (connect).  Tweets are propagated to internal distribution mechanisms (collect) and hosted and indexed (cached, index).  Derivative data such as number of retweets (derive) are all offered cleanly (render) to the final consumer or through the API (new resource).

Klout:

  • Twitter (resource) is among a set of underlying APIs that are accessed (connect) to build a Klout Score (derive) that is in turn offered through a beautiful interface (render) and through its own API (new resource).

DataSift:

  • Various APIs including Twitter and Klout (resources) are accessed via (connect) and archived (collect, cache) to offering advanced filtering (index) of their robust, enriched historical Twitter archive. (new resource)

Wikipedia:

  • Person (resource) logs in and edits a page (connect), submits a change (collect), changes posted at Wikipedia (distribution), indexing re-run on page (index), and, in parallel, a moderation process may be triggered for some pages (improve).

2) Product Design

The data stack is also a huge help when designing new products. Especially when it comes to thinking about where your services sit in relation to others.

First of all, it helps if you understand the raw materials you have to work with. Twitter views individuals as the resource, and connects to them via a variety of client-side tools. The Twitter API becomes a valuable raw resource for your Klout influence score which is then offered via API.  And, DataSift uses Klout’s API to build an enriched Twitter archive index.  And, an individual may notice an interesting pattern when playing with DataSift, and add a comment on a Wikipedia page. (Yes, this is a bit circular – but that is the beauty of this burgeoning ecosystem.)

3)  Optimizing an Organization

At Factual, we had the luxury of building our organization with the data stack in mind. The way we have organized our engineering and product development processes reflects a division of labor with related areas of the stack being clustered under the right leadership.  We architected the company so all raw documents we collect are stored in the same place (dCache), can be processed using a common Hadoop-based system, accesses cleaning rules from a common rule store, is indexed using a pluggable framework (currently, we use Solr and Postgres), and hosted generically either on Amazon EC2 or in our data center.

But even though most organizations I encounter are vitally concerned with creating value from data, many of them have a haphazard organization of capabilities with a lot of silos and barriers to communication. The problem of connecting to a common data source may be solved a dozen different ways. The same is true for approaches to collection, indexing, caching, and so on. As a result of this diffusion of effort, investment capital is squandered, expertise is rarely reused, and everything takes longer than it should. Rarely does such an organization perform to its potential to create value for data.

Conducting a simple census of what each part of an organization is doing in terms of the data stack is a great start to understanding what sort of capabilities are duplicated. Then the hard job of creating an optimal organizational structure can begin.

I hope this tour of the data stack provides some helpful ideas. Please let me know if you put the data stack to work in a productive way. I would love to share more about ways to create great data products.

Gil Elbaz Interviewed at OSCON 2012

Gil was at O’Reilly Media’s Open Source Convention (OSCON) earlier this week where he was interviewed on the subject of data accessibility.

Keep on Driving

You may have noticed that we’ve built up a pretty extensive array of software drivers to access our API. I’d like to take a few moments to provide insight into our driver strategy and the lessons that have informed it.

First and foremost: we anticipate virtually all developers will prefer to use a driver when we’ve made it available. The core API exists as a neutral layer to support our drivers, and as a mechanism to fall back on when the language you are using doesn’t have an available driver. Our reason for providing drivers is to make it easier for you to write code, and for us to help you with your code if things go awry. Another way to think about it is like connecting Java to a SQL database: once someone has written a stable JDBC driver, most users would take advantage of that driver rather than writing their own.

We released our “V3” API about a year ago. Early on, most users were left to writing their own code to connect to our API from within the platforms they were using. We did a lot of question answering, question asking, and debugging of both our code and our customers’ code. We also gave a survey to our developers to answer basic questions like what languages they were developing in and what features related to our core offerings were most important to them. Based on our survey results (see chart, below), we introduced our first set of “drivers” for the most common development platforms among our developer base about six months ago.

Survey question: Which of the following are you using (or planning to use) to connect to the Factual API?

Since then, we’ve had time to reflect and refactor, as well as to expand our coverage to a lot of other languages. Here’s some of what we learned along the way.

Documentation by example is everything

Developers don’t like to read long documents and usually skip to the example section, if there is one. Hopefully our documentation reflects this lesson. If not, let us know! Aside from packing our core API with lots of examples, we’ve made sure that each driver is also abundantly packed with language specific code examples. In the case of iOS, we even provided an entire working sample app to look at.

Our API has been growing by leaps and bounds

In fact, our API has been growing faster than our resources for updating drivers can keep up with. To solve this, we implemented our “raw request” functionality in every driver – guaranteeing that there is always a way to make your requests, even for features that are barely out of the starting blocks. The raw request feature in each of our drivers provides a barebones manner to make an arbitrary Factual API request with a handful of request parameters. You provide the endpoint you are hitting, a map of parameters, and your API key. We take care of the encoding, OAuth and everything else to make sure the request gets to Factual, and that any error in doing so is handled cleanly. While this doesn’t always offer the same level of convenience as a dedicated function might (e.g., some of our drivers have a row filter construction set for generating really sophisticated filter objects), it guarantees that you can make any currently unsupported request with relative ease.

Make it easy to hone in on where things break

There’s a lot to the Factual stack. When you connect your code to Factual, one of the most important things is making sure that we can quickly conclude whether a bug is manifested in one of our drivers, your code or configuration, or our core API. All our drivers have debug modes that dump as much helpful information as available, and we try to include more than enough in our API error responses to help piece together what has really gone wrong.

Each programming language may have a different way of doing things correctly.

The Ruby Way isn’t necessarily compatible with the Zen of Python. Some languages make multi-threaded automatic garbage collection easy. Some, not so much. We’ve strived to be consistent with the expectations of each developer community, rather than forcing consistency internally across drivers.

Sometimes you need to go it alone

Determining what should be provided as syntactic sugar in a driver rather than trusted to developer creativity and coding skills is sometimes difficult. Adding code to our wrapper might not produce enough value or flexibility for our developers, and introduces the possibility of a new debugging point. We’ve tried to stay on the side of providing driver functions that spare users from writing their own code to solve common operations that are at high risk for errors. For example when we introduced our new API, the overwhelming majority of errors users were hitting occurred in two places: OAuth signing their requests and encoding filter parameters. As a result, these were the most obvious places for us to provide initial value in our drivers. We’ve taken care of both of these tasks so our developers never have to.

On the other hand, we’ve repeatedly considered whether it would be a wise decision to add any number of convenient driver methods that connect one of the services in our Crosswalk data to the appropriate third-party API.  For example, instead of just getting the Foursquare ID out of Crosswalk, we considered building a “checkIn()” method that would actually fetch the ID and do the whole dance for you. What we decided for now is that it’s a lot of trouble to build and support this functionality where it interfaces to a third party API and we’re likely to fall behind new functionality of that third party API over time. While we haven’t quite hit upon a hard rule on what features to directly support in our wrappers, we’re hammering it out and feel confident that our strategy is starting to produce a pretty balanced feature set.

We’re still learning!

Currently, the overwhelming majority of our user base is accessing our API using one of our language-specific drivers. We hope that developers are picking that path because it is the path of least resistence to getting started with Factual. If you aren’t using one of our drivers yet, we’d love to know why. You can give us your feedback here.