I’m happy to announce Factual is today extending our entire data stack to include more robust real-time capabilities – not just serving the data, but consuming and assimilating new information. With this release, developers and enterprises can now make changes to Factual Places data in real-time. These real-time contributions are available for US places today, with the rest of world and Factual Global Products following later this quarter.

As data stewards, we are expected to be the gold standard for data quality and standardization, and our launch today keeps that bar high while introducing a new level of data freshness.  Factual’s expertise lies in data engineering, aggregation, cleaning, and normalization.  A significant part of these data maintenance duties is collating contributions from our many data partners, reconciling conflicts, burning off impurities, and fundamentally improving the data.

To date, we’ve seen over 80 million contributions back to Factual from our data partners, including Yext, the market-leading location software company, who is sending us edits from 70 thousand small businesses.  We then push the most accurate data back to our customers in real-time.

We’re all excited to share more of the inner workings of our real-time data stack in action.  With today’s launch, anything written through our write API gets ushered through our entire data stack in seconds.  Allow me to take you on the journey – the long and varied life of a contributed data point.

  1. Accept new input: Check authentication levels and add raw data to queue.
  2. Validate & Structure: Clean and format raw data to match an accepted semantic type (a phone number, a taxonomic category, etc.).
  3. Resolve: Match the submitted content to an existing database in order to minimize duplication. Performing a robust fuzzy match against 63 million records can be computationally expensive, so we parallelize this.
  4. Store: Save the new ‘input’ in an HBASE backend.  We don’t consider inputs as facts yet, but rather opinions about a fact.  Many trusted ‘opinions’ allow us to validate new information as factual.
  5. Summarize: Determine new factual information by running truth prediction models across all known input data about a given Place entity.  Our machine learning systems have led to smarter models that weigh opinions based on historical trust rank (the overall reliability of the source).
  6. Generate Diff: Inspect newly summarized data for changes (a delta or ‘diff’ in Factual parlance) and distribute updates to internal index clusters and external systems via a PubSub-type mechanism.
  7. Index: Update PostgreSQL and SOLR materializations on EC2 which power our fast APIs.

I hope the brief description above gives you a small glimpse into how complex and non-trivial this process is, and why we are so proud of this announcement today.  In the coming weeks, look out for more details on how this all works, what it means to the various customer segments, and how you can optimally take advantage of this.  To get started, get an API key, and read the developer documentation.

-Gil Elbaz