Updated Restaurants Data

Last week, excitement and joy spread through the office as news broke that the Lakers had hired Mike D’Antoni as the new head coach (at least, excitement in my part of the office — Gil is more of a Clippers fan).

Some of us made plans to attend a game. First we did the easy stuff – like buying tickets and coordinating a carpool. But we still had to answer one tough question: where to eat?

We have a pretty diverse crowd at Factual. We would need to find a restaurant that served the right food, had enough room for big groups, and was open before the game. If our party grew, we would have even more requirements – is the restaurant wheelchair accessible? Do they allow smoking? What about a kid’s menu?

Thankfully, our latest release of Restaurants data can more than answer those questions.

The updated dataset, which now includes over 1.2 million restaurants with 43 extended attributes, has improved as follows:

  • Added over 400,000 restaurants
  • Increased coverage across extended attributes by 4.5% — with the biggest gains in: hours of operation (+11%), delivery (+38%), cuisine (+22%), and price (+10%)
  • Hours listings for over 350,000 restaurants, all listed in standardized JSON format
  • Cuisine types for over 1 million restaurants
  • Price, rating, and delivery options for more than 500,000 restaurants each
    ** (check out our docs for a full list of extended attributes)

The new data is derived from 60+ million contributions from over 160 thousand sources.  With this round, our US Restaurants is now definitively one of the largest and most comprehensive sources of restaurant data available today.
And of course, all of this is easily accessible through the Factual API.  To go back to our example, let’s say we wanted to find a Restaurant in Downtown Los Angeles:

/t/restaurants?filters={"locality":"los angeles","neighborhood":{"$search":"Downtown"}}

We also want to make sure that it serves dinner and has a full bar. And does it take reservations? We have a pretty big party.

/t/restaurants?filters={"locality":"los angeles","neighborhood":{"$search":"Downtown"},"reservations":true,"alcohol_bar":true,"meal_dinner":true}

And can we see its hours? Better safe than sorry.

/t/restaurants?filters={"locality":"los angeles","neighborhood":{"$search":"Downtown"},"reservations":true,"alcohol_bar":true,"meal_dinner":true,"hours":{"$blank":false}}

These queries are just a small sampling of what you can do with our lightning-fast API.  Please keep in mind that all queries made against our Global Places data can also be made against Restaurants data.  Get started today by requesting an API key or download, and look for continuous refinement and expansion in the coming months.

-Ben Coppersmith, Data Specialist

Gil Speaking at Techonomy on “The Meanings of Data”

Gil will be participating on a panel “The Forest for the Trees: The Meanings of Data” at the upcoming Techonomy Conference on Sunday, November 11th in Tucson, Arizona at the Ritz Carlton. 

The panel discussion will be moderated by Harvard Business Review and include a lively debate about data patterns. 

We live in a world of patterns. Now we’re getting better at discerning them. As we see the big patterns in human behavior, and in the movement of money, products, jobs, weather, energy, disease, and even molecules and stars, a new era of understanding dawns. Can companies and governments draw proper conclusions fast enough? Where will this world of patterns discerned take us?

If you are attending the conference, be sure to come by and say hi to Gil!

Factual Adds Ingredients and Nutrition Information to Global Products




They say, “You are what you eat”.  Unfortunately, it’s not always easy for consumers to make informed decisions based on the nutrient content and ingredients of the foods and products they purchase.  That’s why we are excited to announce the addition of ingredient and nutrition attributes to Factual Global Products.  With ingredient lists for over 350,000 of the most popular consumer packaged goods, and nutrition facts for over 150,000 of the most popular packaged foods, Factual Global Products is now one of the largest and most complete sources available for ingredient and nutrition information.

The Factual API makes it easy to query products using this data.  For instance, if you want a snack but are watching your waistline, you can query for snacks that are 100 calories or less using the following query.

/t/products-cpg-nutrition?filters={“category”:”snacks”,”calories”:{“$lte”:100}}

In addition to calories, you can access and query data for 13 other nutrition-related attributes: servings, serving size, fat calories, total fat, saturated fat, trans fat, cholesterol, carbohydrates, dietary fiber, sugars, sodium, potassium, and protein.

With the inclusion of ingredients, you can now access the entire ingredients list for a product, as well as search for products that contain, or do not contain, certain ingredients.  For instance, if you’re looking to avoid High Fructose Corn Syrup, you can use the “not equal to” parameter ($neq) to find products that do not contain HFCS in the ingredient list.

/t/products-cpg-nutrition?filters={“ingredients”: {“$neq”: “high fructose corn syrup”}}

If there’s a particular ingredient you’d like to include in your diet, for example green tea, you can easily find products containing that ingredient using the standard $search parameter on the ingredients field.  This will return any ingredient that contains the words “green tea”, such as “Green Tea Extract” and “Camellia Sinensis (Green Tea)”.

/t/products-cpg-nutrition?filters={“ingredients”: {“$search”: “green tea”}}

 

Data Insights
Having ingredient and nutrition information for so many products gives us the opportunity to glean some interesting insights from the data.  For instance, we can facet the data by attributes such as brand or category to generate counts of products containing certain ingredients by the faceted attribute.  The chart below shows product categories and counts for products containing green tea.  It shows that green tea can be found in everything from shampoo to dog treats.

/t/products-cpg-nutrition/facets?filters={“ingredients”: {“$search”: “green tea”}}&select=category

 

We also decided to look at sugar content faceted by brand to determine which brands you should reach for if you have a breakfast sweet tooth, or which you should avoid if you prefer to limit your sugar to your cup of coffee.

The chart above shows that if you’re looking for a sugar fix, Malt-O-Meal is your best bet [1].  No surprise then that they feature cereals with names like Chocolate Marshmallow Mateys and Frosted Mini Spooners.  If you’re are trying to avoid a sugar high, it’s probably better to stick with brands like Kashi and Nature’s Path.

 

EAN-13
Our Global Products data is also becoming a little more global today with the addition of EAN-13 data.  EAN-13 is a superset of UPC (the US standard), and is the 13-digit code used for identifying retail products worldwide.  Products can now be found via the API using either UPC or EAN-13 by using the query below.

/t/products-cpg?q=0077326830772

 

Summary    
Factual Global Products now includes ingredients, nutrition information, EAN-13, and data for more than 650,000 of the most popular consumer packaged goods.  You can learn more and preview the data by visiting the Factual Global Products page.

We’ve just scratched the surface of the types of queries you can perform and insights you can derive from our ingredients and nutrition data, and from the addition of EAN-13.  We are excited to see what you build using all this data.  Get started by requesting an API key or download, and look for additional product categories and the expansion of our international products in the coming months.

Retailers and manufacturers interested in including their products in our Global Products dataset can visit our Merchant Partners page for more information.

John Delacruz, Product Manager

 

 

 

Notes
[1] We first looked at the distribution of sugar content across all breakfast foods.  Median sugar content was 7 g per serving, the lowest 25% (1st quartile) contained 5 g, and the highest 25% (3rd quartile) contained 13 g.  We designated foods as high sugar if they contained 13 g of sugar or more per serving, and low sugar if they contained 5 g of sugar or less.  We then looked that the top 20 brands who produce breakfast foods, based on the number of unique breakfast products they sell.  For each of these brands we plotted the percent of their products that were low sugar versus the percent of their products that were high sugar.  Brands that sell more low sugar products are located towards the bottom right, and brands that sell more high sugar products are located towards the top left.

 

One Step Closer to Representing the Physical World in Real-Time

I’m happy to announce Factual is today extending our entire data stack to include more robust real-time capabilities – not just serving the data, but consuming and assimilating new information. With this release, developers and enterprises can now make changes to Factual Places data in real-time. These real-time contributions are available for US places today, with the rest of world and Factual Global Products following later this quarter.

As data stewards, we are expected to be the gold standard for data quality and standardization, and our launch today keeps that bar high while introducing a new level of data freshness.  Factual’s expertise lies in data engineering, aggregation, cleaning, and normalization.  A significant part of these data maintenance duties is collating contributions from our many data partners, reconciling conflicts, burning off impurities, and fundamentally improving the data.

To date, we’ve seen over 80 million contributions back to Factual from our data partners, including Yext, the market-leading location software company, who is sending us edits from 70 thousand small businesses.  We then push the most accurate data back to our customers in real-time.

We’re all excited to share more of the inner workings of our real-time data stack in action.  With today’s launch, anything written through our write API gets ushered through our entire data stack in seconds.  Allow me to take you on the journey – the long and varied life of a contributed data point.

  1. Accept new input: Check authentication levels and add raw data to queue.
  2. Validate & Structure: Clean and format raw data to match an accepted semantic type (a phone number, a taxonomic category, etc.).
  3. Resolve: Match the submitted content to an existing database in order to minimize duplication. Performing a robust fuzzy match against 63 million records can be computationally expensive, so we parallelize this.
  4. Store: Save the new ‘input’ in an HBASE backend.  We don’t consider inputs as facts yet, but rather opinions about a fact.  Many trusted ‘opinions’ allow us to validate new information as factual.
  5. Summarize: Determine new factual information by running truth prediction models across all known input data about a given Place entity.  Our machine learning systems have led to smarter models that weigh opinions based on historical trust rank (the overall reliability of the source).
  6. Generate Diff: Inspect newly summarized data for changes (a delta or ‘diff’ in Factual parlance) and distribute updates to internal index clusters and external systems via a PubSub-type mechanism.
  7. Index: Update PostgreSQL and SOLR materializations on EC2 which power our fast APIs.

I hope the brief description above gives you a small glimpse into how complex and non-trivial this process is, and why we are so proud of this announcement today.  In the coming weeks, look out for more details on how this all works, what it means to the various customer segments, and how you can optimally take advantage of this.  To get started, get an API key, and read the developer documentation.

-Gil Elbaz

 

You need to know the Power of APIs

Note: This was originally posted by CIO Review on 10/2/2012, available here.

As someone who is dedicated to creating high quality and accessible data, I am encouraged that more and more people are realizing that data in massive volumes does not really mean much unless it is high quality, verifiable, relevant, and accessible. The most important trend I see in the world of data today is that the element to create quality and accessible data is entering the mainstream.

The rise in data marketplaces or data exchanges is a great sign because they enable distribution to new customer segments, more creative pricing, as well as easier data sharing. Companies like Microsoft’s Azure Data Marketplace, docs, Factual and EMC/Greenplum are creating exchanges that demonstrate, to paraphrase Bill Joy, that there may be more valuable data outside your company than in it. These marketplaces/exchanges will foster the sharing and using of external data sources and will drive the creation of business models so people can get valuable data at a fair trade. There will be a natural process of quality control as the organizers and participants of the markets seek to protect their brands and insist that those who provide data show that it is correct. Third parties will review the quality of data, as they do for most other products.

The rising adoption of Application Programming Interfaces (APIs) outside of the world of tech startups will accelerate data accessibility. Most people do not know that most of the traffic for Twitter comes via APIs. The same is true for other popular Internet companies. APIs allow people to incorporate data directly into applications. To make this work, you have to understand how to build and run an API and what kinds of applications it will support. Companies like Apigee are bringing the infrastructure and knowledge of how to publish APIs to market.

I predict that as more great data becomes accessible through APIs and other means, the idea of a data supply chain will become popular. At Factual, we realized that all over the Internet, relevant data was available, but it was incredibly fragmented, and without any clean standards applied. Our first mission is to collect those fragments along with direct contributions from organizations and individuals and to create data about places and products. We have also created many APIs to help mobile app developers, ad targeting efforts, and businesses. We think that as more and more data becomes accessible, more companies will join us in creating data supply chains.

Road to the Future:

The discipline of Data Science will play a much more pivotal role in shaping every aspect of a business, moving from R&D/back office to being indispensable to their core strategy. Its the difference between having a few Hadoop engineers to employing data strategists that will ensure all functions have access to clean, high quality internal and external data, as well as a deep toolkit to do predictive analysis. While it’s exciting to see positions like Chief Data Officer, and Stanford opening a Data Science 101 class, we are still away from really knowing the combination of talent + software + culture required to affect the impact big data can have across the entire organization. I predict Data Scientists will be in the core DNA of many organizations, and a norm in the workforce versus an outlier and a minority.

Challenges faced by Entrepreneurs:

The characteristics of successful entrepreneurs will change in the next decade. Open source and cloud technology have dramatically lowered the costs of the development and production of new products. A startup that required $10 million a decade ago can now get going for $500,000 or in some cases even far less. While that sounds rosy, it also means that there is more competition ever from other startups as well as from nimble groups within larger organizations.

On the consumer side of the business, we have seen a tremendous acceleration around new products and features, and in order to succeed several things have to come together.

Creating great products is of course primary, but communication and marketing skills will become more important than ever as an increasing number of products and services proliferate the market. Attracting talent is the first chore. Once a product is in place, developing market awareness and changing consumer behavior is the next challenge. Successful entrepreneurs will be more balanced between engineering and communication skills.

In addition, it is clear that Lisa Gansky is right on the money in her characterization of how businesses need to embrace the Mesh. The most significant moat will come from data and relationships, and Mesh-oriented products look to every customer to provide information or reach that extends the value of the entire network to the entire consumer base.