The Wisdom of Crowds: Using Ensembles for Machine Learning

Whether it’s computing Point of Interest similarities for our Resolve service, or predicting business categories from business names, we at Factual use Machine Learning to solve a wide variety of problems in our pursuit of quality data and products. One of the most powerful Machine Learning techniques we turn to is ensembling. Ensemble methods build surprisingly strong models out of a collection of weak models called base learners, and typically require far less tuning when compared to models like Support Vector Machines.

In this post I will attempt to explain why aggregrating a collection of base learners is an effective way to build a strong model, and how to do so correctly. Most ensemble methods use decision trees as base learners and many ensembling techniques, like Random Forests and Adaboost, are specific to tree ensembles. So, let’s start with a brief introduction to decision trees.

Decision Trees

Training a decision Tree for Classification (predicting a category) or Regression (predicting a number) is actually fairly simple. You start with a table containing all of your rows. Then, for each column (aka “feature”) you assess every possible split point to divide the data into two new sets.

If you have a categorical variable that has values like “red”, “green”, or “blue” you could split on whether the value is “red” or not “red”; whereas if you have a real-valued variable like Age, you could split on the various possible Age values that exist in you training data. If a row’s value lies above the split point it’s partitioned into one group, otherwise the row is partitioned into the other group. After that you continue to split the groups in a recursive fashion to end up with a decision tree.

You measure the quality of a split point using metrics like Information Gain or the Gini Coefficient if you’re doing classification, and reduction in the Standard Deviation if you’re doing regression. Essentially, all these metrics measure how much you’ve reduced your uncertainty by picking that split point. By picking the best split each time the greedy decision tree training algorithm tries to form decisions with as few splits as possible.

Finally, the leaf nodes of the tree represent predictions. In the case of classification you’d predict the most frequent category in the leaf set, and in the case of regression you’d predict the mean or median value in the leaf set.

To avoid overfitting with trees, one typically uses various stopping criteria to decide when to stop splitting. You can set a max depth for the tree, or a minimum on the number of rows in a leaf set. It’s common sense to stop splitting if your rows have output values that are all of the same category (in the classification case) or have zero variance (in the regression case). There are also various pruning techniques that reduce the size of a tree once it has been fully grown.

Example: Predicting Hotel Star Ratings

Lets say we wanted to predict how many stars a hotel has. We know that a hotel can have anywhere between 1 and 5 stars. So, if we assume that the distribution of stars across all hotels is uniform, then the Root Mean Square Error (RMSE) of making random guesses is 1.6 stars. We can be smarter by guessing 3 stars (the mean) every time and this would give us a RMSE of 1.2 stars.

As an experiment, let’s see if we can do better using Geopulse data from Factual’s API to predict hotel star ratings. To do this I first queried Factual’s API for Hotels in the Los Angeles area that have a non-blank star rating. Then I took the latitude and longitude values for each hotel and queried the Geopulse API for information about the surrounding area. In the end, I formed a table of data including “median income”, “commercial density”, and various other counts of place categories in the area like “legal & financial”, “health & medicine”, “education”, “arts, entertainment & nightlife”, etc.

After partitioning the data into a training and test set, I trained a decision tree using the training set. The resultant tree predicted the star rating of the withheld test set with a RMSE of 0.927 stars. This is a decent improvement on our uninformed guess of 3 stars for each hotel and this is what the tree looks like.

Decision Tree

Each rectangle represents a split, where the “yes” (or true) branch is on the left and the “no” (or false) branch is on the right. The circles represent leaf nodes and the number inside the circles is the value which would be predicted. It’s interesting that none of the predictions are greater than 3.8, which means that model would never predict a 5 star hotel correctly.

The Bias-Variance Tradeoff

One way to look at the problem of overfitting (and underfitting) a model like decision trees is the bias-variance tradeoff. Let’s use regression as an example. You can view the target values (Y) as being a sum of some unknown deterministic function (f(x)) mapping inputs (X) to target values plus some random noise (R) which corrupts the target values. We can assume that the noise is zero-mean since any bias would be incorporated into f(x).

Y=f(X)+R

You can think of R as being randomness induced from not having perfect features or it could be a truly random source of corruption. Your best bet is to estimate f(x) by a function g(x) using the training data. With that in mind, it turns out you can decompose the Mean Square Error of your g(x) in predicting the real target Y into three terms: the irreducible error due to the variance in R (known as the Bayes Limit), the squared bias, and the variance.

Equation

The bias is the average error of g(x) in estimating f(x) over all possible inputs. Some of your errors may be positive, some may be negative, but if your model is unbiased, then they balance out over the long run. The variance captures how much spread your errors have in estimating f(x) with g(x), so a larger variance means that often the magnitude of your errors are very large even though they might average out to zero. Notice that bias and variance are measured with respect to estimating f(x) and not Y since R is random and therefore truly unpredictable.

Overfitting typically leads to a low bias, but a very high variance, while underfitting will give you the opposite scenario. When you tune models you’re looking for a happy medium between bias and variance and there will be a point that minimizes the prediction error overall.

Bias-Variance Graph

Squashing the Variance with Bagging

The big insight with regards to ensembling is that if you have a large number of models, each with zero or near-zero bias, then you can reduce the variance by averaging their respective answers. Think of it this way: each separate model gives a prediction that is somewhere near the right answer. If the predictions are well-spread around the right answer, then if you average them the result should be a lot closer to the right answer than most of the individual models.

You may have noticed that there’s a glaring requirement in this scheme: each model in the ensemble has to give a different prediction. If each model gave the same prediction, then averaging the predictions would give you no benefit. It would come out to the same prediction as a single model. So if we’re using trees (or any repeated base learner for that matter) for our ensemble, how can we get them to give different predictions? Well, the answer could be: use different training sets. But we only have one training set right?

Enter bagging (bootstrap aggregration). To generate different training sets from a single training set we randomly sample the original training set. Some people sample with replacement, others don’t, but the main objective is to have the training sets be relatively large but as different as possible.

Then we train a completely overfit tree on each training set. Why do we want the tree to be overfit? Because then the bias is near zero. Our little averaging trick doesn’t work if the individual models have bias, because the bias will compound and undo the benefit we gain from reduced variance.

I trained a bagged tree model on the hotel data and the resultant classifier produced a RMSE of 0.787 stars, which is definitely an improvement over the 0.927 star RMSE of the single decision tree.

Random Forests

So, the bagging trick is pretty exciting. This technique will almost always produce better results than a single base learner could. But how can we do even better than bagging? Then answer lies in our technique for producing different base models. Yes, our individual decision trees are producing different predictions because they use slightly different training sets, but slightly is the operative word here.

We need to have even more difference between the individual trees to get better results. So, how can we introduce more randomness? Well, instead of using every feature as possible columns to be split on, we could also randomly sample features to use. But not only do we randomly sample new features for each tree we actually randomly sample a different set of features for each split of each tree. A model that uses this technique for creating more variation in the individual trees is called a Random Forest.

Notice that bagging could be applied to any base learning model, but Random Forests are specific to trees and this is why the name of the ensembling method alludes to trees: many trees make a forest.

In the case of our hotel data, the Random Forest improved on the result of the bagged trees. The Random Forest achieved a RMSE of 0.745 stars. As we train a Random Forest, we can estimate the test error using what’s called the out-of-bag (OOB) error. You do this by predicting the output of each row using only the trees that row wasn’t included in. The following graph shows how the OOB error decreased as more and more trees were added to the ensemble.

Random Forest

Notice that the OOB error of the first tree is horrible. This is due to the individual trees being completely overfit to minimize bias. But, as more trees are added, the OOB error due to variance decreases very quickly. By the time there are 50 trees most of the improvement has been realized. You typically see a slower dropoff for larger datasets.

Other Advancements in Tree Ensembling

So, I guess there really is wisdom in crowds. Random Forests are great models. They have very few tuning parameters (number of trees and number of features to sample at each split) and yet in a 2006 study of various machine learning algorithms, including SVMs, Neural Networks and other standard models, Random Forests were found to have the best average performance, second only to boosting ensembles.

Boosting is a class of ensemble algorithms where each new base learner tries to correct the mistakes of the current ensemble in some fashion. This contrasts with Random Forests, because the individual base learners are unaware of each other’s performance in a Random Forest. Adaboost is a popular boosting technique for classification. It assigns weights to the rows in your training set and these weights get larger over time if that particular row is being frequently misclassified by the existing group of trees. The new trees use a modified split metric that uses the weights to try take better care of the higher weight rows.

Another recent advancement in tree ensembling is Rotation Forests. These ensembles try to rotate the feature space with a rotation matrix for each individual tree in the ensemble. This helps because trees can only form decision boundaries that are perpendicular to feature axes. If your data happens to have optimal decision boundaries that are more “diagonal” the trees will probably approximate this boundary fairly poorly. There are some lofty claims that Rotation Forests can outperform Random Forests and Adaboost, but I guess only time will tell.

Introducing Factual Global Products

At Factual, our goal is to provide access to definitive global data.  We started with Factual Global Places, our flagship product which combines data on over 58 million local businesses and points of interest with rich APIs to help bring context to every point on the globe.  Today, we’re excited to announce the release of our second major data vertical, Factual Global Products.  Factual Global Products currently provides detailed product data for over 500,000 of the most popular consumer packaged goods in the US, including your favorite health, beauty, food, beverage, and household products.  With Global Products, you can easily access key product attributes, find products using powerful search tools or UPC lookup, and connect to product pages across the web.

Key Attributes

Our Products data provides the key attributes you need to describe a product including brand, product name, size, category, UPC, links to images, and Factual ID.  For each attribute, we clean, normalize, and combine data from multiple sources in order to provide you with the most definitive data for each product.

Powerful Search & UPC Lookup

The Products API makes it easy to find product data using full-text search or fast UPC lookup.  In addition, we enable advanced filtering and faceting by any of our product attributes, including category.  To increase the relevancy of search results, results are returned sorted by popularity based on Factual Product Rank.

For example, you can search for tea using the following request:

http://api.v3.factual.com/t/products-cpg?q=tea

Filter by brand using:

http://api.v3.factual.com/t/products-cpg?q=tea&filters={“brand”:”mighty leaf”}

And UPC lookups are easy as:

http://api.v3.factual.com/t/products-cpg?q=052000131512

More examples and full documentation are available in our Developer Documentation.

Crosswalk for Products

Crosswalk for Products connects Factual IDs to product data and related purchase pages at more than 25 third party services, including major retailer and review sites.  This allows you to integrate with third party APIs to pull in additional product data such as pricing and reviews, and to more easily integrate with affiliate programs.  When available, we provide third party IDs, such as SKUs, and enable you to convert between IDs at different retailers using Factual IDs as the common key.

Crosswalk for Products also provides a way for merchants and retailers to increase traffic and sales.  Qualified merchants can add links to their products directly to Crosswalk, making those links available to application developers and consumers around the world.  Interested merchants can learn more by visiting our Merchant Partnership page.

Product Data for All

In short, Factual Global Products makes it easier than ever to access and build great applications on top of rich product data.   And combined with Factual Global Places, developers now have access to the data infrastructure that helps power mobile commerce.  We can’t wait to see what you build.   We’re also excited to be able to announce two great partners, NetPlenish and Grokr, who are already using data from Global Products, and for the additional partners we will be announcing soon.

Visit the Factual Global Products page to learn more, or get started by requesting an API key or download.  If you are already a Factual developer, you can you use existing API key.  As always, please don’t hesitate to send us your feedback and suggestions including data you would like to see next.

John Delacruz
Factual Global Products

A Brief Tour of Factual’s Machine Learning Pipeline

In my previous post, 5 Principles for Applying Machine Learning Techniques, I reviewed the principles for putting machine learning to work. In this blog, I will present a brief guided tour of the pipeline that translates that theory into action.

At the beginning of our pipeline, we have an ocean of data coming at us in many different forms. Some of it is super high quality. Some of it is junk. Most of it is in between.  Here are the broad stages of our pipeline.

  • The first task is to make sense of the data (clean it)
  • Put it in a standard form (canonicalize it)
  • Determine if the data applies to existing entities (resolve it)

Performing those tasks involves handling a huge amount of data, leaving interesting corner cases that we attack by:

  • Using our own algorithms (applying advanced techniques)
  • Determining which changes are trustworthy (trust rank)
  • Figuring out how to make sure our data is correct (quality control)

I’ll finish off this post describing how we have two versions of our pipeline (one quick, one more heavy duty) and add a few notes about the technology we use.

Clean and Canonicalize

The algorithmic pipeline starts with the idea of cleaning the data. Imagine a restaurant called Johnny’s BBQ Joint. People could refer to it as “Johnny’s”, “Johnny’s Joint” or using a bunch of other names. When cleaning data, we have to find out which words are meaningful and which are not. For example, “LAX Hilton” and “LAX” are clearly referring to different places, even though just one word is different.

We can determine whether words are meaningful by grouping the data in various ways. For example, suppose we’re looking at business names. If we group the words by location, we might observe that “Hilton” is more interesting than “LAX” because it has less geographic clustering.

Similarly, we can identify stopwords (such as “the”, “and”, and “of”) by looking at adjacent word correlation.  The problem can be solved with a few heuristics:

  • A word like “the” will be found very frequently and next to all kinds of other words.
  • A word like “Starbucks” will be found fairly frequently but next to a small variety of words like “cafe” or “coffee” (which often implies a business chain).
  • A word like “pizzeria” will be found frequently and next to a large variety of words like “Johnny’s” or “Tony’s”, which often indicates a business category.

Doing these kinds of analyses lets us improve our automated data-cleaning efforts.

Cleaning is really tough. We use all of the following approaches and more in our cleaning process:

  • Grammar induction
  • Vocabulary building
  • Language modeling for names

There are all sorts of fascinating problems related to cleaning data. Of course, when you have huge amounts of data, it is easy to find all the ways people misspell a name like Britney Spears. But if we are to do a good job, we cannot just focus on where we have data. We have to apply our algorithms to sparse data. Lots of our corner cases occur where we have sparse data.

The next stage of the pipeline is to canonicalize the data, which means to put the data into a standard form. While this sounds easy enough, just think of all the ways you can write street names or all the different ways you can write out a phone number, not just in the US but all over the world. Did I mention that we are doing all of this work for more than 50 languages? That, of course, adds to the challenge.

Resolve

Resolving the data is a key question early in the pipeline. Specifically, we want to figure out whether the data is for a new place or whether it’s for a place we’ve already tracked. (We assign all of the businesses and points of interest we track a unique ID number.) Of course, doing a good job cleaning and canonicalizing makes this easier, but resolving data is a very nontrivial task. One of our most useful insights here is that each attribute has two important qualities: its information content and its resolution ability.

The information content (a.k.a. entropy) of an attribute is how much it tells us about a place. For example, a state code puts a business into one of 50 buckets, a zip code puts it into one of ~45,000 buckets, and an address puts it into one of millions of buckets.

The resolution ability of an attribute describes if that attribute is helpful for identifying matches, non-matches, both, or neither. For example:

  • A state code is useful for identifying non-matches but useless of determining matches (if two businesses are in different states, they are definitely different; if they are in the same state, then we don’t know whether they’re different).
  • A phone number is somewhat useful for identifying matches and non-matches, but it’s not perfect because different businesses can share one phone number (e.g. doctors in an office building, stores in a chain, or stores in a mall) and one business can have multiple phone numbers.
  • An address is very useful for determining non-matches (a single business won’t have two addresses), and somewhat useful for determining matches (a single address often – but not always – houses exactly one business).

Once we understand a value’s distribution and have a similarity scorer, we can then figure out how to weigh each attribute and use this information to determine whether two places are the same.

Advanced Techniques

Once we have done a first pass of cleaning, canonicalizing, and resolving the data, the bulk of the data has either been determined to be useful or tossed out. What’s left are the corner cases, a much smaller amount of data that is nonetheless crucial for Factual to gain and maintain an edge in quality. One approach, of course, would be to toss all of these cases out, but that would leave too much on the table. In fact, we do toss out a lot of these cases, but we don’t do it casually.

It is also important to point out that much of the advanced work relies on the work done in the previous stage. The overall technique in many of these advanced cases is to use regularities or patterns in the data (from the easy cases) to inform the hard cases. One of the reasons we spend so much time on the corner cases is that solving them often shows us how to do a better job with the earlier algorithms. For example, we may find that a single location has many entities, which may require extra processing.

Here are some of the advanced techniques we use:

  • Advanced language models to map many variations of a word into an abstract form. Stemming is a form of this.
  • Domain models, which is language modeling based on domain knowledge
  • Normalization
  • Name scoring
  • Address scoring
  • Geographic distance metrics
  • Entity-oriented distance metrics for entities that go beyond the term frequency–inverse document frequency method
  • Other natural language processing techniques
  • Web-scale clustering

We also do a lot of to infer facts from a variety of different types of data. For example, let’s say that a restaurant lists hours from 10:00am to 9:00pm on its web site. But then, from our social media data, we start seeing check-ins or Tweets or other comments about being at the restaurant after 9:00pm up to midnight starting after Memorial Day. Could this be an expansion of hours? Summer hours? It is the job of our algorithms to figure this out when we can for the 60 million places we track worldwide.

Trust Ranking

It is not at all uncommon for conflicts to appear after even the most advanced processing. Then the questions become:

  • Which data should I trust?
  • Which sources should I trust?
  • How can we decide between conflicting changes?
  • How do we identify malicious actors?

Over time, we build a trust rank for sources, so we know which sources are high quality and which are full of errors. We also have to take into account that, for various reasons, people may put bad data out there. One way of identifying bad data was inspired by the boosting technique developed by machine learning researchers.

Quality

Finally, even after we have done our best, we know that errors slip through our algorithms. So we have a variety of quality control methods to make sure that we catch false positives, data that looks good but isn’t.

One important principle of our systems is that we don’t assume everything can be automated. We take random samples of our data and run them by our quality control team and through teams organized with the help of Amazon’s Mechanical Turk service. This allows us to identify specific cases that slipped through our algorithms. The errors then go back to our algorithm designers who try to find improvements.

Quick and Stable Pipelines

To make sure our data is updated as quickly as possible, we have two pipelines of data. One pipeline updates data quickly and applies all of the changes for which we have high confidence after a short amount of processing. The second pipeline takes longer but applies all the algorithmic firepower we have. The changes from this pipeline are applied whenever the processing is complete. We employ many such approaches to balance tradeoffs like the one between speed and quality discussed here.

Technology We Love

We are massive users of open source. As you might expect, we use Hadoop and HBase, but we also do a lot of work making Hadoop work better by using Clojure and Cascalog, along with dcache, our URI based repository.

Cascalog (built on top of the Cascading library) is our preferred method for joining data and managing the flow of data through the pipeline. Cascalog allows us to enhance our code for joining and algorithmic processing in Clojure and also easily orchestrate complicated streams of processing depending on what we find.

The most exciting aspect of working at Factual in general and on the algorithm team in particular is that wonderful feeling of being just at the beginning of something new. We’ve made a lot of progress, but there is so much more to do and so much more to learn. If this post has provoked any questions, please send them along.

 

5 Principles for Applying Machine Learning Techniques

Here at Factual we apply machine learning techniques to help us build high quality data sets out of the gnarly mass of data that we gather from everywhere we can find it.  To date we have built a collection of high quality datasets in the areas of places (local businesses and other points of interest) and products (starting with consumer packaged goods).  In the long term, however, Factual is about perfecting the process of building data regardless of the area, so many of our techniques are domain agnostic.  In this post, I cover 5 principles we use when putting machine learning techniques to work.

 1. Don’t Ignore the Corners

The biggest mistake people make when they attempt to use machine learning on data at huge volumes is ignoring the corner cases. If you have a dataset in the billions, you will have 4.5 sigma events in the thousands. To do a great job, you have to handle cases that are this far away from the mean.

The key is not giving up too soon. The challenge is that the game gets harder as you move away from the mass of data to the exceptions. In the corners, simple and straightforward techniques won’t work, and you must apply increasing amounts of algorithmic firepower to tease out the meaning. A lot of people give up instead of doing this work and it impacts the quality of their data.

Think of Olympic sprinters. Millions of people can do a 100-meter dash in 20 seconds. A few hundred can do it in less than 10 seconds. The elite runners spend huge amounts of time battling for hundredths of a second.

At Factual, we run this race toward high quality data in a pipeline with several stages we will describe in a later blog post. Our first approaches handle most of the data. When you are dealing with the bulk of the data, having a huge data set really helps. Our second stage gets us much of the rest of the way (to the level of the 10 second sprinter). Then we spend a huge amount of time battling with the remaining data. In other words, the cost of getting the corners right in time and effort is much larger than getting the rest of the data right. But that’s where we make new discoveries, and that’s why we spend extra time so our datasets can be relied on.

2. Be Attentive to the Boundaries

Boundary cases are another area where we pay significant attention. If we decide that everything above a 0.85 is good and everything below is bad, then the stuff on the boundary is really worth being careful about. This helps us tune the number to make sure we are right most of the time and also helps us figure out other approaches for catching the edge cases.

3. Spend Time on Special Cases

When we spend time on corners and boundaries, we often find different categories that need special treatment. Entities like Starbucks may have many locations, all with the same name. Entities like The White House may have to be curated manually or locked. Some locations may be loaded with entities. As time has passed, we have enhanced our algorithms to handle these special cases. We are always on the lookout for more.

4. Listen to the Data

Probably the biggest thing I’ve learned from my work on advanced algorithms is to listen to the data. When I go in the wrong direction, it is usually because I’m trying to make the data fit an idea I started with rather than watching and listening to the subtleties that the data is whispering through the technology.

5. Love Your Data

All of this involves building teams with smart people, giving them the best tools, using and contributing to open source, allowing room for mistakes, and, perhaps most of all, creating a culture of people who love data. If you love data too, please get in touch. We are always looking for like-minded people to help us do this work.

Stay tuned for my next post where I give a tour of our machine learning pipeline.

First Release of Factual C# / .NET Driver

We’re pleased to announce the release of the officially supported C# / .NET driver for Factual.

Get going!

The driver is available on Nuget. To install the driver into your Visual Studio project, go to the package manager console window and do this:

Install-Package FactualDriver

See the README for more details, including instructions for non-Nuget use.

Once you have the driver installed in your project, you can create an authenticated handle to Factual like this:

Factual factual = new Factual(MY_KEY, MY_SECRET);

(If you don’t have an API key, sign up to get one. It’s free and easy.)

Basic Query Example

Do a full-text search of Factual’s Places table for entities matching “Sushi Santa Monica”:

factual.Fetch("places", new Query().Search("Sushi Santa Monica"));

You can find out more about detailed query capabilities in the driver’s README as well as the general docs for our Core API docs.

Geo Filter Example

The driver supports all of Factual’s Geo Filtering. Here’s an example of finding Starbucks locations near a latitude, longitude:

factual.Fetch("places", new Query()
                              .Field("name").BeginsWith("Starbucks")
                              .WithIn(new Circle(34.06018, -118.41835, 5000)));

Crosswalk Example

Factual’s Crosswalk service lets you map third-party (Yelp, Foursquare, etc.) identifiers for businesses or points of interest to each other where each ID represents the same place.

Here’s an example using the C# driver to query with a factual id, getting entites from Yelp and Foursquare:

var query = new Query().Field("factual_id").Equal("97598010-433f-4946-8fd5-4a6dd1639d77");
query.Or
     (
       query.Field("namespace").Equal("yelp"),
       query.Field("namespace").Equal("foursquare")
     );
factual.Fetch("crosswalk", query);

Resolve

Use Factual’s Resolve to enrich your data and match it against Factual’s:

factual.Fetch("places", new ResolveQuery()
                              .Add("name", "McDonalds")
                              .Add("address", "10451 Santa Monica Blvd")
                              .Add("region", "CA")
                              .Add("postcode", "90025"));

Geopulse

The C# driver supports Factual’s Geopulse API. This gives you hyper­local context for any geographic point in the world. Check it out:

factual.Geopulse(new Geopulse(
                           new Point(34.06021, ­118.41828)).Only(
                                                             "commercial_density",
                                                             "commercial_profile"));

Thanks go to Sergey Maskalik for his care and dedication in the creation and continuing development of this driver.

We hope you’ll take the Factual C# / .NET driver and build something great!

Sincerely,
Aaron
Software Engineer