Factual’s Geopulse Audience product assembles real-world profiles for millions of smart-phone users around the world. A suite of sophisticated geo-fencing, machine-learning, and heuristic methods are used to convert the user input, a set of lat/long records for a particular device, into a colorful description of the user. This description includes demographic, behavioral, and geographic information, such as a user’s age, income, ethnicity, whether they are a likely golfer, mattress shopper or electronics buyer, and which places they have visited over the past year.
As part of our ongoing QA of Geopulse Audience profiles, we had the opportunity to delve into this rich dataset of users in search of emergent consumer patterns. We asked questions such as who is likely to visit which places, are there places that are often visited in concert, and which people exhibit what consumer behaviors?
We calculated the correlation between 85 places, 9 demographic descriptors, and 25 behavioral segments for a set of more than 30 million user records making use of a Clojure library for distributed statistics.
First, we looked at a birds-eye-view of the space — the correlations for a Cartesian join on the full set of places, behavioral segments, and demographics. While at first glance this plot looks inscrutable, there are a couple of noteworthy observations. Sanity check: records are perfectly correlated with themselves. Second, there are a lot more purple squares (positive correlations) that brown squares (negative correlations). This reflects a positive bias in the way that we gather information about a particular device. An application that streams geotags more often has more geotags for ubiquitous places like Starbucks, ATMs, and movie theaters. Because the number of visits to common places is a function of the amount of geotags, place visits to Starbucks, ATMs, etc. are positively correlated. These positive correlations are more likely to surface in Audience profiles than are true negative correlations, i.e. a person who often visits McDonalds is that much more unlikely to visit Burger King. This skew toward positive correlations can also apply to behavioral segments that are learned in part based on place visits.
Also, there are several fields that look vaguely like white stripes, i.e. appear to be evenly distributed across most other fields. These white stripes include affluent consumer, age, college student, commuter, entertainment enthusiast, financial customer, female, income, leisure seeker, and NFL enthusiast. We expect income, age, and gender, to be more or less equally distributed across places because most of the places present in these records are visited by a wide range of people (McDonalds, CVS, etc.) It is also plausible that the behavioral segments like affluent consumer (one who frequents non-chain stores), financial customers (one who uses banks or ATMs), entertainment enthusiast (one who frequents movie theaters, dance clubs, etc.), date nighter (one who frequents restaurants and bars), commuter (one who travels more than five miles to get to work), leisure seeker (one who goes to playgrounds, parks, and pools), and NFL enthusiast are somewhat equally distributed across demographics, other segments, and visits to various places.
Next, we zoom in on some interesting features. We grab the 15 fields with the highest mean absolute correlation (most correlated across the board) and the 15 fields with the highest standard deviation in correlation (a gross proxy for several highly correlated fields). Birds of a feather flock together it would appear. Different car dealerships feature as a block of highly correlated fields, as car dealerships tend to be located near one another and if you’re shopping for a car you are likely to visit more than one lot. Clothing stores, such as Old Navy, JC Penney, Abercrombie and Fitch, Gap, Banana Republic, H&M, American Eagle Outfitters, and Victoria’s Secret cluster together as well, likely for the same reason (what we see in this block of highly correlated retail stores is a topology of the typical American Mall!).
Rather than beg the question whether a user is more likely to shop at two stores that are in close proximity, we can look for stores that we do not expect to be near one another but are still significantly correlated. In this case, we are interested in the elements off of the block diagonal in the above image such as the correlation between Dairy Queen and Jeep and Dodge dealers, JC Penney visitors and Mitsubishi dealers, and curious possibility that consumers are not using cash to purchase their Hyundais.
We partition places into a couple of sets (retail stores, car dealerships, ATMs/Starbucks/Movie Theaters) and identify correlations between places from different groups. We plot correlations between some of these places below. It looks like if you’re going to Napa Auto Parts you’re likely to eat burgers at Sonic, shopping at Nordstrom then you’re probably picking up pet supplies at Pet Smart, and frequenting ATM Banks in order to fuel your Starbucks addiction. In the case of the latter, this is likely a function of the fact that ATM Banks and Starbucks co-occur quite frequently (80,000 times) and just another manifestation of the more data, more visits/segments bias.
We can double check that we’ve identified the major clusters of places by running a metric multidimensional scaling algorithm on our various places (temporarily removing segments for the sake of clarity). MDS represents the distance between our various places in a 2-d space. We see that the majority of car dealerships are clustered in the top left corner, though notably Mercedes and Land Rover dealers are in their own space. In the top right hand corner, we see the Nordstrom, American Eagle, Gamestop cluster we identified earlier. We may also laugh to see the male demographic sit squarely at the center of Best Buy, Subway, and McDonalds.
As a final investigation (though with such a rich dataset there are myriad cool questions one could ask), we inquire whether particular behavioral segments are likely to co-occur and what, if any, are the demographic biases for particular consumer behavior?
A plot like the one above provides powerful insight into consumer segments. It also allows us to see whether various segments are behaving as we expect them to. Each segment was derived from an independent model, so this type of cross-examination based on external ‘truthiness’ is very powerful. For example, it seems reasonable that a user is less likely to be a college student the older they are or the more they appear to be a frequent traveller. Affluent consumers have higher incomes, date nighters overlap a fair bit with entertainment enthusiasts, and males are more likely than females to be golfers.
This exploration is the first of a series in which we QA Geopulse Audience profiles by comparing what they encode about user segments, demographics, and place visits to what we know about those fields based on external sources of information, such as consumer surveys and census data. These analyses enable us to refine our models and maximize the quality of our Geopulse Audience product.
Please email me at email@example.com if you have any questions or feedback about these results or if you would like to learn more about being a Data Engineer at Factual!
- Natasha Whitney, Data Engineering Intern
1. In order to ensure that our results will hold for samples of varying sizes, we only included correlations that had a 5% or smaller probability of occurring by mere chance given the number of records that were used to calculate the correlation (see this discussion of statistical significance for correlations).