Validating Mobile Ad Location Data at Factual

Location Data appears straightforward on first blush: two numbers — longitude and latitude — combine as a coordinate to identify an unambiguous point on the earth’s surface: X marks the spot, unequivocally.

Location data in the Mobile Ad-tech Ecosystem, however — especially that used by marketers and advertisers — has a number of distinguishing characteristics that make it more problematic. In no particular order, they are:

  1. Unvalidated: the great majority of location data, such as that coming through the Real-Time Bid (RTB) stream, are from unknown sources: no inherent quality guarantees are attached to the data, and it must be viewed with skepticism initially;
  2. Independent: Most coordinate pairs come through the pipes naked, unclothed by metadata.  Although mobile devices can source precision, speed, and heading with their location readings, while welcome, this sort of context is extremely rare if not entirely absent;
  3. Intermittent: the majority of mobile apps register location data infrequently — think of most mobile location data as an extremely low-res digital sample of a rich, analogue behavior. Put another way, most mobile location data represents only a fraction of ongoing activity, with little context of what came before or after.

These are pretty major caveats. How then, with so many qualifiers and so much dubious, isolated, and unvalidated data, can one extract signal from the noise? The answer is a Location Validation Stack, a platform that pre-processes location data before we build critical consumer insights.

Factual has two products that require a Location Validation Stack: Geopulse Audience, which creates geographic, behavioral, demographic and retail profiles based on where people go over time; and Geopulse Proximity, which performs realtime server-side geofencing to the tune of 20k-50k queries per second per server.

Most customers run Factual Location Validation on between three and 100 servers; a single customer may validate location at a rate of five-billion locations queries per day at a peak of around 500k qps. Taken together, every month Factual processes over 600 billion location data points for Geopulse Audience creation, and over 800 billion of the same for Proximity realtime geofencing. These not-insignificant volumes, combined with the requirements of both asynchronous and real-time data validation solutions, drove us to create a location validation solution that is both fast and intelligent.

Factual’s Location Data Cleaning Process

When a location data point comes down the pipe, we look at it closely and reject it outright if it hits any of our filter criteria. These are:

Coordinate Truncation

Coordinates with three decimals provide no better than ~100m accuracy, and truncated coordinates (reducing the decimal places in coordinate measurements) will consistently ‘pull’ devices in a single direction away from their ‘real’ location. Factual creates audiences with precise location targeting — and our algorithms tie devices to specific businesses, not grid squares — so coordinates with fewer than four decimal places are insufficiently precise and for small venues a precision of 5 or more decimal places is optimal. You really cannot use coordinates with fewer than 4 decimal places for precise real-time geofencing or to create retail-based audiences. Anyone who says differently is selling you something.

Here’s a specific example of what happens when you truncate decimal places. Let’s say that you are at Factual’s HQ in Century City, Los Angeles, and we progressively truncate your coordinate precision. With each decimal removed, your apparent location drifts to the southeast:

5 decimal places (~1m) You are correctly located in Factual’s building (perhaps with other businesses on different floors).
4 decimal places (~10m) Factual’s building is large, so you are still located here. However, this precision would place you outside a smaller venue.
3 decimal places (~100m) About a football field from your current location; you are now across the street in a hotel.
2 decimal places (~1,000m) You are now over 1km away in a golf course.
1 decimal places (~10,000m) You are in another city, and have been eaten by a grue.

Figure 001 — location drift with coordinate truncation; locations with fewer than five decimal places are almost unusable in precision targeting. (Image: Google Maps)

Invalid Coordinates

Figure 002 — there’s not much really happening at Null Island (0,0) – but it’s a great trap for bad data, and you can check the weather online (image: NOAA)

Generally coordinates with bad numbers. The most common we see under this heading are coordinates found at ‘Null Island’ (0,0), but there is also a growing menagerie of points that generally represent classes of errors indicative of sloppy coding or an upstream data issue, such as matching coordinate pairs. ‘Null Island’ is a valid geographic point, but geo-geeks use it as a trap to ‘catch’ bogus device locations — seeing it in data streams always points to problems where location data is missing.

Out of Bounds Coordinates

Figure 003: the effect of swapping longitude and latitude: swapped, in the ocean (red); corrected, in Buenos Aires (green). (map: OpenStreetMap)

When you process billions of records, you’re going to see a lot of weird coordinates. Many are considered ‘out-of-bounds’ — most because they are outside the range of legal coordinates, but others simply fall at the extremes of the earth where very few people live and whose appearance in the location data pipeline is generally due to developers’ inadvertent swapping of latitude and longitude. For example in figure 003, we identify and discard the coordinate -58.436597, -34.607187 (the red marker) as out-of-bounds, because it is deep in the ocean and at the southern extremes of the earth. Debugging this erroneous location shows that switching the coordinate order to -34.607187, -58.436597 (the green marker) puts the point in Buenos Aires, almost certainly the legitimate location the developer intended.

Blacklisted Coordinates

This filter mechanism catches the biggest proportion of transgressions by identifying apparently high-precision points that have been encoded using a wifi, IP, cell tower, or centroid lookup.

Some of these points may be ‘fraudulent’, but most are just negligent coding on the developer’s part. The best bit about this feature is that it does not run off of a static list of blacklisted places, but instead evolves its logic from the 20+ billion we see daily. This approach is built on a statistical model that identifies blacklisted points via a hypothesis testing framework, that learns which points are over-represented based on all points in the system. The model is therefore improved with every point that we see, and thus requires very little active maintenance. To-date we’ve identified 650k bogus points globally using this method, and the list is growing.

Figure 004 – a farmstead in rural Wichita: the most populated place in the world according to unfiltered mobile location logs (image: Google Maps)

One of the most egregious examples of these blacklisted points is the coordinate (37.999004, -96.97783), which corresponds to a lovely, innocuous-looking plot of farmland 30 miles northeast of Wichita, Kansas (figure 004). According to our unfiltered location logs, this point is the the most popular location for mobile users across the globe, beating New York, London, and Seoul for top billing.

Our Blacklist does not care why a specific place is artificially popular, but this one is easy: it is the geographic center of the continental United States (which by itself speaks volumes about the quality of geodata in Mobile Location). It’s clear that publishers are tagging US locations with the intention of noting that the data point is ‘in the United States’, but instead the high-precision, low accuracy coordinate is only adding noise to the signal.

Fortunately, because our model has identified this point as curiously over-subscribed, it is ignored when processing audiences and validating geofence inputs, and all is well.

Bad Devices, Bad Apps

Bad apples: we frequently observe device IDs that are over-represented in every data stream we monitor — usually because the device ID is poorly coded, not passed to the bid stream correctly, and very likely is shared between devices. We’ll also detect devices that appear to ‘blip’ between locations suggesting either travel above Mach 1 or the developer is employing randomized locations (figure 005). Other developers will semi-randomize locations in ways that can be observed and blocked. When we see evidence of malicious or detrimental coding, usually the whole app is blocked from our pipelines and we work with our partners to address and remedy.

Figure 005 – real world example of a bad apple: good app (left) vs bad app (right). Most location pathologies are less pronounced (map: OpenStreetMap)

These checks are applied to all location data consumed by our Geopulse Audience and Geopulse Proximity products. We do this at speeds measured in microseconds (millionths of a second), which means that we can do more verification in less time, sort the wheat from the chaff, and provide the best possible location-based consumer insights.

If you’re interested in learning more about our mobile ad targeting capabilities, please contact us.

Tyler Bell & Tom White
Geopulse Product and Engineering Leads

Changes in our Global Places Data – Q1 2015

The global business landscape is not static. Places open, close, move, shift ownership, update their names, and change in all other manner of ways every day. So, we put in a great deal of effort to continually ensure that the Global Places data we provide is the most accurate snapshot of the real world.

Below is a summary of some changes that we’ve made since our last update. In the 11 countries listed here, we added about 6.4 million places, discarded about 8.8 million old records, and updated at least one field1 in 10.4 million records.

See the breakdown of these updates by field in the chart below2.


It’s a tough job staying on top of the world’s places, but we’re up to the challenge.

- Julie Levine, Marketing Associate

In Case You Missed It:
See updates from 2014 here

Notes:
1. Fields include: address, address extended, country, locality, name, PO box, postcode, region, telephone number.

2. Note that some records had updates to more than one field, thus the number of updates is larger than the number of updated records.

A Day in the Life of a Factual Engineer: Polygon Compression

In this series of blog posts, Factual’s blog team asked engineers to describe what a typical day looks like.

Background

Chris Bleakley, our resident polygon and Lucene expert, had written meticulous documentation about the problem he was solving. The first paragraph read:

“Because search times are dominated by the cost of deserializing JTS objects from when using Lucene to store polygon data, queries that match many polygons and require ALL results to be compared against the original polygon are slow. Is it possible to make polygon loading and point-polygon intersections faster?”

We had spoken a while back about creating an open-source polygon geometry library, and this seemed like a good enough reason to do it. His current approach of delta-encoding and quantizing everything to 1cm resolution got a 10x space and ~30x time advantage over using JTS and WKT, and given that each of the ~3 million polygons were each relatively small (11 points on average) any further gains were likely going to come from compressing whole cities at a time. My goal for the day was to figure out whether this was possible, and if so how to do it.

First Step: Getting the Data into a Usable Form

I started by exporting the Postgres geometries to a WKT text file using Chris’ instructions:

$ psql -c "copy (select id, st_astext(the_geog) from prod.asset
           where source_id = 10) to STDOUT" \
  | gzip > polygons.gz
$ nfu polygons.gz | wc -l       # make sure we have everything
3141244
$ nfu -T1 polygons.gz           # take a look at one of them
376d29ce-0c62-4dd7-b386-fa501e5a5366    MULTIPOLYGON(((-118.42078343774
34.1143654686345,-118.420652634588 34.1143771084748,-118.42064208021
34.1142953947106,-118.420514149298 34.1143067764812,-118.420506448828
...
)))
$

These were all simple polygons; that is, each consisted of a MULTIPOLYGON with exactly one ring, so extracting the points turned out to be really easy. I went ahead and delta-encoded the points at the same time:

$ nfu polygons.gz \
    -m 'row %1 =~ /(-?\d+\.\d+ -?\d+\.\d+)/g' \
    -m 'my $ref_lat = 0;
        my $ref_lng = 0;
        row map {
          my ($lng, $lat)   = split / /;
          my ($dlng, $dlat) = ($lng - $ref_lng, $lat - $ref_lat);
          ($ref_lat, $ref_lng) = ($lat, $lng);
          "$dlat,$dlng";
        } @_' \
  | gzip > polygon-deltas.gz

My initial version of this code had two mistakes that I didn’t notice for a while:

  1. I started by assuming WKT used a comma to separate lat/lng. It actually uses a space for this, and a comma to separate points. (And it stores numbers as lng lat instead of lat lng.
  2. Instead of $dlat,$dlng, I had written sprintf(“%f,%f”, $dlat, $dlng). What I didn’t realize is that sprintf truncates floats to six decimal places, which is about 10cm of error.

I also took a look at some of the polygon shapes to get a sense for the type of data we had to work with:

$ nfu polygon-deltas.gz -T10000 --pipe shuf -T1 \
      -m 'map row(split /,/), @_' -s01p %l



Second Step: Finding a Compression Limit

At this point I still didn’t know whether it was even possible to compress these polygons much better than Chris already could. His format used about 130 bytes per polygon, so in order to justify much effort I’d have to get below 60 bytes. (If the idea of designing a compression algorithm is unfamiliar, I wrote up a PDF a while back that explains the intuition I was using.)

First I looked at the distribution of latitude and longitude deltas to see what kind of histograms we were dealing with. We’re allowed to introduce up to 1cm of error, so I went ahead and pre-quantized all of the deltas accordingly (the s/,.*//r below is just a faster way of saying (split /,/)[0], which pulls the latitude from the $dlat,$dlng pairs we generated earlier):

$ nfu polygon-deltas.gz -m 'map s/,.*//r, @_[1..$#_]' -q0.000000001oc \
  | gzip > polygon-lat-deltas.gz
$ nfu polygon-lat-deltas.gz -f10p %l            # histogram
$ nfu polygon-lat-deltas.gz --entropy -f10p %l  # running entropy

Latitude delta histogram:


(In hindsight, quantizing the deltas wasn’t strictly valid because I could have easily accumulated more than 1cm of error for any given point — though for this preliminary stuff it probably didn’t impact the results very much.)

The latitude distribution entropy converges to just over 16 bits, which is bad news; if the latitude distribution follows the same pattern, then we’re up to just over 32 bits/4 bytes per delta, which is an average of 44 bytes of deltas per polygon. We’ll also need to encode the number of points total and the starting location, which could easily bump the average size to 56 bytes (and this assumes a perfect encoding against the distribution, which involves storing the whole thing somewhere — yet another space factor to be considered).

My next approach was to find out if latitude and longitude deltas showed any kind of dependence in aggregate. Maybe we can do better than encoding each delta by itself:

$ nfu polygon-deltas.gz \
      -m '@_[1..$#_]' -F , -q01,0.000000001 \
      -m 'row %0 + rand(0.9e-8), %1 + rand(0.9e-8)' \   # explained below
      -T1000000p %d

Progressively zooming in:



Clearly there’s a lot of dependence here, particularly in the form of rings and latitude/longitude-only offsets. This suggested to me that a polar encoding might be worthwhile (I had to modify nfu to add rect_polar):

$ nfu polygon-deltas.gz \
      -m '@_[1..$#_]' -F , -q01,0.000000001 \
      -m 'row rect_polar(%0 + rand(0.9e-8), %1 + rand(0.9e-8))' \
      -T1000000p %d

Here theta is on the Y axis and rho on the X axis. The first thing I noticed looking at this is that the bands of rho values are wavy; this is most likely because we’re converting from degrees latitude and longitude, which aren’t equal units of distance. That’s something I should be correcting for, so I found a latitude/longitude distance calculator and got the following figures for Los Angeles:

reference point = 34.11436, -118.42078
+ 1° latitude   = 111km
+ 1° longitude  = 92km

Converting the deltas to km (which incidentally makes the quantization factor a true centimeter):

# note: the conversion math below seems backwards to me, and I never figured
# out why.
$ nfu polygon-deltas.gz \
      -m '@_[1..$#_]' -F , -m 'row %0 / 92.0, %1 / 111.0' -q01,0.000000001 \
      -m 'row rect_polar(%0 + rand(0.9e-8), %1 + rand(0.9e-8))' \
      -T1000000p %d

At this point I was kind of stuck. The distribution looked good and fairly compressible, but I wasn’t quite sure how to quantize the polar coordinates maximally while preserving error bounds. At some level it’s simple trigonometry, but delta errors accumulate and that complicates things.

Explanation for rand(0.9e-8): In the nfu code above I’m adding a small random number to each point. This is a hack I sometimes use with quantized data to get a sense for correlation frequency; 0.9e-8 is 90% of the quantum, so we end up with visible cell boundaries and stochastic shading corresponding to each cell’s population. It becomes visible as an interference pattern if you zoom way in:

Third Step: Figuring Out What to Take Away From This

I didn’t have any very concrete results at the end of the day, but had some ideas about what a good compression algorithm might look like. I also strongly suspected that it was possible to get below 50 bytes/polygon. Before knocking off I saved a bunch of screenshots and nfu commands to my weekly notes on the wiki to review later on.

Working at Factual (for me, anyway) involves a lot of this kind of exploration, learning a bunch of less-than-definite stuff, and later revisiting a problem to write something concrete. I’ve had other work to do in the meantime, but hopefully before too long I’ll have time to come back to this and write some beercode to implement an awesome compression strategy.

– Spencer Tipping, Software Engineer

Factual’s Trusted Data Contributor Program Adds 13 New Partners

In a continuous effort to provide the highest quality data, we have added 13 new companies to the Factual Trusted Data Contributor Program. These organizations work directly with businesses and brands, and in turn ensure that their data is represented accurately in Factual. This select number of partners provide high-quality data to Factual and equally excellent service to their customers. These organizations have a large number of SMBs and national brands under their belt, and we gladly help them disseminate their customer’s data with fidelity across our network. We are excited to announce the following organizations to the program:

Trusted Data Contributor Service Description Countries Covered
Connectivity
info@connectivity.com
Connectivity provides companies with customer intelligence solutions. It also monitors reviews across hundreds of sites & claim listing information. US and Canada
ICE Portal
info@iceportal.com
ICE Portal is a technology and marketing company that helps travel suppliers manage, curate and deliver their visuals to thousands of online travel and travel related websites- including major OTAs, GDS’, DHISCO (Formerly Pegasus), Search Engines, Local Directories and Social Networks. ICE Portal distributes visual content to over 11 million unique visitors a month on websites worldwide. Global
Local Market Launch
info@localmarketlaunch.com
Local Market Launch provides a Local Presence Automation solution to simplify the foundational process of local presence management. Our user-friendly technology platform delivers an automated solution to establish and manage local presence and business listings across all digital channels. Our SMB and Brand products are sold through our digital agency, newspaper group, and yellow page publisher partners. US, UK.
Localistico
contact@localistico.com
Localistico helps local businesses become easier to find and look their best across different platforms, from Google to Yelp. Not only do they ensure that businesses can get the right data in front of their customers, but also keep track of changes and reviews and give actionable recommendations on how to improve their local presence. Localistico has clients in Spain, UK, Ireland and Czech Republic but will be available in other EU countries during the next months. UK, IE, Spain. With plans to expand to France, Germany and other EU countries
MetaLocator
Info@metalocator.com
MetaLocator provides store locator and product finder solutions for its customers. MetaLocator is a SaaS solution, and works with any website. Global
Milestone Internet Marketing
info@naptunelistings.com
Milestone is a leading software and service provider of online marketing and advertising solutions for the hospitality industry. Milestone’s Naptune™ is an award winning listing management SEO software allowing businesses better online and mobile visibility. USA, Canada, UK, UAE, Mexico, Cayman Islands, Puerto Rico, Aruba, Costa Rica, Brazil, Tanzania, Portugal, China, Singapore, India
NavAds
sales@navads.eu
NavAds offers small businesses and enterprise customers a global platform to manage business listings on a variety of GPS devices, navigation apps, map development companies and POI databases. 122 countries-mainly EMEA and Americas
seoClarity
sales@seoclarity.net
seoClarity is the world’s leading Enterprise SEO platform trusted by 2,000+ brands worldwide including brick and mortar retailers such as Dicks Sporting Goods, Barnes and Noble, Paychex and more. Global
SIM Partners
sales@simpartners.com
SIM Partners’ Velocity platform empowers enterprise brands to drive customer acquisition at a local level. Through automation, Velocity optimizes location-specific content and business information enabling national marketers to dominate local, mobile, and social search results across thousands of locations. SIM Partners has offices in Chicago and San Francisco and authorized resellers in Australia, Germany and Italy. Global
SweetIQ
brad@sweetiq.com
SweetIQ is an industry leader in local search marketing, helping F500 brands convert online searches to in-store shoppers. Our analytics and automation platform delivers local search results for national brands and their marketing agencies. Since 2010 we’ve been helping our clients grow their sales by increasing their online findability and giving them the tools they need to manage their online presence. The company has offices in Montreal and San Francisco, working with clients across North America. USA, Canada, Australia
uberall.com
hello@uberall.com
A Berlin-based Geomarketing Software Company that among other things manages local listings for local shops and service providers, franchises, retail chains, etc. Based in Germany and currently also active in the UK, France, Spain, Austria, and Switzerland. Germany, UK, France, Spain, Austria, Switzerland. In 2015 they will be adding Poland, Italy and several other countries
Vendasta
sales@vendasta.com
Vendasta’s 10x platform helps digital agencies and media companies sell digital marketing solutions to local businesses. Through this platform, Vendasta provides white label reputation management, social media marketing and local listings management while integrating with third party applications to offer a complete solution for any local business. In local listing management, Vendasta is the only platform to provide complete transparency, as well as an entire repository of white label training materials, support resources, and marketing collateral to help partners achieve 10X growth as quickly — and efficiently — as possible. USA, Canada, Australia.
YP
press@yp.com
YP is a leading local marketing solutions provider in the US dedicated to helping local businesses and communities grow. Formerly AT&T Interactive and AT&T Advertising Solutions, YP launched in May 2012, bringing the two companies together with the mission of helping local businesses and communities grow. YP’s flagship consumer brands include the popular YP℠ app and YP.com, which are used by more than 80 million visitors each month in the U.S. and The Real Yellow Pages® directory. US

While we try to process submissions quickly, it can take four to six weeks for submissions to be reflected in our data.

For more information please reach out to tdc@factual.com.

Andrea Chang- Partner Services team

Featured Partner: Beasy

Making plans is often harder than it should be. We text, email, calendar invite, call, and even poll our friends and colleagues in person to determine when and where to gather, but often end up with more headaches than group activities. Beasy solves this problem by providing a simple solution designed explicitly for the challenge of setting up plans, instead of trying to force fit another approach. See our interview below with Founder and CEO Whitney Komor about this inventive approach to making plans.

Company: Beasy
Located: Sydney, Australia
Partner Since: 2014
Website: www.beasy.co
App Store: iTunes, Google Play
Name and Title: Whitney Komor, Founder and CEO

 

Q: Introduce readers to Beasy.
A: Beasy is a messaging app that makes it easier for groups of people to coordinate social plans. In the Beasy app, whenever someone mentions a time or place, it is identified automatically and added to a list where it can be voted on by everyone in the conversation. You can think of this as similar to how iMessage automatically identifies when you enter a time or date in a message and gives you the option to create a calendar item. What is special about Beasy is that it doesn’t only identify times and dates, but also the places that you’re trying to meet, and it lets the group vote before the plans are set.

Beasy builds on our first app, The Best Day, which offers similar functionality around voting on the best time and place to meet. Unlike The Best Day, Beasy extracts time and place information directly from your conversations, so you can continue to talk with your friends while still planning efficiently.

Q: Why is location data important for Beasy?
A: A main part of Beasy’s functionality is that it automatically identifies the places that users mention in their messages and provides detailed information on them to everyone in the conversation. For example, when a text message is entered such as, “let’s go to Intelligentsia Coffee tomorrow afternoon,” what we want is to identify “tomorrow afternoon” as a time and “Intelligentsia Coffee” as a place that you’re going to. To do this on the back end, we break each message down into tokens of one to a few words and pass them through a system where we determine which ones are most likely to be places. For example, tokens following “go to” are weighted as likely to be places. Once we have gotten this far, we need a location database to identify what the real world locations are that are being discussed. This process uses more than just text, we will also qualify queries with information such as where the user is when they send the message (lat/long), or if they’ve mentioned any cities earlier in the conversation.

Q: Why did you choose Factual as your location data provider?
A: In order to offer a seamless user experience, we wanted to host the places data ourselves and query it at high speed. Factual allows us to do this. Additionally, Factual provides rich data around each place, so if users are discussing a restaurant for example, we can use Factual’s Restaurants data to offer details like hours of operation and cuisine.

Q: Beasy uses online communication to foster offline communication. Do you think in the long run, having technology that is with us 24/7 and integral to our every interaction will make us more or less social?
A: This might seem funny coming from me, but I think it’s making us less social. Before starting The Best Day, I remember being so frustrated liking my friends pictures on Instagram but struggling to organize dinner in real life. Then it struck me that I prefer to have technology have me help see people in the real world instead letting me follow their lives digitally. Somewhat ironically, Beasy is a social app that’s really designed to let you spend a little time as possible actually absorbed in technology.

Q: Beasy provides a simple and elegant solution to a surprisingly complicated problem (getting some friends together). Since you have your hands full with your own app, what similar problems would you like to see someone else solve?
A: I’ve read a bit about businesses that are trying to help you park instead of circling around. When someone figures out how to do that well, I’ll be really happy about it.

Q: What lessons have you learned starting The Best Day and Beasy?
A: I’ve mostly learned form mistakes. One particular one that I always try to remember is that when it comes to product development I’ve discovered this idea of trying to think about all of the assumptions you’re making and testing them as quickly and cheaply as possible. For example, before we build out a new natural language processing feature, we write out a message to parse and one of us will pretend to be the bot and manually write out what we want it to detect to confirm if it provides a worthwhile user experience. It takes 10 minutes instead of two weeks and saves a lot of effort.

- Julie Levine, Marketing Associate

In Case You Missed It
Check out some other Featured Partners, like navigation apps Urban Engines and 2GIS, market research company Infoscout, and social apps Emu, HelloTel, and SocialRadar. See even more featured partners here.