Changes in our Global Places Data – Q1 2015

The global business landscape is not static. Places open, close, move, shift ownership, update their names, and change in all other manner of ways every day. So, we put in a great deal of effort to continually ensure that the Global Places data we provide is the most accurate snapshot of the real world.

Below is a summary of some changes that we’ve made since our last update. In the 11 countries listed here, we added about 6.4 million places, discarded about 8.8 million old records, and updated at least one field1 in 10.4 million records.

See the breakdown of these updates by field in the chart below2.


It’s a tough job staying on top of the world’s places, but we’re up to the challenge.

- Julie Levine, Marketing Associate

In Case You Missed It:
See updates from 2014 here

Notes:
1. Fields include: address, address extended, country, locality, name, PO box, postcode, region, telephone number.

2. Note that some records had updates to more than one field, thus the number of updates is larger than the number of updated records.

A Day in the Life of a Factual Engineer: Polygon Compression

In this series of blog posts, Factual’s blog team asked engineers to describe what a typical day looks like.

Background

Chris Bleakley, our resident polygon and Lucene expert, had written meticulous documentation about the problem he was solving. The first paragraph read:

“Because search times are dominated by the cost of deserializing JTS objects from when using Lucene to store polygon data, queries that match many polygons and require ALL results to be compared against the original polygon are slow. Is it possible to make polygon loading and point-polygon intersections faster?”

We had spoken a while back about creating an open-source polygon geometry library, and this seemed like a good enough reason to do it. His current approach of delta-encoding and quantizing everything to 1cm resolution got a 10x space and ~30x time advantage over using JTS and WKT, and given that each of the ~3 million polygons were each relatively small (11 points on average) any further gains were likely going to come from compressing whole cities at a time. My goal for the day was to figure out whether this was possible, and if so how to do it.

First Step: Getting the Data into a Usable Form

I started by exporting the Postgres geometries to a WKT text file using Chris’ instructions:

$ psql -c "copy (select id, st_astext(the_geog) from prod.asset
           where source_id = 10) to STDOUT" \
  | gzip > polygons.gz
$ nfu polygons.gz | wc -l       # make sure we have everything
3141244
$ nfu -T1 polygons.gz           # take a look at one of them
376d29ce-0c62-4dd7-b386-fa501e5a5366    MULTIPOLYGON(((-118.42078343774
34.1143654686345,-118.420652634588 34.1143771084748,-118.42064208021
34.1142953947106,-118.420514149298 34.1143067764812,-118.420506448828
...
)))
$

These were all simple polygons; that is, each consisted of a MULTIPOLYGON with exactly one ring, so extracting the points turned out to be really easy. I went ahead and delta-encoded the points at the same time:

$ nfu polygons.gz \
    -m 'row %1 =~ /(-?\d+\.\d+ -?\d+\.\d+)/g' \
    -m 'my $ref_lat = 0;
        my $ref_lng = 0;
        row map {
          my ($lng, $lat)   = split / /;
          my ($dlng, $dlat) = ($lng - $ref_lng, $lat - $ref_lat);
          ($ref_lat, $ref_lng) = ($lat, $lng);
          "$dlat,$dlng";
        } @_' \
  | gzip > polygon-deltas.gz

My initial version of this code had two mistakes that I didn’t notice for a while:

  1. I started by assuming WKT used a comma to separate lat/lng. It actually uses a space for this, and a comma to separate points. (And it stores numbers as lng lat instead of lat lng.
  2. Instead of $dlat,$dlng, I had written sprintf(“%f,%f”, $dlat, $dlng). What I didn’t realize is that sprintf truncates floats to six decimal places, which is about 10cm of error.

I also took a look at some of the polygon shapes to get a sense for the type of data we had to work with:

$ nfu polygon-deltas.gz -T10000 --pipe shuf -T1 \
      -m 'map row(split /,/), @_' -s01p %l



Second Step: Finding a Compression Limit

At this point I still didn’t know whether it was even possible to compress these polygons much better than Chris already could. His format used about 130 bytes per polygon, so in order to justify much effort I’d have to get below 60 bytes. (If the idea of designing a compression algorithm is unfamiliar, I wrote up a PDF a while back that explains the intuition I was using.)

First I looked at the distribution of latitude and longitude deltas to see what kind of histograms we were dealing with. We’re allowed to introduce up to 1cm of error, so I went ahead and pre-quantized all of the deltas accordingly (the s/,.*//r below is just a faster way of saying (split /,/)[0], which pulls the latitude from the $dlat,$dlng pairs we generated earlier):

$ nfu polygon-deltas.gz -m 'map s/,.*//r, @_[1..$#_]' -q0.000000001oc \
  | gzip > polygon-lat-deltas.gz
$ nfu polygon-lat-deltas.gz -f10p %l            # histogram
$ nfu polygon-lat-deltas.gz --entropy -f10p %l  # running entropy

Latitude delta histogram:


(In hindsight, quantizing the deltas wasn’t strictly valid because I could have easily accumulated more than 1cm of error for any given point — though for this preliminary stuff it probably didn’t impact the results very much.)

The latitude distribution entropy converges to just over 16 bits, which is bad news; if the latitude distribution follows the same pattern, then we’re up to just over 32 bits/4 bytes per delta, which is an average of 44 bytes of deltas per polygon. We’ll also need to encode the number of points total and the starting location, which could easily bump the average size to 56 bytes (and this assumes a perfect encoding against the distribution, which involves storing the whole thing somewhere — yet another space factor to be considered).

My next approach was to find out if latitude and longitude deltas showed any kind of dependence in aggregate. Maybe we can do better than encoding each delta by itself:

$ nfu polygon-deltas.gz \
      -m '@_[1..$#_]' -F , -q01,0.000000001 \
      -m 'row %0 + rand(0.9e-8), %1 + rand(0.9e-8)' \   # explained below
      -T1000000p %d

Progressively zooming in:



Clearly there’s a lot of dependence here, particularly in the form of rings and latitude/longitude-only offsets. This suggested to me that a polar encoding might be worthwhile (I had to modify nfu to add rect_polar):

$ nfu polygon-deltas.gz \
      -m '@_[1..$#_]' -F , -q01,0.000000001 \
      -m 'row rect_polar(%0 + rand(0.9e-8), %1 + rand(0.9e-8))' \
      -T1000000p %d

Here theta is on the Y axis and rho on the X axis. The first thing I noticed looking at this is that the bands of rho values are wavy; this is most likely because we’re converting from degrees latitude and longitude, which aren’t equal units of distance. That’s something I should be correcting for, so I found a latitude/longitude distance calculator and got the following figures for Los Angeles:

reference point = 34.11436, -118.42078
+ 1° latitude   = 111km
+ 1° longitude  = 92km

Converting the deltas to km (which incidentally makes the quantization factor a true centimeter):

# note: the conversion math below seems backwards to me, and I never figured
# out why.
$ nfu polygon-deltas.gz \
      -m '@_[1..$#_]' -F , -m 'row %0 / 92.0, %1 / 111.0' -q01,0.000000001 \
      -m 'row rect_polar(%0 + rand(0.9e-8), %1 + rand(0.9e-8))' \
      -T1000000p %d

At this point I was kind of stuck. The distribution looked good and fairly compressible, but I wasn’t quite sure how to quantize the polar coordinates maximally while preserving error bounds. At some level it’s simple trigonometry, but delta errors accumulate and that complicates things.

Explanation for rand(0.9e-8): In the nfu code above I’m adding a small random number to each point. This is a hack I sometimes use with quantized data to get a sense for correlation frequency; 0.9e-8 is 90% of the quantum, so we end up with visible cell boundaries and stochastic shading corresponding to each cell’s population. It becomes visible as an interference pattern if you zoom way in:

Third Step: Figuring Out What to Take Away From This

I didn’t have any very concrete results at the end of the day, but had some ideas about what a good compression algorithm might look like. I also strongly suspected that it was possible to get below 50 bytes/polygon. Before knocking off I saved a bunch of screenshots and nfu commands to my weekly notes on the wiki to review later on.

Working at Factual (for me, anyway) involves a lot of this kind of exploration, learning a bunch of less-than-definite stuff, and later revisiting a problem to write something concrete. I’ve had other work to do in the meantime, but hopefully before too long I’ll have time to come back to this and write some beercode to implement an awesome compression strategy.

– Spencer Tipping, Software Engineer

Factual’s Trusted Data Contributor Program Adds 13 New Partners

In a continuous effort to provide the highest quality data, we have added 13 new companies to the Factual Trusted Data Contributor Program. These organizations work directly with businesses and brands, and in turn ensure that their data is represented accurately in Factual. This select number of partners provide high-quality data to Factual and equally excellent service to their customers. These organizations have a large number of SMBs and national brands under their belt, and we gladly help them disseminate their customer’s data with fidelity across our network. We are excited to announce the following organizations to the program:

Trusted Data Contributor Service Description Countries Covered
Connectivity
info@connectivity.com
Connectivity provides companies with customer intelligence solutions. It also monitors reviews across hundreds of sites & claim listing information. US and Canada
ICE Portal
info@iceportal.com
ICE Portal is a technology and marketing company that helps travel suppliers manage, curate and deliver their visuals to thousands of online travel and travel related websites- including major OTAs, GDS’, DHISCO (Formerly Pegasus), Search Engines, Local Directories and Social Networks. ICE Portal distributes visual content to over 11 million unique visitors a month on websites worldwide. Global
Local Market Launch
info@localmarketlaunch.com
Local Market Launch provides a Local Presence Automation solution to simplify the foundational process of local presence management. Our user-friendly technology platform delivers an automated solution to establish and manage local presence and business listings across all digital channels. Our SMB and Brand products are sold through our digital agency, newspaper group, and yellow page publisher partners. US, UK.
Localistico
contact@localistico.com
Localistico helps local businesses become easier to find and look their best across different platforms, from Google to Yelp. Not only do they ensure that businesses can get the right data in front of their customers, but also keep track of changes and reviews and give actionable recommendations on how to improve their local presence. Localistico has clients in Spain, UK, Ireland and Czech Republic but will be available in other EU countries during the next months. UK, IE, Spain. With plans to expand to France, Germany and other EU countries
MetaLocator
Info@metalocator.com
MetaLocator provides store locator and product finder solutions for its customers. MetaLocator is a SaaS solution, and works with any website. Global
Milestone Internet Marketing
info@naptunelistings.com
Milestone is a leading software and service provider of online marketing and advertising solutions for the hospitality industry. Milestone’s Naptune™ is an award winning listing management SEO software allowing businesses better online and mobile visibility. USA, Canada, UK, UAE, Mexico, Cayman Islands, Puerto Rico, Aruba, Costa Rica, Brazil, Tanzania, Portugal, China, Singapore, India
NavAds
sales@navads.eu
NavAds offers small businesses and enterprise customers a global platform to manage business listings on a variety of GPS devices, navigation apps, map development companies and POI databases. 122 countries-mainly EMEA and Americas
seoClarity
sales@seoclarity.net
seoClarity is the world’s leading Enterprise SEO platform trusted by 2,000+ brands worldwide including brick and mortar retailers such as Dicks Sporting Goods, Barnes and Noble, Paychex and more. Global
SIM Partners
sales@simpartners.com
SIM Partners’ Velocity platform empowers enterprise brands to drive customer acquisition at a local level. Through automation, Velocity optimizes location-specific content and business information enabling national marketers to dominate local, mobile, and social search results across thousands of locations. SIM Partners has offices in Chicago and San Francisco and authorized resellers in Australia, Germany and Italy. Global
SweetIQ
brad@sweetiq.com
SweetIQ is an industry leader in local search marketing, helping F500 brands convert online searches to in-store shoppers. Our analytics and automation platform delivers local search results for national brands and their marketing agencies. Since 2010 we’ve been helping our clients grow their sales by increasing their online findability and giving them the tools they need to manage their online presence. The company has offices in Montreal and San Francisco, working with clients across North America. USA, Canada, Australia
uberall.com
hello@uberall.com
A Berlin-based Geomarketing Software Company that among other things manages local listings for local shops and service providers, franchises, retail chains, etc. Based in Germany and currently also active in the UK, France, Spain, Austria, and Switzerland. Germany, UK, France, Spain, Austria, Switzerland. In 2015 they will be adding Poland, Italy and several other countries
Vendasta
sales@vendasta.com
Vendasta’s 10x platform helps digital agencies and media companies sell digital marketing solutions to local businesses. Through this platform, Vendasta provides white label reputation management, social media marketing and local listings management while integrating with third party applications to offer a complete solution for any local business. In local listing management, Vendasta is the only platform to provide complete transparency, as well as an entire repository of white label training materials, support resources, and marketing collateral to help partners achieve 10X growth as quickly — and efficiently — as possible. USA, Canada, Australia.
YP
press@yp.com
YP is a leading local marketing solutions provider in the US dedicated to helping local businesses and communities grow. Formerly AT&T Interactive and AT&T Advertising Solutions, YP launched in May 2012, bringing the two companies together with the mission of helping local businesses and communities grow. YP’s flagship consumer brands include the popular YP℠ app and YP.com, which are used by more than 80 million visitors each month in the U.S. and The Real Yellow Pages® directory. US

While we try to process submissions quickly, it can take four to six weeks for submissions to be reflected in our data.

For more information please reach out to tdc@factual.com.

Andrea Chang- Partner Services team

Featured Partner: Beasy

Making plans is often harder than it should be. We text, email, calendar invite, call, and even poll our friends and colleagues in person to determine when and where to gather, but often end up with more headaches than group activities. Beasy solves this problem by providing a simple solution designed explicitly for the challenge of setting up plans, instead of trying to force fit another approach. See our interview below with Founder and CEO Whitney Komor about this inventive approach to making plans.

Company: Beasy
Located: Sydney, Australia
Partner Since: 2014
Website: www.beasy.co
App Store: iTunes, Google Play
Name and Title: Whitney Komor, Founder and CEO

 

Q: Introduce readers to Beasy.
A: Beasy is a messaging app that makes it easier for groups of people to coordinate social plans. In the Beasy app, whenever someone mentions a time or place, it is identified automatically and added to a list where it can be voted on by everyone in the conversation. You can think of this as similar to how iMessage automatically identifies when you enter a time or date in a message and gives you the option to create a calendar item. What is special about Beasy is that it doesn’t only identify times and dates, but also the places that you’re trying to meet, and it lets the group vote before the plans are set.

Beasy builds on our first app, The Best Day, which offers similar functionality around voting on the best time and place to meet. Unlike The Best Day, Beasy extracts time and place information directly from your conversations, so you can continue to talk with your friends while still planning efficiently.

Q: Why is location data important for Beasy?
A: A main part of Beasy’s functionality is that it automatically identifies the places that users mention in their messages and provides detailed information on them to everyone in the conversation. For example, when a text message is entered such as, “let’s go to Intelligentsia Coffee tomorrow afternoon,” what we want is to identify “tomorrow afternoon” as a time and “Intelligentsia Coffee” as a place that you’re going to. To do this on the back end, we break each message down into tokens of one to a few words and pass them through a system where we determine which ones are most likely to be places. For example, tokens following “go to” are weighted as likely to be places. Once we have gotten this far, we need a location database to identify what the real world locations are that are being discussed. This process uses more than just text, we will also qualify queries with information such as where the user is when they send the message (lat/long), or if they’ve mentioned any cities earlier in the conversation.

Q: Why did you choose Factual as your location data provider?
A: In order to offer a seamless user experience, we wanted to host the places data ourselves and query it at high speed. Factual allows us to do this. Additionally, Factual provides rich data around each place, so if users are discussing a restaurant for example, we can use Factual’s Restaurants data to offer details like hours of operation and cuisine.

Q: Beasy uses online communication to foster offline communication. Do you think in the long run, having technology that is with us 24/7 and integral to our every interaction will make us more or less social?
A: This might seem funny coming from me, but I think it’s making us less social. Before starting The Best Day, I remember being so frustrated liking my friends pictures on Instagram but struggling to organize dinner in real life. Then it struck me that I prefer to have technology have me help see people in the real world instead letting me follow their lives digitally. Somewhat ironically, Beasy is a social app that’s really designed to let you spend a little time as possible actually absorbed in technology.

Q: Beasy provides a simple and elegant solution to a surprisingly complicated problem (getting some friends together). Since you have your hands full with your own app, what similar problems would you like to see someone else solve?
A: I’ve read a bit about businesses that are trying to help you park instead of circling around. When someone figures out how to do that well, I’ll be really happy about it.

Q: What lessons have you learned starting The Best Day and Beasy?
A: I’ve mostly learned form mistakes. One particular one that I always try to remember is that when it comes to product development I’ve discovered this idea of trying to think about all of the assumptions you’re making and testing them as quickly and cheaply as possible. For example, before we build out a new natural language processing feature, we write out a message to parse and one of us will pretend to be the bot and manually write out what we want it to detect to confirm if it provides a worthwhile user experience. It takes 10 minutes instead of two weeks and saves a lot of effort.

- Julie Levine, Marketing Associate

In Case You Missed It
Check out some other Featured Partners, like navigation apps Urban Engines and 2GIS, market research company Infoscout, and social apps Emu, HelloTel, and SocialRadar. See even more featured partners here.

Factual Featured Partner: Urban Engines

Getting from Point A to Point B in any city can be challenging. Vital information such as which direction to walk, what train to take, when the next bus will arrive, and if there’s an event going on that will shut down part of the route are not always clear to commuters. Urban Engines is making it easier to traverse cities by collecting and analyzing everything from train and bus data to anonymous commuter movement patterns, to empowering transit agencies, and improving individual commuter experiences around the world. See our Q&A with Urban Engines’ Dan Zheng and Resmi Arjunanpillai below.

Company: Urban Engines
Located: Los Altos, CA
Partner Since: 2014
Website: www.urbanengines.com/
Facebook: www.facebook.com/urbanengines
Twitter: @UrbanEngines
Google+: +Urbanengines
LinkedIn: www.linkedin.com/company/urbanengines
App Store: iTunes, Google Play
Name and Title: Dan Zheng, General Manager
Resmi Arjunanpillai, Marketing Manager

 

Q: Introduce readers to Urban Engines.
A: Urban Engines is using technology to make urban living better; our primary focus is improving urban mobility — making it easier to get where you’re going — by using information from the billions of trips that people and vehicles make each day.

We work closely with cities and transit agencies to optimize their systems using transit data, such as “tap in/tap out” data from commuters entering and leaving public transit stations. With this simple information, we get a complete and clear view of how commuters are using the transit system and learn things like how full trains are and how many people are waiting on the platform. With Urban Engines data analytics solution, transit agencies can make decisions that can alleviate congestion and improve commuter experience, such as adding more buses on a crowded route.

Urban Engines uses the same powerful space-time engine that powers our analytics to drive an app that gives commuters the fastest public transit route from point A to B, whether online or offline, by evaluating traffic, area events, and how transit routes have fared historically. It is designed to be fast, always accessible (online and offline maps, search, and routing) and personalized, providing one-touch access and navigation to your favorite locations.

Q: Why is location data important for Urban Engines? Why did you choose Factual as your location data provider?
A: Within the Urban Engines app, it’s critical to surface locations that people are searching for. We like working with Factual because the location data is very high quality, allowing us to deliver a good user experience.

Q: Aside from providing maps and routes, how does the Urban Engines app help improve transit experiences?
A: When we designed the Urban Engines app, we had a modern commuter with a mobile device in mind. We set out to solve three problems: speed, connectivity, and orientation.

Speed: If you look at other apps today, given small mobile device screens, it takes a lot of effort just to input an address. Then there’s a lot of work to actually get to the location information that you want. We have provided one touch browsing for maps. As soon as you open up the app, we have already computed the best routes to all of your favorite locations.

Connectivity: For most places in metropolitan areas, there’s decent coverage in terms of connectivity to the internet. But generally if you have to go into places such as a subway station or basement bar, or you are traveling to a city where you won’t have data or internet access, you need full offline capability. With the Urban Engines app, you have complete offline functionality — you can do offline browsing of any map, search (for an address or point of interest), and route to any destination.

Orientation: For me, every time I get out of a train station — especially new ones – I get lost. Your phone has a camera, gyroscope, GPS, and a bunch of other sensors and features that actually could make it well equipped to assist you in this situation if deployed better. Our app combines these features to get what we call “X-ray mode.” This is a transparent map overlay of streets, transit stops, and transit routes on your camera view. When you turn on X-ray mode, you can tap on a bus stop or train station to see next arrival times and a walking path to see exactly how to get there without having to guess which way is uptown.

These are the things we’ve focused on so far, but there are many ways to use a mobile device to improve commuting and this is only the beginning!

Q: What’s an interesting or unexpected thing that you’ve learned from analyzing different cities’ transit patterns? Are there any key differences in commuting between different countries or continents?
A: Commuting is so location dependent. For example, in Singapore (given that it is near the equator) it rains a lot and the amount of rain has an impact on how people travel. When there’s light rain, people tend to hop on buses more often — it’s like a large moving umbrella. When it pours however, the subway stations get packed.

Q: How do you think ride-sharing apps (like Uber and Lyft) are impacting commuting and traffic? Is this something you plan to track in the future as well?
A: We know that urban commuters are changing their habits. A lot of it has to do with the shared economy – Zipcar, Uber, Lyft, or bike sharing programs. That’s going to continue to evolve in the sense that for younger people it’s about access rather than ownership. For the app, we will track this space closely and continue to add relevant functionality. More broadly, our space-time engine can also be used by ride-sharing apps to analyze movement data.

Q: Do you have any advice for the average commuter on some simple steps they can take to start improving their experience?
A: One of the things we do in the Urban Engines app is provide real time data (when available) such as if there’s an event happening on your route and how likely your choice of public transportation to be delayed. Looking at this real time information can be extremely helpful in determining your route each day, so you don’t get stuck in unanticipated congestion.

- Julie Levine, Marketing Associate

In Case You Missed It
Check out some other partners we’ve interviewed, like concierge app Snips, city guides Tastemade and Jetpac City Guides, and networking apps HelloTel and SocialRadar. See even more Featured Partners here.