Testing code is hard, but testing data is even harder.

When writing code, you’d like to be able to test the “happy path”, error conditions, throughput, thread-safety, fault-tolerance, etc. Back in the real world, however, you find yourself delighted when the code you’re working with has even a single test. The silver lining is that code testing is becoming easier and easier, with tools like continuous integration, dependency injection, mocking frameworks, unit testing frameworks, Fitnesse, automatic test generators like AgitarOne, and so on.

For data, as for code, there are plenty of things to test, including consistency, accuracy, coverage, correctness, completeness, and so on. However, not only is it hard to test data using automated tests (as opposed to manual review), it’s often hard to even know where to start!

How does one test data???

Let’s say I gave you the following list of all of the Starbucks cafes in zip codes 90067 and 98134, and then asked you to tell me if the data is high quality. What would you do? How would you approach the problem?

1999 Ave. of the Stars, Century City, CA 90067, (310) 557-0785
2025 Ave. of the Stars, Century City, CA 90067, 310-228-1234
2049 Century Park East, Century City, CA 90067, (310) 286-3023
2401 Utah Avenue South, Seattle, WA, 98134, (206) 447-0828
4115 4th Avenue S, Seattle, 98134, (206)-381-1553
1926 1st Ave S., Seattle, WA, 98134, (206)-624-6045
134071 Seventh Ave NW, Seattle, WA, 98134, (206)-521-1242

Here are some of the things you might want to look for:

  1. Is the data complete? For example, the ‘state’ for the Starbucks on 4th Avenue S. is missing.
  2. Is the data comprehensive? For example, there’s a Starbucks at 1875 Century Park East, Century City, CA, but it’s not on the list.
  3. Is the data correct? For example, the Starbucks at 1926 1st Ave S. should actually be at 1962 1st Ave S.
  4. Is the data structured correctly? For example, you did find a telephone number for the Starbucks at 2401 Utah Avenue South, but it’s actually the fax number, not the phone number.
  5. If the data was cleaned/canonicalized, was it cleaned properly? For example, notice that all of the phones have slightly different formats. Also, “Ave. of the Stars” should probably be “Avenue of the Stars”, and there are other small issues as well: sometimes the address contains ‘Ave’ and sometimes it contains ‘Avenue’; ‘S’, ‘S.’, and ‘South’ are used interchangeably; we encounter digits and spelled out numbers in street names (i.e. ’4th’ vs ‘Seventh’); and so on.
  6. Are there outliers in your data? For example, the last Starbucks in the list (which was made up) has a 6 digit street number. If street numbers are never more than 5 digits, then you know you have a problem.
  7. If you clustered data (for deduping), is it underfolded? For example, if you had Starbucks@123 First St, Los Angeles, CA, 90067 and Starbucks@123 1st Street, Los Angeles, CA 90067, they SHOULD be considered the same object.
  8. If you clustered data (for deduping), is it overfolded? For example, if you had Starbucks@123 First St, Los Angeles, CA, 90067 and Starbucks@123 First Blvd, Los Angeles, CA 90067, they SHOULD NOT be considered the same object.

Not only are these attributes hard to test, but often the tests are as hard as the original problem! For example, if you can write an automated test to see if dedupe is overfolding or underfolding, then presumably you can update the deduping algorithm to use this code to make sure you’re not overfolding or underfolding. That leaves you back at square one for testing.

Hard problems offer great rewards

While testing data quality is a big challenge, it also yields great dividends. When you’re talking about millions or even billions of pieces of data, improving the quality by a few percent has a huge absolute impact.

Your turn!

Here are a few example problems that you might encounter during the Factual interview process. If you work on one of these challenges and are happy with your results, please email your solution to us at jobs@factual.com. We’d love to meet you!

A few practical notes: First, for efficiency reasons, a lot of our data-crunching code is in Java. If your answers are in Java, that would be ideal; if not, that’s okay too. Second, these problems are hard and have a tendency to suck up as much time as you allot to them. Please don’t feel like you have to go overboard and write thousands of lines of code. The goal is to play with data and have some fun, not to take a week off from your regular activities. =)

Problem 1: Outlier Detection

Given a file containing between 10,000 and 1,000,000 inputs (1 line = 1 input), write a program to emit outliers along with brief justifications of why they are outliers. You can assume the total amount of data will not exceed 500MB. For example:

Sample Input

12345
12342
23532
34539
348597
30948
30487
abcde
38947
98374
… 9990 more 5-digit numbers …

Sample Output

348597 (99.99% of inputs have length 5)
abcde (99.99% of inputs consist of only digits)

The Ideal Solution

  1. Is relatively fast and memory efficient. For 500MB of data, it’s okay to take a few seconds or a gigabyte or two of RAM, but remember, you will often encounter much more data in production!
  2. Provides helpful justifications (e.g. “99.95% of strings have length 4″ instead of “looks suspicious”).
  3. Finds a significant portion of actual outliers without too many false positives.
  4. Works well across multiple domains: marathon times, UK postal codes, book ISBNs, country names, etc.
  5. Consists of well-written/well-structured code, so that it’s easy to tweak the outlier-detection algorithm 6 months from now.
  6. Half credit for unimplemented ideas. If you implement algorithm “X”, great! If you don’t want to invest more time but make a note that “Algorithm X seems like it would be promising”, we’ll happily give you partial credit.

Sample Data

It’s hard to give sample data for a problem like this without strongly biasing approaches and solutions. If you’d like some sample data, we recommend that you download a dataset of marathon times for a race, all words in the English language, etc. If your heuristics work well, then you (probably) won’t see many outliers in the downloaded data. However, if you manually add couple of suspicious lines, they should be flagged.

Problem 2: Dedupe

You will be given a file with up to 1,000,000 inputs (< 500MB of data). There will be one input per line, and each input will consist of the following TAB-separated elements: id, address, city, region, and postal code. You can assume that the data itself will not contain tabs, so just splitting each line by ‘\t’ will be fine.

Your goal is to generate a list of pairs of elements that you think are identical. For example:

Sample Input

1 123 First Street Los Angeles California 90025
2 123 First St. West Los Angeles CA 90025
3 123 First Street Paris Texas 75462
4 123 First Street Paris France 75008
5 123 1er Street Paris France 75008

Sample Output

(1,2)
(4,5)

The Ideal Solution

  1. Works for addresses from various countries, although even doing a good job with US data is an impressive feat.
  2. Does not explode combinatorially. If you’re comparing every possible pair of inputs, that won’t scale very far.
  3. Does not overfold often (i.e. two inputs are declared dupes, but they are not).
  4. Does not underfold often (i.e. two inputs are dupes, but they are not marked that way by the solution).
  5. Consists of well-written/well-structured code, so that it’s easy to tweak the outlier-detection algorithm 6 months from now.
  6. Some algorithm configuration is okay, but should not be excessive.
  7. Half credit for unimplemented ideas. If you implement algorithm “X”, great! If you don’t want to invest more time but make a note that “Algorithm X seems like it would be promising”, we’ll happily give you partial credit.

Sample Data

As with outlier detection, sample data tends to bias solutions. A good test would be to find a list of locations for some chain (Starbucks, McDonald’s, etc.). Once you have this list, copy some of the inputs in the list and modify them slightly (add a typo, change the street number, reformat the phone number, etc). Your dedupe code will mark some of your copied rows as duplicates of existing inputs, while other copied rows will be correctly marked as too different from any existing row.