Factual Partners with Nexage to Accelerate the Availability of Audience Targeting Data in Mobile Advertising

We are very excited to announce our new partnership with Nexage, the premium mobile programmatic marketplace. Demand Partners of Nexage and Factual will leverage our Geopulse Audience solution in order to provide enriched geo-behavioral targeting, enabling advertisers to better reach their desired audience.

Via Nexage Connect, Advertisers can now target mobile campaigns based on how users interact with the physical world. Over 275 segments will be made available at launch with more being added over time. Example behavioral information such as business traveler, in-market auto buyer, and quick service restaurant diner, as well as geographic information like zip code, and Brand Affinity targeting such as McDonald’s, Target, Home Depot, Toyota, Foot Locker and hundreds more Retail, Dining, Financial, Grocery and Automotive brands. Advertisers can choose any combination of these segments to build the precise audience that best fits their campaign.

To develop the audience segments, Geopulse Audience applies our extensive location data (including our Global Places data, which covers over 65 million local businesses and points of interest across 50 countries and additional contextual data about geographic areas) and deep data engineering along with machine learning expertise to the vast amount of location data running through Nexage Connect. By leveraging all of this, Geopulse Audience can better understand mobile user behavior and use this understanding to develop audience segments.

“Nexage has designed our platform to be the market’s most robust and advanced technology in mobile advertising because we understand that the strength of technology directly impacts our customers’ ability to do business and make money,” said Mark Connon, CRO & EVP of Corporate Development of Nexage. “Nexage Connect is another example of extending our platform and providing a powerful capability that enables our customers—both media buyers and publishers—to be more successful and manage their business how they want.”

Publishers running ad inventory through Nexage will benefit both from the increase in the value of their ad inventory driven by the audience segments as well as from the increase in relevance of the ads seen by their users. Nexage and Factual are also working together to bring Geopulse Audience analytics direct to Publishers. Demand Side Platforms (DSPs) and Ad Networks connected to Nexage will benefit by being able to offer their advertisers enhanced targeting capabilities. Advertisers will benefit by increasing the effectiveness of their ad spend by being able to better reach their desired audience.

“Factual makes signals from location, one of the most important and unique value drivers in mobile, accessible to marketers,” said Bill Michels, SVP of Products and Partnerships at Factual. “Our audience segments based on location intelligence strengthens media buyers and advertisers’ ability to deliver more personalized content, and we’re excited about bringing this capability to Nexage, the leading mobile programmatic platform.”

-Glen Nigel Straub, Director, Business Development

Factual to Enhance First Data’s Insightics Solution for SMBs

First Data has chosen to partner with Factual to enhance its Insightics solution for small and medium sized businesses. Insightics is an innovative cloud-based software that unlocks the power of big data behind payment transactions to give SMB merchants the ability to monitor key business metrics and better understand their customers. First Data already evaluates a vast amount of data, but needed more in depth data for additional attributes and intelligence. This collaboration goes beyond transactional processing and produces advanced business and consumer insights that help merchants to grow their businesses.

It’s a big deal for small businesses to get actionable insights about their business. How many new versus repeat customers they have been getting the past few weeks, is an important thing to know. SMBs may also want to know which part of the local neighborhood customers come from and how far are customers willing to travel to come to their store. If the SMB is having a good or bad month, is it just that SMB or all similar businesses in the area? Delivering such capabilities requires all the horsepower of state of the art big data analytics presented in an extremely easy to understand manner to the merchant. That’s what First Data Insightics does.

These powerful insights combine the tremendous amount of payments data flowing through First Data’s systems with the learnings coming from Factual’s U.S. Places data. In order for First Data to provide the most relevant and beneficial insight to a merchant, they need to know a lot about the merchant – that’s where Factual starts to play a part. Factual provides data that includes the exact location of each merchant, other similar merchants around, other dissimilar merchants, name of neighborhood where the merchant is located, hours of operation, price range, etc. US Places, which covers over 17 million local businesses and points of interest, enables First Data to have a comprehensive view of all of the businesses in any geographic area. The rich categorization scheme allows for highly refined benchmarking, making sure that, for example, a beauty salon is benchmarking themselves against strictly other beauty salons and not also against manicure/pedicure nail shops.

First Data will also be using our Restaurants, Hotels, and Healthcare Providers Extended Attributes data sets. These contain extensive, category specific attributes that enable for even more refined segmentation and analysis. With our restaurants data, for example, a First Data merchant, say a 4 star Italian Restaurant, can analyze his performance against other 4 star Italian restaurants, other 4 star restaurants, or all other Italian restaurants, to get a much richer picture on how he is doing and how much of his performance is related to his own business versus the overall business environment.

Here’s what we heard from First Data:

“After extensive research to find a data partner, we found that Factual far exceeded other companies on data quality and its capabilities to maintain and keep the data up to date,” stated Sandeep Garg, VP Product, Information and Analytics at First Data. “Additionally, Factual goes beyond legacy data companies with its technology and the innovative services that it builds on top of the data, enabling us to provide the valuable insights needed for our clients to grow their businesses.”

Thousands of developers (small business owners) in the mobile space use Factual data to drive their businesses. Processing more than 59 billion transactions globally for more than six million merchant locations in 2013, First Data has long been a data powerhouse. We’re looking forward to working with First Data to bring the power of big data to small businesses!

How Factual Uses Persistent Storage For Its Real-Time Services

As part of Factual’s Geopulse product suite, we need to be able to absorb and process large amounts of data, and deliver back a somewhat smaller amount of data. There is a significant amount of technology available for the processing stage, but fewer for both the intake and delivery. Today, we’re open sourcing two libraries that we’ve used for for these purposes, s3-journal and riffle. Both of these libraries are notable for making efficient use of persistent storage by avoiding random writes, which will be discussed in more detail later in this post.

s3-journal

On the intake side, we need to persist the data we’re being sent somewhere Hadoop can get at it. At the volume we were receiving it (billions of daily entries, comprising terabytes of compressed text), S3 seems to be the most straightforward place to put it. While we don’t make a habit of losing data, guaranteed delivery to S3 upon receiving a particular entry was not a requirement. This meant that we could implement our intake service as a horizontally scalable set of servers, accepting data and uploading it in batches to S3.

At first, we used the Hadoop S3 client, which we had used in some lightweight use cases before. Unfortunately, it ended up coping poorly. Each server had a sustained throughput of about 10 megabytes of compressed data per second, and every so often S3 would hiccup and stop accepting data for a few minutes. When this happened, Hadoop’s client would simply buffer the data in-memory until all the memory was exhausted, at which point the process would become non-responsive. That server’s traffic would then spill over onto the other servers, and the process would be repeated.

One solution would be to give each process so much memory that they could survive any S3 outage of reasonable duration, but this is both fragile in the face of unreasonably long outages, and fragile with respect to process death: arbitrarily large in-memory buffers enable arbitrarily large data loss. Given this, a much better solution is to write the buffer to disk, since that gives us both much more space to work with and makes us robust to process death.

A downside to this approach is that storing and retrieving things in memory is much, much faster than accessing things on disk, even if the disk is solid state. The underlying library, durable-queue, uses append-only slab-allocated buffers to minimize I/O overhead, but this doesn’t change the fundamental capabilities of the hardware.

So in addition to minimizing overhead, s3-journal writes entries to disk in batches constrained by the number of entries, the maximum time allowed to elapse between the first to last entry, or both. These entries are written and then read out as a single binary blob, allowing us a high effective throughput measured in entries/sec without saturating the throughput of the disk controller. The downside to this, obviously, is the same as with the previous client: data buffered in memory may be lost forever. However, the issue with the other client was less that any data could be lost, but rather that the amount of data that could be lost was unbounded. By exposing parameters for how often s3-journal flushes to disk, people using the library can find their own golden mean between throughput and durability.

Since we started using it at the beginning of the year, s3-journal has successfully persisted over a trillion entries into S3, comprising over half a petabyte of textual data. Given that it’s built on top of an service which is under continuous development, and that new versions may be deployed without our knowledge at any time, there may be new and exciting failure modes introduced in the future. But so far it has proven a simple and reliable way to get our data safely persisted, and we expect it will be a useful tool for a wide variety of other applications.

riffle

After the data is persisted and processed, we need to get it back to our customers in a form they find useful. For some, we can simply expose it via our public API, but many of them are latency sensitive enough that the data needs to be on-premise. Of those customers, some already have data stores that can be used to serve up our data (which is just a large number of key/value pairs), but many did not. Given this, and given our desire to make the integration as straightforward as possible, we figured we’d just provide one for them.

There exist today a surprising number of in-process key/value stores, including LevelDB, LMDB, RocksDB, and many others. Each of these provides some variation on the “database as library” use case, where a process needs to persist and look up certain values by key. Our first version took LevelDB, which had decent Java bindings, added some background syncing logic, and wrapped an HTTP server around it.

At first, this worked quite well. Starting from an empty initial state, we were able to quickly populate the database, all the while maintaining an impressively low latency on reads. But then the size of the database exceeded the available memory.

LevelDB uses memory-mapping to keep its index files in memory regions that can be quickly accessed by the process. However, once the size of the memory-mapped regions exceed the available memory, some will be evicted from memory, and only refetched if the region is accessed again. This works well if only some of the regions are “hot” – they can stay in memory and the others can be lazily loaded on demand. Unfortunately, our data had a uniform access pattern, which meant that regions were being continuously evicted and reloaded.

Even worse, once the database grew past a threshold, write throughput plummeted. This is because of write amplification, which is a measure of how many bytes need to be written to disk for each byte written to the database. Since most databases need to keep their entries ordered, a write to the middle of an index will require at least half the index to be rewritten. Most databases will offset this cost by keeping a write-ahead log, or WAL, which persists the recent updates in order of update, and then periodically merges these changes into the main index. This amortizes the write amplification, but the overhead can still be significant. In our case, with a 100GB database comprising 100mm entries, it appeared to be as high as 50x.

So this meant that each time we had a batch of keys to update, it was not only slow (and getting slower), but it saturated the I/O throughput of the system, greatly degrading the read latency. This degraded state would persist until the updates were done, which could take as long as a day.

One solution we investigated was using RocksDB, which is a fork of LevelDB made in response to precisely the sort of production behavior we saw. But while RocksDB was better, it was still fundamentally a solution to a difficult problem we didn’t care about: random writes to disk. Our usage pattern was almost all reads, except for batch updates every week or so. Rather than write to the database, we could simply write a new database. And so we did.

Previous examples of this approach include DJ Bernstein’s cdb and Google’s sorted-string tables, which were first introduced in the BigTable paper, and have gone on to be used extensively within Google (including as the underlying storage format in LevelDB) and elsewhere. Both formats are similar in their intent: an index maps keys onto the location of the values on disk, meaning that values can generally be fetched in a single read from disk.

There are, however, some meaningful differences between the two. SSTables keep the entire set of keys in memory, so that the file offset for the value can be immediately determined. The cdb format, conversely, uses an in-file hashtable that maps hashed keys onto offsets. The possibility of hash collisions means that there may be more than one lookup within the file, but this means our memory usage is not determined by the size of our keys. Even if the hashtable is memory-mapped, as it usually is, the cost per key is fixed (typically ~10 bytes per key), rather than dependent on the size of the keys themselves. The original spec for cdb only allows for databases smaller than 4 gigabytes, but this can be extended to a terabyte by adding 8 bits onto the hashtable’s address space, and sharding can enable multi-terabyte indices.

Keys within SSTables are stored in lexicographic order, so that any two SSTable files can be merged in linear time and constant memory. Conversely, cdb files are arbitrarily ordered, to allow for new databases with a single updated value to be created in such a way that minimally disturb the ordering of the previous database.

Lastly, contemporary SSTables use block compression, where contiguous values are grouped and compressed. If there is shared structure between values, this can greatly increase the compression ratio relative to compressing individual values, but it means that each lookup requires reading and decompressing more than a single value. While cdb doesn’t use any sort of compression, a similar scheme could be grafted onto it.

Our implementation cherry-picked the elements we wanted from each of these sources: fixed memory overhead per key, linear time merging, and block compression. While memory-mapping is used for the hashtable, values are read directly from disk, decoupling our I/O throughput from how much memory is available. The keys are consistently ordered to allow for linear merges, but not lexicographically, so unlike SSTables range queries are not possible. The resulting library, Riffle, is far from novel, but is only ~600 lines of code and is something we we can understand inside and out, which allowed us to write a simple set of Hadoop jobs that, given an arbitrary set of updated keys and values would construct a set of sharded Riffle indices, which could then be downloaded by our on-premise database servers and efficiently merged into the current set of values. A server which is new or has fallen behind can simultaneously download many such updates, merging them all together in constant space and linear time.

It’s worth noting that this approach of building entire databases using Hadoop is not new, as demonstrated by ElephantDB and others. One reason for avoiding ElephantDB is that it uses LevelDB under the covers (albeit in a read-only capacity), and generally assumes that the distributed filesystem and database servers are on the same local network and operated by the same people. It’s an excellent piece of infrastructure used to good effect in a number of places, but not quite what we needed.

Riffle is newer than s3-journal, but it has handled significant traffic and volume without any issues. For reference, here are some basic random read benchmarks, taken on an AWS instance using the riffle benchmark command:

Unfortunately, we haven’t gone through the exercise of loading the same 100mm entries into LevelDB, LMDB, or others, so it’s hard to say exactly how “good” these results are. However, given the use of memory-mapping by many of these databases, as well as the O(log n) lookups in the underlying on-disk LSM trees, compared to the O(1) lookups for Riffle, it’s likely that any such comparison would be favorable on a dataset of similar size. For datasets where the memory-mapped regions actually fit in memory, however, the other databases would be faster. It’s worth noting, though, that in either case the use of Clojure to implement Riffle has no real impact on the performance, since it’s inescapably I/O bound.

Riffle inhabits an interesting local maximum in the design space. It’s certainly not a drop-in replacement for other key/value databases, but we think it will be useful in a broad range of applications, and look forward to seeing how it’s used elsewhere.

Using Clojure To Generate Java To Reimplement Clojure

Most data structures are designed to hold arbitrary amounts of data. When we talk about their complexity in time and space, we use big O notation, which is only concerned with performance characteristics as n grows arbitrarily large. Understanding how to cast an O(n) problem as O(log n) or even O(1) is certainly valuable, and necessary for much of the work we do at Factual. And yet, most instances of data structures used in non-numerical software are very small. Most lists are tuples of a few entries, and most maps are a few keys representing different facets of related data. These may be elements in a much larger collection, but this still means that the majority of operations we perform are on small instances.

But except in special cases, like 2 or 3-vectors that represent coordinates, it’s rarely practical to specify that a particular tuple or map will always have a certain number of entries. And so our data structures have to straddle both cases, behaving efficiently at all possible sizes. Clojure, however, uses immutable data structures, which means it can do an end run on this problem. Each operation returns a new collection, which means that if we add an element to a small collection, it can return something more suited to hold a large collection.

Clojure already takes advantage of this, by representing maps with eight or fewer elements as a PersistentArrayMap, and everything else as a PersistentHashMap. A PersistentArrayMap is pretty much what it sounds like: it has an Object[16] array that holds keys and values in alternating slots, and for lookups will walk each key and do an equality check. This is less naive than it may seem; over small arrays, linear scans are often faster, both due to simpler code and memory access coherency. For this reason, many quicksort implementations will use insertion sort at the leaves.

A map with fewer than eight elements will just have empty slots:

but if an entry is added to a full PersistentArrayMap, it will spill over into a PersistentHashMap, which under the covers is a 32-way tree.

This has a very nice property: we can use approaches which are only optimal at a small size without suffering drawbacks as the collection grows. So nice, in fact, that a year ago I adapted it for vectors in a library called clj-tuple, which implemented separate types for each vector cardinality, up to six elements. The library is discussed in more detail in this talk, but the general idea is that even the most basic operations in a vector require iteration, often via a java.util.Iterator, since much of Clojure tries to provide general implementations for equality, hash calculations, and other similar operations. If we have a type of vector which can only be created by adding an element to an empty vector, we can guarantee there’s only one element, and remove any need for iteration. As we proceed, inductively, to 2-vectors and 3-vectors and so on, we can make sure that all the operations that normally require iteration are unrolled.

In clj-tuple, special vectors were generated via macros for every cardinality up to six, after which it spilled over into a standard Clojure vector. This made certain operations, especially those where a vector was used as a key in a map, significantly faster. It also suggested that while not entirely general, the PersistentArrayMap could be made more specific. When adding or replacing an entry, it must copy the entire array, update the indices, and instantiate a new wrapper. Even when there’s a single entry, it must go through the ceremony of iterating over that entry.

While at Clojure West, I spoke to Rich Hickey, who said that he’d be happy to accept a patch which implemented unrolled maps and vectors, as long as it was in Java. Months later, at a Factual hackathon, I got around to actually attempting to implement this. The result is 1000 lines of Clojure, which generated 5500 lines of Java, which are currently awaiting review to be merged into Clojure.

Writing Java with Clojure

Clojure can call Java functions via JVM bytecode, but there’s no special interop between Clojure and Java source code. So when I started out, I wrote something like this:

(defn class
  [{:keys [modifiers implements extends]}
   name
   & body]
  (str
    (str/join " " modifiers)
    " class " name
    (when extends
      (str " extends " extends))
    (when implements
      (str " implements " (str/join " " implements)))
    "{ "
    (apply str body)
    " }"))

This just concatenates a bunch of strings together into something which is valid Java, albeit a little difficult to read:

> (class
    {:modifiers '[abstract]}
    'AbstractSingletonVisitor
    "protected Object singleton;")
"abstract class AbstractSingletonVisitor{ protected Object singleton; }"

The result is perfectly valid Java which, with a package header and a proper file name, would be happily compiled by javac. But this workflow leaves a lot to be desired. My string-concatenated Java is bound to have some missing semicolons or other such issues, and when I pass my barely-formatted Java into javac, it will give an error referencing line 1, column 50000. Mapping the compiler error back into the code which generated it will be a chore, to say the least.

This approach would also mean that I could only work with code which can be compiled, which means that the code needs to not only be syntactically valid, but also have proper package and import headers, a surrounding class with all the necessary fields and methods defined, and so on. This is greatly at odds with my typical Clojure implementation pattern, which involves writing small bits of functionality, testing each at the REPL, and building upwards from there. The goal is to quickly invalidate dead ends, and keep yourself moving constantly forward.

For this reason, my first four hours were spent figuring out how to hijack the Java parsing and formatting mechanisms in Eclipse, and use them for my own questionable purposes. The resulting function was only six lines long:

(import
  'org.eclipse.jdt.core.formatter.CodeFormatter
  'org.eclipse.jdt.internal.formatter.DefaultCodeFormatter
  'org.eclipse.jface.text.Document)

(defn format-java [s]
  (let [f  (DefaultCodeFormatter.)
        te (.format f CodeFormatter/K_UNKNOWN s 0 (count s) 0 nil)
        d  (Document. s)]
    (.apply te d)
    (.get d)))

(NB: the dependency versions to make this work are hilariously specific)

This function takes a string representing possibly-correct Java, and either returns a pretty-printed version of the Java, or throws an exception if it’s invalid. Unlike javac, this will format Java snippets, not just full class files:

> (format-java "1+2+3")
"1 + 2 + 3"

> (println
    (format-java
      "private void sayHello(){ System.out.println(\"hello\"); }"))
private void sayHello() {
    System.out.println("hello");
}

With this, I could begin to build up my data structure implementations piecemeal, testing them each step of the way. If an input threw an exception, I could narrow the input until I hit upon exactly what was causing the issue.

Of course, this didn’t catch even simple semantic issues, like misspelling the name of a method, which meant that I still needed to rely on the compiler to validate the final code. But since the input to the compiler was now well-formatted, it was much easier to map the Java code back onto the Clojure which had generated it.

Implementing Persistent Vectors

Clojure data structures serve two masters: the clojure.lang.* interfaces that are used by Clojure functions, and the java.util.* interfaces that allow Clojure’s data structures to be consumed by normal Java libraries. This means that many methods they have to implement are redundant, like size() on java.util.Collection and count() on clojure.lang.Counted.

This makes implementing collections that behave like Clojure’s base collections fairly tedious, and easy to screw up – if you forget to implement size() everything will work fine in Clojure, but fail months later when you pass it into a Java library. There are libraries which try to make this easier, but it remains a fundamental pain point for Clojure developers trying to venture off the beaten path. Clojure’s own implementation uses abstract classes to hide away most of the boiler plate, but Clojure has poor interop with abstract classes.

Luckily, we’re writing Java ourselves, so we can just use the abstract classes directly. Even so, in order to implement something that acts like Clojure’s persistent vectors, we need to implement all of the following:

// returns the metadata on the object
public IPersistentMap meta()

// returns a new collection with the provided metadata
clojure.lang.IObj withMeta(IPersistentMap meta)

// returns the nth element, throwing an exception if the index is out of bounds
Object nth(int i)

// like nth, but returns 'notFound' if the index is out of bounds
Object nth(int i, Object notFound)

// returns the size of the vector
int count()

// returns an empty instance of the vector
clojure.lang.IPersistentVector empty()

// returns a new collection with 'val' at the given index
clojure.lang.IPersistentVector assocN(int i, Object val)

// returns a new collection with 'val' appended
clojure.lang.IPersistentVector cons(Object val)

// returns a new collection without the last element
clojure.lang.IPersistentVector pop()

// reduces over the collection, passing in both the index and value
Object kvreduce(IFn f, Object init)

// reduces over the collection, passing in just the value
Object reduce(IFn f, Object init)

// like 'reduce', but uses the first element as the 'init' value
Object reduce(IFn f)

// returns the JVM-style hash of this collection
int hashCode()

// returns the Clojure-style hash of this collection
int hasheq()

// returns the result of a JVM-style equality check
boolean equals(Object o)

// returns the result of a Clojure-style equality check
boolean equiv(Object o)

// returns the vector in array form
Object[] toArray()

// returns an iterator over the vector
java.util.Iterator iterator()

Notice that since Clojure uses its own equality semantics, mostly to allow BigIntegers to be equal to their corresponding primitive numbers, it must implement its own hash code and equality methods. Clojure also recently began using a more expensive hashing scheme to minimize hash collisions.

Most of these methods are trivial to implement if we know our size in advance, and simply have a field per element. Let’s consider a 2-vector, represented by the fields e0 and e1:

int count() {
    return 2;
}

Object nth(int i, Object notFound) {
    switch (i) {
        0: return e0;
        1: return e1;
        default: return notFound;
    }
}

int hashCode() {
    int hash = 1;
    hash = (31 * hash) + (e0 == null ? 0 : e0.hashCode());
    hash = (31 * hash) + (e1 == null ? 0 : e1.hashCode());
    return hash;
}

It’s easy to see how these can be programmatically generated for vectors of arbitrary size, and how the assembly generated for these functions will be extremely straightforward. In the hashCode() implementation, especially, where each element only requires a multiplication and addition (in addition to its own hashCode() call), the overhead of iterating over the collection can easily dwarf the cost of actually calculating the hash.

A less obvious method which benefits from unrolling is reduce:

Object reduce(IFn f, Object init) {
    init = f.invoke(init, e0);
    if (RT.isReduced(init)) {
    return ((IDeref) init).deref();
    }
    init = f.invoke(init, e1);
    if (RT.isReduced(init)) {
    return ((IDeref) init).deref();
    }
    return init;
}

In Clojure, reduction is the underlying mechanism for eager sequence processing. By optimizing this, we transitively optimize most operations over the data structure.

There are definitely diminishing returns on unwinding these sorts of loops. In a language like C, we’d have to balance the benefit of keeping our instructions small and able to fit in cache against the overhead of iteration (which is often vanishingly small with for loops on modern hardware). On the JVM, the cost of iterating with a java.util.Iterator can be significantly higher, but so too can the cost of unrolling. The JVM will only apply certain optimizations to functions that are smaller than a certain number of bytecode instructions. Past that, there is a size threshold at which it ceases to try to optimize the function at all.

This means that for smaller vectors we can safely unroll these collections, but past some point it makes sense to spill over into a more typical Clojure persistent vector. Based on some simple benchmarks I decided to set the magic threshold at six elements for both the vectors and maps, but a more exhaustive analysis may be useful here.

An interesting facet of 2-vectors is that clojure.lang.MapEntry implements both clojure.lang.IPersistentVector and java.util.Map.Entry, which means that we can treat a map as a sequence of 2-vectors. We cannot, however, simply add any 2-vector we have onto a map, since it expects a Map.Entry. Obviously we can’t have Clojure’s PersistentVector implement Map.Entry, since this is only valid when it has two elements, but in our unrolled implementation we have an entire class dedicated to the two element case, we can ensure that all 2-vectors can be conj’ed onto a map.

Implementing Persistent Maps

Clojure’s persistent maps have a similar panoply of functions that need to be implemented, and most can be implemented as simply as they can for the vectors. Unfortunately, when doing a lookup on a map we cannot simply jump to the correct entry, we have to scan over all entries and see if there’s a matching key.

private int indexOf(int h, Object key) {
    if (key instanceof Keyword) {
        if (k0 == key) {
            return 0;
        } else if (k1 == key) {
            return 1;
        }
        return -1;
    }
    return indexOfObj(h, key);
}

Notice we first check if the key is a clojure.lang.Keyword, which is typical for Clojure’s maps. Keywords have the nice property that there’s only ever a single instance of a keyword for a given string, which means that when checking if two keywords are equivalent, we only need to check if they’re the same object. This same optimization is already used in Clojure’s existing PersistentArrayMap.

However, if the key is not a keyword, we have to fail over to a more generic lookup:

private int indexOfObj(int h, Object key) {
    Util.EquivPred ep = Util.equivPred(key);
    if (h0 == h && ep.equiv(key, k0)) {
        return 0;
    } else if (h1 == h && ep.equiv(key, k1)) {
        return 1;
    }
    return -1;
}

Notice that at each step, before doing a generic equivalence check we first compare their hashes. Unlike PersistentArrayMap, in addition to the keys and values the PersistentUnrolledMap also stores the hash code. This is for a very simple reason: most equivalence checks we do will be false. In the case where we’re constructing a map (i.e. adding new entries), all of our checks will be false. In the case where we’re looking up a value, depending on the key’s location we may need to scan all the other entries before arriving at it.

Clojure’s equivalence checks are relatively simple:

static public boolean equiv(Object k1, Object k2) {
    if (k1 == k2) return true;

    if (k1 != null) {
        if (k1 instanceof Number && k2 instanceof Number)
            return Numbers.equal((Number) k1, (Number) k2);
        else if (k1 instanceof IPersistentCollection
                 || k2 instanceof IPersistentCollection)
            return pcequiv(k1, k2);
        return k1.equals(k2);
    }

    return false;
}

First it checks if the two objects are identical, and then checks whether we need to use the special equivalence checks for numbers or collections, and otherwise falls through to the base Java equals() check. However, this is enough overhead that calls to equiv() are typically not inlined, especially when we’re doing many of them. Avoiding this entire stack of calls by doing a single integer comparison can be a significant gain.

This is at the cost of storing, rather than recomputing, the 32-bit hashes of each key. But hashes are also stored in a PersistentHashMap, so it’s a very reasonable tradeoff.

Implementing Transient Collections

Updating an unrolled collection is relatively inexpensive, since we don’t need to copy any underlying arrays, just directly instantiate a new instance:

public IPersistentVector assocN(int i, Object val) {
    switch (i) {
        case 0: return new Card2(meta, val, e1);
        case 1: return new Card2(meta, e0, val);
        case 2: return cons(val);
        default: throw new IndexOutOfBoundsException();
    }
}

(Note that Card2 is the classname of our 2-tuple vector implementation)

Still, no matter how cheap allocations are, it’s still preferable to not make any. Clojure’s transient collections allow us to do in-place updates on collections in order to quickly construct them, and our unrolled collections also need a transient form. This ends up being a little strange, though, because this means we no longer have a 1-to-1 relationship between how many fields we have and how much data we contain. Instead we have a single transient vector which has room for six elements, and an empty transient map needs fields that can store six keys, values, and hashes.

This means that we can’t simply unroll our loops anymore – we only want to iterate over elements which have values. If we were iterating over an array, this would be easy:

for (int i = 0; i < count; i++) {
    ...
}

But since we don’t have an array, the obvious solution is a little ugly:

if (count < 0) {
    ...
}
if (count < 1) {
    ...
}
...

This creates a lot of instruction branches, and for smaller collections forces us to make multiple redundant checks (“Is 0 > 0? No? Well, is 0 > 1?”). We can short circuit these checks like so:

if (count < 0) {
    ...
} else {
    return ...;
}
if (count < 1) {
    ...
} else {
    return ...;
}
...

But this makes our code even more noisy, and the resulting bytecode even larger. Luckily, we have a solution in the form of Duff’s device, which uses the case fall-through behavior in switch to great advantage:

private int indexOf(int h, Object key) {
    if (key instanceof Keyword) {
        switch (6 - count) {
            case 0:
                if (k5 == key) {
                    return 5;
                }
            case 1:
                if (k4 == key) {
                    return 4;
                }
            case 2:
                if (k3 == key) {
                    return 3;
                }
            case 3:
                if (k2 == key) {
                    return 2;
                }
            case 4:
                if (k1 == key) {
                    return 1;
                }
            case 5:
                if (k0 == key) {
                    return 0;
                }
            default:
                return -1;
        }
    }
    return indexOfObj(h, key);
}

Here we see the same key lookup as above, but wrapped in a switch, which will jump to a particular if clause, depending on the current count. If count is 1, we jump to case 5, and will only compare k0 before falling through to the default case. If count is 2, we jump to case 4, which means we’ll compare k1 and k0 before falling through to the default case, and so on. This means that instead of having to worry about bounds checking and short-circuiting, we simply jump over all the keys we don’t care about, and proceed from there.

Testing the Collections

So, at the end of this exercise we have more than 5000 lines of Java, and we want to add them to the core implementation of Clojure. Ideally, we won’t introduce any bugs in the process. But the same unrolling that makes the code faster makes it significantly harder to simply read the code and verify it’s “correct”. The Clojure code which generates the Java, while more compact, is mostly concerned with string concatenating its way to proper syntax. The semantics of both codebases are a bit opaque.

But even if the code were perfectly clear, data structures are easy to get wrong and difficult to test. On a map, boundary conditions that need to be tested include key collisions, hash collisions, removing non-existent keys, conversions to and from transient collisions, and so on. The exact boundary conditions depend on the underlying implementation, so a test suite that covers one implementation of a vector or map won’t necessarily cover another, even if they behave identically.

Given this, it seems like writing exhaustive unit tests is a poor way for us to make sure our implementation is sound. A better approach would be to use property-based testing, which instead of defining both inputs and expected behavior, allows us to just define invariants which must hold true across all inputs. The testing framework will then search through the space of inputs for one which breaks the invariant, and then reduce the input down to the simplest possible reproducing case.

This is an especially nice approach for us, since both the inputs and invariants are straightforward. The inputs are all possible actions that can be performed on the data structure (conj, disj, assoc, dissoc, etc.), and the invariant is that it must behave just like the existing implementation, no matter what actions we take. Luckily there’s a readymade library for this purpose: collection-check. This library has been used to validate most (possibly all) of the alternate data structures in the Clojure ecosystem. It also uncovered an bug in Clojure’s own implementation of transient maps, which is discussed in more detail in this talk.

But while collection-check is a useful tool for validating the unrolled collections, we still need to map it effectively onto the underlying implementation. The initial tests for the collections only checked maps of integers onto integers, which skipped over the special code paths dedicated to keywords. When an additional test was run for maps of keywords onto integers, an off-by-one error in the Duff’s device snippet discussed above was found.

Bringing It All Together

The results of this work can be found in the cambrian-collections library, the output of which has been submitted as a patch for Clojure, and is currently under review to be merged into the Clojure 1.8.0 release. Initial performance tests are promising: the cost of building collections when deserializing JSON in Cheshire is halved with unrolled collections, giving us a 33% overall improvement in performance. Since at Factual we spend quite a bit of time deserializing data from disk, this will have a meaningful and immediate impact on our daily operations. We hope it will have a similar benefit for others.

Factual Partners with EQ Works to Bring Industry-Leading Location Data and Mobile Ad Targeting to Canada

Factual is heading North! We are excited to announce our partnership with EQ Works (TSX: EQ), the first Canadian advertising company to work with Factual. With this partnership, EQ Works now brings to Canadian marketers the most advanced mobile audience data as part of its audience-targeting platform.

Improving the relevancy of every ad served to mobile users, our data identifies audiences based on locations they are near, retail chains visited in the past, and numerous behaviors including shopping, travel, dining, and entertainment preferences. EQ Works then tailors messages and serves the right mobile ad based on this context, making every mobile interaction more targeted, engaging and informative.

Advertisers can now deliver different ads to consumers based on how close they are to an electronics store, for example, or target people who have visited a car dealership in the last seven days. Advertisers can reach their most desired audience segments when they are at related places, or even at their competitors’ locations. This level of targeting is now possible in Canada, and changes the game for mobile marketers.

According to a recent study by eMarketer, by the end of this year mobile advertising in Canada will have grown 70 percent over last year to over $550 million, and is expected to grow to over $2 billion in 2018 (source: eMarketer, March 2014). More than ever, it is important to make mobile advertising more targeted and compelling for consumers, and this is precisely what EQ Works in now doing with Factual.

“Mobile advertising is exploding, and marketers can’t settle for mobile 1.0 solutions anymore,” said David Katz, EVP of EQ Works. “Our goal is to better interpret mobile interactions to help advertisers reach their audiences when and where they are most likely to engage with the message. Our mobile targeting platform enriched by Factual’s database allows us to do just that with meticulous detail.”

Our database includes information from over 65 million businesses and points of interest across 50 countries in 29 languages. More specifically in Canada, Factual has information on more than two million businesses and points of interest.

The additional context we derive from location aligns well with EQ Works’ data-driven audience-centric approach to mobile marketing. We’re thrilled to be partnering with EQ Works to bring these capabilities to the Canadian market.