At Factual, we think about data day and night. Our goal is not just to create products, but to create new data economies that draw people to participate because we offer a better data experience. Because of this mission, we organize our work using what I call the data stack.
The idea of a data stack for Factual was really necessary as we developed our initial plan to pursue many data verticals at once. These verticals have very different data, but the challenges associated with accessing and improving them turn out to have a number of commonalities. The structure of the stack emerged from my experience in building, investing, and observing data-driven businesses. The data stack refers to a set of categories that describe the different capabilities needed to transform data into more valuable forms. At Factual, we gain significant efficiencies by starting with a common framework, which leads to pooling expertise and maximum reuse of technology.
Could the notion of a data stack be relevant outside of Factual? I believe that using the data stack as an organizing principle can help people avoid a number of common problems. In today’s blog, I attempt to explain why.
Data Stack Basics
The number of technology layers involved in processing data is daunting, as are the number of steps from beginning to end to make data useful. It is a huge challenge in most organizations just to understand the data that is available, let alone gain a coherent picture of how data flows from beginning to end when it is put to use. For example, in most businesses there is a department responsible for business intelligence that connects and collects information through an extract, transform, and load process that cleans and joins the information. This information is formed into a standard schema from which insights are derived through analysis, reports are rendered, and applications are built. What is often lost, in my opinion, is that there is an army of analysts working all around the company that are going through similar steps. Everyone involved would benefit if there were a wider awareness of the common process that takes place so that knowledge and techniques could be shared.
Without common awareness of the data stack—the stages in the end-to-end process of putting data to work—people tend to organize efforts around specific technologies or repositories. You find data warehousing groups, analytics groups, reporting groups, and so on, each of which creates its own version of a data stack.
A company based on the data stack uses the concept to organize the common categories of transformations and capabilities applied to data, not tied to specific technologies. As I will expand upon later, creating a common way of talking about the lifecycle of data has numerous benefits:
- Brings more awareness of the available universe of data services
- Leads to more standardization, improving alignment and optimization of data-related efforts
- Speeds up development by increasing reusability and sharing of knowledge about more powerful ways of using technology
- Lowers costs of integration
- Catalyzes innovation and experimentation
- Increases emphasis on managing data as a flow, not a repository.
Here are the elements of the data stack:
- Resource. Any data resource or service that can be used to access data. These resources may be thought of as raw data, what some have been referring to as the “oil of the 21st century.” But just as likely, a resource may be the end-result of a different data value chain – the improved data offered by some other company or process.
Key question: Do you have a catalog of all resources that are readily accessible?
- Connect. Refers to the way that you connect to a resource. The Web created a fairly standard way of connecting to information through URIs and HTTP or through RESTful API calls. Portable authorization schemes like OAuth have created a way to share permissions. But many resources have not adopted standardized approaches.
Key question: Do you have standardized ways of connecting to all available resources?
- Collect. Incorporates the notion of connecting to something one or more times to collect everything needed. The web crawl is a well-known form of collection, but the growth of sensors, machine data, and the data exhaust we all produce open up many new forms of collection.
Key question: How often do you connect to your resources to ensure the freshness of your collections?
- Cache. Once collected, it often makes sense to cache what you’ve collected so that you can look at it carefully on your own time rather than having to go back to the source. Remember, creating a cache should rarely be a goal in itself, but a means to dramatically speed up a process.
Key question: Have you optimized the cache so that all interested parties can access the collected data efficiently?
- Index. Indexing is another key step to bringing efficiency to your processes – it’s the way to access information you’ve cached without scanning everything. One very common type is a full-text index. A full-text index, like Google, indexes every single word of every document for instant access.
Key question: Does your indexing strategy optimize for the most important access patterns?
- Clean. Data cleaning often involves fixing errors, filling in missing data, or eliminating duplicate records. A cleaning process can involve application of basic cleaning rules, manual intervention, or advanced machine learning.
Key question: Do you know whether your data needs cleaning?
- Join. Data often lives in silos but can be connected. This can be easy if you have a clean relational model or if you have standardized on common identification schemes. In other cases, it can be incredibly painful.
Key question: Is data integration a challenge that is stopping you from asking and answering important questions?
- Derive. New and valuable information can be generated by crunching all the data you have. Such derivative data is especially valuable if you have a unique algorithm or are differentiated by the set of data you can access.
Key question: Do you have a catalog of common and proprietary analysis and machine learning techniques?
- Render. The world is very interested in data visualization because it improves the user experience of any consumer of that information. A basic form of rendering might involve transforming a numerical rating into an image showing a number stars, or the display of a sad, neutral, or happy face. Much of both rendering and visualization is used to support applications and user interfaces.
Key question: Do you have standards for the way data is rendered and common technology to support those standards?
- Resource/InfoService/Application. The last step is to provide the results as another resource that can then be easily consumed by another service or end-user application. When you provide the results as a resource, the whole cycle can start again. Applications, of course, can use the results of many different data stacks to achieve their goals.
Key question: Is it easy to find and understand the data and capabilities of your information services and applications?
In later blog entries, I’ll look at specific technologies in each category, explain ways to apply each capability of the stack to understand various types of information value chains, and describe why the data stack is really not so much a stack as a wide-ranging network of services that can be used piecemeal.
At Factual, we have data value chains that work together in complex ways. For example, in our data contribution ecosystem, we have one value chain that uses various capabilities of the data stack to rapidly get changes into our canonical database. We have another value chain that operates in parallel to apply a more time-consuming, and ultimately more fine-tuned, way of evaluating contributions. Both value chains are needed to make the canonical database both timely and accurate.
Applications of the Data Stack
So far, so good. But what good is the data stack? How can it help you improve your ability to get more value from data?
Here are three ways that I use the stack to help me in my work. I suspect they will work for others as well.
1) Normalizing Data Products
Every day, I closely follow the news about new companies offering data-related services or more generally in the big data space. When I hear about a new product or service, I decompose the idea using the capabilities in the stack. That helps me quickly figure out if there is anything special or new going on and how it might get integrated or complementary to Factual:
Here are some examples of normalizing data products based on the data stack.
- Consumers (resource) are engaged via web & mobile apps (connect). Tweets are propagated to internal distribution mechanisms (collect) and hosted and indexed (cached, index). Derivative data such as number of retweets (derive) are all offered cleanly (render) to the final consumer or through the API (new resource).
- Twitter (resource) is among a set of underlying APIs that are accessed (connect) to build a Klout Score (derive) that is in turn offered through a beautiful interface (render) and through its own API (new resource).
- Various APIs including Twitter and Klout (resources) are accessed via (connect) and archived (collect, cache) to offering advanced filtering (index) of their robust, enriched historical Twitter archive. (new resource)
- Person (resource) logs in and edits a page (connect), submits a change (collect), changes posted at Wikipedia (distribution), indexing re-run on page (index), and, in parallel, a moderation process may be triggered for some pages (improve).
2) Product Design
The data stack is also a huge help when designing new products. Especially when it comes to thinking about where your services sit in relation to others.
First of all, it helps if you understand the raw materials you have to work with. Twitter views individuals as the resource, and connects to them via a variety of client-side tools. The Twitter API becomes a valuable raw resource for your Klout influence score which is then offered via API. And, DataSift uses Klout’s API to build an enriched Twitter archive index. And, an individual may notice an interesting pattern when playing with DataSift, and add a comment on a Wikipedia page. (Yes, this is a bit circular – but that is the beauty of this burgeoning ecosystem.)
3) Optimizing an Organization
At Factual, we had the luxury of building our organization with the data stack in mind. The way we have organized our engineering and product development processes reflects a division of labor with related areas of the stack being clustered under the right leadership. We architected the company so all raw documents we collect are stored in the same place (dCache), can be processed using a common Hadoop-based system, accesses cleaning rules from a common rule store, is indexed using a pluggable framework (currently, we use Solr and Postgres), and hosted generically either on Amazon EC2 or in our data center.
But even though most organizations I encounter are vitally concerned with creating value from data, many of them have a haphazard organization of capabilities with a lot of silos and barriers to communication. The problem of connecting to a common data source may be solved a dozen different ways. The same is true for approaches to collection, indexing, caching, and so on. As a result of this diffusion of effort, investment capital is squandered, expertise is rarely reused, and everything takes longer than it should. Rarely does such an organization perform to its potential to create value for data.
Conducting a simple census of what each part of an organization is doing in terms of the data stack is a great start to understanding what sort of capabilities are duplicated. Then the hard job of creating an optimal organizational structure can begin.
I hope this tour of the data stack provides some helpful ideas. Please let me know if you put the data stack to work in a productive way. I would love to share more about ways to create great data products.