I spent this summer as a Data Specialist Intern at Factual, and was tasked with improving our Global Places categorization. Factual employs a wide variety of strategies at every stage of its data pipeline, and categorization is just one part of that. To clarify, every Factual Place belongs to one category from our 400+ node taxonomy. My job was to ensure that the existing process was producing data of high quality, and explore alternate means of improving category accuracy and coverage. Here are some things I wish I’d known before I started out.
- Ask for help first. If you’re not the sole proprietor of every piece of code at your company, then there are things you don’t understand how to do, and someone else can do faster (read: cheaper). So if there’s something you need, like data faceted in a special way, or a Ruby script that interfaces nicely with a step in the workflow, it’s a good bet that a coworker has that exact piece of code sitting somewhere in their personal repository.
- The documentation is your best friend. I spent a full day trying to find decent documentation for the open source Apache Mahout project. It’s simply not out there. By documentation, I mean walkthroughs and explanations that take place in a context, as well as the standard javadocs list of method headers. All the meticulously optimized algorithmic libraries in the world are useless to someone who can’t figure out how they should format their input data. (Disclaimer: I take full responsibility for my own Hadoop ineptitude; this bullet is meant as a journal entry for my personal experience, not as a polemic against Mahout.) On the other side of this comparison is Python’s scikit-learn. It’s built on top of the meticulously documented Numpy and Scipy libraries for scientific computing in Python, so if you can figure out how to get your data into a Numpy array, you’re good to go.
- Know the ecosystem. Working in Python instead of Java/Hadoop was a great decision, both in terms of the rapid prototyping virtues of the former, and my familiarity with the underlying data structures. If my work gets ported to our Hadoop cluster, it’ll be by someone who knows what they’re doing on that platform.
- Machine Learning applications are mostly the boring stuff. The most “machine-learny” code I wrote all summer was two lines of Python, initializing and then fitting a model. The majority of the effort is in pre-processing: getting from the raw data to a standardized, plain-text dataset, to an array of 1s and 0s. There are still interesting problems to be solved here, of course. For example, I employed the hashing trick, which relies on some high-level probabilistic linear algebra to guarantee solutions that are “close enough.”
- Save the ML for the problems you can’t think to solve in any other way. Through a combination of hand curation and careful translation into our taxonomy, Factual gets incredibly far before applying any machine learning at all. A statistical approach is only as good as the data it’s given, which in this case is produced by a sophisticated and deterministic workflow. Before plugging into a fancy stochastic optimizer, make sure you’ve done everything possible to improve your inputs.
- Coding in R makes you feel like a ninja. The R core library is full of awesome one-liners waiting to happen, and is built around a set of vectorized operations, including the greatest thing since the static compiler: filtering by a vector of booleans. There’s basically nothing you can’t do in R with liberal application of implicit looping functions, and it makes even the boring stuff feel pretty cool.
I’m sure if I’d known all these things before I started, I’d be able to come up with a totally different list. Working at a startup like Factual was a great way to get quickly up to speed with a bunch of different tools, and also to be consistently challenged to expand and apply my repertoire in new ways. That holds for every single person at Factual. They work hard in unfamiliar territories because the task demands it. If there’s a driving principle behind Factual, it’s this:
- You can never have enough good data.