Drake after Two Years: “Barely Famous”

Drake cool ducky logoWe released Drake (“Data workflow tool, like a ‘Make for data’”) two years ago. The impetus behind Drake was the classic “scratch your own itch”. We craved a text-based data workflow tool that would make it easier to build and manage our data processing tasks.

We open sourced Drake in the hope that others would find it useful and maybe even be willing to help. We were delighted when the original blog post attracted attention. A fair number of relevant bug reports and feature requests flowed in. “Are you awash in fame and fortune?” my boss, Boris, asked me. “No,” I replied, “we’re awash in github tickets.” Most exciting, over the past two years we’ve seen substantial bug fixes and contributions to the project from outside Factual. This includes S3 support and a Clojure front end (thanks, all!).

Early on we applied Drake internally with an almost reckless abandon, trying it out on just about any project that looked even remotely like a data workflow. Over time, however, we developed a better feel for the strengths and weaknesses of Drake’s design choices.

Drake makes it refreshingly easy to spin up proof of concepts. We like to use Drake to express evolving data workflows that will be shared across a wide team. And Drake is especially useful for tying together heterogeneous scripts and command line calls, thereby taming an otherwise hodgepodge of tasks. We also like Drake for helping us manage chains of Hadoop jobs that relate to each other.

On the other hand: we’ve found Drake is not great if you just need to glue together some existing same-language code. That can usually be done more simply by staying within the borders of that language. Doubly so if you plan to share your work with others who don’t already know and understand Drake workflows.

Also, Drake does not fully answer all problems in the data workflow space. A leading example: how to manage long running tasks? For tasks that take a long time and do a large amount of work, you often want features like resumability and trackability. At Factual, we have a custom built task management service called Vineyard that addresses these and other similar issues for us. We have glue that allows Vineyard and Drake to work together in various ways, but Drake out-of-the-box doesn’t offer these kinds of features for long running tasks.

Earlier this year Factual welcomed Clojure aficionado Alan Malloy to our engineering ranks. Alan showed interest in Drake and invested time and expertise in maintaining the codebase and responding to community requests. This was no surprise given Alan’s Clojure chops and generous willingness to help people. We invited Alan to become the primary owner of Drake’s codebase and it was super great that he accepted.

We hope that Drake’s future is bright and that the project continues to evolve to better serve users. We’re encouraged by the bit of traction that the project has seen so far — I like to think of Drake as “barely famous”: Drake was given its very own chapter in the recently published “Data Science at the Command Line”, and Artem’s YouTube tutorial for Drake has been viewed over 5,000 times, or as Artem puts it: “Since launch, people spent cumulative 639.8 hours watching Drake tutorial on Youtube, which is not Apache Hadoop, of course, but still pretty neat. :)”.

If you’re a current user of Drake, we hope you’ll let us know what you think and tell us what’s missing. If you’ve never used Drake but have always wanted a ‘Make for data’, we hope you’ll give it a go.

And if you’ve ever filed a ticket or, even better, sent us a pull request… Thank you!

Yours,

Aaron Crow
Factual engineer and Drake contributor

Changes in our Global Places Data – Q4 2014

Place data, like most things, has an expiration date. Go too long without picking a fresh crop, and you end up with something stale an unpalatable— or in many of our partners’ cases, even unusable. This is why we work tirelessly to refresh and improve Global Places all the time. We clean out listings for places that have gone out of businesses, add ones that have recently opened, update changed websites and names, and so on.

Below is a summary of some changes that we’ve made since our last update. In the 11 countries listed here, we added about 5.2 million places, discarded about 6.3 million old records, and updated at least one field1 in 9.4 million records.

See the breakdown of these updates by field in the chart below2.


What better way to start off the new year than with a heaping plate of fresh data?

- Julie Levine, Marketing Associate

In Case You Missed It:
See updates from the first three quarters of 2014 here:

Notes:
1. Fields include: address, address extended, country, locality, name, po box, postcode, region, tel.

2. Note that some records had updates to more than one field, thus the number of updates is larger than the number of updated records.

Factual Featured Partner: Snips

Wouldn’t it be great if your smartphone always knew exactly what you were thinking? Dr. Rand Hindi, named one of MIT Technology Review’s 2014 35 Innovators Under 35, is building an app that always provides you with the information you need, right when you need it. Although Snips doesn’t actually read your mind, it does use loads of data to intelligently interpret your situation and provide you with just what you were about to look for. We had the opportunity to talk to Dr. Hindi about Snips- see the Q & A below.

Company Name: Snips
Located: Paris, France
Factual Partner Since: 2014
Website: www.snips.net
Twitter: @Snips
Blog: www.snips.net/blog
Your Name and Title: Dr. Rand Hindi, CEO and Founder

 
Q: Introduce readers to Snips. What do you do?
A: We build context-aware interfaces for mobile devices, so that people never have to look for and juggle multiple apps. Snips learns and adapts to your lifestyle, helping you to quickly access the information and apps you need, when you need it.

Q: What’s an example of how Snips learns from your behavior? How long do you need to use the app to start getting feedback?
A: We integrated several sources of data, some requiring some time to accumulate (such as knowing where you like to go for lunch on weekends) and others that are immediate (such as your calendar events and social networks). Our product works the first time someone uses it, and gets better as we get more and more data.

Q: Why is location data important for Snips?
A: Our goal is to make our devices better integrated in our everyday lives. This means being able to understand what people do throughout their day and how they use their devices in these contexts. We use location data in combination with other sensors to determine what their habits are, such as the time they go to the gym, if they prefer to take a taxi or the tube to go to work, or what kinds of restaurants they like. This level of intelligence is necessary to anticipate what apps and services people will need to use. And, not only does all the location data processing happen on the device, we also empower users to modify and delete anything they want so that they control what our algorithms know about them.

Q: Why did you choose to work with Factual?
A: Because a raw location trace without context is useless for our product, we needed to access a comprehensive database of places with rich attributes. Factual had both, and in a large number of countries, so the choice was pretty easy to make!

Q: Snips is about using context and data to provide mobile users with the best experience. What are some other services that you think could benefit from this approach?
A: We can go so much further than just smarter interfaces. We can literally make the entire world around us able to adapt to our lifestyles. For example, we did a project where we helped people find a seat on their train home by predicting passenger flow. For that, we modeled the context around each station– demography, places nearby, concerts, etc.– and used it to predict how many people would be on board each train. The more accurate data we get, the more things we can predict, and the more we can improve our quality of life.

Q: Have you faced any technology roadblocks while working on Snips? What new or improved technology or data would you like to see in coming years?
A: We faced a large number of roadblocks, since none of the technologies we are using existed. We had to build no less than 25 different algorithms to turn all this data into something meaningful. Our long term technology goal is to build an artificial intelligence capable of discovering and automating all the rules pertaining to a user. Kind of like creating a pocket brain!

Q: What wishes do you have for Snips in the New Year?
A: That our product launch goes well!

- Julie Levine, Marketing Associate

In Case You Missed It
Check out some other partners we’ve interviewed, such as networking apps HelloTel and SocialRadar, consumer purchase data company Infoscout, and city guides Tastemade and Jetpac City Guides. See even more Featured Partners here.

Factual Debuts Two New Tools for Geofencing and Audience Targeting, Putting Location Data in the Hands of the Client

Factual is excited to launch two new product features to make it easier for marketers to take advantage of the power and flexibility of our Geopulse Proximity and Geopulse Audience products. The Geopulse Proximity Designer and Geopulse Audience Builder are transparent and powerful self-serve tools that enable our partners to craft their campaigns – using geofencing or location based audiences – to reach precisely the right audience.

Geopulse Proximity Designer enables our partners to quickly and easily design their geofenced ad campaigns by selecting their targeted places and setting their desired radius. The underlying data comes from our leading Global Places data, which covers over 65 million businesses and other points of interest in 50 countries, in 29 languages and with 467 categories. This data drives the powerful search functionality, enabling users to select places by any combination of business name, merchant chain, category, or geography. Users can also specify any radius or combination of radii, and design advanced targeting that differentiates between, say, targeting your own store, targeting competitors’ stores, and targeting your locations that are close to those of your competitors’. All of this is done on an interactive map providing complete transparency into the campaign’s targeting.

For a guided tour of the Geopulse Proximity Designer tool check out this video

Geopulse Audience Builder enables our partners to mix and match the hundreds of audience segments provided by Geopulse Audience to build their ideal audience. Geopulse Audience builds user profiles by analyzing where they spend time and the places they visit, assigning each user to a series of audience segments that covers geography, demographics, behavior, and brand affinity. Airlines can build one audience of business travelers who live in LA and spend time in NYC to promote one type of services, and build another audience of leisure travelers who live in the Northeast and visit the Caribbean. Retailers can build separate audiences of consumers who frequent their stores, consumers who frequent their competitors, and consumers who frequent both. This tool allows for any number of combinations of segments along with AND and OR logic controls so that advertisers can go as wide or as deep as they wish to reach their precise audience.

For a guided tour of the Geopulse Audience Builder tool check out this video

We’ve partnered with a number of media partners so that advertisers can run campaigns with our data and targeting technology on their preferred media providers. These two tools are currently being used by DSPs and networks such as StrikeAd, Turn, Manage, Deep Forest Media, EQ Works, MdotM, Juice Mobile, Adsmovil, TAPTAP, as well as publishers like The Weather Company and trading desks including Cadreon, and Horizon (HX).

Our current clients who have utilized these tools have had remarkable experiences. Here is what they have to say…

“Factual provides an important stream of best-in-class location data to our clients on a global basis which allows them to harness some of the most unique capabilities of mobile marketing,” said Emily Del Greco, VP of Sales for Adelphic. “Of all our data partnerships, Factual’s stands apart because of the quality of their location data, the diversity of its application and the sheer scalability – which is highly differentiated.”

“Factual’s tools allow us to easily plan campaigns with precise location targeting that’s optimized to the specific needs of our clients,” said Alex Rahaman, CEO of StrikeAd. “We have clients around the world and having global tools is wonderful from an internal organization / operations perspective and gives our multinational clients consistency across their markets.”

“We partner with the best data and technology providers to enable market leading mobile advertising solutions for advertisers,” stated Alberto Pardo, CEO of AdsMovil. “The data and tools provided by Factual enable us to bring our clients location targeting techniques that were not previously available in Latin America and that significantly enhance the ability of advertisers to reach Hispanic consumers in the United States.”

Contact sales to learn more about Factual’s mobile ad targeting products http://www.factual.com/contact?subject=sales

The Humongous nfu Survival Guide

Github: github.com/spencertipping/nfu

A lot of projects I’ve worked on lately have involved an initial big Hadoop job that produces a few gigabytes of data, followed by some exploratory analysis to look for patterns. In the past I would have sampled the data before loading it into a Ruby or Clojure REPL, but increasingly I’ve started to use nfu for these last-mile transformation tasks:

$ history | nfu -F '\s+' -f2gcOT5
96 ls
61 nfu # nfu at #2! booyah!
49 gs
48 vim
25 cd
$

This humongous survival guide covers nearly every aspect of nfu, from simple text-based data aggregation to module loading, JSON parsing, local map/reduce, and inner and outer joins.

What is nfu?

nfu is a command-line Perl tool that builds shell pipelines to transform text data. Conceptually, it does two things:

  1. Provides shorthands for common idioms of UNIX shell commands like sort
  2. Wraps Perl to be more directly applicable to text data

Before diving into how nfu works, here are a couple of use cases that illustrate what it does at a high level.

Testing correlation

As words get longer, do they tend to contain more “e”s? Probably, but let’s find out empirically by looking at all English words. Specifically, we’ll generate a plot whose X axis is cumulative length and whose Y axis is cumulative “e” count. Using standard UNIX tools:

$ perl -ne 'chomp;
            $_ = lc;
            print $x += length, "\t", $y += length(s/[^e]//gr), "\n"' \
  < /usr/share/dict/words \
  | gnuplot -e 'plot "-" with lines' -persist

Using nfu:

$ nfu /usr/share/dict/words \
      -m lc%0 \
      -m 'row length %0, length %0 =~ s/[^e]//gr' \
      -s01p %l

We can easily change this command to get a scatter plot. To see densities better, let’s add a bit of randomness to each data point:

$ nfu /usr/share/dict/words \
      -m lc%0 \
      -m 'row length %0, length %0 =~ s/[^e]//gr' \
      -m 'row %0 + rand(), %1 + rand()' \
      -p %d

The first three lines are identical, so we can generate an intermediate file. nfu transparently loads compressed data, so let’s gzip the file to save some space:

$ nfu /usr/share/dict/words \
      -m lc%0 \
      -m 'row length %0, length %0 =~ s/[^e]//gr' \
  | gzip > points.gz
$ nfu points.gz -s01p %l
$ nfu points.gz -m 'row %0 + rand(), %1 + rand()' -p %d

Joint analysis of JSON files

Suppose you’ve got two files, one a TSV of user-id, tweet-json and the other a TSV of user-id, user-json. You’d like to measure the differences between tweets of people with many vs few followers by building a list of most-used words faceted by log(followers). Using standard tools (including the indispensable jq) you might do this:

$ cut -f2 tweet-jsons \
  | jq -r '.text' \
  | sed 's/[\t\n]//g' \
  | paste <(cut -f1 tweet-jsons) - \
  | sort -k 1b,1 \
  > sorted-tweet-text
$ cut -f2 user-jsons \
  | jq -r '.followers_count' \
  | perl -ne 'print int(log($_ || 1)), "\n"' \
  | paste <(cut -f1 user-jsons) - \
  | sort -k 1b,1 \
  | join - sorted-tweet-text \
  | cut -f2-3 \
  | sort -n \
  > follower-counts-and-texts
$ perl -ne 'my ($count, $text) = split /\t/;
            print "$count\t$_\n" for split /\W+/, $text' \
  < follower-counts-and-texts \
  | sort \
  | uniq -c \
  | perl -ane 'print "@F[1,0,2]\n"' \
  | sort -n \
  > sorted-logfollowers-word-histogram

Of course, an easier way to do most of this would be to use SQLite or local map/reduce, but these introduce their own overhead, particularly when interoperating with other tools. Here’s how to do all of the above using nfu:

$ nfu tweet-jsons -m 'map row(%0, $_), split /\W+/, jd(%1).text' \
      -i0 sh:"$(nfu --quote user-jsons \
                    -m 'row %0, int log(jd(%1).followers_count || 1)')" \
      -f21gcf102o \
      > sorted-logfollowers-word-histogram

Introduction

nfu is all about tab-delimited text data. It does a number of things to make this data easier to work with; for example:

$ git clone git://github.com/spencertipping/nfu
$ cd nfu
$ ./nfu README.md               # behaves like 'less'
$ gzip README.md
$ ./nfu README.md.gz            # transparent decompression (+ xz, bz2, lzo)

Now let’s do some basic word counting. We can get a word list by using nfu’s -m operator, which takes a snippet of Perl code and executes it once for each line. Then we sort (-g, or --group), count-distinct (-c), and reverse-numeric-sort (-O, or --rorder) to get a histogram descending by frequency:

$ nfu README.md -m 'split /\W+/, %0' -gcO
48  
28  nfu
20  seq
19  100
...
$

%0 is shorthand for $_[0], which is how you access the first element of Perl’s function-arguments (@_) variable. Any Perl code you give to nfu will be run inside a subroutine, and the arguments are usually tab-separated field values.

Commands you issue to nfu are chained together using shell pipes. This means that the following are equivalent:

$ nfu README.md -m 'split /\W+/, %0' -gcO
$ nfu README.md | nfu -m 'split /\W+/, %0' \
                | nfu -g \
                | nfu -c \
                | nfu -O

nfu uses a number of shorthands whose semantics may become confusing. To see what’s going on, you can use its documentation options:

$ nfu --expand-code 'split /\W+/, %0'
split /\W+/, $_[0]
$ nfu --explain README.md -m 'split /\W+/, %0' -gcO
file    README.md
--map   'split /\W+/, %0'
--group
--count
--rorder
--preview
$

You can also run nfu with no arguments to see a usage summary.

Basic idioms

Extracting data

  • -m 'split /\W+/, %0': convert text file to one word per line
  • -m 'map {split /\W+/} @_': same thing for text files with tabs
  • -F '\W+': convert file to one word per column, preserving lines
  • -m '@_': reshape to a single column, flattening into rows
  • seq 10 | tr '\n' '\t': reshape to a single row, flattening into columns

The -F operator resplits lines by the regexp you provide. So to parse /etc/passwd, for example, you’d say nfu -F : /etc/passwd ….

Generating data

  • -P 5 'cat /proc/loadavg': run ‘cat /proc/loadavg’ every five seconds, collecting stdout
  • --repeat 10 README.md: read README.md 10 times in a row (this is more useful than it looks; see “Pipelines, Combination, and Quotation” below)

Basic transformations

  • -n: prepend line numbers as first column
  • -m 'row @_, %0 * 2': keep all existing columns, appending %0 * 2 as a new one
  • -m '%1 =~ s/foo/bar/g; row @_': transform second column by replacing ‘foo’ with ‘bar’
  • -m 'row %0, %1 =~ s/foo/bar/gr, @_[2..$#_]': same thing, but without in-place modification of %1

-M is a variant of -m that runs a pool of parallel subprocesses (by default 16). This doesn’t preserve row ordering, but can be useful if you’re doing something latency-bound like fetching web documents:

$ nfu url-list -M 'row %0, qx(curl %0)'

In this example, Perl’s qx() operator could easily produce a string containing newlines; in fact most shell commands are written this way. Because of this, nfu’s row() function strips the newlines from each of its input strings. This guarantees that row() will produce exactly one line of output.

Filtering

  • -k '%2 eq “nfu”': keep any row whose third column is the text “nfu”
  • -k '%0 < 10': keep any row whose first column parses to a number < 10
  • -k '@_ < 5': keep any row with fewer than five columns
  • -K '@_ < 5': reject any row with fewer than five columns (-K vs -k)
  • -k 'length %0 < 10'
  • -k '%0 eq -+-%0': keep every row whose first column is numeric

Row slicing

  • -T5: take the first 5 lines
  • -T+5: take the last 5 lines (drop all others)
  • -D5: drop the first 5 lines
  • --sample 0.01: take 1% of rows randomly
  • -E100: take every 100th row deterministically

Column slicing

  • -f012: keep the first three columns (fields) in their original order
  • -f10: swap the first two columns, drop the others
  • -f00.: duplicate the first column, pushing others to the right
  • -f10.: swap the first two columns, keep the others in their original order
  • -m 'row(reverse @_)': reverse the fields within each row (row() is a function that keeps an array on one row; otherwise you’d flatten the columns across multiple rows)
  • -m 'row(grep /^-/, @_)': keep fields beginning with -

Histograms (group, count)

  • -gcO: descending histogram of most frequent values
  • -gcOl: descending histogram of most frequent values, log-scaled
  • -gcOs: cumulative histogram, largest values first
  • -gcf1.: list of unique values (group, count, fields 1..n)

Sorting and counting operators support field selection:

  • -g1: sort by second column
  • -c0: count unique values of field 0
  • -c01: count unique combinations of fields 0 and 1 jointly

Common numeric operations

  • -q0.05: round (quantize) each number to the nearest 0.05
  • -q10?: quantize each number to the nearest 10
  • -s: running sum
  • -S: delta (inverse of -s)
  • -l: log-transform each number, base e
  • -L: inverse log-transform (exponentiate) each number
  • -a: running average
  • -V: running variance
  • --sd: running sample standard deviation

Each of these operations can be applied to a specified set of columns. For example:

  • seq 10 | nfu -f00s1: first column is 1..10, second is running sum of first
  • seq 10 | nfu -f00a1: first column is 1..10, second is running mean of first

Some of these commands take an optional argument; for example, you can get a windowed average if you specify a second argument to -a:

  • seq 10 | nfu -f00a1,5: second column is a 5-value sliding average
  • seq 10 | nfu -f00q1,5: second column quantized to 5
  • seq 10 | nfu -f00l1,5: second column log base-5
  • seq 10 | nfu -f00L1,5: second column 5x

Multiple-digit fields are interpreted as multiple single-digit fields:

  • seq 10 | nfu -f00a01,5: calculate 5-average of fields 0 and 1 independently

The only ambiguous case happens when you specify only one argument: should it be interpreted as a column selector, or as a numeric parameter? nfu resolves this by using it as a parameter if the function requires an argument (e.g. -q), otherwise treating it as a column selector.

Plotting

Note: all plotting requires that gnuplot be in your $PATH.

  • seq 100 | nfu -p: 2D plot; input values are Y coordinates
  • seq 100 | nfu -m 'row @_, %0 * %0' -p: 2D plot; first column is X, second is Y
  • seq 100 | nfu -p %l: plot with lines
  • seq 100 | nfu -m 'row %0, sin(%0), cos(%0)' --splot: 3D plot
$ seq 1000 | nfu -m '%0 * 0.1' \
                 -m 'row %0, sin(%0), cos(%0)' \
                 --splot %l

You can use nfu --expand-gnuplot '%l', for example, to see how nfu is transforming your gnuplot options. (There’s also a list of these shorthands in nfu’s usage documentation.)

Progress reporting

If you’re doing something with a large amount of data, it’s sometimes hard to know whether it’s worth hitting ^C and optimizing stuff. To help with this, nfu has a --verbose (-v) option that activates throughput metrics for each operation in the pipeline. For example:

$ seq 100000000 | nfu -o                # this might take a while
$ seq 100000000 | nfu -v -o             # keep track of lines and kb

Advanced usage (assumes some Perl knowledge)

JSON

nfu provides two functions, jd (or json_decode) and je/json_encode, that are available within any code you write:

$ ip_addrs=$(seq 10 | tr '\n' '\r' | nfu -m 'join ",", map "%0.4.4.4", @_')
$ query_url="www.datasciencetoolkit.org/ip2coordinates/$ip_addrs"
$ curl "$query_url" \
  | nfu -m 'my $json = jd(%0);
            map row($_, ${$json}{$_}.locality), keys %$json'

This code uses another shorthand, .locality, which expands to a Perl hash dereference ->{“locality”}. There isn’t a similar shorthand for arrays, which means you need to explicitly dereference those:

$ echo '[1,2,3]' | nfu -m 'jd(%0)[0]'           # won't work!
$ echo '[1,2,3]' | nfu -m '${jd(%0)}[0]'

Multi-plotting

You can setup a multiplot by creating multiple columns of data. gnuplot then lets you refer to these with its using N construct, which nfu lets you write as %uN:

$ seq 1000 | nfu -m '%0 * 0.01' | gzip > numbers.gz
$ nfu numbers.gz -m 'row sin(%0), cos(%0)' \
                 --mplot '%u1%l%t"sin(x)"; %u2%l%t"cos(x)"'
$ nfu numbers.gz -m 'sin %0' \
                 -f00a1 \
                 --mplot '%u1%l%t"sin(x)"; %u2%l%t"average(sin(x))"'
$ nfu numbers.gz -m 'sin %0' \
                 -f00a1 \
                 -m 'row @_, %1-%0' \
                 --mplot '%u1%l%t"sin(x)";
                          %u2%l%t"average(sin(x))";
                          %u3%l%t"difference"'

The semicolon notation is something nfu requires. It works this way because internally nfu scripts gnuplot like this:

plot "tempfile-name" using 1 with lines title "sin(x)"
plot "tempfile-name" using 2 with lines title "average(sin(x))"
plot "tempfile-name" using 3 with lines title "difference"

Local map-reduce

nfu provides an aggregation operator for sorted data. This groups adjacent rows by their first column and hands you a series of array references, one for each column’s values within that group. For example, here’s word-frequency again, this time using -A:

$ nfu README.md -m 'split /\W+/, %0' \
                -m 'row %0, 1' \
                -gA 'row $_, sum @{%1}'

A couple of things are happening here. First, the current group key is stored in $_; this allows you to avoid the more cumbersome (but equivalent) ${%0}[0]. Second, %1 is now an array reference containing the second field of all grouped rows. sum is provided by nfu and does what you’d expect.

In addition to map/reduce functions, nfu also gives you --partition, which you can use to send groups of records to different files. For example:

$ nfu README.md -m 'split /\W+/, %0' \
                --partition 'substr(%0, 0, 1)' \
                            'cat > words-starting-with-{}'

--partition will keep up to 256 subprocesses running; if you have more groups than that, it will close and reopen pipes as necessary, which will cause your subprocesses to be restarted. (For this reason, cat > … isn’t a great subprocess; cat >> … is better.)

Loading Perl code

nfu provides a few utility functions:

  • sum @array
  • mean @array
  • uniq @array
  • frequencies @array
  • read_file “filename”: returns a string
  • read_lines “filename”: returns an array of chomped strings

But sometimes you’ll need more definitions to write application-specific code. For this nfu gives you two options, --use and --run:

$ nfu --use myfile.pl ...
$ nfu --run 'sub foo {...}' ...

Any definitions will be available inside -m, -A, and other code-evaluating operators.

A common case where you’d use --run is to precompute some kind of data structure before using it within a row function. For example, to count up all words that never appear at the beginning of a line:

$ nfu README.md -F '\s+' -f0 > first-words
$ nfu --run '$::seen{$_} = 1 for read_lines "first-words"' \
      -m 'split /\W+/, %0' \
      -K '$::seen{%0}'

Notice that we’re package-scoping %::seen. This is required because while row functions reside in the same package as --run and --use code, they’re in a different lexical scope. This means that any my or our variables are invisible and will trigger compile-time errors if you try to refer to them from other compiled code.

Pseudofiles

Gzipped data is uncompressed automatically by an abstraction that nfu calls a pseudofile. In addition to uncompressing things, several other pseudofile forms are recognized:

$ nfu http://factual.com                # uses stdout from curl
$ nfu sh:ls                             # uses stdout from a command
$ nfu user@host:other-file              # pipe file over ssh -C

nfu supports pseudofiles everywhere it expects a filename, including in read_file and read_lines.

Pipelines, combination, and quotation

nfu gives you several commands that let you gather data from other sources. For example:

$ nfu README.md -m 'split /\W+/, %0' --prepend README.md
$ nfu README.md -m 'split /\W+/, %0' --append README.md
$ nfu README.md --with sh:'tac README.md'
$ nfu --repeat 10 README.md
$ nfu README.md --pipe tac
$ nfu README.md --tee 'cat > README2.md'
$ nfu README.md --duplicate 'cat > README2.md' 'tac > README-reverse.md'

Here’s what these things do:

  • --prepend: prepends a pseudofile’s contents to the current data
  • --append: appends a pseudofile
  • --with: joins a pseudofile column-wise, ending when either side runs out of rows
  • --repeat: repeats a pseudofile the specified number of times, forever if n = 0; ignores any prior data
  • --pipe: same thing as a shell pipe, but doesn’t lose nfu state
  • --tee: duplicates data to a shell process, collecting its stdout into your data stream (you can avoid this by using > /dev/null)
  • --duplicate: sends your data to two shell processes, combining their stdouts

Sometimes you’ll want to use nfu itself as a shell command, but this can become difficult due to nested quotation. To get around this, nfu provides the --quote operator, which generates a properly quoted command line:

$ nfu --repeat 10 sh:"$(nfu --quote README.md -m 'split /\W+/, %0')"

Keyed joins

This works on sorted data, and behaves like SQL’s JOIN construct. Under the hood, nfu takes care of the sorting and the voodoo associated with getting sort and join to work together, so you can write something simple like this:

$ nfu /usr/share/dict/words -m 'row %0, length %0' > bytes-per-word
$ nfu README.md -m 'split /\W+/, %0' \
                -I0 bytes-per-word \
                -m 'row %0, %1 // 0' \
                -gA 'row $_, sum @{%1}'

Here’s what’s going on:

  • -I0 bytes-per-word: outer left join using field 0 from the data, adjoining all columns after the key field from the pseudofile ‘bytes-per-word’
  • -m 'row %0, %1 // 0': when we didn’t get any join data, default to 0 (// is Perl’s defined-or-else operator)
  • -gA 'row $_, sum @{%1}': reduce by word, summing total bytes

We could sidestep all nonexistent words by using -i0 for an inner join instead. This drops all rows with no corresponding entry in the lookup table.

Go forth and conquer

nfu is an actively-maintained project we use in production, so bug reports and feature requests are both welcome. If you read this far and find this kind of thing interesting, you should consider working at Factual!