Note: Explore nfu on Github here

We often use the UNIX command line for ad-hoc data crunching. Most of the time we have the good sense to use a better tool after the first 100 characters or so, but sometimes we’ll just blow past the right margin with a string of sort, uniq -c, sort -nr, cut -f1, and other “glue” commands. To make this easier, I decided to bundle a bunch of common ones up into a Perl script called nfu.

The idea behind nfu is to save as much command-line real estate as possible for simple command-line data analysis. It’s designed to wrap or replace a bunch of filter processes like sort, uniq, and in many cases, awk and perl, by providing a series of composable operators designed to operate on rows of whitespace column-delimited text input. For example, two such operators are “sum” and “delta”:

$ seq 4 | nfu -s       # or nfu --sum
1
3
6
10
$ seq 4 | nfu -d       # or nfu --delta
1
1
1
1
$

Operators compose by juxtaposition (as described in further detail)

$ seq 4 | nfu -ss
1
4
10
20
$

These operators are especially useful in combination with things like histograms, which you can construct using the “group” and “count” operators. For example, here’s how you might build a histogram of English letter pairings:

$ perl -ne 'print "$1$2\n" while s/^(.)(.)/$2/' < /usr/share/dict/words | grep '\w\w' > pairings
$ nfu -gcO < pairings | head -n5
  16982 in
  15871 er
  13647 es
  10556 ti
  10400 on
$

nfu hasn’t saved much effort yet since we could have typed sort | uniq -c | sort -rn easily enough. The real wins happen when we want to do things like log-scaling, calculating cumulative totals, or plotting the data:

$ nfu -gcOl < pairings                  # log-scale each number
$ nfu -gcOs < pairings                  # cumulative total
$ nfu -gcOp 'with lines' < pairings     # show data with gnuplot

And more usefully, combining some of these features:

$ nfu -gcOsp ‘title “pair frequency” with lines’ < pairings

At this point we can see that just under 20% of the pairings account for about 80% of the occurrences. Let’s take a closer look at the long tail by dropping the first 200 points and log-scaling the rest:

$ nfu -gcOS200,0lp ‘with lines’ < pairings

All of the data fields are preserved, so you can easily go back and look at the original letter pairings. Here are some of the least-common lowercase ones, for example:

$ nfu -gcO < pairings | grep -v '[A-Z]' | tail -n10
      1 zg
      1 zd
      1 xv
      1 vg
      1 qb
      1 mk
      1 jr
      1 gq
      1 fw
      1 cb
$

And we can easily find the words corresponding to these pairings by using the second column to form a pattern for grep. We can extract this column using -f and specifying the index 1 (fields are zero-indexed):

$ egrep "$(nfu -gcOf1 < pairings | grep -v '[A-Z]' | tail -n10 | paste -sd'|')" /usr/share/dict/words
Chongqing
Fitzgerald
Gujranwala
Iqbal
Knoxville
Macbeth
Mazda
Novgorod
Potemkin
halfway
$

nfu also supports commands for generating and processing noisy sample-based data. For example, a couple of days ago I started logging my battery level using polling (-P):

$ nfu -P 1 ‘cat /sys/class/power*/BAT*/energy_now’ > battery-log &

So every second, nfu runs cat and adds another data point to the log file. Here’s the unfiltered data:

$ nfu -p ‘with lines’ < battery-log

There isn’t much discontinuity, so let’s look at the deltas to measure charge/discharge rate:

$ nfu -dp ‘with lines’ < battery-log

This spike is from the computer being on standby, causing an unsampled duration of charging. We could clip the spike, but we’d be losing information. It’s easier to take a sliding average to spread it out. We also need to remove the first data point after delta-transforming, since it will be encoded as its delta from zero (and will therefore look like another spike):

$ nfu -dS1,0a1000p ‘with lines’ < battery-log

We can use the eval command to calculate the relative amounts of time spent on AC and battery power:

$ nfu -dS1,0e '%0 < 0 ? "BAT" : "AC"' -gc < battery-log
  83464 AC
  35214 BAT
$

You can also use eval to filter values by returning an empty list. Here’s how you might reduce the dataset to battery samples only, calculating the running-average (-a0) discharge rate:

$ nfu -dS1,0e ‘%0 < 0 ? %0 : ()’ -a0p ‘with lines’ < battery-log

nfu isn’t a great piece of architecture or design. It’s just one of those scripts that saves a little bit of time here and there, and is (we hope) useful to have around. Grab the code and let us know what you think! Feature requests and bug reports are welcome as always.

- Spencer Tipping, Software Engineer @ Factual