Text processing in UNIX

by Noufal Ibrahim

May 17, 2019 in Technology

Introduction

I’ve been conducting a series of workshops at engineering colleges in Kerala. It’s a 2 day presentation titled “Programming as Engineering” and designed to inspire rather than instruct. It is structured as 5 separate “topics” viz. UNIX, Optimisation, Higher order functions, Data structures and small languages. They’re distinct from each other so that I have the flexibility to cut it into a shorter one day course if necessary. I’ve conducted it five times now and it’s been fairly well received.

I took a tiny piece of the UNIX presentation and delivered it as a lightning talk at PyCon India 2013. I then expanded it a little and did an open space on Shell + Emacs tricks at the same conference which also seemed to interest a few folks. One of the things we touched was to take the text of a book and to try to extract as much information about the book as possible without reading it. In this blog post, I’m going to detail the exercise. The UNIX shell is as underrated as it is powerful and I think it’s a worthwhile exercise to remind ourselves of that.

To run this, I took a copy of Moby Dick from the Gutenberg project website. I excised the headers and footers added by the project and scrubbed the file to use UNIX newlines. The exercise is to get as much information from the book as possible without actually reading it. I was inspired to do this from Sherlock Holmes’ trick in the beginning of the Valley of Fear where the detective deduces the name of a book from some information that one of his agents send him.

The Analysis

I use zsh version 4.3.17 on Debian. These examples are tested with that shell but they should work with bash too since I don’t use too many esoteric features.

First, we try to find the number of chapters in the book. This is not too hard. We simply run this.

grep -iw chapter moby-dick.txt | wc -l

and we get 172. So we know that it has (roughly) 172 chapters. The -ioption to grep makes the search case insensitive (we match Chapter and chapter). The -w restricts the pattern to word boundaries. So, we won’t match things like chapters.

Next, we try to get the number of pages in the book. A typical paperback book, which is the kind I’d get if I bought a paper copy of Moby Dick, has approximately 350 words on a page (35 lines per page and 10 words per line). I know this because I actually counted them on 10 books. We can get this using

expr $(wc -w moby-dick.txt | awk '{print $1}') / 350

expr is an under appreciated command line calculator that you can use in a pipeline. The $( and ) is command substitution where the snippet inside the brackets is run and the output put instead of the $( and ). In this case, we simply count the words and get the count. We get this and divide it by 350. The output is 595. That’s around 3 pages a chapter on the average.

The next thing we try to get is the length of sentences. This is useful to approximate the reading grade for the book. The Flesch-Kincaid tests use these (among other things) to calculate the reading level for the book. It’s also fair to say that technical books usually keep the sentence lengths somewhat low (although code snippets can ruin our estimations). Childrens books have shorter sentences. The sentences we usually speak during conversation are about 20 words long. To do this, first we run the book through tr '\n' ' '. This changes all newlines to spaces so the whole book fits on a single line. Then we pipe that through tr '.' '\n'which converts it to a single sentence per line. We then count the words per such “line” using awk '{print NF}' and then we pipe that through sort -n | uniq -c | sort -n which gives us a frequency count per sentence length in increasing order. The last few lines will tell us what the lengths of most of the sentences are.