Introduction
I’ve been conducting a series of workshops at engineering colleges in Kerala. It’s a 2 day presentation titled “Programming as Engineering” and designed to inspire rather than instruct. It is structured as 5 separate “topics” viz. UNIX, Optimisation, Higher order functions, Data structures and small languages. They’re distinct from each other so that I have the flexibility to cut it into a shorter one day course if necessary. I’ve conducted it five times now and it’s been fairly well received.
I took a tiny piece of the UNIX presentation and delivered it as a lightning talk at PyCon India 2013. I then expanded it a little and did an open space on Shell + Emacs tricks at the same conference which also seemed to interest a few folks. One of the things we touched was to take the text of a book and to try to extract as much information about the book as possible without reading it. In this blog post, I’m going to detail the exercise. The UNIX shell is as underrated as it is powerful and I think it’s a worthwhile exercise to remind ourselves of that.
To run this, I took a copy of Moby Dick from the Gutenberg project website. I excised the headers and footers added by the project and scrubbed the file to use UNIX newlines. The exercise is to get as much information from the book as possible without actually reading it. I was inspired to do this from Sherlock Holmes’ trick in the beginning of the Valley of Fear where the detective deduces the name of a book from some information that one of his agents send him.
The analysis
I use zsh version 4.3.17 on Debian. These examples are tested with that shell but they should work with bash too since I don’t use too many esoteric features.
First, we try to find the number of chapters in the book. This is not too hard. We simply run this.
grep -iw chapter moby-dick.txt | wc -l
and we get 172
. So we know that it has (roughly) 172 chapters. The -i
option to grep makes the search case insensitive (we match Chapter
and chapter
). The -w
restricts the pattern to word boundaries. So, we won’t match things like chapters
.
Next, we try to get the number of pages in the book. A typical paperback book, which is the kind I’d get if I bought a paper copy of Moby Dick, has approximately 350 words on a page (35 lines per page and 10 words per line). I know this because I actually counted them on 10 books. We can get this using
expr $(wc -w moby-dick.txt | awk '{print $1}') / 350
expr is an under appreciated command line calculator that you can use in a pipeline. The $(
and )
is command substitution where the snippet inside the brackets is run and the output put instead of the $(
and )
. In this case, we simply count the words and get the count. We get this and divide it by 350. The output is 595
. That’s around 3 pages a chapter on the average.
The next thing we try to get is the length of sentences. This is useful to approximate the reading grade for the book. The Flesch-Kincaid tests use these (among other things) to calculate the reading level for the book. It’s also fair to say that technical books usually keep the sentence lengths somewhat low (although code snippets can ruin our estimations). Childrens books have shorter sentences. The sentences we usually speak during conversation are about 20 words long. To do this, first we run the book through tr '\n' ' '
. This changes all newlines to spaces so the whole book fits on a single line. Then we pipe that through tr '.' '\n'
which converts it to a single sentence per line. We then count the words per such “line” using awk '{print NF}'
and then we pipe that through sort -n | uniq -c | sort -n
which gives us a frequency count per sentence length in increasing order. The last few lines will tell us what the lengths of most of the sentences are.
cat moby-dick.txt | tr '\n' ' ' | tr '.' '\n' | awk '{print NF}' | sort -n | uniq -c | sort -n
The last 20 lines of this gives me
153 27
158 24
158 25
159 12
162 13
163 26
164 20
166 11
168 22
168 23
173 19
176 14
178 21
179 17
179 18
179 8
186 15
194 9
197 16
230 2
The first column is the number of sentences and the second column the length of the sentence. Summing column one from this gives us 3490
.
cat moby-dick.txt | tr '\n' ' ' | tr '.' '\n' | awk '{print NF}' | sort -n | uniq -c | sort -n | tail -20 | awk '{sum += $1} END {print sum;}'
and the total sentence count from
cat moby-dick.txt | tr '\n' ' ' | tr '.' '\n' | wc -l
is 7385
. So the last the last 20 lengths account for a little less than half the number of sentences. Sorting the last 20 by sentence length using
cat moby-dick.txt | tr '\n' ' ' | tr '.' '\n' | awk '{print NF}' | sort -n | uniq -c | sort -n | tail -20 | sort -n -k2
gives us some more insight into the lengths.
230 2
179 8
194 9
166 11
159 12
162 13
176 14
186 15
197 16
179 17
179 18
173 19
164 20
178 21
168 22
168 23
158 24
158 25
163 26
153 27
which is that they’re all less than 27 words. That’s fairly conversational. However, the maximum sentence length is 394 and it even has two sentences that are 224 words long. This makes it quite unlikely that it’s a childrens or a technical book. We can even go a step further and drop the tail -20
to get a frequency distribution.
cat moby-dick.txt| tr '\n' ' ' | tr '.' '\n' | awk "{print NF}" | sort -n | uniq -c | sort -n| sort -n -k 2 | awk '{print $2 " " $1}'
and then plot that using gnuplot
to get something like this
The next thing we can try to approximate is the year of writing this. Something like
cat moby-dick.txt | tr -Cs '[A-Za-z0-9]' '\n' | grep -w '[12][0-9][0-9][0-9]'
gives us all the years in the book. This first converts the text into one word per line (by changing all “non word” characters into newlines) and then looks for numbers that look like years. This gives us quite a list. Sticking a wc -l
at the end gives us the number of matches (in our case 30). We can sum this and then again divide by the number of matches to get an average.
expr $(cat moby-dick.txt | tr -Cs '[A-Za-z0-9]' '\n' | grep -w '[12][0-9][0-9][0-9]' | awk '{sum += $1} END {print sum;}') / $(cat moby-dick.txt | tr -Cs '[A-Za-z0-9]' '\n' | grep -w '[12][0-9][0-9][0-9]' | wc -l)
This is hairy but actually runs for a two hundred thousand word text file in about 0.03 seconds on my computer. That’s much lesser than the time needed to write a real program to do this. I get 1796
. It’s likely that it was written a little after this date (unless it’s futuristic speculative fiction of some kind) so let’s say early 19 century.
So far, we have a non technical book for older audiences. It’s approximately 600 pages spread across 170 chapters written in the early 19 century. Let’s go on.
We can do a frequency analysis on the number of words. First lower case everything and get one word per line using
cat moby-dick.txt | tr '[A-Z]' '[a-z]' | tr -Cs '[a-z0-9]' '\n'
Then pipe that through sort | uniq -c | sort -n | tail -20
to get the most common 20 words in the book. The results are disappointing.
1382 this
1515 all
1593 for
1627 was
1690 is
1692 with
1720 as
1737 s
1805 but
1876 he
2114 i
2495 his
2496 it
3045 that
4077 in
4539 to
4636 a
6325 and
6469 of
14175 the
All these words don’t give us any information about the content of the book. We can filter for larger words using by sticking a grep .....
before the first sort to look only for words longer than 4 letters. This gives us
252 queequeg
257 stubb
262 again
268 after
280 white
282 seemed
292 great
295 before
303 those
308 about
312 still
327 captain
382 though
394 these
410 other
426 would
604 their
625 which
854 there
1150 whale
in about 1.6 seconds. You can see themes here. Whale is obviously something important. Captain makes the book either military or nautical. The whale suggests the latter. Great and white are not significant in themselves but with whale, they give you great white whale which is good. The captain is important in the story. There are also two words which you can’t find in the dictionary – queequeg and stubb. It’s likely that these are characters in the story. By changing the lengths of the words we filter, we get out some more stuff from the text. By using ....
, we get ahab and ship. Using ......
, we get whaling, pequod and starbuck. We can adjust the lengths like this and we get these words that are not in the dictionary – Ahab, Queequeg, Stubb, Pequod, Starbuck and Tashtego. We get these words that are in the dictionary – ship, white, whale, captain, whaling, harpooneers, harpooneer, leviathan, Nantucket.
So, we can judge that this is a non technical book for older audiences. It’s approximately 600 pages spread across 170 chapters written in the early 19 century. It deals with a story on a whaling ship. The captain is an important character in the story. They are hunting for a white whale. It’s likely that it’s an American story (since Nantucket has a history of whaling). The main characters in the story are Ahab, Queequeg etc.
Now you can head on to the wikipedia page of Moby Dick and see how close we’ve reached.
It’s possible to squeeze out more information from the text. We can, for example, get bigrams from it with this (try it).
paste <(cat moby-dick.txt| tr '[A-Z]' '[a-z]' | tr -sC '[a-z0-9]' '\n') <(cat moby-dick.txt| tr '[A-Z]' '[a-z]' | tr -sC '[a-z0-9]' '\n' | tail -n +2) | sort | uniq -c | sort -n | tail -20
I think these tools are not sufficiently advertised in the modern developer community and it’s a loss for them. I’m planning to put together a course that teaches these skills which I’m going to hold in Bangalore later this year. You should sign up onto my trainings list or follow me on twitter if you want to know when I’m ready.
This course is also going to be a part of a larger boot camp style thing which I’m doing under the Lycӕum banner so stay tuned.