Unix for Poets: Basic Text Processing Using Unix Tools

Often, we become so captivated by complexity and sophistication that we overlook the profound effectiveness of simple, fundamental methods.

Kenneth Ward Church in wonderful Unix™ for Poets [1] shows how basic text processing tasks like splitting text into words, counting words, sorting words in rhyming order and even computing n-grams can be done by anybody using simple unix tools without any fancy tools or algorithms, even on a modest PC. In this article, we look at some basic NLP tasks using Unix tools based on Unix™ for Poets with Moby-Dick corpus [2]. Which is 214,492 words, 19,390 lines and 1.2MB in size.

1. Count Word Frequencis In a Text

The problem is to input a text file, and output a list of words in the file along with their frequency counts. To split text into words we can use this simple method:

> tr -sc 'a-zA-Z' '\n' < moby.txt | tr 'A-Z' 'a-z' > words.txt
> less words.txt
moby
dick
or
the
whale
by
herman
melville
chapter
loomings
call
me
ishmael
some
years
...

tr can be used to translate or delete characters. Option -c means complement. So In this example we use tr to convert all non-alphabetical characters to \n. By -s, we specify to replace each sequence of a repeated character with a single occurrence of that character. Hence every time tr encounters a non-alphabetical character (eg. space, punctuation, numbers) It replaces it with a \n character. Finaly we replace all uppercase letters by lowercase ones. We store the result in words.txt file.

To calculate world frequencies, we can simply use sort and uniq commands.

> sort words.txt | uniq -c | sort -nr | less
  14178 the
   6467 of
   6321 and
   4629 a
   4537 to
   4073 in
   3036 that
   2494 his
   2489 it
   2108 i
   1874 he
   1804 but
   1735 s
   1717 as
   1691 with
   1691 is
   1628 was
   1594 for
   1513 all
   1377 this
...

An we have list of words sorted by frequency in the output. Note that there are some entries like a single s. It’s beacuse we use a very simple word tokenization method, words like It’s will break into two parts It and s. We can use more accurate tokenization methods, but in this simple example the simple appraoch serves us well.

Small changes to the tokenizing rule are easy to make, but can have a dramatic impact. For example we can count vowel sequences instead of words by changing the tokenizing rule to emit sequences of vowels rather than sequences of alphabetic characters.

> tr 'A-Z' 'a-z' < moby.txt | tr -sc 'aeiou' '\n' | sort | uniq -c | sort -nr | less
  94705 e
  63440 a
  53233 i
  50012 o
  14698 u
   7597 ou
   6274 ea
   4197 ee
   3352 ai
   2734 oo
   1895 ie
   1575 io
   1307 ei
   1262 oa
    962 ia
    886 ue
    769 oi
    540 au
...

With tr 'A-Z' 'a-z' we replace uppercase characters with lowercase ones and then with tr -sc 'aeiou' '\n' we replace any consonant (non-vowel) character with \n.

2. Sort Words By Rhyming Order

Altough it may seem like a complex task but we can do a decsent job using a very simple technic. By rhyming order, we mean that the words should be sorted from the right rather than the left. So the only thing we should do it to reverse words and then sort them. We can reverse lines characterwise using rev command:

> tr -sc 'a-zA-Z' '\n' < moby.txt | tr 'A-Z' 'a-z' > words.txt
> rev words.txt | sort | uniq | rev | less
a
caramba
cuba
juba
malacca
mecca
seneca
republica
oceanica
america
africa
da
armada
canada
andromeda
zogranda
anaconda
golconda
...

3. N-grams

I have presented this material in quite a number of tutorials over the years. The students almost always come up with a big ‘‘aha’’ reaction at this point. Counting words seems easy enough, but for some reason, students are almost always pleasantly surprised to discover that counting ngrams (bigrams, trigrams, 50-grams) is just as easy as counting words. - Kenneth Ward Church, Unix™ for Poets

We can easily count bigrams (pairs of words), using this algorithm:

Step 1: Tokenize by word

Step 2: Print word_i and word_i+1 on the same line

Step 3: Count

We already know how to tokenize words and store the in words.txt file:

> tr -sc 'a-zA-Z' '\n' < moby.txt | tr 'A-Z' 'a-z' > words.txt

For the second step we can use the tail command:

> tail +2 words.txt > next-words.txt
> less next-words.txt

With tail +2 words.txt we skip the first line and print from line 2 to the end. So next-words.txt file is the words.txt with one line (word) shifted up. Finaly we can use paste command to merge lines in words.txt and next-words.txt:

> paste words.txt next-words.txt | sort | uniq -c | sort -nr | less
   1840 of      the
   1146 in      the
    717 to      the
    432 from    the
    377 the     whale
    369 of      his
    361 and     the
    351 on      the
    328 at      the
    324 to      be
    318 of      a
    315 by      the
    310 with    the
    304 for     the
    291 it      was
    282 it      is
    258 in      his
    256 in      a
    248 the     ship
...

We can count trigrams and more using this simple method. We can even make it a bash script:

#!/usr/bin/bash
tr -sc 'a-zA-Z' '\n' < moby.txt | tr 'A-Z' 'a-z' > $$words

tail +2 $$words > $$nextwords
tail +3 $$words > $$nextwords2

paste $$words $$nextwords $$nextwords2 | sort | uniq -c | sort -nr

# remove the temporary files
rm $$words $$nextwords $$nextwords2

We can then run it and get trigrams sorted by frequency:

> sh mktrigram.sh < moby.txt | less
    114 the     sperm   whale
     97 of      the     whale
     88 the     white   whale
     64 one     of      the
     58 of      the     sea
     56 out     of      the
     53 the     ship    s
     53 part    of      the
     52 a       sort    of
     50 the     whale   s
     48 the     pequod  s
     42 of      the     sperm
     37 of      the     ship
     36 i       don     t
     35 the     right   whale
     34 the     old     man
     33 sperm   whale   s

4. Find Palyndrome Words

Palindrome words spelled the same backward and forward, i.e noon. We can find palyndrome with this algorithm:

Step 1: Revers each word

Step 2: Find words that are identical to their reverse

> rev words.txt > rev-words.txt 
> paste words.txt rev-words.txt | awk '$1 == $2' | sort | uniq | less
...
peep    peep
pip     pip
poop    poop
pop     pop
p       p
razar   razar
redder  redder
refer   refer
sees    sees
sexes   sexes
...

5. Conclusion

Thanks to Kenneth Ward Church, we now understand that anyone can perform basic NLP tasks on their own PC using just Unix tools, even without specialized knowledge of NLP or complex algorithms. In today’s landscape, dominated by the hype surrounding AI and large language models, we often underestimate the effectiveness and strength of simple, foundational techniques.

References

[1] https://www.cs.upc.edu/~padro/Unixforpoets.pdf

[2] Moby Deck Corpus

1. Count Word Frequencis In a Text#

2. Sort Words By Rhyming Order#

3. N-grams#

4. Find Palyndrome Words#

5. Conclusion#

References#