The smallest, simplest, and most educational language model ever created β written in a single line of AWK.
A "Baby Language Model" is a tiny statistical model that learns the most basic thing about language: which words appear most often.
This one-liner AWK script is the most minimal version of a unigram language model β the same core idea that powered early statistical language models before deep learning took over.
awk '{
gsub(/[^a-zA-Z0-9 ]/, " ");
for(i=1;i<=NF;i++) {
w = tolower($i);
if(w) count[w]++
}
} END { for(w in count) print count[w], w }' text.txt | sort -nr
Remove punctuation and convert to lowercase.
Build a frequency table using AWKβs powerful associative arrays.
Sort by frequency β the fundamental idea behind statistical language modeling.