Five Million Years of Solitude

2025-11-23T00:00:00-07:00

How big is the FineWeb corpus? The documentation tells us that it has 15 trillion tokens, but the number is hard to picture.

Instead, let’s go back in time to meet the Ardipithecus kadabba. It is an early hominid that lived about 5.7 million years ago, soon after ancestral chimps and early hominids parted ways from their common ancestral hearth.

Imagine that an A. kadabba decided to start reading the corpus at the pace of a modern American leisure reader. Busy with her reading, she would have bypassed the whole Homo erectus, Neanderthal and Denisovan business, missed the birth of the first Homo sapiens, and would be wrapping up her reading just in time to use a modern chatbot.

Of course, she would be no match for the chatbot because she has only been pretrained, and that too largely on 21st century English text. I suspect that the surprisingly literate and well-read hominid would have been disappointed and questioning the purpose of having stayed up (and alive) doing all that reading. She would still need instruction tuning and preference alignment. But still, they say reading builds character. And character, like personality, goes a long way.

Maybe the lesson isn’t about how much data we can shovel into the system, but that we need a better way to structure the intelligence we build.

Technical details: Why 5.7 million years?

▼

How did we arrive at the number 5.7 million for the number of years will take to read the FineWeb corpus? If you’re curious, and don’t mind some aggressive approximations, read on.

Tokens to words

The FineWeb corpus has 15 trillion tokens. According to the blogpost describing the corpus, the tokens are constructed using the GPT2 tokenizer. The exchange rate between tokens and words is not 1:1. Words can, and are, broken down into subwords by the tokenizer. Small words may be retained as is, and bigger words will be broken down. As of now, the rule of thumb suggested by OpenAI suggests that we can of 100 tokens as 75 words, which gives an exchange rate of 1.33 tokens per word.

But that is for the more recent tokenizer, whose vocabulary size (i.e., the number of word pieces) is much larger. GPT2’s vocabulary contains about 50,000 tokens. If we make the vocabulary smaller, then more words will not be in it, and will get broken up. As a result, the number of tokens we get from a word will be bigger.

Consequently, we can think of one word as 1.5 tokens. With this exchange rate, the FineWeb corpus will have 10 trillion words.

Words to books

Now that we have an estimate of how many words are in the corpus, let us now estimate how many books they can fill up. Of course, we will need to make aggressive approximations here. Books can be long (Marcel Proust’s In Search of Lost Time is about 1.2 million words long), and books can be short (Animal Farm has about 30,000 words.) However, an average book has about 80,000 to 100,000 words. Let us take the upper end for simplicity.

FineWeb has 10 trillion words. At 100,000 words per book, the corpus corresponds to 100 million books!

Books to years

Finally, let us see how long it takes to read 100 million books. According to the World Population Review, an average American reader reads 17 books per year. To read 100 million books, it will take about 5.88 million years. Conveniently, this is quite close to when the A. kadabba lived, except for a slight error. Maybe the A. kadabba in our story is a faster reader, and shaves off 0.18 million years. This gives us the 5.7 million years.

Welcome

2025-11-22T00:00:00-07:00

Over the years, I have been explaining my research to slightly different audiences: students, linguists, mental health experts, Utah faculty learning about AI, etc. Each conversation has been slightly different, but has taught me more about AI.

This blog is an attempt to continue those conversations and working through ideas publicly. What kinds of ideas? That the most interesting AI challenges are not just about scale. That techno-solutionism without context from domain experts may be problematic. You can find more about my research interests here. My current research focuses on neuro-symbolic methods, low-data domains and resources, and systematic benchmarking.

What I’ll write about: Research ideas. Lecture notes on topics like neuro-symbolic methods. Thinking about benchmarking beyond metrics. Thinking about linguistic phenomena. Making AI do interesting things with limited data or compute resources. Commentary about AI and society.