Book Reader RAG

Published on Dec 22, 2025

9 min read

Prelude

How long can one live under a rock?

It is finally time for The Engineer to do something AI-related!

But what possible topic could one choose that countless others haven’t covered already?

So let’s start with the typical one: how to build a RAG from scratch, and continue from there.

But why?

It is actually quite simple. With a search system, say Google, or something using Elasticsearch, you get documents relevant to what you are searching. Essentially they throw a pile of books at you that might be relevant.

Buried under documents

By including an LLM into the flow however, one could instead receive a nicely formulated letter with the information you seek. Neat, right?

Give me a concrete use case!

I recently read AI Engineering by Chip Huyen (coincidence?) but found out after going through many chapters that I forget most of what I read; and I do take notes! There is just so much useful information that my current short-span-coerced brain can no longer handle.

So why not enlist some help?

If an AI can function as my 2nd never-tired brain, let’s do it!

Overview

What one needs is essentially a way to ask a book questions. The LLM will be the one formulating nice answers (the G in RAG), while we are in charge of giving it the required data (the RA in RAG). This would roughly look like the diagram below:

Retriever diagram

There are 2 paths:

Ingestion. Take the book contents, process in a format the LLM may use, and store.
Retrieval. Take the question, fetch the relevant sections from the book, and formulate an answer.

Decisions

Based on the above, there are some technical decisions to make:

Where do we run this?

Locally. We want this to be cheap. We want to host it ourselves. We do not have that much data. At this point, we are also the only user. We do not want to rely on network traffic, nor token limits.

Ollama is perfect for this.

Which model?

It does not really matter. Ok, maybe it does; a bit. But we are only experimenting here. We also don’t care that much about the in-built knowledge a model might have. We simply care about its ability to understand what we are searching for and summarize a response, given the knowledge we provide it.

Let’s just pick llama3 for now.

How to parse the book?

Most likely we will have a PDF version. While arguably Adobe produces the best parser for their own monster of a file format, we said we will use open source.

So here comes pymupdf which seems the most capable at the time of writing this.

How do we store the book?

Traditionally (does this word even make sense for something introduced what feels like a few years ago?) we would use a vector database. There have been however plenty of developments in introducing vector support to “classical” databases, such as postgres. Without going deeper into the arguments here, let’s just pick a popular one with a simple API.

Chroma seems perfectly adequate.

Wait, why vectors?

We don’t actually store the text of the book. We store the embeddings, which are vectors of numbers that effectively represent (and compress) the information from the book. Yes, this means potentially also images, or other multi-modal data. A complex topic expertly explored in this article.

Which libraries do we need?

Well of course we use Python and langchain to glue everything together. Not only does it have a friendly API, but the docs are good and it also allows freely switching out components.

Implementation

First let’s cover what a basic chatbot looks like with Ollama.

from langchain_core.messages import SystemMessage, HumanMessage
from langchain_ollama import ChatOllama

if __name__ == "__main__":
    query = input("Prompt:\n")

    llm = ChatOllama(
        model="llama3",
        temperature=0,
    )

    result = llm.invoke([
        SystemMessage(content="You are a helpful chat bot."),
        HumanMessage(content=query)
    ])

    print(result.content)

Of course this isn’t really a complex chat; it can simply answer a single prompt and close. But it’s quite easy to setup, isn’t it?

We chose the Ollama chatbot implementation with the llama3 model. This of course required installing Ollama locally too. Then pass a simple system prompt along with the prompt from the user. But this is it!

Adding retrieval

We need to extend the above with the Ingest and the Retrieve parts of our diagram. More specifically, we need to:

fetch relevant segments from our book, and
extend the system prompt s.t. our friendly bot can summarize responses.

    embeddings = OllamaEmbeddings(
        model="llama3",
    )
    vector_store = Chroma(
        collection_name="sample",
        embedding_function=embeddings,
        persist_directory="./book_helper_db",
    )
    retriever = vector_store.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 2},
    )

    relevant_docs = retriever.invoke(query)
    context_documents_str = "\n\n".join(doc.page_content for doc in relevant_docs)

We configure the embeddings, the vector store, and what kind of retrieval we do. We use llama3 embeddings for now, just to keep it simple, though of course this is an area open to improvements. Then define the vector store location, and that we fetch the top 2 results based on similarity. The search_type is again an area open to improvements, as one can employ other methods. In the next blogposts maybe?

Actually, the more one thinks about it, this is perhaps the most important decision to make. If we fail to retrieve the most relevant documents for the user, the whole project makes no sense! But I digress.

The last step is to simply prepare those relevant documents in a format suitable for the LLM.

Which leads us to the prompt.

    prompt = f"""
    --- CONTEXT ---
    {context_documents_str}
    --- END CONTEXT ---

    Based only on the context above extracted from ingested books, answer the following question.
    Be concise.

    --- QUESTION ---
    {query}
    """
    result = llm.invoke([
        SystemMessage(content="You are a helpful book reader that can summarize key concepts from books."),
        HumanMessage(content=prompt)
    ])

Instead of passing the query directly, we need to extend our prompt with the relevant documents - or chunks, actually - that we just fetched. Then put everything together in a prompt both us and the LLM can understand.

Ingestion

As a data engineer, I’d like to think this is the most important part. The vector store needs to have the contents of the book in a format the LLM can understand, sure, but more importantly a format that allows it to formulate a relevant answer without taking an hour to do so.

Take a second.

What does that mean?

In principle, one could just add the whole book to the context of each question. But is that efficient? Would the model be able to produce relevant answers?

There are plenty of discussions on this topic, though the answer usually boils down to: we need to help the model for it to help us.

Help me help you

This means chunking the data into segments ourselves instead of relying on the LLM to figure out what is relevant out of everything. Many models would likely not allow us to hand over entire documents anyway, due to context window limitations.

Coincidentally, this is also an interesting topic because why wouldn’t all models just allow limitless context? The pragmatists reveal themselves again, as the simplest answer is: the bigger the context, the more memory and time required to process. Without even mentioning the number costly tokens.

But I digress again.

Wait, is there a difference between prompt and context?

In terms of our code, not really. They essentially include one other. A prompt would be the question you ask the LLM or a line of dialogue. The context would be the memory of the LLM, so the knowledge and rules it is expected to work with.

Back to the code.

import pymupdf4llm
from langchain_chroma import Chroma
from langchain_ollama import OllamaEmbeddings

from langchain_text_splitters import MarkdownHeaderTextSplitter

headers = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
    ("####", "Header 4"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers, strip_headers=False)

if __name__ == "__main__":
    embeddings = OllamaEmbeddings(
        model="llama3",
    )
    vector_store = Chroma(
        collection_name="sample",
        embedding_function=embeddings,
        persist_directory="./book_helper_db",
    )

    pdf = pymupdf4llm.to_markdown("data/sample.pdf")
    chunks = markdown_splitter.split_text(pdf)
    vector_store.add_documents(chunks)

Let’s start with the main function. It looks pretty similar to the retrieval, as we define the embeddings and the vector store. The last 3 lines are what we care about the most. We parse the PDF, split it into usable chunks, then store in the db.

The parsing itself is easy, as pymupdf already covers our use case for Gen AI. We choose to use the functionality to convert to markdown, as that makes splitting a document much easier. The last step of creating the embeddings and storing them is also a piece of cake.

So the only real decision we have to make is how to split the markdown. The standard solution is to simply split every X characters with the RecursiveCharacterTextSplitter, optionally including some overlap to not miss the context around the text chunks. However, we chose markdown for a reason. It has clear text separators via headings, so let’s split based on those.

What if chapters are extremely long?

Well here we know they are not that big; at most 1-2 pages. Though langchain does provide the functionality to chain splitters, so we could also split on text size! Or maybe simply by paragraphs? Let’s keep it simple for now.

Summary

Too quick?

We are indeed done! Two simple scripts allow us to ingest a whole book and ask “it” questions. Let’s try out our scripts.

For the example, I used a single chapter of another book. See if you can guess which one ;)

Example execution

Addendum

Why bother with embeddings when the retrieval part could just be Elasticsearch?

Retrieval is indeed essential to our chatbot here. Those that skimmed through the embeddings article above might already know the answer here.

One indeed could “simply” search. Elasticsearch relies on keyword search which it achieves fast largely due to its distributed architecture. Diving deeper though, we do see that it also uses a version of the BM25 to rank these results. Unlike the dense embeddings typically used by LLMs nowadays, BM25 is sparse and indeed the evolution of the “classical” TF-IDF. The main difference here is that a sparse vector helps with exact word matches.

So why not use Elasticsearch? Well we do not have the hardware; that’s one reason. We can’t have the user waiting for ages, wondering if the chatbot crashed trying to find an answer.

And would it actually provide better results?

What if I have a question about 6-7?

New data

Well that’s not in any book, is it?

It does however exemplify an important topic. What if the book does not have the answer? Actually what if it did? Will the embedding dictionary even have that concept and be able to aid us in finding relevant context?

It’s interesting to realize that for our current RAG to work, the embedding model must have encountered the same words or concepts (or parts of words) as the ones a book might introduce.

Looking back at the previous Addendum question, indeed, involving some simple search functionality might help. And this is not just my thought. How to do so though? Do we configure it as a tool? Simply have a “classical” search method alongside embeddings? Stay tuned ;)