PageIndex: The RAG Framework That Threw Embeddings in the Trash (and Achieved 98.7% Accuracy)

If you’ve ever built a RAG pipeline, you know the frustration: you load a PDF into your vector database, slice it into 500-token chunks, and cross your fingers hoping the embedding model understands which chunk answers the question. Often it doesn’t — not because the LLM is bad, but because the retrieval step destroyed the document’s structure before the LLM could reason about it.

PageIndex, an open source framework from VectifyAI with over 29.2K stars on GitHub, starts from an uncomfortable observation: semantic similarity is not the same as relevance.


The Problem with Chunking

Traditional RAG makes two bets: that documents should be cut into chunks, and that vector similarity will surface the correct ones. For casual Q&A about blog posts, this works. For professional documents — financial reports, legal contracts, technical specs — it usually breaks down.

A number in a table cell means nothing without its column header. A footnote referencing Section 4.2 is useless if Section 4.2 ended up in another chunk. The pipeline strips the document of its hierarchical structure — the same structure that makes it readable — and then asks the LLM to reason about the scraps.

VectifyAI calls this the “garbage in, garbage out” trap. PageIndex avoids it entirely by ditching both chunking and embeddings.


How PageIndex Works

Instead of a vector database, PageIndex builds a hierarchical tree-structured index from the document. Think of it as a smart table of contents: each node has a title, a summary, and a page range. The structure reflects how the document is actually organized — chapters, sections, subsections, tables.

When a query arrives, an LLM reads the tree and reasons about which nodes are most likely to contain the answer. It can follow cross-references, recognize when a multi-part question requires searching across two different sections, and return a complete trace of the reasoning showing exactly which nodes it visited.

Mingtian Zhang, co-founder of VectifyAI, describes it as “AlphaGo for document retrieval” — the same tree search logic that powered game-playing AIs now navigates document hierarchies instead of board states. To be clear: some reviews note that the AlphaGo comparison is somewhat overstated — in practice it’s an LLM reasoning over a JSON tree, not Monte Carlo Tree Search with trained value networks. It works well regardless of the framing.

The list of dependencies reflects the simplicity of the approach: OpenAI SDK, PyMuPDF, tiktoken. No PyTorch, no FAISS, no vector database. The entire system comes in around 2,500 lines of Python.


The Benchmark That Turned Heads

FinanceBench is a financial Q&A benchmark on SEC reports and earnings — one of the hardest retrieval problems in production, requiring multi-step reasoning, cross-references between sections, and exact numbers.

System Accuracy on FinanceBench
GPT-4o alone ~31%
Perplexity ~45%
Traditional vector RAG ~50–60%
PageIndex (Mafin 2.5) 98.7%

That difference of almost 40–50 points is enough for any developer working on document pipelines to take it seriously.


Where It Really Shines

PageIndex is designed for structured, professional documents where hierarchy matters:

  • Financial reports and SEC filings
  • Legal contracts and regulatory documentation
  • Technical manuals and academic papers
  • Any document where a table of contents exists for a reason

The traceability bonus is real: instead of a black-box cosine similarity score, you get a complete trace of the reasoning. In strict compliance environments — finance, legal, healthcare — being able to show why the system retrieved a particular section isn’t a nice-to-have: it’s a requirement.


Where the Trade-offs Are Real

It’s important to be direct about the costs:

Indexing is more expensive. Building the tree requires LLM calls per document. For documents you’ll only query once or twice, the overhead might not be worth it.

Latency is different, not necessarily slower. The co-founder explains that because retrieval happens inline with generation (rather than as a blocking pre-step), Time to First Token can be comparable to a standard LLM call. But total token usage per query is higher than vector retrieval.

It doesn’t replace semantic search over large collections. Vector databases still win for fuzzy queries across thousands of documents. PageIndex is a precision tool for deep retrieval within individual documents.

Gap between cloud and self-hosted. The open source version uses standard PDF parsing. The cloud service adds improved OCR and better handling of complex layouts. For documents with heavy visual structure — scanned PDFs, complex financial tables — this matters.


How to Get Started

pip install pageindex

The repo includes a complete agentic RAG example using the OpenAI Agents SDK, and there’s an MCP server for direct integration with agents — compatible with Claude Code or any tool supporting MCP.

For teams already running RAG pipelines on financial or legal documents and hitting precision ceilings, PageIndex deserves a serious evaluation. For general-purpose Q&A over large mixed collections, your current vector setup is probably still the right call.

→ GitHub: VectifyAI/PageIndex
→ Cloud platform: pageindex.ai


Are you building RAG pipelines at work? What kinds of documents give you the most headaches with traditional approaches?