Scaling Semantic Discovery: Indexing 78,000 Public Domain Books with Self-Hosted Qwen RAG Pipeline
A small team has successfully deployed a fully self-hosted, open-source Retrieval-Augmented Generation (RAG) system, indexing 78,000 public domain books from Project Gutenberg. This article details the complex ingestion pipeline, the challenges of intent-based retrieval over traditional metadata, and the architectural solutions implemented to minimize hallucinations using open-weight models like Qwen.
The Challenge of Intent-Based Retrieval
Traditional library and book discovery often relies on rigid metadata filters—such as genre tags, author matching, or purchase history. However, this approach fails when faced with nuanced, qualitative queries. The team highlighted that true semantic discovery requires "intent matching," moving beyond lexical overlap. A query like "something hopeful but not naive" cannot be satisfied by simple genre filtering; it requires matching narrative structure, emotional arcs, and thematic patterns.
This inherent difficulty in translating abstract human intent into quantifiable vector space is the core technical hurdle addressed by the project, which aims to provide a semantic discovery layer over the entirety of Project Gutenberg's library.
Architectural Stack and Infrastructure
The entire infrastructure is built on principles of open-source auditability and self-sufficiency, running on proprietary GPU hardware located in Gijón, northern Spain. Crucially, the system operates without reliance on cloud services (e.g., AWS) or external proprietary APIs (e.g., OpenAI).
Open-Weight Model Deployment
The RAG pipeline leverages a diverse set of open-weight Qwen models for various tasks within the ingestion and generation phases. The deployed models include:
- Qwen3.5-2B
- Qwen2.5-7B-Instruct
- Qwen3.5-9B
- Qwen3-8B-FP8
- Qwen3.6-27B-FP8
- Qwen3-30B-A3B-Instruct-2507-FP8
Deep Dive into the Ingestion Pipeline
The process of incorporating 78,000 documents into the vector store is highly complex, consisting of five sequential phases: fetching, transforming, enriching, storing, and post-processing. The most critical phase for achieving high-fidelity retrieval is the contextual enrichment step.
LLM-Powered Contextual Enrichment
Following initial token-splitting, every document chunk undergoes an LLM-powered contextual enrichment. This step is vital because it ensures that even a small, seemingly isolated chunk of text gains a precise summary of its position within the broader document context (e.g., which character, what moment, which book it belongs to). This significantly increases the chunk's retrievability for a relevant query.
The team noted that this methodology draws inspiration from Anthropic’s published contextual retrieval research, which demonstrated over a 60% reduction in retrieval failures. Although the underlying research is open, the specific implementation and inference were entirely handled by the team.
Mitigating Hallucination in RAG Systems
In the context of advanced RAG systems, preventing hallucinations is a primary engineering challenge. The team implemented three specific techniques that proved effective in maintaining output trustworthiness:
1. Citations as the Sole Verification Mechanism
Every generated response is required to surface the specific source passage it drew from. The system asserts that if the cited passage does not support the claim being made, the output is inherently untrustworthy. This mechanism serves as the only honest check without requiring manual re-reading of the entire source material.
2. Pre-Generation Reranking
Unlike lightweight RAG implementations that often bypass this step, this system scores and ranks retrieved chunks for relevance *before* they reach