Build a RAG Chatbot in 40 Lines of Python
- •Marko Frei demonstrates building a functional RAG chatbot using approximately 40 lines of Python code.
- •The system utilizes sentence-transformers to generate local embeddings and Anthropic's Claude to produce context-aware answers.
- •Tutorial provides a modular pipeline for retrieving private document chunks to reduce hallucination and ensure factual accuracy.
Marko Frei published a tutorial on June 12, 2026, detailing how to build a RAG (Retrieval-Augmented Generation) chatbot in approximately 40 lines of Python. The guide addresses the tendency of language models to provide incorrect information when queried about data they were not trained on, such as internal documentation or niche product details. RAG solves this by fetching relevant document segments and presenting them as context during an inquiry, rather than modifying the model itself.
The implementation process involves five primary stages: breaking documents into manageable chunks, converting those chunks into vectors (embeddings), storing them, performing a similarity search when a user asks a question, and finally prompting the LLM with the retrieved context to generate a precise answer. The tutorial utilizes Python 3.9 or newer along with three core packages: sentence-transformers for local embedding generation, numpy for mathematical operations, and the Anthropic library for model interaction. The author uses a pre-trained model, 'all-MiniLM-L6-v2', which generates 384-dimensional vectors, to handle the embedding process locally on a laptop.
For the demonstration, the author uses 'Nimbus', a hypothetical cloud file storage service, to show how the system retrieves specific information based on cosine similarity scores. The final chatbot code forces the model to rely exclusively on provided context, instructing it to state that it does not know the answer if the information is unavailable. While the current setup uses a basic brute-force loop for retrieval, the author notes that users can scale the system by integrating dedicated vector databases like Chroma or FAISS for larger document volumes. The implementation allows for modular updates, enabling users to swap the Anthropic client for other providers such as OpenAI or local models running via Ollama.