RAG: Complete 2025 Guide - Python, LangChain, OpenWebUI

What is RAG?

RAG (Retrieval-Augmented Generation) is an advanced technique that combines AI text generation with information retrieval from a knowledge base. Unlike classical language models that rely solely on their pre-trained knowledge, RAG enriches responses with recent and specific contextual data.

This approach revolutionizes the use of LLMs (Large Language Models) by allowing them to access external information, solving several major limitations:

Access to recent information (classical models have a cutoff date)
Reduction of hallucinations (invented responses)
Personalization with your own data
Traceability of information sources

How does RAG work?

The RAG process breaks down into three main steps:

1. Indexing (Ingestion)

Documents are split into chunks, converted into embeddings (numerical vectors), and stored in a vector database.

2. Retrieval

When a question is asked, the system converts the query into an embedding and searches for the most similar chunks in the vector database.

3. Generation

The LLM generates a response using the retrieved chunks as context, in addition to its pre-trained knowledge.

Setup with Python and LangChain

LangChain is a Python framework that greatly simplifies creating RAG applications. Here's how to set up a complete system.

Installing dependencies

Start by installing the necessary libraries:

pip install langchain langchain-openai chromadb sentence-transformers

Complete code example

Here is a complete RAG system example with LangChain:

from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# 1. Load documents
loader = TextLoader("documents.txt")
documents = loader.load()

# 2. Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)

# 3. Create embeddings and vector database
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings
)

# 4. Create RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

# 5. Ask a question
response = qa_chain({"query": "What is the refund policy?"})
print(response["result"])

Using open-source models

To avoid OpenAI API costs, you can use local models with Ollama:

from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings

# Use Ollama for embeddings and LLM
embeddings = OllamaEmbeddings(model="nomic-embed")
llm = Ollama(model="llama2")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever()
)

Integration with OpenWebUI

OpenWebUI is an open-source web interface for interacting with LLMs. Here's how to integrate it with a RAG system:

OpenWebUI configuration with RAG

Install OpenWebUI:

docker run -d -p 3000:8080 -v open-webui:/app/backend/data --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main

Configure the RAG connector in settings
Upload your documents via the web interface
The system automatically indexes and enables RAG queries

Custom API for OpenWebUI

You can create a Flask/FastAPI API to connect your RAG system to OpenWebUI:

from flask import Flask, request, jsonify
from langchain.chains import RetrievalQA

app = Flask(__name__)
qa_chain = None  # Initialized with your RAG system

@app.route("/api/rag/query", methods=["POST"])
def query_rag():
    data = request.json
    query = data.get("query")
    
    response = qa_chain({"query": query})
    
    return jsonify({
        "answer": response["result"],
        "sources": [doc.page_content for doc in response["source_documents"]]
    })

if __name__ == "__main__":
    app.run(port=5000)

Best practices and optimizations

Smart chunking

Chunk size is crucial. Too small, you lose context. Too large, you introduce noise. Recommendations:

Chunk size: 500-1500 tokens depending on content type
Overlap: 10-20% to preserve context between chunks
Use smart separators (paragraphs, sections)

Vector database choice

Several options are available:

Chroma: Simple and lightweight, ideal for development
Pinecone: High-performance cloud service for production
Weaviate: Open-source with many features
Qdrant: Performant and easy to deploy

Improving retrieval

To improve response quality:

Use hybrid retrieval (vector + BM25)
Reorder results with a re-ranker
Filter by metadata (date, source, type)

RAG use cases

Conversational assistants with business knowledge
Customer support chatbots with documentation
Q&A systems on internal documents
Semantic search in knowledge bases

Need help with your RAG project?

Setting up a performant RAG system requires deep technical expertise. Our AI-specialized agency can assist you with:

Design and architecture of your RAG system
Integration with your existing systems
Performance and response quality optimization
Production deployment with monitoring

Discover our AI agency and our custom solutions to transform your data into an intelligent assistant.

RAG: The Complete 2025 Guide - Setting Up Retrieval-Augmented Generation