Build A Local Q&A System With LLMs & Python

Aug 9, 2025 by Mei Lin 44 views

Build Your Own Local Q&A System with LLMs in Python

Hey guys! Ever thought about building your own question-answering system that can understand documents and give you precise answers? It sounds like something out of a sci-fi movie, but with the power of Large Language Models (LLMs) and Python, it's totally achievable. In this article, we'll dive into creating a local Q&A system that can ingest PDF documents, chunk them into manageable pieces, embed those chunks into a vector database, and then use an LLM to answer your questions based on the content. Plus, we'll make sure the answers come with citations so you know exactly where the information came from. Let's get started!

Why Build a Local Q&A System?

Before we jump into the how-to, let's talk about the why. Why would you want to build a local question-answering system when there are so many options available online? Here are a few compelling reasons:

Privacy: When you're dealing with sensitive information, you might not want to send your documents to a third-party service. A local system keeps your data on your machine.
Cost: Cloud-based LLM services can get expensive, especially if you're processing a lot of documents. A local system, especially one using open-source LLMs, can save you money.
Customization: Building your own system gives you complete control over the process. You can tweak every aspect, from the chunking strategy to the LLM prompting, to perfectly fit your needs.
Offline Access: A local system works even without an internet connection, which is a huge plus if you need to access information in remote locations or during internet outages.

The Key Components

So, what goes into building a local Q&A system? Here's a breakdown of the main steps and components:

Ingest a PDF: We'll start by loading the PDF document you want to query. Python libraries like PyPDF2 or PDFMiner can help with this.
Chunk the Text: LLMs have input length limits, so we can't just feed the entire document. We need to break it down into smaller chunks, usually paragraphs or sections. This step is crucial for managing the context and ensuring the LLM can process the information effectively. Proper chunking ensures that the LLM receives enough context to answer questions accurately while staying within its processing limits. Think of it like giving the LLM bite-sized pieces of information to digest, rather than overwhelming it with an entire feast at once. The size and overlap of these chunks can significantly impact the system's performance, making it a critical area for fine-tuning.
Embed the Chunks: To make the chunks searchable, we'll convert them into numerical vectors using a technique called embeddings. These vectors capture the semantic meaning of the text, allowing us to compare chunks based on their content. Embedding models, like those from Sentence Transformers, play a crucial role here, transforming textual data into a format that machine learning algorithms can easily process. The quality of these embeddings directly influences the accuracy of the question-answering system, as similar pieces of text should have vectors that are close together in the embedding space. This step is where the system begins to understand the relationships between different parts of the document, paving the way for efficient information retrieval.
Store in Qdrant: We'll store the embeddings in a vector database like Qdrant. This specialized database is designed for fast similarity searches, which we'll need to retrieve the most relevant chunks for a given question. Vector databases are a cornerstone of modern question-answering systems, providing the speed and scalability required to handle large volumes of text data. Qdrant's ability to perform approximate nearest neighbor searches makes it particularly well-suited for this task, allowing the system to quickly identify the most relevant chunks without exhaustively comparing every vector. This efficiency is essential for delivering timely responses, especially when dealing with extensive document collections.
Ask a Question: Now comes the fun part! We'll take your question and embed it using the same embedding model we used for the chunks. This ensures that the question is represented in the same semantic space as the document chunks, enabling meaningful comparisons. This step bridges the gap between human queries and machine-understandable representations, allowing the system to search for answers based on meaning rather than just keywords. The accuracy of the question embedding is crucial for effective information retrieval, as it determines how well the query aligns with the relevant document chunks.
Retrieve Top-k Chunks: We'll use Qdrant to find the top-k chunks that are most similar to the question embedding. These are the chunks that are most likely to contain the answer to your question. This retrieval process is the heart of the question-answering system, where the vector database's speed and precision come into play. By retrieving the most relevant chunks, the system narrows down the information that needs to be processed by the LLM, making the subsequent answer generation more efficient and accurate. The choice of 'k' (the number of chunks to retrieve) is a critical parameter that balances the need for comprehensive context with the computational cost of processing more data.
Prompt the LLM: We'll create a prompt that includes the question and the retrieved chunks, and feed it to an LLM. The LLM will use its knowledge and the provided context to generate an answer. This step leverages the power of LLMs to synthesize information and produce human-like responses. The quality of the prompt engineering significantly impacts the LLM's ability to generate accurate and coherent answers. A well-crafted prompt provides the LLM with the necessary context and instructions to effectively address the question, while a poorly designed prompt may lead to irrelevant or incorrect responses.
Return an Answer with Citations: Finally, we'll return the LLM's answer along with citations indicating which file and page number the information came from. This is crucial for transparency and allows you to verify the answer if needed. Citations add credibility to the question-answering system, enabling users to trace the information back to its source. This feature is particularly important in scenarios where accuracy and reliability are paramount, such as in legal or academic research. By providing clear citations, the system fosters trust and allows users to critically evaluate the information presented.

Building the Prototype: A Python Script

Let's break down how you can build this prototype using a single Python script or Jupyter Notebook. This approach makes it easy to run locally and experiment with different parameters.

1. Install the Necessary Libraries

First, you'll need to install the required Python libraries. Open your terminal and run:

pip install pypdf sentence-transformers qdrant-client transformers torch

Here's what each library does:

pypdf: For reading PDF files.
sentence-transformers: For generating text embeddings.
qdrant-client: For interacting with the Qdrant vector database.
transformers: For using pre-trained LLMs.
torch: PyTorch, a deep learning framework required by some transformers.

2. Set Up Qdrant

You'll need a Qdrant instance to store your embeddings. The easiest way to get started is to run Qdrant in Docker. If you don't have Docker installed, you can download it from Docker's website. Once Docker is running, run this command in your terminal:

docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant

This will start a Qdrant instance running locally on ports 6333 (for the API) and 6334 (for the gRPC interface).

3. The Python Script

Now, let's create the Python script. Here's a basic outline of what the script will do:

Load the PDF: Read the PDF file using pypdf.
Chunk the Text: Split the text into chunks.
Embed the Chunks: Generate embeddings for each chunk using sentence-transformers.
Store in Qdrant: Store the embeddings in a Qdrant collection.
Ask a Question: Take a question as input.
Embed the Question: Generate an embedding for the question.
Retrieve Top-k Chunks: Search Qdrant for the most similar chunks.
Prompt the LLM: Create a prompt with the question and chunks, and send it to the LLM.
Return an Answer: Print the LLM's answer with citations.

Here's a code snippet to get you started:

import os
import pypdf
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient, models
from transformers import pipeline

# 1. Load the PDF
def load_pdf(pdf_path):
    with open(pdf_path, "rb") as f:
        reader = pypdf.PdfReader(f)
        text = "".join(page.extract_text() for page in reader.pages)
    return text

# 2. Chunk the Text
def chunk_text(text, chunk_size=500, chunk_overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        if end > len(text):
            end = len(text)
        chunk = text[start:end]
        chunks.append(chunk)
        start += chunk_size - chunk_overlap
    return chunks

# 3. Embed the Chunks
def embed_chunks(chunks, model):
    embeddings = model.encode(chunks)
    return embeddings

# 4. Store in Qdrant
def store_in_qdrant(chunks, embeddings, qdrant_client, collection_name):
    qdrant_client.recreate_collection(
        collection_name=collection_name,
        vectors_config=models.VectorParams(size=embeddings.shape[1], distance=models.Distance.COSINE),
    )
    qdrant_client.upsert(
        collection_name=collection_name,
        points=models.Batch(
            ids=list(range(len(chunks))),
            vectors=embeddings.tolist(),
            payloads=[{"text": chunk} for chunk in chunks],
        ),
    )

# 5. Ask a Question, 6. Embed the Question, 7. Retrieve Top-k Chunks
def retrieve_chunks(question, model, qdrant_client, collection_name, top_k=5):
    question_embedding = model.encode(question)
    hits = qdrant_client.search(
        collection_name=collection_name,
        query_vector=question_embedding.tolist(),
        limit=top_k,
    )
    return hits

# 8. Prompt the LLM, 9. Return an Answer
def generate_answer(question, hits, llm):
    context = "\n\n".join([hit.payload["text"] for hit in hits])
    prompt = f"""Answer the question based on the context below.\n\nContext:\n{context}\n\nQuestion: {question}\n\nAnswer:"""
    answer = llm(prompt)[0]["generated_text"]
    return answer


def main():
    pdf_path = "your_document.pdf" # Replace with your PDF file
    collection_name = "my_qa_collection"
    
    # Initialize models and clients
    embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
    qdrant_client = QdrantClient(":memory:") # Use :memory: for local, or specify host for a Qdrant instance
    llm = pipeline("text-generation", model="google/flan-t5-base") # Use a smaller model for local

    # 1. Load the PDF
    text = load_pdf(pdf_path)

    # 2. Chunk the Text
    chunks = chunk_text(text)

    # 3. Embed the Chunks
    embeddings = embed_chunks(chunks, embedding_model)

    # 4. Store in Qdrant
    store_in_qdrant(chunks, embeddings, qdrant_client, collection_name)

    # 5. Ask a Question
    question = input("Ask a question: ")

    # 6. & 7. Embed the Question and Retrieve Top-k Chunks
    hits = retrieve_chunks(question, embedding_model, qdrant_client, collection_name)

    # 8. & 9. Prompt the LLM and Return an Answer
    answer = generate_answer(question, hits, llm)
    print("Answer:", answer)

if __name__ == "__main__":
    main()

4. Putting It All Together

**Replace `