Applying RAG for Working with Wazuh Documentation: A Step-by-Step Guide (Part 2)

Applying RAG for Working with Wazuh Documentation: A Step-by-Step Guide (Part 2)

Preparing for Code Development

For local code development for RAG, you will need to install the following tools:

  • Ollama
  • Python v3.9+
  • Basic Python knowledge
  • Wazuh documentation in PDF format

Running and Configuring Ollama

  1. Install Ollama
  2. Obtain the necessary models: llama3.2 and nomic-embed-text.

Developing a Mechanism for Loading PDF Documentation

For development, we will use the following tools:

  • LangChain - for creating data processing chains.
  • Ollama - for running and configuring models.
  • Python - as the main programming language.
  • ChromaDB - as a vector store.

Install the dependencies:

Create a file requirements.txt and add the following dependencies:

chromadb==0.6.3
unstructured==0.16.14
langchain==0.3.18
langchain-text-splitters==0.3.6
unstructured[all-docs]
langchain-community==0.3.14
langchain-ollama==0.2.2

After installation, let’s create a mechanism for loading PDF documentation.

Create a Python script upload.py and add the following code:

Add imports at the beginning of your script:

import argparse
import os

from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

Now create a function for loading PDF documentation:

def upload(document_path, model_name, collection_name, ollama_base_url):
    current_path = os.path.dirname(os.path.realpath(__file__))
    chroma_persistent_directory = current_path + "/data"
    if not os.path.exists(chroma_persistent_directory):
        os.makedirs(chroma_persistent_directory, exist_ok=True)
    loader = UnstructuredPDFLoader(file_path=document_path)
    data = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
    chunks = text_splitter.split_documents(data)
    vector = Chroma.from_documents(
        documents=chunks,
        embedding=OllamaEmbeddings(
            base_url=ollama_base_url,
            model=model_name, show_progress=True
        ),
        collection_name=collection_name,
        persist_directory=chroma_persistent_directory
    )
    return vector

Explanation of the function

Path to the current script:

current_path = os.path.dirname(os.path.realpath(__file__))

This line determines the path to the directory where the current script is located.

Creating a directory for data storage:

chroma_persistent_directory = current_path + "/data"
if not os.path.exists(chroma_persistent_directory):
    os.makedirs(chroma_persistent_directory, exist_ok=True)

Here, a data directory is created inside the current directory if it does not already exist. This directory will be used for data storage.

Loading the PDF document:

loader = UnstructuredPDFLoader(file_path=document_path)
data = loader.load()

The UnstructuredPDFLoader class is used to load the PDF document at the specified document_path.

Splitting the text into chunks:

text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
chunks = text_splitter.split_documents(data)

The text from the PDF document is split into chunks of 7500 characters with an overlap of 100 characters.

This is done for ease of processing and analysis.

vector = Chroma.from_documents(
    documents=chunks,
    embedding=OllamaEmbeddings(
        base_url=ollama_base_url,
        model=model_name, show_progress=True
    ),
    collection_name=collection_name,
    persist_directory=chroma_persistent_directory
)

The Chroma class is used to create a vector representation of the text chunks.

The OllamaEmbeddings model is used, which is loaded from the specified URL (ollama_base_url) and model name (model_name).

The vector representation is stored in a collection named collection_name in the chroma_persistent_directory.

Returning the result:

return vector

The function returns the vector object, which represents the vectorized text.

Now complete our script:

if __name__ == '__main__':
    # Creating argument parser
    parser = argparse.ArgumentParser(description='Upload a PDF to the vector store')
    # Adding arguments
    parser.add_argument('-p', '--path', type=str, help='Path to the PDF file', required=True)
    parser.add_argument('-m', '--model', type=str, help='Name of the Ollama model for embedding',
                        default='nomic-embed-text')
    parser.add_argument('-n', '--name', type=str, help='Collection name in ChromaDB', default='wazuh')
    parser.add_argument('-b', '--base-url', type=str, help='Base URL for the Ollama server',
                        default='http://127.0.0.1:11434')
    # Parsing arguments
    args = parser.parse_args()
    # Calling the upload function
    upload(document_path=args.path, model_name=args.model, collection_name=args.name, ollama_base_url=args.base_url)

Save everything and run:

ollama pull nomic-embed-text
python upload.py -p ΠΏΡƒΡ‚ΡŒ Π΄ΠΎ вашСго pdf Ρ„Π°ΠΉΠ»Π°

The upload will take some time. Be patient.

Now let’s try asking something:

Create a Python script (for example, ask.py) that will use Ollama to get answers to questions.

import argparse
import os

from langchain.retrievers import MultiQueryRetriever
import chromadb
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_ollama import ChatOllama
from langchain_ollama import OllamaEmbeddings
from langchain_chroma import Chroma


def ask_ollama(question, collection_name='wazuh', embedding_model='nomic-embed-text', local_model='llama3.2'):
    current_path = os.path.dirname(os.path.realpath(__file__))
    chroma_persistent_directory = current_path + "/data"
    embedding = OllamaEmbeddings(model=embedding_model)
    persistent_client = chromadb.PersistentClient(path=chroma_persistent_directory)
    vector_db = Chroma(
        client=persistent_client,
        collection_name=collection_name,
        embedding_function=embedding,
    )

    load_ollama = ChatOllama(model=local_model)

    prompt_template = PromptTemplate(
        input_variables=["question"],
        template="""You are an AI language model assistant. Your task is to generate 2
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
    )

    template = """Answer the question based ONLY on the following context:
    {context}
    Question: {question}
    """

    retriever = MultiQueryRetriever.from_llm(vector_db.as_retriever(), load_ollama, prompt=prompt_template)
    prompt = ChatPromptTemplate.from_template(template)
    chain = (
            {"context": retriever, "question": RunnablePassthrough()}
            | prompt
            | load_ollama
            | StrOutputParser()
    )
    return chain.invoke(question)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Ask Ollama a question')
    parser.add_argument('-q', '--question', type=str, help='The question to ask Ollama', required=True)
    args = parser.parse_args()
    print(ask_ollama(args.question))

Save and run the script:

python3 ask.py -q "What is it Wazuh and what for?"

Sample response:

Wazuh is an open-source log management system that provides real-time monitoring and alerting capabilities for security, compliance, and IT operations. It was originally developed by Qualys, a leading provider of vulnerability management and compliance solutions.

Wazuh acts as a bridge between the host operating system and external threat intelligence feeds, allowing users to collect, process, and analyze log data from various sources. This enables users to:

1.  **Monitor security events**: Wazuh collects and analyzes log data from various sources (e.g., system logs, application logs, and network devices) to identify potential security threats and anomalies.
2.  **Detect vulnerabilities**: By integrating with external threat intelligence feeds, Wazuh can detect known vulnerabilities in the environment and alert users to take corrective action.
3.  **Enforce compliance**: Wazuh supports various compliance frameworks (e.g., PCI-DSS, HIPAA/HITECH, GDPR) by providing features for logging, auditing, and reporting on security-related data.

Overall, Wazuh helps organizations proactively manage their security posture, detect potential threats, and maintain regulatory compliance.

As you can see, nothing complicated. Keep an eye on updates.


See also