Applying RAG for Working with Wazuh Documentation: A Step-by-Step Guide (Part 2)
Preparing for Code Development
For local code development for RAG, you will need to install the following tools:
- Ollama
- Python v3.9+
- Basic Python knowledge
- Wazuh documentation in PDF format
Running and Configuring Ollama
- Install Ollama
- Obtain the necessary models:
llama3.2
andnomic-embed-text
.
Developing a Mechanism for Loading PDF Documentation
For development, we will use the following tools:
- LangChain - for creating data processing chains.
- Ollama - for running and configuring models.
- Python - as the main programming language.
- ChromaDB - as a vector store.
Install the dependencies:
Create a file requirements.txt
and add the following dependencies:
chromadb==0.6.3
unstructured==0.16.14
langchain==0.3.18
langchain-text-splitters==0.3.6
unstructured[all-docs]
langchain-community==0.3.14
langchain-ollama==0.2.2
After installation, letβs create a mechanism for loading PDF documentation.
Create a Python script upload.py
and add the following code:
Add imports at the beginning of your script:
import argparse
import os
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
Now create a function for loading PDF documentation:
def upload(document_path, model_name, collection_name, ollama_base_url):
current_path = os.path.dirname(os.path.realpath(__file__))
chroma_persistent_directory = current_path + "/data"
if not os.path.exists(chroma_persistent_directory):
os.makedirs(chroma_persistent_directory, exist_ok=True)
loader = UnstructuredPDFLoader(file_path=document_path)
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
chunks = text_splitter.split_documents(data)
vector = Chroma.from_documents(
documents=chunks,
embedding=OllamaEmbeddings(
base_url=ollama_base_url,
model=model_name, show_progress=True
),
collection_name=collection_name,
persist_directory=chroma_persistent_directory
)
return vector
Explanation of the function
Path to the current script:
current_path = os.path.dirname(os.path.realpath(__file__))
This line determines the path to the directory where the current script is located.
Creating a directory for data storage:
chroma_persistent_directory = current_path + "/data"
if not os.path.exists(chroma_persistent_directory):
os.makedirs(chroma_persistent_directory, exist_ok=True)
Here, a data
directory is created inside the current directory if it does not already exist. This directory will be used for data storage.
Loading the PDF document:
loader = UnstructuredPDFLoader(file_path=document_path)
data = loader.load()
The UnstructuredPDFLoader
class is used to load the PDF document at the specified document_path
.
Splitting the text into chunks:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
chunks = text_splitter.split_documents(data)
The text from the PDF document is split into chunks of 7500 characters with an overlap of 100 characters.
This is done for ease of processing and analysis.
vector = Chroma.from_documents(
documents=chunks,
embedding=OllamaEmbeddings(
base_url=ollama_base_url,
model=model_name, show_progress=True
),
collection_name=collection_name,
persist_directory=chroma_persistent_directory
)
The Chroma
class is used to create a vector representation of the text chunks.
The OllamaEmbeddings
model is used, which is loaded from the specified URL (ollama_base_url
) and model name (model_name
).
The vector representation is stored in a collection named collection_name
in the chroma_persistent_directory
.
Returning the result:
return vector
The function returns the vector
object, which represents the vectorized text.
Now complete our script:
if __name__ == '__main__':
# Creating argument parser
parser = argparse.ArgumentParser(description='Upload a PDF to the vector store')
# Adding arguments
parser.add_argument('-p', '--path', type=str, help='Path to the PDF file', required=True)
parser.add_argument('-m', '--model', type=str, help='Name of the Ollama model for embedding',
default='nomic-embed-text')
parser.add_argument('-n', '--name', type=str, help='Collection name in ChromaDB', default='wazuh')
parser.add_argument('-b', '--base-url', type=str, help='Base URL for the Ollama server',
default='http://127.0.0.1:11434')
# Parsing arguments
args = parser.parse_args()
# Calling the upload function
upload(document_path=args.path, model_name=args.model, collection_name=args.name, ollama_base_url=args.base_url)
Save everything and run:
ollama pull nomic-embed-text
python upload.py -p ΠΏΡΡΡ Π΄ΠΎ Π²Π°ΡΠ΅Π³ΠΎ pdf ΡΠ°ΠΉΠ»Π°
The upload will take some time. Be patient.
Now letβs try asking something:
Create a Python script (for example, ask.py
) that will use Ollama to get answers to questions.
import argparse
import os
from langchain.retrievers import MultiQueryRetriever
import chromadb
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_ollama import ChatOllama
from langchain_ollama import OllamaEmbeddings
from langchain_chroma import Chroma
def ask_ollama(question, collection_name='wazuh', embedding_model='nomic-embed-text', local_model='llama3.2'):
current_path = os.path.dirname(os.path.realpath(__file__))
chroma_persistent_directory = current_path + "/data"
embedding = OllamaEmbeddings(model=embedding_model)
persistent_client = chromadb.PersistentClient(path=chroma_persistent_directory)
vector_db = Chroma(
client=persistent_client,
collection_name=collection_name,
embedding_function=embedding,
)
load_ollama = ChatOllama(model=local_model)
prompt_template = PromptTemplate(
input_variables=["question"],
template="""You are an AI language model assistant. Your task is to generate 2
different versions of the given user question to retrieve relevant documents from
a vector database. By generating multiple perspectives on the user question, your
goal is to help the user overcome some of the limitations of the distance-based
similarity search. Provide these alternative questions separated by newlines.
Original question: {question}""",
)
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""
retriever = MultiQueryRetriever.from_llm(vector_db.as_retriever(), load_ollama, prompt=prompt_template)
prompt = ChatPromptTemplate.from_template(template)
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| load_ollama
| StrOutputParser()
)
return chain.invoke(question)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Ask Ollama a question')
parser.add_argument('-q', '--question', type=str, help='The question to ask Ollama', required=True)
args = parser.parse_args()
print(ask_ollama(args.question))
Save and run the script:
python3 ask.py -q "What is it Wazuh and what for?"
Sample response:
Wazuh is an open-source log management system that provides real-time monitoring and alerting capabilities for security, compliance, and IT operations. It was originally developed by Qualys, a leading provider of vulnerability management and compliance solutions.
Wazuh acts as a bridge between the host operating system and external threat intelligence feeds, allowing users to collect, process, and analyze log data from various sources. This enables users to:
1. **Monitor security events**: Wazuh collects and analyzes log data from various sources (e.g., system logs, application logs, and network devices) to identify potential security threats and anomalies.
2. **Detect vulnerabilities**: By integrating with external threat intelligence feeds, Wazuh can detect known vulnerabilities in the environment and alert users to take corrective action.
3. **Enforce compliance**: Wazuh supports various compliance frameworks (e.g., PCI-DSS, HIPAA/HITECH, GDPR) by providing features for logging, auditing, and reporting on security-related data.
Overall, Wazuh helps organizations proactively manage their security posture, detect potential threats, and maintain regulatory compliance.
As you can see, nothing complicated. Keep an eye on updates.
See also
- Applying RAG for Wazuh Documentation: A Step-by-Step Guide (Part 1)
- Enhancing Wazuh with Ollama: Boosting Cybersecurity (Part 4)
- Enhancing Wazuh with Ollama: Boosting Cybersecurity (Part 3)
- Enhancing Wazuh with Ollama: Boosting Cybersecurity (Part 2)
- Enhancing Wazuh with Ollama: A Cybersecurity Boost (Part 1)