Introduction
Picture this: you’re staring at security logs with thousands of events streaming in daily. Which ones are actually dangerous? Which can you safely ignore? Traditional signature-based detection is like playing whack-a-mole with cybercriminals — they’ve gotten really good at dodging known signatures faster than we can create them.
Enter machine learning — your new cybersecurity superpower! Imagine having a system that learns attacker behavior patterns and predicts new threats before they even hit signature databases. Sounds too good to be true? Well, it’s not!
Honeypot data is the secret sauce that makes this magic happen. Unlike those sterile academic datasets gathering dust, honeypots capture real attackers in their natural habitat — like having a hidden camera in the cybercriminal underworld. This authentic data gives us unprecedented insights into how bad actors actually operate.
In this guide, we’ll take you on a journey from raw honeypot data to a working threat detection system that would make any SOC analyst jealous. Ready to turn chaos into clarity and transform your threat detection game? Let’s dive in!
What’s Hidden in Your Honeypot Data?
Before we start cooking up some ML magic, let’s peek behind the curtain and see what treasures our honeypot traps actually capture. Think of honeypots as security cameras recording cybercriminals in action. Here’s what our “footage” reveals:
Network Flow Data
Raw network connections contain fundamental information about attack patterns:
- Source/Destination IPs and Ports: Geographic and service targeting patterns
- Protocol Information: TCP/UDP usage, application layer protocols
- Flow Statistics: Packet counts, byte volumes, session duration
- Timing Data: Connection timestamps, session intervals
Application-Layer Events
Higher-level application interactions provide behavioral insights:
- Login Attempts: Credential stuffing, brute force patterns
- Command Execution: Shell commands, malware deployment
- File Operations: Upload/download activities, data exfiltration attempts
- Protocol-Specific Actions: HTTP requests, SSH sessions, database queries
Enriched Metadata
Additional context enhances the raw data:
- Geolocation: Country, region, ASN information
- Threat Intelligence: IP reputation, known malware signatures
- Behavioral Patterns: Session clustering, attack campaign attribution
Turning Chaos into Order: Data Cleaning
Raw honeypot data is like crude oil — full of potential, but you need to refine it first! Think of yourself as a detective sorting through evidence: some witness statements are unreliable, timestamps don’t add up, and some records are just duplicates. Here’s how we bring order to this beautiful chaos:
1. Data Validation and Sanitization
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
def validate_network_data(df):
# Map column names (dataset uses 'dest_port' instead of 'dst_port')
if 'dest_port' in df.columns and 'dst_port' not in df.columns:
df['dst_port'] = df['dest_port']
# Convert timestamp column with UTC handling for mixed timezones
if '@timestamp' in df.columns and 'timestamp' not in df.columns:
df['timestamp'] = pd.to_datetime(df['@timestamp'], format='mixed', errors='coerce', utc=True)
elif 'timestamp' in df.columns:
df['timestamp'] = pd.to_datetime(df['timestamp'], format='mixed', errors='coerce', utc=True)
# Convert port columns to numeric, handling string ports
for port_col in ['src_port', 'dst_port']:
if port_col in df.columns:
df[port_col] = pd.to_numeric(df[port_col], errors='coerce')
# Remove invalid IP addresses
if 'src_ip' in df.columns:
df['src_ip'] = df['src_ip'].astype(str)
df = df[df['src_ip'].str.match(r'^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$', na=False)]
# Validate port ranges (after converting to numeric)
if 'src_port' in df.columns:
df = df[df['src_port'].notna() & (df['src_port'] >= 1) & (df['src_port'] <= 65535)]
if 'dst_port' in df.columns:
df = df[df['dst_port'].notna() & (df['dst_port'] >= 1) & (df['dst_port'] <= 65535)]
# Remove rows with invalid timestamps - make current_time timezone aware
if 'timestamp' in df.columns:
current_time = pd.Timestamp.now(tz='UTC')
df = df[df['timestamp'].notna() & (df['timestamp'] <= current_time)]
return df
2. Handling Missing Data and Outliers
def preprocess_security_events(df):
# Handle missing geolocation data
df['country'].fillna('Unknown', inplace=True)
df['asn'].fillna(0, inplace=True)
# Cap extreme outliers in numerical features
for col in ['bytes_sent', 'bytes_received', 'session_duration']:
q99 = df[col].quantile(0.99)
df[col] = df[col].clip(upper=q99)
# Remove duplicate events (keep first occurrence)
df.drop_duplicates(subset=['src_ip', 'dst_port', 'timestamp'], keep='first', inplace=True)
return df
Feature Engineering for Threat Detection
Now for the fun part — turning raw data into ML model “food”! Think of this as cooking a gourmet meal from raw ingredients: each feature is like a spice that adds its unique flavor to our understanding of attacks. Let’s explore which “recipes” work best:
1. Temporal Features
Time-based patterns are crucial for identifying attack campaigns and behavioral anomalies:
def create_temporal_features(df):
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
# Time since first connection from same source
df['first_seen'] = df.groupby('src_ip')['timestamp'].transform('min')
df['hours_since_first_seen'] = (df['timestamp'] - df['first_seen']).dt.total_seconds() / 3600
# Connection frequency features
df['daily_connection_count'] = df.groupby(['src_ip', df['timestamp'].dt.date])['src_ip'].transform('count')
return df
2. Behavioral Aggregation Features
Statistical summaries reveal attacker patterns:
def create_behavioral_features(df):
# Per-source IP aggregations
source_stats = df.groupby('src_ip').agg({
'dst_port': ['nunique', 'count'],
'bytes_sent': ['mean', 'std', 'max'],
'session_duration': ['mean', 'median'],
'protocol': lambda x: x.mode().iloc[0] if len(x) > 0 else 'unknown'
}).reset_index()
# Flatten column names
source_stats.columns = ['src_ip', 'unique_ports', 'total_connections',
'avg_bytes', 'std_bytes', 'max_bytes',
'avg_duration', 'median_duration', 'primary_protocol']
# Merge back to original dataset
df = df.merge(source_stats, on='src_ip', how='left')
# Port scanning indicators
df['port_diversity'] = df['unique_ports'] / df['total_connections']
df['is_port_scanner'] = (df['unique_ports'] > 10).astype(int)
return df
3. Geographic and Network Features
Geographic patterns help identify coordinated attacks:
def create_geographic_features(df):
# Country-level threat scoring
country_threat_scores = df.groupby('country').agg({
'src_ip': 'nunique',
'attack_type': lambda x: (x != 'benign').sum()
}).reset_index()
country_threat_scores['threat_ratio'] = (
country_threat_scores['attack_type'] / country_threat_scores['src_ip']
)
df = df.merge(country_threat_scores[['country', 'threat_ratio']],
on='country', how='left')
# ASN-based features
asn_stats = df.groupby('asn').agg({
'src_ip': 'nunique',
'bytes_sent': 'mean'
}).reset_index()
df = df.merge(asn_stats, on='asn', suffixes=('', '_asn_avg'))
return df
Choosing Your ML Weapon for Threat Hunting
Time to pick our weapon of choice! Just like in video games where different bosses require different strategies, threat detection needs different ML approaches. Let’s figure out which “gear” works best for your specific mission:
1. Binary Classification: Attack vs. Benign
For basic threat detection, start with binary classification:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
def train_binary_classifier(df):
# Prepare features
feature_cols = ['hour', 'day_of_week', 'unique_ports', 'total_connections',
'avg_bytes', 'port_diversity', 'threat_ratio']
X = df[feature_cols].fillna(0)
y = (df['attack_type'] != 'benign').astype(int)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train model
rf_model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
min_samples_split=5,
random_state=42
)
rf_model.fit(X_train, y_train)
# Evaluate
y_pred = rf_model.predict(X_test)
print(classification_report(y_test, y_pred))
return rf_model
2. Multi-class Attack Classification
For detailed threat categorization:
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
def train_multiclass_classifier(df):
# Encode attack types
le = LabelEncoder()
df['attack_label'] = le.fit_transform(df['attack_type'])
feature_cols = ['hour', 'day_of_week', 'unique_ports', 'total_connections',
'avg_bytes', 'std_bytes', 'port_diversity', 'avg_duration']
X = df[feature_cols].fillna(0)
y = df['attack_label']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# XGBoost for multi-class
xgb_model = XGBClassifier(
n_estimators=200,
max_depth=6,
learning_rate=0.1,
subsample=0.8,
random_state=42
)
xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)
print(classification_report(y_test, y_pred, target_names=le.classes_))
return xgb_model, le
3. Anomaly Detection for Zero-Day Threats
Unsupervised learning identifies previously unseen attack patterns:
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
def train_anomaly_detector(df):
# Use only benign traffic for training
benign_data = df[df['attack_type'] == 'benign']
feature_cols = ['unique_ports', 'total_connections', 'avg_bytes',
'port_diversity', 'avg_duration', 'hours_since_first_seen']
X_benign = benign_data[feature_cols].fillna(0)
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_benign)
# Train isolation forest
iso_forest = IsolationForest(
n_estimators=100,
contamination=0.05, # Expect 5% anomalies
random_state=42
)
iso_forest.fit(X_scaled)
# Test on full dataset
X_all = df[feature_cols].fillna(0)
X_all_scaled = scaler.transform(X_all)
anomaly_scores = iso_forest.decision_function(X_all_scaled)
df['anomaly_score'] = anomaly_scores
df['is_anomaly'] = iso_forest.predict(X_all_scaled) == -1
return iso_forest, scaler
From Lab to Battlefield: Model Deployment
Performance Metrics for Security Models
Security models require specialized evaluation metrics:
from sklearn.metrics import precision_recall_curve, roc_curve, auc
import matplotlib.pyplot as plt
def evaluate_security_model(y_true, y_pred_proba):
# Precision-Recall curve (better for imbalanced data)
precision, recall, pr_thresholds = precision_recall_curve(y_true, y_pred_proba)
pr_auc = auc(recall, precision)
# ROC curve
fpr, tpr, roc_thresholds = roc_curve(y_true, y_pred_proba)
roc_auc = auc(fpr, tpr)
# Find optimal threshold for minimal false positives
optimal_idx = np.argmax(precision * recall)
optimal_threshold = pr_thresholds[optimal_idx]
print(f"PR-AUC: {pr_auc:.3f}")
print(f"ROC-AUC: {roc_auc:.3f}")
print(f"Optimal Threshold: {optimal_threshold:.3f}")
return optimal_threshold
Real-time Inference Pipeline
Deploy models for real-time threat detection:
import joblib
from datetime import datetime
class ThreatDetectionPipeline:
def __init__(self, model_path, scaler_path=None):
self.model = joblib.load(model_path)
self.scaler = joblib.load(scaler_path) if scaler_path else None
def preprocess_event(self, event):
# Convert raw event to feature vector
features = {
'hour': datetime.fromisoformat(event['timestamp']).hour,
'unique_ports': event.get('unique_ports', 1),
'total_connections': event.get('connection_count', 1),
'avg_bytes': event.get('bytes_sent', 0),
'port_diversity': event.get('port_diversity', 0),
}
feature_vector = np.array(list(features.values())).reshape(1, -1)
if self.scaler:
feature_vector = self.scaler.transform(feature_vector)
return feature_vector
def predict_threat(self, event):
features = self.preprocess_event(event)
# Get prediction probability
threat_probability = self.model.predict_proba(features)[0][1]
return {
'threat_probability': float(threat_probability),
'is_threat': threat_probability > 0.5,
'risk_level': 'high' if threat_probability > 0.8 else 'medium' if threat_probability > 0.5 else 'low'
}
# Usage example
pipeline = ThreatDetectionPipeline('threat_model.pkl', 'feature_scaler.pkl')
sample_event = {
'timestamp': '2024-01-15T14:30:00',
'src_ip': '192.168.1.100',
'dst_port': 22,
'bytes_sent': 1024,
'unique_ports': 5,
'connection_count': 15
}
result = pipeline.predict_threat(sample_event)
print(f"Threat Assessment: {result}")
Best Practices and Considerations
1. Dataset Quality and Labeling
- Ground Truth Validation: Regularly validate honeypot logs against known attack signatures
- Continuous Labeling: Implement automated labeling pipelines for new attack types
- Data Freshness: Retrain models with recent attack data to maintain effectiveness
2. Model Drift Monitoring
- Performance Tracking: Monitor precision/recall metrics in production
- Feature Distribution: Detect shifts in input feature distributions
- Automated Retraining: Set up pipelines to retrain models when performance degrades
3. Integration with Security Operations
- SIEM Integration: Export model predictions to security information systems
- Alert Tuning: Adjust thresholds based on organizational risk tolerance
- Human-in-the-Loop: Provide mechanisms for security analysts to provide feedback
Working with Public Honeypot Datasets
Dataset Arsenal: Your Threat Intelligence Toolkit
To show you the power of these methods in action, I’ve assembled a complete collection of real honeypot datasets on Hugging Face. Each dataset tells its own “story” about how attackers behave in the wild. Let’s meet your new threat intelligence toolkit:
The Big Boss: cyber-security-events-full
pyToshka/cyber-security-events-full
- Size: 772K events — the heavyweight champion for serious experiments
- What’s Inside: A full-length movie about cyberattacks with a rich feature set
- Features: Network flows, behavioral patterns, geographic data, IP reputation
- Perfect For: Training production-ready threat detection models
- Special Power: It’s like the Wikipedia of attacks — everything’s in here!
The Time Whisperer: attacks-daily
- Size: 676K records — laser-focused on temporal patterns
- What’s Inside: Daily chronicles of attacks with timestamps
- Features: Time series attacks, seasonal patterns, activity cycles
- Perfect For: Predicting “when” the next attack will happen
- Special Power: Shows that even hackers have daily routines!
The Compact Trainer: cyber-security-events
pyToshka/cyber-security-events
- Size: 15.1K events — perfect size for rapid experimentation
- What’s Inside: Curated selection of the most interesting attacks
- Features: Balanced mix of different attack types
- Perfect For: First steps and quick prototyping
- Special Power: Like a starter pack for ML researchers!
The Intrusion Specialist: network-intrusion-detection
pyToshka/network-intrusion-detection
- Size: 100 records — small but mighty
- What’s Inside: High-quality examples of network intrusions
- Features: Clear classifications, samples for IDS systems
- Perfect For: Intrusion detection system developers
- Special Power: Each record is a textbook example of “how NOT to secure your network”
Author’s Tip: Start with
cyber-security-eventsto learn the basics, move toattacks-dailyfor temporal analysis, and finish withcyber-security-events-fullfor serious experiments. It’s like leveling up in a game — from newbie to expert!
Dataset Selection Guidelines
When using publicly available honeypot datasets from platforms like Hugging Face, consider these practical approaches:
Dataset Selection Criteria
- Data Recency: Choose datasets with recent attack patterns
- Volume and Variety: Ensure sufficient samples across different attack types
- Documentation: Look for well-documented datasets with clear feature descriptions
- Licensing: Verify appropriate usage rights for your use case
Example Integration with Hugging Face Datasets
from datasets import load_dataset
# Load a comprehensive security events dataset
dataset = load_dataset("pyToshka/cyber-security-events-full")
# Convert to pandas for easier manipulation
df = dataset['train'].to_pandas()
# Apply preprocessing pipeline
df = validate_network_data(df)
df = preprocess_security_events(df)
df = create_temporal_features(df)
df = create_behavioral_features(df)
# Train your models
model = train_binary_classifier(df)
macOS M1/M2/M4 Compatibility
When running this code on Apple Silicon Macs (M1, M2, M4), you may encounter XGBoost installation issues. Here’s how to resolve them:
Installing Dependencies for Apple Silicon
# Install OpenMP runtime (required for XGBoost)
brew install libomp
# Install Python packages
pip install pandas numpy scikit-learn datasets xgboost
# If you encounter issues with XGBoost, try:
pip uninstall xgboost
pip install xgboost --no-cache-dir
Troubleshooting Common Issues
Issue: XGBoostError: XGBoost Library (libxgboost.dylib) could not be loaded
Solution:
brew install libomp
export LDFLAGS="-L/opt/homebrew/opt/libomp/lib"
export CPPFLAGS="-I/opt/homebrew/opt/libomp/include"
Issue: Performance issues on Apple Silicon Solution: Ensure you’re using the native ARM64 Python installation, not x86_64 through Rosetta.
Related Reading
- Mitigation Anomaly Revelation Keeper (MARK) - AI-powered threat analysis platform
- Integrating Wazuh with Ollama: Part 1 - AI-enhanced SIEM for threat detection
- How to Set Up a Custom Integration between Wazuh and MARK - Advanced security automation
- Amazon EKS SOC 2 Type II Compliance Checklist Part 1 - Enterprise security compliance