Building ML Threat Intelligence with Honeypot Data

Introduction

Picture this: you’re staring at security logs with thousands of events streaming in daily. Which ones are actually dangerous? Which can you safely ignore? Traditional signature-based detection is like playing whack-a-mole with cybercriminals — they’ve gotten really good at dodging known signatures faster than we can create them.

Enter machine learning — your new cybersecurity superpower! Imagine having a system that learns attacker behavior patterns and predicts new threats before they even hit signature databases. Sounds too good to be true? Well, it’s not!

Honeypot data is the secret sauce that makes this magic happen. Unlike those sterile academic datasets gathering dust, honeypots capture real attackers in their natural habitat — like having a hidden camera in the cybercriminal underworld. This authentic data gives us unprecedented insights into how bad actors actually operate.

In this guide, we’ll take you on a journey from raw honeypot data to a working threat detection system that would make any SOC analyst jealous. Ready to turn chaos into clarity and transform your threat detection game? Let’s dive in!

What’s Hidden in Your Honeypot Data?

Before we start cooking up some ML magic, let’s peek behind the curtain and see what treasures our honeypot traps actually capture. Think of honeypots as security cameras recording cybercriminals in action. Here’s what our “footage” reveals:

Network Flow Data

Raw network connections contain fundamental information about attack patterns:

Source/Destination IPs and Ports: Geographic and service targeting patterns
Protocol Information: TCP/UDP usage, application layer protocols
Flow Statistics: Packet counts, byte volumes, session duration
Timing Data: Connection timestamps, session intervals

Application-Layer Events

Higher-level application interactions provide behavioral insights:

Login Attempts: Credential stuffing, brute force patterns
Command Execution: Shell commands, malware deployment
File Operations: Upload/download activities, data exfiltration attempts
Protocol-Specific Actions: HTTP requests, SSH sessions, database queries

Enriched Metadata

Additional context enhances the raw data:

Geolocation: Country, region, ASN information
Threat Intelligence: IP reputation, known malware signatures
Behavioral Patterns: Session clustering, attack campaign attribution

Turning Chaos into Order: Data Cleaning

Raw honeypot data is like crude oil — full of potential, but you need to refine it first! Think of yourself as a detective sorting through evidence: some witness statements are unreliable, timestamps don’t add up, and some records are just duplicates. Here’s how we bring order to this beautiful chaos:

1. Data Validation and Sanitization

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

def validate_network_data(df):
    # Map column names (dataset uses 'dest_port' instead of 'dst_port')
    if 'dest_port' in df.columns and 'dst_port' not in df.columns:
        df['dst_port'] = df['dest_port']

    # Convert timestamp column with UTC handling for mixed timezones
    if '@timestamp' in df.columns and 'timestamp' not in df.columns:
        df['timestamp'] = pd.to_datetime(df['@timestamp'], format='mixed', errors='coerce', utc=True)
    elif 'timestamp' in df.columns:
        df['timestamp'] = pd.to_datetime(df['timestamp'], format='mixed', errors='coerce', utc=True)

    # Convert port columns to numeric, handling string ports
    for port_col in ['src_port', 'dst_port']:
        if port_col in df.columns:
            df[port_col] = pd.to_numeric(df[port_col], errors='coerce')

    # Remove invalid IP addresses
    if 'src_ip' in df.columns:
        df['src_ip'] = df['src_ip'].astype(str)
        df = df[df['src_ip'].str.match(r'^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$', na=False)]

    # Validate port ranges (after converting to numeric)
    if 'src_port' in df.columns:
        df = df[df['src_port'].notna() & (df['src_port'] >= 1) & (df['src_port'] <= 65535)]
    if 'dst_port' in df.columns:
        df = df[df['dst_port'].notna() & (df['dst_port'] >= 1) & (df['dst_port'] <= 65535)]

    # Remove rows with invalid timestamps - make current_time timezone aware
    if 'timestamp' in df.columns:
        current_time = pd.Timestamp.now(tz='UTC')
        df = df[df['timestamp'].notna() & (df['timestamp'] <= current_time)]

    return df

2. Handling Missing Data and Outliers

def preprocess_security_events(df):
    # Handle missing geolocation data
    df['country'].fillna('Unknown', inplace=True)
    df['asn'].fillna(0, inplace=True)

    # Cap extreme outliers in numerical features
    for col in ['bytes_sent', 'bytes_received', 'session_duration']:
        q99 = df[col].quantile(0.99)
        df[col] = df[col].clip(upper=q99)

    # Remove duplicate events (keep first occurrence)
    df.drop_duplicates(subset=['src_ip', 'dst_port', 'timestamp'], keep='first', inplace=True)

    return df

Feature Engineering for Threat Detection

Now for the fun part — turning raw data into ML model “food”! Think of this as cooking a gourmet meal from raw ingredients: each feature is like a spice that adds its unique flavor to our understanding of attacks. Let’s explore which “recipes” work best:

1. Temporal Features

Time-based patterns are crucial for identifying attack campaigns and behavioral anomalies:

def create_temporal_features(df):
    df['hour'] = df['timestamp'].dt.hour
    df['day_of_week'] = df['timestamp'].dt.dayofweek
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

    # Time since first connection from same source
    df['first_seen'] = df.groupby('src_ip')['timestamp'].transform('min')
    df['hours_since_first_seen'] = (df['timestamp'] - df['first_seen']).dt.total_seconds() / 3600

    # Connection frequency features
    df['daily_connection_count'] = df.groupby(['src_ip', df['timestamp'].dt.date])['src_ip'].transform('count')

    return df

2. Behavioral Aggregation Features

Statistical summaries reveal attacker patterns:

def create_behavioral_features(df):
    # Per-source IP aggregations
    source_stats = df.groupby('src_ip').agg({
        'dst_port': ['nunique', 'count'],
        'bytes_sent': ['mean', 'std', 'max'],
        'session_duration': ['mean', 'median'],
        'protocol': lambda x: x.mode().iloc[0] if len(x) > 0 else 'unknown'
    }).reset_index()

    # Flatten column names
    source_stats.columns = ['src_ip', 'unique_ports', 'total_connections',
                           'avg_bytes', 'std_bytes', 'max_bytes',
                           'avg_duration', 'median_duration', 'primary_protocol']

    # Merge back to original dataset
    df = df.merge(source_stats, on='src_ip', how='left')

    # Port scanning indicators
    df['port_diversity'] = df['unique_ports'] / df['total_connections']
    df['is_port_scanner'] = (df['unique_ports'] > 10).astype(int)

    return df

3. Geographic and Network Features

Geographic patterns help identify coordinated attacks:

def create_geographic_features(df):
    # Country-level threat scoring
    country_threat_scores = df.groupby('country').agg({
        'src_ip': 'nunique',
        'attack_type': lambda x: (x != 'benign').sum()
    }).reset_index()
    country_threat_scores['threat_ratio'] = (
        country_threat_scores['attack_type'] / country_threat_scores['src_ip']
    )

    df = df.merge(country_threat_scores[['country', 'threat_ratio']],
                 on='country', how='left')

    # ASN-based features
    asn_stats = df.groupby('asn').agg({
        'src_ip': 'nunique',
        'bytes_sent': 'mean'
    }).reset_index()

    df = df.merge(asn_stats, on='asn', suffixes=('', '_asn_avg'))

    return df

Choosing Your ML Weapon for Threat Hunting

Time to pick our weapon of choice! Just like in video games where different bosses require different strategies, threat detection needs different ML approaches. Let’s figure out which “gear” works best for your specific mission:

1. Binary Classification: Attack vs. Benign

For basic threat detection, start with binary classification:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

def train_binary_classifier(df):
    # Prepare features
    feature_cols = ['hour', 'day_of_week', 'unique_ports', 'total_connections',
                   'avg_bytes', 'port_diversity', 'threat_ratio']

    X = df[feature_cols].fillna(0)
    y = (df['attack_type'] != 'benign').astype(int)

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # Train model
    rf_model = RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        min_samples_split=5,
        random_state=42
    )

    rf_model.fit(X_train, y_train)

    # Evaluate
    y_pred = rf_model.predict(X_test)
    print(classification_report(y_test, y_pred))

    return rf_model

2. Multi-class Attack Classification

For detailed threat categorization:

from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

def train_multiclass_classifier(df):
    # Encode attack types
    le = LabelEncoder()
    df['attack_label'] = le.fit_transform(df['attack_type'])

    feature_cols = ['hour', 'day_of_week', 'unique_ports', 'total_connections',
                   'avg_bytes', 'std_bytes', 'port_diversity', 'avg_duration']

    X = df[feature_cols].fillna(0)
    y = df['attack_label']

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # XGBoost for multi-class
    xgb_model = XGBClassifier(
        n_estimators=200,
        max_depth=6,
        learning_rate=0.1,
        subsample=0.8,
        random_state=42
    )

    xgb_model.fit(X_train, y_train)

    y_pred = xgb_model.predict(X_test)
    print(classification_report(y_test, y_pred, target_names=le.classes_))

    return xgb_model, le

3. Anomaly Detection for Zero-Day Threats

Unsupervised learning identifies previously unseen attack patterns:

from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

def train_anomaly_detector(df):
    # Use only benign traffic for training
    benign_data = df[df['attack_type'] == 'benign']

    feature_cols = ['unique_ports', 'total_connections', 'avg_bytes',
                   'port_diversity', 'avg_duration', 'hours_since_first_seen']

    X_benign = benign_data[feature_cols].fillna(0)

    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_benign)

    # Train isolation forest
    iso_forest = IsolationForest(
        n_estimators=100,
        contamination=0.05,  # Expect 5% anomalies
        random_state=42
    )

    iso_forest.fit(X_scaled)

    # Test on full dataset
    X_all = df[feature_cols].fillna(0)
    X_all_scaled = scaler.transform(X_all)
    anomaly_scores = iso_forest.decision_function(X_all_scaled)

    df['anomaly_score'] = anomaly_scores
    df['is_anomaly'] = iso_forest.predict(X_all_scaled) == -1

    return iso_forest, scaler

From Lab to Battlefield: Model Deployment

Performance Metrics for Security Models

Security models require specialized evaluation metrics:

from sklearn.metrics import precision_recall_curve, roc_curve, auc
import matplotlib.pyplot as plt

def evaluate_security_model(y_true, y_pred_proba):
    # Precision-Recall curve (better for imbalanced data)
    precision, recall, pr_thresholds = precision_recall_curve(y_true, y_pred_proba)
    pr_auc = auc(recall, precision)

    # ROC curve
    fpr, tpr, roc_thresholds = roc_curve(y_true, y_pred_proba)
    roc_auc = auc(fpr, tpr)

    # Find optimal threshold for minimal false positives
    optimal_idx = np.argmax(precision * recall)
    optimal_threshold = pr_thresholds[optimal_idx]

    print(f"PR-AUC: {pr_auc:.3f}")
    print(f"ROC-AUC: {roc_auc:.3f}")
    print(f"Optimal Threshold: {optimal_threshold:.3f}")

    return optimal_threshold

Real-time Inference Pipeline

Deploy models for real-time threat detection:

import joblib
from datetime import datetime

class ThreatDetectionPipeline:
    def __init__(self, model_path, scaler_path=None):
        self.model = joblib.load(model_path)
        self.scaler = joblib.load(scaler_path) if scaler_path else None

    def preprocess_event(self, event):
        # Convert raw event to feature vector
        features = {
            'hour': datetime.fromisoformat(event['timestamp']).hour,
            'unique_ports': event.get('unique_ports', 1),
            'total_connections': event.get('connection_count', 1),
            'avg_bytes': event.get('bytes_sent', 0),
            'port_diversity': event.get('port_diversity', 0),
        }

        feature_vector = np.array(list(features.values())).reshape(1, -1)

        if self.scaler:
            feature_vector = self.scaler.transform(feature_vector)

        return feature_vector

    def predict_threat(self, event):
        features = self.preprocess_event(event)

        # Get prediction probability
        threat_probability = self.model.predict_proba(features)[0][1]

        return {
            'threat_probability': float(threat_probability),
            'is_threat': threat_probability > 0.5,
            'risk_level': 'high' if threat_probability > 0.8 else 'medium' if threat_probability > 0.5 else 'low'
        }

# Usage example
pipeline = ThreatDetectionPipeline('threat_model.pkl', 'feature_scaler.pkl')

sample_event = {
    'timestamp': '2024-01-15T14:30:00',
    'src_ip': '192.168.1.100',
    'dst_port': 22,
    'bytes_sent': 1024,
    'unique_ports': 5,
    'connection_count': 15
}

result = pipeline.predict_threat(sample_event)
print(f"Threat Assessment: {result}")

Best Practices and Considerations

1. Dataset Quality and Labeling

Ground Truth Validation: Regularly validate honeypot logs against known attack signatures
Continuous Labeling: Implement automated labeling pipelines for new attack types
Data Freshness: Retrain models with recent attack data to maintain effectiveness

2. Model Drift Monitoring

Performance Tracking: Monitor precision/recall metrics in production
Feature Distribution: Detect shifts in input feature distributions
Automated Retraining: Set up pipelines to retrain models when performance degrades

3. Integration with Security Operations

SIEM Integration: Export model predictions to security information systems
Alert Tuning: Adjust thresholds based on organizational risk tolerance
Human-in-the-Loop: Provide mechanisms for security analysts to provide feedback

Working with Public Honeypot Datasets

Dataset Arsenal: Your Threat Intelligence Toolkit

To show you the power of these methods in action, I’ve assembled a complete collection of real honeypot datasets on Hugging Face. Each dataset tells its own “story” about how attackers behave in the wild. Let’s meet your new threat intelligence toolkit:

The Big Boss: cyber-security-events-full

pyToshka/cyber-security-events-full

Size: 772K events — the heavyweight champion for serious experiments
What’s Inside: A full-length movie about cyberattacks with a rich feature set
Features: Network flows, behavioral patterns, geographic data, IP reputation
Perfect For: Training production-ready threat detection models
Special Power: It’s like the Wikipedia of attacks — everything’s in here!

The Time Whisperer: attacks-daily

pyToshka/attacks-daily

Size: 676K records — laser-focused on temporal patterns
What’s Inside: Daily chronicles of attacks with timestamps
Features: Time series attacks, seasonal patterns, activity cycles
Perfect For: Predicting “when” the next attack will happen
Special Power: Shows that even hackers have daily routines!

The Compact Trainer: cyber-security-events

pyToshka/cyber-security-events

Size: 15.1K events — perfect size for rapid experimentation
What’s Inside: Curated selection of the most interesting attacks
Features: Balanced mix of different attack types
Perfect For: First steps and quick prototyping
Special Power: Like a starter pack for ML researchers!

The Intrusion Specialist: network-intrusion-detection

pyToshka/network-intrusion-detection

Size: 100 records — small but mighty
What’s Inside: High-quality examples of network intrusions
Features: Clear classifications, samples for IDS systems
Perfect For: Intrusion detection system developers
Special Power: Each record is a textbook example of “how NOT to secure your network”

Author’s Tip: Start with cyber-security-events to learn the basics, move to attacks-daily for temporal analysis, and finish with cyber-security-events-full for serious experiments. It’s like leveling up in a game — from newbie to expert!

Dataset Selection Guidelines

When using publicly available honeypot datasets from platforms like Hugging Face, consider these practical approaches:

Dataset Selection Criteria

Data Recency: Choose datasets with recent attack patterns
Volume and Variety: Ensure sufficient samples across different attack types
Documentation: Look for well-documented datasets with clear feature descriptions
Licensing: Verify appropriate usage rights for your use case

Example Integration with Hugging Face Datasets

from datasets import load_dataset

# Load a comprehensive security events dataset
dataset = load_dataset("pyToshka/cyber-security-events-full")

# Convert to pandas for easier manipulation
df = dataset['train'].to_pandas()

# Apply preprocessing pipeline
df = validate_network_data(df)
df = preprocess_security_events(df)
df = create_temporal_features(df)
df = create_behavioral_features(df)

# Train your models
model = train_binary_classifier(df)

macOS M1/M2/M4 Compatibility

When running this code on Apple Silicon Macs (M1, M2, M4), you may encounter XGBoost installation issues. Here’s how to resolve them:

Installing Dependencies for Apple Silicon

# Install OpenMP runtime (required for XGBoost)
brew install libomp

# Install Python packages
pip install pandas numpy scikit-learn datasets xgboost

# If you encounter issues with XGBoost, try:
pip uninstall xgboost
pip install xgboost --no-cache-dir

Troubleshooting Common Issues

Issue: XGBoostError: XGBoost Library (libxgboost.dylib) could not be loaded Solution:

brew install libomp
export LDFLAGS="-L/opt/homebrew/opt/libomp/lib"
export CPPFLAGS="-I/opt/homebrew/opt/libomp/include"

Issue: Performance issues on Apple Silicon Solution: Ensure you’re using the native ARM64 Python installation, not x86_64 through Rosetta.

Mitigation Anomaly Revelation Keeper (MARK) - AI-powered threat analysis platform
Integrating Wazuh with Ollama: Part 1 - AI-enhanced SIEM for threat detection
How to Set Up a Custom Integration between Wazuh and MARK - Advanced security automation
Amazon EKS SOC 2 Type II Compliance Checklist Part 1 - Enterprise security compliance