DETECTION ENGINEERING

Building a Production Ready Wazuh SIEM: Multi Node Distributed Deployment

A production ready Wazuh 4.14.5 SIEM deployment across 13 VMs, featuring distributed indexers, clustered managers, HAProxy load balancing, and automated Windows/Linux agent deployment via GPO and Ansible. This post covers the end-to-end cluster design, deployment flow, and agent onboarding to make the lab feel closer to a real SOC environment.

Dimasqi Ramadhani · 9 min read
Wazuh SIEM XDR Security Monitoring Detection Engineering Blue Team SOC Log Analysis FIM Vulnerability Detection
All Articles
Building a Production Ready Wazuh SIEM: Multi Node Distributed Deployment

When I started this project, the goal was simple: build a Wazuh lab that actually resembles what you'd deploy in a real managed security environment not just a single VM doing everything. Single node setups are great for understanding Wazuh's internals, but the moment you start thinking about high availability, agent load distribution, and index lifecycle management, you need the real thing.

This is a complete walkthrough of that lab. Every stage verified before moving to the next. All commands real. No hand waving.


Infrastructure

The lab runs on a flat 192.168.90.0/24 subnet across 13 VMs:

Component VM IP (192.168.90.x) Role
Indexer cluster wazuh-indexer-01/02/03 .111, .113, .114 OpenSearch data nodes
Manager cluster wazuh-manager-master .115 Master node
Manager cluster wazuh-manager-worker-01/02 .116, .117 Worker nodes
Dashboard wazuh-dashboard .118 Web UI + API proxy
Load balancer wazuh-lb-01 .112 HAProxy TCP frontend

Why this topology? Enrollment traffic goes to the master only (port 1515), event reporting is distributed round-robin across both workers (port 1514) through HAProxy, and Filebeat on each server node ships alerts to the indexer cluster over TLS. The dashboard hits the indexer cluster on port 9200 and the master API on port 55000. Clean separation of concerns at every layer.

Monitored Endpoints

VM IP (192.168.90.x) Deployment method
windows-ad-dc .121 Active Directory DC + DNS
win-agent-01/02 .122, .123 GPO startup script
linux-agent-01/02 .119, .120 Ansible playbook

Architecture Overview

flowchart TB
    subgraph EP["Endpoints"]
        DC["windows-ad-dc<br/>192.168.90.121<br/>Active Directory DC"]
        W1["win-agent-01<br/>192.168.90.122<br/>group: windows"]
        W2["win-agent-02<br/>192.168.90.123<br/>group: windows"]
        U1["ubuntu-agent-01<br/>192.168.90.119<br/>group: linux"]
        U2["ubuntu-agent-02<br/>192.168.90.120<br/>group: linux"]
    end

    LB["wazuh-lb-01<br/>192.168.90.112<br/>HAProxy TCP"]

    subgraph SRV["Wazuh server cluster"]
        M["wazuh-master-01<br/>192.168.90.115<br/>master"]
        K1["wazuh-worker-01<br/>192.168.90.116<br/>worker"]
        K2["wazuh-worker-02<br/>192.168.90.117<br/>worker"]
    end

    subgraph IDX["Wazuh indexer cluster"]
        I1["wazuh-indexer-01<br/>192.168.90.111"]
        I2["wazuh-indexer-02<br/>192.168.90.113"]
        I3["wazuh-indexer-03<br/>192.168.90.114"]
    end

    SNAP["Snapshot repo<br/>/mnt/wazuh-snapshots<br/>ISM: alerts 90d, archives 30d"]

    D["wazuh-dashboard-01<br/>192.168.90.118"]
    A["Admin / User browser"]

    DC -.->|GPO pushes agent| W1
    DC -.->|GPO pushes agent| W2
    W1 -->|1514 event / 1515 enroll| LB
    W2 -->|1514 event / 1515 enroll| LB
    U1 -->|1514 event / 1515 enroll| LB
    U2 -->|1514 event / 1515 enroll| LB

    LB -->|1515 enrollment| M
    LB -->|1514 reporting RR| K1
    LB -->|1514 reporting RR| K2

    M <-->|1516 cluster sync| K1
    M <-->|1516 cluster sync| K2

    M -->|Filebeat 9200| I1
    K1 -->|Filebeat 9200| I2
    K2 -->|Filebeat 9200| I3

    I1 <-->|9300:9400 transport| I2
    I2 <-->|9300:9400 transport| I3
    I1 <-->|9300:9400 transport| I3

    I1 -.->|snapshot| SNAP
    I2 -.->|snapshot| SNAP
    I3 -.->|snapshot| SNAP

    D -->|9200 search| I1
    D -->|55000 API| M
    A -->|443 HTTPS| D

Agent traffic and load distribution path

sequenceDiagram
    participant Agent
    participant LB as wazuh-lb-01 (HAProxy)
    participant Master as wazuh-master-01
    participant W1 as wazuh-worker-01
    participant W2 as wazuh-worker-02

    Agent->>LB: 1515 enrollment request
    LB->>Master: forward 1515 (enrollment backend)
    Master-->>Agent: agent key issued, group assigned

    Agent->>LB: 1514 events + keepalive
    LB->>W1: round robin to worker-01
    Note over W1: decode, rule match, generate alert

    Agent->>LB: 1514 events + keepalive
    LB->>W2: round robin to worker-02
    Note over W2: decode, rule match, generate alert

    Note over LB,W2: HAProxy health checks keep both workers in the pool
    Note over LB,W2: if one is unavailable traffic continues on the other

Deployment Sequence

flowchart LR
    P["Stage 0 to 1<br/>Prepare VMs +<br/>generate certs"] --> IX["Stage 2<br/>Indexer cluster<br/>green, 3 nodes"]
    IX --> SC["Stage 3<br/>Server cluster<br/>master + 2 workers"]
    SC --> DB["Stage 4<br/>Dashboard<br/>API online"]
    DB --> LBD["Stage 5<br/>HAProxy LB<br/>backends up"]
    LBD --> G["Stage 6<br/>Groups +<br/>agent.conf"]
    G --> AG["Stage 7A/7B<br/>Agents via<br/>Ansible + AD GPO"]
    AG --> IM["Stage 8<br/>ISM policies +<br/>snapshot repo"]
    IM --> R["Stage 9<br/>Final validation<br/>+ lab report"]

Hardware Spec (The Honest Version)

All 8 server-side nodes: Ubuntu 22.04, 2 GB RAM, 128 GB disk. JVM heap on indexer nodes: 1 GB (-Xms1g -Xmx1g). Swap: 4 GB swapfile on every node, vm.swappiness=10.

This is the constrained profile tight, but functional for a low volume PoC or home lab. If you're running this on a laptop with limited RAM, set your expectations: the dashboard takes 3–5 minutes on first load, and you'll see swap pressure during indexer initialization. That's fine. It works.

For anything resembling real throughput testing, bump indexer nodes to 4 GB RAM minimum (2 GB heap), or go straight to Profile B with 16 GB per indexer node.


Stage 0–1: OS Baseline and Certificates

Before any Wazuh package touches the nodes, the baseline matters. I configured all 8 server nodes with:

  • Hostnames and /etc/hosts entries for every node (FQDN resolution is mandatory Wazuh certificates are tied to node names)
  • chrony for NTP sync across all nodes
  • 4 GB swapfile + vm.swappiness=10
  • vm.max_map_count=262144 on all three indexer nodes (OpenSearch will refuse to start without this)
  • UFW rules per role indexers only accept from server nodes and the dashboard, managers only accept from agents and the load balancer

Certificates were generated once on wazuh-indexer-01 using wazuh-certs-tool.sh -A with a config.yml listing all 8 nodes. The output is a single wazuh-certificates.tar (50K, 18 cert files) distributed via scp to each node. This is the one step you cannot skip or do later every mTLS connection in the cluster depends on these certs.


Stage 2: Indexer Cluster

Installed wazuh-indexer 4.14.5-1 (pinned) on all three indexer nodes, distributed the certs, configured opensearch.yml per node with the correct node.name, network.host, and discovery.seed_hosts pointing to all three nodes.

Cluster initialization runs on one node only:

/usr/share/wazuh-indexer/plugins/opensearch-security/tools/wazuh-indexer-security-init.sh

Post-init validation:

curl -XGET https://192.168.90.111:9200/_cluster/health \
  -u admin:<INDEXER_ADMIN_PASSWORD> --cacert /etc/wazuh-indexer/certs/root-ca.pem | python3 -m json.tool

Target state: "status": "green", "number_of_nodes": 3, "unassigned_shards": 0, "active_shards_percent": 100.0. I hit green on first attempt after resolving a timing issue — the security init script needs all three nodes healthy before running, not just one.


Stage 3: Server Cluster

Installed wazuh-manager 4.14.5-1 and filebeat 7.10.2 on all three server nodes. The cluster is defined in ossec.conf on each node:

<cluster>
  <name>wazuh</name>
  <node_name>wazuh-master-01</node_name>
  <node_type>master</node_type>
  <key>65eee392122e08d63ee68141da37398b</key>
  <port>1516</port>
  <bind_addr>0.0.0.0</bind_addr>
  <nodes>
    <node>192.168.90.115</node>
  </nodes>
  <hidden>no</hidden>
  <disabled>no</disabled>
</cluster>

The cluster key must be identical across all three nodes. Workers differ only in <node_type>worker</node_type>.

Filebeat on each node is configured to talk to all three indexers:

output.elasticsearch:
  hosts:
    - 192.168.90.111:9200
    - 192.168.90.113:9200
    - 192.168.90.114:9200

Enrollment password is stored in /var/ossec/etc/authd.pass on the master and referenced in the <auth> block. Agents authenticate with this password during enrollment no manual key extraction needed.

Cluster validation:

/var/ossec/bin/cluster_control -l

Expected output:

NAME             TYPE    VERSION  ADDRESS
wazuh-master-01  master  4.14.5   192.168.90.115
wazuh-worker-01  worker  4.14.5   192.168.90.116
wazuh-worker-02  worker  4.14.5   192.168.90.117

Stage 4–5: Dashboard and HAProxy

Dashboard install is straightforward. The interesting part is the HAProxy config, because this is what makes the whole thing production like.

HAProxy listens on two frontends:

frontend wazuh_enrollment
  bind *:1515
  default_backend wazuh-master-enrollment

frontend wazuh_reporting
  bind *:1514
  mode tcp
  default_backend wazuh-workers

backend wazuh-master-enrollment
  server wazuh-master-01 192.168.90.115:1515 check

backend wazuh-workers
  balance roundrobin
  server wazuh-worker-01 192.168.90.116:1514 check
  server wazuh-worker-02 192.168.90.117:1514 check

Enrollment always hits the master (agents need to get their key from a single authoritative source). Event reporting goes round-robin across workers.

Failover test: stopped wazuh-manager on worker-01, confirmed HAProxy stats showed it DOWN and traffic continued on worker-02 without agent disconnects. Started it back, confirmed automatic return to rotation. This is the kind of validation step that most lab guides skip don't.


Stage 6–7: Agent Groups and Mass Deployment

Agent groups (windows, linux) are created on the master and define centralized agent.conf per group:

  • Windows group: pulls Security, System, Application, and Sysmon event channels
  • Linux group: monitors auth.log, syslog, and audit.log Agents pull their group config automatically after enrollment. No per agent configuration.

Ubuntu Agents via Ansible

A single Ansible playbook handles all Linux agents:

- name: Deploy Wazuh agent
  hosts: linux_agents
  tasks:
    - name: Install wazuh-agent
      apt:
        name: wazuh-agent=4.14.5-1
      environment:
        WAZUH_MANAGER: "192.168.90.112"
        WAZUH_AGENT_GROUP: "linux"
        WAZUH_REGISTRATION_PASSWORD: "WazuhEnroll2024!"

Both agents enrolled through the load balancer at .112, auto assigned to the linux group. Scaling to 50 Linux hosts is just adding entries to the inventory file.

Windows Agents via Active Directory GPO

This is where most guides give up. The setup:

  1. Promote windows-ad-dc as the forest root for lab.local
  2. Domain join both win-agent-01 and win-agent-02
  3. Create a GPO with a Computer Startup Script that runs the Wazuh MSI installer silently with environment variables for manager IP, group, and enrollment password
  4. GPO applies on next boot or gpupdate /force Both Windows agents enrolled successfully, assigned to the windows group, and appeared active in the dashboard. The GPO approach mirrors exactly how you'd deploy agents across a real Windows fleet scale to 500 machines with zero additional effort.

Stage 8–9: Index Management and Final Validation

Two ISM policies applied:

  • wazuh-alerts-policy: 90-day retention on alert indices
  • wazuh-archives-policy: 30-day retention on archive indices Snapshot repository configured at /mnt/wazuh-snapshots. A test snapshot (snapshot-test-02) completed with SUCCESS status.

End-to-end validation: Triggered failed SSH login attempts on linux-agent-01. Rule 5710 alerts appeared in OpenSearch within seconds of the events, searchable through the dashboard. This is the real test not just "all services are running" but "the entire pipeline from endpoint event to indexed alert works."

Final agent status: 4 active agents 001 agent-linux-01, 002 agent-linux-02, 003 win-agent-02, 004 win-agent-01. All reporting through the load balancer, all indexed.


Key Lessons

1. Version pinning is not optional. I pinned 4.14.5-1 across every node manager, indexer, dashboard, agents. Mixed versions in a Wazuh cluster cause silent failures that are miserable to debug.

2. Do the OS baseline properly. vm.max_map_count=262144, swap, NTP, FQDN resolution skip any of these and you'll debug it later under pressure.

3. Validate before progressing. Every stage has a verification step. The cluster health check after Stage 2, cluster_control -l after Stage 3, HAProxy stats after Stage 5. Don't skip them.

4. HAProxy is the right answer for agent load distribution. Wazuh doesn't have native agent-level load balancing. HAProxy TCP frontend gives you health checking, failover, and round-robin distribution with a 20 line config.

5. GPO for Windows agents, Ansible for Linux. Both are the production grade approach in their respective ecosystems. Learning both in the same lab is the point.


What's Next

This lab is the foundation for several follow up projects:

  • Custom detection rulesets (Linux and Windows use cases)
  • SOAR integration via Wazuh active response
  • PPL Rule Engine for detection-as-code on top of OpenSearch
  • Anomaly detection pipeline using Isolation Forest + Markov models The infrastructure is stable. Now the interesting work starts.

Result

Home Page
Home Page

Agent List
Agent List

Discover Menu
Discover Menu

IT Hygiene
IT Hygiene


Tags

Wazuh, SIEM, Detection Engineering, SOC, OpenSearch, HAProxy, Ansible, Active Directory, Linux, Windows

Previous Article

Interested in This Topic?

Discuss your security needs or ask further questions about this article.