When I started this project, the goal was simple: build a Wazuh lab that actually resembles what you'd deploy in a real managed security environment not just a single VM doing everything. Single node setups are great for understanding Wazuh's internals, but the moment you start thinking about high availability, agent load distribution, and index lifecycle management, you need the real thing.
This is a complete walkthrough of that lab. Every stage verified before moving to the next. All commands real. No hand waving.
Infrastructure
The lab runs on a flat 192.168.90.0/24 subnet across 13 VMs:
| Component | VM | IP (192.168.90.x) | Role |
|---|---|---|---|
| Indexer cluster | wazuh-indexer-01/02/03 | .111, .113, .114 | OpenSearch data nodes |
| Manager cluster | wazuh-manager-master | .115 | Master node |
| Manager cluster | wazuh-manager-worker-01/02 | .116, .117 | Worker nodes |
| Dashboard | wazuh-dashboard | .118 | Web UI + API proxy |
| Load balancer | wazuh-lb-01 | .112 | HAProxy TCP frontend |
Why this topology? Enrollment traffic goes to the master only (port 1515), event reporting is distributed round-robin across both workers (port 1514) through HAProxy, and Filebeat on each server node ships alerts to the indexer cluster over TLS. The dashboard hits the indexer cluster on port 9200 and the master API on port 55000. Clean separation of concerns at every layer.
Monitored Endpoints
| VM | IP (192.168.90.x) | Deployment method |
|---|---|---|
| windows-ad-dc | .121 | Active Directory DC + DNS |
| win-agent-01/02 | .122, .123 | GPO startup script |
| linux-agent-01/02 | .119, .120 | Ansible playbook |
Architecture Overview
flowchart TB
subgraph EP["Endpoints"]
DC["windows-ad-dc<br/>192.168.90.121<br/>Active Directory DC"]
W1["win-agent-01<br/>192.168.90.122<br/>group: windows"]
W2["win-agent-02<br/>192.168.90.123<br/>group: windows"]
U1["ubuntu-agent-01<br/>192.168.90.119<br/>group: linux"]
U2["ubuntu-agent-02<br/>192.168.90.120<br/>group: linux"]
end
LB["wazuh-lb-01<br/>192.168.90.112<br/>HAProxy TCP"]
subgraph SRV["Wazuh server cluster"]
M["wazuh-master-01<br/>192.168.90.115<br/>master"]
K1["wazuh-worker-01<br/>192.168.90.116<br/>worker"]
K2["wazuh-worker-02<br/>192.168.90.117<br/>worker"]
end
subgraph IDX["Wazuh indexer cluster"]
I1["wazuh-indexer-01<br/>192.168.90.111"]
I2["wazuh-indexer-02<br/>192.168.90.113"]
I3["wazuh-indexer-03<br/>192.168.90.114"]
end
SNAP["Snapshot repo<br/>/mnt/wazuh-snapshots<br/>ISM: alerts 90d, archives 30d"]
D["wazuh-dashboard-01<br/>192.168.90.118"]
A["Admin / User browser"]
DC -.->|GPO pushes agent| W1
DC -.->|GPO pushes agent| W2
W1 -->|1514 event / 1515 enroll| LB
W2 -->|1514 event / 1515 enroll| LB
U1 -->|1514 event / 1515 enroll| LB
U2 -->|1514 event / 1515 enroll| LB
LB -->|1515 enrollment| M
LB -->|1514 reporting RR| K1
LB -->|1514 reporting RR| K2
M <-->|1516 cluster sync| K1
M <-->|1516 cluster sync| K2
M -->|Filebeat 9200| I1
K1 -->|Filebeat 9200| I2
K2 -->|Filebeat 9200| I3
I1 <-->|9300:9400 transport| I2
I2 <-->|9300:9400 transport| I3
I1 <-->|9300:9400 transport| I3
I1 -.->|snapshot| SNAP
I2 -.->|snapshot| SNAP
I3 -.->|snapshot| SNAP
D -->|9200 search| I1
D -->|55000 API| M
A -->|443 HTTPS| D
Agent traffic and load distribution path
sequenceDiagram
participant Agent
participant LB as wazuh-lb-01 (HAProxy)
participant Master as wazuh-master-01
participant W1 as wazuh-worker-01
participant W2 as wazuh-worker-02
Agent->>LB: 1515 enrollment request
LB->>Master: forward 1515 (enrollment backend)
Master-->>Agent: agent key issued, group assigned
Agent->>LB: 1514 events + keepalive
LB->>W1: round robin to worker-01
Note over W1: decode, rule match, generate alert
Agent->>LB: 1514 events + keepalive
LB->>W2: round robin to worker-02
Note over W2: decode, rule match, generate alert
Note over LB,W2: HAProxy health checks keep both workers in the pool
Note over LB,W2: if one is unavailable traffic continues on the other
Deployment Sequence
flowchart LR
P["Stage 0 to 1<br/>Prepare VMs +<br/>generate certs"] --> IX["Stage 2<br/>Indexer cluster<br/>green, 3 nodes"]
IX --> SC["Stage 3<br/>Server cluster<br/>master + 2 workers"]
SC --> DB["Stage 4<br/>Dashboard<br/>API online"]
DB --> LBD["Stage 5<br/>HAProxy LB<br/>backends up"]
LBD --> G["Stage 6<br/>Groups +<br/>agent.conf"]
G --> AG["Stage 7A/7B<br/>Agents via<br/>Ansible + AD GPO"]
AG --> IM["Stage 8<br/>ISM policies +<br/>snapshot repo"]
IM --> R["Stage 9<br/>Final validation<br/>+ lab report"]
Hardware Spec (The Honest Version)
All 8 server-side nodes: Ubuntu 22.04, 2 GB RAM, 128 GB disk. JVM heap on indexer nodes: 1 GB (-Xms1g -Xmx1g). Swap: 4 GB swapfile on every node, vm.swappiness=10.
This is the constrained profile tight, but functional for a low volume PoC or home lab. If you're running this on a laptop with limited RAM, set your expectations: the dashboard takes 3–5 minutes on first load, and you'll see swap pressure during indexer initialization. That's fine. It works.
For anything resembling real throughput testing, bump indexer nodes to 4 GB RAM minimum (2 GB heap), or go straight to Profile B with 16 GB per indexer node.
Stage 0–1: OS Baseline and Certificates
Before any Wazuh package touches the nodes, the baseline matters. I configured all 8 server nodes with:
- Hostnames and
/etc/hostsentries for every node (FQDN resolution is mandatory Wazuh certificates are tied to node names) chronyfor NTP sync across all nodes- 4 GB swapfile +
vm.swappiness=10 vm.max_map_count=262144on all three indexer nodes (OpenSearch will refuse to start without this)- UFW rules per role indexers only accept from server nodes and the dashboard, managers only accept from agents and the load balancer
Certificates were generated once on wazuh-indexer-01 using wazuh-certs-tool.sh -A with a config.yml listing all 8 nodes. The output is a single wazuh-certificates.tar (50K, 18 cert files) distributed via scp to each node. This is the one step you cannot skip or do later every mTLS connection in the cluster depends on these certs.
Stage 2: Indexer Cluster
Installed wazuh-indexer 4.14.5-1 (pinned) on all three indexer nodes, distributed the certs, configured opensearch.yml per node with the correct node.name, network.host, and discovery.seed_hosts pointing to all three nodes.
Cluster initialization runs on one node only:
/usr/share/wazuh-indexer/plugins/opensearch-security/tools/wazuh-indexer-security-init.sh
Post-init validation:
curl -XGET https://192.168.90.111:9200/_cluster/health \
-u admin:<INDEXER_ADMIN_PASSWORD> --cacert /etc/wazuh-indexer/certs/root-ca.pem | python3 -m json.tool
Target state: "status": "green", "number_of_nodes": 3, "unassigned_shards": 0, "active_shards_percent": 100.0. I hit green on first attempt after resolving a timing issue — the security init script needs all three nodes healthy before running, not just one.
Stage 3: Server Cluster
Installed wazuh-manager 4.14.5-1 and filebeat 7.10.2 on all three server nodes. The cluster is defined in ossec.conf on each node:
<cluster>
<name>wazuh</name>
<node_name>wazuh-master-01</node_name>
<node_type>master</node_type>
<key>65eee392122e08d63ee68141da37398b</key>
<port>1516</port>
<bind_addr>0.0.0.0</bind_addr>
<nodes>
<node>192.168.90.115</node>
</nodes>
<hidden>no</hidden>
<disabled>no</disabled>
</cluster>
The cluster key must be identical across all three nodes. Workers differ only in <node_type>worker</node_type>.
Filebeat on each node is configured to talk to all three indexers:
output.elasticsearch:
hosts:
- 192.168.90.111:9200
- 192.168.90.113:9200
- 192.168.90.114:9200
Enrollment password is stored in /var/ossec/etc/authd.pass on the master and referenced in the <auth> block. Agents authenticate with this password during enrollment no manual key extraction needed.
Cluster validation:
/var/ossec/bin/cluster_control -l
Expected output:
NAME TYPE VERSION ADDRESS
wazuh-master-01 master 4.14.5 192.168.90.115
wazuh-worker-01 worker 4.14.5 192.168.90.116
wazuh-worker-02 worker 4.14.5 192.168.90.117
Stage 4–5: Dashboard and HAProxy
Dashboard install is straightforward. The interesting part is the HAProxy config, because this is what makes the whole thing production like.
HAProxy listens on two frontends:
frontend wazuh_enrollment
bind *:1515
default_backend wazuh-master-enrollment
frontend wazuh_reporting
bind *:1514
mode tcp
default_backend wazuh-workers
backend wazuh-master-enrollment
server wazuh-master-01 192.168.90.115:1515 check
backend wazuh-workers
balance roundrobin
server wazuh-worker-01 192.168.90.116:1514 check
server wazuh-worker-02 192.168.90.117:1514 check
Enrollment always hits the master (agents need to get their key from a single authoritative source). Event reporting goes round-robin across workers.
Failover test: stopped wazuh-manager on worker-01, confirmed HAProxy stats showed it DOWN and traffic continued on worker-02 without agent disconnects. Started it back, confirmed automatic return to rotation. This is the kind of validation step that most lab guides skip don't.
Stage 6–7: Agent Groups and Mass Deployment
Agent groups (windows, linux) are created on the master and define centralized agent.conf per group:
- Windows group: pulls Security, System, Application, and Sysmon event channels
- Linux group: monitors
auth.log,syslog, andaudit.logAgents pull their group config automatically after enrollment. No per agent configuration.
Ubuntu Agents via Ansible
A single Ansible playbook handles all Linux agents:
- name: Deploy Wazuh agent
hosts: linux_agents
tasks:
- name: Install wazuh-agent
apt:
name: wazuh-agent=4.14.5-1
environment:
WAZUH_MANAGER: "192.168.90.112"
WAZUH_AGENT_GROUP: "linux"
WAZUH_REGISTRATION_PASSWORD: "WazuhEnroll2024!"
Both agents enrolled through the load balancer at .112, auto assigned to the linux group. Scaling to 50 Linux hosts is just adding entries to the inventory file.
Windows Agents via Active Directory GPO
This is where most guides give up. The setup:
- Promote
windows-ad-dcas the forest root forlab.local - Domain join both
win-agent-01andwin-agent-02 - Create a GPO with a Computer Startup Script that runs the Wazuh MSI installer silently with environment variables for manager IP, group, and enrollment password
- GPO applies on next boot or
gpupdate /forceBoth Windows agents enrolled successfully, assigned to thewindowsgroup, and appeared active in the dashboard. The GPO approach mirrors exactly how you'd deploy agents across a real Windows fleet scale to 500 machines with zero additional effort.
Stage 8–9: Index Management and Final Validation
Two ISM policies applied:
wazuh-alerts-policy: 90-day retention on alert indiceswazuh-archives-policy: 30-day retention on archive indices Snapshot repository configured at/mnt/wazuh-snapshots. A test snapshot (snapshot-test-02) completed withSUCCESSstatus.
End-to-end validation: Triggered failed SSH login attempts on linux-agent-01. Rule 5710 alerts appeared in OpenSearch within seconds of the events, searchable through the dashboard. This is the real test not just "all services are running" but "the entire pipeline from endpoint event to indexed alert works."
Final agent status: 4 active agents 001 agent-linux-01, 002 agent-linux-02, 003 win-agent-02, 004 win-agent-01. All reporting through the load balancer, all indexed.
Key Lessons
1. Version pinning is not optional. I pinned 4.14.5-1 across every node manager, indexer, dashboard, agents. Mixed versions in a Wazuh cluster cause silent failures that are miserable to debug.
2. Do the OS baseline properly. vm.max_map_count=262144, swap, NTP, FQDN resolution skip any of these and you'll debug it later under pressure.
3. Validate before progressing. Every stage has a verification step. The cluster health check after Stage 2, cluster_control -l after Stage 3, HAProxy stats after Stage 5. Don't skip them.
4. HAProxy is the right answer for agent load distribution. Wazuh doesn't have native agent-level load balancing. HAProxy TCP frontend gives you health checking, failover, and round-robin distribution with a 20 line config.
5. GPO for Windows agents, Ansible for Linux. Both are the production grade approach in their respective ecosystems. Learning both in the same lab is the point.
What's Next
This lab is the foundation for several follow up projects:
- Custom detection rulesets (Linux and Windows use cases)
- SOAR integration via Wazuh active response
- PPL Rule Engine for detection-as-code on top of OpenSearch
- Anomaly detection pipeline using Isolation Forest + Markov models The infrastructure is stable. Now the interesting work starts.
Result
Tags
Wazuh, SIEM, Detection Engineering, SOC, OpenSearch, HAProxy, Ansible, Active Directory, Linux, Windows
