When enterprises in regulated industries talk about compliance automation, it sounds straightforward: write a few playbooks, run them regularly, generate a report, done. Reality looks different — especially when it comes to PCI DSS compliance for large VMware estates with thousands of ESXi hosts.
In this article, I share my approach to automating PCI DSS audits and remediations with Ansible at enterprise scale. It covers architecture decisions that have proven effective, pitfalls you should know about, and one central insight:
Ansible can handle enterprise-scale compliance automation, but only if you treat it as a software engineering project, not a scripting exercise.
The Problem
Large enterprises in regulated industries — banking, energy, travel and tourism — often operate VMware estates with thousands of ESXi hosts distributed across numerous vCenter instances and multiple physically separated data center zones. Every single host must be regularly audited against hardening standards: SSH configuration, NTP settings, syslog forwarding, lockdown mode, firewall rules, password policies, Active Directory integration, and more.
The requirements are clear:
- Audit all hosts against defined hardening controls
- Remediate deviations — as automated as possible
- Audit-ready reports that auditors can understand and accept
- Repeatability — the whole process must run regularly and reliably
Manual auditing is out of the question at this scale. But naive automation quickly hits its limits too.
My Architecture Approach
1. Dynamic Inventory: The Foundation for Everything
With thousands of hosts being constantly provisioned and decommissioned, a static inventory is not an option. And when the environment spans numerous vCenter instances, a simple inventory plugin per vCenter becomes unwieldy fast.
My approach: Use a single, aggregated data source as inventory. In VMware environments, VCF Operations (formerly vRealize Operations / Aria Operations) is ideal for this, as it already collects data from all vCenters. From there, you can develop a custom Ansible inventory plugin that generates the entire ESXi inventory via the VCF Operations API — including automatic grouping by vCenter, cluster, ESXi version, and location.
Why this matters:
- New hosts automatically appear in the inventory
- Decommissioned hosts automatically drop out
- Grouping enables targeted audit runs (e.g., only a specific cluster or ESXi version)
This plugin should be packaged as an Ansible Collection — with unit tests and code quality checks. For a system of this importance, that’s not optional.
2. One Playbook for Audit and Remediation
A common mistake is separating audit and remediation logic into different playbooks. This inevitably leads to drift: what gets checked and what gets fixed diverge over time.
My approach: A single playbook that covers both modes. Ansible’s --check mode
is a powerful but often underestimated feature:
ansible-playbook compliance.yml --check→ Audit mode: checks without changingansible-playbook compliance.yml→ Remediation mode: fixes deviations
The prerequisite is a consistently idempotent implementation of all tasks. This means: every module first checks the current state and only makes changes when an actual deviation exists. Run the playbook a second time, and nothing happens — unless there are new deviations.
The advantages:
- No drift between audit and remediation logic
- Colleagues can trigger an audit or remediation with a single click in AWX/AAP
- The system is usable by non-developers
3. Custom Ansible Modules Over Community Modules
For enterprise-scale compliance, community modules (community.vmware) often
aren’t sufficient. You need full control over API calls, error handling, and
return values.
My approach: Custom Ansible modules in Python that communicate directly with the vSphere and ESXi APIs. For a typical VMware hardening implementation, this means developing around 30 custom modules — one per control.
Why custom modules?
- Full control over
--checkmode behavior - Precise error handling and return values for the audit report
- No dependency on external collections that may change
- Ability to test your own modules with unit tests (pytest)
4. Audit Trail: Capturing Results Systematically
All task results should be captured automatically and in a structured format — not just in the Ansible console, but in a database. ARA (ARA Records Ansible) is excellent for this: an open-source tool that records all playbook results via an Ansible callback and makes them searchable.
5. The Compliance Report: Interactive, Not Static
Auditors don’t want to scroll through spreadsheets with thousands of rows. They want answers to questions like: “How many hosts running ESXi version X are still non-compliant?” or “Which controls are failing in cluster Y?”
My approach: A self-contained HTML report with embedded JavaScript (jQuery + DataTables). The audit data is embedded as JSON directly in the HTML, and a Python script generates the document from the ARA database.
The result:
- Color-coded — green for compliant hosts, red for pending remediation
- Filterable in real-time — by ESXi version, vCenter, cluster, or compliance status
- Dynamic summaries — statistics automatically adapt when filtering
- No setup required — just open the HTML file
This approach has proven extremely valuable during audits. Instead of scrolling through endless tables, the relevant data can be filtered and presented live.

The Pitfalls
Performance: Hours, Not Minutes
With thousands of hosts, execution time is a constant challenge. Even with optimal forks and batching, a complete audit run takes several hours. The causes:
- The nature of Ansible — for each task, the Python interpreter is started per host, and an API call is initiated
- Slow or problematic hosts — older hosts with misconfigurations cause API timeouts
- Connectivity issues — unreachable hosts block progress
- AWX/AAP limits — at this scale, you hit boundaries with output logs and job runtimes
My recommendations:
- Segmentation: Split audits by physical data center zones or clusters and process them sequentially
- Overnight runs: Execute complete audits overnight, with automatic continuation only after error-free completion of the previous segment
- Caution with production workloads: Extra care with critical systems to avoid triggering incidents
- Separate connectivity playbooks: Identify unreachable hosts beforehand and — where possible — automatically repair them
Unexpected Byproducts
Connectivity problems with individual hosts almost inevitably lead to the development of dedicated diagnostic and repair playbooks. What starts as an obstacle becomes a standalone automation tool that delivers value well beyond the compliance context.
Software Engineering Meets Infrastructure as Code
This is the core of my conviction: Infrastructure automation at enterprise scale is software engineering. Treat it like a scripting project, and it collapses under its own complexity.
What this means in practice:
- Ansible Collection as the packaging unit for inventory plugin and custom modules
- Unit tests (pytest) for all Python components — inventory plugin and custom modules
- SonarQube integration for static code analysis — for both Python modules and Ansible YAML
- Idempotent implementation as the fundamental principle for all modules
- Clean API abstraction — custom modules encapsulate the complexity of vSphere/ESXi APIs
- Versioning and CI/CD — the collection goes through a pipeline like any other software
Without this approach, a project with 30 custom modules, an inventory plugin, and a report generator simply becomes unmaintainable.
Conclusion
Enterprise-scale compliance automation with Ansible is feasible — but it demands discipline. Here are the key principles I’ve taken away from projects of this kind:
- Dynamic inventory from an aggregated source removes the biggest obstacle in large, distributed VMware environments.
- One playbook for audit and remediation with consistent idempotency reduces complexity and prevents drift.
- Custom modules are inevitable. If you need full control over API calls, error handling, and check-mode behavior, there’s no way around writing your own Python modules.
- The report determines audit success. An interactive HTML document beats any static spreadsheet.
- Software engineering discipline is not optional. Unit tests, code analysis, collection packaging, CI/CD — without these practices, a project of this size becomes unmaintainable.
- Runtime is a reality. Hours, not minutes. Planning with segmentation and overnight runs is part of the architecture.
This approach is transferable — whether PCI DSS, SOX, ISO 27001, or other compliance frameworks. The tools and patterns remain the same. What matters is the engineering discipline behind them.
