Idempotent Network Config with Ansible and Jinja2

The promise of network automation is that running the same playbook twice changes nothing the second time. The reality, for a lot of teams, is a playbook that appends the same lines on every run, reports “changed” forever, and occasionally pushes a render so wrong it takes a device offline. The gap between those two is idempotency and the guardrails around it. This is how to structure Ansible roles and Jinja2 templates so a second run is genuinely a no-op, and how to stop a bad render before it ever reaches a device.

What idempotency actually means here

Idempotent means applying the configuration repeatedly produces the same result as applying it once. For network config that is a higher bar than it sounds, because of how config gets pushed. If your automation sends a list of commands, the device runs them in config mode and merges them with whatever is already there. Send the same commands again and most are no-ops — but anything stateful, anything that appends, and anything the device normalises differently from how you wrote it will report a change, or worse, stack up.

The symptom is familiar: a playbook that always says “changed: [leaf01]” even when nothing is different. That is not cosmetic. It means you can no longer trust “changed” to mean something is wrong, so you stop reading it, so the day a real unexpected change appears, nobody notices. Idempotency is what makes “changed” meaningful again.

Why naive automation is not idempotent

The naive pattern renders a template into a block of CLI and pushes it with a config module in merge mode. Three things break idempotency:

  • Ordering and normalisation — you write permit ip any any and the device stores permit ip any any log or reorders ACL entries. Your text no longer matches the device’s text, so every run is a “change.”
  • Append-only constructs — banners, some SNMP and AAA lines, and anything without a clean negation can accumulate rather than replace.
  • No notion of removal — merge mode only adds. Delete a VLAN from your data model and a merge push leaves it on the device forever, so the device drifts further from intent with every change.

The fix is to stop thinking in commands and start thinking in declared intent: describe the full desired state, and let a tool compute the difference against the running config.

Structure: separate data from template

Everything downstream depends on one discipline — the data that varies between devices lives in variables, and the template contains no facts, only structure. A template with a hard-coded VLAN ID in it is a bug waiting to happen.

The data model

Put shared facts in group_vars and per-device facts in host_vars. Group by role and by site so a device inherits the right layers — all, then leaf, then site_lon, then the host. A leaf’s variables might look like:

# host_vars/leaf01.yml
hostname: leaf01
loopback: 10.255.1.1
asn: 4200000001
uplinks:
  - { intf: Ethernet1, peer: spine01, ip: 10.0.1.1/31 }
  - { intf: Ethernet2, peer: spine02, ip: 10.0.2.1/31 }
vlans:
  - { id: 10, name: web,  svi: 10.10.10.1/24 }
  - { id: 20, name: app,  svi: 10.10.20.1/24 }

Notice there is not a single CLI command in there. It is a description of what the device is, not how to configure it. The same data model can render Arista, Cisco, or FRR config from three different templates, and it is the thing you review in pull requests — humans reason about intent far better than about diffs of generated CLI.

The template

The Jinja2 template turns that data into config and contains only loops and conditionals. Keep logic minimal; if a template needs real computation, do it in the data layer or a filter plugin instead.

{# templates/leaf.j2 #}
hostname {{ hostname }}
!
{% for v in vlans %}
vlan {{ v.id }}
   name {{ v.name }}
interface Vlan{{ v.id }}
   ip address {{ v.svi }}
{% endfor %}
!
{% for u in uplinks %}
interface {{ u.intf }}
   description >> {{ u.peer }}
   no switchport
   ip address {{ u.ip }}
{% endfor %}
!
router bgp {{ asn }}
   router-id {{ loopback }}

Render this with the same data twice and you get byte-identical output. That determinism is the foundation everything else is built on — but identical text is not yet idempotent on the device. For that you need to push it the right way.

Push declaratively, not by merging

The step that actually delivers idempotency is replacing the running config with intended config rather than merging commands into it. Two mature approaches:

Config replace / declarative resource modules. Cisco’s config replace, and the platform *_config modules in replace mode, diff your rendered intended config against the running config and apply only the delta — adding what is missing and removing what should not be there. Run it once, the device converges to intent. Run it again, the diff is empty and Ansible reports ok, not changed. That is idempotency you can see.

Resource modules. The structured *_l3_interfaces, *_bgp_address_family, and similar modules take structured data (the same shape as your data model) and reconcile state per resource, including removals, without you templating CLI at all. They are the cleanest path where they cover your features.

Whichever you use, drive it through a diff. Generate the candidate, show the diff against running with --check --diff, and only then apply. The diff is the single most valuable artefact in the whole pipeline — it is the thing a human approves before anything touches a device.

Validation guardrails

Idempotency stops you pushing redundant change. Guardrails stop you pushing wrong change. Layer them so a mistake is caught as early and as cheaply as possible.

Validate the data before it renders

The cheapest bug to fix is the one caught before render. Validate host_vars/group_vars against a schema so a typo’d field or an out-of-range VLAN fails immediately. Ansible’s ansible.utils.validate with a JSON schema, or a small pre-commit hook, both work. Catch vlan: 5000 here, not on the device.

Assert intent inside the play

Use assert tasks to encode rules the schema cannot: no two SVIs share a subnet, every uplink has a /31, the management VRF is never touched. These read as executable documentation of what “valid” means for your network.

- name: every leaf must reach two spines
  ansible.builtin.assert:
    that:
      - uplinks | length == 2
      - uplinks | map(attribute='peer') | unique | length == 2
    fail_msg: "{{ hostname }} does not have two distinct spine uplinks"

Model the network before you touch it

For changes with real blast radius, validate the behaviour of the candidate config, not just its syntax. Batfish parses your rendered configs and answers questions about the resulting network offline — will these two endpoints have a path, does this ACL actually permit what you intended, do any BGP sessions fail to come up. Running that in CI against the candidate catches whole classes of outage that a per-device diff cannot see, because they are emergent properties of the whole fabric.

Put it in a pipeline

Wire the layers into CI so they run on every change to the repo, in order of cost: lint and schema-validate the data, render the templates, run Batfish against the result, and only on a merged, approved change does the pipeline run the playbook in check mode and surface the diff for a human to approve. The repo is the source of truth; the device is downstream of it. Nobody logs into a switch to make a change — they change the data and let the pipeline prove it is safe.

Deploy without holding your breath

Even a validated change should land carefully on a live fabric.

  • Run in batchesserial or rolling so a bad change hits one leaf, not forty, and the play halts on first failure.
  • Use commit-confirm where the platform supports it — the device reverts automatically if you do not confirm within a window, so a change that severs your own session undoes itself.
  • Verify after, not just before — follow the push with checks that BGP sessions are up and the loopback is reachable, and fail the run if they are not. A deploy that does not verify its own success is just hope with a progress bar.

The shape of a sane setup

Pulled together, the pattern is small and boring, which is the goal. Facts live in version-controlled variables. Templates are pure structure and render deterministically. Config is pushed by replacing intent, so a second run is a true no-op and “changed” means something again. And a stack of guardrails — schema, asserts, Batfish, check-mode diff — catches mistakes long before they reach a device. None of it is exotic. It is the difference between automation you trust to run unattended and a playbook you watch nervously every single time.

Share this:

Leave a Reply

Your email address will not be published. Required fields are marked *