Terraform Bring-Up Adventures#
This write up is a continuation of the prior post where I create a fully reproducible enviroment to deploy a 3-node lab using QEMU/KVM, libvirt, Terraform, cloud-init, AppArmor tuning, and sysctl kernel configuration.
1. Overview#
TXGrid is a reproducible 3-node lab environment that runs entirely on your workstation using:
- QEMU/KVM via libvirt
- Netplan-hosted Linux bridge for LAN access
- A private libvirt network for cluster traffic
- Terraform provisioning
- Cloud-init for automation
- Sysctl & AppArmor adjustments to support routing & QEMU access
This runbook documents the additional steps from prior post needed to successfully bring up:
txgrid-cp0(control plane)txgrid-wk0(worker 0)txgrid-wk1(worker 1)
All nodes run Ubuntu Server 24.04 cloud image.
2. Prerequisites & Host Preparation#
Refer to the prior blog post kvm_libvirt_setup_guide for:
- Installation of QEMU/KVM, libvirt, and supporting packages
- Bridge creation via Netplan
- User/group permissions for libvirt and KVM
- Cloud-init basics
This guide focuses only on what is required in addition to the earlier setup.
3. Architecture Overview#
TXGrid uses:
- A Linux bridge (
br0) for external network access - A libvirt-managed internal network for inter-node traffic
- Terraform to orchestrate all VMs, disks, NICs, and cloud-init configurations
- Cloud-init for hostname assignment, SSH access, package installs, and bootstrap scripting
4. Terraform Bring-Up (High Level)#
Terraform handles:
- Volume creation (qcow2 cloning)
- Domain definitions for each VM
- NIC attachment (bridge + private network)
- Cloud-init ISO injection
- VM boot ordering and metadata
The full configs and modules are in the GitHub my-ai-journey/tree/v0.1.1/infra/terraform. The code is under development and constantly changing however I have linked the version v0.1.1 that I wrote for this runbook.
5. AppArmor Adjustments#
Ubuntu’s default AppArmor profiles may block certain QEMU operations, especially when:
- Using bridged networking
- Accessing custom disk paths or cloud-init data
- Running Terraform-created domains that reference nonstandard directories
Symptoms include:
qemu-system: failed to open /dev/...- AppArmor DENIED messages in
dmesgor/var/log/syslog - Terraform-created domains failing to launch
This resolves Permission denied issues when QEMU attempts to read Terraform-created files:
Create the folder
1sudo mkdir -p /etc/apparmor.d/abstractions/libvirt-qemu.d/
Edit: /etc/apparmor.d/abstractions/libvirt-qemu.d/override
Add lines (The comma at the end is important do not omit it!):
1/var/lib/libvirt/images/** rwk,
Reload AppArmor and libvirtd:
1sudo systemctl restart apparmor
2sudo systemctl restart libvirtd
Check status:
1sudo aa-status|grep libvirtd
You should see libvirt-related profiles in enforce mode, with your updated rules in effect.
1 libvirtd
2 libvirtd//qemu_bridge_helper
3 /usr/sbin/libvirtd (50117) libvirtd
6. Kernel Tuning (sysctl)#
These settings are required to support forwarding and inter-VM routing. Without the kernel parameter enabled the bridged adapter for each node will not receive an ip.
- Allow IPv4 forwarding between
br0andtxgrid-net
Modify:
1sudo echo "net.ipv4.ip_forward = 1" >> /etc/sysctl.conf
Apply:
1sudo sysctl --system
Verify:
1sysctl net.ipv4.ip_forward
Expected:
1net.ipv4.ip_forward = 1
7. Migrating to the Latest Terraform libvirt Provider#
Status: Migration is work in progress. I created a WIP branch with partial changes, but I did not finish migrating to the latest provider due to upstream bugs and instability. I am not recommending an upgrade at this time.
WIP-Code: my-ai-journey/tree/upgrade_libvirt_provider_to_0_9_x
I also have a discussion open with the maintainer as well terraform-provider-libvirt/discussions/1231 however time will tell if upstream will stabilize or not.
Observed issues during attempted migration to 0.9.0 (and why I paused):
- Breaking changes in resource attributes that required nontrivial refactors of node modules.
- Removal or renaming of disk/network attributes that made old state incompatible.
- Domain XML differences and provider behavior changes causing Terraform to attempt destructive updates.
- Upstream provider bugs causing intermittent failures during
plan/apply. - State files from older versions becoming difficult to reconcile without careful manual interventions.
Current guidance:
- Continue using the provider version that matches your working branch / environment.
- If you experiment with the WIP branch, do so on disposable state and snapshots only.
- Watch upstream provider issue tracker for patches/fixes before attempting production migration.
- I have preserved my WIP branch in the repo for anyone who wants to inspect the attempted changes, but do not treat it as stable or recommended.
8. Operation Workflow#
Steps to run the code are documented in the repo, at a high level these are the general steps:
- Clone the code.
- navigate to infra/terraform folder.
- Run a
terraform initto install all dependencies. - Run
terraform planto preview the changes. - Run
terraform apply -auto-approveto provision the infrastructure.
Note: You have to run step 5 twice due to a bug in the terraform provider that doesn’t refresh the internal state of libvirt. What ends up happening is that the networking information is stale and on a second run it gets populated.
Here is what the outputs look like
1Outputs:
2
3vm_ip_addresses = [
4 {
5 "txgrid-cp0" = tolist([
6 {
7 "addresses" = tolist([
8 "192.168.2.166",
9 "fd8c:9056:52a4:f645:3697:f6ff:feaa:bbc0",
10 "fe80::3697:f6ff:feaa:bbc0",
11 ])
12 "bridge" = "br0"
13 "hostname" = ""
14 "mac" = "34:97:F6:AA:BB:C0"
15 "macvtap" = ""
16 "network_id" = ""
17 "network_name" = ""
18 "passthrough" = ""
19 "private" = ""
20 "vepa" = ""
21 "wait_for_lease" = false
22 },
23 {
24 "addresses" = tolist([
25 "192.168.50.10",
26 "fe80::5054:ff:fe5e:144e",
27 ])
28 "bridge" = ""
29 "hostname" = ""
30 "mac" = "52:54:00:5E:14:4E"
31 "macvtap" = ""
32 "network_id" = "82dc306f-1a38-4e32-b944-f5c8ad31351d"
33 "network_name" = "hostnet"
34 "passthrough" = ""
35 "private" = ""
36 "vepa" = ""
37 "wait_for_lease" = false
38 },
39 ])
40 },
41 {
42 "txgrid-wk1" = tolist([
43 {
44 "addresses" = tolist([
45 "192.168.2.167",
46 "fd8c:9056:52a4:f645:3697:f6ff:feaa:bbc1",
47 "fe80::3697:f6ff:feaa:bbc1",
48 ])
49 "bridge" = "br0"
50 "hostname" = ""
51 "mac" = "34:97:F6:AA:BB:C1"
52 "macvtap" = ""
53 "network_id" = ""
54 "network_name" = ""
55 "passthrough" = ""
56 "private" = ""
57 "vepa" = ""
58 "wait_for_lease" = false
59 },
60 {
61 "addresses" = tolist([
62 "192.168.50.11",
63 "fe80::5054:ff:febe:ee14",
64 ])
65 "bridge" = ""
66 "hostname" = ""
67 "mac" = "52:54:00:BE:EE:14"
68 "macvtap" = ""
69 "network_id" = "82dc306f-1a38-4e32-b944-f5c8ad31351d"
70 "network_name" = "hostnet"
71 "passthrough" = ""
72 "private" = ""
73 "vepa" = ""
74 "wait_for_lease" = false
75 },
76 ])
77 },
78 {
79 "txgrid-wk2" = tolist([
80 {
81 "addresses" = tolist([
82 "192.168.2.168",
83 "fd8c:9056:52a4:f645:3697:f6ff:feaa:bbc2",
84 "fe80::3697:f6ff:feaa:bbc2",
85 ])
86 "bridge" = "br0"
87 "hostname" = ""
88 "mac" = "34:97:F6:AA:BB:C2"
89 "macvtap" = ""
90 "network_id" = ""
91 "network_name" = ""
92 "passthrough" = ""
93 "private" = ""
94 "vepa" = ""
95 "wait_for_lease" = false
96 },
97 {
98 "addresses" = tolist([
99 "192.168.50.12",
100 "fe80::5054:ff:fe9a:3209",
101 ])
102 "bridge" = ""
103 "hostname" = ""
104 "mac" = "52:54:00:9A:32:09"
105 "macvtap" = ""
106 "network_id" = "82dc306f-1a38-4e32-b944-f5c8ad31351d"
107 "network_name" = "hostnet"
108 "passthrough" = ""
109 "private" = ""
110 "vepa" = ""
111 "wait_for_lease" = false
112 },
113 ])
114 },
115]
9. Verification Steps#
At this point the nodes should be online and below are a set of steps you can run to verify a few things.
Verify domains:
1virsh list --all
You should see:
1 Id Name State
2----------------------------
3 1 txgrid-wk2 running
4 2 txgrid-cp0 running
5 3 txgrid-wk1 running
all in the running state.
You can also get VM IP addresses via QEMU guest agent using the domain name.
1virsh domifaddr --source agent txgrid-cp0
Example successful output (for txgrid-cp0):
1 Name MAC address Protocol Address
2-------------------------------------------------------------------------------
3 lo 00:00:00:00:00:00 ipv4 127.0.0.1/8
4 - - ipv6 ::1/128
5 ens3 34:97:f6:aa:bb:c0 ipv4 192.168.2.166/24
6 - - ipv6 fd8c:9056:52a4:f645:3697:f6ff:feaa:bbc0/64
7 - - ipv6 fe80::3697:f6ff:feaa:bbc0/64
8 ens4 52:54:00:5e:14:4e ipv4 192.168.50.10/24
9 - - ipv6 fe80::5054:ff:fe5e:144e/64
SSH into txgrid-cp0:
1ssh -i ~/.ssh/id_ed25519 -o StrictHostKeyChecking=no txgrid@192.168.50.10
Output:
1Warning: Permanently added '192.168.50.10' (ED25519) to the list of known hosts.
2Welcome to Ubuntu 24.04.3 LTS (GNU/Linux 6.8.0-88-generic x86_64)
3
4 * Documentation: https://help.ubuntu.com
5 * Management: https://landscape.canonical.com
6 * Support: https://ubuntu.com/pro
7
8 System information as of Sun Nov 30 04:04:31 UTC 2025
9
10 System load: 0.0
11 Usage of /: 94.6% of 2.35GB
12 Memory usage: 2%
13 Swap usage: 0%
14 Processes: 135
15 Users logged in: 0
16 IPv4 address for ens3: 192.168.2.166
17 IPv6 address for ens3: fd8c:9056:52a4:f645:3697:f6ff:feaa:bbc0
18
19 => / is using 94.6% of 2.35GB
20
21
22Expanded Security Maintenance for Applications is not enabled.
23
240 updates can be applied immediately.
25
26Enable ESM Apps to receive additional future security updates.
27See https://ubuntu.com/esm or run: sudo pro status
28
29
30Last login: Sun Nov 30 04:04:31 2025 from 192.168.50.1
31txgrid@cp0:~$ uptime
32 04:05:40 up 31 min, 1 user, load average: 0.00, 0.00, 0.00
33txgrid@cp0:~$
Connectivity tests between nodes:
From txgrid-cp0 -> txgrid-wk1:
1txgrid@cp0:~$ ping -c 3 192.168.50.11
2PING 192.168.50.11 (192.168.50.11) 56(84) bytes of data.
364 bytes from 192.168.50.11: icmp_seq=1 ttl=64 time=0.177 ms
464 bytes from 192.168.50.11: icmp_seq=2 ttl=64 time=0.203 ms
564 bytes from 192.168.50.11: icmp_seq=3 ttl=64 time=0.261 ms
6
7--- 192.168.50.11 ping statistics ---
83 packets transmitted, 3 received, 0% packet loss, time 2037ms
9rtt min/avg/max/mdev = 0.177/0.213/0.261/0.035 ms
From txgrid-cp0 -> txgrid-wk2:
1txgrid@cp0:~$ ping -c 3 192.168.50.12
2PING 192.168.50.12 (192.168.50.12) 56(84) bytes of data.
364 bytes from 192.168.50.12: icmp_seq=1 ttl=64 time=0.277 ms
464 bytes from 192.168.50.12: icmp_seq=2 ttl=64 time=0.236 ms
564 bytes from 192.168.50.12: icmp_seq=3 ttl=64 time=0.232 ms
6
7--- 192.168.50.12 ping statistics ---
83 packets transmitted, 3 received, 0% packet loss, time 2053ms
9rtt min/avg/max/mdev = 0.232/0.248/0.277/0.020 ms
Connectivity outbound request to public:
txgrid@cp0:~$ ping -c 3 www.google.com
PING www.google.com (172.253.124.99) 56(84) bytes of data.
64 bytes from ys-in-f99.1e100.net (172.253.124.99): icmp_seq=1 ttl=104 time=40.1 ms
64 bytes from ys-in-f99.1e100.net (172.253.124.99): icmp_seq=2 ttl=104 time=40.1 ms
64 bytes from ys-in-f99.1e100.net (172.253.124.99): icmp_seq=3 ttl=104 time=40.6 ms
--- www.google.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
rtt min/avg/max/mdev = 40.067/40.237/40.559/0.227 ms
10. Troubleshooting & Gotchas#
Common Issues#
| Issue | Likely Cause | Fix |
|---|---|---|
| VMs fail to start | AppArmor restriction | Update QEMU profile (see repo diffs) or relax policy for lab usage |
| NIC missing | Bridge misconfiguration | Validate Netplan & ensure bridge exists before VM start |
| Cloud-init not applying | Wrong metadata or template path | Rebuild the cloud-init ISO module |
| Terraform repeatedly recreates resources | Provider incompatibility | Use pinned provider version and avoid migrating until upstream stabilizes |
Useful Commands#
1virsh list --all
2virsh domifaddr txgrid-cp0
3virsh net-dhcp-leases default
4virsh dumpxml txgrid-wk1
11. Further Enhancements#
Future improvements may include:
- Ansible provisioning post-boot
- GPU passthrough testing for ML workloads
- iperf-based network benchmarking between nodes
