Automated Server Fleet Management: Update and Health Investigation Workflows
These workflows are extracted from my actual operational runbooks. They manage a mixed fleet of Linux and macOS nodes connected over a Tailscale mesh network. They work well as manual checklists, shell scripts, or AI-assistant-driven procedures.
The Problem
I manage a fleet of servers with different roles: build machines, application hosts, a NAS, and cloud VPS instances. They run different operating systems, serve different purposes, and need different update strategies. Before formalizing these workflows, updates were ad-hoc SSH sessions where I would forget to check disk space before upgrading, miss a server that needed a reboot, or discover three weeks later that a kernel update broke a module directory.
The two workflows below solve this by making every step explicit, every check automated, and every result recorded in a structured report.
Architecture
All servers are connected via Tailscale, a WireGuard-based mesh VPN. SSH access uses Ed25519 key authentication with a consistent user across all nodes. The general connection pattern:
ssh -i ~/.ssh/id_ed25519 -o ConnectTimeout=10 -o BatchMode=yes user@<tailscale-ip> "hostname && uptime"The -o BatchMode=yes flag is critical for automation: it disables interactive prompts, causing the connection to fail immediately if key auth does not work rather than hanging on a password prompt. The -o ConnectTimeout=10 prevents a single unreachable node from blocking the entire workflow.
The fleet includes both dnf-based (Fedora/RHEL) and apt-based (Debian/Ubuntu) systems, plus one macOS host that uses softwareupdate. Every script in these workflows detects the package manager at runtime and adapts accordingly.
Workflow 1: Server Update
The update workflow applies system patches across the fleet in a controlled, auditable sequence. It produces a Markdown report with per-package version diffs for every server.
Step 1 -- Verify Connectivity
Before touching any packages, confirm that every server is reachable:
ssh -i ~/.ssh/id_ed25519 -o ConnectTimeout=10 -o BatchMode=yes user@<IP> "hostname && uptime"Run this against every node in the fleet. If some servers are unreachable, decide whether to proceed with the reachable subset or abort entirely. If all servers fail, something is wrong with the network or Tailscale. Stop and investigate.
Step 2 -- Gather Update Parameters
Before running anything, define three parameters:
| Parameter | Options | Default |
|---|---|---|
| Update scope | Full system, security-only, or specific packages | Full system |
| Reboot handling | Automatic, ask-per-server, or skip (flag only) | Ask-per-server |
| Execution order | Sequential (safer) or parallel (faster) | Sequential |
Sequential updates are strongly recommended when servers run interdependent services. Parallel is acceptable for independent build machines or development nodes.
Step 3 -- Pre-Update Snapshot
Capture the baseline state of each server before making changes. This is the data you will diff against after the update.
ssh -i ~/.ssh/id_ed25519 user@<IP> bash <<'EOF'
echo "=== HOSTNAME ===" && hostname
echo "=== OS ===" && grep -E "^(NAME|VERSION)=" /etc/os-release
echo "=== KERNEL ===" && uname -r
echo "=== UPTIME ===" && uptime
echo "=== DISK ===" && df -h /
echo "=== PENDING UPDATES ==="
if command -v apt &>/dev/null; then
apt list --upgradable 2>/dev/null | tail -n +2
elif command -v dnf &>/dev/null; then
dnf check-update --quiet 2>/dev/null || true
fi
echo "=== REBOOT REQUIRED ==="
if [ -f /var/run/reboot-required ]; then echo "YES"; else echo "NO"; fi
EOFRecord the OS, kernel version, disk usage on /, number of pending updates, and whether a reboot was already required. If disk usage on / exceeds 80%, warn before proceeding; an update that fills the root filesystem can leave the server in an unrecoverable state.
Step 4 -- Apply Updates
Detect the package manager and run the appropriate update command:
ssh -i ~/.ssh/id_ed25519 user@<IP> bash <<'SCRIPT'
set -e
if command -v apt &>/dev/null; then
sudo apt update
case "$SCOPE" in
full) sudo DEBIAN_FRONTEND=noninteractive apt upgrade -y ;;
security) sudo DEBIAN_FRONTEND=noninteractive unattended-upgrade ;;
specific) sudo DEBIAN_FRONTEND=noninteractive apt install --only-upgrade -y $PACKAGES ;;
esac
sudo apt autoremove -y
elif command -v dnf &>/dev/null; then
case "$SCOPE" in
full) sudo dnf upgrade -y ;;
security) sudo dnf upgrade --security -y ;;
specific) sudo dnf upgrade -y $PACKAGES ;;
esac
sudo dnf autoremove -y
else
echo "ERROR: No supported package manager found"
exit 1
fi
echo "=== UPDATE COMPLETE ==="
SCRIPTFor macOS hosts, skip the package manager entirely and use:
softwareupdate -l # List available updates
softwareupdate -ia # Install all available updatesIf an update command exits non-zero, capture the error output and decide whether to continue with the next server or abort the run. Never silently swallow update failures.
Step 5 -- Handle Reboot
After updates complete, check whether a reboot is required:
# Debian/Ubuntu
[ -f /var/run/reboot-required ] && echo "REBOOT_REQUIRED=YES"
# Fedora/RHEL
needs-restarting -r &>/dev/null; [ $? -eq 1 ] && echo "REBOOT_REQUIRED=YES"If rebooting automatically, issue sudo reboot and poll for SSH to come back:
ssh user@<IP> "sudo reboot"
sleep 30
for i in $(seq 1 18); do
ssh -o ConnectTimeout=10 -o BatchMode=yes user@<IP> "uptime" 2>/dev/null && break
sleep 10
doneThis gives the server up to five minutes to come back online (30 seconds initial wait + 18 polls at 10-second intervals). If it does not respond, flag it as requiring manual intervention.
Step 6 -- Post-Update Verification
Re-run the same checks from Step 3 and diff against the baseline:
- Did the kernel version change? (Expected if a kernel package was updated.)
- Did disk usage spike abnormally? (More than 5% increase warrants investigation.)
- Are there remaining pending updates? (Should be zero for a full update.)
- Are any systemd services in a failed state?
systemctl --failed --no-pager 2>/dev/nullA failed service after an update is a critical finding. It likely means a package upgrade broke a service configuration or a dependency changed.
Step 7 -- Collect Package Details
For the report, extract the exact list of packages that changed. On dnf-based systems, query the transaction history:
LAST_TXN=$(sudo dnf history list --reverse | tail -1 | awk '{print $1}')
sudo dnf history info "$LAST_TXN"This gives you every package that was upgraded, installed, removed, or downgraded, with old and new version numbers. On apt-based systems, parse /var/log/apt/history.log for the most recent transaction block.
Step 8 -- Generate Report
Save a structured Markdown report to output/reports/server-update-YYYY-MM-DD_HHMMSS.md. The report includes:
Summary table:
| Server | OS | Status | Upgraded | Installed | Removed | Reboot |
|---|---|---|---|---|---|---|
| build-server-1 | Fedora 44 | Success | 23 pkgs | 2 pkgs | 0 pkgs | Yes |
| storage-node | Fedora 44 | Success | 18 pkgs | 0 pkgs | 1 pkg | No |
Per-server detail sections including:
- Before/after comparison table (kernel, disk, uptime, pending updates).
- Full upgraded packages table with previous and new versions.
- Newly installed packages (dependencies pulled in by upgrades).
- Removed packages (autoremoved or replaced).
- Security advisories addressed, if the package manager provides this data.
Notable updates section highlighting security-sensitive packages:
- Kernel updates (version change, reboot implications).
- Container runtime updates (docker-ce, containerd, podman).
- Cryptographic libraries (openssl, gnutls, nss).
- Remote access tools (openssh, sudo, curl).
- Language runtimes (python3, nodejs, go).
Issues and warnings section listing any problems encountered:
- Update failures with error messages.
- Servers still pending a reboot.
- Failed systemd services detected post-update.
- Disk space warnings.
Workflow 2: Server Health Investigation
The health investigation is a read-only audit. It makes no changes to any server. Use it after major events (OS upgrades, kernel updates, infrastructure migrations) or as a periodic check to catch drift before it becomes a problem.
Step 1 -- Verify Connectivity
Same as the update workflow: test SSH access to every node, record which are reachable, and proceed with the reachable subset.
Step 2 -- System Identity
Collect the baseline identity of each server:
echo "=== OS ===" && grep -E "^(NAME|VERSION|VERSION_ID)=" /etc/os-release
echo "=== KERNEL ===" && uname -r
echo "=== ARCH ===" && uname -m
echo "=== UPTIME ===" && uptime
echo "=== LAST REBOOT ===" && who -b
echo "=== TIMEZONE ===" && timedatectl | grep "Time zone"This establishes what you are working with. A server running an unexpected kernel version or the wrong timezone is an early signal that something has drifted.
Step 3 -- Systemd Service Health
Check for failed services and verify that critical services are running:
# Any failed services?
systemctl --failed --no-pager
# Is the system in a degraded state?
systemctl is-system-running
# Check critical services
for svc in sshd docker containerd tailscaled chronyd crond firewalld; do
STATUS=$(systemctl is-active "$svc" 2>/dev/null)
ENABLED=$(systemctl is-enabled "$svc" 2>/dev/null)
[ "$STATUS" != "inactive" ] || [ "$ENABLED" != "disabled" ] && \
echo "$svc: active=$STATUS enabled=$ENABLED"
doneFlags:
- Any failed service is a finding.
- A "degraded" system state means at least one unit failed during boot.
sshdnot active is critical (you are connected over SSH; if it restarts and fails, you lose access).
Step 4 -- Disk and Filesystem Health
df -h # Filesystem usage
df -i / # Inode usage
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT # Block device layout
# SMART health (requires smartmontools)
for dev in $(lsblk -d -n -o NAME | grep -E "^(sd|nvme|vd)"); do
sudo smartctl -H "/dev/$dev" | grep -E "^(SMART|overall|result)"
done
# Storage arrays
cat /proc/mdstat 2>/dev/null # MD RAID
sudo zpool status 2>/dev/null # ZFS pools
sudo lvs --noheadings 2>/dev/null # LVM volumesThresholds:
| Metric | Warning | Critical |
|---|---|---|
| Filesystem usage | >85% | >95% |
| Inode usage | >85% | >95% |
| SMART health | Any warning | Failed |
| RAID/ZFS status | Degraded | Faulted |
Step 5 -- Memory and Swap
free -h
swapon --show
ps aux --sort=-%mem | head -6
sudo journalctl --since "7 days ago" -k | grep -ci "oom"Flags:
- Available memory below 500 MB.
- Swap usage above 50% (indicates memory pressure).
- Any OOM kills in the last 7 days (a process was killed by the kernel due to memory exhaustion).
Step 6 -- CPU and Load
echo "$(nproc) cores"
cat /proc/loadavg
ps aux --sort=-%cpu | head -6
iostat -c 1 2 | tail -1 # CPU steal and iowaitFlags:
- Load average (1-minute) exceeding 2x the core count.
- Any single process consuming >80% CPU persistently.
- iowait above 20% (indicates disk I/O bottleneck).
Step 7 -- Network and Connectivity
ip -brief addr # Network interfaces
ip route show default # Default route
dig +short google.com A # DNS resolution
curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 https://google.com
sudo ss -tlnp | head -20 # Listening ports
tailscale status | head -5 # Mesh VPN statusFlags:
- DNS resolution failure.
- External connectivity failure (HTTP status not 200).
- Tailscale not connected (if the server is expected to be on the mesh).
- Unexpected listening ports.
Step 8 -- Container Runtime Health
This is the most detailed check. Containers are where most application logic runs, and a misbehaving container can consume all host resources.
# Docker
docker --version
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Image}}\t{{.Ports}}"
# Health and restart counts
docker ps --format "{{.Names}}" | while read name; do
health=$(docker inspect --format='{{if .State.Health}}{{.State.Health.Status}}{{else}}no-healthcheck{{end}}' "$name")
restarts=$(docker inspect --format='{{.RestartCount}}' "$name")
echo "$name: health=$health restarts=$restarts"
done
# Resource usage
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
# Disk consumption
docker system df
# Compose projects
docker compose lsFlags:
| Condition | Severity |
|---|---|
| Container health = unhealthy | Critical |
| Container status = restarting | Critical |
| Restart count > 5 | Warning |
| Container using > 80% host memory | Warning |
| Docker images > 50 GB total | Warning (prune needed) |
| Build cache > 50 GB | Warning (run docker builder prune) |
| Exited containers older than 7 days | Info |
Step 9 -- Security Posture
# Pending security updates
dnf updateinfo list --available --type=security | head -20
# SELinux status
getenforce
# SSH hardening
grep -E "^PermitRootLogin" /etc/ssh/sshd_config
grep -E "^PasswordAuthentication" /etc/ssh/sshd_config
# Failed login attempts (last 7 days)
sudo journalctl --since "7 days ago" -u sshd | grep -ci "failed\|invalid"Flags:
- SELinux disabled on a server that should be enforcing.
PermitRootLogin yes(should benoorprohibit-password).PasswordAuthentication yeson servers that should be key-only.- Elevated failed SSH login attempts (threshold depends on exposure; a public-facing server with 500 failures/week is normal, while an internal Tailscale-only server with 50 is suspicious).
Step 10 -- GRUB and Boot Configuration
Often overlooked, but critical after kernel updates:
# GRUB entries and default
sudo grubby --info=ALL | grep -E '^(index|kernel|title)'
sudo grubby --default-kernel
# Installed kernel packages
rpm -qa kernel-core --qf '%{NAME}-%{VERSION}-%{RELEASE}.%{ARCH}\n' | sort
# Orphan module directories
for d in /lib/modules/*/; do
kver=$(basename "$d")
rpm -q kernel-core-${kver%.*} &>/dev/null 2>&1 || echo "ORPHAN: $d"
done
# Boot partition usage
df -h /boot
df -h /boot/efiFlags:
- GRUB default kernel does not match the running kernel (mismatch after update without reboot).
- More than 3 kernel entries (stale kernels consuming
/bootspace). /bootpartition above 80% usage.- Orphan
/lib/modules/directories with no matching installed kernel package. /boot/efimount missing thenofailoption in/etc/fstab(can prevent boot if the EFI partition is temporarily unavailable).
Step 11 -- Application-Specific Checks
Catch-all for everything else:
# Scheduled tasks
sudo crontab -l
systemctl list-timers --no-pager | head -15
# NTP synchronization
timedatectl show | grep -E "^(NTP|Synchronized)"
# Recent errors
sudo journalctl --since "1 hour ago" -p err --no-pager | tail -20
# Kernel errors
sudo dmesg --level=err,crit,alert,emerg | tail -10NTP not synchronized is a warning; clock drift causes TLS certificate validation failures, log timestamp confusion, and distributed system coordination issues.
Step 12 -- Generate Report
Save to output/reports/server-health-YYYY-MM-DD_HHMMSS.md with per-server detail.
Summary table:
| Server | OS | Kernel | Uptime | Status | Issues |
|---|---|---|---|---|---|
| build-server-1 | Fedora 44 | 7.0.10 | 12d 4h | Healthy | 0 |
| storage-node | Fedora 44 | 7.0.10 | 23d 8h | Warning | 2 |
Per-server sections with a category-level status matrix:
| Category | Status | Details |
|---|---|---|
| System services | OK | 0 failed services |
| Disk | WARNING | /boot at 87% |
| Memory | OK | 28.3 GB available |
| CPU | OK | Load 0.12 (8 cores) |
| Network | OK | All checks passed |
| Containers | OK | 4 running, 0 unhealthy |
| Security | OK | 0 pending security updates |
| GRUB/Boot | WARNING | 4 kernel entries, 1 orphan module dir |
Each finding gets a severity level (CRITICAL, WARNING, INFO) and a recommended action. The report ends with aggregated sections for critical issues, warnings, and recommendations.
Design Decisions
Why Workflows, Not Scripts
These are structured procedures, not shell scripts. The distinction matters:
-
Decision points. A script either handles every edge case or crashes. A workflow can pause and ask: "Server X has 94% disk usage. Proceed with the update anyway?" An AI assistant or a human operator can make that judgment call with context a script does not have.
-
Heterogeneous fleet. The fleet includes Fedora, potentially Debian-based systems, and macOS. A single script would need extensive branching; a workflow describes the intent and lets each step adapt to the detected environment.
-
Auditability. The structured report is not an afterthought; it is the primary output. Package-level version diffs, before/after comparisons, and flagged issues create an audit trail that a raw script log does not provide.
-
Composability. Run the health investigation before and after the update workflow. Use the update workflow's report to feed into change management. The workflows are designed to chain.
Why Tailscale
Every connection goes through Tailscale's WireGuard mesh rather than public IP + firewall rules. Benefits:
- No exposed SSH ports. The servers' public IPs do not need port 22 open. Some nodes use port knocking as an additional layer; the rest are Tailscale-only.
- Stable addressing. Tailscale IPs do not change when a VPS provider reassigns public IPs or when a machine moves networks.
- Mutual authentication. Both sides are authenticated by Tailscale's control plane. The SSH key is a second factor, not the only factor.
- Traversal. NAT traversal is handled automatically. The NAS behind a home router is as reachable as a cloud VPS.
Report Format
Reports are Markdown for three reasons:
- They render natively in any code editor, terminal pager, or web browser.
- They diff cleanly in git if you version-control your reports directory.
- They are parseable by AI assistants for follow-up analysis ("which servers had kernel updates last month?").
Operational Patterns
Chaining the Workflows
The typical sequence after a planned maintenance window:
- Health investigation (pre-update baseline).
- Server update (apply patches).
- Health investigation (post-update verification).
- Diff the two health reports to confirm nothing regressed.
Frequency
| Workflow | Cadence | Trigger |
|---|---|---|
| Server update | Weekly or biweekly | Scheduled maintenance window |
| Health investigation | Weekly | After updates, after incidents, periodic audit |
| Ad-hoc health check | As needed | After kernel upgrades, infra changes, outage recovery |
Escalation
Issues found during either workflow follow a simple escalation model:
| Severity | Action | Timeline |
|---|---|---|
| CRITICAL | Fix immediately or take the server out of service | Same day |
| WARNING | Schedule a fix in the next maintenance window | Within 1 week |
| INFO | Note for future cleanup, no urgency | Next convenient time |
Lessons Learned
Always snapshot before updating. The pre-update snapshot has saved me twice: once when a kernel update broke a ZFS module (I knew exactly which kernel version to roll back to), and once when dnf autoremove removed a package that was actually needed (the snapshot showed it was installed before the update, so I knew what to reinstall).
Check /boot space before kernel updates. A full /boot partition causes dnf upgrade to fail mid-transaction, leaving the system in a partially updated state. The health investigation flags this at 80%, giving you time to clean up old kernels before it becomes an emergency.
Sequential updates are worth the time. Parallel updates are tempting on a multi-node fleet, but if an update breaks a shared dependency, you want to catch it on the first server before it propagates to all of them. Sequential with early abort is the safe default.
Container health checks matter more than you think. A container can be "running" (green in docker ps) but internally broken; it may be returning 500s, stuck in a retry loop, or consuming all available memory. The health investigation checks docker inspect health status and restart counts, not just the running state.
Report everything, even when nothing is wrong. A report that says "all healthy, no issues" is still valuable. It establishes a baseline and proves you checked. When something does break, you can point to the last clean report and narrow the window of change.