Automated Server Fleet Management: Update and Health Investigation Workflows

These workflows are extracted from my actual operational runbooks. They manage a mixed fleet of Linux and macOS nodes connected over a Tailscale mesh network. They work well as manual checklists, shell scripts, or AI-assistant-driven procedures.

The Problem

I manage a fleet of servers with different roles: build machines, application hosts, a NAS, and cloud VPS instances. They run different operating systems, serve different purposes, and need different update strategies. Before formalizing these workflows, updates were ad-hoc SSH sessions where I would forget to check disk space before upgrading, miss a server that needed a reboot, or discover three weeks later that a kernel update broke a module directory.

The two workflows below solve this by making every step explicit, every check automated, and every result recorded in a structured report.

Architecture

All servers are connected via Tailscale, a WireGuard-based mesh VPN. SSH access uses Ed25519 key authentication with a consistent user across all nodes. The general connection pattern:

ssh -i ~/.ssh/id_ed25519 -o ConnectTimeout=10 -o BatchMode=yes user@<tailscale-ip> "hostname && uptime"

The -o BatchMode=yes flag is critical for automation: it disables interactive prompts, causing the connection to fail immediately if key auth does not work rather than hanging on a password prompt. The -o ConnectTimeout=10 prevents a single unreachable node from blocking the entire workflow.

The fleet includes both dnf-based (Fedora/RHEL) and apt-based (Debian/Ubuntu) systems, plus one macOS host that uses softwareupdate. Every script in these workflows detects the package manager at runtime and adapts accordingly.

Workflow 1: Server Update

The update workflow applies system patches across the fleet in a controlled, auditable sequence. It produces a Markdown report with per-package version diffs for every server.

Step 1 -- Verify Connectivity

Before touching any packages, confirm that every server is reachable:

ssh -i ~/.ssh/id_ed25519 -o ConnectTimeout=10 -o BatchMode=yes user@<IP> "hostname && uptime"

Run this against every node in the fleet. If some servers are unreachable, decide whether to proceed with the reachable subset or abort entirely. If all servers fail, something is wrong with the network or Tailscale. Stop and investigate.

Step 2 -- Gather Update Parameters

Before running anything, define three parameters:

Parameter	Options	Default
Update scope	Full system, security-only, or specific packages	Full system
Reboot handling	Automatic, ask-per-server, or skip (flag only)	Ask-per-server
Execution order	Sequential (safer) or parallel (faster)	Sequential

Sequential updates are strongly recommended when servers run interdependent services. Parallel is acceptable for independent build machines or development nodes.

Step 3 -- Pre-Update Snapshot

Capture the baseline state of each server before making changes. This is the data you will diff against after the update.

ssh -i ~/.ssh/id_ed25519 user@<IP> bash <<'EOF'
echo "=== HOSTNAME ===" && hostname
echo "=== OS ===" && grep -E "^(NAME|VERSION)=" /etc/os-release
echo "=== KERNEL ===" && uname -r
echo "=== UPTIME ===" && uptime
echo "=== DISK ===" && df -h /
echo "=== PENDING UPDATES ==="
if command -v apt &>/dev/null; then
  apt list --upgradable 2>/dev/null | tail -n +2
elif command -v dnf &>/dev/null; then
  dnf check-update --quiet 2>/dev/null || true
fi
echo "=== REBOOT REQUIRED ==="
if [ -f /var/run/reboot-required ]; then echo "YES"; else echo "NO"; fi
EOF

Record the OS, kernel version, disk usage on /, number of pending updates, and whether a reboot was already required. If disk usage on / exceeds 80%, warn before proceeding; an update that fills the root filesystem can leave the server in an unrecoverable state.

Step 4 -- Apply Updates

Detect the package manager and run the appropriate update command:

ssh -i ~/.ssh/id_ed25519 user@<IP> bash <<'SCRIPT'
set -e
 
if command -v apt &>/dev/null; then
  sudo apt update
  case "$SCOPE" in
    full)     sudo DEBIAN_FRONTEND=noninteractive apt upgrade -y ;;
    security) sudo DEBIAN_FRONTEND=noninteractive unattended-upgrade ;;
    specific) sudo DEBIAN_FRONTEND=noninteractive apt install --only-upgrade -y $PACKAGES ;;
  esac
  sudo apt autoremove -y
 
elif command -v dnf &>/dev/null; then
  case "$SCOPE" in
    full)     sudo dnf upgrade -y ;;
    security) sudo dnf upgrade --security -y ;;
    specific) sudo dnf upgrade -y $PACKAGES ;;
  esac
  sudo dnf autoremove -y
 
else
  echo "ERROR: No supported package manager found"
  exit 1
fi
 
echo "=== UPDATE COMPLETE ==="
SCRIPT

For macOS hosts, skip the package manager entirely and use:

softwareupdate -l        # List available updates
softwareupdate -ia        # Install all available updates

If an update command exits non-zero, capture the error output and decide whether to continue with the next server or abort the run. Never silently swallow update failures.

Step 5 -- Handle Reboot

After updates complete, check whether a reboot is required:

# Debian/Ubuntu
[ -f /var/run/reboot-required ] && echo "REBOOT_REQUIRED=YES"
 
# Fedora/RHEL
needs-restarting -r &>/dev/null; [ $? -eq 1 ] && echo "REBOOT_REQUIRED=YES"

If rebooting automatically, issue sudo reboot and poll for SSH to come back:

ssh user@<IP> "sudo reboot"
sleep 30
for i in $(seq 1 18); do
  ssh -o ConnectTimeout=10 -o BatchMode=yes user@<IP> "uptime" 2>/dev/null && break
  sleep 10
done

This gives the server up to five minutes to come back online (30 seconds initial wait + 18 polls at 10-second intervals). If it does not respond, flag it as requiring manual intervention.

Step 6 -- Post-Update Verification

Re-run the same checks from Step 3 and diff against the baseline:

Did the kernel version change? (Expected if a kernel package was updated.)
Did disk usage spike abnormally? (More than 5% increase warrants investigation.)
Are there remaining pending updates? (Should be zero for a full update.)
Are any systemd services in a failed state?

systemctl --failed --no-pager 2>/dev/null

A failed service after an update is a critical finding. It likely means a package upgrade broke a service configuration or a dependency changed.

Step 7 -- Collect Package Details

For the report, extract the exact list of packages that changed. On dnf-based systems, query the transaction history:

LAST_TXN=$(sudo dnf history list --reverse | tail -1 | awk '{print $1}')
sudo dnf history info "$LAST_TXN"

This gives you every package that was upgraded, installed, removed, or downgraded, with old and new version numbers. On apt-based systems, parse /var/log/apt/history.log for the most recent transaction block.

Step 8 -- Generate Report

Save a structured Markdown report to output/reports/server-update-YYYY-MM-DD_HHMMSS.md. The report includes:

Summary table:

Server	OS	Status	Upgraded	Installed	Removed	Reboot
build-server-1	Fedora 44	Success	23 pkgs	2 pkgs	0 pkgs	Yes
storage-node	Fedora 44	Success	18 pkgs	0 pkgs	1 pkg	No

Per-server detail sections including:

Before/after comparison table (kernel, disk, uptime, pending updates).
Full upgraded packages table with previous and new versions.
Newly installed packages (dependencies pulled in by upgrades).
Removed packages (autoremoved or replaced).
Security advisories addressed, if the package manager provides this data.

Notable updates section highlighting security-sensitive packages:

Kernel updates (version change, reboot implications).
Container runtime updates (docker-ce, containerd, podman).
Cryptographic libraries (openssl, gnutls, nss).
Remote access tools (openssh, sudo, curl).
Language runtimes (python3, nodejs, go).

Issues and warnings section listing any problems encountered:

Update failures with error messages.
Servers still pending a reboot.
Failed systemd services detected post-update.
Disk space warnings.

Workflow 2: Server Health Investigation

The health investigation is a read-only audit. It makes no changes to any server. Use it after major events (OS upgrades, kernel updates, infrastructure migrations) or as a periodic check to catch drift before it becomes a problem.

Step 1 -- Verify Connectivity

Same as the update workflow: test SSH access to every node, record which are reachable, and proceed with the reachable subset.

Step 2 -- System Identity

Collect the baseline identity of each server:

echo "=== OS ===" && grep -E "^(NAME|VERSION|VERSION_ID)=" /etc/os-release
echo "=== KERNEL ===" && uname -r
echo "=== ARCH ===" && uname -m
echo "=== UPTIME ===" && uptime
echo "=== LAST REBOOT ===" && who -b
echo "=== TIMEZONE ===" && timedatectl | grep "Time zone"

This establishes what you are working with. A server running an unexpected kernel version or the wrong timezone is an early signal that something has drifted.

Step 3 -- Systemd Service Health

Check for failed services and verify that critical services are running:

# Any failed services?
systemctl --failed --no-pager
 
# Is the system in a degraded state?
systemctl is-system-running
 
# Check critical services
for svc in sshd docker containerd tailscaled chronyd crond firewalld; do
  STATUS=$(systemctl is-active "$svc" 2>/dev/null)
  ENABLED=$(systemctl is-enabled "$svc" 2>/dev/null)
  [ "$STATUS" != "inactive" ] || [ "$ENABLED" != "disabled" ] && \
    echo "$svc: active=$STATUS enabled=$ENABLED"
done

Flags:

Any failed service is a finding.
A "degraded" system state means at least one unit failed during boot.
sshd not active is critical (you are connected over SSH; if it restarts and fails, you lose access).

Step 4 -- Disk and Filesystem Health

df -h                              # Filesystem usage
df -i /                            # Inode usage
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT  # Block device layout
 
# SMART health (requires smartmontools)
for dev in $(lsblk -d -n -o NAME | grep -E "^(sd|nvme|vd)"); do
  sudo smartctl -H "/dev/$dev" | grep -E "^(SMART|overall|result)"
done
 
# Storage arrays
cat /proc/mdstat 2>/dev/null       # MD RAID
sudo zpool status 2>/dev/null      # ZFS pools
sudo lvs --noheadings 2>/dev/null  # LVM volumes

Thresholds:

Metric	Warning	Critical
Filesystem usage	>85%	>95%
Inode usage	>85%	>95%
SMART health	Any warning	Failed
RAID/ZFS status	Degraded	Faulted

Step 5 -- Memory and Swap

free -h
swapon --show
ps aux --sort=-%mem | head -6
sudo journalctl --since "7 days ago" -k | grep -ci "oom"

Flags:

Available memory below 500 MB.
Swap usage above 50% (indicates memory pressure).
Any OOM kills in the last 7 days (a process was killed by the kernel due to memory exhaustion).

Step 6 -- CPU and Load

echo "$(nproc) cores"
cat /proc/loadavg
ps aux --sort=-%cpu | head -6
iostat -c 1 2 | tail -1    # CPU steal and iowait

Flags:

Load average (1-minute) exceeding 2x the core count.
Any single process consuming >80% CPU persistently.
iowait above 20% (indicates disk I/O bottleneck).

Step 7 -- Network and Connectivity

ip -brief addr                     # Network interfaces
ip route show default              # Default route
dig +short google.com A            # DNS resolution
curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 https://google.com
sudo ss -tlnp | head -20           # Listening ports
tailscale status | head -5         # Mesh VPN status

Flags:

DNS resolution failure.
External connectivity failure (HTTP status not 200).
Tailscale not connected (if the server is expected to be on the mesh).
Unexpected listening ports.

Step 8 -- Container Runtime Health

This is the most detailed check. Containers are where most application logic runs, and a misbehaving container can consume all host resources.

# Docker
docker --version
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Image}}\t{{.Ports}}"
 
# Health and restart counts
docker ps --format "{{.Names}}" | while read name; do
  health=$(docker inspect --format='{{if .State.Health}}{{.State.Health.Status}}{{else}}no-healthcheck{{end}}' "$name")
  restarts=$(docker inspect --format='{{.RestartCount}}' "$name")
  echo "$name: health=$health restarts=$restarts"
done
 
# Resource usage
docker stats --no-stream --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
 
# Disk consumption
docker system df
 
# Compose projects
docker compose ls

Flags:

Condition	Severity
Container health = unhealthy	Critical
Container status = restarting	Critical
Restart count > 5	Warning
Container using > 80% host memory	Warning
Docker images > 50 GB total	Warning (prune needed)
Build cache > 50 GB	Warning (run `docker builder prune`)
Exited containers older than 7 days	Info

Step 9 -- Security Posture

# Pending security updates
dnf updateinfo list --available --type=security | head -20
 
# SELinux status
getenforce
 
# SSH hardening
grep -E "^PermitRootLogin" /etc/ssh/sshd_config
grep -E "^PasswordAuthentication" /etc/ssh/sshd_config
 
# Failed login attempts (last 7 days)
sudo journalctl --since "7 days ago" -u sshd | grep -ci "failed\|invalid"

Flags:

SELinux disabled on a server that should be enforcing.
PermitRootLogin yes (should be no or prohibit-password).
PasswordAuthentication yes on servers that should be key-only.
Elevated failed SSH login attempts (threshold depends on exposure; a public-facing server with 500 failures/week is normal, while an internal Tailscale-only server with 50 is suspicious).

Step 10 -- GRUB and Boot Configuration

Often overlooked, but critical after kernel updates:

# GRUB entries and default
sudo grubby --info=ALL | grep -E '^(index|kernel|title)'
sudo grubby --default-kernel
 
# Installed kernel packages
rpm -qa kernel-core --qf '%{NAME}-%{VERSION}-%{RELEASE}.%{ARCH}\n' | sort
 
# Orphan module directories
for d in /lib/modules/*/; do
  kver=$(basename "$d")
  rpm -q kernel-core-${kver%.*} &>/dev/null 2>&1 || echo "ORPHAN: $d"
done
 
# Boot partition usage
df -h /boot
df -h /boot/efi

Flags:

GRUB default kernel does not match the running kernel (mismatch after update without reboot).
More than 3 kernel entries (stale kernels consuming /boot space).
/boot partition above 80% usage.
Orphan /lib/modules/ directories with no matching installed kernel package.
/boot/efi mount missing the nofail option in /etc/fstab (can prevent boot if the EFI partition is temporarily unavailable).

Step 11 -- Application-Specific Checks

Catch-all for everything else:

# Scheduled tasks
sudo crontab -l
systemctl list-timers --no-pager | head -15
 
# NTP synchronization
timedatectl show | grep -E "^(NTP|Synchronized)"
 
# Recent errors
sudo journalctl --since "1 hour ago" -p err --no-pager | tail -20
 
# Kernel errors
sudo dmesg --level=err,crit,alert,emerg | tail -10

NTP not synchronized is a warning; clock drift causes TLS certificate validation failures, log timestamp confusion, and distributed system coordination issues.

Step 12 -- Generate Report

Save to output/reports/server-health-YYYY-MM-DD_HHMMSS.md with per-server detail.

Summary table:

Server	OS	Kernel	Uptime	Status	Issues
build-server-1	Fedora 44	7.0.10	12d 4h	Healthy	0
storage-node	Fedora 44	7.0.10	23d 8h	Warning	2

Per-server sections with a category-level status matrix:

Category	Status	Details
System services	OK	0 failed services
Disk	WARNING	/boot at 87%
Memory	OK	28.3 GB available
CPU	OK	Load 0.12 (8 cores)
Network	OK	All checks passed
Containers	OK	4 running, 0 unhealthy
Security	OK	0 pending security updates
GRUB/Boot	WARNING	4 kernel entries, 1 orphan module dir

Each finding gets a severity level (CRITICAL, WARNING, INFO) and a recommended action. The report ends with aggregated sections for critical issues, warnings, and recommendations.

Design Decisions

Why Workflows, Not Scripts

These are structured procedures, not shell scripts. The distinction matters:

Decision points. A script either handles every edge case or crashes. A workflow can pause and ask: "Server X has 94% disk usage. Proceed with the update anyway?" An AI assistant or a human operator can make that judgment call with context a script does not have.
Heterogeneous fleet. The fleet includes Fedora, potentially Debian-based systems, and macOS. A single script would need extensive branching; a workflow describes the intent and lets each step adapt to the detected environment.
Auditability. The structured report is not an afterthought; it is the primary output. Package-level version diffs, before/after comparisons, and flagged issues create an audit trail that a raw script log does not provide.
Composability. Run the health investigation before and after the update workflow. Use the update workflow's report to feed into change management. The workflows are designed to chain.

Why Tailscale

Every connection goes through Tailscale's WireGuard mesh rather than public IP + firewall rules. Benefits:

No exposed SSH ports. The servers' public IPs do not need port 22 open. Some nodes use port knocking as an additional layer; the rest are Tailscale-only.
Stable addressing. Tailscale IPs do not change when a VPS provider reassigns public IPs or when a machine moves networks.
Mutual authentication. Both sides are authenticated by Tailscale's control plane. The SSH key is a second factor, not the only factor.
Traversal. NAT traversal is handled automatically. The NAS behind a home router is as reachable as a cloud VPS.

Report Format

Reports are Markdown for three reasons:

They render natively in any code editor, terminal pager, or web browser.
They diff cleanly in git if you version-control your reports directory.
They are parseable by AI assistants for follow-up analysis ("which servers had kernel updates last month?").

Operational Patterns

Chaining the Workflows

The typical sequence after a planned maintenance window:

Health investigation (pre-update baseline).
Server update (apply patches).
Health investigation (post-update verification).
Diff the two health reports to confirm nothing regressed.

Frequency

Workflow	Cadence	Trigger
Server update	Weekly or biweekly	Scheduled maintenance window
Health investigation	Weekly	After updates, after incidents, periodic audit
Ad-hoc health check	As needed	After kernel upgrades, infra changes, outage recovery

Escalation

Issues found during either workflow follow a simple escalation model:

Severity	Action	Timeline
CRITICAL	Fix immediately or take the server out of service	Same day
WARNING	Schedule a fix in the next maintenance window	Within 1 week
INFO	Note for future cleanup, no urgency	Next convenient time

Lessons Learned

Always snapshot before updating. The pre-update snapshot has saved me twice: once when a kernel update broke a ZFS module (I knew exactly which kernel version to roll back to), and once when dnf autoremove removed a package that was actually needed (the snapshot showed it was installed before the update, so I knew what to reinstall).

Check /boot space before kernel updates. A full /boot partition causes dnf upgrade to fail mid-transaction, leaving the system in a partially updated state. The health investigation flags this at 80%, giving you time to clean up old kernels before it becomes an emergency.

Sequential updates are worth the time. Parallel updates are tempting on a multi-node fleet, but if an update breaks a shared dependency, you want to catch it on the first server before it propagates to all of them. Sequential with early abort is the safe default.

Container health checks matter more than you think. A container can be "running" (green in docker ps) but internally broken; it may be returning 500s, stuck in a retry loop, or consuming all available memory. The health investigation checks docker inspect health status and restart counts, not just the running state.

Report everything, even when nothing is wrong. A report that says "all healthy, no issues" is still valuable. It establishes a baseline and proves you checked. When something does break, you can point to the last clean report and narrow the window of change.