What I Learned Building Audit Logging for a Multi-Tenant Compliance Platform
I built the audit trail system for TeamPulse, a multi-tenant compliance platform with 139 API routes, independent feature gates per workspace, and full NIST SP 800-92 / SP 800-63B / SOC 2 CC6.1 compliance. The audit system went through three rewrites over four weeks. Each rewrite was informed by operational experience from the previous version. Here is what I learned.
Lesson 1: Audit logging must fail open
The single most important rule: an audit write failure must never break the primary operation. If a user updates a risk entry and the audit log write throws a database error, the user must still see the risk update succeed.
I enforce this with fire-and-forget writes wrapped in try/catch:
// apps/web/lib/audit.ts
export async function writePlatformAuditLog(params) {
try {
await prisma.platformAuditLog.create({ data: { /* ... */ } });
} catch (err) {
console.error('[platform-audit] Failed to write audit log:', err);
// Never re-throw -- primary operation already succeeded
}
}The withErrorHandler wrapper that instruments every API route uses void writePlatformAuditLog(...) -- explicitly discarding the promise -- so the response is sent before the audit write even resolves. Audit data is valuable, but availability of the product is more valuable.
A test verifies this behaviour: api-handler.test.ts asserts that a handler still returns 200 when the audit write throws. This test has caught regressions twice.
Lesson 2: Derive, do not configure
The first version of the audit middleware required each API route to manually declare its category, entity type, and action. Developers forgot, copied the wrong values, or omitted audit calls entirely on new routes.
The current version (parseRequestForAudit) derives everything from the URL and HTTP method. It is a pure function -- no database lookups, no session reads, no side effects. Given /api/compliance/risks/cm9x8y7z and method PATCH, it returns:
| Field | Derived Value | How |
|---|---|---|
| category | COMPLIANCE | URL prefix /api/compliance/ |
| entityType | Risk | De-pluralized PascalCase from segment |
| action | risk.update | HTTP method + entity type |
| entityId | cm9x8y7z | CUID-like segment detection |
| severity | INFO (or WARNING for DELETE) | Method + outcome |
Three design choices matter here:
Action verbs are filtered out. Segments like attest, transition, review, quantify describe what is being done, not what entity is being targeted. Walking the URL backwards and skipping these verbs finds the real resource name.
CUID detection by pattern. Segments starting with c, 15-40 lowercase alphanumeric characters are treated as entity IDs. This avoids parsing route parameter notation ([id]).
Auth routes are excluded entirely. /api/auth/* paths return null from the parser. Auth events (sign-in, sign-out, MFA, password changes) are instrumented explicitly in the auth flow code. Mixing automatic and explicit logging produces duplicate entries and confusing severity assignments.
The payoff: any new API route under /api/compliance/, /api/security/, or /api/incidents/ is automatically audited with zero developer effort. There are 37 unit tests covering the URL parser: nested paths, kebab-case conversion, de-pluralization, action verb skipping, deep paths, and excluded routes.
Lesson 3: Two audit tables, not one
I split auditing into two tables serving different audiences:
| Aspect | Admin Audit Log | Platform Audit Log |
|---|---|---|
| Table | audit_logs | platform_audit_logs |
| Scope | Global -- all tenants | Tenant-scoped |
| Audience | Platform administrators | Tenant users |
| Events | Role changes, user management, tenant ops | Auth, compliance, security |
This separation matters for multi-tenant access control. A tenant user viewing /audit-log must never see another tenant's audit events. With a single table, every query requires a WHERE tenantId = ? clause -- easy to forget, hard to audit. With separate tables, the platform audit log is inherently scoped: tenant users can only query the tenant-scoped table through API routes that enforce their session's tenantId.
The admin audit log is only accessible through the admin app (port 3001), which requires a separate authentication session with SUPER_ADMIN role. No tenant user can access it, and no tenant-scoped API route reads from it.
Lesson 4: tenantId must be nullable
The first version of platform_audit_logs had tenantId as a non-nullable foreign key. This broke immediately because sign-in failures happen before a tenant is selected -- there is no tenant ID to associate with a failed password attempt.
Making tenantId nullable fixed it, but required care:
- Auth events (sign-in failure, account lockout, 401 unauthorized) carry
nulltenant. - All other events carry the tenant from the session token.
writePlatformAuditLog()acceptstenantId: string | nulland passes it through without coercion.
Lesson 5: Severity is a classification, not a guess
I follow NIST SP 800-92 guidance with three severity levels:
| Severity | Criteria |
|---|---|
| INFO | Successful operations (sign-in, CRUD, password change) |
| WARNING | Failed operations (invalid password, denied access, delete actions, feature violations) |
| CRITICAL | Account lockout events |
Two principles guided the severity design:
-
DELETE is always WARNING, even on success. Deleting data is a destruction event and should be flagged regardless of whether the operation itself completed without error.
-
CRITICAL is reserved for security incidents. Only account lockout triggers it -- signaling a potential brute-force attack. No other event type reaches CRITICAL. If everything is critical, nothing is.
Lesson 6: Retention is part of the schema, not an afterthought
Per NIST SP 800-92 section 4.2.4, I encoded retention policy in the database schema comments before the first row was written:
- Online retention: 90 days -- data available for immediate querying in the live database.
- Archive retention: 1 year -- exported and stored for compliance audits.
- Critical events: retained indefinitely -- account lockouts and security incidents are never pruned.
Retrofitting retention onto a growing audit table is painful: you discover you needed a partition key, an archive format, or an index that makes pruning efficient. Designing it upfront costs nothing.
Lesson 7: NIST compliance is a feature, not a checklist
Every standards reference is traceable to a specific line of source code:
| Standard | Implementation |
|---|---|
| NIST SP 800-92 event categories | PlatformAuditCategory enum: AUTH, COMPLIANCE, SECURITY, ADMIN, etc. |
| NIST SP 800-92 log format | Every row has outcome, severity, source, actorId, actorEmail, metadata |
| NIST SP 800-63B lockout | AccountSecurity model tracks failedAttempts and lockedUntil; auth flow enforces 5/15 rule |
| SOC 2 CC6.1 immutability | No UPDATE or DELETE endpoints for audit tables; append-only |
| SOC 2 CC6.1 access protection | Audit tables gated behind authenticated API routes; no direct database access |
When an auditor asks "show me how you implement NIST SP 800-63B section 5.2.2," you should point to the exact file and line where the lockout logic lives, not just a paragraph in a policy document.
Lesson 8: E2E tests catch what unit tests miss
The audit log UI has a comprehensive Playwright test suite covering page load, filters, per-page selector, pagination controls, API contract, and mobile viewport (390x844).
Two bugs the suite caught before users did:
-
Filter change did not reset pagination. Changing the action filter from "All" to "CREATE" while on page 3 returned 0 results because the filter was applied to page 3 of the unfiltered set.
-
Page size selector triggered duplicate API calls. The
useEffectdependency array was missingpageSize, so changing from 25 to 50 rows sent two requests (old size + new size).
Lesson 9: The three-rewrite pattern
| Version | Scope | Mechanism | Problem |
|---|---|---|---|
| v1.31.0 | Compliance only | logAudit() called explicitly in 19 route handlers | Not scalable; developers forgot |
| v1.45.3 | Compliance + pagination UI | Same logAudit() but with rows-per-page selector | Still manual; no auth event logging |
| v1.57.0 | Platform-wide | Pure-function derivation + withErrorHandler auto-instrumentation | Current state |
The progression -- explicit to automatic to unified -- is a pattern worth recognizing: the first version proves the concept, the second stabilizes the UX, the third eliminates the manual burden.
Summary
Building audit logging for a multi-tenant compliance platform taught nine lessons applicable to any SaaS product handling sensitive operations:
- Audit writes must fail open -- never break the primary operation.
- Derive audit metadata from the URL, do not require developers to configure it.
- Separate audit tables for separate audiences (admin vs tenant).
- Make
tenantIdnullable for pre-authentication events. - Classify severity with clear criteria; reserve CRITICAL for security incidents.
- Design retention policy before writing the first audit row.
- Map every compliance standard reference to a specific line of source code.
- E2E tests catch pagination and filter interaction bugs that unit tests miss.
- Expect to rewrite audit instrumentation at least twice as the product grows.