What I Learned Building Audit Logging for a Multi-Tenant Compliance Platform

I built the audit trail system for TeamPulse, a multi-tenant compliance platform with 139 API routes, independent feature gates per workspace, and the audit subsystem designed to meet NIST SP 800-92 ^[1], SP 800-63B ^[2], and SOC 2 CC6.1 ^[3] guidance. The audit system went through three rewrites over four weeks. Each rewrite was informed by operational experience from the previous version. Here is what I learned.

Lesson 1: Audit logging must fail open

The single most important rule: an audit write failure must never break the primary operation. If a user updates a risk entry and the audit log write throws a database error, the user must still see the risk update succeed.

I enforce this with fire-and-forget writes wrapped in try/catch:

// apps/web/lib/audit.ts
export async function writePlatformAuditLog(params) {
  try {
    await prisma.platformAuditLog.create({ data: { /* ... */ } });
  } catch (err) {
    console.error('[platform-audit] Failed to write audit log:', err);
    // Never re-throw -- primary operation already succeeded
  }
}

The withErrorHandler wrapper that instruments every API route uses void writePlatformAuditLog(...) to discard the promise, so the response can return before the audit write resolves. The try/catch inside the function is the correctness guarantee — if the write fails, the error is caught and logged internally. The outer void is a performance optimization that avoids blocking the response on I/O. In strict TypeScript with the no-floating-promises ESLint rule enabled, void is the correct suppression pattern for an intentionally discarded promise.

A test verifies this behaviour: api-handler.test.ts asserts that a handler still returns 200 when the audit write throws. This test has caught regressions twice.

Lesson 2: Derive, do not configure

The first version of the audit middleware required each API route to manually declare its category, entity type, and action. Developers forgot, copied the wrong values, or omitted audit calls entirely on new routes.

The current version (parseRequestForAudit) derives everything from the URL and HTTP method. It is a pure function -- no database lookups, no session reads, no side effects. Given /api/compliance/risks/cm9x8y7z and method PATCH, it returns:

Field	Derived Value	How
category	COMPLIANCE	URL prefix /api/compliance/
entityType	Risk	De-pluralized PascalCase from segment
action	risk.update	HTTP method + entity type
entityId	cm9x8y7z	CUID-like segment detection
severity	INFO (or WARNING for DELETE)	Method + outcome

Three design choices matter here:

Action verbs are filtered out. Segments like attest, transition, review, quantify describe what is being done, not what entity is being targeted. Walking the URL backwards and skipping these verbs finds the real resource name.

Opaque ID detection by length and character set. Segments that are 15-40 lowercase alphanumeric characters (matching common auto-generated ID formats like CUIDs, CUID2, or nanoid) are treated as entity IDs. Short identifiers or custom ID schemes that do not match the length threshold may need explicit handling. This heuristic avoids parsing route parameter notation ([id]).

Auth routes are excluded entirely. /api/auth/* paths return null from the parser. Auth events (sign-in, sign-out, MFA, password changes) are instrumented explicitly in the auth flow code. Mixing automatic and explicit logging produces duplicate entries and confusing severity assignments.

The payoff: any new API route under /api/compliance/, /api/security/, or /api/incidents/ is automatically audited with zero developer effort. There are 37 unit tests covering the URL parser: nested paths, kebab-case conversion, de-pluralization, action verb skipping, deep paths, and excluded routes.

Lesson 3: Two audit tables, not one

I split auditing into two tables serving different audiences:

Aspect	Admin Audit Log	Platform Audit Log
Table	`audit_logs`	`platform_audit_logs`
Scope	Global -- all tenants	Tenant-scoped
Audience	Platform administrators	Tenant users
Events	Role changes, user management, tenant ops	Auth, compliance, security

This separation matters for multi-tenant access control. A tenant user viewing /audit-log must never see another tenant's audit events. With a single table, every query requires a WHERE tenantId = ? clause -- easy to forget, hard to audit. With separate tables, the platform audit log is inherently scoped: tenant users can only query the tenant-scoped table through API routes that enforce their session's tenantId.

The admin audit log is only accessible through the admin app (port 3001), which requires a separate authentication session with SUPER_ADMIN role. No tenant user can access it, and no tenant-scoped API route reads from it.

Lesson 4: tenantId must be nullable

The first version of platform_audit_logs had tenantId as a non-nullable foreign key. This broke immediately because sign-in failures happen before a tenant is selected -- there is no tenant ID to associate with a failed password attempt.

Making tenantId nullable fixed it, but required care:

Auth events (sign-in failure, account lockout, 401 unauthorized) carry null tenant.
All other events carry the tenant from the session token.
writePlatformAuditLog() accepts tenantId: string | null and passes it through without coercion.

Lesson 5: Severity is a classification, not a guess

I follow NIST SP 800-92 guidance ^[1] with three severity levels:

Severity	Criteria
INFO	Successful operations (sign-in, CRUD, password change)
WARNING	Failed operations (invalid password, denied access, delete actions, feature violations)
CRITICAL	Account lockout events

Two principles guided the severity design:

DELETE is always WARNING, even on success. Deleting data is a destruction event and should be flagged regardless of whether the operation itself completed without error.
CRITICAL is reserved for security incidents. Only account lockout triggers it -- signaling a potential brute-force attack. No other event type reaches CRITICAL. If everything is critical, nothing is.

Lesson 6: Retention is part of the schema, not an afterthought

Per NIST SP 800-92 section 4.2.4 ^[1], I encoded retention policy in the database schema comments before the first row was written:

Online retention: 90 days -- data available for immediate querying in the live database.
Archive retention: 1 year -- exported and stored for compliance audits.
Critical events: retained indefinitely -- account lockouts and security incidents are never pruned.

Retrofitting retention onto a growing audit table is painful: you discover you needed a partition key, an archive format, or an index that makes pruning efficient. Designing it upfront costs nothing.

Where the archive lives

The audit archive is not stored in the application database. A cron job runs weekly and exports rows older than 90 days as newline-delimited JSON (one JSON object per line, with all PII fields stripped) to a GCS bucket with object-level retention lock. The export query filters to WHERE createdAt < NOW() - INTERVAL '90 days' AND severity != 'CRITICAL', so critical events stay in the live database permanently. The GCS bucket is in the same region as the database to avoid egress costs, and the service account used by the export job has storage.objectCreator scope only -- it cannot read, list, or delete objects once written.

Lesson 7: Compliance guidance is a design input, not a checklist

Every standards reference is traceable to a specific line of source code. The relevant NIST and SOC 2 publications provide implementation guidance that informs the code, not a set of boxes to tick at the end:

Standard	Implementation
NIST SP 800-92 event categories ^[1]	PlatformAuditCategory enum: AUTH, COMPLIANCE, SECURITY, ADMIN, etc.
NIST SP 800-92 log format ^[1]	Every row has outcome, severity, source, actorId, actorEmail, metadata
NIST SP 800-63B lockout ^[2]	AccountSecurity model tracks failedAttempts and lockedUntil; auth flow enforces the 5/15 rule (5 failed attempts within 15 minutes triggers lockout, per NIST SP 800-63B section 5.2.2)
SOC 2 CC6.1 immutability ^[3]	No UPDATE or DELETE endpoints for audit tables; append-only
SOC 2 CC6.1 access protection ^[3]	Audit tables gated behind authenticated API routes; no direct database access

When an auditor asks "show me how you implement NIST SP 800-63B section 5.2.2" ^[2], point to the exact file and line where the lockout logic lives -- not a paragraph in a policy document.

Lesson 8: E2E tests catch what unit tests miss

The audit log UI is built as a Next.js page with a Playwright E2E suite covering page load, filters, per-page selector, pagination controls, API contract, and mobile viewport (390x844).

Two bugs the suite caught before users did:

Filter change did not reset pagination. Changing the action filter from "All" to "CREATE" while on page 3 returned 0 results because the filter was applied to page 3 of the unfiltered set.
Page size selector triggered duplicate API calls. The useEffect dependency array was missing pageSize, so changing from 25 to 50 rows sent two requests (old size + new size).

Lesson 9: The three-rewrite pattern

Version	Scope	Mechanism	Problem
v1.31.0	Compliance only	logAudit() called explicitly in 19 route handlers	Not scalable; developers forgot
v1.45.3	Compliance + pagination UI	Same logAudit() but with rows-per-page selector	Still manual; no auth event logging
v1.57.0	Platform-wide	Pure-function derivation + withErrorHandler auto-instrumentation	Current state

The progression -- explicit to automatic to unified -- is a pattern worth recognizing: the first version proves the concept, the second stabilizes the UX, the third eliminates the manual burden.

Summary

Building audit logging for a multi-tenant compliance platform taught nine lessons applicable to any SaaS product handling sensitive operations:

Audit writes must fail open -- never break the primary operation.
Derive audit metadata from the URL, do not require developers to configure it.
Separate audit tables for separate audiences (admin vs tenant).
Make tenantId nullable for pre-authentication events.
Classify severity with clear criteria; reserve CRITICAL for security incidents.
Design retention policy before writing the first audit row.
Map every compliance standard reference to a specific line of source code.
E2E tests catch pagination and filter interaction bugs that unit tests miss.
Plan for audit instrumentation to evolve: the first version proves the concept, the second stabilizes UX, the third eliminates manual burden.

References

[1] National Institute of Standards and Technology, "SP 800-92 — Guide to Computer Security Log Management," https://csrc.nist.gov/publications/detail/sp/800-92/final, September 2006. NIST has published a draft revision (SP 800-92r1, 2023) introducing additional requirements around structured logging formats; this implementation targets the original 2006 publication until r1 is finalized.

[2] National Institute of Standards and Technology, "SP 800-63B — Digital Identity Guidelines: Authentication and Lifecycle Management," https://csrc.nist.gov/publications/detail/sp/800-63b/final, accessed June 2026.

[3] American Institute of Certified Public Accountants, "SOC 2 CC6.1 — Logical and Physical Access Controls," https://www.aicpa.org/soc4so, accessed June 2026. The full Trust Services Criteria specification that defines CC6.1 in detail is available for purchase via AICPA.