← Back to Blog

What I Learned Building Audit Logging for a Multi-Tenant Compliance Platform

·8 min read
Audit LoggingMulti-TenancyComplianceNISTSOC 2TypeScriptNext.js

I built the audit trail system for TeamPulse, a multi-tenant compliance platform with 139 API routes, independent feature gates per workspace, and full NIST SP 800-92 / SP 800-63B / SOC 2 CC6.1 compliance. The audit system went through three rewrites over four weeks. Each rewrite was informed by operational experience from the previous version. Here is what I learned.

Lesson 1: Audit logging must fail open

The single most important rule: an audit write failure must never break the primary operation. If a user updates a risk entry and the audit log write throws a database error, the user must still see the risk update succeed.

I enforce this with fire-and-forget writes wrapped in try/catch:

// apps/web/lib/audit.ts
export async function writePlatformAuditLog(params) {
  try {
    await prisma.platformAuditLog.create({ data: { /* ... */ } });
  } catch (err) {
    console.error('[platform-audit] Failed to write audit log:', err);
    // Never re-throw -- primary operation already succeeded
  }
}

The withErrorHandler wrapper that instruments every API route uses void writePlatformAuditLog(...) -- explicitly discarding the promise -- so the response is sent before the audit write even resolves. Audit data is valuable, but availability of the product is more valuable.

A test verifies this behaviour: api-handler.test.ts asserts that a handler still returns 200 when the audit write throws. This test has caught regressions twice.

Lesson 2: Derive, do not configure

The first version of the audit middleware required each API route to manually declare its category, entity type, and action. Developers forgot, copied the wrong values, or omitted audit calls entirely on new routes.

The current version (parseRequestForAudit) derives everything from the URL and HTTP method. It is a pure function -- no database lookups, no session reads, no side effects. Given /api/compliance/risks/cm9x8y7z and method PATCH, it returns:

FieldDerived ValueHow
categoryCOMPLIANCEURL prefix /api/compliance/
entityTypeRiskDe-pluralized PascalCase from segment
actionrisk.updateHTTP method + entity type
entityIdcm9x8y7zCUID-like segment detection
severityINFO (or WARNING for DELETE)Method + outcome

Three design choices matter here:

Action verbs are filtered out. Segments like attest, transition, review, quantify describe what is being done, not what entity is being targeted. Walking the URL backwards and skipping these verbs finds the real resource name.

CUID detection by pattern. Segments starting with c, 15-40 lowercase alphanumeric characters are treated as entity IDs. This avoids parsing route parameter notation ([id]).

Auth routes are excluded entirely. /api/auth/* paths return null from the parser. Auth events (sign-in, sign-out, MFA, password changes) are instrumented explicitly in the auth flow code. Mixing automatic and explicit logging produces duplicate entries and confusing severity assignments.

The payoff: any new API route under /api/compliance/, /api/security/, or /api/incidents/ is automatically audited with zero developer effort. There are 37 unit tests covering the URL parser: nested paths, kebab-case conversion, de-pluralization, action verb skipping, deep paths, and excluded routes.

Lesson 3: Two audit tables, not one

I split auditing into two tables serving different audiences:

AspectAdmin Audit LogPlatform Audit Log
Tableaudit_logsplatform_audit_logs
ScopeGlobal -- all tenantsTenant-scoped
AudiencePlatform administratorsTenant users
EventsRole changes, user management, tenant opsAuth, compliance, security

This separation matters for multi-tenant access control. A tenant user viewing /audit-log must never see another tenant's audit events. With a single table, every query requires a WHERE tenantId = ? clause -- easy to forget, hard to audit. With separate tables, the platform audit log is inherently scoped: tenant users can only query the tenant-scoped table through API routes that enforce their session's tenantId.

The admin audit log is only accessible through the admin app (port 3001), which requires a separate authentication session with SUPER_ADMIN role. No tenant user can access it, and no tenant-scoped API route reads from it.

Lesson 4: tenantId must be nullable

The first version of platform_audit_logs had tenantId as a non-nullable foreign key. This broke immediately because sign-in failures happen before a tenant is selected -- there is no tenant ID to associate with a failed password attempt.

Making tenantId nullable fixed it, but required care:

  • Auth events (sign-in failure, account lockout, 401 unauthorized) carry null tenant.
  • All other events carry the tenant from the session token.
  • writePlatformAuditLog() accepts tenantId: string | null and passes it through without coercion.

Lesson 5: Severity is a classification, not a guess

I follow NIST SP 800-92 guidance with three severity levels:

SeverityCriteria
INFOSuccessful operations (sign-in, CRUD, password change)
WARNINGFailed operations (invalid password, denied access, delete actions, feature violations)
CRITICALAccount lockout events

Two principles guided the severity design:

  1. DELETE is always WARNING, even on success. Deleting data is a destruction event and should be flagged regardless of whether the operation itself completed without error.

  2. CRITICAL is reserved for security incidents. Only account lockout triggers it -- signaling a potential brute-force attack. No other event type reaches CRITICAL. If everything is critical, nothing is.

Lesson 6: Retention is part of the schema, not an afterthought

Per NIST SP 800-92 section 4.2.4, I encoded retention policy in the database schema comments before the first row was written:

  • Online retention: 90 days -- data available for immediate querying in the live database.
  • Archive retention: 1 year -- exported and stored for compliance audits.
  • Critical events: retained indefinitely -- account lockouts and security incidents are never pruned.

Retrofitting retention onto a growing audit table is painful: you discover you needed a partition key, an archive format, or an index that makes pruning efficient. Designing it upfront costs nothing.

Lesson 7: NIST compliance is a feature, not a checklist

Every standards reference is traceable to a specific line of source code:

StandardImplementation
NIST SP 800-92 event categoriesPlatformAuditCategory enum: AUTH, COMPLIANCE, SECURITY, ADMIN, etc.
NIST SP 800-92 log formatEvery row has outcome, severity, source, actorId, actorEmail, metadata
NIST SP 800-63B lockoutAccountSecurity model tracks failedAttempts and lockedUntil; auth flow enforces 5/15 rule
SOC 2 CC6.1 immutabilityNo UPDATE or DELETE endpoints for audit tables; append-only
SOC 2 CC6.1 access protectionAudit tables gated behind authenticated API routes; no direct database access

When an auditor asks "show me how you implement NIST SP 800-63B section 5.2.2," you should point to the exact file and line where the lockout logic lives, not just a paragraph in a policy document.

Lesson 8: E2E tests catch what unit tests miss

The audit log UI has a comprehensive Playwright test suite covering page load, filters, per-page selector, pagination controls, API contract, and mobile viewport (390x844).

Two bugs the suite caught before users did:

  1. Filter change did not reset pagination. Changing the action filter from "All" to "CREATE" while on page 3 returned 0 results because the filter was applied to page 3 of the unfiltered set.

  2. Page size selector triggered duplicate API calls. The useEffect dependency array was missing pageSize, so changing from 25 to 50 rows sent two requests (old size + new size).

Lesson 9: The three-rewrite pattern

VersionScopeMechanismProblem
v1.31.0Compliance onlylogAudit() called explicitly in 19 route handlersNot scalable; developers forgot
v1.45.3Compliance + pagination UISame logAudit() but with rows-per-page selectorStill manual; no auth event logging
v1.57.0Platform-widePure-function derivation + withErrorHandler auto-instrumentationCurrent state

The progression -- explicit to automatic to unified -- is a pattern worth recognizing: the first version proves the concept, the second stabilizes the UX, the third eliminates the manual burden.

Summary

Building audit logging for a multi-tenant compliance platform taught nine lessons applicable to any SaaS product handling sensitive operations:

  1. Audit writes must fail open -- never break the primary operation.
  2. Derive audit metadata from the URL, do not require developers to configure it.
  3. Separate audit tables for separate audiences (admin vs tenant).
  4. Make tenantId nullable for pre-authentication events.
  5. Classify severity with clear criteria; reserve CRITICAL for security incidents.
  6. Design retention policy before writing the first audit row.
  7. Map every compliance standard reference to a specific line of source code.
  8. E2E tests catch pagination and filter interaction bugs that unit tests miss.
  9. Expect to rewrite audit instrumentation at least twice as the product grows.