← Back to Blog

Comprehensive GCP Cloud Cost Analysis: A FinOps Workflow

·8 min read
GCPFinOpsCloud CostBigQueryInfrastructureDevOps

This workflow is extracted from my operational runbooks and represents three years of iterating on GCP cost analysis. It's designed to be run weekly, requires only read access to billing and monitoring data, and produces actionable insights in under 30 minutes.

The Problem

Cloud bills are opaque. On GCP, your monthly invoice can hide 40% of its cost in logging alone. Idle resources accumulate in corners of the infrastructure nobody's touching. Commitments go underutilized. Spikes go unnoticed until the bill lands. This workflow solves all of that.

What This Workflow Covers

This is a 13-step process:

  1. Parameter gathering (billing account, projects, date range)
  2. Project discovery and metadata collection
  3. BigQuery billing export cost queries
  4. Cloud Logging ingestion cost analysis
  5. Cloud Monitoring cost analysis
  6. Cloud Asset Inventory scan
  7. Active Assist recommendations
  8. Idle resource detection (VMs, disks, IPs, Cloud SQL, Cloud Run, Cloud Functions)
  9. Storage lifecycle and retention cost analysis
  10. Commitment coverage and utilization
  11. Budget alerts and threshold audit
  12. Anomaly detection (week-over-week spikes)
  13. Cost forecasting and trend analysis

All queries are read-only. All data comes from billing exports, metrics, or inventory APIs.

Architecture

The workflow relies on three primary data sources:

Data SourcePurposeAccuracy
BigQuery Billing ExportActual dollar costs, real-time, line-item level100% (matches your invoice)
Cloud Monitoring MetricsService-specific metrics (logging bytes, CPU, memory)95% (subject to aggregation window)
Cloud Asset InventoryResource inventory, labels, metadataNear real-time (5-10 min latency)

BigQuery billing export is the source of truth. The other two fill in the details.

Step-by-Step Walkthrough

Querying Costs from BigQuery

Start by exporting your GCP billing data to BigQuery. Once enabled, you'll have a table like YOUR_PROJECT.YOUR_DATASET.gcp_billing_export_v1_XXXXXX.

Weekly cost by project:

SELECT
  project.id,
  project.name,
  DATE(usage_start_time) as date,
  SUM(cost) as total_cost,
  COUNT(DISTINCT service.description) as service_count
FROM `YOUR_PROJECT.YOUR_DATASET.gcp_billing_export_v1_XXXXXX`
WHERE DATE(usage_start_time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
  AND project.id IS NOT NULL
GROUP BY project.id, project.name, date
ORDER BY date DESC, total_cost DESC;

Top 20 SKUs by cost (weekly):

SELECT
  sku.description,
  SUM(cost) as total_cost,
  ROUND(SUM(usage.amount), 2) as usage_amount,
  usage.unit as unit,
  COUNT(*) as line_item_count
FROM `YOUR_PROJECT.YOUR_DATASET.gcp_billing_export_v1_XXXXXX`
WHERE DATE(usage_start_time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY sku.description, usage.unit
ORDER BY total_cost DESC
LIMIT 20;

Credits breakdown (if applicable):

SELECT
  credits.name,
  SUM(credits.amount) as total_credits,
  COUNT(*) as transaction_count,
  MIN(usage_start_time) as earliest_use
FROM `YOUR_PROJECT.YOUR_DATASET.gcp_billing_export_v1_XXXXXX`
WHERE DATE(usage_start_time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
  AND ARRAY_LENGTH(credits) > 0
CROSS JOIN UNNEST(credits) as credits
GROUP BY credits.name
ORDER BY total_credits DESC;

Cloud Logging Cost Analysis: The Killer Line Item

Cloud Logging is often the largest single cost on a GCP bill, yet it's almost never monitored. On high-cardinality workloads, logging can exceed compute costs by 3 to 5x.

Cost drivers:

  • Log ingestion: $0.50/GiB after 50 GiB free per month
  • Log storage: $0.01/GiB-month (StackDriver storage, deprecated but still used)
  • Log exclusion filters (applied at ingestion, not at storage)

Query monthly logging ingestion (from Cloud Monitoring):

gcloud monitoring time-series list \
  --filter 'metric.type="logging.googleapis.com/billing/monthly_bytes_ingested"' \
  --format='table(resource.labels.project_id, metric.labels.resource_type, points[0].value.double_value)' \
  --interval-start-time=$(date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%SZ) \
  --interval-end-time=$(date -u +%Y-%m-%dT%H:%M:%SZ)

This metric is a GAUGE, not DELTA. Query it once per day; do not sum it. The value represents bytes ingested that day.

To estimate monthly cost:

daily_bytes_ingested = 1_500_000_000_000  # 1.5 TB/day
monthly_bytes = daily_bytes_ingested * 30
free_tier = 50 * (1024 ** 3)  # 50 GiB
billable_bytes = max(0, monthly_bytes - free_tier)
cost_usd = (billable_bytes / (1024 ** 3)) * 0.50
print(f"Estimated monthly logging cost: ${cost_usd:.2f}")

Log exclusion filters to investigate:

gcloud logging sinks list --project=YOUR_PROJECT \
  --format='table(name, destination, includeChildren, filter)'

High-cardinality logs to exclude:

  • Health check logs (Compute Engine health checks, load balancer checks)
  • Verbose debug logs in production (usually from misconfigured SDK clients)
  • Third-party service logs with low signal-to-noise ratio

Idle Resource Detection

Idle resources are the second-largest waste category after logging.

Terminated VMs with attached disks

SELECT
  resource.labels.instance_id,
  resource.labels.instance_name,
  resource.labels.zone,
  SUM(CASE WHEN metric.type = 'compute.googleapis.com/instance/cpu/utilization' THEN value.double_value ELSE 0 END) as avg_cpu_utilization,
  COUNT(*) as measurement_count
FROM `YOUR_PROJECT.YOUR_DATASET.monitoring_metrics`
WHERE metric.type IN ('compute.googleapis.com/instance/cpu/utilization')
  AND resource.type = 'gce_instance'
  AND timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY resource.labels.instance_id, resource.labels.instance_name, resource.labels.zone
HAVING avg_cpu_utilization < 0.05
ORDER BY avg_cpu_utilization ASC;

Cost: $0.10/hour per vCPU, $0.04/hour per GB memory (compute-optimized).

Unattached persistent disks

gcloud compute disks list --project=YOUR_PROJECT \
  --filter='users:* OR users!:*' \
  --format='table(name, zone, sizeGb, type, users.len())'

Cost: $0.17/GB-month.

Unused static IP addresses

gcloud compute addresses list --project=YOUR_PROJECT \
  --filter='status != IN_USE' \
  --format='table(name, region, status)'

Cost: $0.01/hour per IP address.

Cloud SQL idle instances

Idle Cloud SQL is tricky: a paused instance still has compute costs if it's not actually paused. Monitor:

  • Connections: cloudsql.googleapis.com/database/mysql/connections: if zero for 7+ days, the instance is idle regardless of CPU.
  • CPU: cloudsql.googleapis.com/database/cpu/utilization: anything under 2% is usually idle.
  • Memory: high memory allocation but zero connections = over-provisioned.

Cloud Run idle services

gcloud run services list --project=YOUR_PROJECT \
  --format='table(metadata.name, status.replicas, metadata.creationTimestamp)'

Check request metrics:

gcloud monitoring time-series list \
  --filter='resource.type="cloud_run_revision" AND metric.type="run.googleapis.com/request_count"' \
  --format='table(resource.labels.service_name, points[0].value.int64_value)'

If zero requests for 30 days, delete the service or migrate to Cloud Functions.

Cloud Functions idle

gcloud functions list --project=YOUR_PROJECT \
  --format='table(name, runtime, status, sourceArchiveUrl)'

Check invocation metrics:

gcloud monitoring time-series list \
  --filter='resource.type="cloud_function" AND metric.type="cloudfunctions.googleapis.com/function/invocations"' \
  --format='table(resource.labels.function_name, points[0].value.int64_value)'

If zero invocations for 30 days: delete. If error rate > 5% for 7 days: investigate immediately (error invocations still cost).

Commitment Coverage and Utilization

Commitments are the largest lever for cost reduction. Most teams leave 30 to 50% of commitment capacity unused.

SELECT
  service.description,
  SUM(CASE WHEN commitment_id IS NOT NULL THEN cost ELSE 0 END) as commitment_cost,
  SUM(CASE WHEN commitment_id IS NULL THEN cost ELSE 0 END) as on_demand_cost,
  ROUND(
    SUM(CASE WHEN commitment_id IS NOT NULL THEN cost ELSE 0 END) /
    (SUM(CASE WHEN commitment_id IS NOT NULL THEN cost ELSE 0 END) +
     SUM(CASE WHEN commitment_id IS NULL THEN cost ELSE 0 END) + 0.001) * 100,
    1
  ) as commitment_coverage_percent
FROM `YOUR_PROJECT.YOUR_DATASET.gcp_billing_export_v1_XXXXXX`
WHERE DATE(usage_start_time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY service.description
ORDER BY commitment_coverage_percent ASC;

Target: >80% commitment coverage on compute, >90% on sustained-use discounts.

Anomaly Detection (Week-over-Week Spikes)

WITH current_week AS (
  SELECT
    DATE(DATE_SUB(CURRENT_DATE(), INTERVAL DAYOFWEEK(CURRENT_DATE()) - 1 DAY)) as week_start,
    project.id,
    SUM(cost) as total_cost
  FROM `YOUR_PROJECT.YOUR_DATASET.gcp_billing_export_v1_XXXXXX`
  WHERE DATE(usage_start_time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
  GROUP BY week_start, project.id
),
prior_week AS (
  SELECT
    DATE(DATE_SUB(CURRENT_DATE(), INTERVAL DAYOFWEEK(CURRENT_DATE()) + 6 DAY)) as week_start,
    project.id,
    SUM(cost) as total_cost
  FROM `YOUR_PROJECT.YOUR_DATASET.gcp_billing_export_v1_XXXXXX`
  WHERE DATE(usage_start_time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 14 DAY)
    AND DATE(usage_start_time) < DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
  GROUP BY week_start, project.id
)
SELECT
  cw.project.id,
  cw.total_cost as current_week_cost,
  pw.total_cost as prior_week_cost,
  ROUND((cw.total_cost - pw.total_cost) / pw.total_cost * 100, 1) as percent_change
FROM current_week cw
FULL OUTER JOIN prior_week pw USING (project.id)
WHERE (cw.total_cost - pw.total_cost) / pw.total_cost > 0.30
ORDER BY percent_change DESC;

Flag any project with a >30% spike.

Cost Forecasting

Simple linear regression over 90 days:

SELECT
  project.id,
  DATE(usage_start_time) as date,
  SUM(cost) as daily_cost,
  AVG(SUM(cost)) OVER (
    PARTITION BY project.id
    ORDER BY DATE(usage_start_time)
    ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
  ) as rolling_30day_avg
FROM `YOUR_PROJECT.YOUR_DATASET.gcp_billing_export_v1_XXXXXX`
WHERE DATE(usage_start_time) >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY project.id, date
ORDER BY date DESC;

Calculate trend: if the 30-day rolling average is rising, flag it.

The Report

The output is a weekly Markdown report with:

  1. Executive summary: Total cost, week-over-week change, top 5 cost drivers.
  2. Cost by project: Ranked list with sparklines.
  3. Cost by service: Breakdown of compute, storage, networking, logging, etc.
  4. Idle resources: List of candidates for deletion, with estimated monthly savings.
  5. Anomalies: Week-over-week spikes (>30%), projects with high logging cost.
  6. Forecasting: 90-day trend direction and projected end-of-month cost.
  7. Recommendations: Specific actions (delete idle resources, increase commitment, reduce logging).

Key Insights from Three Years of Cost Analysis

  • Cloud Logging often exceeds compute costs. If logging is >40% of your bill, investigate immediately. Log exclusion filters are underutilized.
  • Cloud SQL is deceptive. A paused instance still charges; zero connections is the true idle signal, not CPU. Autoscaling doesn't help idle databases.
  • Cloud Functions errors are pure waste. Every invocation (failed or succeeded) costs. >5% error rate means your function is broken and you're burning money on failures.
  • Label coverage below 50% is a governance issue. You can't optimize what you can't tag. Implement a tag-enforcement policy.
  • Always use ALIGN_NONE when querying gauge metrics. Summing a gauge (e.g., logging.googleapis.com/billing/monthly_bytes_ingested) gives nonsense numbers. Read the metric descriptor.
  • Commitments are inflexible but powerful. Buy for your baseline usage (the 50th percentile over 90 days), not your peak. Overcommit by at most 10%.
  • Billing export is delayed by 3 to 5 days. For real-time cost tracking, use Cloud Monitoring billing metrics. For accuracy, use BigQuery export.

Run this workflow every Sunday evening. Most teams find $2K to $10K/month in savings in the first pass.