Background Image
TECH INSIGHTS

Designing a GCP Cloud ETL Pipeline for 10TB/Day with Sub-Second Latency

July 11, 2025 | 4 Minute Read

Using Cloud Function (GCS trigger), Pub/Sub, Dataflow & BigQuery

At Improving, we architected a cloud-native, real-time ETL pipeline on Google Cloud Platform (GCP) that could:

  • Ingest 10TB of data per day

  • Maintain sub-second end-to-end latency

  • Remain fully serverless, reliable, and production-grade

This post details our architecture, the challenges we encountered, and how we optimized the system to meet both performance and cost-efficiency goals.

Objectives

  • Ingestion volume: 10TB/day (~115MB/sec sustained)

  • Latency target: < 1 second from data upload to visibility in BigQuery

  • Design constraints: Serverless, autoscaling, minimal ops

  • Core services: Cloud Functions, Pub/Sub, Dataflow (Apache Beam), BigQuery

Architecture Design & Overview

Asset - Designing a GCP Cloud ETL Pipeline for 10TB/Day with Sub-Second Latency

[Data Producers → GCS Upload]

(Object Finalization Event) [Cloud Function Triggered by GCS]

[Pub/Sub Topic (Buffer)]

[Dataflow Streaming Job]

[BigQuery (Analytics Layer)]

Cloud Function Triggered by GCS Upload

Our ingestion pipeline begins when files are uploaded to Google Cloud Storage. A Cloud Function is automatically triggered by GCS “object finalized” events.

This function:

  • Parses GCS metadata (bucket, filename, timestamp, source)

  • Adds custom tags or schema versioning info

  • Publishes a message to a Pub/Sub topic with metadata

Optimizations

  • Allocated enough memory (~512MB) to reduce cold starts

  • Reused network clients to avoid socket overhead

  • Enabled 1000 concurrent executions with no retry deadlocks

Pub/Sub: Decoupling & Buffering

Pub/Sub acted as our message buffer, enabling:

  • Horizontal scaling of downstream consumers

  • Replay capabilities in case of downstream failure

  • Backpressure protection via flow control

Each message carried metadata needed by Dataflow to load, transform, and route the data.

Dataflow: Stream Processing Engine

We used Dataflow (Apache Beam, Java SDK) in streaming mode to:

  • Fetch and parse raw file content from GCS

  • Apply schema-aware transformations

  • Handle deduplication using windowing and watermarking

  • Write enriched data to BigQuery with sub-second delay

Dataflow Optimizations

  • Enabled Streaming Engine + Runner v2

  • Used dynamic work rebalancing and auto-scaling (5–200 workers)

  • Split large files into smaller chunks (ParDo + batching)

BigQuery: Real-Time Analytics Layer

Data was streamed into a landing table in BigQuery via streaming insert. A separate scheduled transformation job partitioned and clustered data into curated tables.

BigQuery Optimizations

  • Partitioned by ingestion date

  • Clustered by source_id, event_type

  • Compacted daily files using scheduled queries

  • Used streaming buffer metrics to detect lag spikes

Bottlenecks & Challenges

Asset - Image 2 - Designing a GCP Cloud ETL Pipeline for 10TB/Day with Sub-Second Latency

Recovery and Observability

  • Retries: Cloud Functions and Pub/Sub retries + backoff

  • Dead-letter queues: For malformed events or schema mismatches

  • Logging: Centralized with Cloud Logging and custom labels

  • Metrics: Data freshness, latency, throughput, and error counts

  • Reprocessing: Manual or scheduled Dataflow replay from GCS + Pub/Sub

Monitoring, Cost Optimization & Failure Recovery

Building a highly performant ETL pipeline isn’t enough; you also need observability and resilience. Here’s how we ensured visibility, cost control, and robust error handling in our 10TB/day GCP ETL pipeline.

Monitoring the Pipeline

We leveraged Cloud Monitoring and Logging to track every component in real time:

Cloud Function:

  • Monitored invocation count, execution duration, and error rates using Cloud Monitoring dashboards.

  • Configured alerts for failed executions or abnormal spikes in duration.

Pub/Sub:

  • Tracked message backlog and throughput via Pub/Sub metrics.

  • Alerts were set to detect message backlog growth, indicating slow or failing subscribers.

Dataflow:

  • Monitored autoscaling behavior, CPU utilization, memory usage, and system lag.

  • Configured custom lag alerts to catch delays in processing (e.g., high watermarks lagging behind ingestion).

  • Used Dataflow job logs to debug transformation errors, memory issues, or skewed partitions.

BigQuery:

  • Monitored streaming insert errors and throughput using the BigQuery audit logs.

  • Tracked query cost, slot usage, and latency to optimize performance and budget.

Tooling: Cloud monitoring dashboards were customized for our pipeline, with charts for:

  • Throughput (files/min, records/sec)

  • Latency (end-to-end from GCS to BQ)

  • Job failure alerts

  • Cost metrics per service (BQ storage/query, Dataflow VMs, etc.)

Results

  • Scaled to 10TB/day sustained load

  • Achieved end-to-end latency of ~600ms on average

  • Fully serverless and autoscaling with minimal ops overhead

  • Real-time querying and downstream analytics are available within seconds

Lessons Learned

  • Cloud-native doesn’t mean that compromise sub-second latency is achievable with the right patterns.

  • File-based triggers from GCS are reliable if you design for retries and idempotency.

  • The key to performance is pre-scaling, batch tuning, and decoupling stages.

Conclusion

GCP provides a robust foundation for building high-throughput, low-latency ETL pipelines. By combining event-driven triggers, Pub/Sub decoupling, and streaming dataflows, we were able to handle terabytes of data per day with real-time SLAs.

Whether you’re modernizing a batch system or building greenfield streaming architecture, this design can act as a blueprint. Contact us to learn more or check out our career page to find out how you can be a part of the Improving team.

Tech Insights
Thumbnail: Tech Insights
Tech Insights

Designing a GCP Cloud ETL Pipeline for 10TB/Day with Sub-Second Latency

Looking at a high-performance architecture using Google Cloud services to process massive data volumes efficiently.