Diagnosing and Fixing a Hidden ClickHouse Bottleneck in Your Billing Pipeline

Overview

ClickHouse is a powerful OLAP database that many organizations rely on for real-time analytics. At Cloudflare, we process millions of ClickHouse queries daily to calculate usage-based billing, power fraud detection, and handle other critical workflows. When our daily aggregation jobs suddenly slowed down after a routine migration, the impact was immediate—invoices became delayed, and downstream systems struggled to keep up. The usual suspects (I/O, memory, rows scanned, parts read) looked normal. After deep investigation, we discovered a hidden bottleneck buried deep within ClickHouse’s internal thread pool and locking mechanisms. This tutorial walks through the exact steps we took to identify, diagnose, and fix the issue. By the end, you’ll learn how to set up per-namespace retention, spot hidden performance killers, and apply targeted patches to restore query speed.

Diagnosing and Fixing a Hidden ClickHouse Bottleneck in Your Billing Pipeline — Source: blog.cloudflare.com

Prerequisites

Before diving into the diagnostics, ensure you have:

ClickHouse expertise: Familiarity with MergeTree tables, partitions, primary keys, and system tables.
Access to production or staging ClickHouse clusters: Ability to run queries against system.query_log, system.parts, system.metrics, etc.
Basic Linux administration: For reading logs and adjusting configuration files.
A test environment to apply changes without affecting live traffic.
A billing or aggregation pipeline that relies on ClickHouse for time-sensitive jobs.

Step-by-Step Instructions

Step 1: Analyze Your Retention Requirements

Cloudflare’s Ready-Analytics platform stored all data in a single massive table partitioned by day, with a 31-day retention enforced by dropping old partitions. This “one-size-fits-all” approach forced teams with longer retention needs (e.g., 90 days for legal compliance) to build separate infrastructure. The first step is to inventory your own retention policies: which namespaces or datasets require more than the default? Which need less? Document these requirements to design a flexible per-namespace retention system.

Step 2: Implement Per-Namespace Retention

To replace the rigid partition-drop approach, we introduced a new system that allows each namespace (identified by a namespace field) to have its own retention period. Here’s how to set it up in ClickHouse.

Add a retention_days metadata table: Create a small table like namespace_retention with columns namespace, retention_days, and updated_at. Populate it with your new policies.
Modify the primary key: Ensure your primary key includes namespace and timestamp so that data from different namespaces can be efficiently pruned. Example: PRIMARY KEY (namespace, indexID, timestamp).
Create a custom retention job: Instead of dropping entire partitions, run a periodic job that deletes rows older than each namespace’s allowed days. Use the ALTER TABLE ... DELETE command with a condition like WHERE namespace = 'x' AND timestamp < now() - toIntervalDay(retention_days). Because deletes in ClickHouse are asynchronous and create mutations, batch namespaces to avoid overwhelming the system.
Optimize with parts merging: After deletions, run OPTIMIZE TABLE ... FINAL (sparingly, as it’s resource-intensive) to reclaim space and improve query performance.

After implementing this, we observed that billing aggregation queries slowed drastically—but not because of the delete logic. The slowdown appeared only after a migration that changed the merge scheduler’s behavior. This led us to the real bottleneck.

Step 3: Identify the Hidden Bottleneck

When query performance degrades, start with the usual diagnostics. Check system.query_log for increased latency, system.parts for part count, and system.metrics for CPU and memory usage. In our case, all were normal. The hidden bottleneck was inside ClickHouse’s BackgroundProcessingPool—a global thread pool responsible for merging parts, deleting stale data, and other background tasks. After the migration, the merge scheduler started holding a lock longer than expected, causing queries that needed to access the same partitions to queue up.

To detect this:

Query system.merges to see if merges are stuck or taking unusually long.
Examine system.processes for queries in “Lock wait” state.
Check system.events for increments in LockWaitTime.

We discovered that the per-namespace delete mutations were triggering many small merges, each competing for the same internal lock. This lock contention created a ripple effect that slowed all queries waiting for those partitions.

Step 4: Apply Targeted Fixes

We wrote three patches to address the root cause. Here’s a conceptual approach you can replicate:

Patch 1: Tune the BackgroundProcessingPool – Increase the number of threads and reduce the lock-holding interval. In ClickHouse config, adjust background_pool_size and background_merges_mutations_concurrency_ratio. Example: 16 (from default 8).
Patch 2: Prioritize Merge Tasks – Separate merge tasks for different namespaces into independent queues so that a slow namespace doesn’t block others. This requires customizing the merge selector logic (our patch used a sliding-window approach based on namespace popularity).
Patch 3: Reduce Mutations Impact – Instead of running a single large DELETE across many namespaces, batch smaller deletes with pauses. Also, use the max_mutations_per_partition setting to limit concurrent mutations per partition.

Apply these changes in a test environment first. After rolling out to production, monitor system.merges and query latency. Within hours, our billing pipeline returned to normal speeds.

Common Mistakes

Ignoring the background pool: Many engineers only look at active queries; they overlook background processes. Always check system.merges and system.part_mutations.
Over-tuning deletions: Deleting rows too aggressively (e.g., many small deletes simultaneously) can flood the mutation queue. Spread out deletions and use larger batches.
Assuming the bottleneck is I/O: In our case, I/O was fine. Don’t stop at surface-level metrics; dive into ClickHouse’s internal locks.
Neglecting to test migrations: The migration that triggered the slowdown changed the default configuration of the merge scheduler. Always test config changes on a staging cluster with production-like workload.

Summary

A hidden bottleneck inside ClickHouse’s background thread pool can cripple your billing pipeline even when traditional metrics look healthy. By implementing per-namespace retention, systematically diagnosing lock contention, and applying targeted patches to tune concurrency and isolation, you can restore performance. The three patches we wrote—increasing pool size, separating merge queues, and throttling mutations—solved our problem and can be adapted to your environment. Remember to always validate changes in a staging environment and monitor system internals, not just query logs.

Tags: