Diagnosing and Fixing a Hidden ClickHouse Bottleneck in Your Billing Pipeline
Overview
ClickHouse is a powerful OLAP database that many organizations rely on for real-time analytics. At Cloudflare, we process millions of ClickHouse queries daily to calculate usage-based billing, power fraud detection, and handle other critical workflows. When our daily aggregation jobs suddenly slowed down after a routine migration, the impact was immediate—invoices became delayed, and downstream systems struggled to keep up. The usual suspects (I/O, memory, rows scanned, parts read) looked normal. After deep investigation, we discovered a hidden bottleneck buried deep within ClickHouse’s internal thread pool and locking mechanisms. This tutorial walks through the exact steps we took to identify, diagnose, and fix the issue. By the end, you’ll learn how to set up per-namespace retention, spot hidden performance killers, and apply targeted patches to restore query speed.

Prerequisites
Before diving into the diagnostics, ensure you have:
- ClickHouse expertise: Familiarity with MergeTree tables, partitions, primary keys, and system tables.
- Access to production or staging ClickHouse clusters: Ability to run queries against
system.query_log,system.parts,system.metrics, etc. - Basic Linux administration: For reading logs and adjusting configuration files.
- A test environment to apply changes without affecting live traffic.
- A billing or aggregation pipeline that relies on ClickHouse for time-sensitive jobs.
Step-by-Step Instructions
Step 1: Analyze Your Retention Requirements
Cloudflare’s Ready-Analytics platform stored all data in a single massive table partitioned by day, with a 31-day retention enforced by dropping old partitions. This “one-size-fits-all” approach forced teams with longer retention needs (e.g., 90 days for legal compliance) to build separate infrastructure. The first step is to inventory your own retention policies: which namespaces or datasets require more than the default? Which need less? Document these requirements to design a flexible per-namespace retention system.
Step 2: Implement Per-Namespace Retention
To replace the rigid partition-drop approach, we introduced a new system that allows each namespace (identified by a namespace field) to have its own retention period. Here’s how to set it up in ClickHouse.
- Add a
retention_daysmetadata table: Create a small table likenamespace_retentionwith columnsnamespace,retention_days, andupdated_at. Populate it with your new policies. - Modify the primary key: Ensure your primary key includes
namespaceandtimestampso that data from different namespaces can be efficiently pruned. Example:PRIMARY KEY (namespace, indexID, timestamp). - Create a custom retention job: Instead of dropping entire partitions, run a periodic job that deletes rows older than each namespace’s allowed days. Use the
ALTER TABLE ... DELETEcommand with a condition likeWHERE namespace = 'x' AND timestamp < now() - toIntervalDay(retention_days). Because deletes in ClickHouse are asynchronous and create mutations, batch namespaces to avoid overwhelming the system. - Optimize with parts merging: After deletions, run
OPTIMIZE TABLE ... FINAL(sparingly, as it’s resource-intensive) to reclaim space and improve query performance.
After implementing this, we observed that billing aggregation queries slowed drastically—but not because of the delete logic. The slowdown appeared only after a migration that changed the merge scheduler’s behavior. This led us to the real bottleneck.
Step 3: Identify the Hidden Bottleneck
When query performance degrades, start with the usual diagnostics. Check system.query_log for increased latency, system.parts for part count, and system.metrics for CPU and memory usage. In our case, all were normal. The hidden bottleneck was inside ClickHouse’s BackgroundProcessingPool—a global thread pool responsible for merging parts, deleting stale data, and other background tasks. After the migration, the merge scheduler started holding a lock longer than expected, causing queries that needed to access the same partitions to queue up.

To detect this:
- Query
system.mergesto see if merges are stuck or taking unusually long. - Examine
system.processesfor queries in “Lock wait” state. - Check
system.eventsfor increments inLockWaitTime.
We discovered that the per-namespace delete mutations were triggering many small merges, each competing for the same internal lock. This lock contention created a ripple effect that slowed all queries waiting for those partitions.
Step 4: Apply Targeted Fixes
We wrote three patches to address the root cause. Here’s a conceptual approach you can replicate:
- Patch 1: Tune the BackgroundProcessingPool – Increase the number of threads and reduce the lock-holding interval. In ClickHouse config, adjust
background_pool_sizeandbackground_merges_mutations_concurrency_ratio. Example:(from default 8).16 - Patch 2: Prioritize Merge Tasks – Separate merge tasks for different namespaces into independent queues so that a slow namespace doesn’t block others. This requires customizing the merge selector logic (our patch used a sliding-window approach based on namespace popularity).
- Patch 3: Reduce Mutations Impact – Instead of running a single large DELETE across many namespaces, batch smaller deletes with pauses. Also, use the
max_mutations_per_partitionsetting to limit concurrent mutations per partition.
Apply these changes in a test environment first. After rolling out to production, monitor system.merges and query latency. Within hours, our billing pipeline returned to normal speeds.
Common Mistakes
- Ignoring the background pool: Many engineers only look at active queries; they overlook background processes. Always check
system.mergesandsystem.part_mutations. - Over-tuning deletions: Deleting rows too aggressively (e.g., many small deletes simultaneously) can flood the mutation queue. Spread out deletions and use larger batches.
- Assuming the bottleneck is I/O: In our case, I/O was fine. Don’t stop at surface-level metrics; dive into ClickHouse’s internal locks.
- Neglecting to test migrations: The migration that triggered the slowdown changed the default configuration of the merge scheduler. Always test config changes on a staging cluster with production-like workload.
Summary
A hidden bottleneck inside ClickHouse’s background thread pool can cripple your billing pipeline even when traditional metrics look healthy. By implementing per-namespace retention, systematically diagnosing lock contention, and applying targeted patches to tune concurrency and isolation, you can restore performance. The three patches we wrote—increasing pool size, separating merge queues, and throttling mutations—solved our problem and can be adapted to your environment. Remember to always validate changes in a staging environment and monitor system internals, not just query logs.
Related Articles
- Haiku OS Makes Strides with ARM64 Multi-Core Support
- From Coding Novice to AI Agent Builder: A Beginner's Step-by-Step Guide to Creating a Leaderboard-Cracking AI
- Beyond Training Data: How Knowledge Graphs Enhance AI Agent Accuracy in the Enterprise
- The Real Strategy Behind GTA 6’s PC Delay: A Second Payday
- How Grafana Assistant Pre-Configures Infrastructure Knowledge for Instant Troubleshooting
- The Book That Built a Generation: How 101 BASIC Computer Games Sparked the Home Computer Revolution
- 6 Startling Findings About the Growing Gender Gap in Math Achievement
- Rebuilding America's Outbreak Response: A Guide to Strengthening Contagious Pathogen Surveillance