Cluster Health
Operations & database monitoring for ClickHouse — replication, Keeper, merges, mutations, parts pressure, and long queries fanned out across every node
Cluster Health gives you operational visibility into a running ClickHouse cluster: replication status, ZooKeeper/Keeper health, background merges and mutations, parts pressure, long-running queries, and data location — fanned out across every node, not just the one CH-UI is connected to.
Available on Pro and Enterprise plans.
It is built entirely from ClickHouse system.* tables, so there is nothing to install on your nodes — no agents, no exporters.
Scope
Distributed-database monitoring has three pillars. Cluster Health deliberately covers two of them:
- Operations — replication, Keeper, merges/mutations, backups
- Database — long queries, timeouts, parts-vs-limit, data location
- Infrastructure (CPU / memory / disk / network) — out of scope
Host infrastructure needs node-level agents (Prometheus, node_exporter, etc.). Everything Cluster Health shows comes from ClickHouse itself, which is why it works with zero setup.
What it shows
Operations
- Replication — per-replica delay, queue depth, inserts/merges in queue, and readonly/session-expired state (
system.replicas). - Queue issues — stalled or failing replication tasks with their retry count, postpone reason, and exception (
system.replication_queue). - Keeper — ZooKeeper/Keeper connection, session uptime, and expiry (
system.zookeeper_connection). - Merges & mutations — in-flight merges and unfinished mutations, including the failure reason for stuck mutations (
system.merges,system.mutations). - Backups — recent backup/restore operations and their status (
system.backups).
Database
- Long queries — currently-running queries over a configurable threshold (
system.processes). - Parts pressure — active parts per partition compared against
parts_to_throw_insert— the classic "too many parts" early warning before inserts start failing (system.parts,system.merge_tree_settings). - Data location — disk and S3 distribution with free/total usage (
system.disks).
How it works
Fan-out across nodes
Queries run via clusterAllReplicas('<cluster>', system.X) with per-node attribution, so a problem on any replica is visible — not just the node CH-UI happens to be connected to. The cluster is auto-detected (the one containing the connected node); you can override it.
- Single-node deployments transparently fall back to the local node.
- If remote nodes can't be reached (for example, the connection's credentials aren't valid cluster-wide), the view degrades to the local node and is flagged as such rather than failing.
- Tables that don't exist on a given ClickHouse version (Keeper, backups) are skipped gracefully instead of erroring.
History & retention
A background harvester samples lightweight per-node aggregates on an interval and stores them in SQLite to power the trend charts. Only compact numeric aggregates are stored — detailed drill-down lists are always fetched live — so storage stays tiny.
Samples older than the retention window are pruned automatically. The default is 7 days, adjustable from 1 to 365.
Charts & filtering
The trend charts (replication delay, parts pressure) plot one series per node with time axes and a hover legend, over a selectable window (1h / 6h / 24h / 7d) — so "which node is lagging" is visible at a glance instead of hidden in a cluster-wide aggregate.
The page also filters end-to-end: a free-text search narrows section rows (table names, postpone reasons, exceptions…), and a node filter — pick from the dropdown or click the funnel on any row of the per-node table — focuses the headline tiles, the charts, and every drill-down section on a single node. One click answers "is this one node's problem, or everyone's?"
Settings
Open Cluster Health → Settings (admin only):
| Setting | Default | Range | Purpose |
|---|---|---|---|
| Enable background collection | On | — | Turn the harvester on or off per connection |
| History retention (days) | 7 | 1–365 | TTL for stored samples |
| Poll interval (seconds) | 60 | 15–3600 | How often each cluster is sampled |
| Long-query threshold (seconds) | 30 | 1–3600 | When a running query counts as "long" |
Requirements
- A Pro license. The page, the
/api/cluster-health/*endpoints, and the background harvester are all disabled on the free edition. - The connection's ClickHouse user needs read access to the relevant
system.*tables — and to remote nodes for full cluster fan-out.
For load-balanced deployments where CH-UI sits behind chproxy / HAProxy / a Kubernetes Service, see ClickHouse Clusters & Load Balancers for the session-affinity setup that keeps system.* reads consistent.
Cluster Health pairs naturally with Query Insights: Cluster Health tells you what the cluster is doing, Query Insights tells you which queries made it do that.
Query Insights
Visual analytics over system.query_log — latency percentiles, slow and memory-heavy query patterns, failures, users, and hot tables, with dashboard-wide cross-filtering
ClickHouse Clusters & Load Balancers
Sticky routing with X-CH-UI-Session for multi-node ClickHouse deployments behind chproxy, HAProxy, or Kubernetes