How to Built a Self-Hosted ClickHouse Cluster from Scratch

If you've made it this far, you're probably following my journey building Userloom.io. Things are going well so far. In the last post, I explained why ClickHouse was the obvious choice for analytics storage. Now it's time to actually deploy the thing.

I decided to go self-hosted for two reasons:

Control. I want full control over cluster topology, configuration tuning, and scaling decisions.
Learning. I want to understand how to architect and deploy a distributed database cluster myself. No black boxes.

So the goal for the week was clear: build a production-ready ClickHouse cluster with complete operational control.

Then the R&D rabbit hole began.

Why I Skipped ClickHouse Cloud

Let me get this out of the way first. ClickHouse Cloud is a great product. But I ruled it out for this project.

I wanted full control over the cluster topology, configuration tuning, and scaling decisions. I didn't want to wait behind support tickets to change settings or debug issues.

Bare metal (or VMs) it is.

Why Terraform + Ansible Over Kubernetes

Here's where I'll be completely honest: I don't have deep experience with clusters or advanced DevOps. So I turned to everyone's best friends, Claude and ChatGPT, to help me think through options.

My first instinct was to learn Kubernetes. I have plenty of experience with Docker and containers, so K8s felt like the natural next step. But after some research, I realized it wasn't the right choice for this stage.

Info

At small scale (5 nodes), Kubernetes adds significant overhead without proportional benefits. The 2-4GB RAM per node for K8s components alone would eat into resources better spent on ClickHouse itself.

Here's the comparison that convinced me:

Factor	Kubernetes	Terraform + Ansible
Operational complexity	High (need to manage K8s itself)	Low (direct server management)
Resource overhead	2-4GB RAM per node for K8s components	Near zero
ClickHouse I/O performance	Tricky with PVCs	Native disk access, optimal
Debugging	More layers to troubleshoot	SSH in, check logs, done
Learning curve	Steep if not already using K8s	Moderate
Time to production	Longer	Faster

The decision became obvious. Terraform + Ansible gives me infrastructure-as-code discipline without the Kubernetes tax.

Why Both Terraform and Ansible?

Terraform is declarative for cloud resources. Ansible is procedural for server configuration. They complement each other, though there's some overlap.

I could use Ansible alone. If you're comfortable creating servers manually in Hetzner's UI (takes 10 minutes), Ansible handles the rest just fine.

But Terraform adds value when you want to:

Tear down and recreate environments reliably
Manage DNS, firewalls, private networks, and load balancers as code
Maintain multiple environments (staging, production)
Review infrastructure changes in pull requests

I'm planning to replicate this environment later, so investing in Terraform now reduces future headaches. If we're doing this, let's do it right.

The Cluster Architecture

After a week of research, here's the architecture I landed on:

100%

Five nodes total:

Nodes 1-4: ClickHouse servers (2 shards × 2 replicas)
Nodes 1-3: Also run ClickHouse Keeper for consensus
Node 5: Dedicated monitoring stack (Prometheus, Grafana, Alertmanager, backups)

Everything lives inside a private network. The only public exposure is through the load balancer with SSL termination.

SSH Access

SSH (port 22) is restricted to your IP only. Never expose SSH to the public internet on production infrastructure.

Understanding ClickHouse Keeper and Quorum

Info

ClickHouse Keeper is the modern replacement for ZooKeeper, built directly into ClickHouse. It uses the same Raft consensus protocol but eliminates the need to manage a separate Java-based ZooKeeper cluster.

ClickHouse Keeper uses a consensus protocol called Raft. It needs a quorum (majority) to make decisions.

Here's the math:

Nodes	Quorum	Survives N Failures
1	1	0 (no fault tolerance)
2	2	0 (both needed, pointless)
3	2	1 node can fail ✓
4	3	1 node can fail (same as 3)
5	3	2 nodes can fail

Why 3 Keeper Nodes?

Three is the minimum for fault tolerance (survives 1 failure), and odd numbers avoid split-brain scenarios. Four nodes provides no extra fault tolerance over three. For my scale, five Keeper nodes would be overkill.

What Keeper actually does:

Coordinates which replica is the leader for writes
Tracks replication state across nodes
Manages distributed DDL operations (CREATE/ALTER TABLE)
Stores merge scheduling metadata

Sharding vs Replication: What's the Difference?

This confused me at first. Let me break it down simply.

100%

Sharding = Horizontal Scaling (Split Your Data)

Sharding distributes data across multiple servers. Each shard holds a subset of the total data, partitioned by a shard key (like user_id, tenant_id, or date).

When to add shards:

Single node can't hold all your data
Query performance degrades due to data volume
Write throughput exceeds single node capacity

Replication = High Availability (Copy Your Data)

Replication creates identical copies of each shard on multiple nodes. Writes go to any replica, then sync automatically. If one replica dies, others keep serving.

Benefits:

Zero data loss if a node fails
Read scaling (distribute reads across replicas)
Online maintenance (take one replica offline without downtime)

My Setup: 2 Shards × 2 Replicas

Here's what this configuration provides:

Can lose any 1 node without data loss
Can lose 1 node per shard and stay online
2x write throughput (parallel writes to shards)
4x read throughput (all 4 nodes serve reads)
2x storage capacity (data split across shards)

How Distributed Queries Work

The table setup uses two layers:

-- Local table: lives on each shard, holds that shard's data only
CREATE TABLE events ON CLUSTER 'analytics' (
    tenant_id UInt32,
    event_time DateTime,
    event_type String
) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/events', '{replica}')
PARTITION BY toYYYYMM(event_time)
ORDER BY (tenant_id, event_time);

-- Distributed table: virtual table that queries ALL shards
CREATE TABLE events_distributed ON CLUSTER 'analytics' AS events
ENGINE = Distributed('analytics', 'default', 'events', tenant_id);

When you query events_distributed, here's what happens:

100%

Query hits any node through the load balancer
That node fans out to one replica per shard
Each shard processes its portion in parallel
Results merge and return to client

The client doesn't need to know anything about shards or replicas. It just queries events_distributed and ClickHouse handles the rest.

Key Takeaways for Self-Hosted ClickHouse

If you're considering self-hosted ClickHouse:

Skip Kubernetes at small scale. Terraform + Ansible gives you infrastructure-as-code without the overhead.
Use 3 Keeper nodes. It's the minimum for fault tolerance, and 4 nodes doesn't improve anything.
Understand the difference between shards and replicas. Shards split data for scale. Replicas copy data for availability.
Start with 2 shards × 2 replicas. It's a solid foundation that survives node failures and scales reads/writes.
Keep everything in a private network. Expose only what you need through a load balancer.

The whole setup took about a week of learning and iteration. Most of that time was understanding concepts, not writing code. Once the architecture clicked, the Terraform and Ansible configs came together quickly.

How to Built a Self-Hosted ClickHouse Cluster from Scratch

Under the Hood

Why I Skipped ClickHouse Cloud

Why Terraform + Ansible Over Kubernetes

Why Both Terraform and Ansible?

The Cluster Architecture

Understanding ClickHouse Keeper and Quorum

Sharding vs Replication: What's the Difference?

Sharding = Horizontal Scaling (Split Your Data)

Replication = High Availability (Copy Your Data)

My Setup: 2 Shards × 2 Replicas

How Distributed Queries Work

Key Takeaways for Self-Hosted ClickHouse

Related Posts

How to Built a Custom ClickHouse Migration System in TypeScript

Why We're Obsessing Over Signal Capture

Stop losing users to silence.