Replo
18 min read
clickhousesql

Optimizing 100 Billion Clickhouse Events

How we built an analytics data model, migrated it, then migrated it again

Ryan Voris
Ryan VorisEngineer @ Replo

One of the newer products we've launched at Replo is Replo Analytics. We want to enable Replo customers to sell anything online, and part of selling online is analyzing the performance of your campaigns, determining conversion rates based on per-session purchase data, etc. Since Replo automatically deploys campaign landing pages, we can easily track this event data and provide it for our customers with way less setup than something like Google Analytics.

We use ClickHouse to store this data, and throughout the development of this product we had to implement several different architectures to account for storage growth while keeping queries performant.

#Storing Analytics Data

At initial launch of Replo Analytics, we had spent 3-4 months prototyping and built what felt like a solid analytics pipeline on top of ClickHouse. Our events table sat at the center, steadily collecting frontend events from Replo-powered storefronts. The data fidelity was high. The queries returned what we expected. Things were working and life was good.

Here's what our events table looked like:

If you're experienced with ClickHouse, you probably already see the problems, but back then we were new to ClickHouse and column-oriented databases, so it felt reasonable to us.

But as usage grew, so did the number of events, the sessions behind them, and the complexity of the metrics we wanted to compute. Query times slowed - slow became unreliable, then unreliable became unacceptable. Queries for customers tracking 10M+ pageviews per month timed out.

We tried tuning, we tried optimizing queries, we tried adding more and more tables to try to split and speed things up, but eventually we realized it wasn't something we could patch our way out of. We needed a fresh start, a new model that wasn't just correct, but fast. Thus began our journey into what we now call internally “events data model V2”.

If you're working with ClickHouse at any meaningful scale or trying to make analytics feel instant for real users this story might be familiar (or at least entertaining). This is the story of how we built the new data model and then carefully migrated to it, then migrated it again.

#events_v2 and events_computed

As more sessions rolled in and our customer base grew, queries that once took hundreds of milliseconds started creeping into seconds, then tens of seconds. Eventually we reached a point where running a simple COUNT() across recent sessions required a meditation break.

We started noticing a few other patterns:

  1. The same per-session purchase data was being recomputed over and over again. We didn't have any caching layer, so every query was a cold query.
  2. Computed metrics were derived on the fly. This added tons of latency and made our analytics dashboards feel sluggish at best.

So, we began exploring what would become our V2. The goal wasn't just faster queries - we wanted to rethink the whole approach to computing metrics from raw events.

At the heart of this would be precomputation. Not in the generic, “yeah we'll batch some things” sense, but full, session-level attribution logic baked into the pipeline, computed once, and never recomputed unless necessary.

This was the first version of our events_v2 table. It reflected lessons we had started to internalize about ClickHouse's performance characteristics. namespace became a central column to establish ownership boundaries across tenants - we used Replo project ids for this. It was relatively low in cardinality, which allowed ClickHouse to skip large portions of data when queries were scoped by customer.

We optimized storage and query performance through careful partitioning, ordering, and deduplication strategies.

We partitioned the table by both time and namespace to align with common query patterns that filter on these fields. Whenever rows are inserted, ClickHouse creates a new data part for each unique partition key value, rather than combining all inserted rows into a single part. This approach reduces unnecessary scans and makes queries more efficient, because ClickHouse scans by loading partitions into memory.

Within each partition, events were sorted by namespace, then session, followed by date and id. This ordering enables queries to read data in a session-oriented sequence, while ensuring uniqueness guarantees through the id field.

To handle duplicate events, we used a Replacing MergeTree. This ClickHouse Table Engine automatically removes duplicate rows that share the same sorting key, keeping only the most recent version.

Finally, the event payload itself is stored in a single data column as JSON. This design gives us flexibility across event types without needing to alter the table schema as events evolve.

We also made the important decision to pre-compute some metrics with events_computed :

Close readers might realize this definition is almost the same as events_v2 - it is, but with one key difference: it includes computed values. The idea was that we could take common metrics, like per-session purchase amounts, and pre-compute them once to save query performance time.

So how do those values get populated? That's the 100 billion event question. This is where our team started slipping into the rabbit hole of figuring out when and what to recompute, trying every kind of materialized view ClickHouse supports, building cache tables, setting indexes, and eventually rethinking the entire strategy from the ground up.

#Pre-computing Event Columns Every Minute

Let's look at a common case of per-session purchase data. In Replo Analytics, we want to compute number of purchases per session, and the total amount of those purchases.

Since most events aren't purchases, we started by copying everything from events_v2 into events_computed with a null value in the computed column. This happened through a simple materialized view:

Next, we needed a way to figure out which sessions required computation. To track this, we created a lightweight summing table called sessions_requiring_computation. The idea was simple: if a purchase event came in, we marked the session with a +1. Once we processed it, we subtracted one to "unmark" it.

Two materialzed views handled the bookkeeping: one to increment when a purchase was seen, and one to decrement when that purchase was processed.

So far, so good, but we still needed to actually compute the metrics and write them to events_computed. That's where events_computed_fn came in, a view that scanned sessions_requiring_computation, found active rows, joined them with raw events, did the aggregation using OVER (PARTITION BY session), and emitted a JSON blob. Note that we needed to limit the amount of possible sessions we were computing because we started running into memory issues if we tried to go too big.

We ran this every minute using another materialized view:

And to track how it was doing, we added an audit table that recorded how many sessions still needed computation every minute:

This was a lot of steps, but it did work - our performance testing showed our queries were 4-10x faster, so we migrated our ~60 billion events.

For the first time.

#Expanding, Collapsing, and Caching

Replo Analytics shipped with support for first-touch and last-touch purchase attribution, but a major request from customers was to attribute fractionally. Instead of attributing 100% of revenue to a given page for being the first or last page visited, give each page visited a fraction of the attribution based on the overall pages visited, to more accurately reflect that customer visits have multiple pages and they might all contribute to a purchase in combination.

The logic was straightforward - in order to figure how how much a pageview contributed to revenue of a session, divide the total purchase amount by the number of meaningful pageviews. If you had five pageviews and one purchase, each pageview was worth a fifth of the purchase's amount. Multiply that out and you get fractional revenue.

As started testing this, we noticed the calculations seemed off. We eventually traced it to duplicate events that threw off the calculations. While both events_v2 and events_computed used ReplacingMergeTree, they are only eventually deduplicated. You cannot rely on them being perfectly clean in real time, especially not for recent events (and we were only pre-computing fields for recent events).

Fortunately, ClickHouse has tools to help with this: FINAL and DISTINCT.

This was the version of events_computed_fn that tipped the whole system over. At first it worked, and because of the flaws of the original unduplicated calculations, we had to re-migrate all of the data to trigger a recalculation. That's when the symptoms began.

Memory usage crept up, and eventually refreshes began to miss their schedule entirely. At the same time, we noticed something else: the sessions_requiring_computation queue wasn't shrinking. It was meant to track work that still needed processing by incrementing for a session when a purchase happened, and decrementing after the computation ran. If a session had a value greater than zero, we knew it still needed work. But the unmarking wasn't happening.

#Large Scans Result in Timeouts

The view used FINAL and DISTINCT ON to collapse duplicates, which required scanning large partitions of the events_v2 table. That made refreshes expensive. We realized that when memory usage spiked during one of these scans, it began silently failing.

No successful refresh meant no -1 was written to the sessions_requiring_computation table. So even after a session had been processed, it still showed up as needing work. The queue grew and grew, filled with sessions that had already been handled, but never marked as such. CPU usage climbed, since every refresh now required more and more computation. Wiew refreshes became unstable, and it started to look like the system was falling behind, when in reality it was doing the same work over and over.

At this point, everything was tangled. The function relied on a join between events_v2 and the computation queue to narrow the working set, but now that queue was bloated with phantom entries. Every minute, we scanned massive partitions, joined millions of rows, ran layered CTEs, and waited for ClickHouse to spill to disk or crash. Sometimes it finished, sometimes it didn't.

#What about a cache?

We tried the usual fixes - limit the number of sessions per run, shrink the time window, filter columns, etc. This helped a little, but the core problem wasn't a tuning issue. We were doing too much work, too often, with too many moving parts on the read path.

Our core issue was we needed to deduplicate recent events so our calculations were accurate, but we needed to do it without using FINAL or DISTINCT ON against the entirety of our events_v2 or events_computed tables. So what if we created smaller “cache” tables that only held the most recent events, something like events_v2_recent_events and events_computed_recently ?

We prototyped this, and technically got it working. Performance was great, the numbers checked out, and the views were refreshing again. But by then, the whole system had become a maze of CTEs, materialized views, audit tables, and fallback logic - each one solving a problem introduced by the last.

#Long Live Simplicity

Eventually, we took a step back and made the call: no more marking, no more unmarking, no more session queues to audit and debug. The whole computation pipeline had become too tightly coupled to an increasingly fragile set of assumptions. We needed to simplify. That's when we landed on what we now call the flusher model.

Instead of trying to track what needs recomputing and when, we flipped the problem around. We asked a simpler question: What if we just recompute everything for sessions that had recent purchase activity? Not all time, not all events, just the last 30–40 minutes of “live” sessions. If it had a purchase event recently, it was eligible for recomputation. If not, we ignored it.

This was the table that powered that shift:

Every minute, a materialized view populated this table with new purchase events from events_v2, giving us a rolling window of active sessions to recompute.

Then came the flusher. Every minute, it took the current window of active sessions, joined them against recent events, ran the attribution math, and flushed the computed results directly into events_computed.

The flusher was simpler, cheaper, and most importantly, predictable. It removed the guesswork. No more tracking computation queues, no more fighting deduplication edge cases, no more unmarking logic silently failing. Just compute what's fresh, and move on.

In hindsight, this was the version we were always working toward. We just had to build and break the complicated one first. And so we migrated and recomputed one final time, now at ~80 billion events.

#What about the rest?

Not everything fits neatly into a rolling 40-minute window. But instead of trying to build a catch-all pipeline that solved every edge case in real time, we took a more pragmatic route. For any events that fell outside the live window, we built separate batch jobs. They could run on-demand, whenever we needed to backfill or reprocess older data, without complicating the main system.

#Key Takeaways

We didn't set out to become experts in ClickHouse, materialized views, or session-based attribution pipelines, and we still aren't. We just wanted fast, reliable metrics. But somewhere along the way - after CTE recursion, queue bloat, memory blowups, and view refresh chaos - we picked up a few things we feel are worth keeping.

#When to use CTEs, and when to walk away

While CTEs can improve query organization, their performance impact compared to subqueries or temporary tables can vary. In some cases, especially with complex CTEs or when the ClickHouse optimizer struggles to find an optimal plan, temporary tables might offer better performance due to the ability to apply indexes and statistics.

ClickHouse's query optimizer may not always fully optimize CTEs, potentially leading to repeated computations if a CTE is referenced multiple times. Understanding the query plan using EXPLAIN PIPELINE can help identify and address such inefficiencies.

#Avoid refreshable recompute queues unless you really need them

Tracking sessions that “need” computation with counters and flags sounds clean. In reality, it's another system to keep warm and in sync. Ours silently bloated for weeks because of a broken unmarking path. It added pressure to our write path and complicated our observability. We're better off computing in place, on filtered windows, and letting TTLs clean up the rest.

#Use TTLs, caches, and filtered refreshes

We got the best results when we stopped trying to compute everything all the time. Instead, we focused on what's fresh. Live sessions flow into temporary tables with TTLs. Materialized views only scan a few minutes of data. And when we do need to backfill, we run controlled batch jobs on archived partitions. ClickHouse is great at streaming and batch—but mixing the two takes care.

#ReplacingMergeTree + minimal MVs = win

There's no magic here. ReplacingMergeTree gives us clean deduplication over time without needing to micromanage it. A few well-scoped materialized views give us the projections we need. No FINALs, no DISTINCTs, no bloated queues. Just a handful of tables that are easy to reason about and cheap to maintain.

We do use FINAL and DISTINCT, just where they're meant to be used - at read time, and only when querying the most recent data.

#Going Forward

We think there are still performance gains to be made. We didn't cover everything we did to get here - removing nullable fields, using materialized columns, explicitly setting low-cardinality columns, etc. It took a few false starts, but we're happy with where things landed for now.

If you're building something similar, hopefully this saves you a few steps, gives you some patterns to try, or at least gives you some clear examples of what not to do.

Anything we can optimize here, or seeing something we didn't? Reach out at x.com/replohq, and if you want to work on systems like this or think you could build it better, join our team at replo.app/careers.

Written by

Ryan Voris
Ryan VorisEngineer @ Replo

Share on

Want to work on this kind of stuff? replo.app/careers