Optimizing 100 Billion Clickhouse Events

One of the newer products we've launched at Replo is Replo Analytics. We want to enable Replo customers to sell anything online, and part of selling online is analyzing the performance of your campaigns, determining conversion rates based on per-session purchase data, etc. Since Replo automatically deploys campaign landing pages, we can easily track this event data and provide it for our customers with way less setup than something like Google Analytics.

We use ClickHouse to store this data, and throughout the development of this product we had to implement several different architectures to account for storage growth while keeping queries performant.

#Storing Analytics Data

At initial launch of Replo Analytics, we had spent 3-4 months prototyping and built what felt like a solid analytics pipeline on top of ClickHouse. Our events table sat at the center, steadily collecting frontend events from Replo-powered storefronts. The data fidelity was high. The queries returned what we expected. Things were working and life was good.

Here's what our events table looked like:

CREATE TABLE IF NOT EXISTS default.events (
    id UUID,
    name String,
    createdAt DateTime,
    receivedAt DateTime,
    sessionId UUID,
    -- ...
    payload Nested(
        key String,
        value String
    ),
) ENGINE = MergeTree()
PARTITION BY toStartOfInterval(createdAt, INTERVAL 6 hours)
ORDER BY (createdAt, sessionId, urlHost, urlPath);

If you're experienced with ClickHouse, you probably already see the problems, but back then we were new to ClickHouse and column-oriented databases, so it felt reasonable to us.

But as usage grew, so did the number of events, the sessions behind them, and the complexity of the metrics we wanted to compute. Query times slowed - slow became unreliable, then unreliable became unacceptable. Queries for customers tracking 10M+ pageviews per month timed out.

We tried tuning, we tried optimizing queries, we tried adding more and more tables to try to split and speed things up, but eventually we realized it wasn't something we could patch our way out of. We needed a fresh start, a new model that wasn't just correct, but fast. Thus began our journey into what we now call internally “events data model V2”.

If you're working with ClickHouse at any meaningful scale or trying to make analytics feel instant for real users this story might be familiar (or at least entertaining). This is the story of how we built the new data model and then carefully migrated to it, then migrated it again.

#`events_v2` and `events_computed`

As more sessions rolled in and our customer base grew, queries that once took hundreds of milliseconds started creeping into seconds, then tens of seconds. Eventually we reached a point where running a simple COUNT() across recent sessions required a meditation break.

We started noticing a few other patterns:

The same per-session purchase data was being recomputed over and over again. We didn't have any caching layer, so every query was a cold query.
Computed metrics were derived on the fly. This added tons of latency and made our analytics dashboards feel sluggish at best.

So, we began exploring what would become our V2. The goal wasn't just faster queries - we wanted to rethink the whole approach to computing metrics from raw events.

At the heart of this would be precomputation. Not in the generic, “yeah we'll batch some things” sense, but full, session-level attribution logic baked into the pipeline, computed once, and never recomputed unless necessary.

CREATE TABLE events_v2 (
  namespace   String,
  session     UUID,
  date        DateTime64,
  id          UUID,
  name        String,
  data        String
)
ENGINE = ReplacingMergeTree()
PARTITION BY (
  toStartOfInterval(date, INTERVAL 2 MONTH),
  namespace
)
ORDER BY (namespace, date, session, id);

This was the first version of our events_v2 table. It reflected lessons we had started to internalize about ClickHouse's performance characteristics. namespace became a central column to establish ownership boundaries across tenants - we used Replo project ids for this. It was relatively low in cardinality, which allowed ClickHouse to skip large portions of data when queries were scoped by customer.

We optimized storage and query performance through careful partitioning, ordering, and deduplication strategies.

We partitioned the table by both time and namespace to align with common query patterns that filter on these fields. Whenever rows are inserted, ClickHouse creates a new data part for each unique partition key value, rather than combining all inserted rows into a single part. This approach reduces unnecessary scans and makes queries more efficient, because ClickHouse scans by loading partitions into memory.

Within each partition, events were sorted by namespace, then session, followed by date and id. This ordering enables queries to read data in a session-oriented sequence, while ensuring uniqueness guarantees through the id field.

To handle duplicate events, we used a Replacing MergeTree. This ClickHouse Table Engine automatically removes duplicate rows that share the same sorting key, keeping only the most recent version.

Finally, the event payload itself is stored in a single data column as JSON. This design gives us flexibility across event types without needing to alter the table schema as events evolve.

We also made the important decision to pre-compute some metrics with events_computed :

CREATE TABLE events_computed (
  namespace       String,
  session         UUID,
  date            DateTime64,
  id              UUID,
  name            String,
  data            String,
  computed        Nullable(String),
  computedVersion UInt8,
  computedDate    Nullable(DateTime64)
)
ENGINE = ReplacingMergeTree()
PARTITION BY (
  toStartOfInterval(date, INTERVAL 2 MONTH),
  namespace
)
ORDER BY (namespace, date, session, id);

Close readers might realize this definition is almost the same as events_v2 - it is, but with one key difference: it includes computed values. The idea was that we could take common metrics, like per-session purchase amounts, and pre-compute them once to save query performance time.

So how do those values get populated? That's the 100 billion event question. This is where our team started slipping into the rabbit hole of figuring out when and what to recompute, trying every kind of materialized view ClickHouse supports, building cache tables, setting indexes, and eventually rethinking the entire strategy from the ground up.

#Pre-computing Event Columns Every Minute

Let's look at a common case of per-session purchase data. In Replo Analytics, we want to compute number of purchases per session, and the total amount of those purchases.

Since most events aren't purchases, we started by copying everything from events_v2 into events_computed with a null value in the computed column. This happened through a simple materialized view:

CREATE MATERIALIZED VIEW events_computed_default_mv
TO events_computed
AS
	SELECT
	  namespace,
	  name,
	  session,
	  date,
	  id,
	  data,
	  CAST(NULL AS Nullable(String)) AS computed,
	  CAST(NULL AS Nullable(DateTime64)) AS computedDate
	FROM events_v2
	WHERE name != 'purchase';

Next, we needed a way to figure out which sessions required computation. To track this, we created a lightweight summing table called sessions_requiring_computation. The idea was simple: if a purchase event came in, we marked the session with a +1. Once we processed it, we subtracted one to "unmark" it.

CREATE TABLE sessions_requiring_computation (
  namespace           String,
  session             UUID,
  insertDate          DateTime64,
  requiresComputation Int32
)
ENGINE = SummingMergeTree()
PARTITION BY (toStartOfInterval(insertDate, INTERVAL 6 HOUR), namespace)
ORDER BY (namespace, session);

Two materialzed views handled the bookkeeping: one to increment when a purchase was seen, and one to decrement when that purchase was processed.

-- Mark
CREATE MATERIALIZED VIEW sessions_requiring_computation_mark_mv
TO sessions_requiring_computation AS
	SELECT namespace, session, now() AS insertDate, 1 AS requiresComputation
	FROM events_v2
	WHERE name = 'purchase';

-- Unmark
CREATE MATERIALIZED VIEW sessions_requiring_computation_unmark_mv
TO sessions_requiring_computation AS
	SELECT namespace, session, now() AS insertDate, -1 AS requiresComputation
	FROM events_computed
	WHERE name = 'purchase';

So far, so good, but we still needed to actually compute the metrics and write them to events_computed. That's where events_computed_fn came in, a view that scanned sessions_requiring_computation, found active rows, joined them with raw events, did the aggregation using OVER (PARTITION BY session), and emitted a JSON blob. Note that we needed to limit the amount of possible sessions we were computing because we started running into memory issues if we tried to go too big.

CREATE VIEW events_computed_fn AS
WITH
  200000 AS sessionLimit,
  (now() - INTERVAL 4 YEAR) AS startDate,
  (now() + INTERVAL 1 DAY) AS endDate,
  1 AS currentComputedVersion,

  computable_sessions AS (
    SELECT namespace, session, SUM(requiresComputation) AS requiresComputation
    FROM sessions_requiring_computation
    WHERE insertDate >= startDate AND insertDate < endDate
    GROUP BY namespace, session
    HAVING requiresComputation > 0
    LIMIT sessionLimit
  ),

  events_to_compute AS (
    SELECT e.namespace, e.session, e.date, e.id, e.name, e.data
    FROM events_v2 e
    INNER JOIN computable_sessions s USING (namespace, session)
  ),

  computed_events AS (
    SELECT
      namespace, session, date, id, name, data,
      COUNT(IF(name = 'purchase', 1, NULL)) OVER (PARTITION BY session)
        AS sessionPurchaseCount,
      SUM(IF(name = 'purchase', JSONExtract(data, 'payload', 'amount', 'Float64'), 0)) OVER (PARTITION BY session)
        AS sessionPurchaseSum
    FROM events_to_compute
    GROUP BY namespace, session, date, id, name, data
  )

SELECT
  namespace, session, date, id, name, data,
  toJSONString(map(
    'session', map(
      'purchase', map(
        'count', toFloat64(sessionPurchaseCount),
        'sum', toFloat64(sessionPurchaseSum)
      )
    )
  )) AS computed,
  now() AS computedDate
FROM computed_events;

We ran this every minute using another materialized view:

CREATE MATERIALIZED VIEW events_computed_mv
REFRESH EVERY 1 MINUTE
APPEND TO events_computed AS
	SELECT * FROM events_computed_fn;

And to track how it was doing, we added an audit table that recorded how many sessions still needed computation every minute:

CREATE TABLE sessions_requiring_computation_audit (
  count UInt64,
  date DateTime64
) ENGINE = MergeTree()
ORDER BY date;

CREATE MATERIALIZED VIEW sessions_requiring_computation_audit_mv
REFRESH EVERY 1 MINUTE
APPEND TO sessions_requiring_computation_audit AS
	WITH
	  computable_sessions AS (
	    SELECT namespace, session
	    FROM sessions_requiring_computation
	    WHERE insertDate >= now() - INTERVAL 4 YEAR
		    AND insertDate < now() + INTERVAL 1 DAY
	    GROUP BY namespace, session
	    HAVING SUM(requiresComputation) > 0
	  )
	SELECT count() AS count, now64() AS date
	FROM computable_sessions;

This was a lot of steps, but it did work - our performance testing showed our queries were 4-10x faster, so we migrated our ~60 billion events.

For the first time.

#Expanding, Collapsing, and Caching

Replo Analytics shipped with support for first-touch and last-touch purchase attribution, but a major request from customers was to attribute fractionally. Instead of attributing 100% of revenue to a given page for being the first or last page visited, give each page visited a fraction of the attribution based on the overall pages visited, to more accurately reflect that customer visits have multiple pages and they might all contribute to a purchase in combination.

The logic was straightforward - in order to figure how how much a pageview contributed to revenue of a session, divide the total purchase amount by the number of meaningful pageviews. If you had five pageviews and one purchase, each pageview was worth a fifth of the purchase's amount. Multiply that out and you get fractional revenue.

As started testing this, we noticed the calculations seemed off. We eventually traced it to duplicate events that threw off the calculations. While both events_v2 and events_computed used ReplacingMergeTree, they are only eventually deduplicated. You cannot rely on them being perfectly clean in real time, especially not for recent events (and we were only pre-computing fields for recent events).

Fortunately, ClickHouse has tools to help with this: FINAL and DISTINCT.

CREATE VIEW events_computed_fn
-- ...
events_to_compute AS (
  SELECT DISTINCT ON (e.namespace, e.subject, e.date, e.id) -- New DISTINCT ON specifier
  -- ...
  FROM events_v2 e FINAL -- New FINAL specifier
  INNER JOIN computable_subjects s ON s.namespace = e.namespace AND s.session = e.session
),
session_metrics AS (
  SELECT
    namespace,
    session,
    COUNT(IF(
      name == 'page_view',
      NULL
    )) AS attributablePageViewCount
  FROM events_to_compute
  GROUP BY
    namespace,
    session
),
computed_events AS (
  SELECT
    ... AS sessionPurchaseCount,
    ... AS sessionPurchaseSum,
    ... AS sessionPurchaseFraction,
    ... AS sessionPurchaseFractionalSum,
    -- ...
  FROM events_to_compute e
  JOIN session_metrics s ON (
    e.namespace = s.namespace AND e.session = s.session
  )
  GROUP BY
    e.namespace,
    -- ...
    e.data,
    s.attributablePageViewCount
)
SELECT
  -- ...
  toJSONString(map(
    'session', map(
      'purchase', map(
        'count', toFloat64(sessionPurchaseCount),
        'sum', toFloat64(sessionPurchaseSum),
        'fraction', toFloat64(sessionPurchaseFraction),
        'fractionalSum', toFloat64(sessionPurchaseFractionalSum)
      )
    )
  )) AS computed,
FROM computed_events;

This was the version of events_computed_fn that tipped the whole system over. At first it worked, and because of the flaws of the original unduplicated calculations, we had to re-migrate all of the data to trigger a recalculation. That's when the symptoms began.

Memory usage crept up, and eventually refreshes began to miss their schedule entirely. At the same time, we noticed something else: the sessions_requiring_computation queue wasn't shrinking. It was meant to track work that still needed processing by incrementing for a session when a purchase happened, and decrementing after the computation ran. If a session had a value greater than zero, we knew it still needed work. But the unmarking wasn't happening.

#Large Scans Result in Timeouts

The view used FINAL and DISTINCT ON to collapse duplicates, which required scanning large partitions of the events_v2 table. That made refreshes expensive. We realized that when memory usage spiked during one of these scans, it began silently failing.

No successful refresh meant no -1 was written to the sessions_requiring_computation table. So even after a session had been processed, it still showed up as needing work. The queue grew and grew, filled with sessions that had already been handled, but never marked as such. CPU usage climbed, since every refresh now required more and more computation. Wiew refreshes became unstable, and it started to look like the system was falling behind, when in reality it was doing the same work over and over.

At this point, everything was tangled. The function relied on a join between events_v2 and the computation queue to narrow the working set, but now that queue was bloated with phantom entries. Every minute, we scanned massive partitions, joined millions of rows, ran layered CTEs, and waited for ClickHouse to spill to disk or crash. Sometimes it finished, sometimes it didn't.

#What about a cache?

We tried the usual fixes - limit the number of sessions per run, shrink the time window, filter columns, etc. This helped a little, but the core problem wasn't a tuning issue. We were doing too much work, too often, with too many moving parts on the read path.

Our core issue was we needed to deduplicate recent events so our calculations were accurate, but we needed to do it without using FINAL or DISTINCT ON against the entirety of our events_v2 or events_computed tables. So what if we created smaller “cache” tables that only held the most recent events, something like events_v2_recent_events and events_computed_recently ?

We prototyped this, and technically got it working. Performance was great, the numbers checked out, and the views were refreshing again. But by then, the whole system had become a maze of CTEs, materialized views, audit tables, and fallback logic - each one solving a problem introduced by the last.

#Long Live Simplicity

Eventually, we took a step back and made the call: no more marking, no more unmarking, no more session queues to audit and debug. The whole computation pipeline had become too tightly coupled to an increasingly fragile set of assumptions. We needed to simplify. That's when we landed on what we now call the flusher model.

Instead of trying to track what needs recomputing and when, we flipped the problem around. We asked a simpler question: What if we just recompute everything for sessions that had recent purchase activity? Not all time, not all events, just the last 30–40 minutes of “live” sessions. If it had a purchase event recently, it was eligible for recomputation. If not, we ignored it.

This was the table that powered that shift:

CREATE TABLE live_session_events_trigger (
    namespace String,
    session UUID,
    date DateTime64(3),
    id UUID,
    name String,
    data String
)
ENGINE = MergeTree
ORDER BY (namespace, session, date, id)
TTL toDateTime(date) + INTERVAL 40 MINUTE;

Every minute, a materialized view populated this table with new purchase events from events_v2, giving us a rolling window of active sessions to recompute.

CREATE MATERIALIZED VIEW live_session_events_trigger_mv
TO live_session_events_trigger AS
SELECT
    namespace,
    session,
    date,
    id,
    name,
    data
    FROM events_v2
    WHERE date >= (now() - INTERVAL 40 MINUTES)
        AND name = 'purchase';

Then came the flusher. Every minute, it took the current window of active sessions, joined them against recent events, ran the attribution math, and flushed the computed results directly into events_computed.

CREATE MATERIALIZED VIEW session_flusher_mv
REFRESH EVERY 1 MINUTE
APPEND TO events_computed AS
SELECT
    e.namespace as namespace,
    -- ...
    e.data as data,
    CONCAT(
        '{',
            '"session":{',
            '"purchase":{',
                '"count":', toString(IFNULL(m.sessionPurchaseCount, 0)), ',',
                '"sum":', toString(IFNULL(m.sessionPurchaseSum, 0)), ',',
                '"fractionalSum":', toString(m.attributablePageViewCount > 0 ? m.sessionPurchaseSum * (1.0 / m.attributablePageViewCount) : 0),
                    -- ...
            '}',
            '}',
        '}'
    ) AS computed,
    --- ...
    now64(3) AS computedDate,
    now64(3) AS insertDate
    FROM live_session_events_trigger AS t
    LEFT JOIN live_session_events AS e USING (namespace, session)
    LEFT JOIN (
    SELECT
        namespace,
        session,
        COUNTIf(name = 'page_view' AND ...) AS attributablePageViewCount,
        COUNTIf(name = 'purchase') AS sessionPurchaseCount,
        SUMIf(amount, name = 'purchase') AS sessionPurchaseSum,
        anyIf(currencyCode, name = 'purchase') AS currencyCode
    FROM live_session_events
    GROUP BY namespace, session
    ) AS session_metrics USING (namespace, session);

The flusher was simpler, cheaper, and most importantly, predictable. It removed the guesswork. No more tracking computation queues, no more fighting deduplication edge cases, no more unmarking logic silently failing. Just compute what's fresh, and move on.

In hindsight, this was the version we were always working toward. We just had to build and break the complicated one first. And so we migrated and recomputed one final time, now at ~80 billion events.

#What about the rest?

Not everything fits neatly into a rolling 40-minute window. But instead of trying to build a catch-all pipeline that solved every edge case in real time, we took a more pragmatic route. For any events that fell outside the live window, we built separate batch jobs. They could run on-demand, whenever we needed to backfill or reprocess older data, without complicating the main system.

#Key Takeaways

We didn't set out to become experts in ClickHouse, materialized views, or session-based attribution pipelines, and we still aren't. We just wanted fast, reliable metrics. But somewhere along the way - after CTE recursion, queue bloat, memory blowups, and view refresh chaos - we picked up a few things we feel are worth keeping.

#When to use CTEs, and when to walk away

While CTEs can improve query organization, their performance impact compared to subqueries or temporary tables can vary. In some cases, especially with complex CTEs or when the ClickHouse optimizer struggles to find an optimal plan, temporary tables might offer better performance due to the ability to apply indexes and statistics.

ClickHouse's query optimizer may not always fully optimize CTEs, potentially leading to repeated computations if a CTE is referenced multiple times. Understanding the query plan using EXPLAIN PIPELINE can help identify and address such inefficiencies.

#Avoid refreshable recompute queues unless you really need them

Tracking sessions that “need” computation with counters and flags sounds clean. In reality, it's another system to keep warm and in sync. Ours silently bloated for weeks because of a broken unmarking path. It added pressure to our write path and complicated our observability. We're better off computing in place, on filtered windows, and letting TTLs clean up the rest.

#Use TTLs, caches, and filtered refreshes

We got the best results when we stopped trying to compute everything all the time. Instead, we focused on what's fresh. Live sessions flow into temporary tables with TTLs. Materialized views only scan a few minutes of data. And when we do need to backfill, we run controlled batch jobs on archived partitions. ClickHouse is great at streaming and batch—but mixing the two takes care.

#ReplacingMergeTree + minimal MVs = win

There's no magic here. ReplacingMergeTree gives us clean deduplication over time without needing to micromanage it. A few well-scoped materialized views give us the projections we need. No FINALs, no DISTINCTs, no bloated queues. Just a handful of tables that are easy to reason about and cheap to maintain.

We do use FINAL and DISTINCT, just where they're meant to be used - at read time, and only when querying the most recent data.

#Going Forward

We think there are still performance gains to be made. We didn't cover everything we did to get here - removing nullable fields, using materialized columns, explicitly setting low-cardinality columns, etc. It took a few false starts, but we're happy with where things landed for now.

If you're building something similar, hopefully this saves you a few steps, gives you some patterns to try, or at least gives you some clear examples of what not to do.

Anything we can optimize here, or seeing something we didn't? Reach out at x.com/replohq, and if you want to work on systems like this or think you could build it better, join our team at replo.app/careers.