Building a Generic Analytics Solution: Lessons from Scale

The Problem: Observability Beyond Logs

At Enter, we build AI systems that process large volumes of complex documents and turn them into structured, useful insights. Our agents reason over heterogeneous inputs, extract meaning, and generate clear, actionable outputs that help teams make better decisions at scale.

Doing this well requires running AI workloads continuously and reliably. Each stage—model inference, document understanding, and analysis—produces signals we need to track: How much are we spending on LLM calls per customer? Which document types take the longest to process? Where are the bottlenecks?

We started with the obvious approaches. All of them failed.

Logs don't scale for analytics. Searching through logs for "all LLM calls for customer X in the last 7 days" is painful. Log aggregation tools like Elasticsearch or Loki are designed for debugging specific incidents, such as finding a needle in a haystack. They're not designed for analytical queries that scan and aggregate millions of records.

Prometheus metrics are inflexible. Prometheus excels at system metrics (CPU, memory, request counts), but adding high-cardinality dimensions like document_id or customer_id quickly becomes problematic. You end up with millions of time series that blow up memory and slow down queries. And getting business stakeholders to query PromQL? Not happening.

OLTP databases aren't designed for this. We tried storing analytics data in our PostgreSQL database. The writes added latency to the hot path, analytical queries slowed down transactional workloads, and at scale we ran into WAL replication lag on read replicas. OLTP and OLAP workloads don't mix well.

Why OLTP and OLAP don't mix: OLTP databases (Online Transaction Processing — PostgreSQL, MySQL) store data row-by-row and optimize for fast single-row operations: inserts, updates, lookups by primary key. When you run an analytical query like "sum all costs grouped by customer for the last month," the database must scan every row, reading columns it doesn't need. OLAP databases (Online Analytical Processing — ClickHouse, BigQuery) store data column-by-column, so aggregating a single column scans only that column's data. The difference at scale is orders of magnitude in query time and resource usage.

We needed something different: high-throughput publishing that doesn't slow down services, with batch consumption into an OLAP database optimized for the queries we actually run.

The Insight: Schema-on-Read

We took inspiration from EAV-style approaches, specifically the pattern used by Cloudflare's Workers Analytics Engine. Their schema looks like this:

timestamp | index1 | blob1...blob20 | double1...double20

A timestamp, one index field, 20 string fields, and 20 numeric fields. No migrations. No coordination between teams. Ship events now, define the structure in your queries later.

Most analytics queries don't need schema enforcement. You're aggregating, filtering, grouping. The structure lives in convention, not database constraints. We adopted this pattern with one modification: instead of index1, we use dataset as our primary discriminator. Each logical metric type (llm_metrics, document_processing, ocr_pipeline) becomes a dataset. Same table, different semantics.

analytics/models.py

class Analytics(BaseModel):
    id: str = Field(default_factory=lambda: str(uuid.uuid7()))
    dataset: str          # e.g., "llm_metrics"
    timestamp: datetime

    # 20 string fields for dimensions
    blob1: Optional[str] = None   # customer_id
    blob2: Optional[str] = None   # model_name
    blob3: Optional[str] = None   # document_type
    # ... blob4 through blob20

    # 20 numeric fields for measures
    double1: Optional[float] = None   # cost_usd
    double2: Optional[float] = None   # latency_ms
    double3: Optional[float] = None   # token_count
    # ... double4 through double20

The tradeoff is real: you lose type safety and self-documenting schemas. But teams can ship new metrics without waiting for anyone. No tickets, no migrations, no coordination. When you're iterating fast on AI pipelines, that velocity matters.

The Architecture

Three components: an event bus for decoupling, a batch consumer for efficiency, and an OLAP database for queries.

NATS JetStream: The Event Bus

We use NATS JetStream as our main event bus. Services publish events without waiting for acknowledgment. True fire-and-forget with sub-millisecond publish times. If NATS is temporarily unavailable, events are lost, and we're okay with that. Analytics data is valuable but not critical; we'd rather lose some events than add latency to document processing.

NATS persists events in durable streams until the consumer explicitly acknowledges them. The consumer pulls events at its own pace, which enables efficient batching on the consumption side.

Why fire-and-forget? In a request-response model, every analytics publish would add network round-trip latency to your hot path. With fire-and-forget, the publishing service continues immediately. The risk is losing events if NATS is down, but for analytics (not billing or audit logs), this tradeoff usually makes sense.

The publisher interface is minimal:

analytics/publisher.py

await publish_analytics_event(
    dataset="llm_metrics",
    blob1=customer_id,
    blob2=model_name,
    blob3=operation_type,
    double1=cost_usd,
    double2=tokens_input,
    double3=tokens_output,
    double4=latency_ms,
)

One function call. No connection management in the service, no batching logic, no retry handling.

MooseStack: Batch Consumption

Raw events in NATS need to reach ClickHouse efficiently. ClickHouse performs best with batch inserts of thousands of rows. Individual event inserts would overwhelm it and create excessive disk merges.

We use MooseStack to orchestrate the consumption. A scheduled workflow pulls batches of events, validates them, and inserts them into ClickHouse.

Why a custom workflow instead of direct streaming? MooseStack supports direct insertion from Kafka/Redpanda streams out of the box. But we already use NATS JetStream as our event bus across all services. Rather than running two streaming systems, we built a workflow that bridges NATS to ClickHouse. One event bus, simpler operations.

analytics/workflows.py

@workflow(schedule="@every 3m")
async def consume_analytics():
    events = await pull_batch(
        subject=ANALYTICS_SUBJECT,
        batch_size=10_000,
        timeout_seconds=120,
    )

    records = [Analytics(**e) for e in events]
    await insert_batch("analytics.Events", records)
    await ack_batch(events)

The batch size (10,000) and schedule (3 minutes) are tunable based on your latency requirements. Larger batches improve ClickHouse throughput; more frequent runs reduce how long until recent events appear in queries.

ClickHouse: The Query Engine

ClickHouse handles analytical queries. Its columnar storage means aggregating a single column (like summing costs) only reads that column's data, not entire rows. Queries that would take minutes in PostgreSQL return in milliseconds.

analytics/schema.sql

CREATE TABLE analytics.Events (
    id String,
    dataset String,
    timestamp DateTime64(3),
    blob1 Nullable(String),
    blob2 Nullable(String),
    -- ... blob3 through blob20
    double1 Nullable(Float64),
    double2 Nullable(Float64),
    -- ... double3 through double20
)
ENGINE = ReplicatedMergeTree(
    '/clickhouse/tables/{shard}/analytics',
    '{replica}'
)
PARTITION BY toYYYYMM(timestamp)
ORDER BY (dataset, timestamp, id)

Key design decisions:

Ordering by (dataset, timestamp, id): ClickHouse builds a sparse primary index based on the ORDER BY columns. When you filter by dataset and a time range, ClickHouse skips entire blocks of rows that don't match. This is where most query performance comes from: the database reads only relevant data blocks.

Monthly partitioning: Partitions allow dropping old data without rewriting the table (ALTER TABLE DROP PARTITION). They also enable partition pruning — queries filtering by time range skip entire partitions.

UUID7 for IDs: Time-ordered unique identifiers. Ensures uniqueness and makes IDs roughly chronological when inspecting raw data.

Wide Tables vs. Normalized: The Tradeoff

The 20 blob + 20 double schema is unconventional. A normalized alternative would be dedicated tables per metric type:

Normalized approach

CREATE TABLE llm_metrics (
    timestamp DateTime64(3),
    customer_id String,
    model_name String,
    cost_usd Float64,
    tokens_input UInt64,
    tokens_output UInt64
)

Normalized tables are self-documenting and type-safe. But they require coordination: every new metric type needs a migration, code review, and deployment. In a fast-moving team, that coordination cost adds up quickly.

The wide table trades schema clarity for organizational velocity. The schema documentation lives in the publisher code:

Self-documenting publishers

await publish_analytics_event(
    dataset="llm_metrics",
    blob1=customer_id,        # Customer identifier
    blob2=model_name,         # gpt-4, claude-3, etc.
    blob3=operation_type,     # completion, embedding
    double1=cost_usd,         # Cost in USD
    double2=tokens_input,     # Input token count
    double3=tokens_output,    # Output token count
)

Engineers can ship new metrics in minutes instead of waiting for infrastructure changes.

Visualization: Operations and Business

Data in ClickHouse serves two audiences with different needs.

Grafana serves operations teams with real-time dashboards showing usage, processing latency, and error rates. Alerts trigger when metrics exceed thresholds.

Grafana query example

SELECT
    toStartOfHour(timestamp) AS hour,
    blob2 AS model_name,
    sum(double1) AS total_cost
FROM analytics.Events
WHERE dataset = 'llm_metrics'
  AND timestamp > now() - INTERVAL 24 HOUR
GROUP BY hour, model_name
ORDER BY hour

Metabase serves business stakeholders who don't write SQL. Dashboards can be embedded directly in our product with JWT-based authentication, enabling customers to see their own analytics with proper row-level security.

Technical Considerations

Schema Evolution

What happens when 20 fields aren't enough? We haven't hit this limit yet, but our escape hatches are:

Add a metadata_json column for overflow fields (lose some query performance but gain flexibility)
Create a v2 table with more fields and migrate

The 20-field limit is a real constraint. We chose it because it matches Cloudflare's design and has been sufficient for their scale. If you're tracking hundreds of distinct dimensions per event, this pattern might not fit.

Query Performance at Scale

As data grows, complex aggregations slow down. ClickHouse materialized views can pre-aggregate data on insert. For example, maintaining hourly rollups of metrics that are frequently queried at daily granularity.

We haven't needed materialized views yet at 2.5M events/day, but the architecture supports adding them without changing the publishing side.

The Result

Queries over months of data return in sub-second times. Engineers ship new metrics without tickets or migrations.

Another benefit: the simple schema makes AI-assisted development practical. The well-documented publisher function is easy for LLMs to understand and generate code against. We migrated metrics from a legacy system in under a day using Claude. It read the existing code, understood what data was being tracked, and generated the corresponding publish_analytics_event calls with appropriate blob/double mappings.

Grafana's MCP integration with ClickHouse has a similar effect. AI assistants can generate valid ClickHouse queries and help debug dashboards, which speeds up iteration on monitoring.

Wide tables require discipline. You need conventions for which blob means what in each dataset, and errors surface at query time rather than write time. But for high-throughput analytics where shipping speed matters, this architecture has worked well for us. We can now answer questions about our AI pipelines that were previously invisible.

Technical Stack

Component	Technology	Purpose
Event Bus	NATS JetStream	Low-latency, durable event streaming
ETL	MooseStack	Batch consumption, workflow orchestration
OLAP	ClickHouse	High-performance analytical queries
Operations	Grafana	Real-time dashboards, alerting
Business Intelligence	Metabase	Self-service analytics, embedded dashboards

The principles transfer: decouple publishing from storage, batch for efficiency, use schema-on-read for flexibility. The specific technologies matter less than the pattern.