Event-Driven Microservices: Handling 100K+ Events/Day

When we started scaling our analytics platform at Fractal, we hit the classic monolith wall. Our system needed to process 100,000+ events daily while maintaining sub-200ms response times. This is the story of how we built it.

Why Event-Driven?

Event-driven architecture offered us several critical advantages:

Loose coupling between services — teams can develop and deploy independently
Better fault isolation — one service failure doesn't cascade
Horizontal scalability — add more consumers as load increases
Real-time data processing — events are processed as they occur
Audit trail — every event is logged and replayable

Our Tech Stack

Apache Kafka       → Event streaming platform
FastAPI + Python   → Microservices framework
MongoDB            → Primary data store
Redis              → Caching and rate limiting
Azure Kubernetes   → Container orchestration
Datadog            → Observability

Architecture Overview

We designed a system with three main layers:

1. Ingestion Layer

FastAPI services that validate incoming events and publish them to Kafka. Each service handles a specific domain (user events, transaction events, system events).

2. Processing Layer

Kafka consumers that transform, enrich, and route events. We use consumer groups to enable parallel processing and automatic rebalancing.

3. Storage Layer

MongoDB for persistence with Redis caching for hot data. We use MongoDB's change streams to keep caches synchronized.

Achieving P99 Under 200ms

Getting sub-200ms P99 latency required several optimizations:

Async I/O throughout — No blocking operations in the hot path
Strategic Kafka partitioning — Events partitioned by customer ID for ordering guarantees
MongoDB index optimization — Covered queries for common access patterns
Connection pooling — Reuse connections to databases and Kafka
Backpressure handling — Graceful degradation under load

P99 latency dropped from 800ms to 180ms after implementing proper backpressure handling and optimizing our Kafka consumer configuration.

Observability is Non-Negotiable

With 100K+ events flowing through the system daily, you can't debug issues without proper observability:

Distributed tracing — Every event gets a correlation ID that follows it through all services
Structured logging — JSON logs with consistent schema
Metrics everywhere — Latency percentiles, throughput, error rates
Alerting — PagerDuty integration for critical issues

Lessons Learned

Start with idempotency — Design every consumer to handle duplicate events
Dead letter queues are essential — Have a plan for events that can't be processed
Schema evolution matters — Use Avro or similar for backward compatibility
Test at scale early — Performance issues often only appear under load

Results

Processing 100K+ events/day reliably
P99 latency under 200ms
99.9% uptime over 12 months
30% reduction in MTTR due to better observability