When we started scaling our analytics platform at Fractal, we hit the classic monolith wall. Our system needed to process 100,000+ events daily while maintaining sub-200ms response times. This is the story of how we built it.
Why Event-Driven?
Event-driven architecture offered us several critical advantages:
- Loose coupling between services — teams can develop and deploy independently
- Better fault isolation — one service failure doesn't cascade
- Horizontal scalability — add more consumers as load increases
- Real-time data processing — events are processed as they occur
- Audit trail — every event is logged and replayable
Our Tech Stack
Apache Kafka → Event streaming platform
FastAPI + Python → Microservices framework
MongoDB → Primary data store
Redis → Caching and rate limiting
Azure Kubernetes → Container orchestration
Datadog → Observability
Architecture Overview
We designed a system with three main layers:
1. Ingestion Layer
FastAPI services that validate incoming events and publish them to Kafka. Each service handles a specific domain (user events, transaction events, system events).
2. Processing Layer
Kafka consumers that transform, enrich, and route events. We use consumer groups to enable parallel processing and automatic rebalancing.
3. Storage Layer
MongoDB for persistence with Redis caching for hot data. We use MongoDB's change streams to keep caches synchronized.
Achieving P99 Under 200ms
Getting sub-200ms P99 latency required several optimizations:
- Async I/O throughout — No blocking operations in the hot path
- Strategic Kafka partitioning — Events partitioned by customer ID for ordering guarantees
- MongoDB index optimization — Covered queries for common access patterns
- Connection pooling — Reuse connections to databases and Kafka
- Backpressure handling — Graceful degradation under load
P99 latency dropped from 800ms to 180ms after implementing proper backpressure handling and optimizing our Kafka consumer configuration.
Observability is Non-Negotiable
With 100K+ events flowing through the system daily, you can't debug issues without proper observability:
- Distributed tracing — Every event gets a correlation ID that follows it through all services
- Structured logging — JSON logs with consistent schema
- Metrics everywhere — Latency percentiles, throughput, error rates
- Alerting — PagerDuty integration for critical issues
Lessons Learned
- Start with idempotency — Design every consumer to handle duplicate events
- Dead letter queues are essential — Have a plan for events that can't be processed
- Schema evolution matters — Use Avro or similar for backward compatibility
- Test at scale early — Performance issues often only appear under load
Results
- Processing 100K+ events/day reliably
- P99 latency under 200ms
- 99.9% uptime over 12 months
- 30% reduction in MTTR due to better observability