Real-time data processing sounds simple in theory. Events arrive, you process them, you emit results. In practice, distributed stream processing is one of the hardest problems in systems engineering. After building several Kafka-based pipelines in production, I've learned that the difference between a system that works in demo and one that survives production is understanding the failure modes.
Our first Kafka deployment processed 10,000 events per second with no issues. Then we hit 50,000 events per second and everything fell apart. Consumer lag spiked to millions of messages. Rebalances caused 30-second processing gaps. Duplicate events corrupted downstream databases. We spent three weeks debugging, rewriting, and learning why stream processing is fundamentally different from request-response architectures.
This is what I learned about building stream processing systems that actually work.
Why Batch Processing Isn't Enough
Before Kafka, we processed data in hourly batches. Every hour, a cron job queried the database for new records, processed them, and wrote results back. This worked until:
Problem 1: Latency Users wanted real-time updates. Hourly batches meant decisions were made on stale data. By the time we detected fraud, the transaction had already completed.
Problem 2: Scale At 50M events per day, our hourly batches processed 2M records at once. Database queries timed out. ETL jobs took 90 minutes, overlapping with the next batch.
Problem 3: Reprocessing When logic changed, we had to reprocess historical data. Batch jobs couldn't handle this efficiently. Backfilling months of data took weeks.
Streaming solves these issues:
- Latency: Events are processed within milliseconds of arrival
- Scale: Processing is distributed across consumers, not a single batch job
- Reprocessing: Kafka retains event history, making backfills trivial
But streaming introduces new complexity: ordering, fault tolerance, and delivery guarantees.
Kafka Fundamentals: Topics, Partitions, and Consumer Groups
Kafka organizes data into topics. A topic is a logical stream of events (e.g., "user-clicks", "transactions", "sensor-readings"). Topics are divided into partitions for parallelism.
Here's the architecture:
Topic: user-clicks (3 partitions)
Partition 0: [event1, event4, event7, ...]
Partition 1: [event2, event5, event8, ...]
Partition 2: [event3, event6, event9, ...]
Consumer Group: click-processors (3 consumers)
Consumer A → reads Partition 0
Consumer B → reads Partition 1
Consumer C → reads Partition 2
Key principles:
Ordering: Events in the same partition are strictly ordered. Events across partitions are not. If you need total ordering, use one partition (but lose parallelism).
Partitioning: Events are assigned to partitions by key. Events with the same key go to the same partition, preserving per-key ordering.
Consumer groups: Multiple consumers in the same group share partition consumption. Each partition is read by exactly one consumer in the group. This enables parallel processing.
Offsets: Each partition maintains an offset (position in the log). Consumers track their offset and can replay events by resetting to an earlier offset.
This architecture is elegant, but the devil is in the details.
The Consumer Lag Disaster
Our first production issue was consumer lag. Kafka was receiving 50,000 events/second, but consumers were only processing 30,000 events/second. The backlog grew at 20,000 events/second.
Within an hour, we had 72 million unprocessed events. Downstream systems were operating on hour-old data. We couldn't catch up because our consumers were CPU-bound.
The root cause: inefficient processing. Our consumer code looked like this:
while True:
messages = consumer.poll(timeout_ms=1000)
for message in messages:
process_message(message) # Blocking I/O call
consumer.commit() # Commit after each message
This had two problems:
Problem 1: Processing one message at a time We processed messages serially. Even at 30ms per message, that's only 33 messages/second per consumer. With 10 consumers, we maxed out at 330 messages/second.
Problem 2: Committing after every message Kafka commits are expensive (network round-trip to the broker). Committing after every message added 20ms overhead per message.
The fix: batch processing and asynchronous commits.
while True:
messages = consumer.poll(timeout_ms=1000, max_records=500)
# Process batch asynchronously
results = process_batch(messages) # Parallel processing
# Write results
write_to_database(results)
# Commit offsets once per batch
consumer.commit()
This improved throughput from 330 messages/second to 8,000 messages/second per consumer. With 10 consumers across 10 partitions, we hit 80,000 messages/second, exceeding our ingestion rate.
The lesson: stream processing throughput depends on batching. Kafka itself is fast. Your consumer code is often the bottleneck.
Partition Rebalancing and the 30-Second Gap
Once we solved throughput, we hit our second issue: rebalancing.
When a consumer joins or leaves the consumer group, Kafka triggers a rebalance. All consumers stop processing, partition assignments are recalculated, and consumers resume with new assignments.
During rebalancing, no messages are processed. For our workload, rebalances took 30 seconds. This happened whenever:
- We deployed new consumer code (rolling restart)
- A consumer crashed
- A consumer's heartbeat failed (network hiccup or GC pause)
Rebalances every 10 minutes created 30-second gaps in processing. Downstream systems received stale data. Alerts fired. Users complained.
The fix required three changes:
1. Increase session timeout and heartbeat interval
consumer = KafkaConsumer(
'user-clicks',
group_id='click-processors',
session_timeout_ms=60000, # 60 seconds (up from 10s)
heartbeat_interval_ms=20000, # 20 seconds (up from 3s)
)
This reduced false positives (consumers kicked out due to transient delays) but increased detection time for real failures.
2. Static membership
consumer = KafkaConsumer(
'user-clicks',
group_id='click-processors',
group_instance_id='consumer-1', # Static ID
)
Static membership tells Kafka "this consumer will come back." During rolling restarts, Kafka waits for the consumer to rejoin instead of immediately rebalancing. This reduced rebalances during deployments from 10 (one per consumer) to 0.
3. Incremental cooperative rebalancing
We upgraded to Kafka 2.4+ and enabled incremental rebalancing:
consumer = KafkaConsumer(
'user-clicks',
partition_assignment_strategy=[CooperativeStickyAssignor],
)
Instead of stopping all consumers during rebalancing, Kafka only stops consumers whose partition assignments change. In practice, this reduced rebalance time from 30 seconds to less than 2 seconds.
These changes cut our rebalance frequency from every 10 minutes to once per day (during deployments), and reduced rebalance duration from 30 seconds to 2 seconds.
Exactly-Once Semantics and the Duplicate Event Problem
Our third issue was duplicate processing. Kafka's default delivery guarantee is at-least-once. Messages are guaranteed to be delivered, but they might be delivered multiple times (e.g., during failures or retries).
For analytics, duplicates are annoying but tolerable (you deduplicate in post-processing). For transactional systems, duplicates are catastrophic. We had a payment processing pipeline where duplicate events caused double charges.
The sequence:
- Consumer reads message: "charge user $50"
- Consumer processes message: database records charge
- Consumer crashes before committing offset
- Consumer restarts, reads same message again
- User charged $50 twice
We tried idempotency keys:
def process_payment(message):
idempotency_key = message['event_id']
# Check if already processed
if database.exists(idempotency_key):
return # Skip duplicate
# Process payment
charge_user(message['amount'])
database.insert(idempotency_key, message)
This mostly worked, but had a race condition. Two consumers could read the same message (during rebalancing), check the database simultaneously (both see no record), and double-charge.
The real solution: Kafka transactions (exactly-once semantics).
producer = KafkaProducer(
transactional_id='payment-processor',
enable_idempotence=True,
)
producer.init_transactions()
while True:
messages = consumer.poll()
producer.begin_transaction()
for message in messages:
# Process message
result = process_payment(message)
# Produce result to output topic
producer.send('payment-results', result)
# Send offset commit as part of transaction
producer.send_offsets_to_transaction(
{message.partition: message.offset + 1},
consumer.group_id
)
# Commit transaction atomically
producer.commit_transaction()
Kafka transactions provide atomic writes: either the output events and offset commits all succeed, or they all fail. Consumers reading the output topic see results only after the transaction commits. No duplicates, no partial failures.
The cost? Throughput dropped 20% due to transaction overhead. For payment processing, this was acceptable. For analytics, we stuck with at-least-once plus deduplication.
Event Sourcing: The Event Log as Source of Truth
Once we had reliable stream processing, we rebuilt our architecture around event sourcing.
Traditional architecture:
User action → Service updates database → Other services poll database
Event sourcing:
User action → Service emits event to Kafka → Other services consume events
The database becomes a materialized view of the event log, not the source of truth. Benefits:
1. Auditability Every state change is recorded as an immutable event. You have a complete audit trail.
2. Replayability You can rebuild any downstream system by replaying events from the beginning. This makes experimentation easy: spin up a new consumer, replay historical events, and test your logic on real data.
3. Decoupling Services don't query each other's databases. They consume events. This reduces tight coupling and enables independent scaling.
4. Time travel You can reconstruct system state at any point in history by replaying events up to that timestamp.
We used event sourcing for our inventory management system:
# Events
ProductAdded(product_id, quantity)
ProductSold(product_id, quantity)
ProductReturned(product_id, quantity)
# State is derived from events
def current_inventory(product_id):
events = kafka.read_topic('inventory-events', filter=product_id)
inventory = 0
for event in events:
if event.type == 'ProductAdded':
inventory += event.quantity
elif event.type == 'ProductSold':
inventory -= event.quantity
elif event.type == 'ProductReturned':
inventory += event.quantity
return inventory
When we discovered a bug in our inventory calculation logic, we fixed it and replayed all historical events. The corrected inventory counts were ready in 20 minutes. In our old architecture, this would have required manual database corrections.
Handling Schema Evolution
As systems evolve, event schemas change. You add fields, rename fields, or change types. Stream processing systems must handle this gracefully.
We use Avro with a schema registry:
from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer
schema = {
"type": "record",
"name": "UserClick",
"fields": [
{"name": "user_id", "type": "string"},
{"name": "url", "type": "string"},
{"name": "timestamp", "type": "long"}
]
}
producer = AvroProducer({
'bootstrap.servers': 'kafka:9092',
'schema.registry.url': 'http://schema-registry:8081'
}, default_value_schema=avro.loads(schema))
The schema registry ensures backward and forward compatibility. When you evolve a schema:
Backward compatibility: New consumers can read old events. Required when you add optional fields.
Forward compatibility: Old consumers can read new events. Required when you add fields with defaults.
Full compatibility: Both backward and forward compatible. The gold standard.
We enforce compatibility at the registry level. Attempting to register an incompatible schema fails, preventing production issues.
Monitoring and Alerting
Stream processing failures are often silent. Messages pile up in Kafka, but no errors are thrown. Consumers keep running, just too slowly.
We monitor three key metrics:
1. Consumer lag The gap between the latest offset in Kafka and the consumer's current offset.
kafka-consumer-groups --bootstrap-server kafka:9092 \
--group click-processors --describe
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
click-processors user-clicks 0 1000000 1000500 500
click-processors user-clicks 1 1000000 1002000 2000
Alert if lag exceeds 10,000 messages or grows for >5 minutes.
2. Processing rate Messages processed per second. Compare to ingestion rate.
import time
from prometheus_client import Counter
messages_processed = Counter('messages_processed_total')
start_time = time.time()
message_count = 0
while True:
messages = consumer.poll()
for message in messages:
process_message(message)
message_count += 1
messages_processed.inc()
if time.time() - start_time > 60:
rate = message_count / 60
print(f"Processing rate: {rate} messages/sec")
message_count = 0
start_time = time.time()
Alert if processing rate falls below 90% of average.
3. Rebalance frequency Rebalances disrupt processing. Too many rebalances indicate instability.
Alert if rebalances occur more than once per hour (outside deployments).
When Not to Use Streaming
Streaming isn't always the answer. We've learned when to use streaming and when to stick with simpler architectures:
Use streaming when:
- You need sub-second latency
- Events must be processed in order (per key)
- Reprocessing historical data is common
- Multiple downstream systems consume the same data
Don't use streaming when:
- Hourly or daily batch processing is acceptable
- Data volume is low (less than 10,000 events/day)
- Ordering doesn't matter and you can tolerate duplicates
- You don't have expertise to operate distributed systems
For our analytics dashboard, we switched from Kafka back to batch processing. Users were happy with hourly updates, and the operational complexity of streaming wasn't justified.
Cost Analysis
Running Kafka in production isn't cheap:
Infrastructure:
- 3 Kafka brokers (for reliability): $400/month
- Schema registry: $50/month
- Monitoring (Prometheus + Grafana): $100/month
Operations:
- On-call rotations: 10 hours/month × $100/hour = $1,000/month
- Incident response: Average 5 hours/month × $150/hour = $750/month
Total: $2,300/month
For comparison, our previous batch processing approach cost $300/month (cron jobs on EC2).
The 7x cost increase was justified for our real-time use cases (fraud detection, inventory management) but not for analytics. Be honest about whether you need real-time.
Lessons Learned
1. Start simple, scale gradually Our first Kafka topic had one partition. We added partitions as throughput increased. Don't over-engineer upfront.
2. Optimize consumers, not Kafka Kafka can handle millions of messages per second. Your consumer code is the bottleneck 95% of the time. Profile and optimize your processing logic.
3. Test failure scenarios Kill consumers during processing. Unplug network cables. Fill disks. Your system will face these failures in production. Test them in staging.
4. Monitor lag aggressively Consumer lag is the most important metric. Everything else (error rates, latency, throughput) derives from it.
5. Choose delivery guarantees carefully At-least-once is simple and fast. Exactly-once is complex and slower. Use exactly-once only when duplicates are unacceptable.
6. Plan for schema evolution Your schemas will change. Use Avro or Protobuf with a schema registry from day one.
7. Document your partitioning strategy How you assign keys to partitions matters for ordering and consumer load balancing. Document this clearly.
The Future: ksqlDB and Stream Processing Abstraction
We're experimenting with ksqlDB, which lets you write stream processing logic as SQL:
CREATE STREAM user_clicks (
user_id VARCHAR,
url VARCHAR,
timestamp BIGINT
) WITH (
kafka_topic='user-clicks',
value_format='avro'
);
CREATE TABLE clicks_per_user AS
SELECT user_id, COUNT(*) as click_count
FROM user_clicks
WINDOW TUMBLING (SIZE 1 HOUR)
GROUP BY user_id;
This is simpler than writing consumer code, but less flexible. For complex logic, we still write custom consumers. For simple aggregations, ksqlDB is attractive.
Stream processing has matured significantly. Kafka is now a commodity, and higher-level abstractions (ksqlDB, Flink SQL, Spark Structured Streaming) make it accessible to more teams.
But the fundamentals remain: understand partitioning, manage consumer lag, choose appropriate delivery guarantees, and test failure scenarios. Get these right, and streaming becomes a powerful tool for building real-time systems. Get them wrong, and you'll spend weeks debugging production outages at 2am.
After two years of running Kafka in production, I've learned that stream processing isn't magic. It's a set of well-understood patterns and trade-offs. Master the fundamentals, and you can build systems that process millions of events per second reliably.