A 50ms API endpoint suddenly took 12 seconds. No code changes. No infrastructure changes. Just a normal Tuesday afternoon when our database decided it had had enough.
The culprit? An innocuous-looking query that worked fine with 100 records. At 10,000 records, it brought our entire API to its knees. This is the nature of database performance: problems hide at small scale and explode at large scale.
After spending hundreds of hours optimizing databases in production, I've learned that database performance isn't about memorizing best practices. It's about understanding query execution, recognizing patterns, and knowing when to denormalize, when to cache, and when to accept that some queries are just expensive.
The N+1 Query Problem
Our 12-second API endpoint was fetching a list of users with their posts. The Django ORM code looked clean:
def get_users_with_posts():
users = User.objects.all()
result = []
for user in users:
result.append({
'name': user.name,
'posts': list(user.posts.all()) # N+1 here
})
return result
This generated SQL like:
SELECT * FROM users; -- 1 query
-- Then for each user:
SELECT * FROM posts WHERE user_id = 1;
SELECT * FROM posts WHERE user_id = 2;
SELECT * FROM posts WHERE user_id = 3;
-- ... 10,000 queries
One query for users, plus N queries for posts (one per user). Total: 10,001 queries. At 1ms per query, that's 10 seconds just in network round-trips.
The fix is select_related or prefetch_related:
def get_users_with_posts():
users = User.objects.prefetch_related('posts').all()
result = []
for user in users:
result.append({
'name': user.name,
'posts': list(user.posts.all()) # No extra query!
})
return result
This generates:
SELECT * FROM users;
SELECT * FROM posts WHERE user_id IN (1, 2, 3, ..., 10000);
Two queries instead of 10,001. Response time dropped from 12 seconds to 50ms.
The lesson: N+1 queries are the most common database performance problem. They're invisible in development (N is small) and catastrophic in production (N is large). Always use prefetch_related or join queries when fetching related data.
Understanding Query Plans
When a query is slow, the first step is understanding what the database is actually doing. This requires reading query plans.
For a slow query:
SELECT u.name, COUNT(p.id) as post_count
FROM users u
LEFT JOIN posts p ON u.id = p.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.id
HAVING COUNT(p.id) > 10
ORDER BY post_count DESC
LIMIT 100;
Check the execution plan:
EXPLAIN ANALYZE
SELECT u.name, COUNT(p.id) as post_count
FROM users u
LEFT JOIN posts p ON u.id = p.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.id
HAVING COUNT(p.id) > 10
ORDER BY post_count DESC
LIMIT 100;
Output (PostgreSQL):
Limit (cost=15234.42..15234.67 rows=100)
-> Sort (cost=15234.42..15284.42 rows=20000)
Sort Key: (count(p.id)) DESC
-> HashAggregate (cost=13234.42..13434.42 rows=20000)
Group Key: u.id
Filter: (count(p.id) > 10)
-> Hash Left Join (cost=234.42..11234.42 rows=200000)
Hash Cond: (u.id = p.user_id)
-> Seq Scan on users u (cost=0.00..145.00 rows=5000)
Filter: (created_at > '2024-01-01'::date)
-> Hash (cost=145.00..145.00 rows=5000)
-> Seq Scan on posts p (cost=0.00..145.00 rows=5000)
Key insights from this plan:
1. Seq Scan on users: The database is scanning all users sequentially because there's no index on created_at. This is expensive.
2. Seq Scan on posts: No index on user_id, so all posts are scanned. Another expensive operation.
3. HashAggregate: Grouping requires aggregating all matching rows in memory.
4. Sort: Sorting post_count requires sorting all grouped results.
The plan reveals two missing indexes:
CREATE INDEX idx_users_created_at ON users(created_at);
CREATE INDEX idx_posts_user_id ON posts(user_id);
After adding indexes, the query plan changes:
Limit (cost=1234.42..1234.67 rows=100)
-> Sort (cost=1234.42..1284.42 rows=20000)
Sort Key: (count(p.id)) DESC
-> HashAggregate (cost=1034.42..1234.42 rows=20000)
Group Key: u.id
Filter: (count(p.id) > 10)
-> Hash Left Join (cost=34.42..834.42 rows=20000)
Hash Cond: (u.id = p.user_id)
-> Index Scan using idx_users_created_at on users u
Index Cond: (created_at > '2024-01-01'::date)
-> Hash (cost=145.00..145.00 rows=5000)
-> Index Scan using idx_posts_user_id on posts p
Query time dropped from 3 seconds to 80ms.
The lesson: Always check query plans for slow queries. Seq Scans on large tables are red flags. Add indexes on frequently filtered or joined columns.
Index Design: Composite Indexes and Covering Indexes
Not all indexes are created equal. Consider this query:
SELECT name, email
FROM users
WHERE country = 'US' AND age > 18
ORDER BY created_at DESC
LIMIT 100;
You might add three indexes:
CREATE INDEX idx_country ON users(country);
CREATE INDEX idx_age ON users(age);
CREATE INDEX idx_created_at ON users(created_at);
But the database can only use one index per scan (usually). It'll pick idx_country, filter by country, then scan the result to filter by age and sort by created_at.
Better: A composite index:
CREATE INDEX idx_country_age_created_at ON users(country, age, created_at);
This allows the database to:
- Seek to country = 'US'
- Filter age > 18 within that range
- Scan in created_at order (index is already sorted)
Order matters. The index works for:
WHERE country = 'US'WHERE country = 'US' AND age > 18WHERE country = 'US' AND age > 18 ORDER BY created_at
But not for:
WHERE age > 18(country is not specified)WHERE created_at > '2024-01-01'(country and age not specified)
Rule of thumb: Order columns in your composite index by selectivity (most selective first) and query frequency.
Covering indexes go further. If the index contains all columns in the SELECT clause, the database doesn't need to access the table at all:
CREATE INDEX idx_covering ON users(country, age, created_at, name, email);
Now the query reads only the index, not the table. This is called an "index-only scan" and is extremely fast.
The trade-off? Indexes consume disk space and slow down writes (every INSERT/UPDATE/DELETE must update all indexes). We had a table with 15 indexes. Writes became slower than reads. We removed 8 redundant indexes, and write throughput doubled.
Balance index coverage with write performance. For read-heavy workloads, generous indexing is fine. For write-heavy workloads, minimize indexes.
The JOIN Performance Trap
JOINs are powerful but can be expensive. Consider:
SELECT u.name, p.title, c.content
FROM users u
JOIN posts p ON u.id = p.user_id
JOIN comments c ON p.id = c.post_id
WHERE u.country = 'US';
If users, posts, and comments each have 1 million rows, and 10% of users are in the US, this query:
- Filters 100,000 US users
- Joins with posts: if each user has 10 posts, that's 1 million post rows
- Joins with comments: if each post has 100 comments, that's 100 million comment rows
The result set explodes. The database scans 100 million+ rows to return a result.
The fix depends on your access pattern:
Option 1: Denormalize
Store comment counts and latest comment content directly on the posts table. Avoid the join entirely.
ALTER TABLE posts ADD COLUMN comment_count INT DEFAULT 0;
ALTER TABLE posts ADD COLUMN latest_comment TEXT;
-- Update on INSERT/UPDATE to comments
-- Now query is:
SELECT u.name, p.title, p.latest_comment
FROM users u
JOIN posts p ON u.id = p.user_id
WHERE u.country = 'US';
This is 100x faster but introduces complexity (keeping comment_count accurate) and data duplication.
Option 2: Pagination
Retrieve data in chunks:
SELECT u.name, p.title
FROM users u
JOIN posts p ON u.id = p.user_id
WHERE u.country = 'US'
LIMIT 100 OFFSET 0;
Then for each post, fetch comments separately (with LIMIT). This avoids the Cartesian explosion.
Option 3: Separate queries
-- Query 1: Get posts
SELECT p.id, p.title
FROM users u
JOIN posts p ON u.id = p.user_id
WHERE u.country = 'US'
LIMIT 100;
-- Query 2: Get comments for these posts
SELECT c.post_id, c.content
FROM comments c
WHERE c.post_id IN (post_ids_from_query_1);
Two queries, but total rows scanned is much lower. This is often faster than a single massive JOIN.
The lesson: Multi-way JOINs can explode result set size. Consider denormalization, pagination, or separate queries to control scan volume.
Transactions and Locking
A seemingly innocent transaction caused a 30-second API timeout:
with transaction.atomic():
user = User.objects.select_for_update().get(id=user_id)
user.balance -= amount
user.save()
# Heavy operation: generate invoice PDF
invoice_pdf = generate_invoice(user, amount) # Takes 25 seconds
Invoice.objects.create(user=user, pdf=invoice_pdf)
The select_for_update() acquires a row lock. Other transactions trying to read or update this user are blocked until the transaction commits. The PDF generation holds the lock for 25 seconds, blocking all other requests for this user.
The fix: Move slow operations outside the transaction:
# Generate PDF first (no lock held)
invoice_pdf = generate_invoice(user, amount)
with transaction.atomic():
user = User.objects.select_for_update().get(id=user_id)
user.balance -= amount
user.save()
Invoice.objects.create(user=user, pdf=invoice_pdf)
Now the lock is held for less than 100ms instead of 25 seconds.
The lesson: Keep transactions short. Lock only what's necessary, and release locks as soon as possible. Never hold locks during I/O operations (network requests, file writes, external API calls).
Distributed Databases: CAP Theorem in Practice
We migrated from a single PostgreSQL instance to a distributed database (CockroachDB) for geographic redundancy. This introduced new failure modes.
Scenario: User updates profile
BEGIN;
UPDATE users SET name = 'Alice' WHERE id = 123;
COMMIT;
In single-node PostgreSQL, this takes 1ms. In distributed CockroachDB with nodes in us-east, us-west, and eu-central, this takes 100ms because:
- Coordinate transaction across replicas (consensus protocol)
- Write to multiple nodes (replication)
- Wait for acknowledgment from quorum (durability)
The trade-off: High availability and disaster recovery come at the cost of latency.
We optimized by:
1. Locality-aware partitioning: Store US users in us-east, EU users in eu-central. Queries within a region avoid cross-region latency.
2. Read replicas: For read-heavy workloads, read from the nearest replica. Accept eventual consistency (reads might lag writes by 100ms).
3. Caching: Aggressive caching for user profiles, reducing database load and masking latency.
The CAP theorem is real. You can't have perfect consistency, availability, and partition tolerance. We chose availability + partition tolerance, accepting eventual consistency for reads.
The lesson: Distributed databases are not drop-in replacements for single-node databases. Understand the consistency model and design your application accordingly.
Connection Pooling and Database Load
Our application opened a new database connection for every request. At 1,000 requests/second, that's 1,000 concurrent connections. PostgreSQL's max_connections is 100 by default. We hit the limit, and requests started failing.
The fix: Connection pooling with pgBouncer:
[databases]
mydb = host=postgres.example.com port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 10000
default_pool_size = 20
PgBouncer maintains 20 persistent connections to PostgreSQL. Application requests share these connections. 10,000 concurrent application requests → 20 database connections.
The trade-off: Connection pooling introduces slight latency (waiting for an available connection) but prevents connection exhaustion.
We also optimized by:
1. Reducing connection lifetime: Close connections immediately after use instead of keeping them open.
2. Read replicas: Route read queries to replicas, reducing load on the primary.
3. Query caching: Cache query results in Redis for 60 seconds, reducing database queries by 40%.
The lesson: Database connections are a limited resource. Use connection pooling, minimize connection lifetime, and offload reads to replicas or caches.
When to Denormalize
Normalization (3NF) is taught as best practice, but production databases often denormalize for performance.
Example: E-commerce order totals.
Normalized schema:
CREATE TABLE orders (id, user_id, created_at);
CREATE TABLE order_items (id, order_id, product_id, price, quantity);
To get order total:
SELECT SUM(price * quantity) as total
FROM order_items
WHERE order_id = 123;
For 1 million orders with 5 items each, this query scans 5 million rows to compute totals.
Denormalized schema:
CREATE TABLE orders (id, user_id, created_at, total);
Store the total on the order. Now querying totals is O(1). The trade-off? You must keep the total in sync:
-- On INSERT/UPDATE/DELETE to order_items:
UPDATE orders
SET total = (SELECT SUM(price * quantity) FROM order_items WHERE order_id = orders.id)
WHERE id = 123;
This adds write complexity but speeds up reads. For read-heavy workloads, this is a winning trade-off.
The lesson: Denormalize when read frequency much greater than write frequency, and when maintaining consistency is manageable.
Monitoring and Alerting
Database performance degrades gradually. Indexes become fragmented. Query plans change as data distributions shift. Detecting issues early requires monitoring.
We monitor:
1. Slow query log: Log queries > 100ms. Analyze daily for new slow queries.
2. Connection pool utilization: Alert if >80% of connections are in use (indicates capacity issues).
3. Replication lag: For replicas, alert if lag exceeds 5 seconds (indicates primary is overloaded or network issues).
4. Index hit rate: Percentage of queries served from indexes vs. sequential scans. Target >99%.
5. Transaction rollback rate: High rollbacks indicate lock contention or application bugs.
These metrics caught issues before users noticed degradation.
Closing Thoughts
Database performance is not a one-time optimization. It's ongoing maintenance. As data grows, query patterns change, and indexes age, performance degrades. Regular review of slow query logs, query plans, and monitoring metrics is essential.
The strategies that work:
1. Understand query execution: Use EXPLAIN ANALYZE. Don't guess.
2. Index strategically: Composite indexes for common queries. Covering indexes for critical paths. But don't over-index.
3. Avoid N+1 queries: Prefetch related data. Always.
4. Keep transactions short: Lock only what's necessary. Never hold locks during I/O.
5. Denormalize when justified: Read-heavy workloads benefit from denormalization. Write-heavy workloads don't.
6. Monitor proactively: Slow query logs and connection metrics catch issues early.
7. Test at scale: Performance problems hide at small scale. Test with production-sized datasets.
The 12-second API endpoint taught me that database performance is about understanding fundamentals, recognizing patterns, and making informed trade-offs. There's no silver bullet. But with the right tools and mental models, you can build systems that scale gracefully.