Redis (Paper-Server Edition)
Lesson 4: Architectural Tactics
The complete HOW reference
Your personal architecture cache. No expiration policy.
How to use this handout
This document is a reference, not a linear read. Use it while working on AdBid and on real projects.
Lesson 3 taught you the WHAT and the WHY. This handout teaches you the HOW — specific tactics for achieving each quality attribute.
Structure of each tactic:
- What is it?
- When to use
- When NOT to use
- Tradeoffs
- Concrete example
Remember: The Lesson 3 → Lesson 4 Transition
| Lesson 3 (Strategy) | Lesson 4 (Tactics) |
|---|---|
| WHAT quality attributes matter | HOW to achieve performance |
| WHY we make tradeoffs | HOW to achieve scalability |
| Context analysis | Specific techniques |
| Tradeoff matrix | ADRs and decisions |
Strategy without tactics = good intentions without execution. Tactics without strategy = doing things at random.
Performance Tactics
How do we make things faster?
Tactic 1: Caching
What is it? Store the results of expensive work so you don't have to repeat it. Instead of doing the same work repeatedly, do it once and save the result.
Cache layers (from outside in):
Client
↓
Browser Cache (client stores resources locally)
↓
CDN (Cloudflare, Fastly — geographically close to the user)
↓
Application Cache (Redis, Memcached — in-memory, ultra fast)
↓
Database Query Cache (the DB caches query results)
↓
Your database
When to use:
- Data that doesn't change frequently (product catalog, configuration, reference data)
- Expensive to compute or fetch (complex calculations, slow third-party APIs)
- Same data requested repeatedly (homepage, popular products)
When NOT to use:
- Data that changes very frequently (real-time stock price)
- Critical data where freshness matters a lot (bank balance at this exact moment)
- When the cost of cache invalidation outweighs the benefit
Tradeoffs:
- ✅ Gain: drastically improved speed, less load on the database
- ❌ Sacrifice: possibly stale data, invalidation complexity
Example:
// Without cache: 50ms per request, 1,000 requests/min = 50 seconds of DB load per minute
const products = await db.query('SELECT * FROM products WHERE category = ?', [category]);
// With cache: 50ms first time, 0.1ms after, same 1,000 requests = 0.1 seconds of DB load per minute
const cacheKey = `products:category:${category}`;
let products = await redis.get(cacheKey);
if (!products) {
products = await db.query('SELECT * FROM products WHERE category = ?', [category]);
await redis.setex(cacheKey, 3600, JSON.stringify(products)); // TTL: 1 hour
}
Rule of thumb: If the data has more than 1 minute of "useful life" and is requested more than 10 times per minute, it's probably worth caching.
Tactic 2: Async Processing
What is it? Don't make users wait for operations that don't need an immediate result. Accept the task, confirm to the user, process in the background.
The pattern:
User takes action
→ System accepts and saves the task (immediate, <100ms)
→ System returns "success" to user
→ Background job processes the task
→ System notifies when done (optional)
When to use:
- Operations that take more than 2 seconds
- Result not needed immediately
- User can continue using the app while waiting
- Confirmation emails, image processing, report generation, data exports
When NOT to use:
- User needs the result immediately to continue
- Operation is critical and can't fail silently
- Transactions where the user needs to know the outcome instantly (payments)
Tradeoffs:
- ✅ Gain: faster user experience, decouples expensive operations
- ❌ Sacrifice: eventual consistency, user doesn't know immediately if something failed
My principle: When you need messaging between components, it's almost always some kind of queue — or sometimes as simple as a DB table with a status column (pending, processing, completed, failed). Start simple.
Minimal implementation:
-- The simplest possible queue: a PostgreSQL table
CREATE TABLE background_jobs (
id SERIAL PRIMARY KEY,
type VARCHAR(50) NOT NULL,
payload JSONB NOT NULL,
status VARCHAR(20) DEFAULT 'pending',
attempts INTEGER DEFAULT 0,
created_at TIMESTAMP DEFAULT NOW(),
processed_at TIMESTAMP
);
Tactic 3: Database Optimization
What is it? Make your database queries faster. Almost always the first and biggest performance win.
The 4 main techniques:
1. Indexes
-- Without index: PostgreSQL reads the ENTIRE table (table scan)
SELECT * FROM orders WHERE user_id = 123; -- slow with 1M rows
-- With index: PostgreSQL goes directly (index scan)
CREATE INDEX idx_orders_user_id ON orders(user_id);
SELECT * FROM orders WHERE user_id = 123; -- fast with 1M rows
Rule: index columns you use in WHERE, JOIN, ORDER BY. But don't index everything — indexes slow down writes.
2. The N+1 problem
// ❌ BAD: N+1 queries — 1 query for orders + 1 query PER order
const orders = await db.query('SELECT * FROM orders LIMIT 100'); // 1 query
for (const order of orders) {
const user = await db.query('SELECT * FROM users WHERE id = ?', [order.user_id]); // 100 more queries!
}
// Total: 101 queries
// ✅ GOOD: 2 queries total
const orders = await db.query('SELECT * FROM orders LIMIT 100');
const userIds = orders.map(o => o.user_id);
const users = await db.query('SELECT * FROM users WHERE id IN (?)', [userIds]);
// Total: 2 queries
3. Connection Pooling Creating a new DB connection per request is expensive (~10ms). Connection pooling reuses connections.
// Pool of 10 connections → handles 10 simultaneous requests without connection overhead
const pool = new Pool({ max: 10, connectionString: DATABASE_URL });
4. EXPLAIN ANALYZE
-- Run this to understand what the DB is doing with your query
EXPLAIN ANALYZE SELECT * FROM orders WHERE created_at > '2026-01-01' ORDER BY total DESC;
-- Shows: how it searches, how many rows it scans, how long it takes
When NOT to: You should almost always optimize queries. The only exception is when the cost of optimizing outweighs the benefit (queries that run once a day, for example).
Tactic 4: Load Balancing
What is it? Distribute incoming traffic across multiple server instances.
Common algorithms:
| Algorithm | How it works | When to use |
|---|---|---|
| Round Robin | 1, 2, 3, 1, 2, 3... | Similar-duration requests |
| Least Connections | Send to server with fewest active connections | Variable-duration requests |
| IP Hash | Same user → same server | When you need some session affinity (note: compromises stateless) |
When to use: When a single server can't handle all the traffic.
When NOT to use: If your problem is processing speed (not capacity), a faster single server may be simpler. Don't add a load balancer if you have no evidence you need more than one server.
Tradeoffs:
- ✅ Gain: distributed capacity, tolerance for individual instance failures
- ❌ Sacrifice: operational complexity, you must be stateless
Tactic 5: Compression
What is it? Compress responses before sending them to the client.
Without compression: 100KB JSON → 100KB over the wire
With gzip: 100KB JSON → ~20KB over the wire (80% reduction typical for JSON)
When to use: Large responses (data-heavy APIs), mobile clients, slow networks.
When NOT to use: Already-compressed data (images, video). Re-compressing is inefficient.
Tradeoffs:
- ✅ Gain: less bandwidth, faster transfer
- ❌ Sacrifice: CPU to compress/decompress (usually negligible)
Scalability Tactics
How do we handle more load? Performance = speed per request. Scalability = more simultaneous requests.
Tactic 1: Stateless Services FOUNDATIONAL
What is it? Your server stores no user session data in its own memory. Session lives in external shared storage.
Why it's foundational:
App with server-side state:
Server 1 (knows about User A) Server 2 (knows nothing about User A)
↑ ↑
User A → can only go to Server 1 → you can't scale horizontally
Stateless app:
Server 1 Server 2 Server 3
↑ ↑ ↑
User A can go to ANY of them → add servers freely
How to implement:
Option 1 — Redis for session:
// Store session in Redis (shared across all servers)
app.use(session({
store: new RedisStore({ client: redisClient }),
secret: process.env.SESSION_SECRET
}));
Option 2 — JWT (session in the token):
// Token contains all session info
// Server validates the token, doesn't need to "remember" anything
const token = jwt.sign({ userId: user.id, role: user.role }, process.env.JWT_SECRET);
// Client stores the token. Server only validates it.
When to use: Always for web applications that need to scale.
When NOT to use: Single permanent server applications (rare, but they exist).
Tradeoffs:
- ✅ Gain: unlimited horizontal scaling
- ❌ Sacrifice: slight complexity, added latency to retrieve session from Redis
Tactic 2: Horizontal Scaling
What is it? Add more server instances instead of bigger instances.
| Vertical Scaling | Horizontal Scaling | |
|---|---|---|
| How | Bigger server | More servers |
| Ceiling | Yes (hardware has limits) | No (keep adding) |
| Cost | Exponential at the top | Linear |
| Complexity | Lower | Higher |
| Risk | Single point of failure | Redundant |
Requirements:
- Stateless services ← most important
- Load balancer to distribute traffic
- Shared external storage (DB, Redis)
When to use: When vertical scaling is no longer cost-effective or you've hit the limit.
When NOT to use: Before you need it. Don't add multi-server complexity before you have evidence that one server isn't enough.
Tactic 3: Read Replicas
What is it? A copy of your database that accepts reads only. Write traffic goes to the primary. Read traffic is distributed across replicas.
Writes → Primary DB
↓ (replication, lag ~1-5 seconds)
Reads → Replica 1 | Replica 2 | Replica 3
Why it works: Most applications are 90% reads. One replica ≈ double read capacity. Three replicas ≈ 4x total read capacity.
When to use: Read-heavy applications. Often the easiest, most overlooked scaling win.
When NOT to use: When you need to read data immediately after writing it (replication lag is a problem). e.g., showing a bank balance immediately after a transaction.
Tradeoffs:
- ✅ Gain: multiplied read capacity, reduced load on primary
- ❌ Sacrifice: eventual consistency (replicas can be 1-5 seconds behind), cost of additional instances
Tactic 4: Queue-Based Processing
What is it? Decouple who produces work from who processes it using an intermediate message queue.
Without queue:
User → Web Server → processes everything immediately → responds
With queue:
User → Web Server → puts in queue → responds "accepted" ✓
↓
Workers consume from queue
Workers process at their own pace
Benefits:
- Absorbs spikes: if 10,000 requests arrive in 1 second, the queue holds them, workers process at their own pace
- Independent scaling: you can have 3 web servers and 10 workers, or the reverse
- Automatic retry: if a job fails, the queue can retry it
Tools:
- RabbitMQ — flexible, feature-rich, self-hosted
- AWS SQS — managed, simple, reliable, good price
- BullMQ (Node.js) — built on Redis, great DX, for moderate scale
When to use: Background processing, high-volume events, anything that can tolerate delay.
When NOT to use: When the user needs the result immediately.
Tradeoffs:
- ✅ Gain: spike absorption, scalability, resilience
- ❌ Sacrifice: eventual consistency, operational complexity, one more system to manage
Tactic 5: Caching for Scale
What is it? Each cache layer prevents requests from hitting the bottleneck below it.
Without cache: 10,000 requests → all hit the DB
With layered cache:
- CDN answers 7,000 (static assets, cached pages)
- Redis answers 2,500 (frequent app data)
- DB receives only 500 (unique data, writes, uncached data)
For scalability, think of cache as a shield, not just a speed optimization.
Tactic 6: Polyglot Persistence
What is it? Using different databases for different types of data, each optimized for its use case.
The most common pattern:
Central ACID DB (PostgreSQL):
→ Critical transactional data
→ Orders, payments, user accounts
→ You need ACID: Atomicity, Consistency, Isolation, Durability
→ You cannot lose this
Specialized DB (Cassandra, DynamoDB, InfluxDB, etc.):
→ High-volume, less-critical data
→ Analytics logs, tracking events, time-series data
→ Eventual consistency is fine
→ Volume that would overwhelm your relational DB
Real examples:
| Company | Critical DB | Specialized DB | For what |
|---|---|---|---|
| Uber | PostgreSQL | Cassandra | GPS tracking |
| PostgreSQL | Cassandra | Activity feeds | |
| Netflix | MySQL/CockroachDB | Cassandra | Viewing history |
The flock analogy: Two small, manageable flocks (one fancy, one regular) instead of one giant herd of 100 sheep. Each has its own shepherd, its own space, its own rhythm.
When to use:
- You have data with radically different access patterns
- High volume of data where eventual consistency is fine
- You already have evidence that one DB isn't enough
When NOT to use:
- Your app is small or mid-size (adds real operational complexity)
- The team has no experience with the specialized DB
- "Because Netflix does it" (they also have 100+ person infrastructure teams)
Tradeoffs:
- ✅ Gain: each DB optimized for its case, independent scaling
- ❌ Sacrifice: two systems to operate, monitor, and back up; cross-DB joins are complex
Tactic 7: Database Sharding THE NUCLEAR OPTION
What is it? Split a single database into multiple databases (shards) where each shard contains a subset of the data.
Without sharding:
Single DB → all data for all users
With sharding (by user_id):
Shard 1 → users 1 to 1,000,000
Shard 2 → users 1,000,001 to 2,000,000
Shard 3 → users 2,000,001 to 3,000,000
⚠️ THE SHEEP HERDING PRINCIPLE ⚠️
1 sheep: Simple. Click, move, done.
100 sheep with no experience and no sheepdog: Chaos. They scatter, you can't control them, you lose them.
1 database: Simple. Query, result, done.
10 sharded databases: Every query becomes:
- Which shard has this data?
- Need data from multiple shards? (cross-shard query — very slow)
- How do I keep shards balanced? (rebalancing — nightmare)
- Joins across shards? (very, very painful)
- User moved between shards? (migration — complex)
You're herding 100 sheep with no sheepdog.
The Principle
Keep your flock small until wool demand is too high.
= Use ONE database until you absolutely cannot handle the load anymore.
"Cannot handle the load" means you've already tried:
- ✓ Caching at all layers
- ✓ Read replicas (3-4 replicas)
- ✓ Vertical scaling (bigger server)
- ✓ Optimizing all slow queries
- ✓ Connection pooling
- You're at 90%+ capacity and growth continues
THEN, maybe, consider sharding.
Reality check:
- Well-optimized PostgreSQL handles billions of rows on a single instance
- Instagram used a single database for years before needing to shard
- Most companies never need to shard
- If you eventually do need it: hire someone with experience, or use a DB with auto-sharding (CockroachDB, PlanetScale)
When to use: When you've literally exhausted every other option.
When NOT to use: Practically always.
Availability Tactics
How do we stay running when things fail? (And they will.)
Tactic 1: Redundancy
What is it? Eliminate single points of failure by having multiple instances of critical components.
Modes:
| Active-Active | Active-Passive | |
|---|---|---|
| How it works | Both instances handle traffic simultaneously | One active, one on standby |
| If primary fails | Traffic continues on secondary | Secondary takes over (failover) |
| Complexity | Higher (both must handle load) | Lower |
| Resource usage | More efficient | Standby resources "wasted" |
When to use: Systems that cannot have downtime — production databases, payment services, authentication, network infrastructure.
When NOT to use: In development. In systems where some downtime is acceptable and redundancy complexity isn't worth it.
Tradeoffs:
- ✅ Gain: eliminates single points of failure, improved uptime
- ❌ Sacrifice: cost (minimum 2x), operational complexity
Tactic 2: Health Checks
What is it? A dedicated endpoint that reports whether your service is working correctly. The orchestration system uses it to decide whether to send traffic.
// Simple endpoint
app.get('/health', async (req, res) => {
try {
await db.query('SELECT 1'); // verify DB responds
res.json({ status: 'healthy', timestamp: new Date() });
} catch (error) {
res.status(503).json({ status: 'unhealthy', error: error.message });
}
});
// Detailed health check
app.get('/health/detailed', async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
externalApi: await checkExternalApi(),
};
const allHealthy = Object.values(checks).every(c => c.healthy);
res.status(allHealthy ? 200 : 503).json({ status: allHealthy ? 'healthy' : 'degraded', checks });
});
When to use: Always. On every production service. Basically free to implement and catches 80% of problems before users notice.
Tradeoffs:
- ✅ Gain: automatic failure detection, automatic restart, routing away from degraded instances
- ❌ Sacrifice: one more endpoint to maintain (minimal)
Tactic 3: Graceful Degradation
What is it? When a component fails, the system continues working with reduced functionality instead of failing completely.
Practical examples:
// Instead of failing when recommendation service is down:
async function getProductRecommendations(userId) {
try {
return await recommendationService.getFor(userId); // try the service
} catch (error) {
logger.warn('Recommendation service unavailable, using fallback');
return await getPopularProducts(); // fallback: generic popular products
}
}
// Instead of showing an error when payment service is slow:
async function processOrder(order) {
await saveOrderToQueue(order); // save to process later
return { status: 'queued', message: 'Your order was received. We will confirm by email.' };
}
When to use: Whenever your system depends on external or internal services that can fail.
When NOT to use: When the secondary functionality is so critical that there's no point continuing without it. If the payment processor fails during checkout, there's no way to degrade gracefully — the transaction must fail.
Tradeoffs:
- ✅ Gain: user experiences reduced functionality instead of a total error
- ❌ Sacrifice: complexity of implementing multiple paths, users may not notice degradation
Tactic 4: Circuit Breakers
What is it? A mechanism that automatically stops calling services that are failing, preventing error cascades.
The three states:
CLOSED state (normal):
Your service → calls pass through → External service ✓
OPEN state (failure detected):
Your service → BLOCKED immediately → Never reaches external service
↓
Returns fast error or cached data
HALF-OPEN state (testing recovery, after timeout):
Your service → ONE test call → External service
If it works → CLOSED state
If it fails → OPEN state (reset timer)
Why it matters: Without circuit breaker: external service is slow → your threads pile up waiting → your thread pool exhausts → YOUR service falls too.
With circuit breaker: external service is slow → circuit opens → your threads fast-fail → your service stays responsive.
Tools:
- Node.js:
opossumlibrary - Java: Resilience4j
- .NET: Polly
- Or implement a simple one yourself
When to use: When calling external services (third-party APIs) or your own internal microservices.
When NOT to use: Simple internal function calls, database operations where you want errors to propagate normally.
Tactic 5: Database Replication (for availability)
What is it? Keep a "warm" copy of the database ready to take over if the primary fails.
(Note: also used for scalability — see Read Replicas. Here the focus is failover, not load distribution.)
Modes:
| Synchronous | Asynchronous | |
|---|---|---|
| Write confirms when | Primary AND replica confirm | Only primary confirms |
| Data loss if primary fails | Zero | Possibly last 1-5 seconds |
| Write speed | Slower | Faster |
| For what | Critical financial data | Most apps |
When to use: In any production system where losing data would be unacceptable.
When NOT to use: Development or staging where downtime is acceptable.
Tactic 6: Monitoring & Alerting
What is it? Systems that measure your application's behavior and notify you when something goes wrong.
What to monitor:
Technical Metrics:
- Error rate (% of requests returning 5xx)
- Response times (p50, p95, p99)
- Resource usage (CPU, memory, disk)
- Queue depth
- Active DB connections
Business Metrics (underrated!):
- Orders per hour
- User signups per day
- Revenue per hour
- Searches per minute
Business metrics are crucial: a sudden drop in orders/hour is often the first signal of a technical problem, before technical metrics show it clearly.
Accessible tools:
- Sentry — errors and exceptions (has a free tier)
- Datadog — metrics and infrastructure (paid, powerful)
- Grafana + Prometheus — open source, requires setup
- Uptime Robot — simple, monitors that endpoints respond
When to use: Always in production. Non-negotiable.
Security Tactics — Overview
Full deep dive in Lesson 8. This is the introduction.
| Tactic | What it does | When |
|---|---|---|
| Authentication | Verifies who you are | Always for access to user data |
| Authorization | Verifies what you can do | Always for privileged operations |
| Encryption in transit | HTTPS protects data in network | Always in production |
| Encryption at rest | Encrypted DB | Sensitive data (PII, health, finance) |
| Input validation | Never trust external data | Always, server-side |
| Rate limiting | Prevents abuse and brute force | Public APIs, auth endpoints |
| Secrets management | Environment variables, vault | Never in code or git |
Golden rule: If you don't know whether you need a security tactic, you do.
Maintainability Tactics
Can the team understand and modify this code in 6 months?
Tactic 1: Modularity
What is it? Organize code into modules with clear, well-defined responsibilities.
The most common pattern — Layered Architecture:
┌─────────────────────────────┐
│ Controllers │ ← HTTP, input validation, responses
│ (what the world sees) │
└─────────────┬───────────────┘
│
┌─────────────▼───────────────┐
│ Services │ ← Business logic, orchestration
│ (the brain) │
└─────────────┬───────────────┘
│
┌─────────────▼───────────────┐
│ Repositories │ ← Data access, DB queries
│ (the memory) │
└─────────────────────────────┘
📁 src/
📁 controllers/ ← handles HTTP
userController.js
orderController.js
📁 services/ ← business logic
userService.js
orderService.js
📁 repositories/ ← data access
userRepository.js
orderRepository.js
📁 models/ ← data structures
📁 middleware/ ← auth, logging, error handling
When to use: From day one. On any project with more than one person or more than 3 months of life.
Tactic 2: Low Coupling, High Cohesion
What is it?
Low Coupling: Modules depend minimally on each other.
// ❌ HIGH COUPLING: UserService knows too much about OrderRepository
class UserService {
async deleteUser(userId) {
await this.db.query('DELETE FROM orders WHERE user_id = ?', [userId]); // ← directly accesses order DB!
await this.db.query('DELETE FROM users WHERE id = ?', [userId]);
}
}
// ✅ LOW COUPLING: UserService coordinates, each repository handles its own concern
class UserService {
async deleteUser(userId) {
await this.orderRepository.deleteByUserId(userId); // ← delegates to whoever owns it
await this.userRepository.delete(userId);
}
}
High Cohesion: Related things live together.
❌ Bad: Your auth logic scattered across:
- authController.js (some here)
- userService.js (some here)
- middleware/validate.js (some here)
- utils/token.js (some here)
✅ Good: All auth logic in:
- auth/authService.js (complete logic)
- auth/authMiddleware.js (middleware)
- auth/tokenUtils.js (token utilities)
Coupling test: If you change one file, how many others do you have to change? If the answer is more than 2-3, coupling is too high.
Tactic 3: Clear Abstractions
What is it? Hide complexity behind simple interfaces. Code that uses the abstraction doesn't know or care how it's implemented.
// Define the interface (the contract)
interface StorageService {
uploadFile(file: Buffer, path: string): Promise<string>;
deleteFile(path: string): Promise<void>;
getFileUrl(path: string): string;
}
// Development implementation
class LocalStorageService implements StorageService {
async uploadFile(file: Buffer, path: string): Promise<string> {
await fs.writeFile(`./uploads/${path}`, file);
return `/uploads/${path}`;
}
// ...
}
// Production implementation
class S3StorageService implements StorageService {
async uploadFile(file: Buffer, path: string): Promise<string> {
await this.s3.upload({ Bucket: 'my-bucket', Key: path, Body: file }).promise();
return `https://my-bucket.s3.amazonaws.com/${path}`;
}
// ...
}
// Code using storage doesn't change when you change the implementation
class UserService {
constructor(private storage: StorageService) {} // ← injected, not hardcoded
async updateAvatar(userId: string, image: Buffer) {
const url = await this.storage.uploadFile(image, `avatars/${userId}.jpg`);
await this.userRepository.updateAvatar(userId, url);
}
}
When to use: When you might swap implementations (vendor change), when you want easy tests (mock the interface), when multiple modules use the same thing.
Tactic 4: Automated Testing
What is it? Tests that run automatically to verify code works as expected.
The three types and when to use each:
| Type | What it tests | Speed | When to prioritize |
|---|---|---|---|
| Unit | Individual function in isolation | Very fast | Complex business logic, algorithms |
| Integration | Components working together | Medium | Data flows, service interactions |
| E2E (End-to-End) | Complete user flow | Slow | Critical paths (checkout, login, signup) |
You don't need 100% coverage. You need confidence to change code.
Prioritize tests for:
- Critical business logic (price calculation, authorization, validations)
- Things that have broken before
- Complex algorithms
Don't spend much energy testing:
- Simple configuration (if DB config works, you know in 5 seconds)
- Code that changes constantly (tests you break every change aren't helping)
// Example of a well-written unit test
describe('OrderService.calculateTotal', () => {
it('should apply 10% discount for orders over $100', () => {
const order = { items: [{ price: 60 }, { price: 50 }] }; // total = $110
const result = orderService.calculateTotal(order);
expect(result.discount).toBe(11); // 10% of $110
expect(result.total).toBe(99); // $110 - $11
});
it('should not apply discount for orders under $100', () => {
const order = { items: [{ price: 40 }, { price: 30 }] }; // total = $70
const result = orderService.calculateTotal(order);
expect(result.discount).toBe(0);
expect(result.total).toBe(70);
});
});
Tactic 5: Documentation
What is it? Explaining the WHY behind decisions, not the WHAT of the code.
What's worth documenting:
// ❌ Bad comment — explains the obvious:
// sum item prices
const total = items.reduce((sum, item) => sum + item.price, 0);
// ✅ Good comment — explains the non-obvious why:
// We use price.amount in cents (not decimal pesos) to avoid floating-point
// errors in financial calculations. The client had a $0.01 bug in production
// from using floats. See ADR-042 for details.
const totalCents = items.reduce((sum, item) => sum + item.price.amount, 0);
The minimum documentation stack:
- README.md — how to run, how to deploy, required environment variables
- ADRs — why we made the important architecture decisions
- API documentation (OpenAPI/Swagger) — how to call the endpoints
- Code comments — only for things that are truly non-obvious
Tactic 6: Consistent Patterns
What is it? Using the same approach to solve the same type of problem throughout the codebase.
Example — Consistent error handling:
// ✅ All controllers follow the same pattern:
// Result: any dev understands any controller immediately
async function getUser(req, res) {
try {
const user = await userService.getById(req.params.id);
if (!user) return res.status(404).json({ error: 'User not found' });
res.json({ data: user });
} catch (error) {
logger.error('Error in getUser', { error, params: req.params });
res.status(500).json({ error: 'Internal server error' });
}
}
To maintain consistency in a team:
- Linter (ESLint, Prettier) — auto-enforce style
- Code review — catches deviations from patterns
- Document patterns in the architecture README
- Templates for new files (controllers, services, etc.)
ADRs — The Thinking Tool
The real value
ADRs aren't for documenting past decisions. They're for forcing rigorous thinking before deciding.
When you have to write:
- What problem am I actually solving?
- What options did I consider?
- Why this option for my specific context?
- What are the consequences?
...you catch bad ideas early. You catch golden hammer thinking. You catch resume-driven development.
When to write an ADR
Write an ADR when:
- The decision costs 1+ week to change if you're wrong
- The team is split between options
- You want to remember in 6 months why you chose this
- The decision affects multiple teams or components
Don't write an ADR when:
- The decision is obvious to everyone
- Trivial to change later
- You're exploring or prototyping
- Ceremony doesn't add value
Template — One Page Maximum
# ADR-[number]: [Descriptive decision title]
**Date:** YYYY-MM-DD
**Status:** [Proposed | Accepted | Deprecated | Superseded by ADR-XXX]
## Context
[2-3 sentences: What problem do we have? What constraints exist?
Why do we need to make this decision now?]
## Options Considered
1. **[Option A]** — brief description
2. **[Option B]** — brief description
3. **[Option C]** — brief description (if applicable)
## Decision
We chose **[Option X]**.
## Rationale
Why this option for OUR specific context:
- [Reason 1 — ties to your real constraints]
- [Reason 2 — ties to your real constraints]
- [Reason 3 if needed]
## Consequences
**We gain:**
- [Benefit 1]
- [Benefit 2]
**We sacrifice:**
- [Tradeoff 1]
- [Tradeoff 2]
**Next steps:** [optional — what concrete actions does this decision generate]
Real Example — ADR for Notification Service Scenario A
# ADR-001: Use PostgreSQL table as queue for email notifications
**Date:** 2026-02-22
**Status:** Accepted
## Context
We need to notify subscribers when a writer publishes. Current volume: ~400
notifications/day. Team: 3 junior developers. Infrastructure budget: $500/month.
Timeline: working system in 2 months.
## Options Considered
1. **PostgreSQL table as queue** + cron job every 5 minutes
2. **RabbitMQ** + worker pool
3. **AWS SQS** + Lambda functions
## Decision
We chose **Option 1: PostgreSQL table + cron**.
## Rationale
- 400 notifications/day is trivially handled by a PostgreSQL table
- The 3-junior team already knows PostgreSQL; no need to learn RabbitMQ
- The $500/month budget doesn't justify $150+/month of additional messaging infrastructure
- Delivery within 5 minutes is explicitly acceptable per requirements
- Simple system, easy to understand and debug from day one
## Consequences
**We gain:** Simplicity, no new infrastructure, team can maintain it from day 1
**We sacrifice:** Not real-time (~5 min delay), doesn't scale easily beyond ~10,000/day
**Next steps:** If we exceed 5,000 notifications/day, revisit and add a real queue
Quick Reference Tables
Tactics by Quality Attribute
| Quality Attribute | Main Tactics | When to Start |
|---|---|---|
| Performance | Caching, Async, DB optimization | From MVP if you have 100+ users/day |
| Scalability | Stateless, Read replicas, Queues | When one server shows >70% load |
| Availability | Health checks, Monitoring, Graceful degradation | Always in production |
| Security | Auth/Authz, HTTPS, Input validation | From day one, no exceptions |
| Maintainability | Modularity, Testing, Consistent patterns | From day one |
When NOT to use each tactic
| Tactic | Don't use when... |
|---|---|
| Caching | Data changes frequently or freshness is critical |
| Microservices / Queues | Small team, startup, scale doesn't justify it |
| Sharding | You haven't exhausted all other options first |
| Polyglot persistence | Team has no experience with the second DB |
| Redundancy | Cost outweighs the value of additional uptime |
Common Combinations
| Scenario | Recommended tactic combination |
|---|---|
| Startup web app | Stateless + Redis session + PostgreSQL + Health checks + Sentry |
| Growing mid-size app | All above + Read replicas + CDN + Background jobs |
| High-scale app | All above + Queues + Workers + Polyglot persistence + Full monitoring |
| Never in this course | Sharding (you need to be Instagram first) |
Case Study: Notification Service
Scenario A — 400 notifications/day
Context: 50 writers, 200 subscribers each, 3-junior-dev team, $500/month budget.
Recommended architecture:
PostgreSQL table: notification_queue
(id, article_id, subscriber_email, status, created_at, processed_at)
Cron job → every 5 minutes → query status='pending' → SendGrid API → update to 'sent'
Tactics used: DB as queue, cron scheduling, email via SendGrid API
Tactics explicitly NOT used: RabbitMQ (overkill), worker pools (overkill), real-time websockets (unnecessary), Redis (nothing to cache at this scale)
Why this design? Because 3 junior devs can build it, maintain it, and debug it. It meets all requirements. Additional cost: $0.
Scenario B — 1,000,000 notifications/day
Context: 5,000 writers, 10,000 subscribers each, 10-dev team + ops, $10K/month budget.
Recommended architecture:
Article published → event to Message Queue (RabbitMQ / AWS SQS)
↓
Worker Pool (3-5 workers, auto-scaling)
↓
SendGrid API with rate limiting
↓ (in parallel)
Redis for retry deduplication
↓
Monitoring: queue depth, failed jobs, delivery rates
Tactics used: Queue for decoupling, worker pool for horizontal scaling, rate limiting, Redis for deduplication, full monitoring
Why this design? Because at 1M notifications/day, a cron job would collapse. Viral spikes (article = 100K simultaneous notifications) need a buffer. The team has the capacity to operate this.
The A → B Migration
Start (Scenario A): cron + PostgreSQL table
↓ when: volume exceeds 5,000/day and cron takes too long
Step 1: add RabbitMQ, convert cron to publisher
↓ when: queue fills consistently during spikes
Step 2: add worker pool (2-3 workers)
↓ when: workers consistently fall behind
Step 3: add worker auto-scaling
↓ when: you have trouble debugging failures
Step 4: add full monitoring
Each step: triggered by PROVEN need, not hypothetical.
Redis (Paper-Server Edition) | Lesson 4 | Software Architecture for Junior Developers
No expiration policy. Your personal architecture cache.