Exercise: Notification Service
Lesson 4 — Architectural Tactics
Goal: Design the same solution for two radically different contexts. Demonstrate that the right tactics depend on context.
The Problem
A blogging platform needs to notify subscribers by email when a writer publishes a new article.
Same problem. Two very different contexts.
Scenario A — The Reality
The company
- Name: BlogMX (fictional)
- Size: 50 registered writers
- Subscribers: ~200 per writer on average
- Notification volume: ~400 per day
- Peak: One popular writer with 1,000 subscribers publishes
The team
- Developers: 3 junior developers
- Messaging experience: None (they know SQL, JavaScript, basic APIs)
- Infrastructure budget: $500 USD / month total (including server, DB, everything)
- Timeline: Basic notifications working in 2 months
System requirements
- When a writer publishes, notify all their subscribers by email
- Email only for now (no push notifications)
- Delivery within 5 minutes is acceptable
- Reliable: notifications cannot be permanently lost
- Simple to build and maintain (the junior team must be able to debug it)
Current stack
- Node.js + Express (API)
- PostgreSQL (database)
- Heroku (hosting, $25/month)
- SendGrid (email, free tier: 100 emails/day, $15/month for 40,000/month)
Scenario B — The Dream
The company
- Name: BlogMX (3 years later, same product)
- Size: 5,000 registered writers
- Subscribers: ~10,000 per writer on average
- Notification volume: ~1,000,000 per day
- Peak: Viral article, 100,000 simultaneous notifications
The team
- Developers: 10 developers + dedicated operations team
- Experience: Mixed — some senior, some mid-level
- Infrastructure budget: $10,000 USD / month
- Timeline: 6 months to build it properly
System requirements
- Same as Scenario A, BUT:
- Must handle extreme traffic spikes
- Delivery within 1 minute (not 5)
- Monitoring and alerts: know when something fails
- Retry logic: if an email fails, retry automatically
- No duplicate sends if there are retries
- Observable system: able to see the status of any notification
Current stack
- Node.js (multiple services)
- PostgreSQL + some read replicas
- AWS (EC2, RDS, various services)
- SendGrid Business (higher limits, reliable API)
Your Task
For EACH scenario, complete the worksheet below.
Success criteria: Your answers must be DIFFERENT for each scenario. If they're identical, you're not thinking about context.
Worksheet — Scenario A (400 notifications/day)
1. Overall Architecture
How does a notification flow from when the writer publishes to when the subscriber receives the email?
Writer publishes article
↓
[write the flow step by step here]
↓
Subscriber receives email
2. Performance Tactics
Which of these do you need? Why or why not?
| Tactic | Use it? | Justification |
|---|---|---|
| Caching (Redis/Memcached) | Yes / No | |
| Async processing | Yes / No | |
| DB optimization | Yes / No | |
| Load balancing | Yes / No | |
| Compression | Yes / No |
3. Scalability Tactics
Which do you need at 400 notifications/day?
| Tactic | Use it? | Justification |
|---|---|---|
| Stateless services | Yes / No | |
| Horizontal scaling | Yes / No | |
| Read replicas | Yes / No | |
| Message queue (RabbitMQ, SQS) | Yes / No | |
| Worker pool | Yes / No | |
| Polyglot persistence | Yes / No |
4. Availability Tactics
What do you need to be reliable without over-engineering?
| Tactic | Use it? | Justification |
|---|---|---|
| Server redundancy | Yes / No | |
| Health checks | Yes / No | |
| Graceful degradation | Yes / No | |
| Circuit breakers | Yes / No | |
| Database replication | Yes / No | |
| Monitoring and alerts | Yes / No |
5. Key questions
Can this team of 3 juniors build AND maintain what you're proposing?
What happens if the email service (SendGrid) is down for 30 minutes?
What is the simplest thing that can work for this context?
Worksheet — Scenario B (1,000,000 notifications/day)
1. Overall Architecture
How does a notification flow? The flow will be more complex than in A.
Writer publishes article
↓
[write the flow step by step here]
↓
Subscriber receives email
2. Performance Tactics
Which of these do you need at this scale?
| Tactic | Use it? | Justification |
|---|---|---|
| Caching (Redis/Memcached) | Yes / No | |
| Async processing | Yes / No | |
| DB optimization | Yes / No | |
| Load balancing | Yes / No | |
| Compression | Yes / No |
3. Scalability Tactics
At 1M notifications/day, what is now necessary?
| Tactic | Use it? | Justification |
|---|---|---|
| Stateless services | Yes / No | |
| Horizontal scaling | Yes / No | |
| Read replicas | Yes / No | |
| Message queue (RabbitMQ, SQS) | Yes / No | |
| Worker pool | Yes / No | |
| Polyglot persistence | Yes / No |
4. Availability Tactics
What availability guarantees does this system need?
| Tactic | Use it? | Justification |
|---|---|---|
| Server redundancy | Yes / No | |
| Health checks | Yes / No | |
| Graceful degradation | Yes / No | |
| Circuit breakers | Yes / No | |
| Database replication | Yes / No | |
| Monitoring and alerts | Yes / No |
5. Key questions
How do you handle a spike of 100,000 notifications in 1 minute (viral article)?
How do you avoid sending the same email twice if there's a retry?
How do you know the system is failing BEFORE users report it to you?
Instructor Solutions
Read this AFTER completing your worksheet. Compare your reasoning.
Solution: Scenario A — The Minimum Viable Architecture
Flow:
Writer publishes article
↓
API saves article to PostgreSQL
↓
API inserts one record into notification_queue per subscriber
(article_id, subscriber_email, status='pending')
↓
[Every 5 minutes — Cron Job]
↓
Query: SELECT * FROM notification_queue WHERE status='pending' LIMIT 100
↓
For each notification: call SendGrid API
↓
Update status to 'sent' (or 'failed' if SendGrid returns an error)
↓
Subscriber receives email
The schema:
CREATE TABLE notification_queue (
id SERIAL PRIMARY KEY,
article_id INTEGER NOT NULL REFERENCES articles(id),
subscriber_email VARCHAR(255) NOT NULL,
status VARCHAR(20) DEFAULT 'pending', -- pending, sent, failed
attempts INTEGER DEFAULT 0,
created_at TIMESTAMP DEFAULT NOW(),
processed_at TIMESTAMP,
error_message TEXT
);
-- The index you need:
CREATE INDEX idx_notification_queue_status ON notification_queue(status, created_at);
Tactics used (and why):
| Tactic | Used | Reason |
|---|---|---|
| Caching | ❌ No | Nothing worth caching at 400/day |
| Async processing | ✅ Yes | The table as queue IS async processing |
| DB optimization | ✅ Yes | The index on status is needed |
| Load balancing | ❌ No | One server is more than enough for this volume |
| Message queue (RabbitMQ) | ❌ No | Overkill for 400/day, team doesn't know it |
| Worker pool | ❌ No | The cron job is the "worker" |
| Read replicas | ❌ No | No read load that justifies it |
| Health checks | ✅ Yes | Always in production, practically free |
| Monitoring | ✅ Minimal | At least Sentry for errors, uptime monitor |
Can the 3-junior team maintain this? Yes. It's a PostgreSQL table and a script that runs every 5 minutes. Any junior who knows SQL can debug it.
What if SendGrid is down?
Records stay at status='pending'. The cron job retries them in 5 minutes. Automatically. No extra code.
Additional infrastructure cost: $0. Everything runs on the server they already have.
Solution: Scenario B — The Architecture That Scales
Flow:
Writer publishes article
↓
API saves article to PostgreSQL
↓
API publishes event to Message Queue: { articleId, authorId, publishedAt }
↓
Subscriber Lookup Worker:
→ Reads event from queue
→ Queries PostgreSQL: SELECT email FROM subscriptions WHERE author_id = ?
→ For each subscriber: publishes individual message to Notification Queue
↓
Email Worker Pool (3-5 workers, auto-scaling):
→ Consumes messages from Notification Queue
→ Generates deduplication key: sha256(article_id + subscriber_email)
→ Checks Redis: have we already sent this?
→ If not: calls SendGrid API
→ Stores deduplication key in Redis (TTL: 24 hours)
→ ACKs the queue (marks as processed)
↓
Monitoring:
→ Queue depth (Datadog/Grafana)
→ Failed job rate
→ Email delivery rate
→ Worker health checks
↓
Subscriber receives email (within ~1 minute)
Tactics used (and why):
| Tactic | Used | Reason |
|---|---|---|
| Message queue | ✅ Yes | Absorbs spikes, decouples publish from send |
| Worker pool | ✅ Yes | Horizontal scaling of processing |
| Worker auto-scaling | ✅ Yes | Viral spikes of 100K messages |
| Read replicas | ✅ Yes | Subscriber lookup can be heavy |
| Redis (deduplication) | ✅ Yes | Retries at 1M/day will happen; dedup prevents double-send |
| Full monitoring | ✅ Yes | You cannot debug 1M/day without it |
| Circuit breaker | ✅ Yes | If SendGrid is slow, don't block the workers |
| Health checks | ✅ Yes | Always |
| Polyglot persistence | Maybe | If email analytics needs a specialized DB |
How do you handle the viral spike of 100,000 messages? The Message Queue absorbs them. The autoscaler adds workers. Workers process at their own pace. Without the queue, 100,000 simultaneous requests would collapse the API.
How do you avoid duplicates on retries?
// Before sending each email:
const dedupKey = `email:sent:${articleId}:${subscriberEmail}`;
const alreadySent = await redis.exists(dedupKey);
if (alreadySent) return; // already sent, skip
await sendGrid.send({ to: subscriberEmail, ... });
await redis.setex(dedupKey, 86400, '1'); // TTL: 24 hours
Additional cost: ~$1,500-2,000/month more in infrastructure. Justified by the volume.
The Final Comparison
| Aspect | Scenario A | Scenario B |
|---|---|---|
| Architecture | Cron + PostgreSQL table | Queue + workers + Redis |
| Complexity | Low | High |
| Additional infra cost | $0 | $1,500-2,000/month |
| 3 juniors can build it | ✅ Yes | ⚠️ With senior help |
| Scales to 1M/day | ❌ No | ✅ Yes |
| Delivery time | ~5 minutes | ~1 minute |
| Deduplication | Simple (status in DB) | Redis + ACK logic |
The A → B Migration Path
The beauty: you can start at A and evolve toward B when you actually need it.
Start: cron + PostgreSQL table ← start here
Trigger 1: volume > 5,000 notifications/day, cron takes > 5 minutes
Action: Add RabbitMQ. Convert cron into event publisher.
Workers consume from the queue.
Trigger 2: Queue fills consistently during spikes
Action: Add second and third worker.
Trigger 3: Workers consistently falling behind
Action: Configure worker auto-scaling.
Trigger 4: You start having failures you can't debug
Action: Add full monitoring (Datadog/Grafana).
Trigger 5: Retries are generating duplicates
Action: Add Redis for deduplication.
Each step is taken when there is evidence it's needed. Not before.
This is incremental architecture. You don't build Scenario B from day one "just in case."
Final Reflection
Before next week, reflect:
Where in your current work does this same problem exist — architecture decisions that ignore the real context of the team and budget?
If you had to defend Scenario A (the simple solution) to a client who wants "enterprise-grade," how would you explain it?
What is the concrete trigger that would make you move from Scenario A to Scenario B in your context?