Monitoring and Observability: A Complete Guide
Modern applications require comprehensive observability to maintain reliability and performance. This guide covers monitoring, logging, tracing, and observability best practices.
Monitoring vs Observability
Monitoring tells you when something is wrong:
- Predefined metrics and alerts
- Known failure modes
- System health dashboards
Observability helps you understand why:
- Explore unknown issues
- Debug complex systems
- Understand system behavior
Both are essential for reliable systems.
The Three Pillars of Observability
1. Metrics
Numerical measurements over time:
// Prometheus metrics example
const prometheus = require('prom-client')
// Counter: monotonically increasing
const httpRequestsTotal = new prometheus.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'path', 'status']
})
// Gauge: can go up or down
const activeConnections = new prometheus.Gauge({
name: 'active_connections',
help: 'Number of active connections',
})
2. Logs
Structured event records:
// Structured logging
const pino = require('pino')
const logger = pino({
level: 'info',
formatters: {
level: (label) => ({ level: label }),
},
})
logger.info({
event: 'user_login',
userId: user.id,
ip: req.ip,
userAgent: req.headers['user-agent'],
})
3. Traces
Follow requests across services:
// OpenTelemetry tracing
const { trace } = require('@opentelemetry/api')
const tracer = trace.getTracer('my-service')
async function processOrder(order) {
const span = tracer.startSpan('process-order')
span.setAttribute('orderId', order.id)
try {
await validateOrder(order)
await chargePayment(order)
await fulfillOrder(order)
} finally {
span.end()
}
}
Best Practices
Use Structured Logging
- JSON format for easy parsing
- Include correlation IDs
- Log at appropriate levels
Define Key Metrics
- Request rate
- Error rate
- Duration (latency)
- Saturation
Implement Alerting
- Alert on symptoms, not causes
- Reduce alert fatigue
- Include runbooks
Conclusion
Effective observability requires combining metrics, logs, and traces. Start with the basics and iterate based on what incidents teach you about your system's blind spots.



