Yazılım Çorbası: Monitoring vs. Observability

Giriş

Açıklaması şöyle

Monitoring answers: "Is something wrong?" and observability answers: "Why is it wrong?" You need both.

Açıklaması şöyle

In practice, you need three things. Metrics are used to detect problems, logs to explain errors, and traces to discover latency. If your metrics are wrong, you would never know that something is failing. And if you don't know something is failing, you never check logs and traces, which is why metrics are the entry point of any investigation.

Monitoring Challenges

Açıklaması şöyle

Most of the time, teams don't have strategies for monitoring. It is the last backlog item to be picked up before the final production release. One service team adds a dashboard, another adds alerts, and a third team introduces a different naming convention.

Six months down the line, you get duplicate metrics and inconsistent naming. There are no standard dashboards and alerts that nobody trusts. Eventually, teams ignore alerts, stop relying on monitoring, and fall back to guesswork. That is a dangerous place to be.

One pattern I have seen repeatedly is metric explosion without clarity. A service exposes 400 metrics, and nobody knows which one matters.

Good monitoring is not about collecting more metrics. It is about collecting the right metrics. A production-ready service rarely needs more than 10-20 core metrics and a small number of critical alerts. Everything else is an investigation detail. Not an operational signal.

Yazar 4 tane metriğin takip edilmesi gerektiğini söylüyor

1. Latency: Earliest Signal

Burada ortalamaları (average) gösteren metriklerin işe yaramadığını söylüyor. Percentile p50, p90 gibi bakmak daha iyi. Eğer p99 hareket etmeye başlarsa bir problem var anlamına gelir.

Örnek

Şöyle yaparız. order-service içindeki end pointleri gösterir.

uri : groups metrics per endpoint,

le : “less than or equal”

anlamına gelir.

# Latency (p50, p95, p99)
histogram_quantile(0.50,
sum(rate(http_server_requests_seconds_bucket{application="order-service"}[5m])) by (uri, le)
)
histogram_quantile(0.95,
sum(rate(http_server_requests_seconds_bucket{application="order-service"}[5m])) by (uri, le)
)
histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket{application="order-service"}[5m])) by (uri, le)
)

2. Traffic: System Load

Açıklaması şöyle

Traffic metrics include requests per second, events per second, messages per second, and batch rates. Most incidents begin with a traffic change. Sometimes expected and sometimes not.

A common pattern that I have always observed: Traffic increases, and that increases latency. Integrations slow down, and errors appear. Without traffic metrics, the root cause looks mysterious. With traffic metrics, it becomes obvious.

Prometheus query example:

Requests per second:

rate(http_server_requests_seconds_count[1m])

This metric alone explains a surprising number of incidents.

3. Errors - The most misunderstood signal

Burada şunlar önemli.

- Error rate is more important than error count

- 4xx vs 5xx - Critical distinction

4. Saturation — Where failures actually begin

- CPU and Memory - Necessary but not enough

- Connection pool usage.

- Kubernetes saturation signals

Yazılım Çorbası

21 Nisan 2026 Salı

Monitoring vs. Observability

Hiç yorum yok:

Yorum Gönder

Blog Arşivi