jonas.simonsson
Observability and problem management

Build observability for the worst moment

Jonas Simonsson

I have spent a lot of time in rooms full of green, and I have watched what people actually do the moment something goes wrong. Mostly, they do not look at the screens. They open a terminal, message the one person who really understands the system, and start working from memory. The displays that looked like control turn out to be decoration the instant control is needed.

That gap is the whole subject. It is easy to build something that looks like visibility during the calm hours, when nobody needs it. It is much harder to build something a person can actually use in the few tense minutes when a decision has to be made and the cost of getting it wrong is real. Most monitoring is built for the first situation and quietly fails the second, and the failure only shows up at the worst possible time.

Monitoring and observability are not the same activity

The two words get used interchangeably, and the blur hides an important difference. Monitoring tells you whether the things you already thought to watch are within the range you already decided was acceptable. It is good at known problems. Observability is supposed to let you ask a question you did not anticipate, about a situation you have not seen before, and get an answer you can act on.

A thousand pre-built panels do not add up to the second thing. You can have complete monitoring and still be blind the moment reality steps outside the boxes you drew in advance. And reality, in any system that matters, eventually does.

The question is the unit of value

A useful way to judge any piece of observability is to ask what decision it exists to support. Not what it displays. What it helps someone decide. If you cannot name the decision, the panel is decoration, and decoration is not free. It takes up attention, it adds to the noise, and it slowly trains people to ignore the screen, which is the worst outcome of all, because the genuinely important signal gets ignored along with everything else.

If you cannot name the decision a signal supports, it is not observability. It is wallpaper.

The questions that matter under pressure are concrete, and they are almost always about connection rather than any single value.

None of these are answered by a single number turning red. They are answered by being able to relate signals to each other and to time. A system that shows you a hundred values but cannot help you connect them has handed you data and kept the understanding for itself.

Design for the worst moment, not the calm one

Treating observability as decision support changes what you build. You stop asking what you can measure and start asking what you will need to know when things are at their worst, with the least time and the most pressure. That reframing tends to produce fewer signals, not more, and each one earns its place. It is a harder design problem, because you are designing for the moment of stress rather than the moment of leisurely review, but it is the only moment that actually tests the work.

It changes how you treat history too. The same data that supports a live decision is what lets you learn afterwards, so that the next event of the same kind is recognised in seconds instead of rediscovered from nothing. This is where observability meets problem management. Good observability is what turns an incident from a one-time scramble into something the organisation can learn from. Without it, every incident arrives as if it were the first one, and the same problem keeps coming back wearing a different face.

The test

There is a simple test for whether your observability does its job. The next time something breaks, watch where people actually look. If they turn to what you built and it helps them decide, you have decision support. If they ignore it and start digging elsewhere, then what you built was a display, and the real observability still lives in a few people's heads. The goal is to get it out of their heads and into something the whole organisation can use, under pressure, to decide well.

Dashboards are fine. They are the surface. What matters is whether, underneath the surface, you have built something that helps a person answer the question in front of them at the moment it is hardest to answer. That is the entire job. The rest is decoration, and decoration does not survive contact with a bad night.

Written by Jonas Simonsson.   More writing   Connect on LinkedIn