Trace Points in C++: Diagnosing Production Systems Without Restart

One of the goals behind trace points in the C++ logging library logme was solving a very practical production problem. The logs available during an incident are usually not the logs developers actually need.

Production issues almost never happen when developers are actually prepared to investigate them. During development everything works correctly, test environments appear stable, and existing logs usually seem sufficient. However, the situation changes quickly once a real production workload appears.

A live system may suddenly start behaving differently under real traffic and real load. A connection may occasionally reset for no obvious reason. One request out of thousands may behave differently from all the others. In other situations, a server may unexpectedly enter a reconnect loop under load. Sometimes strange behavior appears deep inside the system and nobody can reproduce it locally.

As a result, the same unpleasant realization appears again and again: the logs that already exist inside the system are not enough.

This is exactly the point where traditional logging starts breaking down.

Why DEBUG Logging Stops Working

Most logging systems are still built around a fairly simple idea. Developers decide in advance which messages should always be written. They add INFO, WARNING, and DEBUG logs, sometimes combined with channels or categories. Eventually the application is deployed into production with the assumption that those logs will hopefully be sufficient later.

Sometimes they are.

However, production systems have an annoying habit of failing in places nobody expected and in ways nobody anticipated. As a result, the most difficult incidents often appear in areas that originally seemed completely uninteresting during development. In many cases the problem appears precisely inside the parts of the codebase that originally seemed completely uninteresting during development.

The first reaction is usually predictable: “let’s enable DEBUG logging.” On small projects this can still work reasonably well. In larger systems, however, DEBUG logs quickly become a problem of their own. They start consuming gigabytes of storage. Meanwhile, useful information gets buried in noise, disk activity grows, and sometimes the logging itself begins affecting performance and timing behavior.

More importantly, developers usually do not need additional diagnostics everywhere. Instead, they need more diagnostics in one specific location: inside one suspicious function, one handler, or one rare execution path that only becomes active under a specific combination of conditions.

This is exactly the problem trace points are designed to solve.


What Makes Trace Points Different

At first glance, a trace point looks almost identical to an ordinary log statement:

LogmeI_TPt("received packet size=%zu", size);

Internally, however, it behaves very differently.

An ordinary DEBUG log either writes messages continuously or remains completely disabled. A trace point always exists. Even while disabled, the system still knows about it, hit counters continue increasing, and the point itself can still be discovered and activated through runtime tools. Actual log records are not emitted until the trace point is explicitly enabled.

As a result, a trace point is not simply another DEBUG macro. It behaves more like dormant instrumentation embedded directly into the application. Most of the time it stays quiet and waits until somebody actually needs the diagnostics.

That changes the entire philosophy of runtime diagnostics.


A Typical Production Scenario

Consider a fairly typical scenario. Imagine a high-load server where strange behavior occasionally appears somewhere deep inside the HTTP parser. Nobody can reproduce the issue locally. Everything works perfectly on the staging environment. The problem only appears under real customer traffic and only under specific load patterns.

The traditional workflow in this situation is painfully familiar. Developers add temporary DEBUG logs, build a special version, restart the service, wait for the problem to happen again, and eventually discover that the collected diagnostics are still insufficient. Then the entire cycle repeats again.

Anyone who has worked on backend systems long enough has gone through this process many times.

Trace points allow a completely different workflow. The diagnostics are already embedded into the production binary ahead of time. The only thing required is activating the suspicious area:

trace enable "*HttpParser*"

At that point the application begins emitting only the diagnostics that are actually relevant to the current investigation. Consequently, there is no need for a globally enabled DEBUG level, no explosion of log volume, and no service restart.

Once the investigation is finished, the trace points can simply be disabled again.

In practice this changes the entire feel of production debugging. Instead of constantly rebuilding and redeploying special diagnostic versions, engineers can investigate a live running system almost in real time.


Why Trace Points Are Still Rare

Interestingly, mechanisms like this are still relatively uncommon in the world of C++ logging libraries. The reason is not that the idea lacks value. On the contrary, the difficulty lies in implementation complexity.

To make trace points truly usable in production, a library has to solve a number of difficult engineering problems. It needs trace point registration, runtime discovery, wildcard matching, counters, thread safety, and remote control. At the same time, the overhead must remain small enough that developers are comfortable leaving trace points permanently inside production code.

Because of this, many logging libraries simply stop much earlier. Traditional logging already solves most everyday tasks. Trace points, however, address an entirely different problem space. This is no longer just about “printing logs nicely.” Instead, it is about runtime diagnostics for already deployed production systems.

In that sense, trace points are philosophically much closer to technologies like DTrace, eBPF, or ETW than to ordinary DEBUG logging.


logmeweb and Runtime Diagnostics

Once the number of trace points inside an application starts growing, another problem quickly appears. Managing them from the command line becomes increasingly inconvenient.

This is one of the reasons logmeweb was created. A web interface built on top of the runtime control server.

Through a browser it becomes possible to inspect channels and subsystems. Enable and disable trace points, view hit counters, filter points, reset statistics. Execute runtime commands without restarting the application.

The benefits become especially noticeable in larger systems where the number of trace points may already reach into the hundreds. In practice, this is the point where CLI commands gradually stop being a convenient operational tool. At that point CLI commands gradually stop being a convenient operational tool. Meanwhile, a visual runtime diagnostics interface starts saving a considerable amount of engineering time.

At the same time, it is important to understand what logmeweb actually is. It is not a separate observability platform and not an external agent running beside the application. Instead, the diagnostics already exist inside the process itself. The web interface simply provides a more convenient way to interact with the built-in control server.


Changing the Way Developers Think About Diagnostics

There is also another interesting effect that becomes visible over time.

Developers often avoid adding detailed diagnostics in advance because permanent DEBUG logging feels too expensive and too noisy. And a large amount of potentially valuable diagnostic information never makes it into the codebase at all.

As a result, trace points gradually change that mindset. Developers can afford to leave dormant diagnostic hooks in potentially problematic areas of the system. Those hooks do not continuously pollute production logs and do not require DEBUG logging to remain permanently enabled.

Consequently, when a real issue eventually appears, the necessary diagnostics are already waiting inside.

Traditional logging answers the question:

“What should the application always write?”

Trace points answer a much more interesting question:

“What information might suddenly become important later?”

For modern production systems, the second question is often far more valuable than the first.

Leave a Reply