> Processing/streaming logs to get metrics is a terrible waste of time, energy a...

viraptor · on July 31, 2020

You can go the https://www.honeycomb.io/ way and make structured logs your metrics. It will cost you a lot in storage, but simplifies a lot. Just throw properly structured logs into storage as long as you query them efficiently (which honeycomb provides)

awithrow · on July 31, 2020

I think the only times it ever really makes sense to use logs to generate metrics are fairly limited:

1. You haven't yet instrumented the application with metrics yet.

2. The logs are from a third party tool that don't emit metrics

3. The log format is well defined and doesn't change (I'd still prefer native metrics)

Otherwise the issue is that logging messages can and do change over the lifetime of an application. Relying on the content of the log for metrics becomes an implicit API that's not obvious to developers working on the code. I've seen issues of broken monitoring and alerting because a refactor changed log formatting and content. Much better to be explicit about metrics and instrument them directly.

KaiserPro · on July 31, 2020

Aha! that is the eternal question.

TL;DR:

almost never. structured logs are expensive in terms of infra, management and query time. Storing logs just in case is much more expensive at any kind of scale compared to metrics alone.

Long answer:

A lot of it depends on what the service/program is meant to be doing.

If we take a proxying webs service router for example listening on example.com/* We would want metrics to tell us how well its doing for its specific job, and any upstream services.

So for each service URL we'd want at least a hit count for 2xx, 3xx, specific 4xx and 5xx return codes. We'd also want the time taken to process that request.

We'd also probably want to know the total number of active connection to back end, and total clients connected. Memory and CPU usage would also be a given.

From that we could easily ascertain the health of upstream services, the performance, and total load (which is useful for autoscaling of either the service router, or the upstream apps)

I think it requires sitting down with a peice of paper and imagining your service/app breaking, and then working back to see how that would look. Once you've done that, you can figure out some counters to keep track of those thins.