Prometheus metrics saves us from painful kernel debugging (2022)

tass · on July 21, 2024

This awakened a memory from last year where a colleague and I were trying to understand where an increase in Linux memory was coming from in machines that hadn’t been rebooted in a while. This was alerted to by Prometheus metrics.

Even after all apps had been restarted, it persisted. Turned out to be a leak of slab memory allocations by a kernel module. That kernel module had since been updated, but all previous versions were still loaded by the kernel so the leak persisted until the next reboot.

The leaky kernel module was CrowdStrike’s falcon sensor. It started a discussion where engineering had no option but to run these things for the sake of security, there were no instances where it actually caught anything, but it had the potential to cause incidents and outages.

josephcsible · on July 21, 2024

Do you mean "engineering had no option but to run these things for the sake of compliance"?

nsguy · on July 21, 2024

As far as I'm aware there's never a requirement to use a specific product/technology for compliance. The standards will require processes or something very general and the way a company complies is their choice. It's certainly possible that a company will buy a certain product because it checks some compliance box but I would expect there are many other ways to check that box.

tass · on July 21, 2024

Compliance was the reasoning they raised, yes.

Whether you require this particular tool to meet compliance, and whether that tool needs to be deployed across the entire stack might be open to interpretation though.

atomicnumber3 · on July 21, 2024

Open to interpretation but it often ends up something like this:

1. "Do we need to use this shovelware? Can't we achieve compliance via <a few sane steps>?

2. CTO/CEO: We can either use this software and you don't have to think about it, or we can do your thing and I'm going to hang you out to dry the second literally anything goes wrong with it.

3. "Ah. Ok nevermind." [starts updating resume]

IMO it's part of this toxic culture against DIYing literally anything. Yes NIH can be bad, but so is the complete opposite where you make insane decisions (technical or money-wise) because of a culture where people are punished for trying to DIY when it does make sense.

At a particularly toxic past org of mine, we actually had a "joke" (not the ha-ha kind of joke, the i can't believe we actually do this kind) where for a few key decisions we appointed someone to be the "Jesus" for that decision - someone who was already planning to quit, so they could make a decision that was correct but would get them punished, then "die for our sins", leaving us with the benefits of a correct decision having been made but without any of the political fallout of having done something that offends the sensibilities of leadership.

Yes eventually you run out of messiahs to sacrifice and yes it sucked to work where a lot. But damn did they pay really well, so, yknow.

znpy · on July 21, 2024

> against DIYing literally anything

I get your feeling but I largely disagree: I’ve seen too many DYI mini-projects that implemented just enough and became too much important, and when they broke no one was around to fix that (the implementer had left a long ago and left scarce to no documentation at all).

The nice thing about using an off-the-shelf product is that usually either it comes with support or there’s some kind of community where you can go and ask for help).

There is this insidious cognitive bias in (software?) engineers to only consider the happy path when thinking about the consequences of their actions.

kiitos · on July 21, 2024

> I’ve seen too many DYI mini-projects that implemented just enough and became too much important, and when they broke no one was around to fix that (the implementer had left a long ago and left scarce to no documentation at all)

Personally I have seen far more damage done by a reflexive rejection of NIH syndrome, than the costs you're describing here.

I've generally found it much easier to maintain a bespoke system built around a narrow set of requirements, than a general purpose system applied too-eagerly.

carbotaniuman · on July 22, 2024

> There is this insidious cognitive bias in (software?) engineers to only consider the happy path when thinking about the consequences of their actions.

This applies to buying off the shelf software (and other things) too. Vendor provided software is not immune from bugs or shortcomings or terrible vendors.

znpy · on July 21, 2024

> CrowdStrike’s falcon sensor

Oh god, is that falcon thing from crowdstrike as well?

spiffytech · on July 21, 2024

My team spent weeks using log-aggregated metrics to gradually figure out why servers' clocks would go out of whack.

It turned out Docker Swarm made undocumented† use of a UDP port that some VMware product also used, and once in a while they'd cross the streams.

We only figured it out because we put every system event we could find onto a Grafana graph and narrowed down which ones kept happening at the same time.

† I think? It's been a while, might have just been hard to find.

louwrentius · on July 21, 2024

As a side note, I’m into storage performance and the node-exporter data is absolutely spot on. I performed storage benchmarks with FIO and the metrics matched the loads and reported os metrics (iostat) perfectly.

I actually made a Grafana dashboard[0] for it, but haven’t used this in a while myself.

[0]: https://grafana.com/grafana/dashboards/11801-i-o-statistics/

nsguy · on July 21, 2024

It's all standard Linux metrics that node-exporter exposes. Mostly from procfs.

louwrentius · on July 22, 2024

I’ve tried to get the same results with telegraf and influxdb and never got the metrics right

lordnacho · on July 21, 2024

Sounds like they have node on their machines. Not the js framework, but the prometheus/grafana package that gives you all the meters for a generic system monitoring dashboard. Disk usage, CPU, memory, it's all set up already, just plug and play.

In fact, I found a memory leak this way not long ago.

Super useful having this on your infra, saves a lot of time.

gouthamve · on July 21, 2024

Yup, https://github.com/prometheus/node_exporter is the standard way to monitor machine metrics with Prometheus.

alfons_foobar · on July 21, 2024

FYI, it's called "node_exporter" ;)

luckman212 · on July 23, 2024

Anyone know of a good guide for configuring windows_exporter and setting up alerts based on that data? I was tasked with monitoring some Windows endpoints for a memory leak and I failed in my attempts so far.