> What is the system supposed to do here? There are a million possibilities…
No, there are two: You dump old data or you dump new data. Everything else should be up to the user code. It's really not as difficult as you are making it out to be. There's certainly no excuse for a ridiculous API as described in the article.
Huh? If you dump data you miss events. Imagine if Process Monitor decided to suddenly dump half of the system calls it monitored. Wouldn't that be ridiculous? For a general event-tracing system, there have to be more options provided. Maybe it wouldn't matter so much for context-switching per se, but for a ton of other types of events you really need to track each and every event.
Yes, you miss events. But if you try to make build the kitchen sink into your low-level logging system then it ceases to be low level. If your logging system allocates memory then how can you log events from your VM subsystem? If your logging system logs to the disk, then how do you log ATA events? It becomes recursive and intractable.
The solution is to make your main interface a very simple pre-allocated ring buffer and have userspace take that and do what they please with it (as fast as it can so things don't overflow).
There is always a point at which your logging system can't keep up. At the kernel level you decide which side of the ring buffer to drop (new data or old) and at the userspace level you decide whether to drop things at all or whether to grind the system to a halt with memory, disk, or network usage.
The options are not simply "drop data" or "don't drop data". The options depend on the logging source, because not every logging source requires a fixed-size buffer. The API itself needs to support various logging sources and thus needs to support extensible buffers (e.g. file-backed sources, the way ProcMon does). Whether or not a particular logging source supports that is independent of whether or not the generic logging interface needs to support it.
I think we're talking past each other here. I don't think we're disagreeing on the userspace part. I'm not even implying that the the low level kernel interface should have unconfigurable buffer sizes. They should be configurable, but pre-allocated and non-growable. You're right, the userspace part can do whatever it wants. But I stand by my last paragraph (you either drop or grind things to a halt).
> Huh? If you dump data you miss events. Imagine if Process Monitor decided to suddenly dump half of the system calls it monitored. Wouldn't that be ridiculous?
All sorts of systems have worked like this in the past (search for "ring buffer overwrite"). If you can't assume unlimited storage, you have to make a decision whether it's more important to have the latest data, dropping older samples, or whether it's more important to maintain the range of history by lowering precision (e.g. overwriting every other sample).
> but for a ton of other types of events you really need to track each and every event.
If you really need this, you have to change the design to keep up with event generation. That's outside the scope of a low-level kernel API where performance and stability trump a desire for data.
No, there are two: You dump old data or you dump new data. Everything else should be up to the user code. It's really not as difficult as you are making it out to be. There's certainly no excuse for a ridiculous API as described in the article.