Jupyter notebooks are actually a use case we think about a lot, you can try a live demo with a Jupyter notebook here: https://jamsocket.com/tmpenv/
It wasn't really one thing with Kubernetes that was slow, but that the more we tried to optimize it the less of core Kubernetes we were using and so the less value we were getting for the complexity tax we were paying. The image pulling you mention is a good example of that; having pre-pulled images is a big factor, but we have too many images to push every image to every node, instead we'd like the scheduler to be aware of which node has which image. We could do that with node affinity, but what we'd end up building would be more work than if we wrote our own scheduler to support it from day one.
> From what it looks like in your repo it might be that you need to do session timing (like ms) response time from a browser?
Our goal is subsecond container starts. We're not there yet, and might not get there with Docker, but we have a POC that is there with WebAssembly-based workloads. Too bad those are rare :)
(By the way, I'm always happy to chat about this stuff, my email is in my profile)
It turned out to be somewhat tricky, because it increased the size of the Node object, and colocating node heartbeats onto the same object meant that a bigger object was changing relatively often. But that was addressed by moving heartbeats to a different object: https://github.com/kubernetes/enhancements/issues/589
Very cool, I didn't know about this either. I feel like so many of these features are coming in which is great, but also part of the drag of k8s is the kind of constant upgrade churn and having to keep your yaml fresh.
AWS has put work into fast-starting containers [1] using tricks like lazy loading container storage, profiling container startup, non-lazily priming critical blocks, and caching shared blocks. IIRC parts of it are open source. I don't know if enough of it is open source to be helpful, but it's cool stuff!
> the more we tried to optimize it the less of core Kubernetes we were using and so the less value we were getting for the complexity tax we were paying
Since we were headed down that path, we took a step back and asked what we were really getting out of Kubernetes, and most of it was things that were orthogonal to our intended use case. The way Kubernetes is architected around control loops works great for its intended use case, but we wanted a more event-driven system.
Event driven ... like a streaming data pipeline? Given your comment about Jupyter notebooks, that makes sense. It might be the Mesos project is better architected for your use-case. Then again, I think Mesos ported some of their schedulers to Kubernetes.
It wasn't really one thing with Kubernetes that was slow, but that the more we tried to optimize it the less of core Kubernetes we were using and so the less value we were getting for the complexity tax we were paying. The image pulling you mention is a good example of that; having pre-pulled images is a big factor, but we have too many images to push every image to every node, instead we'd like the scheduler to be aware of which node has which image. We could do that with node affinity, but what we'd end up building would be more work than if we wrote our own scheduler to support it from day one.
> From what it looks like in your repo it might be that you need to do session timing (like ms) response time from a browser?
Our goal is subsecond container starts. We're not there yet, and might not get there with Docker, but we have a POC that is there with WebAssembly-based workloads. Too bad those are rare :)
(By the way, I'm always happy to chat about this stuff, my email is in my profile)