No mention of user namespaces whatsoever, which is the primary security isolation mechanism for containers on linux. This is what enables "rootless" mode. Of course, this is from 2017, but user namespaces were released with linux 3.8 in February 2013.
Docker particularly has always required extra work to run in rootless mode because it was released soon after in March 2013, and for whatever reason it hasn't been a priority to rework the codebase to make that the default. I switched to podman for exactly this reason as my go-to oci implementation and haven't looked back.
The linux kernel features that enable various forms of isolation all require root privileges (CAP_SYS_ADMIN). Once user namespaces were a thing, that allowed you to use user namespaces to get around the root requirements for all the other isolation namespaces.
All of the below still require CAP_SYS_ADMIN:
CLONE_NEWCGROUP: cgroup namespace, for resource control (mem/cpu/block io/devices/network bandwith)
CLONE_NEWIPC: ipc namespace for sysv ipc objects and message queues
CLONE_NEWNET: network namespace, for isolated virtual networking
CLONE_NEWNS: mount namespace, for isolated mounting (filesystems, etc.)
CLONE_NEWPID: pid namespace, for isolated view of running processes
CLONE_NEWUTS: unix timesharing system namespace, for isolation of hostname and domain name
Exactly, "set up". Many people (not all) don't want to fiddle with things, they just want it to work out of the box. The importance of secure defaults can't be overstated, especially when there are virtually no downsides.
Docker has a big community, lots of guides, and ready-to-use containers. It became pretty much a de facto standard for self-hosting things. You also have a very high chance of getting a piece of software to work out of the box as intended with Doker. The only way this or some other way of running stuff will overtake Doker is if it will match the Docker in these aspects.
As much as I'd love to try this lightweight VM idea, I don't have the time or energy to convert 20+ projects I'm self-hosting into this and then keep everything updated. I'd rather invest this time into learning Docker more and making my existing setup more secure and robust.
Maybe try out kraft.cloud: we take Dockerfiles as input and automatically convert to lightweight VMs/unikernels when deploying (disclaimer: I'm one of the paper's authors and one of the people behind KraftCloud).
I recently built a similar thing for learning purposes using firecracker + the firecracker go api.
I wrote a small init system in rust and combined that with filesystem images derived from the Debian, Ubuntu etc. container images (that can be extended with more layers).
What really surprised me the most, is how quick and simple it is to compile the linux kernel. Cloned a tag with --depth 1, configured it and then it took ~ 5 minutes to build vmlinuz.bin. As someone who is too young to have had to regularly do that, I had heard multiple stories of how long that's supposed to take but it really doesn't.
I then tried to move from firecracker to qemu microvms but didn't get that far yet since I didn't have more time.
All in all a great learning experience and if I wasn't an undergrad student with no time, I'd love to build a service/business around it.
I came across Unikraft a while ago and went “wow that’s cool, but I have no idea how to use this”, cloud offering and docs you have up there now look amazing! Will 100% be giving this a go first thing tomorrow!!
Fly: takes your docker image, converts it into a Firecracker VM and runs that: kernel boundaries etc are all the same as before (and the same as running your container locally).
Kraft Cloud: takes your docker image, and turns it into a “unikernel”, and runs that. In a unikernel, your application _is_ the kernel. There’s no process boundary, no kernel-space/userspace split there’s a single address-space etc.
I believe the idea is that you get a perf benefit-as your application is often the only one running in the container, security is provided by the hupervisor anyways, so may as well cut out all the middle layers that aren’t getting you much. Seems some of the authors/founders of Unikraft are in the comments, they can explain much better than I.
Hey, author/founder here, thanks for providing that answer, all correct there :) . I would also add that KraftCloud unikernels are built using Unikraft, and that its modularity allow us to tailor/specialize those images to obtain great perf.
Finally, we also had to design and implement a controller from scratch -- nothing out there provided the millisecond semantics and scalability we needed (plus we also did tweaks to network interface creation and a few other things to get the end to end experience to be fast).
My work had a product that was doing builds and hosting for arbitrary client code, you’re doing all that, plus more. I’ve got massive respect for that, because there were some hard problems to solve, even in our pretty vanilla environment- looks like you guys have done a far better job than we did, plus more!
It sounds like consequences of bugs like memory corruption are far more challenging to deal with in the Kraft cloud situation. Sometimes isolation has other benefits.
Isn't that better isolation though ? A memory corruption will at worse break the OS which is the app and nothing else. Push the model further and you can have one unikernel per user and reduce even further the consequences of bugs
Ahem, this is a research paper. You should look at this stuff as "Innovation" and someone may just consider building a tool or product on the idea... or not.
Author here, we did this, first by continuing the research alongside the creation of the Unikraft LF OSS project -- the result of which was the Eurosys 2021 best paper award (https://dl.acm.org/doi/10.1145/3447786.3456248).
Commercially, we leverage Unikraft on kraft.cloud to provide a cloud platform with millisecond semantics.
OS: I provide isolation where needed, handle safely interacting with outside world, and abstract away all the pesky stuff so programmers can just get stuff done.
Container / VM: I provide isolation where needed, handle safely interacting with outside world, and abstract away all the pesky stuff so programmers can just get stuff done.
I get that a dev machine (OS) isn't usually suitable for deployment or shared development (Container/VM). But seems to me the promise of the Operating System has fallen short, if we are striving to meet so many of the same goals of the OS, with something on the OS that tries to abstract away the OS.
I guess this came to be due to the poor original security model of classic OSs, which led to prolification of viruses and complex management of shared resources. Users, groups and access flags are not enough to manage security of a system.
Linux tried to fix that with namespaces and it turned out to be more or less successful, but Linux is not an OS, it's just a kernel, and it's up to real OSs built atop Linux to use namespaces as an implementation detail for real application isolation.
One way to do that is OCI-containers, the other way is Flatpak. Neither of those is not a proper OS yet, but you could call Kubernetes an operating system which uses containers as means for application and resource isolation. Naturally that means Kubernetes is a complex beast, but that's what it takes to provide what users expect from an OS.
Android also comes to mind, they managed to isolate applications between each other quite safely.
I say this with great care as I do not want to launch a flamewar.
If you do not consider Linux with namespaces an OS (because of fragmented userland): Would you then consider FreeBSD with jails or Solaris with zones for fully fledged?
If you still consider those flawed (maybe because thet do not force you into jails/zones) should we at least no consider OS/390 or z/OS as proper operating systems to that/your (not meant inflamatory!) standard?
Yes. Though you do not mention them directly DOS and Windows has ruled the world for years and they opened the door for the nasties. But they were not all there was - only the popular/easy choice. Everything is a trade off.
Isolation mechanisms is not what makes an OS. It's the stable ABI that application developers can depend on and which provides a way to use shared resources: disk, CPU, RAM, GPU, network, screen space, push notifications, GUI integrations, your favorite LLM integration, so on, so forth... Yes, it might have an imperfect security model, but nothing's perfect under the sun.
Raw Linux without userspace could be considered an OS, but it has the ABI only in form of syscalls and the minimal standard FS. That's barely enough for anything other than, say, a statically linked Go binary, which is why it's seldom used by app developers as a target.
To most of your examples I say – yes, that's an OS, and jails or zones have nothing to do with it. Although I'm not familiar with them other than FreeBSD, so I'm relying on your short description and your implied criteria for selecting these examples.
I don't really see how rootless containers change anything at all. You're still "just" one kernel privilege escalation away from breaking out. The level of isolation is much better in virtual machines, and the performance penalty is comparable these days.
The virtual machine images are a bit heavier, since you need a kernel and whatnot, but it's negligible at best. The memory footprint of virtual machines with memory deduplication and such means that you get very close to the footprint of containers. You have the cold start issue with microvms, but these days they generally start in less than a couple of hundred milliseconds, not that far off your typical container.
Memory de-dup is computationally expensive, and KSM hitrate is generally much worse than people tend to expect - not to mention that it comes with its own security issues. I agree that the security tradeoffs need to be taken seriously but the realworld performance/efficiency considerations are definitely not negligeable at scale.
There are also significant operational concerns. With containers you can just have your CI/CD system spit out a new signed image every N days and do fairly seamless A/B rollouts. With VMs that's a lot harder. You may be able to emulate some of this by building some sort of static microvm, but there's a LOT of complexity you'll need to handle (e.g. networking config, OS updates, debugging access) that is going to be some combination of flaky and hard to manage.
I by no means disagree with the security points but people are overstating the case for replacing containers with VMs in these replies.
And these overheads are even smaller if you use unikernels as per the paper. Eg, cold starts of a few milliseconds depending on the app/size of the image.
I'm struggling a little bit to grasp all the concepts when we start talking about unikernels, wasm and so on. Hopefully that's just a sign of the maturity of it, and not a sign of my mental decline. But on paper (as I understand it) it looks /so cool/.
Unikernels aren't too complicated conceptually. They're more or less a kernel stripped down to the bare minimum required by a single application. The complete bundle of the minimal kernel and application together is called a unikernel. The uni- prefix means one as in the kernel only supports one userspace application, instead of something like linux, which supports many. The benefits, as mentioned in the paper and in this thread are that you can run that as a vm, since it contains it's own operating system, unlike a container which is dependent on the host operating system. Also, they boot very quickly.
Agree with epr's definition of a unikernel (and no, no mental decline on your part, this isn't always well defined).
First off, a unikernel is a virtual machine, albeit a pretty specialized one. They're are often based on modular operating systems (e.g., Unikraft), in order to be able to easily pick the OS modules needed for each application, at compile time. You can think of it as a VM that has a say NGINX-specific distro, all the way down to the OS kernel modules.
VMs provide what's called hardware-level isolation, running on top of a hypervisor like KVM, Xen or Hyper-V. Wasm runs higher up the stack, in user-space, and provides what's called language-level isolation (i.e., you could even create a wasm unikernel, that is, a specialized VM that inside runs wasm (eg, see https://docs.kraft.cloud/guides/wazero/). Generally speaking, the higher you go up the stack, the more code you're running and the higher the chances of a vulnerability.
Why weren't containers rootless from the start anyway? What did they need that user space doesn't provide? Wine, emulators and VMs didn't require it either (with the exception of some VMs needing a kernel module for performance reasons like memory management, which I also find stupid, the OS should provide all the performance in user space).
As I mentioned in another comment, the linux kernel feature (user namespaces) that enables "rootless" containers was released in February 2013, and Docker was released soon after in March of that year. For whatever reason, they haven't made it a priority to make rootless the default, although it is technically doable. If you are annoyed by this, I'd suggest checking out podman, which has done a lot of work to be basically a drop in replacement with a similar workflow to docker.
Because the docker developers hate security. The idea of the docker group is insane, for example. You can mount any directory into a container so being in the docker group is like having a root account.
People were running containers for a decade before rootless podman came around.
There has been lot of sharp corners around userns and related tech that needed to get resolved. Notably Debian& Ubuntu disabled unprivileged userns for some legitimate security concerns
Funny, the original commit message for that suggests it was simply a precaution. It's not out of the ordinary to avoid newer kernel features just in case.
> This is a short-term patch. Unprivileged use of CLONE_NEWUSER
is certainly an intended feature of user namespaces. However
for at least saucy we want to make sure that, if any security
issues are found, we have a fail-safe.
I really don't get that: having to run something substantial as root seems a much bigger security concern, than what it is shielding from user space (example: hosting a web server at port 80)
There is a lot of discussion on here about the different isolation levels available, but these micro-VMs aren't playing in the same field and can't be compared apples-to-apples.
If you go read the paper this requires a specialized Xen kernel, which in turn requires processor virtualization extensions directly available where you're running these containers. Those extensions aren't generally available if you're already running inside of a VM.
This is a solution that only works on bare metal which I would bet money the vast majority of people using containers, outside of development environments at least, are not running their containers in bare metal but in an existing VM such as on AWS or GCP where this solution is simply a non-starter.
Neat, niche, and doesn't operate in the same world as containers.
>On the downside, containers offer weaker isolation than VMs, to the point where people run containers in virtual machines to achieve proper isolation.
That's not really why containers are deployed in VMs, especially in the context of on-prem enterprise software. I think that's more of a legacy issue. For example, for on-prem enterprise software, the enterprise already invested millions into their VM infrastructure so deploying a containerized stack means deploying into their VM infrastructure.
I think when centralized container orchestrators get enough market penetration with properly trained IT, you'll probably see that change.
Also, very few people choose containers for security and isolation. Typically it's for flexibility in deployment, and control of the environment (no more dependency hell).
high level, a vm is an entire virtual machine with its own kernel/operating system/filesystem/etc. a container is a process (and associated files/archived filesystem) with a (more or less) isolated view of the world (network/filesystem/etc.) running on top of the same kernel/os as other processes on the same machine.
examples:
a) vm - an entire windows install running in a window on my linux workstation so i can use tax software once a year. two kernels running at the same time. (N+1 for N VMs)
b) container - a small python service, its dependencies, and various filesystem bits from alpine-minimal packaged into a file that docker/containerd/whatever can turn into the service running in a little isolated portion of my machine. no matter how many i run, one kernel. the various processes just don't see the host or other procs' files/memory/etc. via namespace trickery (unless there's a security problem, lol)
A VM is a virtualized instance with virtual hardware and can therefore run its own operating system with its own kernel to interface with the virtual hardware.
A container is basically a process restricted by multiple kernel namespace isolation mechanisms. It shares the same kernel with the host and does not present any “virtual hardware”.
Technically, and simplifying enormously, the VM emulates the whole machine while the Container scopes the OS process. I prefer the analogy of an office building.
Your VM is your whole office building and overnight maybe a whole new company can move in but still using the whole building. Your Container is a set of rules, somebody told you when arrived to the reception desk. About where is the only office in the building you can use, plus maybe some common access to shared areas once in a while, like WC and Kitchen. :-)
Containers seem light and cheap, but they have subtle problems lacking solid guarantees, prioritization, or limits on compute, network, and storage resources that type-1 v12n provides.
Isolated, but are they isolated enough? The article states that containers offer weaker isolation than VMs. (it doesn't quantify it though and I don't know this kind of thing offhand)
Who is complaining? And if containers do not offer enough of an isolation, why would you think VMs do? There are use cases where you have to have host-level isolation - for example, if you want to build a HIPAA-compliant cloud service, your customer data has to be isolated at the host level and VMs are not enough.
The Linux kernel has far too large of an attack surface to be trusted as a hard security boundary. It is good enough to prevent mostly trusted software from accidentally interfering with each other but I would not trust it to protect me from an untrusted workload.
For example GCP and AWS both have container running services. They both use hardware VMs to isolate different tenants. You will never share a kernel with another customer (I don't even think you will share one with yourself by default).
I agree with the other comments. On the cloud, the VM is still the golden standard for strong (hardware-level isolation): if you deploy a container in the cloud, you can almost be sure there's a VM underneath. Given this, what we tried to do in that paper, in the LF Unikraft project (www.unikraft), and on kraft.cloud, is ensure that each VM only has the thinnest possible layer between the application and the hypervisor underneath -- strong isolation and hopefully max efficiency. We do use Dockerfiles to have users specify the app/filesystem, but then we transparently convert them to unikernels (specialized VMs) at deploy time.
Correct -- and you can run multiple kernels on the machine with virtualization extensions. Even Docker Desktop does this. You'd do this for _real_ isolation purposes.
It depends on the type of v12n. Paravirtualization and similar, the answer is sort-of while hard emulation is definitely yes. There are efficiencies in memory usage because the often will share the same kernel code and userland code, which are memory pages that can be deduplicated at the hypervisor level. Read more about type-1 v12n.
> All processes in a proper OS are already isolated and there is no need for VM.
No. This is not how things work in reality. (Ideally, yes because hypervisors are OS "duct tape" but there is no such readily-available OS with strict resource limits and hard enforced VFS and network isolation.) Isolation, sharing, and hard limits on RAM, CPU, networking, and storage (bandwidth, block devices, and IOPS) is beyond the capabilities of every major OS. This is why VMware and similar type-1 hypervisors exist.
I'm wondering though what value will Kubernetes add beside integrating with existing (presumably Kubernetes-based) infrastructure? At least, this is my understanding of the rationale for Kata containers. Other than that, it seems like it'd be just getting in the way...
I believe this work originated at Intel as "clear containers" (which I believe started life from an acquisition (but could be mixing this up...my memory isn't what it used to be). Either way it's great they are being used like this and at Nvidia (I know Alibaba cloud also use this tech)
Yes, Kata started as clear containers. And yes, the main purpose is compatibility with containers -- though generally speaking, adding layers to the cloud stack never helps to make a deployment more efficient. On kraft.cloud we use Dockerfiles to specify app/filesystem, but then at deploy time automatically and transparently convert that to a specialized VM/unikernel for best performance.
Back when we did the paper, Firecracker wasn't mainstream so we ended up doing a (much hackier) version of a fast VMM by modifying's Xen's VMM; but yeah, a few millis was totally feasible back then, and still now (the evolution of that paper is Unikraft, a LF OSS project at www.unikraft.org).
(Cold) boot times are determined by a chain of components, including (1) the controller (eg, k8s/Borg), (2) the VMM (Firecracker, QEMU, Cloud Hypervisor), (3) the VM's OS (e.g., Linux, Windows, etc), (4) any initialization of processes, libs, etc and finally (5) the app itself.
With Unikraft we build extremely specialized VMs (unikernels) in order to minimize the overhead of (3) and (4). On KraftCloud, which leverages Unikraft/unikernels, we additionally use a custom controller to optimize (1) and Firecracker to optimize (2). What's left is (5), the app, which hopefully the developers can optimize if needed.
LightVM is stating a VM creation of 2.3ms while Firecracker states 125ms of time from VM creation to a working user space. So this comparing apples and oranges.
I know it's cool to talk about these insane numbers, but from what I can tell people have AWS lambdas that boot slower than this to the point where people send warmup calls just to be sure. What exactly warrants the ability to start a VM this quickly?
The 125ms is using Linux. Using a unikernel and tweaking Firecracker a bit (on KraftCloud) we can get, for example, 20 millis cold starts for NGINX, and have features on the way to reduce this further.
But if you can get isolation, security AND reproducible environments using a VM, specially one that's nearly as fast as a OS process, the case for using containers instead pretty much disappears.
I don't know this LiteVM thing but I will definitely investigate that, specially given that on my Mac I need to use a VM anyway to run containers!
Check out kraft.cloud and the accompanying LF OSS project www.unikraft.org :) (disclaimer: I'm one of the authors of the paper and one of the people behind that cloud offering). On KraftCloud we use Dockerfiles so users can conveniently specify the app/filesystem, and then at deploy time transparently convert that to a unikernel (specialized VMs). With this in place, NGINX cold starts in 20 millis, and even heavier apps/frameworks like Spring Boot in < 300 millis (and we have a number of tech to bring these numbers even further down).
For anyone else wondering how heavy this is on a MacOS, I ran the install script and it just delegated to brew... brew listed the following packages being installed:
Most should already exist on your mac if you do development... it seems to rely on qemu, unsurprisingly... openjdk as well (probably to support Java out-of-the-box?), imagegick etc.
Took a few minutes to finish installing... the CLI seems to be based on the Docker commands (build, clean, run, 'net create', inspect etc.), some package-manager like commands ('pkg info', 'pkg pull', 'pkg list' etc.), a bunch of "cloud" commands (I suppose that's the non-free part) and "compose" commands just like docker-compose. Interesting stuff.
I tried to run the C hello world example... I get an error, it wants to run Docker?!?! I thought the whole point was to avoid Docker (and containers)??
Here's the log:
i creating ephemeral buildkit container
W could not connect to BuildKit client '' is BuildKit running?
W
W By default, KraftKit will look for a native install which
W is located at /run/buildkit/buildkit.sock. Alternatively, you
W can run BuildKit in a container (recommended for macOS users)
W which you can do by running:
W
W docker run --rm -d --name buildkit --privileged moby/buildkit:latest
W export KRAFTKIT_BUILDKIT_HOST=docker-container://buildkit
W
W For more usage instructions visit: https://unikraft.org/buildkit
W
E creating buildkit container: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?: failed to create container
PS. running the hello-world pre-built "image" worked:
> kraft run unikraft.org/helloworld:latest
EDIT:
A lot of stuff looks broken on MacOS.
For example, `kraft menu` doesn't work (error "no choices provided", even though the docs show it working fine without "choices"?)...
`kraft run --elfloader loaders.unikraft.org/strace:latest ./my_binary` also doesn't work (the docs show it working).
Error: "unknown flag: --elfloader".
i don't think devs care if they use containers or VMs, as long as it's easy and they don't have to worry about which version of Python the host is running
This. It's why vagrant was popular before the container revolution.
The killer app of Docker isn't the container, it's the depth and uniformity of the UX surrounding the container system. When that is broken by something on the host (non x86 cpu was a major pain for a while before popular images were x-built) and emulation gets in the way and is not as easy, or just mildly different (windows behind corporate firewalls that assign ips used by the docker engine for example), the ease of use falls away for non-power users and it's all painful again.
Tech like Docker for windows and Rancher Desktop and lima has largely matured at this point, but somebody could make a new machine and then the process of gradual improvement starts all over again.
Certainly depends a lot on what the term "VM" actually means in the context. If it's something as specialized as the JVM, or a native virtualization with an extremely trimmed down guest, then at some point you'll find yourself in need of something more heterogenous, e.g. running a tool on the side that does not fit the VM. Then you're back at square one, only this time with containers (or back in some irreproducible ad-hoc setup). Going with containers from the start, containers that may or may not contain a VM, and that may or may not actually do more than what a VM could supply, that's much less hassle than changing horses at a later point.
VMs in general use more CPU power as you have two OSes each doing things like updating their real time clock... There are VM aware OSes that will not do this, but it needs special code and CPU support which means you are often lagging behind the latest (to be fair this is rarely important) A container will normally be slightly faster than a VM never slower (assuming a reasonable OS - I can write an exception if I was malicious) and so there is a lot of interest if they are good enough.
You don't need a VM on Mac to run containers, check out OrbStack, they provide a Docker compatible engine that is using native MacOS capabilities for running containers without the hidden Linux VM.
I don't know where you got that idea. OrbStack absolutely runs a Linux VM. That Linux VM then uses Linux containerization technologies (namely LXD) for each separate OrbStack 'machine' you set up, which is how you get such fast startup times for your OrbStack 'machines'.
For Docker, OrbStack does the same thing as Docker Desktop, Podman Desktop, Rancher Desktop, etc., which is set up a Linux VM running Docker and then present a native socket interface on macOS which relays everything it receives to the Docker socket inside the VM.
macOS doesn't have native capabilities for running containers, which is why the nearest thing you can get to containerd on it requires you to disable SIP so it can use a custom filesystem to emulate bind mounts/null mounts: https://darwin-containers.github.io/
If you read the PRs where the principal author of the Darwin Containers implementation is trying to upstream bits of his work, you'll see containerd comparing his approaches to others and complimenting them by calling them 'the most containerish' because real capabilities aren't there.
(I believe I've read rumors here on HN that Apple has those features internally, fwiw. But they've evidently never released them in a public copy of macOS.)
Another clue in all this is to just run uname in any of your Docker containers in OrbStack; you'll see they're Linux machines. Some operating systems have Linux syscall emulation layers (WSL1, FreeBSD's Linux emulation, Illumos' LX Zones) that could perhaps be used to run Linux containers without hardware emulation or paravirtualization in combination with some native containerization capabilities. Afaik Illumos' LX Zones is the only implementation where that's a supported, intended use case but maybe FreeBSD can do it. At any rate, macOS has never had that kind of syscall compatibility layer for Linux, either. So when you run `uname` in a 'macOS container' and see 'Linux', you can be certain that there's a VM in that stack.
PS: Aside from the fact that it's proprietary, I really do quite like OrbStack. It's the nicest-to-use implementation of something like this that I've tried, including WSL2 and Lima. The fact that it makes the VM machinery so invisible is very much to its credit from a UX perspective!
Interesting! I'd swear that in the early days of OrbStack somewhere on their website I've read they're using native MacOS frameworks without the need of Linux VM, but I can't find that anymore (they don't mention Linux VM either, but the language still differs from what I remember).
They do use native GUI frameworks rather than something like Electron, which they still mention. And maybe they also used to have something about relying on Apple's Virtualization Framework or something like that, rather than qemu as Lima used for a long time. (I think it may still be Lima's default, but not for long.)
Where is the tooling to build and distribute lightweight vms like containers? How can I copy one html file into an nginx VM, built this vm image with multiple architectures (I have arm and x64 servers), publish it, pull it, and run it multiple times?
Once again: Containers are not about isolation or security, they are a package format for shipping applications. The packages are easy to build, distribute, multiarch, ...
And requiring a Linux-VM on macOS to run Linux containers, is not particularly surprising.
At the time of publication of the article the tool used to create the minimalistic VM Tinyx was not released and as far as I can see was never released.
Correct, we never did release Tinyx, mostly because it was in a very unclean/researchy state = not ready for public consumption. In retrospect, we probably should have either (a) made it available in whatever state it was in or (b) put more cycles into it.
Containers are perfect for build environments and for creating the root filesystem. The issue is that the kernel these days are super bulky and are intended for multi-user, multi-process environments. Running a container runtime on top just makes it worse when you're looking for "isolation".
This paper argues that when you build a extremely minimal kernel (i.e. ditch Linux entirely) and link your application against necessary bits of code to execute _as_ a VM, then you'll get better performance than a container and you'll get that isolation.
I am looking at the examples. They all have a Docker file. If that just for local development on my laptop?
Using the deploy command line tool is the Docker file used to determine dependencies for the hosted VM? What if a developer is using an unusual programming language, like Common Lisp. Is that doable?
A Dockerfile is just a file with a bunch of commands to execute and get a working "computer". https://github.com/combust-labs/firebuild is fairly aged translation of the Dockerfile to a VM rootfs.
> build a extremely minimal kernel (i.e. ditch Linux entirely) and link your application against necessary bits of code
It would be nice, but this is really hard to do when modern software has so many layers of crud. Good luck getting say, a PyTorch app, to work doing this without some serious time investment.
But you don't need to write against all the layers of crud. You only have to write against the bottom layer, the kernel API. This sort of software would have no need to specifically support "libxml" or "TLS", because that is multiple layers above what this sort of software does.
The flip side is that if you want something like low-level access to your specific graphics card you may need to implement a lot of additional support. But of course nothing says you have to use this everywhere at the exclusion of everything else. There's plenty of systems in the world that from the kernel point of view are basically "I need TCP" and a whole bunch of compute and nothing else terribly special.
[Author of the paper here] You hit the nail on the head, this is precisely what we do (kernel API compatibility) with the LF Unikraft project (the evolution of the 2017 paper) at www.unikraft.org, and kraft.cloud, a cloud platform that leverages Unikraft.
Most of that effort should be sharable. if you know you will only have one python process you can get rid of a lot of cruft. If you know you will be running in a VM then you only need the driver for the network interface the VM provides not every network interface every designed (often including ones that your hardware doesn't even physically support). So while there is serious time investment it isn't nearly as much as it would be to write a competitor to linux.
I'm not sure if I missed a bit here, but I have some colleagues doing research on unikernels for HPC and the point is that this unikernel is running directly on the hardware or hypervisor and not inside another VM. The unikernel is effectively a minimal VM and the network stack is one of the things they struggle the most with due to sheer effort.
[One of the authors of the paper] I wouldn't recommend writing a network stack from scratch, that is a lot of effort. Instead, with the Unikraft LF project (www.unikraft.org) we took the lwip network stack and turned it into a Unikraft lib/module. At KraftCloud we also have a port of the FreeBSD stack.
I tell people "An OCI container is a way to turn any random runtime into a statically linked binary."
It is very useful for managing dependency hell, or at least moving it into "API dependencies" not "Library dependencies", it is handy for pickling a CI/CD release engineering infrastructure.
It's not a security boundary.
(I'm 100% agreeing with parent, in case I sound contentious)
All security boundaries are "incidental" in that sense, though. Virtualization isn't a "purpose-designed" security boundary either, most of the time it's deployed for non-security reasons and the original motivation was software compatibility management.
The snobbery deployed in this "containers vs. VMs" argument really gets out of hand sometimes. Especially since it's almost never deployed symmetrically. Would you make the same argument against using a BSD jail? Do you refuse to run your services in a separate UID because it's not as secure as a container (or jail, or VM)? Of course not. Pick the tools that match the problem, don't be a zealot.
> All security boundaries are "incidental" in that sense, though
X86 protected mode, processor rings, user isolation in the multi user operating systems, secure execution environments in X86 and ARM ISAs, kernel and userspace isolation, etc. are purpose built security boundaries.
Virtualization is actually built to allow better utilization of servers, which is built as a "nested protected mode", but had great overhead in the beginning, which has been reduced over generations. Containers are just BSD jails, ported to Linux. This doesn't make containers bad, however. They're a cool tech, but held very wrong in some cases because of laziness.
The motivation for MMU hardware was reliability and not "security". Basically no one was thinking about computer crime in the 1970's. They were trying to keep timesharing systems running without constant operator intervention.
Yeah, but that's not an incidental property of *namespaces* (of which cgroups is only one isolation axis), that was the requirement when namespaces were designed.
Yeah, I know. Namespaces are pretty cool outside containers too.
My comment was more of a soft jab against using containers as the ultimate "thing" for anything and everything. I prefer to use them as "statically linked binaries" for short lived processes (like document building, etc.).
But, whenever someone abuses containers (like adding an HTTPs fronting container in front of anything which can handle HTTPS on its own) I'm displeased.
There is no such thing as a reproducible build environment anymore. You can get a temporary reproducible build environment, but any sane security policy will have certificates that expire and that in turn means that in a couple years your build environment won't be reproducible anymore.
> but any sane security policy will have certificates that expire and that in turn means that in a couple years your build environment won't be reproducible anymore.
"Reproducible" is usually defined as "identical output except for the cryptographic signature at the end" (and that should be the only use for a certificate in your build environment, a high-quality build environment should be self-contained and have no network access). That is, once you remove the signature, the built artifacts should be bit-by-bit identical.
If you run multiple instances of a container image, you get a reproducible environment.
If you run a docker build multiple times, and copy a few files into the container, you get a reproducible container image. It is not a hash perfect duplicate, but functionally equivalent.
If builds of your favourite programming language are reproducible or not, is not really related to VM vs. Container.
The main advantage in my use case is in fact isolation (network and volumes) and a well defined API enabling management of those containers in production (not k8s, a tiny subnet of that perhaps).
The isolation could be achieved using namespaces directly. But the API, tooling and registry add a lot of value that would otherwise require a lot of development.
Also last time I looked hypervisors aren't possible on all cloud vendors, unless you have a bare metal server. This matters in my case. Maybe it has changed in the past 3 years.
When docker fits it's great. Same can be said of k8s, where there are a whole bunch of additional benefits.
If this were true, then wouldn't folks just need an application binary that statically links all of its required libraries and resources into a giant, say, ELF? Why even bother with a container?
Programmers discover the benefits of static linking, and then programmers discover the benefits of dynamic linking, and then programmers discover the benefits of static linking, and then...
Anyway containers go quite a bit further than just static linking, most people aren't out there linking all the binaries that their shell script uses together?
What if you application is not just one binary. What if it's a pipeline of complex tasks, calls some python scripts, uses a patched version of some obscure library, ...
It's not possible to package half a Linux distribution into a single binary. That's why we have containers.
First thing that comes to mind is the need to link against libraries across platforms. Imagine that my app depends on opencv, if I wanted to statically link everything on my Windows machine, I need to compile opencv for Linux on my windows machine (or use pre-compiled binaries). Also, if you link against libraries dynamicaly, it's likely you can compile them on the host machine (or in a container) with more optimizations enabled. And the last thing is probably the ability to "freeze" the whole "system" environment (like folders, permissions, versions of system libraries).
Personally, I use containers to quickly spin-up different database servers for development or as an easy way of deployment to a cloud service...
Well yes, but try turning some random python, java or ruby service into a single binary .. now do that 12 times.
Or try with a native app that leverages both the GPU and libLLVM, and enjoy finding out the kind of precautions you have to take for LLVM to not blow up on a computer where your GPU driver was built with a different LLVM version.
That said, it makes sense from a developer POV; if, during development, you don't need the isolation you can run multiple containers (with on paper fast boot times and minimal overhead) on your development box.
There's plenty of cases to imagine where you need the containerization but not necessarily the isolation.
Because static libraries ain't a big thing anymore. Maybe they will become popular again. This would make it easier to have reproduceable build without a container. But I think containers are the new static libs now
we arguably already had this with things like python venv.
the articles main point still remains, containers are a slow and bloated answer to this problem.
I concede youll need containers for Kubernetes, and Kubernetes on the surface is a very good idea, but this level of infrastructure automation exists already in things like foreman and openstack. designs like shift-on-stack trade simplicity of traditional hardware for ever byzantine levels of brittle versioned complexity...so ultimately instead of fixing the problem we invoke the god of immutability, destroy and rebuild, and hope the problem fixes itself somehow...its really quite comical.
baremetal rust/python/go with good architecture and CI will absolutely crush container workloads in a fraction of disk, CPU, RAM, and personal frustration.
Python venv is language specific, doesn't handle the interpreter version and doesn't handle C libraries.
I really don't understand why people do this: I get having a distaste for containers but some people, seeing the massive success of OCI images, mainly seem content on trying to figure out how to discredit its popularity, rather than trying to understand why it's popular. The former may be good for contrarian Internet forums, but the latter is more practically useful and interesting.
I say this with some level of understanding as I also have a distaste for containers and Docker is not my preferred way to do "hermetic" or "reproducible" (I am a huge Nix proponent.) I want to get past the "actually it was clearly useless from the start" because it wasn't...
All the younger engineers I talk to think you would need to be Albert Einstein to bootstrap a bare metal server.
As someone who made a living doing this at scale, where we would build a new datacenter every 2-4 weeks using 100% open source or off the shelf tools, I completely disagree.
I think PXE booting some servers and running a binary on them is 90% easier than most container orchestration engines, Kubernetes control plane, and all the other problems engineers seem to have invented for themselves. I also think it’s almost always much more performant. Engineers don’t have an intuition to realize that their XXLarge-SuperDuper instance is actually a 5 year old Xeon they’re sharing with 4 other customers. Cloud Prociders obfuscate this as much as possible, and charge a King’s ransom if you want modern, dedicated hardware.
NixOS and Guix System offer a far lighter and more reproducible approach, who also not push many running images build by unknown on the internet direct in production, full of outdated deps, wasting in the meantime storage and cpu resources...
Yet it doesn't even come close to a fraction of the adoption scale of containers, no matter how good it is. Ecosystems matter more than individual quality.
That's because some interested parties have advertised containers, because they are good to sell as pre-built stuff, nice to sell VPS and alike etc, while pure IaC is useful for anyone and invite NOT to be dependent on third party platforms.
It's not a technical matter, it's a human, economical matter and actually... Most people are poor, following the largest scale means following poverty not a good thing.
[disclaimer: I'm one of the authors of the paper] I 100% agree, containers are an amazing dev env/reprodicble env tool! In fact, we think they're the perfect marriage to the unikernels (specialized VMs) we used in the paper; on kraft.cloud , a cloud platform we built, we use Dockerfiles to specify apps/filesystems, and transparently convert them to unikernels for deployment. The end result is the convenience of containers with the power of unikernels (eg, millisecond cold starts, scale to zero and autoscale, reduced TCB, etc).
While reproducible build envs are a nice feature of using containers, they aren't the primary benefit.
The primary benefit is resource usage and orchestration.
Rather than duplicating entire aspects of an OS stack (which might ne considered wasteful) they allow for workloads to share aspects of the system they run on while maintaining a kind of logical isolation.
This allows for more densely packed workloads and more effective use of resources. This is a reason why the tech was developed and pushed by google and adopted by hyperscalers.
You are absolutely correct, and the creators of Docker did mention that was the core reason. Unfortunately your comment comes 10 years too late for many.
> If there is some additional isolation required, just run the container in a VM.
No. Running a container in a VM gets you no additional isolation. Containers share kernel space and as such have limited isolation to VMs, which have isolated kennels. In exchange for this Lack of additional isolation, you’ve added a Bunch of extra Complexity.
Pardon the extra caps I am using iOS voice dictation.
I think they mean run a VM with one container inside. So you do get strong isolation.
This is similar to how managed container IaaS works. They launch a VM and run your container in it.
It is extra complexity but has a few advantages. 1. People already have a convenient workflow for building container images. 2. The base OS can manage hardware, networking and whatever other low-level needs so that the container doesn't need to have these configurations. 3. If you want to trade of isolation for efficiency you can do this. For example running two instances of a container in the same VM. The container doesn't need any changes to support this setup.
The model of a single container within a VM just adds overhead. The ideal case would be to remove the container layer and have the application(s) within the container run directly in the VM (which hopefully only includes the libs and OS modules needed for the app to run, and nothing more).
This is the approach we take at kraft.cloud (based on the LF Unikraft project): use Dockerfiles to specify app/filesystem, and at deploy automatically convert to a lightweight VM (unikernel) without the container runtime/layer.
No, you don’t. There is no benefit the container is providing, because The only feature of the container is isolating you from the zero other containers running on the VM.
The isolation I am referencing is from the VM, not the container. Containers don't provide strong isolation, that is why the VM is required in this model.
Docker particularly has always required extra work to run in rootless mode because it was released soon after in March 2013, and for whatever reason it hasn't been a priority to rework the codebase to make that the default. I switched to podman for exactly this reason as my go-to oci implementation and haven't looked back.
The linux kernel features that enable various forms of isolation all require root privileges (CAP_SYS_ADMIN). Once user namespaces were a thing, that allowed you to use user namespaces to get around the root requirements for all the other isolation namespaces.
All of the below still require CAP_SYS_ADMIN:
CLONE_NEWCGROUP: cgroup namespace, for resource control (mem/cpu/block io/devices/network bandwith)
CLONE_NEWIPC: ipc namespace for sysv ipc objects and message queues
CLONE_NEWNET: network namespace, for isolated virtual networking
CLONE_NEWNS: mount namespace, for isolated mounting (filesystems, etc.)
CLONE_NEWPID: pid namespace, for isolated view of running processes
CLONE_NEWUTS: unix timesharing system namespace, for isolation of hostname and domain name
see: https://man7.org/linux/man-pages/man2/clone.2.html