Why can't it? End-to-end encryption means you don't need to hold on to pictures for lengthy periods of time, and video calling can be done without a centralized service.
Neither of those costs anything extra to run though. Since it's end to end encrypted with no sever backups the images and video can be communicated directly between the clients, WhatsApp just plays matchmaker.
It could, but that's not how it does, they store the (encrypted) media in their servers so that both phones don't need to be online at the same time, and also to avoid having to upload multiple times when you send to a group.
That said, they shouldn't need to keep them around for that long, unlike FB and similar.