I was thinking about this before and I thought of having a 3d physics engine that the AI could create objects in and simulate things to see their physical viability. Could also help with question answering that requires that spatial knowledge / real world simulation.
Yes I think that the latest in ML everything else will help to create those traditional simulations.
But also I think that what an AI that can for example really answer questions about a video would need to do to be really effective would be to basically do compressed versions of those simulations using the spatial-temporal-abstract latent space. Which should be a better model than just the textual space.