I've been wondering if at some point we could eventually train the models in fir...

I've been wondering if at some point we could eventually train the models in first-order terms. I.e. input some HTML/JS/CSS + user state (i.e. scroll position, x/y dimensions) and then it outputs a final rasterized frame representing the current state of the viewport. The training data would be fairly obvious and easy to collect.

Failing that, an optimized binary blob that could achieve the same using training data over modern browser specifications. If you go to ChatGPT and start talking about ISO32000-compliant implementations and poke at the edges, you can get it to start writing a PDF engine pretty quickly.