With AlphaZero there are clear evaliation metrics -- you win, lose, or draw the ...

With AlphaZero there are clear evaliation metrics -- you win, lose, or draw the game given specific rules. With chess, there is even a way of detecting end-game threats via check. The zero human data approach works here because of that, allowing the computer to find optimal strategies.

With natural language you don't have that unaided feedback evaluation metric. Especially when given idioms, domain specific terms, etc.

This is slow and hardwork because you need to process some text, evaluate and correct that data, retrain and repeat with the next text. You also need to check and correct the existing data, because inconsistencies will compound any errors.