Let's say your text is "I like red cats", your current token input is at "I like red". Let's say the model thinks "cats" is 25% likely. It also thinks "dogs" is eg 50% likely for the next token. So rather than storing 1 bit (= log p) to encode dog in the output encoded code, it can use 2 bits to store "cats". Totally lossless. At decode time, you're reading the code, so it's going to just pick the "cats" token anyway, even though it's not the likeliest, because it's not randomly sampling here, it's just using the underlying probabilistic model to do entropy coding. It's not lossy here, since we're not letting the model/sampling process freely choose what the text is made of.