They use perplexity on github data to demonstrate the effectiveness of their model.
I suspect github data has a lot of copy pasted code. Ie. a good chunk of what you are asking the model to do is to go back X million tokens and copy a chunk verbatim.
Sure, the model might also be looking back at some code X million tokens ago and using that to improve its guess of the next token (oh look, the API definition of the API I am using is back here, that'll help me get this right!).
But the perplexity number alone doesn't differentiate those cases - and considering how much code copying/templating happens in software, I suspect that affects the perplexity a lot more than smartly using stuff from the context window.
I wonder if these models work well on other kinds of data?
I suspect github data has a lot of copy pasted code. Ie. a good chunk of what you are asking the model to do is to go back X million tokens and copy a chunk verbatim.
Sure, the model might also be looking back at some code X million tokens ago and using that to improve its guess of the next token (oh look, the API definition of the API I am using is back here, that'll help me get this right!).
But the perplexity number alone doesn't differentiate those cases - and considering how much code copying/templating happens in software, I suspect that affects the perplexity a lot more than smartly using stuff from the context window.
I wonder if these models work well on other kinds of data?