Man if someone asked me to build a system to merely identify whether a unicode string is human language or not I would flatly refuse. There are thousands of spoken languages, many of them with no standard written form, some that are transcribed into multiple different writing systems, some with no writing tradition at all and with only ad-hoc transliteration unique to each user and use.
Even being 90% confident would be a massive undertaking, and "speakers of this language may/may not use the internet" feels like high stakes for getting it wrong.
It seems a little niche but I'm sure a few times a year some far out town gets connected and suddenly there are speakers of a previously unknown-to-the-internet language newly online.
Note that the metric here is "is the tweet in one of the languages spoken by the user". This hypothetically allows more nuanced implementations than you contemplate.
For example, they could have a language "unrecognized" and assume everyone speaks it.
I broadly find this useful: I see tweets in other languages when they're retweeted by people I follow, and about half the time I machine translate them. But I don't want my whole feed to be that.
Well if someone asked me to do that, I would suggest that it’d be based off their recent tweet history and not just one tweet. And I would make my case in the meeting.
Second, it’s already been done so my next suggestion would be to look what at all the computational linguistic majors have been up to.
I think it would be pretty easy with the language models we have available these days.
And there’s always the options of having unsupported languages or inferring it from user settings or user location.
From a product perspective you will need this feature though if you want worldwide coverage, because very few people are polyglots and most people don’t speak English as a first language.
Demoting a tweet that's entirely unidentifiable as any human language seems fair enough.