All of the facts based queries I have asked so far have not been 100% correct on any LLM including this one.
Here are some examples of the worst performing:
"What platform front rack fits a Stromer ST2?": The answer is the Racktime ViewIt. Nothing, not even Google, seems to get this one. Discord gives the right answer.
"Is there a pre-existing controller or utility to migrate persistent volume claims from one storage class to another in the open source Kubernetes ecosystem?" It said no (wrong) and then provided another approach that partially used Velero that wasn't correct, if you know what Velero does in those particular commands. Discord communities give the right answer, such as `pvmigrate` (https://github.com/replicatedhq/pvmigrate).
Here is something more representative:
"What alternatives to Gusto would you recommend? Create a table showing the payroll provider in a column, the base monthly subscription price, the monthly price per employee, and the total cost for 3 full time employees, considering that the employees live in two different states" This and Claude do a good job, but do not correctly retrieve all the prices. Claude omitted Square Payroll, which is really the "right answer" to this query. Google would never be able to answer this "correctly." Discord gives the right answer.
The takeaway is pretty obvious right? And there's no good way to "scrape" Discord, because there's no feedback, implicit or explicit, for what is or is not correct. So to a certain extend their data gathering approach - paying Kenyans - is sort of fucked for these long tail questions. Another interpretation is that for many queries, people are asking the wrong places.
So do you just have a list of like thousands of specialized discord servers for every question you want to ask? You're the first person I've seen who is actually _fond_ of discord locking information behind a login instead of the forums and issues of old.
I personally don't think it's useful evaluation here either as you're trying to pretend discord is just a "service" like google or chatgpt, but it's not. It's a social platform and as such, there's a ton of variance on which subjects will be answered with what degree of expertise and certainty.
I'm assuming you asked these questions because you yourself know the answers in advance. Is it then safe to assume that you were _already_ in the server you asked your questions, already knew users there would be likely to know the answer, etc? Did you copy paste the questions as quoted above? I hope not! They're pretty patronizing without a more casual tone, perhaps a greeting. If not, doesn't exactly seem like a fair evaluation.
I don't know why I'm typing this all out. Of course domain expert _human beings_ are better than a language model. That's the _whooole_ point here. Trying to match human's general intelligence. While LLM's may excel in many areas and even beat the "average" person - you're not evaluating against the "average" person.
Here are some examples of the worst performing:
"What platform front rack fits a Stromer ST2?": The answer is the Racktime ViewIt. Nothing, not even Google, seems to get this one. Discord gives the right answer.
"Is there a pre-existing controller or utility to migrate persistent volume claims from one storage class to another in the open source Kubernetes ecosystem?" It said no (wrong) and then provided another approach that partially used Velero that wasn't correct, if you know what Velero does in those particular commands. Discord communities give the right answer, such as `pvmigrate` (https://github.com/replicatedhq/pvmigrate).
Here is something more representative:
"What alternatives to Gusto would you recommend? Create a table showing the payroll provider in a column, the base monthly subscription price, the monthly price per employee, and the total cost for 3 full time employees, considering that the employees live in two different states" This and Claude do a good job, but do not correctly retrieve all the prices. Claude omitted Square Payroll, which is really the "right answer" to this query. Google would never be able to answer this "correctly." Discord gives the right answer.
The takeaway is pretty obvious right? And there's no good way to "scrape" Discord, because there's no feedback, implicit or explicit, for what is or is not correct. So to a certain extend their data gathering approach - paying Kenyans - is sort of fucked for these long tail questions. Another interpretation is that for many queries, people are asking the wrong places.