Cool stuff! We use a similar process internally to rerank and filter our cold outbound lists. We just use an off-the-shelf model as the judge, give it a custom criteria, and let it run until some set number of iterations. It's helped narrow down wide searches to the maximally relevant set of people (few thousand medium-bad matches to few hundred good matches)
It's not cheap and it's not fast, but it definitely works pretty well!
It's all unstructured text (title, company, company size, experience, skills, raw text, etc.) and LLMs are pretty bad at assigning numerical scores in a vacuum. To make it work, we'd have to provide a representative set of examples, break scoring down by specific field, etc.
Kind of a lot of work compared to just dumping the text of 2 profiles into a context window along with a vague description of what I want, and having the LLM make the binary judgment.
Yeah that's exactly what we observed. Our goal was to create an absolute score that's completely independent from the Corpus, which is difficult because naturally all ELO distributions are inherently tied to the corpus itself!
When we were exploring the mathematical foundations, we considered ELO scoring against a "Universal Corpus" based on the natural entropy of human language (Obviously that's intractable, but sometimes this term cancels out like in the DPO proof).
But eventually we figured out a method using cross-query comparisons to assign an "ELO bias" to all document ELOs within a given query's candidate list. This normalizes it correctly such that when a candidate list is all bad, the ELOs shift low. And when the candidate list is all good, the ELOs shift high. Even when the relative ELOs are all the same.
It's not cheap and it's not fast, but it definitely works pretty well!