0xab's comments

0xab · 2025-06-04T01:00:28 1748998828

> When VLMs make errors, they don't make random mistakes. Instead, 75.70% of all errors are "bias-aligned" - meaning they give the expected answer based on prior knowledge rather than what they actually see in the image.

Yeah, that's exactly what our paper said 5 years ago!

They didn't even cite us :(

"Measuring Social Biases in Grounded Vision and Language Embeddings" https://arxiv.org/pdf/2002.08911

hkmaxpro · 2025-06-04T04:37:32 1749011852

I think social biases (e.g. angry black women stereotype) in your paper is different from cognitive biases about facts (e.g. number of legs, whether lines are parallel) that OP is about.

Social biases are subjective. Facts are not.

rcxdude · 2025-06-04T09:13:25 1749028405

As far as the model's concerned, there's not much difference. Social biases will tend to show up objectively in the training data because the training data is influenced by those biases (the same thing happens with humans, which how these biases can proliferate and persist).

vokhanhan25 · 2025-06-04T10:34:30 1749033270

I see a clear difference. One is objective (only one correct answer), one is subjective (multiple plausible answers)

EvgeniyZh · 2025-06-04T08:21:55 1749025315

Well you send a vaguely worded email like "I think you may find our work relevant" and everyone knows what that means and adds the citation

anguyen8 · 2025-06-05T16:09:46 1749139786

Hello 0xab,

Sorry that we missed your work. There are a lot of works in this area both textual and visual, especially social biases.

We wish to mention all but the space is limited so one can often discuss the most relevant ones. We'll consider discussing yours in our next revision.

Genuine question: Would you categorize the type of bias in our work "social"?

3abiton · 2025-06-04T03:14:46 1749006886

It's easier to succeed if you ignore the issues, andthe users are not aware of it.the rate of evolution of "AI" recently is so fast, no one is stopping to do actual benchmarks and analysis of allyhe new models.

moralestapia · 2025-06-04T02:31:00 1749004260

That's weird, you're at MIT. You're in the circle of people that's allowed to succeed.

I wouldn't think much about it, as it was probably a genuine mistake.

JackYoustra · 2025-06-04T02:52:00 1749005520

What does allowed to succeed mean?

moralestapia · 2025-06-04T02:58:29 1749005909

Your work usually has 1,000x the exposure and external validation compared to doing it outside those environments, where it would just get discarded and ignored.

Not a complain, though. It's a requirement for our world to be the way it is.

_345 · 2025-06-04T03:34:45 1749008085

Is there truth to this? Do you have any sources to link to on this

moralestapia · 2025-06-04T18:22:22 1749061342

Sure dude, here's the link to the UN Resolution about which researchers deserve attention and which others do not, signed by all countries around the world [1].

*sigh*

It's pretty obvious, if you publish something at Harvard, MIT, et. al. you even get a dedicated PR team to make your research stand out.

If you publish that on your own, or on some small research university in Namibia, no one will notice.

I might be lying, though, 'cause there's no "proof".

1: https://tinyurl.com/3uf7r5r7

ramblerman · 2025-06-04T07:34:26 1749022466

What do you genuinely think they built upon from your paper?

If anything, the presentation of their results in such an accessible format next to the paper should be commended.

0xab · on Dec 24, 2024

Datasets need to stop shipping with any training sets at all! And they should forbid anyone from using the test set to update the parameters of any model through their license.

We did this with ObjectNet (https://objectnet.dev/) years ago. It's only a test set, no training set provided at all. Back then it was very controversial and we were given a hard time for it initially. Now it's more accepted. Time to make this idea mainstream.

No more training sets. Everything should be out of domain.

upghost · on Dec 24, 2024

I don't know how this is possible with LLM tests. The closed source models will get access to at least the questions when sending the questions over the fence via API.

This gives closed source models an enormous advantage over open-source models.

The FrontierMath dataset has this same problem[1].

It's a shame because creating these benchmarks is time consuming and expensive.

I don't know of a way to fix this except perhaps partially by using reward models to evaluate results on random questions instead of using datasets, but there would be a lot of reproducibility problems with that.

Still -- not sure how to overcome this.

[1]: https://news.ycombinator.com/item?id=42494217

light_hue_1 · on Dec 24, 2024

It's possible.

I'm not worried about cheaters. We just need to lay out clear rules. You cannot look at the inputs or outputs in any way. You cannot log them. You cannot record them for future use. Either manually or in an automated way.

If someone cheats, they will be found out. Their contribution won't stand the test of time, no one will replicate those results with their method. And their performance on datasets that they cheated on will be astronomical compared to everything else.

FrontierMath is a great example of a failure in this space. By going closed, instead of using a license, they're created massive confusion. At first they told us that the benchmark was incredibly hard. And they showed reviewers subsets that were hard. Now, they're telling us that actually, 25% of the questions are easy. And 50% of the questions are pretty hard. But only a small fraction are what the reviewers saw.

Closed datasets aren't the answer. They're just unscientific nonsense. I refuse to even consider running on them.

We need test sets that are open for scrutiny. With licenses that prevent abuse. We can be very creative about the license. Like, you can only evaluate on this dataset once, and must preregister your evaluations.

upghost · on Dec 24, 2024

I would like to agree with you, but I doubt the honor system will work here. We are talking about companies that have blatantly trampled (or are willing to risk a judicial confrontation about trampling) copyright. It would be unreasonable to assume they would not engage in the same behavior about benchmarks and test sets, especially with the amount of money on the line for the winners.

greatgib · on Dec 25, 2024

I understand the idea but I don't think that it is beneficial in the end.

Access to the dataset is needed to understand why we get a given result. First from a transparency point of view to check if results make sense and why one model is favored compared to another one.

But also, it is needed to understand why a model will perform badly on some aspect to be able to determine how to improve the model.

0xab · on March 15, 2020

I'm the lead author of the paper you cited. Glad you enjoyed our work :)

Sure, you can break systems. That doesn't mean that they aren't useful! In many cases a system will see the same boring input many times over. People are often willing to be a bit flexible and help out when it happens to misread something. The fact that you can intentionally break systems like that, and that you can break them in a particular direction, like making them always think there's no danger in an image, is really worrisome.

Our work shows that your autonomous car won't always work well; that its vision system has some systematic error which we can characterize now. Adversarial attacks show that someone can intentionally make your car see a lane, whenever and wherever they feel like it, and drive you off the road. It's a whole different ballgame, and the language of attack and defense really fits well.

0xab · on June 22, 2019

Years ago we had a lot of good problems. Not just 2006, but 2007 was great too where you had secret messages to save Endo. 2008 at least had you doing some fun things with the rover. 2014 you programmed a fun machine. One of the years was an optimization problem, I forgot which, but it involved orbital mechanics, so it was a lot deeper than the current setup.

In the past 5 or so years the contest has settled into a pretty boring rut. The problems are all the same. "We came up with a system that has an agent. It can do 5-15 things. Get it to solve this simple to define problem efficiently."

It's not that these problems aren't fun, it's that they're all the same :(

What's the point in doing the 10th contest in exactly the same format? It's gotten to the point where I can just reuse code from previous years.

pwodhouse · on June 22, 2019

Aside from 2006-2008, the ramp up of complex "fun" projects, nearly all the contests were the same optimization style, often maze-solving format. Off the top of my head I remember the ant colony, the Mars Rover, the Pacman type game, the mining maze.

The oldest one I remember was an html compressor.

The real downfall of icfp is that in the past 10 years, anyone talented enough to participate is no longer merely a student or professor, but can get a job at or create their own a startup or big company or meaningful open source project of their own.

0xab · on June 22, 2019

> merely a student or professor, but can get a job at or create their own a startup or big company or meaningful open source project of their own.

Hey! Professors do meaningful things too :)

> The real downfall of icfp is that in the past 10 years, anyone talented enough to participate

I don't think the optimization is the biggest problem. If you look at it, Endo in 2007 was "optimization", you had to find the shortest prefix to save our alien friend. But the problem had more meat to it and it was different from the problems that came before.

People are no different today than 10-20 years ago, and more good young people are created all the time who can participate. If anything, there are way more capable people that can participate today than 20 years ago.

The problems are now all the same. Why would you waste your time doing this year's problem when the problem from last year is nearly identical but with a different two-paragraph lead-in story.

Regardless of our diagnosis, the numbers bear out that the ICFP contest is slowly dying. The number of teams submitting from 2014 to 2018: 275, 230, 203, 120, 107.

We have a few more years before it's scrapped at this rate.

0xab · on June 20, 2019

There's so much wrong with this.

HIPAA doesn't talk about US citizens or distinguish different types of records based on any properties of the people that those records cover. The words citizen do not appear in HIPAA or HITECH. HIPAA applies to any records by covered entities, which is what it discusses, regardless of who those records refer to.

There is no requirement in HIPAA that PII must be stored in the US. This is such basic info it's in the HIPAA FAQ from HHS https://www.hhs.gov/hipaa/for-professionals/special-topics/c.... Question 9 is unequivocal, you can store data outside of the US, but you need to think about any dangers or risks associated with this. Which is totally logical.

There are lots of reasons to have issues with the US. But not what you're talking about.

0xab · on June 16, 2019

If you use lens as just a way to access records like you do in other languages, then there is absolutely nothing hard about it. Literally all you need to know is:

Name your records like "data Prefix = Prefix { prefixFieldName :: ... }" call "makeFields ''Prefix" once at the bottom of your file and use "obj ^. fieldName" to access and "obj & fieldName .~ value" to set.

That's it. You now have 100% of the capabilities of record update in any other language. This doesn't get any simpler in any other language. It even pretty much looks like what you would do in other languages.

I'll grant you, Haskell and lens do a terrible job of explaining subsets of functionality that are simple and let you get the job done before jumping in the deep end.

foldr · on June 16, 2019

Yeah, so it's a less good way of accessing record fields than the one present in 99% of other programming languages. Your own description makes this plain. Let's compare to Javascript:

* I don't need to import a module to make available the syntax for getting and setting fields of an object.

* I can use the same syntax for any object, and don't have to worry about doing a bunch of code generation via a badly-designed metaprogramming hack.

* I don't have to worry about adding prefixes to all my field names.

* The syntax uses familiar operators that I won't have to look up again on hackage if I stop writing Javascript for a few months.

* No-one modifying my code can get "clever" and use one of ~50 obscure and unnecessary operators to save a couple of lines of code.

What bugs me is when Haskell advocates try to use all the additional esoteric features of the lens library as an excuse for this fundamental baseline crappiness.

Haskell really just needs proper support for record types. Then people could use lenses when they actually need lenses (never?). At the moment, they're using lenses because they want something that looks almost like a sane syntax for record updates.

_fq4v · on June 16, 2019

Record types are not a solution to the problem lens solves. Lens is a good library and a good concept. If we spent some time on it in programming class, most people would get it. When moving to non-Haskell languages, the lack of proper lenses is something I notice almost immediately.

foldr · on June 16, 2019

I know what the lens library does - I write Haskell for my day job.

In practice, the main reason people use it is to work around the deficiencies of Haskell's built-in record system:

>I never built fclabels because I wanted people to use my software (maybe just a bit), but I wanted a nice solution for Haskell’s non-composable record labels.(http://fvisser.nl/post/2013/okt/11/why-i-dont-like-the-lens-...)

The other features of lenses don't strike me as particularly useful. YMMV. I'd also question the quality of the library. It's full of junk like e.g. http://hackage.haskell.org/package/lens-4.17.1/docs/src/Cont..., which is just an invitation to write unreadable code.

_fq4v · on June 16, 2019

My biggest use case for lenses that I miss in other languages is the ability to interact with all elements of a collection, or elements in deeply nested collections.

For example, if I had a list of records with a field named 'categories' holding a list of objects with a field named 'tags', and I wanted to get all of these names in one list, without nested loops, lens makes it easy 'record ^.. categories . each . tags . each' or I could update them all, etc. It's just so easy to do this kind of data munging with lens that writing fors, whiles, etc in other languages is painful.

0xab · on May 19, 2019

There's a control group. Both groups were watched by the experimenters. Not only that, the same teachers had students in both groups in different classes.

zby · on May 21, 2019

But the control group would know that nothing has changed. The group testing the new method would feel the excitement - the control group would not.

0xab · on April 30, 2019

I do research in computer vision and this paper is so bad it's beyond words.

* They give the network is huge advantage: they teach it that it should say "no" 80% of the time. The training data is unbalanced (80% no vs 20% yes) as is the test data. Of course it does well! I don't care what they do at training time, but the test data should be balanced or they should correct for this in the analysis.

* They measure the wrong things that reward the network. Because the dataset is imbalanced you can't use an ROC curve, sensitivity, or specificity. You need to use precision and recall and make a PR curve. This is machine learning and stats 101.

* They measure the wrong thing about humans. What a doctor does is they decide how confident they are and then they refer you to a biopsy. They don't eyeball it and go "looks fine" or "it's bad". They should measure how often this leads to a referral, and they'll see totally different results. There's a long history in papers like this of defining a bad task and then saying that humans can't do it.

* They have a biased sample of doctors that is highly skewed toward people with no experience. Look at figure 1. A lot of those doctors have about as much experience to detect melanoma as you do. They just don't do this task.

* "Electronic questionnaire"s are a junk way of gathering data for this task. Doctors are busy. What tells the authors that they're going to be as careful for this task as with a real patient? Real patients also have histories, etc.

I could go on. The number of problems with this paper is just interminable (54% of their images were non-cancer because a bunch of people looked at them. If people are so wrong, why are they trusting these images? I would only trust biopsies).

This isn't coming to a doctor's office anywhere near you. It's just a publicity stunt by clueless people. Please collaborate with some ML folks before publishing work like this! There are so many of us!

plus · on April 30, 2019

Since this is a journal focused on cancer and not machine learning, I can understand why the editors would see this paper as being worthy for for publication. Unfortunately, many of the readers will read the paper uncritically.

If possible, you should write a critical response to this paper, focusing on its methodological flaws, and send it to the editors. It doesn't have to be long; critical response are usually a couple pages at most. This is likely the most effective way of removing (or at the very least, heavily qualifying) bad science from research journals.

0xab · on April 30, 2019

This is a huge problem throughout science, not just ML. As scientists, we're rewarded for publishing cool new things that work, not for pointing out things that don't or for pointing out flaws in existing papers. If the point is to get people to not read one bad paper, it's just a waste of my time. Most papers are false and a lot of them should never have passed review.

If the authors actually wanted to do good ML research, they could always have reached out to a decent ML researcher who could have told them all of this. There's no shortage of us. The journal could have reached out to an ML reviewer. Why wouldn't they? But no one did, because the results look good and so they send it off to press and it's good for both the authors and the journal to have something that is hype-worthy. It's just the sad reality of modern science.

PierredeFermat · on April 30, 2019

It's amazing that a similar concern is raised/discussed here just couple hours ago: https://news.ycombinator.com/item?id=19788088

Any chance we could connect over email or something?

dataflow · on April 30, 2019

> Most papers are false and a lot of them should never have passed review.

Do you mean this literally or is this a metaphor to illustrate the point? If you actually mean most papers are false it'd be nice to see a link on that!

michaelhoffman · on April 30, 2019

John Ioannidis claims that "most published research is false" based on some rather dubious assumptions.

https://www.annualreviews.org/doi/abs/10.1146/annurev-statis...

DataWorker · on April 30, 2019

I agree with him although the accuracy of that statement is partially based on how “published research” is defined. Operational definitions and measurement are themselves much of the problem.

michaelhoffman · on April 30, 2019

How to Publish a Scientific Comment in 1 2 3 Easy Steps

http://frog.gatech.edu/Pubs/How-to-Publish-a-Scientific-Comm...

I agree that a formal comment is best although not necessarily easy. A comment on PubPeer is easier but it will probably only be seen by those with the PubPeer extension.

I do machine learning in computational biology and cancer. The issues described in the parent comment are known among experts. It’s too bad so many others don’t know or care.

chris_va · on April 30, 2019

Thank you for that link, it was a joy to read

Florin_Andrei · on April 30, 2019

I mean, if it's an interdisciplinary study, you may want to get advisers from all sides to look at it before you publish, no?

claytonjy · on April 30, 2019

Why would you ever balance your test data? If 80/20 is the actual population distribution, the sample that forms your test set should conform to that. Balance all you want in train/validation sets, but never the test set.

Not balancing and using ROC is a terrible combo, but the metric is the problem, not the lack of artificial balance.

0xab · on April 30, 2019

I agree, they should do one or the other.

The imbalance is totally artificial and objectionable though. Where's the evidence that doctors see a 80/20 split in real life? If there is going to be an imbalance they should make it reflect the actual statistics of the task that the doctors perform not some artificial number. It doesn't even reflect the statistics of the dataset they started with (which is 90/10 unblanaced).

Admittedly, the correct analysis for when the data is unbalanced is more annoying and ROC curves are easier to interpret. That's why in something like ImageNet even though the training set is imbalanced, the test set is is balanced.

Comparisons against humans are also harder when the data is imbalanced in a way that reflects the training set, not the task. Humans don't know they are supposed to say "no" 80% of the time. That rewards the machine and that isn't easy to correct (you can correct what you think about the machine results with respect to a baseline, but not what biases the humans had).

arkades · on April 30, 2019

> Where's the evidence that doctors see a 80/20 split in real life?

Cause they definitely don’t. Even in a select subpopulation - say, people going to a derm for screening - you’d expect one melanoma per 620 persons screened (as per the SCREEN trial). Since most people have more than one mole for evaluation, and even those with melanoma will have multiple innocent moles... a mole count >50 triggers a referral for screening, though in more cautious docs, possibly as few as 25...

If you wanna be really generous and consider our hypothetical high risk group to have an average of 10 moles per person, that’s 6209:1, not 80:20.

p1esk · on April 30, 2019

Another reason to balance the test set when the train set is unbalanced is to check if lack of training data for certain classes is a problem. You would use cross-validation, but do different splits for each class. It might well turn out that certain classes are just "easy", and you don't need to find more training samples for them to get the overall accuracy up.

michaelhoffman · on April 30, 2019

80/20 is not the actual population distribution though.

aoeusnth1 · on April 30, 2019

Do you have an explanation of why ROC is bad for unbalanced datasets? Isn't ROC unaffected by dataset imbalance?

theferalrobot · on April 30, 2019

Agreed, I have a hard time believing this person does CV research (though I suppose it could just be a hobby for them) with a statement like that. Especially calling out that they didn't balance the test set, ummm... what?

1e-9 · on April 30, 2019

I would say your criticism is way off base. I've developed and fielded ML-based medical devices and this looks like a reasonable study that suggests they have a system worthy of further testing. There's nothing wrong with using an ROC curve here and they document the experience of the doctors, so they weren't hiding that and around 60 or so doctors had greater than 5 years experience. Also, studies like this generally don't use only biopsy-proven negatives, since that would bias the negatives towards those that were suspicious enough to biopsy. Without knowing more details than what the paper provides, I cannot say the results are valid, but I also don't see any terrible errors after a quick scan. The main weakness is probably the fact that the test set came from the same image archive used for development. As a result, there can be all sorts of biases the CNN is using to inflate its performance unbeknownst to the developers. The best way to eliminate that concern is to use a test set gathered through a different data collection effort using different clinics, but that is expensive and time consuming and not something I would do initially. This looks like a good first step and I would encourage the developers to carry it further.

EDIT: I'll add that the ratio of positives to negatives in the training set is irrelevant and in no way invalidates the study. As far as testing goes, there is always a balance you must strike in a reader study involving doctors. Ideally, you would have the exact ratio a doctor would encounter in practice, but for a screening study, that is typically impractical as you would need a huge number of cases and doctor time is expensive. A ratio of 1 positive to 4 negatives is entirely reasonable, although the doctors (particularly the less experienced ones) will almost certainly have an elevated sensitivity and reduced specificity since they will know it is an enriched set, but this is reasonable for ROC comparison purposes as it mostly just selects a different point on the doctor's personal ROC curve. Note that some studies even tell the doctors beforehand what percentage of cases are positive.

sgt101 · on April 30, 2019

Thank you for posting this; I can see that this evaluation came very easily to you because of your experience and expertise but to me it shows how much knowledge is required to evaluate something like this. There really should be a protocol defined around this kind of study that encodes the criticisms that you make here (and others) and stops publication of this kind of thing in its tracks.

arturadib · on April 30, 2019

Agreed. See this paper for a reputable reference in this space: https://www.nature.com/articles/nature21056

argonaut · on April 30, 2019

Why can't you use ROC with an imbalanced dataset?

My understanding is the PR curve is preferable to ROC since the ROC can make it difficult to discern differences between models on imbalanced data; but the ROC is still a valid way to compare/measure models.

rcheu · on May 1, 2019

I work as an ML engineer, some thoughts:

The train/test data being imbalanced in the same way does give the model an advantage, but I don't think that making the test set 50% would solve the issue completely either. Doctors have been "trained" on the true distribution, while which is not 50% (I'd guess that the true distribution is actually extremely unbalanced).

The model isn't simply learning to predict no 80% of the time, it is learning the distribution of the data with respect to the input features. For example, let's say that we have a simple model with only 3 binary features. It may learn that when features X_0, X_1 and X_2 are 1, the probability of cancer is 70%. This isn't a simple multiplication of the true probability by the upscaling factor though--it depends on the percent of negative samples with this feature vector and the percent of positive samples with this feature vector.

If we are to change the test set to be 50% positive and keep the same train distribution, the model no longer has the correct information about cancer rates with respect to feature distributions, but neither does the dermatologist. The specificity and sensitivity continue to not be interpretable as predicted specificity and sensitivity in the real world.

There is no issue with reporting specificity/sensitivity if they had used the true distribution of cases. Yes, the curves/AUCs will look better than the precision/recall rates, but they do not mis-represent what the doctors are interested in (what percent of people will be missed, and what percent of healthy people will be subjected to unnecessary procedures).

Anyways, the classifier doesn't actually seem to be that good, there's actually doctors that were better than the classifiers if you check the paper.

psoy · on May 1, 2019

Sensitivity and recall are two names for the same thing, Mr Stats 101 :)

Also, please explain the problem with using ROC here. The probabilistic interpretation of ROC's AUC is the probability of correctly ranking a random mixed pair (i.e. ranking the positive example higher than a negative one). How is that metric affected by the 80/20 split of the test data? Genuinely curios here...

avvakum · on April 30, 2019

It does not matter whether the data is balanced or not when you report ROC (AUC), sensitivity and specificity for the purpose of comparison of two ways of image interpretation (e.g. humans vs. machines) as long as the evaluation is done on the same dataset with the same methodology. Obviously, the absolute numbers would not mean much outside of the study.

ppod · on April 30, 2019

> test data should be balanced or they should correct for this in the analysis.

Why should it be balanced? It should be the expected natural clinical class distribution, no? The humans have priors about this too. If anything, it should be more imbalanced, as I would guess (I would hope!) that less than 20% of scans are malignant.

hhs · on April 30, 2019

Very useful, thanks for this level of critique.

I wish they added this context in the limitations section. The paper only says:

"There are some limitations to this system. It remains an open question whether the design of the questionnaire had any influence on the performance of the dermatologists compared with clinical settings. Furthermore, clinical encounters with actual patients provide more information than that can be provided by images alone. Hänßle et al. showed that additional clinical data improve the sensitivity and specificity of dermatologists slightly [5]. Machine learning techniques can also include this information in their decisions. However, even with this slight improvement, the CNN would still outperform the dermatologists."

Your points hit on validity issues. Where would it fit on the errors of omission/commission scale?

Scea91 · on May 2, 2019

While I agree that there are problems with the paper, I think you are confused about suitability of ROC, PR and how test set class imbalance affects them.

Your first two suggestions combined together are very wrong. If you made the test dataset balanced and then measured PR curve the precision would be way too optimistic as it is directly affected by the class imbalance. ROC curve on the other hand is invariant to the test set imbalance.

You can find interesting this short article I have written about this problem: https://arxiv.org/abs/1812.01388

perturbation · on April 30, 2019

> * They measure the wrong things that reward the network. Because the dataset is imbalanced you can't use an ROC curve, sensitivity, or specificity. You need to use precision and recall and make a PR curve. This is machine learning and stats 101.

A̶F̶A̶I̶K̶,̶ ̶a̶ ̶R̶O̶C̶ ̶c̶u̶r̶v̶e̶ ̶c̶a̶n̶ ̶b̶e̶ ̶m̶i̶s̶l̶e̶a̶d̶i̶n̶g̶ ̶f̶o̶r̶ ̶a̶n̶ ̶i̶m̶b̶a̶l̶a̶n̶c̶e̶d̶ ̶d̶a̶t̶a̶s̶e̶t̶,̶ ̶b̶u̶t̶ ̶t̶h̶e̶ ̶A̶U̶C̶ ̶i̶s̶ ̶s̶t̶i̶l̶l̶ ̶o̶k̶a̶y̶ ̶f̶o̶r̶ ̶s̶e̶l̶e̶c̶t̶i̶n̶g̶ ̶m̶o̶d̶e̶l̶s̶.̶ Edit: This is incorrect, a PR curve + PR AUC should be used for model selection if imbalanced. I agree it would be really misleading if they (say) just reported accuracy (since the null classifier of always guess negative would give 80% overall accuracy). I̶ ̶t̶h̶o̶u̶g̶h̶t̶ ̶t̶h̶a̶t̶ ̶t̶h̶e̶ ̶A̶U̶C̶ ̶f̶o̶r̶ ̶R̶O̶C̶ ̶c̶u̶r̶v̶e̶ ̶s̶h̶o̶u̶l̶d̶ ̶s̶t̶i̶l̶l̶ ̶b̶e̶ ̶a̶ ̶v̶a̶l̶i̶d̶ ̶m̶e̶a̶s̶u̶r̶e̶ ̶s̶i̶n̶c̶e̶ ̶i̶t̶'̶s̶ ̶s̶h̶o̶w̶i̶n̶g̶ ̶h̶o̶w̶ ̶m̶u̶c̶h̶ ̶b̶e̶t̶t̶e̶r̶ ̶t̶h̶e̶ ̶m̶o̶d̶e̶l̶ ̶p̶e̶r̶f̶o̶r̶m̶s̶ ̶t̶h̶a̶n̶ ̶r̶a̶n̶d̶o̶m̶ ̶g̶u̶e̶s̶s̶i̶n̶g̶.̶

How do you usually handle imbalanced data? I've had some success with SMOTE or weighted loss for imbalanced datasets, but I'm embarrassed to say I've been using AUC with ROC curves as the default - if this gives inferior model selection than AUC with PR curve I'll have to start doing that instead.

TuringNYC · on April 30, 2019

Thanks for the comments, this is a great summary. Curious what you'd think of a Kappa score given the imbalance?

https://en.wikipedia.org/wiki/Cohen%27s_kappa

miemo · on May 2, 2019

>you can't use an ROC curve, sensitivity, or specificity. You need to use precision and recall and make a PR curve

But sensitivity and recall are the same thing...

dontreact · on May 1, 2019

There is nothing wrong with using ROC for imbalanced data. It is also perfectly reasonable to use an enriched dataset for a reader study, this is the standard practice.

avip · on May 1, 2019

It's almost as if publishing the thing was more important for the authors than the scientific value of the content.

0xab · on Jan 4, 2018

That's a common misconception. She did not do jail time for insider trading. She almost certainly would have gotten away with it. The problem is she lied to a federal agent while trying to hide her inside trading. That's what they charged and convicted her with.

0xab · on Aug 11, 2014

In terms of immigrants as a percentage of population the US is nothing special. Canada has a lot more (closer to 19% instead of 14%). Many European countries have statistics similar to that of the US (Sweden has a bit more, the UK a bit less, etc).

But I digress. The problem with the existing system is that it is insane and a waste of resources. One has to get a lawyer and wait for a long time with no certainty in the outcome while dealing with an extremely opaque bureaucracy where anything can go wrong at any time.

This has nothing at all to do with volume. Indeed, a simpler more streamlined and less opaque system would help with volume.

Say a system like the Canadian or the UK one. There's a point scale. You can compute the number of points you get ahead of time, now there's a simple web form actually. If you cross the threshold you will get in. There are known wait times, you can just call the embassy. There are points for things the country needs (particular jobs), for certain qualifications (degrees, etc), for language proficiency, some regional tweaks, family, and having an employment offer.

And extremely importantly. You get permanent resident status (green card), not an H1B. What Americans don't realize is that you want to hand out green cards. H1B lower both your salaries and ours. The H1B restricts immigrant mobility, can't move to a better job and raise the average salary, and encourages people to go home with all of their newly gained knowledge and money.

I'll give you a personal example of the difference between a sane and an insane system:

As an 11 year old I filled out all of our Canadian immigration paperwork (my parents checked it but it was correct and they didn't change it). We knew we would get in based on our points. The embassy told us the timeframe in which we should expect our paperwork to go through. It went through a bit early. We moved to Canada.

Now for the US. As a near-30-year old with a US PhD working at a top research institution I have to pay a specialized law firm several thousand dollars, spend weeks getting paperwork, bugging people in several countries to write absurd letters, building a case, etc. All of this to basically the same thing. And in the end, who knows what will happen because there are no standards, no appeal, no one to discuss anything with. Oh, and I have no idea what the processing time is.

So no. It is not an issue of "why can't they just let me in". It's a system that hurts your salary by restricting my mobility, hurts me by making me pay lawyers needlessly, hurts the image of the US by creating disgruntled people, and hurts the economy by routing business and increasingly prestigious conferences elsewhere. It just makes no sense.

afafsd · on Aug 11, 2014

It's interesting, actually. In some ways the US is actually the small-government low-regulation country that it likes to pretend to be, but in other areas it's just a labyrinth of aggresive mollases-paced bureaucracy. Examples:

Example: the DMV. In both Australia and the US the procedure for getting my licence renewed is the same. I go into an office, I fill out a form, they take a picture, and I get a licence. The difference is that when I did this in Australia I waited about five minutes and they printed my licence on the spot, whereas in the US it for some reason takes one or two hours and they print the licence in six to eight weeks. It's not that the California DMV appears to have fewer staff per customer or anything, it's just for some reason their procedures make no sense and nobody is able or willing to fix it.

Other examples: immigration and the TSA, but let's not even go there.

raverbashing · on Aug 11, 2014

> It's interesting, actually. In some ways the US is actually the small-government low-regulation country that it likes to pretend to be, but in other areas it's just a labyrinth of aggresive mollases-paced bureaucracy. Examples:

Exactly this. There's a myth that the US is less bureaucratic than Europe, for example, and while in Europe you have the things like Italy which are horrible (in Bureaucracy terms) a lot of things are simpler.

talmand · on Aug 11, 2014

To be fair, that's a US state problem and not a US problem.

I have had driver licenses from multiple US states. My experiences on the matter vary in extreme measures from one state to the next. This is mostly based on local laws, resources, and demand.

One time, it took hours but I had license in hand before leaving the building. In another case it took around twenty minutes and I had license in hand before leaving the building. Things are different in different places.

Then again, I've never had a license from California so that's a pain I haven't endured.

bruceb · on Aug 11, 2014

To defend the DMV. I renewed my license online and it was mailed to me. Took 5mins.

lsc · on Aug 11, 2014

>It just makes no sense.

Now, I'm not saying this is right or wrong; it's a complex issue, and I am personally undecided. But you seem to be missing a major argument against the "points system" you describe.

The Canadian system explicitly biases the system in favor of the wealthy and well-educated. That's exactly what the "points" system is meant to do. Now, many people think this is a good thing; their argument is that wealthy and well educated people bring good things in to the country. I'm not saying they are wrong, I'm just saying that you should understand how some people feel that is unfair.

The US system does this to some extent, too, for instance the H1B visa is biased in just that way, and even for the lottery, we set minimum "you can likely support yourself" levels, And we have special ways for really wealthy people to jump the queue, and of course, the way the lottery is set up, one could argue, is quite racist. But you can also make the argument that the US lottery system is a lot more fair to people that have the ability to support themselves, but maybe weren't wealthy enough to get an advanced degree.

jarek · on Aug 11, 2014

> but maybe weren't wealthy enough to get an advanced degree.

Of course, in many countries the ability to get an advanced degree is not very dependent on being wealthy

lsc · on Aug 11, 2014

>Of course, in many countries the ability to get an advanced degree is not very dependent on being wealthy

I... find your assertion to be unlikely. Of course, I could be wrong. Do you have statistics? Is there a country where there is not a very strong correlation between high parental income and advanced degrees?

nikanj · on Aug 11, 2014

Finland should have no correlation here, as all the schools are free and you get an allowance from the government for the duration of your studies. Unfortunately it turns out a high income correlates well with having an advanced degree, and a parent with an advanced degree correlates well with a child who has one.

Which results in a situation where the higher socio-economical status of your parents makes you much more likely to have an advanced degree, even though there should be no correlation.

drpgq · on Aug 11, 2014

Obviously IQ has something to do with this.

simonbarker87 · on Aug 11, 2014

Most of the people in my PhD lab were paid to be there, fees waived and 50% from what would be described as low income backgrounds - anecdotal I know, but in the UK you don't need to be wealthy to get an advanced degree ... although getting the first degree got a bit more expensive a couple of years ago.

watwut · on Aug 11, 2014

The correlation in Europe tend to be between parental education level and child education level. The correlation between parental income and education level of the child is weaker.

Being educated does not imply being wealthy.

dragonwriter · on Aug 11, 2014

This is true in the United States (there is a very strong correlation between wealth and education level, but the correlation between parent's education level and child's education level is stronger than the correlation between parent's wealth and child's education level; IIRC, there still some evidence that the wealth plays some role independent of parent's education level in determining child's education level, but its a smaller role.)

jarek · on Aug 11, 2014

No statistics, sorry. If you want to look yourself try the usual suspects, India, Iran, ex-Soviet states...

read · on Aug 11, 2014

This is the first post by 0xab after 1992 days of being a registered HN user. Imagine (a) the determination needed to not yield to temptation and post something over 5.5 years, and (b) how upsetting immigration is to someone as determined as this PhD for his first comment to be on immigration.