More

randomwalker · on April 14, 2017

Coauthor of the paper here. No --- this is not one of the three techniques that we implemented. It was a hypothetical suggestion for future work. Unfortunate that the article didn't make that clear.

Here is the paper http://randomwalker.info/publications/ad-blocking-framework-...

Here is our blog post about it: https://freedom-to-tinker.com/2017/04/14/the-future-of-ad-bl...

wernercd · on April 14, 2017

Just a throw out question:

One of the biggest issues (IMO) is the security threat... if there is a second copy that "we" don't see, doesn't that mean that the second copy can do "bad things" still?

Case in Point: Forbes requests you disable ad-blocker... then serves pop-under malware:

http://www.networkworld.com/article/3021113/security/forbes-...

I could stomach ads - just like I stomached commercials on TV shows... you get used to them and they become white noise... if the ads were unobtrusive and non-invasive (IE: Google Search Results)... it's full screen, pop-up/under, audio, video, etc. Then throw on top of that malware...

What's to stop the malware from biting on that second copy?

unholythree · on April 14, 2017

This is what matters most to me. My dislike of ads are out of fear of malicious code first, tracking second, and wasting my bandwidth third.

The simple gif banner ads of yore were tolerable enough.

bigbugbag · on April 18, 2017

Neuromarketing proved that they do not become white noise, you just stop consciously noticing them and how they affect you[1]. But even if you somehow developed some kind of immunity, you'd still be affected by ads because advertising costs are factored in product prices, when you buy something even if you never seen any of the ads for this product, you're still paying for them.

[1]: https://en.wikipedia.org/wiki/Neuromarketing#Study_examples

cprayingmantis · on April 15, 2017

Here's what probably amounts to a dumb question why couldn't we just have an AI act like a real person and click on some of the ads that it deemed 'safe' and in a separate sandbox type browser instance? This would reduce the negative effects of adblockers on content publishers.

ktta · on April 14, 2017

Looks like you've accidentally repasted the paper URL for the blog post

randomwalker · on April 14, 2017

Oops! Thanks, fixed.

randomwalker · on April 14, 2017

Coauthor here. Some of the press articles about our work didn't have a lot of nuance (unsurprisingly), but in the paper we're careful about what we say, what we don't say, and what the implications are. Happy to engage in informed discussion :)

yummyfajitas · on April 14, 2017

Do you have any evidence that this effect results in machines making systematically wrong inferences?

Near as I can tell, your paper shows that these "biases" result in significantly more accurate predictions. For example, Fig 1 shows that a machine trained on human language can accurately predict the % female of many professions. Fig 2 shows the machine can accurately predict the gender of humans.

Normally I'd expect a "bias" to result in wrong predictions - but in this case (due to an unusual redefinition of "bias") the exact opposite seems to occur.

(Drawing on your analogy with stereotypes, it's probably also worth linking to a pointer on stereotype accuracy: http://emilkirkegaard.dk/en/wp-content/uploads/Jussim-et-al-... http://spsp.org/blog/stereotype-accuracy-response )

rmxt · on April 14, 2017

Accuracy might mean "positively" right, as your post suggests, but that doesn't necessarily mean "normatively" right.

From what I understand, the fear surrounding embedding human stereotypes into ML systems is that the stereotypes will get reinforced. In some way or form, there will be less equality of opportunity in the future than exists today, because machines will make decisions that humans are currently making. Societal norms evolve over time, yet code can become locked in place.

Is your takeaway from this paper that we, as the creators of intelligent machines, should allow them to continue to making "positively" right assumptions simply because that's the way we, as humans, have always done them? Is "positively" right, in your opinion, in all cases equivalent to "normatively" right?

andreasvc · on April 14, 2017

I think your questions would be answered by reading the article. Particularly:

"In AI and machine learning, bias refers generally to prior information, a necessary prerequisite for intelligent action (4). Yet bias can be problematic where such information is derived from aspects of human culture known to lead to harmful behavior. Here, we will call such biases “stereotyped” and actions taken on their basis “prejudiced.”"

This definition is not unusual. This is about inferences that are wrong in the sense of prejudiced, not necessarily inaccurate.

yummyfajitas · on April 14, 2017

The usual definition of bias in ML papers is E[theta_estimator - theta]. That is explicitly a systematically wrong prediction.

In any case, the paper suggests that this "bias" or "prejudice" is better described as "truths I don't like". I'm asking if the author knows of any cases where they are actually not truthful. The paper does not suggest any, but maybe there are some?

andreasvc · on April 14, 2017

Again, per the article "bias refers generally to prior information, a necessary prerequisite for intelligent action (4)." This includes a citation to a well-known ML text. This seems broader than the statistical definition you cite.

Think for example of an inductive bias. If I see a couple of white swans, I may conclude that all swans are white, and we all know this is wrong. Similarly, I may conclude the sun rises everyday, and for all practical purposes this is correct. This kind of bias is neither wrong nor right, but, in the words of the article "a necessary prerequisite for intelligent action", because no induction/generalization would be possible without it.

There are undoubtedly examples where the prejudiced kind of biases lead to both truthful and untruthful predictions, but that seems beside the point, which is to design a system with the biases you want, and without the ones you don't.

EGreg · on April 14, 2017

Isn't the word bias being redefined by a social justice point of view? Normally bias would be with reference to failing to match reality (eg women in general have physically weaker upper body than men), and not failing to match whatever standard of equality a society wishes were the case eventually.

sp332 · on April 14, 2017

You should check out the Implicit Association Test that they used to measure the biases. Just as one example, there's nothing about being a doctor that is inherently more male or female. So all gender differences would have external causes.

slackstation · on April 14, 2017

Since we think of biases of a large human corpora as wrong, I'm curious how one would find one or make one that is "right".

Given how accurate human corpora is at predicting things like gender distribution in jobs for instance, wouldn't making an "unbiased" corpora make an inaccurate AI?

Shouldn't we be careful in implying things like the biases and solutions to said biases? For instance, I'd like to know if my algorithm for filtering job applicants is trying to undo the injustices of the world in addition to finding the best candidates.

ylem · on April 14, 2017

Hi! I just read the paper--impressive work! Have you tried any other languages? For example, French or German?

glibgil · on April 14, 2017

Can you provide an example of how this bias might play out in a human-AI interaction?

yummyfajitas · on April 14, 2017

The paper has them:

- An AI correctly infers (simply by reading text) that a physicist is male and a nurse is female.

- An AI correctly infers the gender of humans with androgyonous names.

- An AI infers insects are unpleasant and flowers are pleasant to humans.

- An AI also infers that African American names are more likely to be associated with unpleasantness than European names.

[edit: to those who dislike this comment, can you tell me what you object to? Which of my concrete examples is not in the paper?]

andreasvc · on April 14, 2017

It appears that the linked paper has examples.

fpp · on April 14, 2017

very interesting topic - could you please share a link to the original paper.

best probably to read that one first.

privong · on April 14, 2017

> very interesting topic - could you please share a link to the original paper.

Unless the link was changed in the few minutes since you posted your comment, the link for the article is the original Science paper (http://science.sciencemag.org/content/356/6334/183.full)

pasbesoin · on April 14, 2017

(From a Javascript-disabled perspective)

Page with actual link:

http://science.sciencemag.org/content/356/6334/183/tab-pdf

Link to PDF itself:

http://science.sciencemag.org/content/sci/356/6334/183.full....

fpp · on April 14, 2017

sorry mixed it up.

If I look at glove & WordNet usage e.g. for topic extraction, bagging / clustering or semantic similarity would you say we would need to get rid of such a bias, e.g. create something like a Geiger counter for NLP.

Alternative view - when doing sentiment analysis / classification would you say that such a bias actually helps to identify a type of sentiment in a doc / sentence.

russdpale · on April 14, 2017

Wouldn't this lead to an entire of idea of contextual bias? Times when it could benefit and be used, and times where it is occluded.

randomwalker · on April 14, 2017

Coauthor here. Someone has been DoSing the paper site(!), so here's a copy for now: https://drive.google.com/file/d/0B59AisMv54waZXRhbE9GV2NDQUE...

randomwalker · on Aug 25, 2016

That's definitely one of our main areas for future research. So far, the only part of the paper where we consider other languages is in studying how model bias affects language translation:

Unsurprisingly, today’s statistical machine translation systems reflect existing gender stereotypes. Translations to English from many gender-neutral languages such as Finnish, Estonian, Hungarian, Persian, and Turkish lead to gender-stereotyped sentences. For example, Google Translate converts these Turkish sentences with genderless pronouns: "O bir doktor. O bir hems¸ire." to these English sentences: "He is a doctor. She is a nurse." A test of the 50 occupation words used in the results presented in Figure 1 shows that the pronoun is translated to “he” in the majority of cases and "she" in about a quarter of cases; tellingly, we found that the gender association of the word vectors almost perfectly predicts which pronoun will appear in the translation.

See section on "Effects of bias in NLP applications" http://randomwalker.info/publications/language-bias.pdf

randomwalker · on Aug 25, 2016

Coauthor here. The blog post is written in relatively non-technical language for a general audience, but our paper has tons of technical details that HN readers might enjoy. Give it a read!

http://randomwalker.info/publications/language-bias.pdf

cs702 · on Aug 25, 2016

Doing this was a great idea. Great paper: easy-to-follow and to-the-point.

The results are not too surprising, as the models for learning word embeddings like GloVe, word2vec, etc. learn to map to vectors existing relationships between words in training corpora. If a corpus is biased, the embeddings learned from it will necessarily be biased too.

However, the implications of this finding are wide-ranging. For starters, any machine learning system that relies on word embeddings learned from biased corpora to make predictions (or to make decisions!) will necessarily be biased in favor of certain groups of people and against others.

Moreover, it's not obvious to me how one would go about obtaining "unbiased" corpora without somehow relying on subjective societal values that are different everywhere and continually evolving. You have raised an important, non-trivial problem.

yummyfajitas · on Aug 25, 2016

For starters, any machine learning system that relies on word embeddings learned from biased corpora to make predictions (or to make decisions!) will necessarily be biased in favor of certain groups of people and against others.

This is not true.

Here's an oversimplified example. Suppose your machine learning system wants to predict something, e.g. loan repayment probabilities. One input might be a written evaluation by a loan officer.

When trained on a corpora of group X, the predicted probability might be:

    pred = a*written_evaluation + other_factors

(Using linear regression to make example simple.)

However, now lets suppose the written evaluation is biased to the tune of 25% against group Y. I.e., group Y has written scores that are 25% less than group X.

Then a new predictor which includes pairwise terms, trained on a corpora of group X and Y, will work out to be:

    pred = a*written_evaluation + 0.33*written_evaluation*isY + other_factors

This predictor would be unbiased. In general, if you have a biased input and the biasing factor is also present in your input, your model should correct the bias. (Obvious caveats: your model needs to be sufficiently expressive, etc.)

Interestingly, everyone's favorite bogeyman, namely redundant encoding ( http://deliprao.com/archives/129 ) will actually help fix this problem *even if you don't include the biasing factors in the model.

cs702 · on Aug 27, 2016

...now lets suppose the written evaluation is biased to the tune of 25% against group Y...

How do you find out that the written evaluation is biased "to the tune of 25% against group Y?"

THAT is the problem. It's not obvious to me how you would go about determining written evaluations are biased (and to what extent!) against group Y without somehow relying on subjective societal values that are different everywhere and continually evolving.

yummyfajitas · on Aug 27, 2016

Finding out is the easy part. I don't mean to trivialize it, because doing stats right is actually a very technical matter, but this is just ordinary statistics.

You build a sufficiently expressive statistical model and include the potentially biasing factors as features in the model. Then the model will correct the bias all by itself because correcting for bias maximizes accuracy.

In the example above, you find the bias by doing linear regression and including (written_evaluation x isY) as a term. Least squares will handle the rest. If you using something fancier than least squares (e.g. deep neural networks, SVMs with interesting kernels), you probably don't even need to explicitly include potentially debiasing terms - the model will do it for you.

I give toy examples (designed to illustrate the point and also be easy to understand) here: https://www.chrisstucchio.com/blog/2016/alien_intelligences_...

This paper does the same thing - it discovers that standard predictors of college performance (grades, GPA) are biased in favor of blacks and men, against Asians and women, and the model itself fixes these biases: http://ftp.iza.org/dp8733.pdf

Statistics turns fixing racism into a math problem.

If the topic were anything less emotionally charged, you wouldn't even think twice about it. If I suggested including `isMobile`, `isDesktop` and `isTablet` as features in an ad-targeting algorithm to deal with the fact that users on mobile and desktop browse differently, you'd yawn.

cs702 · on Aug 27, 2016

...include the potentially biasing factors as features...

Who decides what the "potentially biasing factors" are? How is that decided without somehow relying on subjective societal values?

Factors that no one thought were biased in the past are considered biased today; factors that no one thinks are biased today may be considered biased in the future; and factors that you and I consider biased today may not be considered biased by people in other parts of the world. I don't know how one would go about finding those "potentially biasing factors" without relying on subjective societal values that are different everywhere and always evolving.

yummyfajitas · on Aug 27, 2016

A potentially biasing factor is a factor that you think would be predictive if you included it in the model. If it's actually predictive, you win, your model becomes more accurate and you make more money.

Go read the wikipedia article on the topic: https://en.wikipedia.org/wiki/Omitted-variable_bias

It's true that as we learn more things we discover new predictive factors. That doesn't make them subjective. A lung cancer model that excludes smoking is not subjective, it's just wrong. And the way to fix the model is to add smoking as a feature and re-run your regression.

Again, would you make the same argument you just made if I said I had an accurate ad-targeting model?

cs702 · on Aug 27, 2016

OK, I see where the disconnect is. I think the best way to address it is with an example.

Many people today would object a priori to businesses using race as a factor to predict loan default risk, regardless of whether doing that makes the predictions more accurate or not. In many cases, using race as a factor WILL get you in trouble with the law (e.g., redlining is illegal in the US).

Please tell me, how would you predict what factors society will find objectionable in the future (like race today)?

yummyfajitas · on Aug 27, 2016

My claim is very specific. If you tell an algorithm to predict loan default probabilities, and you give it inputs (race, other_factor), the algorithm will usually correct for the bias in other_factor.

I claimed a paperclip maximizer will maximize paperclips, I didn't claim a paperclip maximizer will actually determine that the descendants of it's creators really wanted it to really maximize sticky tape.

Now, if you want an algorithm not to use race as a factor, that's also a math problem. Just don't use race as an input and you've solved it. But if you refuse to use race and race is important, then you can't get an optimal outcome. The world simply won't allow you to have everything you want.

A fundamental flaw in modern left wing thought is that it rejects analytical philosophy. Analytical philosophy requires us to think about our tradeoffs carefully - e.g., how many unqualified employees is racial diversity worth? How many bad loans should we make in order to have racial equity?

These are uncomfortable questions - google how angry the phrase "lowering the bar" makes left wing types. If you have an answer to these questions you can simply encode it into the objective function of your ML system and get what you want.

Modern left wing thought refuses to answer these questions and simply takes a religious belief that multiple different objective functions are simultaneously maximizable. But then machine learning systems come along, maximize one objective, and the others aren't maximized. In much the same way, faith healing doesn't work.

The solution here is to actually answer the uncomfortable questions and come up with a coherent ideology, not to double down on faith and declare reality to be "biased".

cs702 · on Aug 28, 2016

My claim was specific too: if a corpus is biased -- as defined by evolving societal values -- then the word embeddings learned from that corpus will necessarily be biased too -- according to those same societal values, regardless of whether you think those values are rational and coherent.

wolfgke · on Aug 25, 2016

> Moreover, it's not obvious to me how one would go about obtaining "unbiased" corpora without somehow relying on subjective societal values that are different everywhere and continually evolving.

I don't believe that problem will ever be completely solvable. But I think the road to go is to make these assumptions always explicit. I.e. when the machine learning system derives a result, program it to additionally return a proof of how it came to this result. And also give a way to let the ML system return a list of all axioms and derivation rules that it has currently learned, so that they can independently be checked how much they are biased and can thus be corrected.

nl · on Aug 25, 2016

It's pretty hard to return "rules" for a ML system, especially a non-linear system. Google is currently working systems that use a trillion features - I can't imagine returning some kind of rule list for that.

LIME[1] is a nice start, though.

[1] https://www.oreilly.com/learning/introduction-to-local-inter...

wolfgke · on Aug 25, 2016

Another text about interpreting convolutional neural networks:

http://cs231n.github.io/understanding-cnn/

> Google is currently working systems that use a trillion features - I can't imagine returning some kind of rule list for that.

As I wrote: It would already help if the ML system as a first step returned the derivation with only the rules that were concretely used for a concrete derivation - this list is much shorter and can thus much easier be checked.

joe_the_user · on Aug 25, 2016

Gave it a quick read.

Are biases distinct from "preferences" - humans view flowers as more pleasurable than insects - human language associates flowers with pleasurable terms, states and so-forth.

"Bias" is term associated with "irrational beliefs" whereas "preferences" more often imply "arbitrary preferences". Especially, biases are held to prevent rational deduction whereas preferences have no such stumbling block.

Now, one supposes that question would come down to whether a computer would "know it's a computer, not a person".

If the AI was asked "do you like cockroaches or daisies better", would it say "why daises are prettier and smell better" or would it say "most people like daisies but I'm a machine, can't smell or taste, and only care about the preferences entered into my control panel" (or something).

And you'd expect that a thing that merely "parroted" human speech without understanding would give the former answer.

Which is to say I don't think you are really fully grappling with word-association and word-logic coming together, ie, "meaning".

jdp23 · on Aug 25, 2016

Very interesting results. I really like the approach of paralleling the classic bias experiments. And I think your recommendations in the last paragraph of the "Awareness is better than blindness" section are excellent - although I'd go farther and suggest that the long-term interdisciplinary research program should have a highly diverse team, and include experts on diversity.

I thought the section on "Challenges" could have been stronger. You talk about the bias in "the basic representation of knowledge" used in these systems today -- but it's not like there isn't aren't other possible representations of knowledge. How much effort has gone into exploring knowledge representation (and approaches to derive semantics) that are designed to highlight and reduce biases look like?

sideshowb · on Aug 25, 2016

Hi. I'd be interested to know what you think of my attempts to deduce bias from a text corpus. Very early stage compared to yours mind https://linkingideasblog.wordpress.com/2015/08/19/data-minin...

randomwalker · on Aug 25, 2016

OP here. We address this argument in detail in our paper, and we're deeply skeptical of it. See the sections titled "Challenges in addressing bias" and "Awareness is better than blindness".

Here's the short version:

We view the approach of "debiasing" word embeddings (Bolukbasi et al., 2016) with skepticism. If we view AI as perception followed by action, debiasing alters the AI’s perception (and model) of the world, rather than how it acts on that perception. This gives the AI an incomplete understanding of the world. We see debiasing as "fairness through blindness". It has its place, but also important limits: prejudice can creep back in through proxies (although we should note that Bolukbasi et al. (2016) do consider "indirect bias" in their paper). Efforts to fight prejudice at the level of the initial representation will necessarily hurt meaning and accuracy, and will themselves be hard to adapt as societal understanding of fairness evolves

Direct link to our paper: http://randomwalker.info/publications/language-bias.pdf

rspeer · on Aug 25, 2016

I agree with the suggestion to de-bias the application and not the representation itself.

Recently I was using a version of Conceptnet Numberbatch (word embeddings built from ConceptNet, word2vec, and GloVe data that perform very well on evaluations) as an input to sentiment analysis. So its input happens to include a crawl of the Web (via GloVe) and things that came to mind as people played word games (via ConceptNet). All of this went into a straightforward support vector regression with AFINN as training data.

You can probably see where this is going. The resulting sentiment classification of words such as "Mexican", "Chinese", and "black" would make Donald Trump blush.

I think the current version is less extreme about it, but there is still an effect to be corrected: it ends up with slightly negative opinions about most words that describe groups of people, especially the more dissimilar they are from the American majority.

So my correction is to add words about groups of people to the training data for the sentiment analyzer, with a lot of weight, saying that their output has to be 0.

jdp23 · on Aug 25, 2016

I'm not convinced by your skepticism about correcting prejudiced bias. Debiasing certainly gives the AI a understanding of the world than the original (biased) language dataset, but it's not necessarily less complete - or less accurate. After all, any one corpus is incomplete, and has biases based on the items that were chosen for it - which are likely to reflect the biases of the past, and of the person choosing the corpus. It may not be a "complete" or "accurate" reflection of today's world - let alone the future. So it's not at all clear to me that efforts to undo the bias will necessarily make it less "accurate".

AnthonyMouse · on Aug 25, 2016

> Debiasing certainly gives the AI a understanding of the world than the original (biased) language dataset, but it's not necessarily less complete - or less accurate.

If you're translating from an ungendered language and have to choose, the only way you're going to get anything sensible is from context and common usage. Which is going to choose "she is a nurse" because an algorithm that can deduce that fathers are most likely male can also deduce that nurses are most likely female. But without that you get bad translations like "she is a father" and "he is a fine ship" and "John is her own person."

yummyfajitas · on Aug 25, 2016

"She is a nurse" is also not a bias. It's a prior and a valid one - the system will be right 93% of the time.

http://work.chron.com/gender-equality-issues-nursing-careers...

A bias would be if it incorrectly weighted "JOHN" and "nurse", and used the feminine for "John the nurse".

jdp23 · on Aug 25, 2016

> "She is a nurse" is also not a bias. It's a prior ...

Assuming that lower-status professions are female and higher-status professions are male ("he is a doctor") when translating ungendered words is indeed a bias.

> the system will be right 93% of the time.

And "this person is a doctor, that person is a nurse" will be right 100% of the time.

yummyfajitas · on Aug 25, 2016

It's a bias in the sense that it accurately reflects a fact you dislike. It's not a bias in the statistical sense, namely something that causes the answer to be wrong systematically in a particular direction. See my other post here discussing the distinction.

It's also not wrong in the sense of generics: https://sites.ualberta.ca/~francisp/papers/GenericsIntro.pdf

The phrase "this person is a doctor" has a different meaning than "she is a doctor" - "she" and "he" refers to (I'm probably messing up the terminology here) contextually implicit person. "This person" does not.

AnthonyMouse · on Aug 25, 2016

> And "this person is a doctor, that person is a nurse" will be right 100% of the time.

Except when it produces "that person is a fine ship" or "John is that person's own person" or equally ridiculous things.

igravious · on Aug 25, 2016

Interesting paper, thanks!

See how you sway the argument in your favour using words with negative connotations like "fairness through blindness" and "hurt meaning and accuracy". Nobody would want to deliberately blind or hurt something, would they? How about rebalance or recalibrate or re-correct.

A concrete analogy:

1) I have a meter measuring stick but I discover that it was made wrong, it is actually 2mm shorter than advertised. Every time I make a measurement with it I have to add 2mm to the measurement. Would it not be better to use a more accurate stick and not have to continually compensate?

im4w1l · on Aug 25, 2016

Stereotypes have great predictive power. The reason we sometimes avoid them is they can lead to outcomes that are seen as undesirable.

pixl97 · on Aug 25, 2016

With your analogy that would assume we know exactly how long a meter is. "We know that we are wrong, but we don't know the exact right answer". Also language shifts and biases are not constants. Oh, then you have the issue of a corpus attempting to manipulate the learning algorithm itself.

foolrush · on Aug 25, 2016

Are there any overlaps or thoughts regarding Sapir Whorf here in your research?

randomwalker · on Aug 25, 2016

Yes! We address this in the section "Implications for understanding human prejudice".

The simplicity and strength of our results suggests a new null hypothesis for explaining origins of prejudicial behavior in humans, namely, the implicit transmission of ingroup/outgroup identity information through language. That is, before providing an explicit or institutional explanation for why individuals make decisions that disadvantage one group with regards to another, one must show that the unjust decision was not a simple outcome of unthinking reproduction of statistical regularities absorbed with language. Similarly, before positing complex models for how prejudicial attitudes perpetuate from one generation to the next or from one group to another, we must check whether simply learning language is sufficient to explain the observed transmission of prejudice. These new null hypotheses are important not because we necessarily expect them to be true in most cases, but because Occam’s razor now requires that we eliminate them, or at least quantify findings about prejudice in comparison to what is explainable from language transmission alone

(The paper has more along these lines.)

randomwalker · on May 19, 2016

Coauthor here. I lead the research team at Princeton working to uncover online tracking. Happy to answer questions.

The tool we built to do this research is open-source https://github.com/citp/OpenWPM/ We'd love to work with outside developers to improve it and do new things with it. We've also released the raw data from our study.

Freak_NL · on May 19, 2016

What can be done by the browser vendors such as Mozilla, Google, and Microsoft?

To prevent fingerprinting, your browser has to disable all sorts of useful modern JavaScript API's (e.g., WebRTC) by default, prevent spurious HTTP requests (e.g., to prevent abusing @font-face to find out which fonts are installed), and pretend you are an American using the most popular web browser of the moment (i.e., hide the user's preferred language and claim en-US as your preference, and change the user agent string to blend in to the crowd).

This is all assuming people don't run any third party plugins like Flash.

Are browser vendors on track to figure out a solution to this problem that combines user friendliness with privacy? Or will anonymous browsing remain a privilege for those with the right amount of technical know-how?

The problem it seems is that simply disabling JavaScript is not an option for normal web browsing, and even a requirement for interacting with the web services used by organisations you have a relation with (e.g., the government, insurance companies, banks, etcetera).

randomwalker · on May 19, 2016

Personally I think there are so many of these APIs that for the browser to try to prevent the ability to fingerprint is putting the genie back in the bottle.

But there is one powerful step browsers can take: put stronger privacy protections into private browsing mode, even at the expense of some functionality. Firefox has taken steps in this direction https://blog.mozilla.org/blog/2015/11/03/firefox-now-offers-...

Traditionally all browsers viewed private browsing mode as protecting against local adversaries and not trackers / network adversaries, and in my opinion this was a mistake.

avar · on May 19, 2016

Google has explicitly WontFix'd bugs on the subject of expanding incognito to be hardened against fingerprinting: https://bugs.chromium.org/p/chromium/issues/detail?id=142214...

Don't you think this sort of thing warrants a separate sort of browsing mode? A lot of people who use the likes of incognito mode just use it for e.g. browsing porn where they don't want the local history to be preserved.

Turning that mode into one that's highly hardened against fingerprinting would in practice ruin the browsing experience for those users. Just look at what the Tor browser needs to do with fixed preset resolutions, no JavaScript etc.

benevol · on May 19, 2016

> Google has explicitly WontFix'd bugs on the subject of expanding incognito to be hardened against fingerprinting

Obviously. Google is in the business of destroying your privacy: Advertising revenue is maximized when the consumer is/remains completely tracked and profiled at all times.

Other browser vendors which are not in the ad business could use this as an opportunity to differentiate themselves from Google:

Introduce a 3rd browsing mode which kills fingerprinting (with the "cost" of reduced user friendliness).

Freak_NL · on May 19, 2016

Or just let the user decide at the start of a private session. Firefox already does this with tracking protection. If Mozilla decides to improve tracking protection at the cost of usability (such as hiding you preferred language), than offering that as a toggle-able option on that page might be sufficient to empower the user to decide for on his own.

benevol · on May 19, 2016

> offering that as a toggle-able option on that page might be sufficient

Technically speaking yes.

From a marketing/communication standpoint, I would separate this "feature" clearly from the 2 known browsing experiences. It not only clearly communicates to the user that a different browsing experience is about to start. By selling it as "the third browsing mode", it definitely also adds more perceived value to the product.

cwilkes · on May 20, 2016

Google already has that info without these hacks.

In fact it is in googles best interest to remove these security holes so other advertisers lose whatever minor advantage they can get.

chriswarbo · on May 19, 2016

> Don't you think this sort of thing warrants a separate sort of browsing mode? A lot of people who use the likes of incognito mode just use it for e.g. browsing porn where they don't want the local history to be preserved.

Think about it this way, would those using incognito mode for porn be OK with their normal browsing being peppered with ads claiming to "Improve your <fetish> with our range of <sexual implements>"?

Whilst I think incognito mode's warnings about not hiding data from network operators should remain (i.e. your boss can find out what sites you were visiting at work), that doesn't mean efforts to prevent it shouldn't be made.

hackuser · on May 19, 2016

> there are so many of these APIs that for the browser to try to prevent the ability to fingerprint is putting the genie back in the bottle.

I disagree. If there were billions to be made from this new tech, secure browsing, then the browser vendors would be moving rapidly toward it. Certainly, more difficult technical challenges are overcome regularly.

Surveillance was implemented without users' knowledge and without public debate, presented as a fait accompli, and now the latest tactic is to say there's nothing that can be done about it. People accept that because they feel helpless, but I don't think we should be perpetuating this rhetoric of inevitiability. There's no technical reason it can't be done.

lawnchair_larry · on May 19, 2016

That's not effective, because very few people use private mode.

amluto · on May 19, 2016

The browser vendors could start taking the idea of asking for permission seriously.

For WebRTC, browsers could block local addresses. uBlock Origin can do this on Firefox already.

For battery: browsers could treat it like location and ask for permission. Why does the average site need to know my battery status?

For fonts: browsers could standardize a list of system fonts available on each platform. It's 2016 already: web fonts are here, are widely supported, and no legitimate website should be relying on some oddball manually installed font.

This problem is hard to solve, but the Tor browser has it mostly solved. Other browsers could learn from it.

Freak_NL · on May 19, 2016

> browsers could standardize a list of system fonts available on each platform.

It would probably make sense to completely disable support for local fonts unless permitted by the user (for legacy websites that depend on it). All modern browsers support @font-face, and without @font-face you can always depend on the special keywords serif, sans-serif, and monospace; these will load the system's default font for that category.

Klathmon · on May 19, 2016

I hope there is another way to solve it, as I have installed web-fonts to my PC to improve page loading speed for some common fonts I keep seeing (the most recent being the Roboto font stack from google).

It would be a shame to have to keep re-downloading that every time.

Freak_NL · on May 19, 2016

You don't have to. Fonts should be (and usually are) offered to the web browser with the instruction to cache them indefinitely. You will only have to re-download them when your cache is cleaned up (due to its size, private browsing, or manually cleaning it). Upcoming technology WOFF2 helps further compress them by a significant margin as well (I've seen up to 50% improvement in size over plain WOFF).

The problem is that the behaviour you are describing is also one of the ways a fingerprinter gets its data on your fonts; by specifying an @font-face declaration that first tries for a local font, and only loads a remote font if that is not found. Do this for a short-list of popular but distinct fonts (such as Roboto), and you have a nice amount of bits of identifying data to add to the stack.

Also, tricks like these exist (using rendering metrics to detect fonts):

http://www.lalit.org/lab/javascript-css-font-detect/

Klathmon · on May 19, 2016

Browser caches are extremely unreliable and pretty small in the grand scheme of things.

On some mobile platforms the browser cache can be replaced entirely by some heavy pages!

Plus, even if we assumed "cached forever" actually worked for a significant amount of time, it still doesn't solve the problem that I am hoping to solve. I know many websites use the Roboto font. By installing it I no longer need to ever download that font again. It doesn't matter if it's the first time i'm seeing the site, if they use a CDN, if they link to the bold/light/regular version or their own packed font, etc...

I understand that it's a privacy issue, but I'm hoping there is a way to solve that privacy issue without removing that feature.

cm3 · on May 19, 2016

I would go further and disable remote fonts as well since it's not crucial like images and is an attack vector that should have been avoided. The better solution would have been a shared set of web fonts distributed with browsers, just like certificates.

WalterSear · on May 19, 2016

I don't see why fonts are any different from images.

cm3 · on May 19, 2016

Fonts are different in that they're not as crucial to the content as images. You cannot replace an image with an alternative text form while retaining the content, but you can display the content completely with WOFF missing.

kbenson · on May 19, 2016

I think the logical conclusion of that argument is to also disable all CSS. Fonts are styling for text, CSS is styling for markup. I think most the arguments against disabling CSS can be used against disabling fonts (barring that people do crazy shit with CSS most often now, so it complicates the issue).

Really, in the end, all input accepted from the remote side (including text/html) needs to be vetted and processed by security conscious routines. I don't personally have a reason to assume a font library is more likely to be exploitable than an HTML+CSS parser and layout engine. Based on complexity, I would actually assume the opposite, which is probably right, except we've already found and fixed a lot of the exploits for the HTML parser and layout engine.

dredmorbius · on May 19, 2016

I've considered what might be necessary to dispose of server-side CSS.

A set of standard page templates could do it. Clients could then choose their preferred client-side CSS to apply. Article, index page, image gallery, catalog entry, search result, etc.

Seems a finite set should cover most needs. Which ought be available from a few fairly standard sources (CMS, blogging engines, frameworks).

kbenson · on May 19, 2016

As I see it, the web is the most vibrant medium of expression and innovation we have today. While I don't doubt there would be gains in security to limiting it in many aspects, I question whether the specific level of security gain would be worth the loss of innovation and expression. I think there are many areas we could focus on instead that would increase security without the same level of negative consequences, so I espouse doing those first.

Not to mention I don't think it's a solution that's viable given economic principle and how much people value expression. We'd just be back to the equivalent of Flash sites again, with whatever takes over for flash (canvas?).

cm3 · on May 19, 2016

It's not just security, the other half is usability. The way popular web sites work is that they change stuff around regularly only to make it look different but change the behavior as well. This breaks stuff like existing functionality, key+mouse sequences to get stuff done, places you've learned to look at and navigate to quickly. Computers and modern appliances (including cars) are strangely affected by this constant desire to shift things around unnecessarily because someone told them they could sell it as it looks different now.

GitHub and gmail are prime examples for sites that broke a lot of things in the process.

Maybe what we need instead are real APIs and custom clients.

kbenson · on May 19, 2016

Consider the implications of what this means though. If sites are not free to innovate, things like Github and Gmail wouldn't exist. They only reason we aren't stuck with a Hotmail interface circa 2002 is because people were able to innovate on the web. To lock down CSS (or Javascript, there's no reason I can think of you would lock CSS and not Javascript) to a specific set of capabilities is both a statement that it is sufficient for all needs, and that we can decide by committee what is a good set of standards to lock into. I think both assertions are laughable false.

If we had locked down CSS five years ago, what CSS would be not be capable of using today? If we lock it down today, what would we be missing out on that would come five years from now?

Design by committee is horribly inefficient, and rarely takes into consideration the full needs of the users. What's more, it can't take into consideration future needs. Design by committee gets us XML. Adoption by iteration and evolution gets us JSON. XML has its place, but JSON is overwhelmingly more popular in certain contexts for a reason, it fits the domain better.

Lastly, iterating on Github and Gmail would not stop even if there was a complete lack of CSS and Javascript, it would just be more tedious as everything was done through a full page serve, just like the old days. That wouldn't prevent site redesigns along with missing or broken features, it would just make everything look shittier, perform slower, and use more server side resources.

That said, a sane standard for embedded interfaces, where choice is restricted, it needs to live a long time, and needs to have sane accessibility features would do well with better standards. I view that as a separate problem.

dredmorbius · on May 19, 2016

The fact of standard templates needn't prevent the possibility of novel templates. But it ought make the prospect slightly more user-controllable. Design-by-committee isn't the alternative to design-by-fuckwits, the present mode.

Github and Gmail are both tools which now face the dilemma of gratuitous changes -- many of the recent innovations haven't done much for usability, for numerous reasons (familiarity itself is a key factor, GUI offers limited capacity for improved functionality, jwz has commented on this from his Mozilla experiences).

But most changes to default styles are pants.

Hell, much the problem is that default styles are pants. If browsers had a set of presentation styles that did work well (see the "readability" modes offered by Safari, Firefox, Readability, Pocket, Instapaper, etc.), then we'd have slightly less a problem.

Github, Gmail, Google Maps, etc., are largely the exception to long-form informational content pages. I'm OK with an explicit "app mode" for such sites. But 99.999999% of what I read would do vastly better with uniform presentation.

More attention to content and semantic construction. Less to layout frippery.

Something tells me you'll not be convinced.

kbenson · on May 19, 2016

> The fact of standard templates needn't prevent the possibility of novel templates.

If your stance is "provide well established default templates, but don't enforce their use", then I have no disagreement. That's not how I interpreted "I've considered what might be necessary to dispose of server-side CSS."

> Github, Gmail, Google Maps, etc., are largely the exception to long-form informational content pages. I'm OK with an explicit "app mode" for such sites. But 99.999999% of what I read would do vastly better with uniform presentation.

I think that depends heavily on what you use the web for. You and I likely read a lot on the web. Some people might stick largely to Facebook and Gmail. There are people that spend a lot of time in Github, and others that spend very little. Some people use a lot of online organizational and collaboration tools, others none.

> More attention to content and semantic construction. Less to layout frippery.

What you call layout frippery, someone else desires. This sounds suspiciously like remaking the web for your use cases, not for general use cases (which are always changing). But I'm not sure there's even a problem to address, you already addressed through referencing "readability modes" as an example of presentation styles that do work well. Why isn't that your solution to this perceived problem?

It feels like you're trying to achieve the equivalent of forcing all the printers to agree to not print magazines that don't conform to someone's opinion of what a good magazine is. I'm just not sure why that's even desirable.

> Something tells me you'll not be convinced.

No, not yet, if I understand your position correctly.

dredmorbius · on May 19, 2016

Defaulting to standard formats, and, on the basis of improved semantic parsing and ranking, promoting them through higher Search rankings (ceterus paribus) would be a Good Thing.

Among the problems of present Web design is that the Web is an error condition (there's a wonderful essay exploring this), and browsers default to allowing broken behavior, even adapting themselves to it, explicitly.

The lack of a publishing gateway, even a minimal one which enforces markup correctness to the Web is a problem.

Layout frippery as pertains textual content has a rather well-supported basis. Complexity is the enemy of reliability, and more complex layouts offer far more ways for sites to break. That's a well-established fact that successive generations eventual learn (or fail to learn) at their peril.

(The phrase "Complexity is the enemy" itself dates to the 1950s. I'd have to check the year, but have remarked on it before. Source is The Economist newspaper.)

I've seen what happens when documents and other media are aimed at very specific readers. Eventually, they rot.

Bog-standard HTML (or some alternative markup -- I'm increasingly partial to LaTeX) tends, strongly, to avoid this.

You're also going back to ignoring points raised earlier in this conversation about security, privacy, and usability.

And yes, if there's a call for an app-based runtime environment, which Google seem quite bent on producing, well, that's a thing. But no need to fuck up the game for the rest of us.

And models which prove useful could and should be incorporated.

I'm pretty gobstopped, for example, that 25 years after its introduction there's no affordance in HTML for notes (e.g., footnotes, endnotes, sidenotes, as presentation is a client issue), or for hierarchical presentation, e.g., of comments threads.

On can create nested hierarchies, but one with an integrated expand/collapse/sort/filter functionality doesn't exist. This was extant in Usenet newsreaders and mail clients 20 years ago. Why not the Web?

kbenson · on May 20, 2016

I don't really have any issue with most of what you are saying, except "The lack of a publishing gateway, even a minimal one which enforces markup correctness to the Web is a problem.", and my issue with that really depends on how what you mean by "problem". Sure, a publishing gateway would enforce some conformity, and some level of conformity is beneficial (I'll even allow that more conformity than we currently have would be beneficial), but too much conformity is not. Too much conformity breeds stagnation. So i'll re-frame my stance: How do you enforce or encourage conformity without going to far? How do you keep the entity or entities you've entrusted this task to from going to far?

> You're also going back to ignoring points raised earlier in this conversation about security, privacy, and usability.

I was just working off your points, which all seemed to be about usability. I've been treating this discussion as somewhat distinct from that one. I can definitely make arguments about conformity having it's own negative aspects with regard to security.

cm3 · on May 19, 2016

But that's not what I'm suggesting.

We can begin by actually reviving browser user style sheets and having a well known and respected sets of names will allow for appropriate styling on the client.

kbenson · on May 20, 2016

I'm all for user style sheets, I see no problem in people overriding site defaults. Re: sites breaking existing functionality while changing, maybe I just don't see big regressions as having happened in Gmail (which I always have open) or Github (which I rarely have open, as my source is in a local repo, but I visit on a regular basis from links here and elsewhere). It is interesting that you mention keyboard mouse combos, when to my knowledge both sites have put specific effort into making keyboard shortcuts that work and allow some level of navigation without any mouse.

cm3 · on May 20, 2016

I have to use Gmail in static HTML mode so that it doesn't try to reinvent and fail at a text edit control for composing a mail.

GitHub has been, like Twitter, grabbing more key bindings that existed before in a web browser like Ctrl-K and their comment edit box got limited in its resizability, forcing me to edit outside the website and paste into it often enough that it's an annoyance.

kbenson · on May 20, 2016

> I have to use Gmail in static HTML mode so that it doesn't try to reinvent and fail at a text edit control for composing a mail.

I'm not really sure what you are referring to here. Gmail does attempt to give you an editor for emails, but it's extremely simple in my experience to get it to do what you want most the time. If your complaint is that you want it to just send a text email, and not a multi-part with a plain text version and an HTML version, in which case I have to question why, as all it does it add choice and allow people to view it in the format they prefer, and it should look the same either way.

> grabbing more key bindings that existed before in a web browser like Ctrl-K and their comment edit box got limited in its resizability

Re: key binding, yeah, I can see that as somewhat annoying. I suspect they are trying to match some standard usability map and thinking of their site as an application, but it's annoying that it interferes with the browser (but only when within a text input, from what I can tell).

To some degree, I have to agree with what's probably Github's stance, which is that it's their site, and while it may seem annoying in some respects, they may have specific reasons they do things. They obviously aren't going to be able to make every change something everyone likes, but I don't necessarily think they are making change for change's sake. It's likely in response to pressure from gitlab and competitors. Presumably they are audience testing. The best way you can speak to this is to not use them when possible, or urge others to not use them.

cm3 · on May 20, 2016

Re Gmail: the bug is that they replaced the browser text box edit control with their homegrown JavaScript solution which does not work at all. Copy/Past, scrolling, and many other features are broken with it. On top of that, you cannot resize it.

Re Github: The big issue is that they start hijacking keys that were free before. It's hard to impossible to sway developers to use anything but Github. I've tried and been treated as if I'm in the luddite camp.

kbenson · on May 20, 2016

> homegrown JavaScript solution which does not work at all. Copy/Past, scrolling, and many other features are broken with it.

These all just work for me. I'm not sure what the specific complaints are, maybe it's a Firefox thing, but it's not like there's a lot in chrome that FF doesn't support.

> On top of that, you cannot resize it.

A little convoluted, but there is a way. In the subject of the thread, to the right, along with the collaps all control, there's the option to open the thread in a new window. This window can be resized, and since the input is the size of the window. Although, I suspect Gmail is meant to be viewed as more of an app than a site, so if the window size is not just about composing, but use, it might be worth using it as a freestanding browser window, distinct from and sized differently than other tabs, if you aren't already. I might actually play around with doing that not that I've said it.

> Re Github: The big issue is that they start hijacking keys that were free before. It's hard to impossible to sway developers to use anything but Github. I've tried and been treated as if I'm in the luddite camp.

Yeah, that's unfortunate, and I would have hoped Github would do better. I don't really think it's the norm though.

dredmorbius · on May 19, 2016

As cm3 notes, usability is as much if not more a concern than privacy and security. Though I'd not dislodge any of these three from a position of high primacy.

There's a risk / frequency trade off with all of these. Privacy can be quite possibly costly or fatal, though slightly more rare. Not so rare though that 20% of all Web users in a US Department of Commerce survey (see my recent comments history) report known credit card fraud. That's many tens of millions of affected users.

The security risks are similar but also extend to organisations which might stand to lose control over their own (validly) private information, or control over systems (see for example concerns over SCADA infrastructure, or industrial process control).

Usability and adaptability issues pose lower risks, but have a much larger affected field.

It goes well beyond the visually disabled, illiterate, and cognitively challenged. Anyone who's landed on a desktop site that's unusable on mobile has encountered a usability challenge. Google, Apple, Facebook, and Amazon are all rapidly pushing us, some kicking and screaming (I include myself) to an audible Web -- one in which the primary control and response interfaces are spoken.

What landing on a small set of templates does is provide for clearly parseable and understandable content. In a world where the goal isn't to read a full page but to extract and convey a useful item of information from it, wading verbally through megabytes of unparsable and nonexcludable content isn't particularly useful (and yes, figuring out how much a data reference is worth to the data-rerference intermediary is another question worth considering).

More generally, in my case, with only modest perceptual impairments (reading small, low-contrast type is among the earlier signs of your impending death), I've come to conclude for some years now that Web design isn't the solution, Web design is the problem. There are only so many ways you can present content that doesn't fuck with readability. I try, very hard, to ensure I'm not doing this on my own modest designs (look up "Edward Morbius's Motherfucking Web Page", a riff on a popular refrain, for my own principals in action).

My most common response when landing on a website is to sigh, roll my eyes, and dump it to something more readable. Firefox's Reader Mode. Pocket. Straight ASCII text. w3m.

And no, "novel graphic design" isn't conveying vast new amounts of information. I grew tired of hearing that argument 30 years ago, it's not got fresher since. Bloomberg, The New York Times, and the BBD are all experimenting with high-concept article formatting. In my experienct, without exception, it simply Gets In The Fucking Way.

My half-serious response to this is to create a new web browser embodying these and a few other principles. The working title is "the fuck your web design browser". FYWD for short.

Ninnies may opt to call it the Fine Young Western Dinosaurs browser as an alternative.

kbenson · on May 19, 2016

> My most common response when landing on a website is to sigh, roll my eyes, and dump it to something more readable. Firefox's Reader Mode. Pocket. Straight ASCII text. w3m.

> My half-serious response to this is to create a new web browser embodying these and a few other principles.

In all seriousness, I wonder if spoofing a mobile client (easily done through most browser developer console's or an extension) might immediately result in a more useful experience for you on the majority of sites. Given the viewing constraints of most mobile platforms, and the focus on mobile accessibility (it's supposed to account for over 50% of traffic now), I imagine many sites try to but some minimum level of effort in to at least make it usable.

dredmorbius · on May 19, 2016

The majority of my browsing is mobile these days. 10" tablet.

Even sites which are otherwise well-designed (Aeon and Medium come to mind) insist on dark-pattern behavior such as fixed headers/footers. Again: straight to reader-mode for that.

Except for the sites which break that. Violet Blue's Peerlyst comes to mind: https://plus.google.com/104092656004159577193/posts/PWuVmx2r...

(Screenshots contrasting site and a Reader Mode session included.)

I've written directly with the site designer who seems utterly insensate to why 14pt font isn't in fact a majickal solution to all readability problems.

HN itself is only barely usable.

kbenson · on May 19, 2016

> 10" tablet.

Really? That seems unlikely. I mostly see people use phones, and small tables, so <= 7".

> Except for the sites which break that. Violet Blue's Peerlyst comes to mind

There will always be someone thwarting best practices, just as there will always be those that skirt or break the rules in systems that are less lenient. There's not a lot of recourse, you want what they've got, so you are at their whim unless you can work around their imposed difficulties or find another source.

> I've written directly with the site designer who seems utterly insensate to why 14pt font isn't in fact a majickal solution to all readability problems.

See above :/

> HN itself is only barely usable.

Yeah, but I think the reasoning behind HN is slightly different. I suspect HN assumes you will takes some appropriate steps to optimize your use of the platform. Instead of "we will tailor the view to our artistic vision and you shall not besmirch it!" it's more of a "we believe in user agency, so get off your ass and make it better for yourself." Depending on your point of view, skill level, and site usage, you might find one more appealing than the other.

Personally, I use one of the browser extensions that allows collapsible comments, inline replying, and user info on hover over username.

kbenson · on May 20, 2016

>> 10" tablet.

> Really? That seems unlikely. I mostly see people use phones, and small tables, so <= 7".

Ignore that, I misread the sentence. I thought you were saying most mobile browsing is with a 10" tablet. I'm not trying to tell you that you're wrong about your own reported habits...

cm3 · on May 19, 2016

I have this hope that Servo will lend itself to modularity and building a usable web client with it.

dredmorbius · on May 19, 2016

What's Servo?

kbenson · on May 19, 2016

Mozilla's replacement to the Gecko engine that run's Firefox, written in rust. Often covered here[1], the benchmarks look really promising. Small portions of the codebase are already trickling back to FF where applicable.

The complexity and size of a modern web browser and the need to better engineering tools to combat this are often touted as some of the reasoning the Rust project started.

1: https://hn.algolia.com/?query=servo&sort=byPopularity&prefix...

cm3 · on May 19, 2016

Of course you're right, and I didn't mention CSS but thought of it as another utilitarian piece.

I agree that the most complex parts are HTML+CSS+JS+DOM+GFX, but some parts cannot be reasonably disabled without breaking it completely.

rz2k · on May 19, 2016

Actually, that makes me wonder — if I spoof 5% battery charge will I get fewer annoying features on any sites?

WalterSear · on May 19, 2016

You'll get more ads for chargers and laptop batteries.

cm3 · on May 19, 2016

Good idea.

blacksmith_tb · on May 19, 2016

> Why does the average site need to know my battery status?

I would go further and suggest that really no site needs to know it (I am sure there could be a few reasonable uses, but still). Which makes me wonder if we could strike back by abusing the WebRTC spec and fuzzing values like these, instead of simply blocking them.

slac · on May 19, 2016

For WebRTC, this behavior is now default in Chrome since version 48. Please see the release notes here: https://groups.google.com/d/msg/discuss-webrtc/_5hL0HeBeEA/H...

voltagex_ · on May 20, 2016

Exactly - most sites don't need my exact location, or access to WebAudio or whatever. It should be a red flag for most sites, however most users won't know how to react in such a situation.

jsprogrammer · on May 19, 2016

> For WebRTC, browsers could block local addresses.

That would defeat a huge selling point of WebRTC, the ability to create in-browser p2p connections over the user's local network.

vitd · on May 19, 2016

I'm not familiar with WebRTC. What's the use case there? I can't remember ever wanting to create an in-browser p2p connection on my local network. What would it be used for?

slac · on May 19, 2016

Please read the Chrome 48 release notes, WebRTC's default behavior has changed. https://groups.google.com/d/msg/discuss-webrtc/_5hL0HeBeEA/H...

MichaelGG · on May 19, 2016

But it still ignores the proxy settings and will use STUN to discover your "external IP". Thus users that think they are using a proxy end up not actually doing so.

jsprogrammer · on May 19, 2016

WebRTC is a secure real-time protocol for audio, video, and data.

You'd want to use it any time you want a high-speed network connection with another user. For example, a multiplayer game or video teleconference.

greggman · on May 19, 2016

in-company hangouts, video conferencing, etc. Without p2p on local network that would have go outside the company and back in.

Others have pointed out this behavior has changed in Chrome 48. You don't get the local IP unless the page asks for access to the mic/camera which the user has to give permission for.

Frondo · on May 19, 2016

I think the answer isn't technical, but legal and cultural. Make it unacceptable in the court of public opinion for companies to misuse this data, and strengthen privacy laws.

These two things, of course, go hand-in-hand, but us techies tend to look, I think, for the technical solution because that's the place where it's easiest to see how we could have any sort of impact. The other stuff is a lot of talking to and listening to people, consensus-building, being persuasive, etc.

Freak_NL · on May 19, 2016

Legislation and regulation are necessary, and they tend to help to keep the really big boys in check, but how can you actually tell if a company is actively engaged in compiling profiles on you or not? I can ask my browser to pass the Do-Not-Track header indicating my objection that practice, but why would a company specialised in tracking users respect that header?

I have tried to convince the Dutch banks I use (ING and ABN AMRO, i.e., big banks) to stop employing tracking beacons and third party tracking services on their secured internet banking environments, but the responses I get range from 'yeah we need those to improve your customer experience' to 'you are welcome to block these trackers yourself' (I already do, thank you very much).

mtgx · on May 19, 2016

How about using the permission system? You don't have to disable WebGL by default, but you can ask users for permission when it's needed (usually in a game).

Other stuff like GPS, camera, and microphone already require permission before being used.

chris_wot · on May 19, 2016

What about time and date? I'm almost certain NetFlix uses this to detect geo-unblocking.

Freak_NL · on May 19, 2016

Netflix uses an up-to-date list of known VPN-endpoints in addition to a database of IP-ranges by country. They don't need to detect anything client-side. This list is constantly in flux though, so sometimes accessing Netflix via VPN works, sometimes it doesn't.

rdancer · on May 19, 2016

When everybody was running Windows on a smorgasbord of hardware / patchlevel / plugins / fonts, it was easy to fingerprint. Are we moving towards a more monolithic landscape where fingerprinting is less able to track individual users?

* If I have a fleet of Chromebooks running the same version of Chrome OS, will they all have the same fingerprint?

* Will, say, all iPhones 6 with the same hardware parts, running the same Mobile Safari and iOS version, have the same fingerprint?

Thank you!

wumpus · on May 19, 2016

Why yes, some monoculture has made fingerprinting harder. Try [0] with a few of those devices.

0: https://panopticlick.eff.org/

dccoolgai · on May 19, 2016

This is much-needed research. Thank you for your work. Regarding the WebRTC tracking- would it be possible for WebRTC to work without exposing the local IP? I.e. is there any real reason that fingerprint needs to be there?

englehardt · on May 19, 2016

Other co-author here. Unfortunately there are good performance reasons for allowing WebRTC to access the local IP, see the lengthy discussion here: https://bugzilla.mozilla.org/show_bug.cgi?id=959893. One use case is allowing two peers behind the same NAT to communicate directly without leaving the local network.

The working group recommendation that we linked in the paper (https://datatracker.ietf.org/doc/draft-ietf-rtcweb-ip-handli...) addresses some of the concerns that arise from that (namely the concern that a user behind a VPN or proxy will have their real, public address exposed), but still recommends that a single private IP address be returned by default and without user permission.

However that's still quite identifying for some network configurations, e.g. a network which assigns non-RFC1918 IPs to users behind a NAT. Seems to me that putting access to the local IP address behind a permission would both remove the tracking risk and still allow the performance gains after the user grants permission.

dccoolgai · on May 19, 2016

Thanks for the response! If you're interested and it would be useful for your research, I have some really, really interesting privacy findings regarding Service Workers I'd be happy to share. I'm strongly in favor of an enhanced Open Web, but I'm not comfortable with the opaque nature in which tracking/privacy can be likewise enhanced with little user interaction or notification. Keep up the good work.

englehardt · on May 19, 2016

Feel free to email us at the addresses listed on the bottom of the linked site.

projectramo · on May 19, 2016

I am going to ask about a really basic question: what is fingerprinting?

I had to dig around, from the paper is sounds like a stateless form of tracking.

The audio example made sense:

1. the mic comes on, and it identifies a particular background noise.

2. I browse to another site, or a different page without a cookie.

3. The mic comes on again, matches the ambient noise and realizes I am the same person.

Is that what you mean? If this is the case, how can the "canvas fingerprinting" work since I had to browse to a new page and all the old pixels from the previous page are no longer there.

Anyway, if it is what I understand it to be, then it sounds very interesting. I bet some science fiction author wishes they had though to use it.

crispyambulance · on May 19, 2016

I can see how you would be led to believe that interpretation. Looking at the "fingerprinting" webapp however, details that sound is NOT actually recorded-- only the uniqueness of your machine's audio processing stack. At least I hope that's the case. The idea of a microphone recording without permission upon visiting a website would cause quite a broo-ha-ha.

https://audiofingerprint.openwpm.com/

    > "This page tests browser-fingerprinting using the AudioContext and Canvas API. 
    > Using the AudioContext API to fingerprint does not collect sound played or
    > recorded by your machine - an AudioContext fingerprint is a property of your
    >  machine's audio stack itself. If you choose to see your fingerprint, we will
    >  collect the fingerprint along with a randomly assigned identifier, your IP
    > Address, and your User-Agent and store it in a private database so that we can
    > analyze the effectiveness of the technique. We will not release the raw data 
    > publicly. A cookie will be set in your browser to help in our analysis. We
    > also test a form of fingerprinting using Flash if you have Flash enabled."

englehardt · on May 19, 2016

Yes, no sound is recorded. Access to the user's mic isn't possible without a permission. If there are sections of the website or paper that seem to imply that, let me know and we'll clarify.

Brakenshire · on May 19, 2016

It seems strange that access to the audio stack isn't also behind the permission.

crispyambulance · on May 19, 2016

It does, and also as someone who has never heard of AudioContext before, I can't fathom why it would be necessary for a web application to generate an audio data stream that isn't output to speakers _AND_ _THEN_ analyze the result.

What is the typical use case for AudioContext?

The capabilities of AudioContext used in audio fingerprinting seem like they're beyond what is really necessary?

chatmasta · on May 19, 2016

AudioContext is actually pretty cool. As far as I know, only Firefox supports it at the moment. But it allows you to work with audio streams in raw byteform, which means you can do advanced audio processing in client side javascript.

vox_mollis · on May 19, 2016

Presumably, an application would need to enumerate audio capabilities before even offering the option.

hackuser · on May 19, 2016

The front page [1] doesn't say either way, and I wondered the same thing when I read it. A few words or a link to an explanation would be helpful.

[1] https://webtransparency.cs.princeton.edu/webcensus/index.htm...

englehardt · on May 19, 2016

Thanks, I've added a clarification.

projectramo · on May 19, 2016

Wow, so are mic settings that different on different on, say, different iOS devices? If you and I have the same model iPhone with the same model iOS, is the audio stack that different?

vertex-four · on May 19, 2016

Probably not - but basically, you add together your microphone stack with all the other data it can possibly find about your device, and that's your fingerprint.

Check out https://panopticlick.eff.org - this will attempt to fingerprint your browser and see if it's unique.

jerf · on May 19, 2016

If you'd like a lot more on this topic, check out https://33bits.org/sitemap/ and start clicking on the links.

nickm50 · on May 19, 2016

I know iOS was just an example but just to clarify, the WebRTC spec isn't supported in-browser on iOS. To further clarify: it's not supported on iOS.

As a developer, you can take advantage of the spec only if you're building a native app. There's frameworks that you can use if you do. But within Safari or Chrome you have zero WebRTC support.

It's supported in modern versions of chrome on Android but won't be supported on iOS until apple does something about it.

kbenson · on May 19, 2016

> But within Safari or Chrome you have zero WebRTC support.

Chrome is just a skin on Safari for iOS, because Apple doesn't allow third party browsers, right? I would think FF (or any other browser) wouldn't be able to on iOS either, given that constraint.

ronjouch · on May 19, 2016

> "how can the "canvas fingerprinting" work since I had to browse to a new page and all the old pixels from the previous page are no longer there"

The linked page answers this: "Differences in font rendering, smoothing, anti-aliasing, as well as other device features cause devices to draw the image differently."

Put differently, the function measureText(canvas full of text with various fonts and bizarre features with varying implementation) is a pretty good hashing function for a population of web users, because each of these web users have a pretty-unique [canvas rendering engine, underlying OS, installed fonts] combination.

Combine several of these techniques (webrtc, audio, list of plugins installed and their version, etc), and you go from a "pretty unique" to a "guaranteed unique" hash, which you can follow across the web.

projectramo · on May 19, 2016

Thanks for the explanation. I missed it because I never though these settings could be that unique.

kbenson · on May 19, 2016

Just imagine, if the audio stack exposes the volume level, that's roughly 7.5 bits of uniqueness to contribute to the 33 required to uniquely identity any person on Earth (not that you can expect it to be uniformly distributed, and thus fully usable).

mh- · on May 19, 2016

that doesn't quite work since the volume level changes frequently.

kbenson · on May 19, 2016

Yes, it's a poor example in that respect. I meant it more as an explanation of how different attributes of a source all contribute a little but to providing a unique identifier, but you are correct that it's much less useful if the attribute is not static.

Fradow · on May 19, 2016

I'm going to answer the basic question: fingerprinting is about trying to identify your device as uniquely as possible using available APIs, in order to track you cross-site, without cookies.

To do that, you first try to identify API that have different results depending on the browser or the device, and then track their result. For example, the User agent have some identifying information. It's not unique for each person, but you can start having a bit of identifying information. Do that with multiple APIs (available fonts, installed plugins ...), and you start having enough identifying informations to uniquely identify some browsers, without having an actual ID provided by the browser.

To test your browser, you can visit https://panopticlick.eff.org/

aub3bhat · on May 19, 2016

I never understood panopticlick, even when I repeatedly visit it, it always tells me that

"Your browser fingerprint appears to be unique among the 135,054 tested so far."

Shouldn't it tell me that my browser is not unique during my 10th attempt considering it has recorded my previous attempts. This warning actually never changes, regardless of duration between consecutive attempts. That can only mean that the panopticlick is flawed or my browser signature is in constant flux (which would essentially make it useless from tracking perspective.)

prof_hobart · on May 19, 2016

I thought the same thing, so I did a bit of digging.

Turns out they put a bunch of tracking cookies on your machine without asking you (it is mentioned in the about page though), which seem rather naughty for an organisation promoting online privacy.

When I removed all 4 of them, I get down to being "almost unique". I'm currently down to having the same fingerprint as 1 in 45132.3333333 browsers.

aub3bhat · on May 19, 2016

This is confusing, my incorrect assumption was 1 out n browsers meant n = total numbers of browsers evaluated by panopticlick. Rather what they mean is 1 out of k, where k is determined by unique bits. There might be other factor such as entropy of each fingerprint.

komali2 · on May 19, 2016

How are two sites sharing this fingerprint information in a way that says "yup, this is the same guy?" Like is there some sot of cabal of evil advertising companies running a bunch of sites, or what?

cpeterso · on May 19, 2016

jordache · on May 19, 2016

Which API reveals my system fonts to a website?

Edit: The fingerprint test at https://panopticlick.eff.org/ shows my System Fonts

tangent128 · on May 19, 2016

I don't think you can enumerate them these days, but you can test for them by trying to use them in CSS (which font is used would affect the width of a span of text, use a wildly different fallback font and you can guess which is installed) or <canvas> (where you can inspect the actual pixels rendered).

Freak_NL · on May 19, 2016

Or use @font-face and detect calls to a remote URL — which happens when the named local font is missing:

    @font-face {
      font-family: "Roboto";
      src: local("Roboto"), url("https://example.com/user-does-not-have-roboto") format("woff2");
    }

dao- · on May 19, 2016

Flash provides an API for that.

Edit: Disable Flash or make it click-to-activate and https://panopticlick.eff.org/ shouldn't list your fonts anymore.

mpeg · on May 19, 2016

You can't get installed fonts via javascript, but you can change the font of a known text and fingerprint the size to determine whether the font is installed or it defaulted to another font.

superuser2 · on May 19, 2016

No. Microphone requires permission.

Your browser can leak a ton of information about your computer silently: window size, screen resolution, pixel density, time zone, language, installed fonts, installed plugins, operating system and version, browser version, plugins and versions, etc. There are good reasons for all of this data to be available to JavaScript for legitimate purposes. The AND of all of these datapoints, however, may be unique (or close), particularly if you have ever done something like install a novelty font. The EFF runs a website that fingerprints you and tells you how unique your fingerprint is:

[0] https://panopticlick.eff.org/tracker

detaro · on May 19, 2016

Canvas fingerprinting uses differences in rendering e.g. of fonts. Output a text, hash resulting pixel values. Depending on exact version of the font(s) installed, anti-aliasing settings, default font sizes, operating system... you get slightly different results. So you don't rely on information stored on the device, but on repeatable behavior that differs between devices.

projectramo · on May 19, 2016

I thought that was "Canvas-Font Fingerprinting"

But now I see that is just seeing which fonts are available.

Thanks for the explanation. Its just hard to believe devices are so different. I would think most versions of iOS would have roughly the same set of fonts etc.

mikeash · on May 19, 2016

Canvas fingerprinting by itself won't uniquely identify users. But the idea is that you can combine various different techniques, each one giving you more bits of uniqueness, until you have enough to do so. For example, say that canvas fingerprinting gives you one of 100 possibilities, and you combine it with other techniques that give you one of 10,000 possibilities, then combined (assuming they're not correlated) you get it to a million, letting you uniquely identify people with decent reliability from a decently large visitor pool.

jfoster · on May 19, 2016

What are some unexpected things that would differ between two iPhones of the same model running the same versions of the software stack?

mikeash · on May 19, 2016

Good question. I'm not particularly informed on this stuff, so take this with a grain of salt, but my understanding is that mobile devices in general and iPhones in particular are much harder to fingerprint reliably. Things like time zone, clock skew, and ping times might help differentiate users, but you probably can't get it down to a single person. I imagine there's still a use for fingerprinting which helps you differentiate groups of users even if you can't narrow it down to just one.

mediumdeviation · on May 19, 2016

Actually, checking if a font is available does not require canvas (you can simply inject a piece of text into the page with a specific font stack set and check its width). Rather, what canvas is used for is to obtain the sub-pixel anti-aliasing of a given piece of text, which is different between browsers and OS even when the same font is present.

detaro · on May 19, 2016

I would assume that iOS devices are quite hard to tell apart using most of these techniques, yes. But I also wouldn't be too surprised if there were something that works for them, some kind of cookie that isn't cleared by default or ...

mpeg · on May 19, 2016

Yup HSTS supercookies or some kinds of network fingerprinting will work to distinguish between two otherwise identical iOS devices.

tremon · on May 19, 2016

what is fingerprinting?

I don't know the research definition, but fingerprinting is a technique to uniquely track a user across multiple sites without a tracking beacon.

The most basic form of fingerprinting is to use the browser-supplied headers (user agent, version, OS). Canvas fingerprinting works because identical browser versions across different machines may render slightly different, but consistent. IIUC, canvas fingerprinting doesn't rely on any pixels shown to the user or anything unique to the site, but if the same canvas is rendered exactly the same on two different sites, that's another indication that both visits were from the same user.

I don't think the AudioContext fingerprinting uses the actual microphone: it uses the browser's (and possibly OS's) audio engine to generate an audio stream, then fingerprints the resulting data stream.

manigandham · on May 19, 2016

Thanks for this research, really interesting to see.

I do want to state for the record that instinctiveads.com was testing augur.io and that's why we're listed there. We don't use them anymore but unfortunate timing, especially considering we're trying to be a better ad network than the rest.

Also I'd like to point out that one of the most pervasive tracking methods is done through form submissions. Anywhere you submit an email (login, purchase, etc) can be used as identification and first-party cookie matching.

kbenson · on May 19, 2016

Having the insight of of someone who works in online advertising would be interesting and informative. Is there anything you can share that we might find interesting?

manigandham · on May 19, 2016

I could talk about this for hours. Fundamentally, identity is important for the ad industry but it's not about your personal info, it's just a reliable ID that we're all after.

A reliable ID allows for storing your ad history and interests to show you better ads and less of the same. This is proven since it's all math and data science and we can see the increase in metrics with better targeting. By the way, clicks are not the most important metric either, there's much more that goes into an ad campaign. Ironically, reliable IDs also allow for storing any opt-out settings since it's just a value attached to that ID.

The email login I mentioned above is the most common way to track online, most of the big sites actually sell login data and fire tracking tags when you're logged in with the email address passed through (usually hashed but not always) so that providers can set their own cookies and recognize you again. Since emails are strongly unique, this is really effective.

This tech is also used to combat ad fraud (which is what we were using it for). Fraud is a massive problem since it's so easy to start up botnets and churn through millions of ad impressions quickly.

Unfortunately a lot of this new age of tracking is the result of politics, bad incentives, and a lack of regulation that's led to a wild west situation where these companies can do anything. Clearly the technical talent is capable (as seen in this research) but it's being put to the wrong use. The DNT (do not track) header was a compromise but lacked any real regulation to make it effective. 3rd party cookies were fine but unfairly demonized and the default blocking of them pushed the industry to these deeper tactics.

Ultimately this is a business process issue: if there was a standardized ID like IDFA but for browsers (or even better at the OS level) and privacy regulation that's actually enforced, that would be a good compromise. Sites and ad networks get a reliable ID and you get control over when and how that ID is refreshed.

EDIT - All this stuff used by independent ad companies is just a tiny fraction of the industry. This barely covers ISPs who have very refined tracking abilities that you really cant avoid since they control the traffic. Comcast/Verizon has the AOL ad network using this. And the 2 biggest ad companies are Google and Facebook, both of which don't need fingerprinting because they already know who you are from just being logged in.

cpeterso · on May 19, 2016

> if there was a standardized ID like IDFA but for browsers (or even better at the OS level) and privacy regulation that's actually enforced, that would be a good compromise. Sites and ad networks get a reliable ID and you get control over when and how that ID is refreshed.

To detect ad fraud, would the ID need to be the same on all sites? Instead of sites dropping cookies on clients, what if browsers generated their own random per-site IDs? Users and browsers would have more control over managing and clearing cookies and user IDs.

manigandham · on May 19, 2016

Yes the ID would have to be the same, just like it is on mobile (Android Advertising ID, iOS IDFA) and would be best at the device level but browser would be a start.

If it's unique to every site then it's nothing new, networks can already set IDs today with 1st party cookies. It's being able to have a internet-wide ID that's valuable and is what 3rd party cookies allow(ed).

The ID itself doesn't matter, it's just random characters and mapped in various ways by networks. It's the reliability and consistency on a device level that's needed. Having something like this would make a massive difference - all the cookies/tracking junk would be obsolete, along with the hundreds of pixel sync tags, and would make everything faster, more accurate, more private and more secure.

grav · on May 19, 2016

Why would two browsers with the exact same user agent (ie same, version, same OS, same arch) yield two different renditions of an audio fingerprint?

hendi_ · on May 19, 2016

They wouldn't.

But the point of fingerprinting is that practically no two "browsers" are the same:

  - browser software and exact version
  - installed plugins
  - size of browser window
  - OS software and exact version (think of patches!)
  - language
  - time zone
  - screen resolution
  - ...
  - (and all the stuff mentioned in the submitted article!)

See the EFF's Panopticlick to see _how_ unique your browser is. Be sure to click the "Show full results for fingerprinting" after the test to see all things it considers.

[0] https://panopticlick.eff.org/

richardwhiuk · on May 19, 2016

But is any of this stuff stable enough to ensure a fingerprint -> user correlation which doesn't break every time? It's not very much use if all it does is create a unique fingerprint for each refresh?

hendi_ · on May 19, 2016

Yes; the things I've mentioned above don't change on page refresh.

If you'd find some things do change too often to be relied upon you could either take that into account, or simply don't use that specific fingerprinting technique.

dao- · on May 19, 2016

That doesn't really answer the question, because most of the factors you listed should be irrelevant for _audio_ fingerprinting.

hendi_ · on May 19, 2016

It does.

> They wouldn't.

dao- · on May 19, 2016

That's a pretty short answer, and it sounds wrong to me. Are you implying that browser version, OS type and version, and system architecture are all factors that matter for audio fingerprinting? If so, what would be the point of audio fingerprinting when you can just look at the user agent string?

hendi_ · on May 19, 2016

Sorry, it seems I misunderstood your intention/question.

The `AudioContext` API exposes several details about the host which may depend on the hardware (sound card, sound chip), software stack (OS, on Linux e.g. PulseAudio vs. ALSA), sound driver and its versions, and connected periphery (speakers? headphones?).

Additionally, the audio API is used to generate a sound (which is muted before being played, but still generated before). Sound is hard, and so the browser vendors don't necessarily generate the "sound bits" themselves but ask the OS to so. Which might in fact ask its sound system to do so. Which might ask its sound driver...

Some of these properties are fairly common or likely to change often. But chances are that combined they give you more bits of information then say the simple user agent string (which is shared by thousands - if not more! - other browsers).

BinaryBullet · on May 19, 2016

Will you post insight into the data you've collected? Obviously I don't care about IP addresses, etc, but it would be nice to know how many people have submitted data vs how many unique hashes have been collected for say the "Fingerprint using DynamicsCompressor", etc. I also haven't checked every page on the site, so the data might already be there (and I'm missing it)...

englehardt · on May 19, 2016

Yes, we'll definitely do an analysis on this and write it up.

inthewoods · on May 19, 2016

If the sites can be detected, wouldn't it be possible to come up with a browser extension to at least let people know this is happening?

beeker87 · on May 19, 2016

In your paper you say,

"When using the headless configuration, we are able to run up to 10 stateful browser instances on an Amazon EC2 “c4.2xlarge” virtual machine."

Also it seems like you ran the crawl only in the month of January this year, and crawled about 90 million pages. Were you able to do that on the single AWS instance, using Firefox via Selenium? What do you think the performance would have been just issuing raw requests?

Just interested because I'm currently building a crawler and am trying to decide if Selenium would be worth it performance wise.

smartbit · on May 19, 2016

On iOS I use safari and disable access to location etc, also disable cookies, advertisementID, etc, etc. Then I feel quite save when using a VPN. Does that still hold?

hendi_ · on May 19, 2016

Yes.

Changing common settings might in fact even make you stand out _more_.

Check the EFF's Panopticlick [0] to see how your specific configurations leaks identifying information.

[0] https://panopticlick.eff.org/

smartbit · on May 20, 2016

This week’s http://www.heise.de/artikel-archiv/ct/2016/11/144_kiosk states “Viele .. fingerprinting-verfarhen laufen auf Mobilgeräten ins Leere. … Zudem gibt es kein Mittel, mit dem man über gezielt gegen Fingerprinting über die Sensoreigenschaften vorgehen kan - weder unter iOS noch unter Android.” And then it concludes recommending Adblockers for Safari on iOS noting that it depends on the quality of the block-list. It also mentions that adblockers on iOS don’t work in Apps, other than in Android.

IOW most fingerprinting fail on mobile devices and that sensors, eg batteries, are one of the few remains for fingerprining on iOS. Do you disagree with Heise? Could you please substantiate your statements regarding iOS fingerprinting?

atomwaffel · on May 20, 2016

You're right that mobile devices are harder to fingerprint – too many people with the same screen size, operating system, browser version, timezone, and language/region settings.

However, mobile devices have a bunch of sensors, some of which can be accessed by JavaScript without permission, namely the ones you wouldn't expect to yield identifying information (e.g. accelerometer). The problem is that no two sensors are alike – they all introduce noise into the data which can be enough to fingerprint a device.

See here for a paper on the subject: https://crypto.stanford.edu/gyrophone/sensor_id.pdf

miley_cyrus · on May 19, 2016

nice Arvind!

randomwalker · on Jan 16, 2016

The key bit from the abstract:

As long as they could not see themselves in a mirror, ants with a blue dot painted on their clypeus did not try to remove it. Set in front of a mirror, ants with such a blue dot on their clypeus tried to clean themselves, while ants with a brown painted dot — of the same color as that of their cuticle — on their clypeus and ants with a blue dot on their occiput did not clean themselves. Very young ants did not present such behavior.

weinzierl · on Jan 16, 2016

And from the conclusion:

Briefly, if an animal detains self recognition ability, it will recognize itself in a mirror and will try to clean the alien colored spot it bears. The inverse is not always true: if an animal clean itself in front of a mirror, it might do so without recognizing itself. So, on the basis that ants conspicuously marked on their clypeus clean themselves while ants marked otherwise do not, both only after having been in front of a mirror, it can be presumed (but not yet asserted) that, for the Myrmica species presently tested, and for individuals of a given age, self recognition is not impossible.

biturd · on Jan 17, 2016

I am not a scientist of any field, but the above sort of made me think twice. They used a mirror. I would think, just by virtue of the ant seeing something in a mirror and being able to deduce "hey, that is me, there is something up with my swag here, I better freshen up a bit". That ant not only determined based on the looks of others, compared it to himself in a reflection, and made the decision it was wrong enough to attempt to remove it. Can I further think they can tell the difference from say, an "ant birthmark" and just paint? If so, I am pretty shocked at just how self aware they appear to be.

randomwalker · on Jan 15, 2016

There's a follow-up to this post here: https://freedom-to-tinker.com/blog/englehardt/the-web-privac...

And here's our open-source tool that we've been using to do all these privacy measurements: https://github.com/citp/OpenWPM/

We'd love to see pull requests or just other people using our tool for new interesting findings.

fitzwatermellow · on Jan 16, 2016

Kudos to the Princeton team on yet another interesting find! I recently submitted the work on de-anonymizing programmers from binary signatures using machine learning that was also fascinating (not to mention a bit scary)

But I confess I'm perplexed as to how a canvas render can yield a consistent unique fingerprint? Even allowing for subtleties in font anti aliasing implementations amongst drivers, wouldn't two machines with identical gpu hardware and drivers be identical down to the last pixel? And what about browsers using software rendering?

My interest is purely academic of course ;)

randomwalker · on Aug 25, 2015

I'm the instructor of an upcoming Coursera course [1]. A couple of observations from my point of view:

* I wish there were a way to fund online education through philanthropy/donations. Coursera being for-profit leaves a bit of a bad taste in the mouth. At a practical level, it complicates what images I can use in my lectures and qualify as fair use.

* After several years the site is far from being at a point where an instructor can log on and upload content. The interface is constantly changing, confusing, and buggy. My university has a dedicated team who help out instructors with putting their material online and even they are often confused about how to edit this or upload that.

Overall I'm glad that Coursera exists and is finding a revenue stream; my own undergraduate education would have been vastly different if I'd had access to the material that's available today.

[1] Bitcoin and Cryptocurrency technologies https://www.coursera.org/course/bitcointech

fps · on Aug 25, 2015

To address your first point, Edx.org is very similar to Coursera, but is a non-profit organization that releases all it's software as open source (https://open.edx.org/.) For your second point, EdX Studio (https://studio.edx.org/) is focused on being accessible and easy to use for instructors - we hear good things from course staff about usability compared to Coursera.

(I work for EdX.org as a developer)

ChicagoBoy11 · on Aug 25, 2015

Out of curiosity: Any news on the hosted version of edX (mooc.org).?Site went up quite a while ago, but I haven't heard any developments since.

6stringmerc · on Aug 25, 2015

Interesting observations and notes. Glad to see a little bit of background / context regarding the mechanics. Regarding my perspectives, I have a background teaching and an advanced degree in education course design.

Regarding point 1, my understanding of Fair Use within an Education environment is that an instructor using protected material in the context of a lecture or assignment is, by default, an instance of Fair Use. A lot of the pivot relates to the scope of the use - as in, photocopying an entire chapter or short-story is okay, but photocopying the entire book is not. With images, I think you're well in the clear. I can understand where you're coming from with your concern, I just don't believe it to be material.

My university has a dedicated team who help out instructors with putting their material online and even they are often confused about how to edit this or upload that.

This scenario strikes me as counter-intuitive from a savings perspective, because now there's two layers involved: Instructors and IT Support. Actually, it sounds like a terrible waste of overhead and expense the University is laying out. Will Coursera reimburse your institution for the burden, or is it so small compared to the revenue brought in through Coursera that the expense is immaterial?

I get a macabre laugh out of learning Coursera actually kind of sucks at its main value proposition of being a technology platform for education, in that it's not user friendly for actual educators. Yeah it's a 'disruption' platform, sure. Just seems to me like throwing a Basball into an Olympic Swimming Pool.

cosmie · on Aug 25, 2015

> This scenario strikes me as counter-intuitive from a savings perspective, because now there's two layers involved: Instructors and IT Support. Actually, it sounds like a terrible waste of overhead and expense the University is laying out. Will Coursera reimburse your institution for the burden, or is it so small compared to the revenue brought in through Coursera that the expense is immaterial?

If it's anything like my university was, the team he's referring to didn't exclusively support Coursera; their purpose is to provide faculty assistance with managing online content in general. Whether it's the university's internal Blackboard site or Coursera, they support whatever platforms the professors are using (assuming it's a university approved platform).

So the marginal expense of supporting Coursera is likely negligible, unless they've somehow managed to make a worse interface than Blackboard and it's particularly resource intensive to support.

cgag · on Aug 26, 2015

It hasn't launched yet but you may be interested in checking out snowdrift: https://snowdrift.coop/.

I brought this thread up in #snowdrift on freenode and there was some interesting discussion around coursera's model and this critical talk by Eden Moglen: https://boingboing.net/2012/05/27/innovation-under-austerity.... There's a full transcript here: https://www.softwarefreedom.org/events/2012/freedom-to-conne.... I've mostly viewed coursera as a good thing even if it's a for-profit company, but I'm not as confident any more.

mrdrozdov · on Aug 25, 2015

Have you looked at http://www.donorschoose.org/? From their site:

DonorsChoose.org makes it easy to help classrooms in need. Public school teachers post classroom project requests which range from pencils for poetry to microscopes for mitochondria.

I bet there's a way you could sneak in support for your Coursera teaching.

As for uploading content, that seems like a really tough and time consuming problem. Are you allowed to put together a wiki or webpage? I was doing some prep for an Operating Systems course and read this excellent blog post about why textbooks should be free [1]. In the post, the writer mentions that "perfect is the ultimate enemy of good", so he decided to write the initial draft of a textbook purely in plain text rather than properly format it with something like LaTeX. Getting the necessary content out there seems like a good first step for you and your team.

[1] http://from-a-to-remzi.blogspot.com/2014/01/the-case-for-fre...

Kalium · on Aug 25, 2015

> * I wish there were a way to fund online education through philanthropy/donations. Coursera being for-profit leaves a bit of a bad taste in the mouth. At a practical level, it complicates what images I can use in my lectures and qualify as fair use.

There are ways to do this. The problem is that they don't readily scale. Self-funding systems scale much better than systems that require ever-increasing amounts of external funding.

vanderZwan · on Aug 25, 2015

That's a bit hand-wavy. Can you elaborate why you claim this?

Kalium · on Aug 25, 2015

It's a matter of requirements for external resources. If a system needs constant infusions of external money, its ability to grow will be determined by its ability to bring in such external money. This is how non-profits tend to work and why they dedicate such attention to fundraising.

Systems that generate what they need to grow don't have the same constraints. If money is what your business needs to grow and it generates a significant yearly profit, then your business can meet its own needs to enable growth.

Is that clearer? Some sorts of systems, when functioning correctly, will tend to be self-perpetuating. Others will, as an artifact of structure, require endless external resourcing.

candu · on Aug 25, 2015

OK. Find me an example of a system that doesn't require external resources :)