Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't really understand the issue? If I want to prefetch an image, I'm on the same origin the whole time and this cache segregation doesn't matter.


Indeed. As the author puts it "sometimes you know something is very likely to be needed". Let's have a look:

    <link rel=prefetch href=url>
What's going on here, for the test case given? It's introducing tight coupling (or it already exists, and you're trying to capture a description of it to serve to the browser) to an external resource. It's not that prefetch is broken, it's that to desire to be able to gesture at the existence of a resource outside your organization's control, while insisting that it's so important as to be timing-critical, is like trying to have your cake and eat it, too.

As mentioned in similar comments, the observed behavior for this particular test case is potentially a problem if you are building Modern Web Apps by following the received wisdom of how you're supposed to do that. There are lots of unstated assumptions in the article in this vein. One such assumption is that you're going to do things that way. Another assumption is that the arguments for doing things that way and the plight of the tech professionals doing the doing are universally recognized and accepted.

From the Web-theoretic perspective—that is, following the original use cases that the Web was created to address—if that resource is so important to your organization, then you can mint your own identifier for it under your own authority.

Ultimately, I don't have a lot of sympathy for the plight described in the article. It's fair to say that the instances where this sort of thing shows up involve abusing the fundamental mechanism of the Web to do things that are, although widely accepted by contemporaries as standard practice, totally counter to its spirit.


I understand what you're saying, but I fundamentally disagree with you.

The issue is that you're immediately deciding that domain X and domain Y are different entities.

In practice, I find that there are a HUGE number of use cases where two domains are actually the same organization, or two organizations that are collaborating.

There is basically no way to say to the browser - I am "X" and my friends are "Y" and "Z", they should have permissions to do things that I don't allow "A", "B", and "C" to do.

---

We actually have a functioning standard for this on mobile (both iOS and Android support /.well-known paths for manifests that allow apps to couple more tightly with sites that list them as "well-known" (aka - friends)

The browser support for this kind of thing is basically non-existent, though, and it's maddening. SameSite would have been a PERFECT use case. We're already doing shit like preflight requests, why not just fetch a standard manifest and treat whitelisted sites as "Samesite=none" and everything else as "Samesite=lax"? Instead orgs were forced into a binary choice of none (forfeiting the new csrf protections) or lax (forfeiting cross site sharing).


> There is basically no way to say to the browser - I am "X" and my friends are "Y" and "Z", they should have permissions to do things that I don't allow "A", "B", and "C" to do.

Isn't that CORS? That sounds like CORS.


CORS can fully stop an origin from accessing an http resource, this discussion is about allowing different origin to reuse the browser local state (like cookies, http cache, local storage).

CORS is a stronger security layer in the browse than cache segregation, some people want to keep the CORS security model but weaken cache segregation.


> The issue is that you're immediately deciding that domain X and domain Y are different entities.

That's... literally what domain means.

"A domain name is an identification string that defines a realm of administrative autonomy, authority or control within the Internet." - Wikipedia

The entire security policy of the Internet is built on this definition. It's not an assumption. It's a core mechanism.


wrong again... It's very clear "*A* realm". Not "The realm". Certainly not "The only realm".

Entities are allowed to control many assets.

Simplest possible case in the wild for you, since you're being obtuse.

I'm company "A". I just bought company "B". Now fucking what?


> but I fundamentally disagree with you ["We actually have a functioning standard for this on mobile"]

I wouldn't expect any less.

https://quoteinvestigator.com/2017/11/30/salary/

> In practice [...]

There's your problem. Try fixing that.


Moving everything under a single domain makes no sense for the use-case of cross-organization sharing though. Domain as as the root identity on the web is just broken and there's no way to make it work.

    mysharedhosting.com/customer1
    mysharedhosting.com/customer2
Two separate identities. Is it even possible to let the browser know that they should be segmented? Nope.

    customer1.mysharedhosting.com
    customer2.mysharedhosting.com
How about this? No again, not without getting yourself on the suffix list which is just a hack.

    mycompany.com
    joinmycompany.com
Can you tell the browser that these are actually the same and shouldn't be segmented? Nope!


This makes more sense and doesn’t dilute your brand, improves SEO, and helps protect against phishing by making your customers and staff look for a single known-good domain:

   https://mycompany.com/join


Your take is wrong.

Having a domain is identical to having a driver's license: This org says I am "X".

It is fundamentally different from uniquely identifying me.

I am still the same person if I give you my library card - a different ID from a different org that says I am "Y".


Nah, the problem is definitely on your end.

> Having a domain is identical to having a driver's license: This org says I am "X".

Nope. You just described two different documents.

http://horsawlarway.example/drivers-license.html

http://horsawlarway.example/library-card.rdf

It's very odd the position you're taking here, given the sentiment in your other tirade about being digital sharecroppers to Facebook and Google <https://news.ycombinator.com/item?id=27369652>. Your proposed solution is dumping more fuel in their engines—which is why it's the kind of solution they prefer themselves—and completely at odds with e.g. Solid and other attempts to do things thing would actually empower and protect individual users. I'm interacting with your digital homestead, why are you so adamant about leaking my activity to another domain?


If you actually search a bit on the internet, you'll find Google actually made a RFC for a standard allowing websites to list other domains that should be considered as same origin. Look for the comments on it by the ietf and you'll understand why this is a terrible idea.


As somebody directly involved in this space, that's a pretty bad summary. The spec for first party sets (not an RFC, just a draft) isn't in a great state. Google is going to implement it anyway, Microsoft is supporting, Apple basically said the spec was crap but they might be interested in a good version of it, and Firefox said they didn't like it.

Speaking as engineer.. the Firefox folks don't really get it. You can't just break what sites like StackOverflow and Wikipedia have been doing for years (and in some cases decades) and then say "you were doing the wrong thing." Some version of FPS will ship in browsers, probably in the next 2 years.

Quoting Apple's position directly "[...] Given these issues, I don’t think we’d implement the proposal in its current state. That said, we’re very interested in this area, and indeed, John Wilander [Safari lead] proposed a form of this idea before Mike West’s [Google] later re-proposal. If these issues were addressed in a satisfactory way, I think we’d be very interested. [...]"

Also it was a W3C TAG review. The W3C and IETF are different organizations.


> RFC for a standard allowing websites to list other domains that should be considered as same origin

No, they allowed an origin to list other origins whose cookies would be sent back to the serving origin correctly even if they were iframes loaded in the parent origin DOM.

I.e. this is the expected behavior for iframes until Safari decided that there was such a thing as "third party" origins whose web semantics could be broken in their war against advertising.

Google is trying to (partially) restore the expected behavior of iframes so that named origins get their own cookies sent to them, which is how things worked for the first two decades of the web.


Why don't you search a bit and come back with a link?

Because I can't comment on an RFC I haven't seen, and a quick google search of my own based on your comment turns up nada.

That said - I'm fully aware of the downsides of this approach, but I want my browser to be (to put it crudely) MY FUCKING USER AGENT. I want to be able to allow sharing by default in most cases, and I want a little dropdown menu that shows me the domains a site has listed as friendly/same-entity, and I want a checkbox I can uncheck for each of them.

Then I want an extension API to allow someone else to do the unchecking for me, based on whether the domain is highly correlated with tracking (Google analytics, Segment, Heap, Braze, etc)

-------------

The way I see it, the road to hell is paved with good intentions. If the web was developed in our current climate of security/privacy focus, how likely is it that even a fucking <a href=[3rd party]> would be allowed? Because I see us driving to a spot where this verboten. Which also happens to be the final nail in the coffin for any sort of real open platform.

Welcome to the world where the web is literally subdomains of facebook/google. What a fucking trash place to be.


Hum... The resource is not that important to me. I'm just the author of the current page, and are letting the browser know that users are very likely to want that resource.

You are attributing a lot of intention into a mechanism. You don't know if it's a 3rd party tracker or the news link in a discussion page.

The proposal at the article is actually quite good, since I should always know very well if it will load into a frame or a link.


> You are attributing a lot of intention into a mechanism. You don't know if it's a [...] or [...]

Interestingly, I think this remark is a signal that you've read something out of my comment that's just not there (and thus attributing a lot more intention to me than you should).

> The resource is not that important to me.

These are a class of resources that are important enough that folks would pause what they are doing to try and deliberately mark it with a magic incantation that they expect will cause the user agent to do something, notice that it doesn't do the thing that they want it to do, and then go and either write a blog post to complain about it, or throw their support behind someone else's complaints it. The argument that you don't find it particularly important is pretty much self-defeating.


I could be wrong but it seems to me that any cross-domain prefetch that uses the “document” option from the article is potentially privacy-violating and can reintroduce the same leaks that necessitated the original segregation.

A.test prefetches b.test/visited_a.js, b.test/unique_id.js, and log(n) URLs that bisect unique_id.js so that you can search the cache for the unique id.

Have to be careful to balance performance and “this is useful to me” with abuse prevention at scale. It’s also important to realize we have to tread carefully with browser features that seem useful as the graveyard of deprecated features that didn’t survive privacy attacks is quite large.


What's the privacy violation risk here? A doesn't learn anything through those prefetches about whether or not the user has previously cached any B resources (because A never sees the timing or result of its prefetch requests - it doesn't even know if the browser attempted them).

B obviously knows what resources A prefetched because they were requested from B in the first place. And if A wants to pass information to B, they don't need to do a complex prefetch dance, they can just load an img src.

So I don't see any way for A or B to learn anything about the user's behavior on one another's site without the other site's cooperation?


Well, if so it's a problem, because the article says that Chrome and Safari handle prefetch exactly that way.

But I don't see that problem. On this case the a.test domain can not see what is on the cache, only b.test sees it. (At least by what I understood.)


Yeah, his slideshow example doesn't show the problem. Unless each slide was on its own domain, this isn't a problem. It matters for things like Goggle Fonts, but very few folks have multiple domains that share enough of the same assets for this to matter in practice.


Question: Does Google use Google Fonts to track users across the web?

Google's FAQ [1] says that it only collects the information needed to serve fonts, but it says the generic Google privacy policy applies. The Google Privacy Policy allows them to use any information it collects for advertising purposes.

While Google also states that requests do not contain cookies, Google Chrome will automatically send a high-entropy [3], persistent identifier on all requests to Google properties, and this cannot be disabled (X-client-data) [2]. Google can use this X-client-data, combined with the useragent's IP address, to uniquely identify each Chrome user, without cookies.

So, perhaps the privacy statement is more of a sneakily worded non-denial?

[1]: https://developers.google.com/fonts/faq?hl=en#what_does_usin...

[2]: https://github.com/w3ctag/design-reviews/issues/467#issuecom...

[3]: A sample: `X-client-data: CIS2yQEIprbJAZjBtskBCKmdygEI8J/KAQjLrsoBCL2wygEI97TKAQiVtcoBCO21ygEYq6TKARjWscoB` - looks very high entropy to me!


> Google Chrome will automatically send a high-entropy [3], persistent identifier on all requests to Google properties, and this cannot be disabled (X-client-data) [2].

X-Client-Data indicates which experiment variations are active in Chrome:

Additionally, a subset of low entropy variations are included in network requests sent to Google. The combined state of these variations is non-identifying, since it is based on a 13-bit low entropy value (see above). These are transmitted using the "X-Client-Data" HTTP header, which contains a list of active variations. On Android, this header may include a limited set of external server-side experiments, which may affect the Chrome installation. This header is used to evaluate the effect on Google servers - for example, a networking change may affect YouTube video load speed or an Omnibox ranking update may result in more helpful Google Search results. -- https://www.google.com/chrome/privacy/whitepaper.html#variat...

Google doesn't use fingerprinting for ad targeting, through like with IP, UA, etc it receives the information it would need if it were going to. I don't see a way Google could demonstrate this publicly, though, except an audit (which would show that X-Client-Data is only used for the evaluation of Chrome variations.)

(Disclosure: I work on ads at Google, speaking only for myself)


Thanks for the informative answer. I still have trust in engineers and assume truth and good faith, so that's is comforting to know.


You could always ask someone who works on Google Fonts. I did just that. The answer is they don't use the logs for much apart from counting how many people use each font to draw pretty graphs.

Doesn't mean that won't change in the future though. But log retention is only a matter of days, so they can't retrospectively change what they do to invade your privacy.


I find myself wondering whether Google’s front end implements a fully generic tracker: collect source address and headers and forward it to an analytics system. The developers involved in each individual Google property behind the front end might not even know it’s there. Correlating the headers with the set of URLs hit and their timing might give quite a lot of information about the pages being visited.

I hope Google doesn’t do this, but I would not be entirely surprised if they did.


Unless it's regularly verified by a trusted third party, such as a government agency, I wouldn't trust them not to. After all: we're talking about a corporation that lives off the data it gathers about people using their services and products.


If the frontend had a fully generic tracker, teams wouldn't need to set up their own logging and stats systems... Which they do...


I think they would in any case. My impression is that data is siloed internally at Google, and that data sharing between departments would be way more complex than just setting up some (possibly redundant) logging.


I spent ten seconds thinking about the logistics of adding logging to the frontends, and...

Well, obviously I can't say for sure they don't have any. I didn't look it up, and if I had I wouldn't be able to tell you. But since I didn't, I can tell you that the concept seems completely infeasible. There's too much traffic, and nowhere to put them.

Besides that, not everything is legal to log. The frontends don't know what they're seeing, though; they're generic reverse proxies. So...


> completely infeasible. There's too much traffic, and nowhere to put them

If there’s one company in the world for whom bandwidth and storage are not an issue, it’s Google.


It sounds so easy to make, yet so useful, that I can't see how they wouldn't do that. Deontology has been thrown out Google's window a long time ago.


I just went for the easy solution and disabled web fonts. Comes with the drawback that many site UIs are now at least partially broken (especially since some developers had the bright idea to use fonts for UI icons), though flashier sites tend to come with less interesting content anyway.

But as it stands I don't want to trust Google, Facebook etc. more than absolutely necessary. They have lost every right to that a long time ago and are incentivized by their business model to not change anything.


So download your fonts off Google and serve them from your own domain.


Yes, sorry, an image is a bad example. The main issue is with HTML documents. You might open one at the top level, by clicking on it and navigating to it, or you might open one as a child of the current page, by putting it in an iframe. Since they can be opened in both contexts, prefetch doesn't know what to do.


But is it right that the issue is fetching resources from a different domain than the current one? As a user, just because I've connected to domain A, it doesn't mean I necessarily want my computer to connect to any domain B that A links to. Also, I'd rather developers focus on making small pages that are easier to fetch on demand, and am worried that they'll use prefetch to justify bloated pages. If a page is large enough to need prefetch, then I might not want to spend the data pre-fetching it especially if the click probability is not very high. Between all of these, I'm not convinced of the need for cross-domain prefetch.

Apologies that I'm not a front-end person so this may be naive, but it would be great to hear your thoughts!


Yes, this is only an issue for cross domain prefetch.

With HTML resources, the goal of prefetch is typically not to get a head start on loading enormous amounts of data, but instead to knock a link off of the critical path. The HTML typically references many different resources (JS, CSS, images, etc) and, if the HTML was successfully prefetched, when the browser starts trying to load the page for real it then can kick off the requests for those resources immediately.


Makes sense, thanks for the reply!


It would be full document prefetches. Would be useful for sites like Reddit or Google News. Also for things like Okta application list page.


It’s not an issue most of the time, but I do agree that it would be nice to have a fix.


Most, or at least a lot, of the prefetching is for third party libraries (think jQuery, Google Fonts, Facebook Pixel, etc). There’s a general speed advantage for users caching commonly used libraries and fonts across sites. Nonetheless I believe prefetch will still have a speed advantage even when the cache is segregated.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: