Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Azure Active Directory down (twitter.com/azuresupport)
91 points by Decabytes on March 15, 2021 | hide | past | favorite | 44 comments


As someone working at a company that pays for Azure support, it's pretty sad that Microsoft 365's Status Twitter account has been providing more insight than Azure's own status page

https://twitter.com/MSFT365Status/status/1371554704518352896

They actually acknowledged it was due to a "recent change to an authentication system" where Azure Support just vaguely mentions an outage


I have zero expectations from Azure support as a whole, in fact 80% of the time I'll rebuild production services in Azure to avoid having them tell me to do just that in most tickets. It is abysmal levels of actual support.


Was interesting watching this happen today at a large Fortune xx that is a year or so in to wide scale Azure rollout.

Took out Outlook, Teams and all meetings over the course of 5 minutes, then moved on to internal service alarms going off across multiple (10+) major customer facing digital products that got vaporized due (presumably) to various dependencies on AD. I think some of this was service accounts on backend that lost connection to various dbs, file systems, etc.

What a total shit show.

Company basically down from 2pm CST and barely coming back up 5 hours later.

IT Buddy who is 2 people away from CIO pinged me and said “DNS issue, data center power outage or Azure just crapped its pants.”


It is a huge pet peeve of mine that,

a. the status page update took as long as it did.

b. the status page still only claims that AAD is down. You're not "up" if a required dependency of yours is down, IMO, and most of the Azure services right now are severely degraded to the point of not being useful due to the AAD outage. Now, that would make your status page look bad, yes. (And I would clearly message on it that the other outages are fallout of AAD on that.)

c. I don't have any real expectation that we'll get a public PM about this. (And a PM must include sufficient information for me to understand what went wrong, and what's been done to prevent it going forward.)

I had been starting to wonder what sort of incident it would take for them to actually update the status page. I guess this is it. We've witnessed a number of outages in various services over the past few months, none of which made it to the status page. I pushed a support rep on the issue, and was told that they don't want to cause a panic.


I'm also here because my work is blocked by the inability to do anything on Azure right now.

I looked at their past status histories, and they have provided pretty detailed postmortems for much more minor incidents. I don't have any reason to believe that they won't do that this time. I'm new to Azure, so this kind of outage really looks bad in my mind, but it's par for the course with Cloud stuff. They'll fix it and there will be one less thing to break in the future.


As someone who has been using Azure for 6 months, it's fairly bad but most of the Azure pain is more death by 1000 cuts than large explosions.

Documentation that contains triple negatives, UI elements that have no consistency (clicking off a modal on one page does nothing while doing so on another closes it and clears your data), widgets and modals forgetting state when you navigate to different submenus.

The sort of unexciting stuff that by itself is ok but it slowly makes you descend into madness.


Just wait until you see their made up rules for using their Redis implementation. You have to roll your own Redis to do anything interesting.


Their managed PostgreSQL offering also has fun quirks, such as that you have to connect using `username@db_name`, but your actual username in the database is still just `username`, so any third party software needs to support using a different username to connect than it uses to perform user-related queries (they have some sort of application-level load balancer in front of it that uses `db_name` to route the connection).


We were well into implementation of a SaaS that relied on PG and used their cloud offering before we found that the latency was just abysmal.

The best we could get was ~45ms using pgbench vs ~10ms local and ~18 using AWS Aurora.

Their solution, wait for the flexible server offering which is not available in AUS but when tested in another region showed ~30ms so still not fit for purpose.

We eventually had to switch to running PG in AKS which has its pros and cons.

The funniest thing was, running Azure VM to AWS PG was faster than running Azure VM to Azure PG.


Oh that solution is a garbage fire.

We hit the 4tb storage limit that they didn't tell us about on it when the documentation said 16tb. They told us we were on the older storage tier and that we needed to rebuild a huge analytics database on our time because they couldn't migrate us to a backend storage tier. So we had a database server that was in read only mode until we could migrate off of it.

Then we had random restarts and outages with their proxy that handles all of the connections. Which is why you actually need the username@db-name, because without the @db-name it doesn't know where to route you, because the postgres connection string is actually just mapping to one IP that they manage for multiple databases!

Fun trick is you can make your url anything that maps to the IP they use for their proxy and it'll take you to whatever the @db-name is in your connection string.

Their solution to the random connections dropping for minutes at a time was to "implement retry logic" on our applications.

Our solution was to migrate away from it.


This is ... something - wow - they couldn't just make the actual username "username@db_name"


Nope. Their load balancer seems to strip the `@db_name` from connections, and anyway some software doesn't like having an @ in Postgres usernames (which is probably also a sign it's vulnerable to SQL injections)


Been an active user of Azure for almost five or six years now. Things were a lot more rough in then (especially if you'd seen the old portal), unlike what it is now.

People always talk about cloud as an entity which can/shouldn't have problems at all. When it's more about convenience then anything else. So I wouldn't worry too much about this if I were you.


I don't particularly expect the cloud to be problem free. I do expect it to be possible to, when an issue is caused by an outage or a service issue, get that escalated to actual engineers. That has not been easy, IMO. Even obvious stuff, like "we sent X request around Y time and got Z 500 Internal Server Error from service S" have required multiple round-trips.¹

Messaging around transient issues helps. E.g., AWS would often email us to tell us "hey, sorry, such & such VM is being hosed by the underlying hardware, it is migrating but you'll see a forced reboot". That's all I need to know: I expect some interruption to that portion of the system, and I expect to see it heal within a certain time frame, and if it does, great, no support ticket required.

Azure sort of has that with "Resource health"; but, e.g., one of our recent support tickets is that we had a VM reboot unexpectedly due to a resource health event with its disks. And then, a few hours later, another VM, same resource health event, same issue, same reboot. And then again, a few hours later. And that pattern, with no further communication, requires me to write the (obvious) ticket of, "So… what's happening?"

(That ticket ended with such and such service was experiencing an incident. Never made the status page. It did make the internal "Service alerts" page.)

Part of issuing a PM is also to help show & convince me, the customer, that I won't be filing support issues for the rest of my life. But this last 6 months has just felt like support ticket after support ticket, and it's sort of depressing, since I'd rather be coding. I do think, last spring or so, it was no nearly this bad.

¹Every place I've worked at… we page on 5xx codes sent to a client. If Azure is internally doing that, it sure is hard to tell from where I sit.


Bad status pages are a very bad sign in my opinion.

I had to use the IBM cloud for a project once upon a time and the status monitoring was almost always wrong ...

I really hope that as customers we can succeed in making "excellent status reporting" an absolute minimum requirement for cloud services. Observing the quality of the status page reporting history is one of the first things I do now when evaluating a new service -- and no I'm not just looking for a page that shows all greens -- I'm looking for a page that connects to some conceivably meaningful metric as well as detailed reports about major and minor outages ... something to prove that they actually even know if their service is working at a given point in time ...

The IBM cloud page was a page that tended to update in response to my own issue reports (usually if I also pointed out that nothing about the outage appeared on the status page ...). I don't think that's at all an acceptable standard ...


Are status pages the right answer though? They seem to be a failure all around, so let's find a better solution.


Honest question for the sake of discussion/learning: What impact would it have on you if (1) the status page was updated instantly and (2) All impacted capabilities were marked as down/degraded?


Affecting Teams, Azdo, and likely more services.

It turned the whole afternoon into learning time at our company. Thankfully our Okta integration goes through our on prem AD servers and not purely AAD. Otherwise I wouldn't be able to get to learning resources which authenticate through AD!


It seems to be coming back online now.

Strange that HN was down at exactly the same time, although this is reported to be unrelated.

> CURRENT STATUS: Engineering teams are currently rolling out mitigation worldwide. Customers should begin seeing recovery at this time, with full mitigation expected within 60 minutes. This message was last updated at 21:12 UTC on 15 March 2021


I was seeing:

DX10501:+Signature+validation+failed.+Unable+to+match+keys:+ kid:+'[PII+is+hidden]',+ token:+'[PII+is+hidden]'.

and sure enough, it seems it was some authentication thing going sideways


What's crazy is that apparently there's a "fallback" Azure AD authentication endpoint that is a delayed read-only replica of the production endpoint with a subset of the features. During the last global AAD outage, they claimed that they learned their lesson and improved the failover to the fallback endpoint. That didn't seem to happen...

What is more crazy is that low level service-to-service authentication in Azure is primarily based on Azure AD. This includes Key Vault and Storage Accounts. Both were mentioned in the Microsoft notification page as affected by this outage.

How can anyone build a reliable service on top of Azure if what is essentially the failure of the "Office 365 login" also breaks your VM's access to its SSL certificates (or whatever)?


I am naive, but it is crazy to me that we experience outages of any magnitude like this from any major supplier (GOOG/MS/AMZN). I get alot of crap for being crotchety about making my backups in and out of the cloud, and it's stuff like this that proves my point!


"The more they overthink the plumbing, the easier it is to stop up the drain." - Scotty

Trying to provide globally available, replicated service that meets every and all needs 24/7/365 is basically making Active Directory about 600x more complicated than it would be otherwise. It's basically impossible for a service that complicated to meet the uptime of... installing Windows Server 2019 on a VM and patching it monthly. (Bear in mind, if I know when my business is not affected by an outage, I can do it without being disruptive. Microsoft, by definition, cannot. Every moment is critical since their customers are everywhere.)

There are types of businesses to which the former might be a necessary solution, but most would be better off with the latter. It'll be really interesting when people start realizing the cloud is mostly just a scam to get people on subscription revenue streams, and not actually providing any greater reliability or less management overhead than what they had before.


Plus, with a massive customer base and eating their own dog food makes their schedule be a priority over other customers. Your priority is lower than theirs.

Microsoft is always a shitshow this time of year. Infrastructure changes usually land ahead of spring releases.


Indeed, no matter how much we pay Microsoft, they will not care about our business as we care about our business.


Though I'd also rather a company eat their own dog food for something like this, so personally that's a tough one. I wouldn't want to find out Microsoft's internal infrastructure is all on AWS.


I agree!


You know, in some ways I'm actually glad that Azure AD is down: I ran a SQL update and right after it I started seeing errors and thought I'd done something wrong... I've been shitting myself for an hour :D


Teams completely borked because of this. My employer hugely depends on this as we all WFH, and basically has stopped all communication outside of email (thank god we are on-prem).


I run a saas company and we use azure ad. B2B sales. Good thing is we are down but so are they. This is strangely not that bad. Left the office early today


Azure AD instability has been a big pain point for us for a while. We’ve had more of these outages than I care to remember. One the one hand the ability to have one central auth between on-prem and other services is great but every time this happens so many services just stop working. Microsoft really needs to fix their stability issues.


This should be a huge shitshow for Microsoft. This never should have happened.

Azure AD taking down pretty much the entirety of Azure / O365 / Teams is frankly inexcusable. Astounding incompetence, and an astonishing single point of failure that needs to be re-architected.


In fairness to Microsoft, there's no way to make authentication not a single point of failure.

You have to make the service as resilient as possible, which clearly they failed at, but you can't very well fail over from your AD service to something else.


> In fairness to Microsoft, there's no way to make authentication not a single point of failure.

Huh? x509-type authentication works without a single point of failure. I can authenticate myself to any client via public-key cryptography without any external dependencies, after we've agreed on a root of trust.

Active Directory chose a solution that requires centralized availability. They didn't have to do it this way (though it does make certain administrative tasks, like revocation, simpler).

Note that even AD itself does not require a single point of failure though! You can define secondary (i.e. failover) DCs (usually via DNS). Then DNS becomes your point of failure -- and again most DNS implementations support failover, such that any single point of failure gets pushed lower to the stack (e.g. NICs), which also support failover or multipathing, etc, if you really need high availability.

Anyway, it's totally possible to make authentication not be a single point of failure. Of course, you have to make some tradeoff to do it (see: the CAP theorem).


My point was poorly stated, based on one line of the GP comment:

> Azure AD taking down pretty much the entirety of Azure / O365 / Teams is frankly inexcusable

My objection was that the Microsoft Azure AD service is down, there's nothing that Microsoft can do to protect the services that rely on it.

Setting aside your valid x509 counter-argument, the remaining options you listed to increase availability, to me, have to do with resilience of the Azure AD service. They're not alternative services.

It depends on where you draw the boundary around the service. Again, clearly Microsoft has failed to provide sufficient resiliency, not arguing against that at all. But once the service is down, practically by definition Teams and Exchange and everything else is doomed.


Microsoft is rolling back some update they pushed that borked Azure AD.

Early signs are not pointing to this as a service resilience issue - this appears to be sheer incompetence - pushing an update that was likely not tested well enough that broke pretty much everything. More than that, why aren't updates to Azure AD being rolled out regionally? Why is Azure AD architected in such a way that the entire thing going down can do so much damage?

It's pretty hard to be understanding with a preventable screwup with this kind of global scale.


You are making a whole bunch of assumptions here without ANY info about how Microsoft rolls out updates to its Azure services.


To add another datapoint, we recently faced an issue where Microsoft rolled out a change to their OIDC endpoints, which broke our client's tenant, but not other tenants. (It was crashing with an HTTP 500 if the Accept request header was a certain value: the default value in OpenJDK 8).

Some days later and after our complaints, Microsoft rolled back the change.

I would call this evidence that they do practice gradual deployments of changes.

I would guess that this outage was caused by something more complex than MS not slow-rolling their deployment.


Sure, but those assumptions are largely backed up by a single update to Azure AD knocking it out worldwide. End of the day, whatever they pushed was not nearly well tested enough, and they paid for it with a massive global outage that borked pretty much everything they sell.


For something as critical as auth, I would expect most changes to be rolled out over a few days just to make sure something like this cannot happen to everyone. Ideally a change would be shadowed for a few % of traffic before it was allowed to start rolling out.


They could quite easily have progressive migrations, blue/green deployments, or a failover environment..


True. I guess it’s semantics whether to call that resiliency within a service or failover to another service.


cloud=a single point of failure for the whole world!


Every time someone tells me we should move to the cloud, it seems like there's an outage the same day. Microsoft isn't actually any good at managing your IT infrastructure for you.

Between Exchange 365 outages and Azure AD outages, I'm not sure how anyone thinks moving Windows infrastructure to the cloud is not an existential risk to business operations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: