Post Mortem on Salt Incident

cetra3 · on May 6, 2020

This whole salt-stack incident could've been handled a lot better by salt themselves:

- the notification was a week ago to a small mailing list, which is tucked away on their site

- no notification to the registry to when you go to download salt (at least I never received an email, but still get plenty of marketing spam)

- no posts on social media as far as I can tell, I couldn't find a tweet, anything on reddit, or anything on hn.

- they only blogged about it on their official site yesterday, way after damage had been done

- one week's notice between the initial announcement and the patch coming out. The patch being released is basically a disclosure of the vulnerability

- the patch was released late Thursday early Friday depending on your timezone, giving attackers the weekend head start

- the official salt docker images were only patched yesterday

- You can't get a patch for older versions without filling out a form and supplying details

- Ubuntu and other repositories are still vulnerable

mtam · on May 6, 2020

+1, however, from what I read, the vulnerability can only be exploited if the attacker has network access to the salt masters port, which should never occur. The people that got compromised had Salt exposed to the Internet, which is obviously ridiculous.

Not trying to downplay the critical nature of the vulnerability but the ones that were compromised by this issue have deeper security issues to deal with.

mike_d · on May 6, 2020

> has network access to the salt masters port, which should never occur

You seem to prescribe to the "hard shell soft gooey center" network security philosophy. Should people expose an Oracle server to the internet? Absolutely not. Does moving it behind a firewall change the fact that every mildly skilled exploit developer is sitting on an Oracle 0day? Absolutely not.

People have legitimate reasons for exposing Salt to the internet. I do. It's how I bootstrap random VMs and bare metal from the internet. But in my case the attack was mitigated by the fact that Salt cascades changes in a bunch of other systems and re-masters minions to a host only reachable over a tunnel. I blew away the internet master, restored from a backup, and patched.

> the ones that were compromised by this issue have deeper security issues to deal with

Or it was just another Monday. When you become sufficiently large you deal with incidents on a daily basis. Kudos to the people who publicly postmortem and talk about what went well and what didn't.

(For the record, I've already been working for a few months on a move to Ansible for non-security reasons)

empath75 · on May 6, 2020

> People have legitimate reasons for exposing Salt to the internet. I do. It's how I bootstrap random VMs and bare metal from the internet.

I question that that is a legitimate reason to expose it to the internet.

Defense in depth is a thing and putting the keys to the kingdom at layer 0 doesn’t seem wise even if a vpn or bastion doesn’t offer perfect protection.

mike_d · on May 7, 2020

Read the sentence after the ones you quoted. The internet connected salt master is used to provision accepted hosts in to the tunneled (VPN) network where the real master lives.

hajhatten · on May 6, 2020

You piqued my interest, what made you move? I personally haven't touched Saltstack in years, but was super happy with it when I did.

mike_d · on May 6, 2020

Twice I encountered breaking changes between versions that required manually upgrading minions. I also got the overall feeling Salt was built by developers, Ansible by sysadmins - and I fit into the latter bucket.

Conan_Kudo · on May 6, 2020

Ansible (originally known as "Fedora Unified Network Controller" or "func") was made for solving the problems automating Fedora Infrastructure.

Puppet did not make Fedora Infrastructure administrators happy. So func was designed around solving their problems, and expanded its scope as people found it useful. Then it was renamed to Ansible, the developers left Red Hat to create AnsibleWorks, and the rest is history!

So yes, it was made for sysadmins. :)

mixedCase · on May 6, 2020

> It's how I bootstrap random VMs and bare metal from the internet.

What's stopping you from bootstrapping a bastion server through a simpler script and then bootstrapping with Salt from there?

mike_d · on May 7, 2020

That is exactly what the internet connected Salt master does. It bootstraps enough control that I can get the tunnels and keys properly configured, and the other 95% takes place once it is switched to a protected Salt master.

StreamBright · on May 6, 2020

Coconut security is not great either. Hard shell, soft internals. Not exposing the ports to the internet is just one layer.

tasssko · on May 6, 2020

+1 agree but exposing salt to the internet is not the problem. A simple ip whitelist ingress firewall rule on the salt master port would have helped, blocking access is also possible on this port. With cloud services it has become trivial to group server resources so that when they belong to the same group they can communicate with each other. I don’t use salt however i am not a proponent of network isolation as a form of security.

techslave · on May 6, 2020

that’s just plain negligent. of course you use network isolation as one of your defenses in depth.

Shish2k · on May 6, 2020

I was in this situation; I went with “salt master exposed to the internet” because it’s the only service on that box - if I’d wrapped it in a VPN, then I’m replacing one exposed service with a different exposed service, and VPNs aren’t immune to exploits either (plus an extra layer of configuration means an extra layer of things that can go wrong)

CJefferson · on May 6, 2020

If they wrote software which should never be visible to the internet, they should have made that clearer.

It's far too easy to make something internet-visible. They could have set up a simple check to see if the service is internet, and refused to work if it was.

cetra3 · on May 6, 2020

If you look at their current `hardening` document it still has pretty unclear language about what is acceptable and what isn't.

> Use a hardened bastion server or a VPN to restrict direct access to the Salt master from the internet

Is this SSH access or is this access to the salt master from minions? Or just access in general?

brians · on May 6, 2020

Apparently it includes minion-master interaction. If that’s to be “hardened” over SSH, what’s the point of all the salt keys?

user5994461 · on May 6, 2020

SSH bastions and VPN are two standard ways to allow external clients access into an internal network, meaning salt is never exposed publicly.

I read this as a guideline that the salt master must not be exposed to the internet. Albeit could be better worded for a developer audience who doesn't understand bastions or VPN well.

bawolff · on May 6, 2020

> one week's notice between the initial announcement and the patch coming out. The patch being released is basically a disclosure of the vulnerability

While your other points may be valid, one week should be plenty of time between announcement and patch. Any longer and i would call the time table problematic.

aneutron · on May 6, 2020

You have clearly never worked at a large enough OLD corporation.

One week is nothing compared to what it would take to upgrade your configuration management system.

bawolff · on May 6, 2020

That sounds like an old corporation problem. The world should not pay the price of old corporations inflexibility.

If someone hacks your system you certainly wont have a week to respond. The longer a vendor sits on a vuln, the more likely it is to leak or to be rediscovered by a malicious party.

isodude · on May 6, 2020

Weird that something with this title got buried here on HN: https://news.ycombinator.com/item?id=23041528

section_me · on May 6, 2020

It was posted, just no traction on it (eg. https://news.ycombinator.com/item?id=22972100 posted 11 days ago). But yeah supprising the lack of posts about it.

cat199 · on May 6, 2020

> - Ubuntu and other repositories are still vulnerable

isn't really salt's problem though.. same could be said for relying on any distro-provided package

VWWHFSfQ · on May 6, 2020

The intruders had root access to every server in a salt deployment for who knows how long and yet everyone is claiming there's no evidence that any data or secrets (customer's or otherwise) were exfiltrated from the network. This is a very dangerous assumption. Nobody has any idea what was run on the servers since it seems that once the initial attack script was deployed it downloaded and executed new scripts every 60s and then removed themselves. Pretty standard C&C ops. It may have started as a mining operation, but that doesn't mean it was the only thing it was doing.

Jedd · on May 6, 2020

> ... and yet everyone is claiming there's no evidence that any data or secrets (customer's or otherwise) were exfiltrated from the network.

A number of people have carefully reviewed the payload that was deployed to servers, especially during what we're calling v1-v4 of the attack. (v5 onwards got more complex, but that wasn't until Monday (with variability for timezone).

> Nobody has any idea what was run on the servers ...

Well that's not true - there's a number of victims that have useful IDS tools, including auditd, plus the review of binaries and shell scripts deployed, etc.

Some of us also have netflow collection at the edge, and can review connections initiated from within our networks.

> ... once the initial attack script was deployed it downloaded and executed new scripts every 60s and then removed themselves.

I don't think any of us have found scripts that removed themselves. While that may sound naive, there's a few researchers that have been analysing these tools, including via large honeypot networks, and this just hasn't (at least for the first 2-3 days) been a profile of the attack.

Thankfully - and I appreciate it's very weird to say this - the initial attacks were very much vanilla crypto currency mining opportunities. It could have been a lot worse, and algolia's assessment matches a lot of other independent assessments on this front.

VWWHFSfQ · on May 6, 2020

I hope for everyone's sake that it was just a naive crypto mining operation. But given the length of time this vulnerability was available, and the extent of access it allowed, I just find it very hard to say with any certainty that we know everything that it was doing. Exploits like this get passed around in nefarious circles pretty regularly. One of the scripts I saw went to great lengths to eliminate competing crypto miners from the systems so they could run their own. That tells me there were multiple people (or groups) exploiting this in competition with each other.

You said the v5 of the attack got more sophisticated. How do we know there wasn't a "v0" that was even more sophisticated and innocuous? You can't trust the server logs. Firewall tables were flushed, SELinux was disabled. It's just really hard to say the full extent of damages.

Jedd · on May 7, 2020

You're absolutely right that we can't be 100% confident, and best practice dictates a full rebuild from known sources, as usual after IOCs especially of this magnitude.

However, the number of public and non-patched salt servers might be considered a sufficiently small volume for bad actors to have investigated, who can say why it took so long to see genuinely malign attacks.

> One of the scripts I saw went to great lengths to eliminate competing crypto miners from the systems so they could run their own. That tells me there were multiple people (or groups) exploiting this in competition with each other.

It wasn't very sophisticated - just a series of kill statements. This tells me that the author of that script picked up an existing script that's probably been around for years and adjusted it to their needs.

The script also tried to kill confluence, amongst a handful of other large, relatively rare applications, which further suggests this was old fashioned copy-pasting by some non-sophisticated script kiddies ... or someone just wanting to do a PSA and draw attention to this exploit, and making a few BTC for their troubles. Who can say.

We don't know there wasn't a 'v0' - but we're fairly confident. Unless it was disabled as soon as 'v1' popped up, you'd expect honeypot systems to identify non-benign variants - and honeypot systems were identifying modest, reversible changes and nothing in the way of data exfiltration.

By Tuesday or Wednesday of this week I expect there were more (and worse) exploits than could be tracked, though, and some people are really going to suffer as a result.

johann-algolia · on May 6, 2020

Hello,

I'll try to give you some insight as I'm a security engineer at Algolia.

Your concern is valid, and it's true, we cannot know for sure. That's the reason why, as explained in the blog post, we are reinstalling all impacted servers and rotating our secrets. If our assumption is false, this should contain the issue.

That being said, we have good reasons to make that assumption.

- Our analysis of the incident and how the malware behaved on our systems didn't find any evidence towards access and transfer of data.

- There are other public analysis of the malware. Other companies hit have the same analysis than us, and you can have a look at https://saltexploit.com/ which is maintaining an interesting list of what is known on the attack, how it behaved, and how it's evolving fast to adapt.

I hope this answers your concern.

lasdfas · on May 6, 2020

I agree. I would like to seem more details of how they determined it was only crypto mining. Finding only mining scripts in your logs doesn't mean they were not running other code once they had root.

sterlind · on May 6, 2020

It seems bizarre to me that a crypto miner got in. It wouldn't make much money on regular CPUs, and the high processor usage would immediately draw attention. So it looks like a low-effort botnet, which is embarrassing to get pwned by.

(The coin mining could be a cover like you mention, but it seems unlikely since it naturally draws attention.)

vertex-four · on May 6, 2020

It’s easier to sell Monero for cash than... some random data from some random company.

itsajoke · on May 6, 2020

I once worked at a place where a minor piece of cloud infra got exploited. All the attacker did was run a monero miner on it.

sterlind · on May 6, 2020

Heh, in a way it makes a good bug bounty. Like if popping calc got you a trickle of income.

optimiz3 · on May 6, 2020

> It wouldn't make much money on regular CPUs

Not true; some PoWs such as Random-X are designed to be most efficient CPUs.

nemo136 · on May 6, 2020

running the virus code in a container / vm and checking what gets modified

hawaiian · on May 6, 2020

I haven't been a fan of Salt since learning they decided to roll their own encryption.

You don't have to look that far to find problems with that:

https://github.com/saltstack/salt/commit/5dd304276ba5745ec21...

kureikain · on May 6, 2020

It's weird that these salt master are reach-able from internet and they can sleep well with it.

Even with zero-trust network or beyondcorp idea, I still found one extra layer of protection a VPC give are so great. Few years ago, it has an issue with K8S API Server, and updating k8s isn't a walk in the park. I felt relax back then because we have everything inside VPC.

You can use SSH or VPN to access service inside VPC. But any of tools that had permission to manage your infrastructure should never expose to the internet.

Same thing with Jenkins, if you are using Jenkins to manage Terraform or trigger Ansible/Salt/Chef run, make sure Jenkins is not reachable from internet. Using different method to route webhook into it.

trabant00 · on May 6, 2020

I never understood the current trent to say VPN is a thing of the past. Redundancy in security layers is how you dont't get affected by every CVE out there.

Imo this is THE lesson to learn from this story.

Seondary: salt and ansible are not very mature yet.

dijit · on May 6, 2020

Salt is definitely immature (been using it for 5 years and the situation has actually gotten worse in that time) but Ansible is a weird thing to group.

What issues do you have with Ansible?

darkwater · on May 6, 2020

Yeah, I completely agree and really don't see the point of having a Configuration Management server facing Internet and basically having all your servers connect to it through the Internet! One thing is BeyondCorp idea to eliminate the roadwarrior concept and another is having your infra management exposed to CVEs in the wild!

For Jenkins it's a bit more complicated because GitHub webhooks although they do publish their IPs in a programmatic form so you can whitelist them.

kureikain · on May 6, 2020

For Jenkins, what I do is:

1. Configured webhook override in Jenkins. So Jenkins will register sth like https://ci-webhook.domain.com to github webhook.

2. This ci-webhook is a simple webapp that validate webhook and if it's valid(sign by correct key), write the payload to SQS queue

3. A small daemon, run on same Jenkins master, that pulls SQS queue, and replay it to local jenkins

I used to rely on Github IP whitelist but one day i realized anyone can hit my Jenkins use Github.

darkwater · on May 6, 2020

> I used to rely on Github IP whitelist but one day i realized anyone can hit my Jenkins use Github.

That's a really good point but I guess you are talking about Actions egress right? Webhook in theory have dedicated IP ranges [1] and I think they are not shared with Actions egress, although TBH I haven't tested it.

[1] https://api.github.com/meta

tilolebo · on May 6, 2020

Hooks have a dedicated IP range.

With Terraform and AWS it's pretty simple to create a security group on an Application Load Balancer and whitelist these IPs using https://www.terraform.io/docs/providers/github/d/ip_ranges.h...

mtam · on May 6, 2020

“We’ve secured the impacted SaltStack service by updating it and adding additional IP filtering, allowing only our servers to connect to it.”

So this means they had Salt master ports publicly accessible? Why would anyone have salt ports open/exposed to public/internet?

dijit · on May 6, 2020

> Why would anyone have salt ports open/exposed to public/internet?

If you're bootstrapping random servers, this is a fine approach.

The whole Salt connection methodology is 'trust on first connect' (a bit like the default SSH) with a manual stage in accepting an incoming request and the connection stream is encrypted.

If you're using salt to bootstrap your VPN servers or network appliances then it's understandable that you'd have it exposed to a more public network, and the documentation was clear that this was fine.

Not everything is a virtual machine on a cloud provider.

zug_zug · on May 6, 2020

Kind of a tough situation. I personally wouldn't be ready to accept this is the last such vulnerability that will be found.

In light of this attack, maybe going forward have a setup script that creates an SSH tunnel back to a machine that can talk to the salt-master for you. You could then have VPN, but if it's flakey at all, it could cost the ability to update machines.

Or perhaps (and I say this as a saltstack user) ansible really is the more secure model for those scenarios.

darkwater · on May 6, 2020

> If you're bootstrapping random servers, this is a fine approach.

Define "random". I think there is an alternative method not involving exposing you CM server on the Internet for almost any definition of random. In the Algolia case it's pretty sure because they now filter the access by IP (so they KNOW the IPs)

dijit · on May 6, 2020

"Random" can mean "I don't know before I start my instance".

If you're multi-cloud (vultr, DO, AWS and GCP) you almost certainly will not know your instances IP before it's provisioned and you can't make use of nice features like network tags or security labels.

If you're producing test environments then bootstrapping those is going to be significantly more painful than just opening up your salt-master and running an authenticated API request to allow those new machines.

As other people have mentioned, this was always supposed to be /possible/ it's akin to SSH. Sure, you can avoid some log spam and potential issues by firewalling it off- but it's meant to be possible to run it publicly, it has always been marketed this way so it's not "insane" that people did it.

darkwater · on May 6, 2020

> As other people have mentioned, this was always supposed to be /possible/ it's akin to SSH. Sure, you can avoid some log spam and potential issues by firewalling it off- but it's meant to be possible to run it publicly, it has always been marketed this way so it's not "insane" that people did it.

I'm not blaming anyone, I'm just saying that if you put well-known software facing the Internet you are exposing yourself to more risks than not putting them on the Internet. And for a core infra software as SaltStack I don't really see a good reason to justify it. I don't justify either putting SSH publicly accessible unless you are a really, really small company or an individual.

user5994461 · on May 6, 2020

In a multi clouds setup, all the clouds are joined together with site to site VPNs. One doesn't just do a setup where they're public and connect to one another database over the public internet.

mirimir · on May 6, 2020

Yeah, that jumped out for me too. I'm guessing that they didn't want to deploy some sort of private network layer.

lykr0n · on May 6, 2020

That's easier said then done. There are no simple cross cloud provider solutions for a private networking other then ZeroTier, which has it's own issues.

kureikain · on May 6, 2020

Last time I was able to build Azure <-> AWS and GCP <-> AWS use their VPN tunneling and a strongswan server on AWS.

It's only AZure <-> AWS <-> GCP, Azure <-> GCP I didn't try bcuz we just want to connect to central AWS node.

I think IPSec with the "right" config is good enough. But the pain is managing the route tables :(.

mirimir · on May 6, 2020

As a hobbyist, I might use tinc or PeerVPN. Or Tor plus OnionCat, with restrictive ip6tables rules. I've used that for a private Docker repository.

But those are probably not secure enough. Or too much hassle to setup.

flower-giraffe · on May 6, 2020

Trusting a central control server is the fundamental mistake here.

It creates a very high value target that is difficult to secure.

I prefer a model where the management commands are signed at a management workstation and those commands are pushed by the server and authenticated at the managed node against a security policy.

brianjlogan · on May 6, 2020

What configuration management tools use this methodology?

flower-giraffe · on May 6, 2020

A couple that I’ve built - they are not commercially available.

I’d consider open sourcing something based on them if there’s sufficient interest.

Perhaps as an integration for one of the major players.

0x0 · on May 6, 2020

Both this and the ghost cms updates seem to hint that the only reason this was discovered was the fact that loud crypto miners were exhausting resources. What are the chances a more quiet attacker hasn't thoroughly ploughed through the entire infrastructure days ahead?

Also think about how many years this vuln has been present and exposed. Who's to know blackhats haven't sat on this 0day for years, quietly compromosing private keys and other data? Spooky.

ciprian_craciun · on May 6, 2020

I've seen mentioned in the comments various "deployment" tools (or call them "configuration management" if you will) being called "insecure" or "immature", or one being claimed better than another; however I think this is a good opportunity to talk about a deeper problem, namely the architectural choices each tool has taken.

These choices all impact the reliability and security of the resulting system, especially the following:

* do they rely on SSH, or they have implemented their own authentication / authorization techniques? (personally I would be very reluctant to trust anything that just listens on a network port for deployment commands, and it's not SSH;)

* do the agents run with full `root` privileges, or is there a builtin mechanism that allows the agent to act only in a limited capacity, within the confines of a set of whitelisted actions? (perhaps even requiring a secondary authentication mechanism for certain "sensitive" actions, for example something integrated with `sudo`, that provides a sort of 2-factor-authentication with a human in the loop;)

* do the operators have enough "visibility" into what is happening during the deployments? (more specifically, are the deployment scripts easily auditable or are they a spaghetti of dependencies? are the concrete actions to be taken clearly described, or are they hidden in the source code of the tool?)

* are there builtin mechanisms to "verify" the results of the deployments?

* and building upon the previous item, are there mechanisms to continuously "verify" if the deployment hasn't changed behind the scenes?

I understand that some of these features wouldn't have helped directly to prevent this particular case, however it would have helped in alerting and diagnosis.

alexbrower · on May 6, 2020

Can anyone describe the business benefits of an algolia implementation (vs Elasticsearch?) for a company that doesn't heavily rely on content searches? It seems expensive and something that I'd build on my own.

(Disclaimer: long-time operator and fledgling programmer)

aseure · on May 6, 2020

Disclaimer: I'm a developer at Algolia.

IMHO the two main advantages in favor of Algolia, are the sane defaults for relevancy and speed and the fact that the service is hosted and can grow with your business without having dedicated engineers to manage both the configuration and the infrastructure.

Also, on top of the Algolia services per se (search, analytics, recommendation, etc.), we're providing a lot of backend and frontend libraries which one would otherwise need to reimplement when using an elastic- or Solr-based implementations.

vegannet · on May 6, 2020

Search is hard to get right and the cost of Algolia is negligible vs. doing it yourself. As a programmer, every line of code you write is a line of code you own: the less code you own in production, the better off you are. Algolia has saved us hundreds of hours which translates to tens of thousands of dollars.

vbernat · on May 6, 2020

As a point of comparison, you can also expose Puppet masters to the public Internet but Puppet is using HTTP/HTTPS as a transport, so it is trivial to put a reverse proxy in front of it, requiring a valid certificate (managed and signed by Puppet) to contact the service. This way, no need to maintain a whitelist of legitimate clients.

itsajoke · on May 6, 2020

[flagged]

flower-giraffe · on May 6, 2020

Or even a post Maldon. (English sea salt brand)