OCR Software Dev Exposes 200k Customer Documents

p1necone · on Aug 29, 2018

'The Cloud' is just 'Someone Elses Computer'. Would you process confidential documents on 'Someone Elses Computer'?

The bigger mistake was made before this data breach even happened.

9712263 · on Aug 30, 2018

Actually, leaking customer information does not really matter to the company. What the company cares is the leak itself is not their fault. They just want someone to take those responsibility, so putting in 'Someone Elses Computer' is actually a good strategy.

Only when losing customer information equals to revenue drops, company will take security more seriously. Enforcing a law to company storing customer to have common security practice is a possible solution, though it hurts low budget startup.

pilsetnieks · on Aug 30, 2018

It would in Europe under GDPR. If your subcontractor leaked your data, you are at fault as well, for improperly vetting your subcontractors.

vkou · on Aug 29, 2018

Every[1] computer is, at worst, a misconfigured firewall and a late security patch - or, at best, one moron clicking a phishing e-mail away from being 'Some Martian script kiddie's computer'.

Keeping your data on cloud introduces new attack vendors. It closes others.

If you're going to be an idiot[2][3], and host a production database which doesn't require any authentication on the cloud, I don't have much hope for your non-cloud-deployed security.

[1] For most reasonable definitions of every.

[2] I have had the dubious pleasure of working in a ~15 year-old software company where ~half of the machines were virus-infested (Mostly Conficker, but also some other shit that I forget. This was in 2012. If I recall correctly, we had one part-time IT guy for ~30 engineers. His job was ordering computer parts, and keeping e-mail working. The health of the dev machines was not his problem.) The SOP was to do all your work in a VM running Windows XP, and wipe it every few weeks, or whenever performance would grind to a halt - whichever came first. One of my tasks, a few months in, was to deal with the virus situation on the build server, so that we could 'securely' build the release, and sign it with the encryption key that only the VPs had access to.

[3] I have also had a chance to work in a company obsessed with security (Because of the nature of their products). One of my discoveries, a few months before I left, was that one of their products' updates were pushed via an insecure HTTP downloader. While I was there, nobody budgeted time to get it fixed.

GhostVII · on Aug 30, 2018

The difference with computers in the "cloud" is that there is a much higher incentive to hack them since there is much more data on them, and hackers can be fairly confident that they will get valuable data in a hack. This is not the case with your personal computer. So if you are bad at security, I would argue that it is much safer (from a security point of view, at least) to store things on your own computer, since your mistakes are much less likely to be noticed.

golergka · on Aug 30, 2018

> Would you process confidential documents on 'Someone Elses Computer'?

If it's a big megacorp like Google or Microsoft, and I don't consider myself a target for state-level actors, easy.

tdb7893 · on Aug 29, 2018

Even if they were Amazon and actually owned the hardware being run on it wouldn't have changed the outcome at all so I don't understand how it being someone else's computer is relevant.

p1necone · on Aug 29, 2018

My comment was more about their clients opting to process their important documents on an external cloud service, not the service itself using AWS.

zAy0LfpBZLC8mAC · on Aug 29, 2018

It's not about property rights, it's about who is in control of and has insight into what is going on on the machine.

LeifCarrotson · on Aug 29, 2018

Right. In this case, the OCR software dev had control of the clound machines.

However, the customers that uploaded sensitive documents to this cloud OCR service did not have control of the computers, the code, or the configuration.

Yes, if you don't trust anyone you can't get anything done but this feels like the kind of task where you should be a little bit nervous each time you do it.

sixothree · on Aug 30, 2018

This is assuming the customer would have better data protection.

mirimir · on Aug 30, 2018

Customers would presumably not store their documents on publicly available servers.

j88439h84 · on Aug 30, 2018

The ocr company presumably wouldn't either though.

mirimir · on Aug 30, 2018

Well, except that said OCR company needs to be using Internet-facing servers, and random customers don't.

tdb7893 · on Aug 29, 2018

In my experience you have control and insight into machines in AWS, what insight or control do you think was lacking here? It seems to me more they didn't understand the technology they were building on.

zAy0LfpBZLC8mAC · on Aug 29, 2018

As far as I am concerned, see p1necone's comment above: It's not about the relation between Abbyy and Amazon, it's about the relation between them and their customers who stored their documents with some company where they had not insight into how they were handling their data.

tdb7893 · on Aug 30, 2018

Oh that makes sense, apologies for misonterpreting.

gruez · on Aug 30, 2018

how's this any different than someone self-hosting a FTP server and allowing anonymous users?

uses · on Aug 29, 2018

Another MongoDB misconfiguration, wow. How is it still so easy to configure mongo with no creds and an open port? Feels like these alone cause a large % of data leaks.

If I was running a cloud platform of such a massive scale I'd probably scan my own ports to identify glaring problems like this one. Kind of surprising that isn't happening considering how bad it is to have the brand associated with a report like this.

ReverseCold · on Aug 30, 2018

If you spin up a server and install mongodb (while forgetting to turn on a firewall that blocks all non-port 443/80 traffic) - everyone has root access to your database.

Easy mistake to make. I've probably done it at least once on publicly accessible test instances.

BLanen · on Aug 30, 2018

This just says to me that MongoDB's defaults should be changed.

It's happening too much.

A somewhat random password would at least provide some protection and minimal inconvenience for devs.

achillean · on Aug 30, 2018

By default, MongoDB doesn't listen on the public interface so it won't be exposed to the Internet - it only listens to localhost. Old versions of MongoDB had bad defaults but that hasn't been the case in years:

https://blog.shodan.io/its-still-the-data-stupid/

thinkingemote · on Aug 30, 2018

The defaults on Ubuntu at least have been changed (not sure since when, though) "since release 2.6.0 we have made localhost binding the default configuration in our most popular deployment package formats, RPM and deb" from https://www.mongodb.com/blog/post/update-how-to-avoid-a-mali...

j88439h84 · on Aug 30, 2018

I don't think this has been true for several years

taude · on Aug 30, 2018

You should really be configuring your AWS security groups for the proper inbound/outbound ports. So a failure at that basic level of even opening up access to the machine so fully. You actually have to whitelist everything to be open like that.

stephengillie · on Aug 30, 2018

Yet AWS still default to wide-open security groups for their new Cloud9/CodePipeline instances that my devs create, and Trusted Adviser tells me about the insecure configuration...

acdha · on Aug 30, 2018

Defaults matter but I think there’s also a lot of blame for developer culture, especially in the circles where Mongo is popular. Faster, faster, ship it…

tluyben2 · on Aug 30, 2018

... and break things, the part I never understood, but they sure as hell are following it.

And Mongo isn't even faster for development either while it's far harder to optimize and by default insecure.

acdha · on Aug 30, 2018

I think document stores are prone to an early threshold of thinking you’ve gotten a lot done without having to “waste” defining models/types and some people never shake that feeling even after being hip deep in all of the code they’re now writing to migrate, validate, or analyze that data.

tluyben2 · on Aug 30, 2018

Our ORM allows this on a relational database with similar ‘quick dev start’ benefits. But adding all those pesky structure, validation and indexing are much easier to add later. Also, when there are perf issues with Psql and Mysql it is usually a few minutes to find and fix the issue, with Mongo, even if you find it, you might be at a loss to fix it.

chatmasta · on Aug 29, 2018

Isn't this still the default configuration?

sanityvampire · on Aug 29, 2018

Allowing unauthenticated access is the default configuration, but I think you have to go out of your way to make it accessible from external systems, let alone by anyone on the open internet...

sb8244 · on Aug 29, 2018

Can you elaborate further?

My thought process is deploying this on digital ocean would make it insecure by default.

aidos · on Aug 29, 2018

It binds to localhost by default (now, didn't used to which caused all the issues in the past).

https://docs.mongodb.com/manual/reference/configuration-opti...

tjoff · on Aug 29, 2018

As a user: just don't store (or even process) data in random clouds.

As a developer: really try to minimize data collection at all costs.

I'm waiting for a future where the above is considered common sense.

bigcity · on Aug 30, 2018

There is a severe lack of good LOCAL ocr options for documents. They only ones I know are Abbyy (very good and very expensive) and OCR.space Local (affordable but not as good) and of course Tesseract. But i feel Tesseract is increasingly left behind with regards to OCR quality and speed.

jumelles · on Aug 30, 2018

And this article is ABOUT Abbyy!

snthd · on Aug 29, 2018

The million dollar question - did this affect EU citizens?

notimetorelax · on Aug 29, 2018

Yeah, as an EU resident and a client of one of their clients, I REALLY want to know if my data was affected. And I have the right under the GDPR.

_mlxl · on Aug 29, 2018

Aside; This was the OCR engine included with DEVON(think/note) applications.

Thankfully the OCR engine processed documents offline, in a seemingly prescient move, adherence to the rule of processing and storing as little data as needed saved an anxiety inducing set of events from occurring.

Shame that small teams (I think that team is less than 10 devs) are sometimes the only businesses with enough sense to ensure their risk when it comes to things like this is mitigated or reduced.

https://blog.devontechnologies.com/2018/08/abbyy-data-leak/

syn0byte · on Aug 29, 2018

I think smaller teams are better at mitigating because there is much less "not my department".

Dev: I don't gotta worry about setting up Auth on my mongo instance for dev/testing. Deploy/Admins will handle it in production...

Admins: They delivered a container package from CI so we assumed they had set up authentication correctly...

Skiddie: Om nom nom...

world2vec · on Aug 30, 2018

This is why the bank where I work developed its own OCR software (and translation and entity recognition, etc.). It simply couldn't afford to have these kind of leaks.

jglazko · on Aug 29, 2018

If I have the right 'Abbyy', I'm surprised that I don't see any information about this data release on their website. Or, maybe not. Are there disclosure laws that one would expect to come into play here?

bdcravens · on Aug 29, 2018

Not familiar that much, but I think disclosure means they must notify the customer, rather than being obligated to disclose on their website.

mirimir · on Aug 30, 2018

Isn't there still OCR software that does all processing locally?

zanchey · on Aug 30, 2018

Anyone remember the articles in Uplink? This feels like it could easily have been lifted from the game.

person_of_color · on Aug 30, 2018

Wow. Software from 1993 still going strong.