Great article, until the end. Who uses PII in test data derived from real custom...

akamor · on Jan 31, 2019

Hi, this is Adam. I'm a founder at Tonic.

As others have said, we've found a lot of smaller companies will test with production data because of their need/desire to move quickly. But we've also seen much much larger companies use production data in their dev/staging environments. Sometimes there will be production-like safeguards and security measures in place but not always. People shy away from practices that slow down development and testing.

We think synthetic data is the right solution for a few reasons. Most importantly, we believe it provides the right level security, while still allowing your team to be productive, i.e., your business logic and test cases still work. It also allows you to scale really easily since you effectively have a ruleset for generating data of any size. Finally, it’s a great way to share data throughout your organization and can help facilitate sales and partnerships. If you’re curious about scaling, check this post out: https://www.tonic.ai/blog/condenser-a-database-subsetting-to...

magicalhippo · on Feb 1, 2019

Funny you should ask. This just popped up the other day over here in Norway:

https://www.dagensmedisin.no/artikler/2019/01/27/brukte-virk...

In summary, when doctors were testing a new electronic patient journal system, they used real social security numbers (our version of them). And just for kicks they tested in production, so the persons used got all kinds of prescriptions for stuff they didn't need etc.

alkonaut · on Jan 31, 2019

I have never seen a “dev” instance of a DB that wasn’t just a snapshot of the prod DB from earlier. I admit haven’t seen many - but I have seen zero of any other kind (e.g. anonymized or synthetic)

sjjshzvuiajhz · on Feb 1, 2019

Just going to throw out there that I’ve never seen a dev database that was anything other than fake data, or internal dogfood data. Have worked at major public tech companies and late-stage startups.

alkonaut · on Feb 1, 2019

I think one reason might be that this was never sensitive personal data. Phone numbers, emails and addresses mostly corporate. But real passwords (hashes) from real users, on 50+ laptops with unencrypted drives was pretty normals.

I think culturally there may be a difference since I'm in a place where some data (addresses, phone numbers, ...) is public info, i.e. given your name I can get your address and phone number from a public DB anyway.

flukus · on Feb 1, 2019

I've done this plenty of times, but you can anonymize the data fairly easily and get the benefits of both.

The flip side of the coin is dev databases not being representative of production, this can cause performance issues. "It works with 10 rows on localhost, why doesn't it work with a million in production?".

barefoot · on Jan 31, 2019

Many small companies take this approach. Usually it’s lower-risk PII.

Some small companies will refuse to use generated data if it takes even a minute more to generate it vs import it from production. In the consulting world I’ve seen multiple examples of companies complaining bitterly about other security minded consultants efforts to improve security and privacy through even small amounts of additional development time.

baroffoos · on Jan 31, 2019

I have seen it done in a small company to check if a query will run too slow in production. Take a copy of the biggest database. Run query, see what happens, delete copy.

fooey · on Jan 31, 2019

It's probably more often that the query is just run against production in the first place.

Making a copy is probably more effort than most developers out in the wild are going to make.

sroussey · on Feb 1, 2019

Not true. If you were to throw up a slow locking query in production, you could take down the site. Restoring a backup should be fast.

Kalium · on Feb 1, 2019

It's very easy for a small company to get in the habit of using data cloned from prod for testing. This practice, easy as it is to start, gets progressively more difficult to move away from as the application and service grow in complexity.

As a result, you get shockingly mature companies that do exactly this obviously absurd thing because it's a ton of work to stop. Work with no obvious reason to do this instead of feature work.

casual_slacker · on Jan 31, 2019

We used a similar trick, not for testing, but the ability to download the prod database and debug things locally. We hit scaling issues before PII, so coworker built a system to generate real-ish DBs with only one customer's data. And then in a future version, sensitive fields were filtered or replaced with mock data. Not sure if there are better, less engineering effort ways of doing this, but it was a great tool when debugging.

icoe · on Feb 1, 2019

Makes a lot of sense. I actually think leveraging your prod data to create a test environment is one of the best approaches, as long as you're mindful of privacy. Full disclosure: I'm a founder of tonic.ai and we make tools to make it easier to create synthetic staging instances from production environments.

munk-a · on Jan 31, 2019

Small companies that are just starting out may use real data in test environments since it's a bit easier than using mocked data... Honestly this really only holds for companies that also avoid unit/integration tests (which will generally require that data to support the tests be explicitly mocked in some manner)

Since this involves computers nothing above is a hard rule, but it goes along with my experience.

jboy55 · on Jan 31, 2019

The $25 million revenue limit would be a pretty good guide from 'small'. Typically there are a lot of changes around that mark, one of which should be to stop using Customer data insecurely.

coldacid · on Feb 1, 2019

Except that revenue limit is just one term of an OR clause. If you hit any of those three listed points, CCPA comes down on you. No revenue at all but 50k unique visitors, and it applies.

munk-a · on Feb 1, 2019

Yea but the $25mil portion of the clause is the only one I see an excuse for, if you're saying that -all- businesses generating X revenue or higher need to comply with a regulation then it's good to make sure X is high enough that businesses in unrelated fields will be able to afford the cost of compliance without going bankrupt.

The other two categories specifically target companies that really should comply with this law - I assume the $25mil clause is there to make sure large companies can't loop hole themselves out of this somehow (offload PII responsibility onto a subsidiary or a "third party" that is incorporated in Bermuda by the owner of the company)

lotu · on Jan 31, 2019

People who have been hurt by the fact that synthesized data often doesn't exactly match real data.

throwaway91747 · on Feb 1, 2019

TSA. My buddy was on a contract to modernize the system behind the supposedly secret no-fly list in the United States. The sample data was a sample of the production data.

Throwaway for obvious reasons.

j16sdiz · on Feb 1, 2019

When the system involve some not well understood edge cases (e.g. pre-unicode, non-latin person names. Historical dates and time during calender and timezone switch..) and other underdocumented business rules.

icoe · on Jan 31, 2019

Sadly it's much more common than you might imagine.

kerkeslager · on Feb 1, 2019

A huge part of the problem is that many companies don't take security seriously.

wglb · on Jan 31, 2019

It would surprise you.

brohee · on Jan 31, 2019

I've seen that at an insurer...