Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've told this story before, but it was fun, so I'm sharing it again:

I'll skip the details, but a previous employer dealt with a large, then-new .mil website. Our customers would log into the site to check on the status of their invoices, and each page load would take approximately 1 minute. Seriously. It took about 10 minutes to log in and get to the list of invoices available to be checked, then another minute to look at one of them, then another minute to get out of it and back into the list, and so on.

My job was to write a scraper for that website. It ran all night to fetch data into our DB, and then our website could show the same information to our customers in a matter of milliseconds (or all at once if they wanted one big aggregate report). Our customers loved this. The .mil website's developer hated it, and blamed all sorts of their tech problems on us, although:

- While optimizing, I figured out how to skip lots of intermediate page loads and go directly to the invoices we wanted to see.

- We ran our scraper at night so that it wouldn't interfere with their site during the day.

- Because each of our customers had to check each one of their invoices every day if they wanted to get paid, and we were doing it more efficiently, our total load on their site was lower than the total load of our customers would be.

Their site kept crashing, and we were there scapegoat. It was great fun when they blamed us in a public meeting, and we responded that we'd actually disabled our crawler for the past week, so the problem was still on their end.

Eventually, they threatened to cut off all our access to the site. We helpfully pointed out that their brand new site wasn't ADA compliant, and we had vision-impaired customers who weren't able to use it. We offered to allow our customers to run the same reports from our website, for free, at no cost to the .mil agency, so that they wouldn't have to rebuild their website from the ground up. They saw it our way and begrudgingly allowed us to keep scraping.



This sounds like exactly what a 'data ownership' law would solve. Allow the user, via some official oauth to their service providers, to authorize even a competitor to access their account so the competitor can bear the burden of interfacing with the API to port their new user's data over; but it should be a one-time-every-year thing, so that the law doesn't require it to the point of forcing companies to scale their service to handle bots like the main OP is experiencing.


I have worked with .mil customers who paid us to scrape and index their website because they didn't have a better way to access their official, public documents.


This is not .mil specific: I've been told of a case where an airline first legally attacked a flight search engine (Skyscanner) for scraping, and then told them to continue when they realized that their own search engine couldn't handle all the traffic, and even if it could, it was more expensive per query than routing via Skyscanner.


Michael Lewis' podcast had an episode recently where the Athena Health people related a (self-promotional) anecdote that, after they had essentially reverse-engineered the insurers' medical billing systems and were marketing it as software to providers, a major insurance company called them up and asked to license information about their own billing system because their internal systems were too complicated to understand.


Yep. Have seen similar things.


Me too but for a private company

In reality it was probably more like org sub group A wanted to leverage org sub group B’s data but they didn’t cooperate


Amazing story :) Though I am left wondering if there are ever any circumstances where minorities don't get used as leverage somehow


Yeah, that was unfortunate. We had precious few Federal-strength levers at our disposal, though, and sometimes you have to go with what's available.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: