Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It does have a feature to do the opposite. You can, in theory, extract tabular data from PDFs with Excel (note: only on the Windows version; this function isn’t available in macOS Excel).

In practice I’ve found it to be extremely unreliable, and I suspect this may be because the optional metadata that semantically defines a table as a table is missing from the errant PDF. It’ll still look like a table when rendered, but there’s nothing that defines it as such. It’s just a bunch of graphical and text elements that, when rendered, happen to look like a table.



Yeah. The "extremely unreliable" part of that is the stinker. Some of the exports I get through FOIA are thousands and thousands of pages, so the unreliability really compounds really quickly. It's frustrating, because there are many things Microsoft could do with PDFs to make that a non-problem. But it's consistently been a naive implementation that doesn't consider newlines.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: