Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"           "

There I cut and pasted a quote from the document linked into this response. What do you see? I see a bunch of boxes.



If you use HTML inspectors and remove that piece of mess, you'll find text on the html also disappear.

It's the problem of font encoding, which is one of the difference between PDF and HTML. Sometimes you cannot copy the text out of PDF, but you can read correctly.

I'm working on that problem. I made things like this so far because I think visual accuracy is more important.


"It's the problem of font encoding"

Yes, and that was exactly my comment. It would be really cool if the converter generated character code points for the characters on the screen. So that cutting and pasting did what you might expect. But to make that work you need to do some form of OCR on the document, figure out where the text is, and how it is composed, then you create a font which re-creates the look based on the imagery in the document and then you generate the CSS that lays down the text and decorates it with the font and re-create the visual of the PDF. (or make it an epub)

If you can get it to that point, there will be huge utility for folks who want to convert paper books to e-books. Because the typical scanner will generate PDF but the typical e-book will only flow e-pub (or .mobi or proprietary formats).


OCR is beyond the scope of pdf2htmlEX I'm just trying to find out the real meaning of the glyphs through glyph names.

Actually usually you should be able to select/copy text without problem, if there's no type 0 fonts.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: