I'm flattered the author mentions Crocodoc. Crocodoc is hiring by the way if anyone wants to hack on stuff like this full time: https://crocodoc.com/jobs/
When viewing the output (using the "computer science cheat sheet") I found some differences between browsers that I thought HN readers might find interesting. These aren't primarily issues with your tool, hence posting here.
- I primarily use Chrome (21) as my browser, and the cheat sheet renders very quickly. I noticed it doesn't seem to render some equations correctly (see bad operators here[1]).
- FF (15.0.1) seems to render more correctly, but it is glacially slow. The whole app (chrome and all) freezes for several seconds between clicks while the document is loaded in any tab.
- IE (9) renders the same page both correctly and quickly.
Amazing - I am attempting to install this on mac osx lion -- it is taking a lot of time because of the dependencies. With so many dependencies the probability of failure is very high. Let's hope it works.
I urge you to find a way to allow people to install your software more easily.
I managed to get it to install (after about an hour and a half of tinkering. However I get "Segmentation fault" when I try running it:
pdf2htmlEX --debug=1 test.pdf
temporary dir: /tmp/pdf2htmlEX-LY9cOv
Preprocessing: ....
Working: Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/__css
Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/__pages
Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/p1.png
Install font: (29 0) -> f1
Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/f1.pfa
It's confirmed by some guys using Mac.
We are working on this.
Please hold on, and join the discussion on github if you like.
Thanks for your patience.
What's I'd like to see is a library that can extract multi-column text into a readable format. From looking at the source of the HTML here, they're doing it with absolute positioning. Nothing wrong with that for display purposes, but I'd like to have a library that can extract text meaningfully from a multi-column PDF.
The pdftotext tool from xpdf does something like that. One option pads the output text with spaces to roughly match the layout of the pdf (the -layout option) and another option just strips the pdf formatting out (the -raw option).
Depending on the structure of the pdf, one or the either may give better output (the -layout output would need some more processing).
From my point of view, that's not really tex to html. That's tex markup to html. I am talking about using the latex software, whose purpose is to do typesetting. The amazing thing about this converter is that it takes the latex OUTPUT and produces html.
Neat idea, make it generate epub and it moves the pdf->e-book ball a bit further down the field. Looking at the source to this page view-source:http://coolwanglu.github.com/pdf2htmlEX/demo/geneve.html it looks like you can't yet generate a font from the characters, rather it uses the 'font trick' to put images on the page. That makes the epub problem harder (which really really wants fonts not images it seems)
If you use HTML inspectors and remove that piece of mess, you'll find text on the html also disappear.
It's the problem of font encoding, which is one of the difference between PDF and HTML. Sometimes you cannot copy the text out of PDF, but you can read correctly.
I'm working on that problem. I made things like this so far because I think visual accuracy is more important.
Yes, and that was exactly my comment. It would be really cool if the converter generated character code points for the characters on the screen. So that cutting and pasting did what you might expect. But to make that work you need to do some form of OCR on the document, figure out where the text is, and how it is composed, then you create a font which re-creates the look based on the imagery in the document and then you generate the CSS that lays down the text and decorates it with the font and re-create the visual of the PDF. (or make it an epub)
If you can get it to that point, there will be huge utility for folks who want to convert paper books to e-books. Because the typical scanner will generate PDF but the typical e-book will only flow e-pub (or .mobi or proprietary formats).
Looks pretty neat - saving it as html file works great, but you can't print the docs (in Chrome print to pdf only shows a scroll bar, in FF it does not properly format).
I'm very impressed. Can you post some more examples online, some non-technical PDFs for example? I'm curious how well it does 'generic' PDFs (for example magazine layouts).
There are also various use-cases for doing this as part of a larger product. Say you need to take a customer's crappy PDFs & reformat them for display within a web app, on a public display or to send as an HTML email. You could use this tool, convert to HTML, then drop in your own CSS stylesheet to reformat it. If your customer had many of said crappy PDFs you could no-doubt automate the whole process.
Needless to say I had to do something pretty similar recently, though I ended up having to ask the customer to provide better source data than the PDFs they initially sent. This tool could have been very useful at the time, hope to give it a spin soon.