Pdf2htmlEX: A PDF to HTML converter

peterlai · on Sept 16, 2012

I'm flattered the author mentions Crocodoc. Crocodoc is hiring by the way if anyone wants to hack on stuff like this full time: https://crocodoc.com/jobs/

coolwanglu · on Sept 16, 2012

Hello, I'm the author. MATHML is not used. PDF is rendered with only HTML/CSS, and a few JS.

Please comment at github such that I can see it in time.

davidp · on Sept 16, 2012

This is clever, thanks for sharing.

When viewing the output (using the "computer science cheat sheet") I found some differences between browsers that I thought HN readers might find interesting. These aren't primarily issues with your tool, hence posting here.

- I primarily use Chrome (21) as my browser, and the cheat sheet renders very quickly. I noticed it doesn't seem to render some equations correctly (see bad operators here[1]).

- FF (15.0.1) seems to render more correctly, but it is glacially slow. The whole app (chrome and all) freezes for several seconds between clicks while the document is loaded in any tab.

- IE (9) renders the same page both correctly and quickly.

[1] http://imageshack.us/a/img88/3754/chromeformulas.png

coolwanglu · on Sept 16, 2012

The problem happens only on Windows.

For Chrome, if you zoom in, I think everything should be fine.But Chrome is lack of antialias in Windows.

I'm trying to solve the problem of Firefox.

mih · on Sept 16, 2012

Just to add to the compatibility list, all examples render perfectly on Opera 12, albeit a bit slow.

mgualt · on Sept 16, 2012

Amazing - I am attempting to install this on mac osx lion -- it is taking a lot of time because of the dependencies. With so many dependencies the probability of failure is very high. Let's hope it works.

I urge you to find a way to allow people to install your software more easily.

I managed to get it to install (after about an hour and a half of tinkering. However I get "Segmentation fault" when I try running it:

pdf2htmlEX --debug=1 test.pdf

temporary dir: /tmp/pdf2htmlEX-LY9cOv

Preprocessing: ....

Working: Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/__css

Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/__pages

Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/p1.png

Install font: (29 0) -> f1

Add new temporary file: /tmp/pdf2htmlEX-LY9cOv/f1.pfa

Segmentation fault: 11

mgualt · on Sept 16, 2012

I was able to install

cmake, fontforge and libpoppler with homebrew,

gcc-4.7 using

https://github.com/sol-prog/gcc-4.7-binary

evandrix · on Sept 17, 2012

you mean `poppler` instead of `libpoppler`

tincholio · on Sept 16, 2012

     sudo apt-add-repository ppa:coolwanglu/pdf2htmlex   
     sudo apt-get update
     sudo apt-get install pdf2htmlex

I haven't yet tried to build on mac os, but in ubuntu it was trivially simple.

coolwanglu · on Sept 16, 2012

It's confirmed by some guys using Mac. We are working on this. Please hold on, and join the discussion on github if you like. Thanks for your patience.

coolwanglu · on Sept 17, 2012

The compiling problem should have been fixed. Could you please try the latest master branch, see if it works well or maybe fail at one assertion?

mgualt · on Sept 17, 2012

Thank you for replying - I tried the new branch, and posted the problem I encountered as an issue with gists attached on the github.

coolwanglu · on Sept 16, 2012

I cannot reproduce it with a 20110222 version of fontforge.

Would you mind send me the pdf file, for me to debug?

Does it always crash, with other pdf files?

mgualt · on Sept 16, 2012

yes - it crashes as described with any pdf file.

coolwanglu · on Sept 16, 2012

sorry to hear that. some guys are working on MacPorts and Homebrew formula.

I hope this would help you. https://trac.macports.org/ticket/36028

coolwanglu · on Sept 16, 2012

The problem is that I don't have a machine with Mac. Which version of fontforge have you installed?

mgualt · on Sept 16, 2012

This is what fontforge displays when I start it up:

Executable based on sources from 14:57 GMT 31-Jul-2012-D. Library based on sources from 14:57 GMT 31-Jul-2012.

coolwanglu · on Sept 16, 2012

I'm now trying compile with an older version. But please update fontforge if you can.

mgualt · on Sept 16, 2012

I think my fontforge is the current version (see above) -- please correct me if I'm wrong.

coolwanglu · on Sept 17, 2012

Usually I built from git. There has been some improvement relevant to pdf2htmlEX during the path month.

However it should not crash, and it's confirmed by many people now.

Could you please try the commit f02e1d4 ?

MartinMond · on Sept 16, 2012

Hi! This is an incredible project. I'm just curious, you mention that crocodoc has been "consulted" for this project.

Did you ask them how they do their HTML5 conversion or what exactly do you mean by that?

Anyway, a big Thanks for creating this project!

coolwanglu · on Sept 16, 2012

I meant I took a look at a HTML page generated by crocodoc. Their approach was interesting.

tincholio · on Sept 16, 2012

This is awesome stuff! Thanks for sharing this.

antidoh · on Sept 16, 2012

Damn, that's cool. Somewhat full circle too, in light of the many pdf printer drivers in use today.

wesley · on Sept 16, 2012

What's I'd like to see is a library that can extract multi-column text into a readable format. From looking at the source of the HTML here, they're doing it with absolute positioning. Nothing wrong with that for display purposes, but I'd like to have a library that can extract text meaningfully from a multi-column PDF.

maxerickson · on Sept 17, 2012

The pdftotext tool from xpdf does something like that. One option pads the output text with spaces to roughly match the layout of the pdf (the -layout option) and another option just strips the pdf formatting out (the -raw option).

Depending on the structure of the pdf, one or the either may give better output (the -layout output would need some more processing).

fudged71 · on Sept 16, 2012

This is fantastic! I've been using LaTeX for a while now, and nothing has really outputted HTML anywhere near this quality. I'm very impressed!

mgualt · on Sept 16, 2012

Very interesting - It would be great if the author could outline his overall goals and design ideas.

What are some of the constraints on the PDF in terms of page dimensions or configuration?

How is the math translation done? Does it use MathML or something else?

For me, the interest is that I can now go LaTeX ---> Webpage.

agilebyte · on Sept 16, 2012

Have you tried wiki.lyx.org/Tools/ELyXer for tex to html? I have used it on my dissertation and was mightily impressed (I am easily impressed):

http://patterns.radekstepan.com/

mgualt · on Sept 16, 2012

From my point of view, that's not really tex to html. That's tex markup to html. I am talking about using the latex software, whose purpose is to do typesetting. The amazing thing about this converter is that it takes the latex OUTPUT and produces html.

ChuckMcM · on Sept 16, 2012

Neat idea, make it generate epub and it moves the pdf->e-book ball a bit further down the field. Looking at the source to this page view-source:http://coolwanglu.github.com/pdf2htmlEX/demo/geneve.html it looks like you can't yet generate a font from the characters, rather it uses the 'font trick' to put images on the page. That makes the epub problem harder (which really really wants fonts not images it seems)

coolwanglu · on Sept 17, 2012

What do you mean no fonts? You can try to copy the text out, which is not possible if images are used.

ChuckMcM · on Sept 17, 2012

"           "

There I cut and pasted a quote from the document linked into this response. What do you see? I see a bunch of boxes.

coolwanglu · on Sept 17, 2012

If you use HTML inspectors and remove that piece of mess, you'll find text on the html also disappear.

It's the problem of font encoding, which is one of the difference between PDF and HTML. Sometimes you cannot copy the text out of PDF, but you can read correctly.

I'm working on that problem. I made things like this so far because I think visual accuracy is more important.

ChuckMcM · on Sept 17, 2012

"It's the problem of font encoding"

Yes, and that was exactly my comment. It would be really cool if the converter generated character code points for the characters on the screen. So that cutting and pasting did what you might expect. But to make that work you need to do some form of OCR on the document, figure out where the text is, and how it is composed, then you create a font which re-creates the look based on the imagery in the document and then you generate the CSS that lays down the text and decorates it with the font and re-create the visual of the PDF. (or make it an epub)

If you can get it to that point, there will be huge utility for folks who want to convert paper books to e-books. Because the typical scanner will generate PDF but the typical e-book will only flow e-pub (or .mobi or proprietary formats).

coolwanglu · on Sept 17, 2012

OCR is beyond the scope of pdf2htmlEX I'm just trying to find out the real meaning of the glyphs through glyph names.

Actually usually you should be able to select/copy text without problem, if there's no type 0 fonts.

fpp · on Sept 16, 2012

Looks pretty neat - saving it as html file works great, but you can't print the docs (in Chrome print to pdf only shows a scroll bar, in FF it does not properly format).

Great work anyway - I'll have a deeper look.

coolwanglu · on Sept 17, 2012

Yeah, known issue. Currently I've no idea how to fix it :(

akie · on Sept 16, 2012

I'm very impressed. Can you post some more examples online, some non-technical PDFs for example? I'm curious how well it does 'generic' PDFs (for example magazine layouts).

coolwanglu · on Sept 16, 2012

What do you suggest? I don't have one in my mind now.

SeanDav · on Sept 16, 2012

Sorry for the very noob question, but how do you actually get this to run on a windows XP system?

I just want to run a quick test, but it seems I have to build the project - is that correct?

lectrick · on Sept 17, 2012

Step 1: Only use Windows for games. Fire up a *nix VM, fullscreen it and get real work done in the big boy's open source developer land :)

coolwanglu · on Sept 16, 2012

You may build it with CYGWIN.

corry · on Sept 16, 2012

Very cool, definitely an area where there needs to be lots of work done.

What is browser compatibility like? Is IE8 supported?

Edit: removed reference to HTML5/canvas, didn't see any in the source HTML.

coolwanglu · on Sept 17, 2012

AFAIK, IE8 doesn't support enough HTML5 stuffs, so no. IE9 should be OK

guilloche · on Sept 16, 2012

Amazing, Can the same trick be used for latex=>html? It would be better than tth which is also very good.

coolwanglu · on Sept 16, 2012

of course you can compile latex to pdf first

neurostimulant · on Sept 16, 2012

Very cool! This is exactly what I need. I'm going to play with it for a while.

Genmutant · on Sept 16, 2012

Has somebody already built it for Windows and could upload the binary?

coolwanglu · on Sept 17, 2012

I've tried and succeeded with CYGWIN. But no idea how can I distribute the package with the dependencies.

coolwanglu · on Sept 17, 2012

Please try with commit f02e1d4 if any of you cannot build it on Mac

dutchbrit · on Sept 16, 2012

What about complex vectors with gradients etc?

coolwanglu · on Sept 16, 2012

Just go head, try and be pleased.

additive · on Sept 16, 2012

Is it faster than pdftohtml?

coolwanglu · on Sept 17, 2012

Probably not, as font conversion is slow. pdftohtml does not extract fonts for now.

Evbn · on Sept 16, 2012

Technically impressive, but what systems can render HTML and JS but not PDF?

nilliams · on Sept 16, 2012

There are also various use-cases for doing this as part of a larger product. Say you need to take a customer's crappy PDFs & reformat them for display within a web app, on a public display or to send as an HTML email. You could use this tool, convert to HTML, then drop in your own CSS stylesheet to reformat it. If your customer had many of said crappy PDFs you could no-doubt automate the whole process.

Needless to say I had to do something pretty similar recently, though I ended up having to ask the customer to provide better source data than the PDFs they initially sent. This tool could have been very useful at the time, hope to give it a spin soon.

rabidsnail · on Sept 16, 2012

Firefox and IE on Windows.

Evbn · on Sept 16, 2012

Is it possible to reflow the page, at least in simple cases like 2-column documents? That would be awesome for mobile.

coolwanglu · on Sept 16, 2012

That's beyond the scope of pdf2htmlEX.

mgualt · on Sept 16, 2012

Indeed, and I hope it stays out of scope forever. The idea of reflow is anathema to the idea of typesetting, as far as I can see.