Embedding Images into a PDF
September 04, 2007, at 10:29 PM
So, I've been working on embedding image files into a PDF. This could be done by uncompressing them and storing them in any convenient codec, but there's two good reasons and one bad one not to do that. A shame, because it would potentially be a lot less work; I could use the existing Haskell binding against the GD library to get at uncompressed pixel data.
First, it would potentially be substantially larger - ever seen a PDF that was tens of megs just for a couple pages? That's probably because somebody was careless about image encoding.
Second, when converting from a lossy compression format, recompressing will cause quality loss.
Third, and this would trump the other concerns anyhow, doing it this way gives me a chance to learn neat stuff.
You would really think there would be a nice library to do this stuff already, since Adobe specifically makes note of the possibility and all that, but there's not... I found a nice paper that actually surveys a lot of open-source and some closed-source libraries and concludes that almost all of them do it the lazy way.
JPEGs turn out to be really easy. One of the built-in codecs can deal with an embedded stream that's simply an entire JPEG file. It even extracts the image size and other parameters from that stream, instead of you having to specify them explicitly like with every other codec. So I've already got that working. Of course, if time were at all a consideration, I would just declare it done at this point and have the users convert all images to JPEG beforehand, but since it's not...
Other formats are a bit more work. It supports the basic codecs used for GIF - which is of course the famous LZW, finally out of patent as of a couple years ago - and PNG, which, to my surprise, is the old familiar Deflate codec that's used by both zip and gzip. Unfortunately, the pixel data isn't the only thing in the file; and they're both palletized formats, so you need to extract the pallete, too.
This is relatively simple stuff as it goes... I'm most of the way through doing it for GIF now. I got a major boost when I finally found a source of working implementations I could look at: A couple small utilities that come with Ghostscript, written as command-line tools and themselves implemented in Postscript (yes...). It's a little awkward reading that, since the PDF side of things kinda happens automagically, but at least I know it works, so it's useful.
One other thing that actually is kinda important: EPS. Encapsulated postscript is a vector format. You might think that since it's so closely related to PDF, you could somehow convert it directly, but you can't, because they have different built-in APIs (in particular, the PDF one has no flow-control constructs). You could rasterize the EPS but of course it would be much nicer to keep it resolution-independent, not to mention smaller.
I wasn't sure how to deal with EPS, but then I realized that Ghostscript can be used as a library, and in that mode you can run it without any support files installed anywhere. So you can just hard-link it into the distribution and be done with it. Have Ghostscript convert the EPS to a PDF, then extract the content stream of the PDF, and use it directly as a content stream in your own PDF. Postscript and PDF do have very similar rendering operations; they're just invoked diferently. So in effect this is running the program contained in the EPS, but trapping all the graphics calls it makes, and, instead of running them, outputting equivalent calls in PDF vocabulary. If the EPS contained a program to compute the entire Mandelbrot set, this would have disastrous results, but “don't do that, then.”
For the Larpbase project, I was thinking EPS files could be used for the heart and energy logos, although on close examination they appear to be high-res raster images right now. At any rate, it's an interesting enough challenge that I'm going to make it work one way or another!