PDF Mail-Merge in Haskell

August 27, 2007, at 09:51 PM

So, I've been taking a break from my other projects by working on a component that I'll eventually be needing, although the system it will be part of is nowhere near ready yet. This is essentially mail-merge for PDF, although I'll need to support both text and images. I'll also need to support multiple fonts.

At first I looked at Perl libraries; Perl has several libraries that deal with PDFs. PDF::Reuse is almost exactly what I want, and nicely documented. Unfortunately, I didn't really understand how I was supposed to use it until I'd already read up, and now I've reimplemented enough of its functionality that it would no longer be a time-saver.

Mostly, when I was looking at existing work, I ran into many other Perl modules which were not PDF::Reuse, and which were all very poorly documented, and finally I had to turn to the official PDF reference to understand what they were supposed to be doing. And then I realized that the reason it's so confusing is that most of these libraries are just one piece of it - dealing only with the document's internal index, or only with generating content streams (the postscript-like portion of the file), or only with parsing the syntax.

So I have a syntax parser/lexer; I put it together in Parsec. I'm starting to really dislike Parsec, but I have to concede that I seriously doubt I could have supported the full syntax in just a couple hours without it. I also wrote Show instances for everything, so that my PDF.Syntax module can both generate and consume. I even handled all the weird little escape things in the string syntax; sweet, huh? Biggest holdup with this part was that, ouch! Different strings can be in different text encodings! You can even mix-and-match line endings within the same file! So I had to poke deeper than I wanted to into Haskell's IO stuff in order to come up with something 8-bit-clean.

Syntax is only half the issue. A PDF file, it turns out, is not a linear stream. Well, figures, right? Anyway, it has an index at the end of the file, the xref table, which gives a mapping of object IDs to byte offsets. I was thinking I would have to deal with space allocation as I deleted and replaced objects, and then regenerate the xref table from scratch, but it turns out that the design is nicer than that: You could do that, but you can also leave the existing document untouched and just append the new or replaced data, and another xref table which contains entries only for the changed things, plus a pointer back to the old one. Since all the internal data structures are built in terms of object IDs anyway, the ability to remap object IDs means you can do anything in append-only mode that you could do by writing an entire new file. Sweet, huh?

It took me a while to get that working, since it was hard to see where the problem in my generated output was when all I got was either total success or total failure when I tried to use the resulting file. So to do it in pieces, first I tried making a PDF entirely by hand. Ouch! Painful! Had to use a hex editor to find offsets into what was otherwise more or less a text file! So I wrote some code to generate a trivial PDF, and refactored that until it turned into a PDF updater. Which now works.

Right now, after basically eight hours of coding, what I have it doing is opening up a file and drawing a predefined graphic either on top of or below every page. I took the example code for the graphic from Adobe's reference.

Making it generate text will be easy; it just has to output a content stream with the appropriate command. Making it do so in any given font won't be much harder; the document contains copious metadata, and it just has to look up what fonts the document already has embedded, and then decide which to use.

Making it generate graphics will be a bit harder. Apparently PDF supports some of the same codecs that popular image formats are based on, particularly jpeg. So I have a choice of whether to slice-and-dice the input image and just embed its existing compressed chunks into the PDF, or whether to uncompress the image, convert it to a different format, and reencode it. There are existing tools for both. The former apparently has a very significant space savings, but is a lot more fiddly and there are fewer examples for me to work from. So, I'm reading up on that.

I also haven't decided yet whether I want to store image files as blobs within the database (which means I can take advantage of the version-control I'll be implementing for the rest of the database, and users won't have to explicitly think about moving them around), or as separate files (which would make the database smaller). Well, plenty of time to think about that. The PDF code will be designed so it doesn't care where its data is coming from, anyhow.

Shark Rules!

August 15, 2007, at 07:22 AM

Shark, for those who don't know, is Apple's profiling tool. You don't need to run in their IDE to use it, which is good... It's impressed me in many ways with both its precision and its versatility, but what it did for me just now was particularly cool.

A website that I run was being slow. So I start up Shark and record a trace as I load the page from my test server, running on the local system. I don't want to assume without checking that the problem is in my code and not in, say, the Apache config causing things to be done in a slower way than necessary, so I tell it to record the whole system.

So it records the whole system and chews on its data and pops up a box informing me that the most-time-consuming function in the most-time-consuming process is Perl_utf8_length, with a self time of 41% of the total userspace time of that process during the sampling period. Hah! After a moment of thought I realize that this just confirms what I suspected anyway, namely that XML::SAX::PurePerl is way too damn slow, but still, it's nice to know that for sure.

But the slick part of all this is in what I didn't have to do. Notice how I didn't tell it the path to my script. In fact I just let it loose on the whole system and it realized that the program that was thinking hardest was probably the one I was most interested in hearing about, so it showed me that one first - with a drop-down menu to see all the others. It wasn't able to decipher the Perl call-stack, only the Perl runtime library, so I couldn't tell the name of the Perl function that was calling Perl_utf8_length, but that's still helpful; it's clearly something parsing-related. And notice also that I didn't have to tell it how to find the debug symbols for that systemwide library. Or even tell it to go fetch them at all; it did that on its own.

One other amusing thing I noticed, although it wasn't useful to me this time. I wondered whether IO operations might be the main expense, so I looked at the special process which is used to account for time spent in kernelspace - and it informed me that something like 90% of kernel time was spent in the library AppleIntelCPUPowerManagement. After a moment of thought, I realized this meant it was sleeping in the idle loop. (I have two CPUs, and parsing is inherently serial, so it can still be idle while working hard.) Pause for a moment to reflect on how slick this is. It wasn't able to get me the symbols from that driver, although it does get symbols from the kernel at large. But it continued its sampling uninterrupted (pardon the pun!), even in kernelspace, even going the extra mile to figure out what driver the PC was in.

Usability and stuff is all great too, but let me tell you, it's things like this that really warm my heart towards Apple.