Home   Resources  

IBM Skip to main content
Search for:   within 
  Use only ( ) " " + -  Search help  
    IBM home  |  Products & services  |  Support & downloads   |  My account

IBM developerWorks : Linux : Linux articles
developerWorks
Server clinic: PDF for the server
Discuss 65KB e-mail it!
Contents:
Programmatic PDF generation
PDF's "Hello, world"
Putting together PDF pieces
Conclusion
Resources
About the author
Rate this article
Related content:
A must-have book for Linux developers
Subscribe to the developerWorks newsletter
Also in the Linux zone:
Tutorials
Tools and products
Code and components
Articles
Automate generation of professional-quality output

Cameron Laird (claird@phaseit.net)
Vice president, Phaseit, Inc.
September 2002

Column iconPDF is the recognized standard for several categories of top-quality displayable output. While most programmers regard it as a "desktop" technology, a format that a content specialist chooses through a SaveAs operation, you can make your document management processes more powerful through server-side automation of PDF creation. This month, Cameron introduces the ReportLab library for PDF management and programming.

You know PDF. When someone in marketing wants a brochure that looks "just so," or legal needs a document that shouldn't be changed, they publish it as Portable Document Format (PDF). PDF is a standard defined by Adobe Systems for platform-independent, device-independent rendering and display of documents. PDF builds on the fantastic success of Adobe's PostScript (PS), first released in 1984 to improve the printing sophistication possible with common hardware. In principle, PDF has a fixed appearance, invariant across different Web browsers and different devices including printers; the content of PDF documents is "locked down."

While neither of these propositions is strictly true, they're close enough for most purposes. Moreover, PDF generally prints well; only a plain text document is more likely to be compatible with any particular printer.

What does that have to do with you? As a systems or server-side programmer, perhaps you think of PDF as just another opaque content type. Your desktop users or document specialists occasionally update instances on your servers, and you serve up the files just as you would any others. That, you say, should be the limit of your involvement.

Programmatic PDF generation
That model misses out on several interesting server-side possibilities for processing PDF, though. When you automate generation of PDF, you can begin to use all the techniques of software engineering: version control, abstraction, professional-quality backups, regression tests, and so on. Programmatic PDF generation means you can customize deliverables in a manageable way. Perhaps your organization's habit with PDF is to have someone adept with a particular desktop word processor set up a "mail merge" sort of operation to parameterize document output. Automation can reach far deeper, though.

Desktop software vendors have a partial appreciation of this. Several word-processing or desktop-publishing packages have scripting capabilities that reach at least part of the way to PDF. Some shops create PostScript images and transform them into PDF with Ghostscript or similar packages.

My favorite way to automate PDF generation, though, is with one of three actively maintained open source libraries: ReportLab, PJ, and PDFlib. They're all roughly comparable, and I've had medium to good success on projects that relied on each. Pointers to all three, along with several other tools, appear in Resources, below.

Among these, ReportLab is the one I currently use most: it handles the multi-megabyte PDFs with which I work, its exposure of Python as a scripting language suits me, its library includes all the functionality I need for daily work, and the ReportLab company behind the library appears to enjoy sustainable business. Moreover, its convenient integration into the Python interactive shell makes for a delightfully productive development environment. The rest of this month's "Server clinic" illustrates how you can start to program PDF.

PDF's "Hello, world"
While you probably have a good Python installation on your servers already, Python.org's download page can help assure you're current. Version 2.2.1 is a good choice.

With Python installed, you need to visit the ReportLab Download page before you begin your PDF programming career. Even over slow connections, downloads and installations of both Python and the ReportLab Toolkit take well under an hour (see Resources for links to both downloads) .

The source code for your first application can be as simple as this:

Source code for a "Hello, world" page

      from reportlab.pdfgen import canvas
      from reportlab.lib.units import inch

      font = "Helvetica"
      font_size = 26
      text = "Hello, world"
      x = 5.0 * inch
      y = 8.0 * inch
      destination_file = "/tmp/first.pdf"

      my_canvas = canvas.Canvas(destination_file)
      my_canvas.setFont(font, font_size)
      my_canvas.drawRightString(x, y, text)
      my_canvas.save()

This code simply puts a headline on an otherwise blank piece of paper. While mundane, it hints at new horizons: font style and size, content, and formatting are all programmable. When your organization decides to publish in Times New Roman rather than Helvetica, you can, in principle, change one configuration assignment and regenerate everything, rather than having to open each of thousands of documents, alter them, and write them back out. The same is true for other effects: if you want to expand the typeface on information targeted to older readers, for instance, your application can automate that.

Don't think you have to develop your own word processor to accomplish anything meaningful, though. While the ReportLab library is broad and deep enough to allow that, it also supports a couple of specific shortcuts that enormously simplify my PDF programming. First is the import_HTML method. This renders valid HTML source into PDF pages. For many applications, I find it convenient to prototype in HTML, get "stakeholder sign-off" for a sample document, parameterize the HTML generation, then complete an implementation with:

my_document.import_HTML(my_html_source)

This gives me a very fast, easily maintained, fully programmatic way to pour content into PDF. ReportLab's processing efficiency is so good that I can comfortably generate all kinds of PDF documents for Web display on the fly. This gives me the opportunity to keep critical financial or engineering reports fully current with the latest data while preserving an appropriate visual appearance. Print documents enjoy the same choices for customization, of course.

Putting together PDF pieces
A second crucial library function is copyPages. It appends an existing PDF document to a Canvas instance. copyPages makes it easy to construct a PDF document as a concatenation of several pieces.

For more sophisticated effects, ReportLab, like other PDF tool vendors, licenses a for-fee product. In ReportLab's case, its PageCatcher product annotates existing PDF documents, reorders their pages, reformats them for different printing methods, adds backgrounds (including watermarks), and fills in PDF forms. ReportLab documents several interesting uses for PageCatcher. One example is programmatic preparation of completed Internal Revenue Service (IRS) forms.

A final ReportLab capability I've found important is its management of Tables of Contents. Online document readers appreciate these navigational aids, which Adobe calls "bookmarks" or "outlines." Most PDF viewers show these as menus in a left-hand window. The ReportLab Reference itself constitutes a nice example of a bookmarked document. Such ReportLab functions as copyPages include an option to import an outline properly into a larger document, or discard it.

Conclusion
Whenever a computing job seems tedious or error-prone -- updating documents "by hand," for example -- you should be on the lookout for a way to automate the process. If you have questions about how far this attitude can take you, refer back to my review of the Limoncelli and Hogan book ("Server clinic", May 2002). Those authors even constructed the drafts for their book as the output of make processes. Although many systems programmers don't seem to realize it, management of PDF documents presents rich opportunities for automation and abstraction. Use the ReportLab libraries or other available PDF-savvy tools to teach your server to do your PDF work. That should free your time for more productive pursuits.

Future installments of "Server clinic" are likely to touch on other underappreciated fields for server-side automation, including generation of Excel and Word documents.

Disclaimer: I'm on cordial personal terms with the employees of several companies that specialize in PDF-related products. However, I've never had a financial interest in any of the companies, nor any contractual relationship other than as an ordinary customer.

Resources

PDF is a biiiiig subject. You don't have to know all of it, though, to begin a successful automation project. The resources below are more than sufficient to get you started.

About the author
Cameron is a full-time consultant for Phaseit, Inc., who writes and speaks frequently on open source and other technical topics. You can contact Cameron at
claird@phaseit.net.


Discuss 65KB e-mail it!
What do you think of this article?
Killer! (5) Good stuff (4) So-so; not bad (3) Needs work (2) Lame! (1)

Send us your comments or click Discuss to share your comments with others.



IBM developerWorks : Linux : Linux articles
developerWorks
  About IBM  |  Privacy  |  Legal  |  Contact
 Downloads: https://875000.xyz/downloads/ email: support@875000.xyz