Home   Resources  

IBMSkip to main content
Advanced search
    IBM home  |  Products & services  |  Support & downloads   |  My account

IBM developerWorks : Linux : Linux articles
developerWorks
Server clinic: RTF on the server
72 KBe-mail it!
Contents:
Simplest first
RTF::Documents
Resources
About the author
Rate this article
Related content:
Converting RTF documents with graphics into HTML documents
Using XSL-FO to create printable documents
PDF for the server
Subscribe to the developerWorks newsletter
Also in the Linux zone:
Tutorials
Tools and products
Code and components
Articles
Automate document handling with low-cost server processes

Cameron Laird (claird@phaseit.net)
Vice president, Phaseit, Inc.
December 2002

Column iconOffice workers habitually exchange documents encoded in Microsoft Word .DOC format. An abundance of open-source tools make it feasible to automate management of their content.

"PDF for the server" was one of the more popular columns in this series. More precisely, it's the one that inspired the most e-mail in response. Several readers asked that Server clinic treat Microsoft Word documents the same way: describe how to manage them programmatically.

It's important to do so. Few office workers get the point of automation, despite the large investments Microsoft and others have made in scripting, "active documents," and related technologies. "Civilians" are largely habituated to routines of typing in data that come from computer print-outs. I see plenty of workplaces where it's unusual even to question such practices.

On the other side, many of the system programmers with the expertise to help end-users integrate complex work-flows don't regard Microsoft Word formats as feasible targets for server-side programming. Commercial document management packages are available, but only for costs in the range of $20,000 and up.

In fact, there's plenty you can do with Word documents on a Linux or other UNIX server, at a modest cost. Consider the possibilities:

Simplest first
First, for quick human readability, rough word counts, and so on, it's often enough to scan a .DOC document with strings. A command such as

strings something.doc | wc -w

generally returns a word count within 10% of the correct one.

It's surprisingly difficult to improve on that crude approach. The heart of the problem is that .DOC, as a format, has changed quite a bit through the years. It's hard to track.

The related .RTF has several advantages: it's encoded in ASCII and is almost human-readable, and it's far less likely to bear effective infections. Moreover, it appears to have been less volatile through the years; a reader from 1997 will probably be able to digest a .RTF written this year, and vice-versa. On some of the networks I manage, I restrict traffic to exclude .DOC in favor of .RTF as prophylaxis against malicious code. In principle, this deprives users of certain word-processing features only available with .DOC. As a practical matter, I have never had a user who actually uses an effect that can't be achieved with .RTF.

The Resources section, below, lists several lightweight Word readers: wvWare, catdoc, and so on. These are generally quick and easy both to install and use. Most UNIX desktop users now know that OpenOffice-on-UNIX is entirely viable as a replacement for common uses of Windows Word, and is quite adept at both reading and writing .DOC documents. OpenOffice exposes scriptable interfaces which make it feasible to program document content, either in Java, C++, Python, OpenOffice.org Basic, StarScript, CORBA, or OLE Automation. OpenOffice integrates macro recording with this technology, also. Essentially the same is true of the StarOffice(TM) commercially-licensed product.

In fact, while StarOffice is formally distinct from OpenOffice, this column focuses entirely on the latter, as "[f]uture versions of StarOffice software, beginning with 6.0, will be built using the OpenOffice.org source, APIs [application programming interfaces], file formats, and reference implementation", according to the latter's Web site (see Resources). In future OpenOffice implementations, "UNO (Universal Network Objects) is the interface-based component model."

OpenOffice is quite a "heavy" way to work with Word documents, though. It requires at least graphical user interface (GUI) service, and often a rather delicate installation, and multiple programmed processes. XML-oriented "formatting objects" (FO) is much the same: while powerful, it demands quite a bit of machinery be in place before it begins to work. If you're out to do the simple kinds of operations I most often encounter -- generate an .RTF invoice of a fixed format, "scrape" an incoming weekly status report, customize a Web download with reader-specific information, that sort of thing -- you should look into the direct language bindings of .RTF libraries. Best among these is Robert Rothenburg's Perl API.

RTF::Documents
For the simplest .RTF generations, simple cutting and pasting is enough. You can parameterize production of a form like Figure 1 with a shell script.

Listing 1. Source code for invoice.sh (partial)
   
    #!/bin/sh
    
    AMOUNT="1234.56"
    DATE="06 October 2002"
    NUMBER="9999/3333"
    PO="6543"
    
    FORM="{\rtf1\ansi\deff0\deftab720{\fonttbl...
    \par \pard\plain\f3\fs20 
    \par \pard\qr\plain\f2\fs24\cf0 $DATE
    \par \pard\plain\f2\fs24\cf0 Phaseit, Inc.
    \par #$NUMBER
    \par 
    \par Please pay \$$AMOUNT to
	...
    

Figure 1. Screen shot of simple Word document generated on Linux server
Screen shot of simple Word document generated on Linux server

For more structured, scalable, and maintainable programming, use Perl's RTF modules. These make it possible to write code as in Listing 2.

Listing 2. Source code for invoice.pl (partial)
   
     use RTF::Document;
     
     $rtf = new RTF::Document({
         doc_page_width => '8.5in',
         doc_page_height => '11in'
     });                 
     $fCourier = $rtf->add_font ("Courier",
         { family=>monospace, pitch=>fixed,
           alternates=>["Courier New", "American Typewriter"]
         }
     ); 
     $fTime s= $rtf->add_font ("Times New Roman",
         { default => 1
         }
     ); 
     
     $rtf->add_text( $rtf->root(), "Invoice", ...    
    

With this approach, of course, I have all the power and productivity of Perl immediately at hand to tap into external data sources, transform content, and so on.

Summary
Don't count on problems to fix themselves. Part of your responsibility as a server-side developer is to be on the prowl for frictions in the operations around you. If there's a report that frequently gets lost, or miscoded, one approach is to exhort employees to put in longer hours, or to be more careful. Sometimes that works. With automation tools, though, you can systematically engineer effective processes.

Your automation can do even more than just reduce error. When you automate content generation or processing, you open up new possibilities for customization and qualitatively better service. Pick the resources below that best fit your situation, use them to solve problems already consuming time in your organization, and move on to more interesting and rewarding challenges.

Resources

  • "Rich Text Format (RTF) Specification, version 1.6" is a document Microsoft first published in 1999.

  • Antiword is a free MS Word reader for Linux and RISC operating systems. Some commercial UNIX distributions also include a proprietary reader of common Windows file formats. As these are not available for Linux, this column doesn't describe them further.

  • The catdoc Word reader is compact and straightforward.

  • wvWare is a library for converting Word documents.

  • Read Cameron's article "PDF for the server" (developerWorks, September 2002).

  • The CPAN RTF directory includes Perl modules both to parse and generate RTF documents. While author Robert Rothenberg labels them "experimental" and "alpha", they can be quite useful even in production situations.

  • Docserver is a Perl-coded application that renders .DOC and related formats into more standard text formats. It depends on a (licensed) installation of Microsoft Office running on a Windows host accessible through the network.

  • The Open Office home page and Star Office home page lead to plenty of information about working with .DOC and .RTF.

  • The UNO Development Kit Project describes the OpenOffice approach to scripting. For more details, see the UNO technical documents.

  • Check out "Converting RTF documents with graphics into HTML documents" (developerWorks, February 2002).

  • NuxDocument is a "Zope product" that converts from Microsoft Word and other formats into HTML and plain text. Zope is the popular Python-based content management and application server.

  • Windward Reports is a commercial Java-coded product that includes .RTF->{XML,TXT,HTML,...} functionality. Windward is not FO-based (xsl:fo), although it has a similar external appearance.

  • HtmlToHlp is a Java-coded conversion tool that translates HTML files to RTF.

  • JRTF dynamically generates Word RTF documents using servlets.

  • "Using XSL-FO to create printable documents" introduces FO and touches on its potential to work with RTF (developerWorks, November 2001).

  • Find more resources for Linux developers in the developerWorks Linux zone.

About the author
Cameron is a full-time consultant for Phaseit, Inc., who writes and speaks frequently on Open Source and other technical topics. You can contact him at claird@phaseit.net.


72 KBe-mail it!
What do you think of this document?
Killer! (5) Good stuff (4) So-so; not bad (3) Needs work (2) Lame! (1)

Comments?



IBM developerWorks : Linux : Linux articles
developerWorks
  About IBM  |  Privacy  |  Legal  |  Contact
 Downloads: https://875000.xyz/downloads/ email: support@875000.xyz