|
Automate document handling with low-cost server processes
Cameron Laird (claird@phaseit.net) Vice president, Phaseit, Inc. December 2002
Office workers habitually exchange documents encoded in Microsoft Word .DOC format. An abundance of open-source tools make it feasible to automate management of their content.
"PDF for the server" was one of the more popular
columns in this series. More precisely, it's the one that inspired the
most e-mail in response. Several readers asked that Server clinic treat
Microsoft Word documents the same way: describe how to manage them
programmatically.
It's important to do so. Few office workers get the point of
automation, despite the large investments Microsoft and others have made
in scripting, "active documents," and related technologies. "Civilians"
are largely habituated to routines of typing in data that come from
computer print-outs. I see plenty of workplaces where it's unusual even
to question such practices.
On the other side, many of the system programmers with the expertise to
help end-users integrate complex work-flows don't regard Microsoft Word
formats as feasible targets for server-side programming. Commercial
document management packages are available, but only for costs in the
range of $20,000 and up.
In fact, there's plenty you can do with Word documents on a Linux or
other UNIX server, at a modest cost. Consider the possibilities:
Simplest first
First, for quick human readability, rough word counts, and so on, it's
often enough to scan a .DOC document with
strings. A command such as
strings something.doc | wc -w
generally returns a word count within 10% of the correct one.
It's surprisingly difficult to improve on that crude approach. The
heart of the problem is that .DOC , as a format,
has changed quite a bit through the years. It's hard to track.
The related .RTF has several advantages:
it's encoded in ASCII and is almost human-readable, and it's far less
likely to bear effective infections. Moreover, it appears to have been
less volatile through the years; a reader from 1997 will probably be able
to digest a .RTF written this year, and
vice-versa. On some of the networks I manage, I restrict traffic to
exclude .DOC in favor of .RTF as prophylaxis against malicious code. In
principle, this deprives users of certain word-processing features only
available with .DOC . As a practical matter, I
have never had a user who actually uses an effect that can't be
achieved with .RTF .
The Resources section, below, lists several lightweight Word readers:
wvWare, catdoc, and so on. These are generally quick and easy both to
install and use. Most UNIX desktop users now know that OpenOffice-on-UNIX
is entirely viable as a replacement for common uses of Windows Word, and
is quite adept at both reading and writing .DOC
documents. OpenOffice exposes scriptable interfaces which make it feasible
to program document content, either in Java, C++, Python, OpenOffice.org
Basic, StarScript, CORBA, or OLE Automation. OpenOffice integrates macro
recording with this technology, also. Essentially the same is true of the
StarOffice(TM) commercially-licensed product.
In fact, while StarOffice is formally distinct from OpenOffice, this
column focuses entirely on the latter, as "[f]uture versions of StarOffice
software, beginning with 6.0, will be built using the OpenOffice.org
source, APIs [application programming interfaces], file formats, and
reference implementation", according to the latter's Web site (see Resources). In future OpenOffice implementations,
"UNO (Universal Network Objects) is the interface-based component model."
OpenOffice is quite a "heavy" way to work with Word documents, though.
It requires at least graphical user interface (GUI) service, and often a
rather delicate installation, and multiple programmed processes.
XML-oriented "formatting objects" (FO) is much the same: while powerful,
it demands quite a bit of machinery be in place before it begins to work.
If you're out to do the simple kinds of operations I most often encounter
-- generate an .RTF invoice of a fixed format,
"scrape" an incoming weekly status report, customize a Web download with
reader-specific information, that sort of thing -- you should look into
the direct language bindings of .RTF libraries.
Best among these is Robert Rothenburg's Perl API.
RTF::Documents
For the simplest .RTF generations, simple
cutting and pasting is enough. You can parameterize production of a form
like Figure 1 with a shell script. Listing 1. Source code for invoice.sh (partial)
#!/bin/sh
AMOUNT="1234.56"
DATE="06 October 2002"
NUMBER="9999/3333"
PO="6543"
FORM="{\rtf1\ansi\deff0\deftab720{\fonttbl...
\par \pard\plain\f3\fs20
\par \pard\qr\plain\f2\fs24\cf0 $DATE
\par \pard\plain\f2\fs24\cf0 Phaseit, Inc.
\par #$NUMBER
\par
\par Please pay \$$AMOUNT to
...
|
Figure 1. Screen shot of simple Word document generated on Linux server
For more structured, scalable, and maintainable programming, use Perl's
RTF modules. These make it possible to write code as in Listing 2. Listing 2. Source
code for invoice.pl (partial)
use RTF::Document;
$rtf = new RTF::Document({
doc_page_width => '8.5in',
doc_page_height => '11in'
});
$fCourier = $rtf->add_font ("Courier",
{ family=>monospace, pitch=>fixed,
alternates=>["Courier New", "American Typewriter"]
}
);
$fTime s= $rtf->add_font ("Times New Roman",
{ default => 1
}
);
$rtf->add_text( $rtf->root(), "Invoice", ...
|
With this approach, of course, I have all the power and productivity of
Perl immediately at hand to tap into external data sources, transform
content, and so on.
Summary
Don't count on problems to fix themselves. Part of your responsibility as
a server-side developer is to be on the prowl for frictions in the
operations around you. If there's a report that frequently gets lost, or
miscoded, one approach is to exhort employees to put in longer hours, or
to be more careful. Sometimes that works. With automation tools, though,
you can systematically engineer effective processes.
Your automation can do even more than just reduce error. When you
automate content generation or processing, you open up new possibilities
for customization and qualitatively better service. Pick the resources
below that best fit your situation, use them to solve problems already
consuming time in your organization, and move on to more interesting and
rewarding challenges.
Resources
- "Rich Text Format (RTF) Specification, version 1.6"
is a document Microsoft first published in 1999.
- Antiword is a
free MS Word reader for Linux and RISC operating systems. Some commercial
UNIX distributions also include a proprietary reader of common
Windows file formats. As these are not available for Linux,
this column doesn't describe them further.
- The
catdoc Word reader
is compact and straightforward.
- wvWare is a library for
converting Word documents.
- Read Cameron's article "PDF for the server" (developerWorks, September 2002).
- The CPAN RTF
directory includes Perl modules both to parse and generate RTF documents.
While author Robert Rothenberg labels them "experimental" and "alpha",
they can be quite useful even in production situations.
- Docserver
is a Perl-coded application that renders
.DOC and
related formats into more standard text formats. It depends on
a (licensed) installation of Microsoft Office running on a
Windows host accessible through the network.
- The Open Office home page
and Star Office home page
lead to plenty of information about working with
.DOC
and .RTF .
- The UNO Development Kit Project
describes the OpenOffice approach to scripting. For more details, see the UNO technical documents.
- Check out "Converting
RTF documents with graphics into HTML documents" (developerWorks, February 2002).
- NuxDocument
is a "Zope product" that converts from Microsoft Word and other
formats into HTML and plain text.
Zope is the popular
Python-based content management and application server.
- Windward Reports
is a commercial Java-coded product that includes
.RTF ->{XML,TXT,HTML,...} functionality. Windward
is not FO-based (xsl:fo), although it has a similar
external appearance.
- HtmlToHlp
is a Java-coded conversion tool that translates HTML files to RTF.
- JRTF
dynamically generates Word RTF documents using servlets.
- "Using
XSL-FO to create printable documents" introduces FO and touches on its potential to
work with RTF (developerWorks, November 2001).
- Find more resources for Linux developers in the developerWorks Linux zone.
About the author
Cameron is a full-time consultant for
Phaseit, Inc., who writes and speaks frequently on Open Source and
other technical topics. You can contact him at claird@phaseit.net.
|
|
|