Originally Published: Friday, 4 May 2001 Author: Bob DuCharme
Published to: develop_articles/Development Articles Page: 1/1 - [Std View]

XSL and XSLT: Creating Acrobat Files and Other Formatted Output

In this article based on an excerpt from his new book "XSLT Quickly" Bob DuCharme shares the scoop on creating PDFs and other formatted documents on your Linux system using the bleeding edge Extensible Stylesheet Language and the Apache's project's free Java-based FOP.

XSL and XSLT: Creating Acrobat Files and other Formatted Output

This article is a revised excerpt of the author's book "XSLT Quickly" available from Manning Publications.

As the W3C's XSL specification gets closer and closer to Recommendation status, more software is appearing that can support it. One such program, the Apache project's free, Java-based FOP, lets you create Acrobat PDF files from XSL formatting object files. Used in conjunction with an XSLT processor, you can turn XML files into nice-looking PDF files on your Linux system.

XSL is the Extensible Stylesheet Language, a W3C standard for specifying the visual or audio presentation of an XML document. "Visual presentation" refers to details like fonts, margins, bolding, italicizing, and other page layout issues. Audio presentation refers to the pitch, speed, volume, and other parameters of a spoken voice communicating a document. As I write this, with the XSL spec in Candidate Recommendation status, I know of no program that can do anything with a stylesheet that has audio properties specified, but there are several that can turn XML documents into attractive pages suitable for publishing.

XSLT's relationship with XSL can be confusing. They have similar names, and they both offer specialized elements that you assemble into stylesheets that convert XML documents into something else. Technically, XSLT is a part of XSL; XSL was designed to be a language for transforming and formatting documents, and the "Transformations" part of this plan (the "T" in "XSLT") proved so valuable that the W3C's XSL Working Group split XSLT out into its own specification. Now, when people refer to "XSL" they usually just mean the formatting part.

Before XSLT became its own spec that described the conversion of XML documents into other XML documents (or even into non-XML documents), the original plan for this transformation language was to use it to convert XML documents into trees of formatting objects. Formatting objects are the specific elements from the XSL namespace that describe how to present the document's information. Although XSLT was split out to be separate, it's still very good for this.

An XSLT processor can convert an XML file into a formatting object XML file suitable for rendering by an XSL processor.

Before we look at an example of using XSLT to create an XSL formatting object document, let's look at a short XSL document created by hand so that we can get a feel for the structure of formatting object documents. (Comments refer to each stylesheet's filename in the zip file where you can get the stylesheets and sample documents.)

<!-- xq501.xml -->
<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">

  <fo:layout-master-set>

    <fo:simple-page-master>
      <fo:region-body/>
    </fo:simple-page-master>

  </fo:layout-master-set>

  <!-- Optional fo:declaration elements can go here. -->

  <fo:page-sequence>

    <!-- A sequence of pages. -->

    <fo:flow>
      <fo:block>Him thus intent Ithuriel with his spear</fo:block>
    </fo:flow>

  </fo:page-sequence>

</fo:root>

A formatting object stylesheet document uses elements from the http://www.w3.org/1999/XSL/Format namespace, and the namespace prefix declared for it is usually "fo" (for "formatting object"). The document element is fo:root, an element that has two required child elements:

In the sample document above, the fo:layout-master-set uses the simplest master, fo:simple-page-master, to set the relevant values to their default values.

An fo:page-sequence typically has a series of fo:flow flow object element that make up the actual content of the document. The example above has one fo:flow element to add the phrase "Him thus intent Ithuriel with his spear" to the formatting object tree.

An XSL processor turns these elements into whatever is appropriate for the output formats it supports. FOP ("Formatting Object Processor"), the XSL renderer originally written by James Tauber and available through the XML Apache project (xml.apache.org) can turn the document above into a PDF file that looks like this when displayed in Adobe Acrobat:

Adobe Acrobat displaying a PDF file created by FOP from a simple XSL formatting object file

The text is right up against the left and top edges of the "paper", because no margins were specified, so most laser printers wouldn't be able to print this little document. It's still an impressive achievement, though: with very little XML markup and an open-source program unaffiliated with any major software company, we've created a working Acrobat document.

Let's look at how we can use XSLT to create a more complex XSL document suitable for FOP rendering. (For an introduction to the basics of XSLT stylesheets, see my articles Copying, Deleting, and Renaming Elements and Adding New Elements and Attributes at XML.com) For input, the following poem document has a title element and an in-line prop element for proper names. Elements of both types will be formatted differently from the rest of the poem's text. We want line breaks after each verse of the poem, and we want a little extra space after the title in the final Acrobat version.

<poem>
  <title>"Paradise Lost" excerpt</title>
  <verse>Him thus intent <prop>Ithuriel</prop> with his spear</verse>
  <verse>Touched lightly; for no falsehood can endure</verse>
  <verse>Touch of Celestial temper, but returns</verse>
  <verse>Of force to its own likeness: up he starts</verse>
  <verse>Discovered and surprised.</verse>
</poem>

Our XSLT stylesheet will read this poem document and convert it an XSL stylesheet, or "formatting object file", suitable for conversion into an Acrobat file by FOP. The XSLT stylesheet to convert it to an XSL formatting object file declares two namespaces in its xsl:stylesheet start-tag: one to identify the XSLT instructions to the XSLT processor and one to identify the XSL elements to the rendering program.

<!-- xq503.xsl: converts xq502.xml into xq504.xml -->
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:fo="http://www.w3.org/1999/XSL/Format">

  <xsl:template match="/">

    <fo:root>

      <fo:layout-master-set>
        <fo:simple-page-master>
          <fo:region-body margin-top="36pt"
              margin-bottom="36pt" margin-left="36pt"
              margin-right="36pt"/>
        </fo:simple-page-master>
      </fo:layout-master-set>

      <fo:page-sequence>
        <fo:flow flow-name="xsl-region-body">
          <xsl:apply-templates/>
        </fo:flow>
      </fo:page-sequence>

    </fo:root>

  </xsl:template>


  <xsl:template match="verse">
    <fo:block font-size="10pt" font-family="Times">
      <xsl:apply-templates/>
    </fo:block>
  </xsl:template>


  <xsl:template match="title">
    <fo:block font-size="14pt" font-weight="bold"
        space-before.optimum="12pt"
        space-after.optimum="12pt">
      <xsl:apply-templates/>
    </fo:block>
  </xsl:template>


  <xsl:template match="prop"><!-- proper names -->
    <fo:inline font-style="italic">
      <xsl:apply-templates/>
    </fo:inline>
  </xsl:template>

</xsl:stylesheet>

When the XSLT processor finds the root ("/") of the source tree, the stylesheet's first template rule adds the result XSL stylesheet's fo:root document element to the result tree. It also adds that fo:root element's fo:layout-master-set and fo:page-sequence child elements to the result tree. The fo:layout-master-set element resembles the one in the earlier example except that the fo:region-body element in its fo:simple-page-master doesn't just leave all its parameters at their default values. Instead, it sets the top, bottom, left, and right margins to 36 points, or half an inch.

The fo:page-sequence element has one fo:flow object, and this element's contents in the result tree will be determined by the xsl:apply-templates instruction between the fo:flow tags in the XSLT stylesheet. Whatever this stylesheet's template rules do with the nodes that the XSLT processor finds hanging off the source tree's root node, the nodes that they add to the result tree will be between these fo:flow tags in the result document.

The stylesheet's three remaining template rules turn parts of the poem document into formatting objects to go inside this fo:flow element. The first of the three sets the verse elements to 10 point text in the Times font family. The second sets the title element to 14 point bold text; the font-family is left at the default value, which happens to be Helvetica for the FOP XSL formatter.

The "title" template rule also sets space-before.optimum and space-after.optimum attribute values; related ones include space-before.minimum, space-before.maximum, and the corresponding space-after attribute values. The opportunity to set three different parameters to control the amount of allowable space before or after a given text block lets you specify exactly how much leeway you want to give to a page layout engine's automated decisions about these settings.

The poem's title and verse elements are both added to the formatting object result tree as fo:block elements. They're each their own block of text, and you can set block-oriented parameters for them such as the amount of space to put before and after each block.

The final template rule adds an fo:inline element to the result tree for prop elements. This tells the XSL processor to treat elements of this type as part of their surrounding block instead of treated each one as its own block. Emphasized words, technical terms set in a different font, and in this case, a single proper name to be italicized are typical elements that make good candidates for inline rendering instead of block rendering.

When this stylesheet is run with the poem source document shown above, it creates this result (I added carriage returns and indenting to make the document easier to read, but these wouldn't affect the FOP program's treatment of it):

<?xml version="1.0" encoding="UTF-8"?>
<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">
   <fo:layout-master-set><fo:simple-page-master>
   <fo:region-body margin-top="36pt" margin-bottom="36pt" 
         margin-left="36pt" margin-right="36pt"/>
   </fo:simple-page-master>
   </fo:layout-master-set><fo:page-sequence><fo:flow>
   <fo:block font-size="14pt" font-weight="bold" 
       space-before.optimum="12pt" 
       space-after.optimum="12pt">"Paradise Lost" excerpt</fo:block>
   <fo:block font-size="10pt" 
       font-family="Times">Him thus intent 
<fo:inline font-style="italic">Ithuriel</fo:inline>
 with his spear</fo:block>
   <fo:block font-size="10pt" font-family="Times">
Touched lightly; for no falsehood can endure</fo:block>
   <fo:block font-size="10pt" font-family="Times">
Touch of Celestial temper, but returns</fo:block>
   <fo:block font-size="10pt" font-family="Times">
Of force to its own likeness: up he starts</fo:block>
   <fo:block font-size="10pt" font-family="Times">
Discovered and surprised.</fo:block>
   </fo:flow></fo:page-sequence></fo:root>

FOP turns this document into a PDF file that looks like this in Acrobat:

Adobe Acrobat displaying a PDF file created by FOP from a more complex XSL formatting object file

A glance through the W3C XSL specification (see http://www.w3.org/TR/xsl) shows many other settings that you can assign to your formatting objects along with those shown in the examples above: fo:page-number, fo:list-item, fo:table-and-caption, and many others.

For now, FOP only converts formatting object files to Acrobat PDF files, but that's pretty useful. Other XSL engines will certainly appear that convert them to other formats, whether RTF or other vendors' own rendering formats. Any given rendering engine may not support the entire XSL spec; for example, it will be a while before all of the audio properties are supported by one package, and some of the visual ones may be difficult to support as well. Still, it's a great way to create nice-looking documents on Linux using nothing but free software and open standards.

XSLT Quickly can be purchased from www.manning.com.