J2EE News, Forums, Discussions, Articles, Jobs, Books J2EEWorld.COM
J2EE News, Forums, Blogs, Articles, Books, Jobs
J2EE Careers | Vancouver J2EE Jobs | Sitemap

Search >> 
 Site last updated: Thursday, 24 July 2008
 Home arrow J2EE articles arrow J2EE case study and solutions arrow Use XML and XSLT to automatically produce both plain text and HTML newsletters
Main Menu
   Home
   J2EE News
   J2EE Weblog
   J2EE Books
   J2EE articles
   J2EE Web Links
   J2EE FORUM
   J2EE Download
   Newsletter
   About us
Latest News
Testing Portal Web Applications With HttpUnit
IBM vs Microsoft Developer Productivity Study
The J2EE 1.4 Tutorial
Linux file system structures for J2EE developers
Apache Jakarta DBCP Component -- efficient, feature-rich connection pool package
Login Form





Lost Password?
No account yet? Register
SoloSunglasses
SoloSunglasses: buy sunglasses direct from the manufacturer
Most Read Content
Using Oracle Timestamps in Java
IBM Restocks Eclipse Project
Introducing Sun Java Desktop System, Release 2
How to submit file to J2EE server?
Java 2 Enterprise Edition
Polls
What J2EE Application Server do you use?
 
Hit Counter
Visitors: 426111
Use XML and XSLT to automatically produce both plain text and HTML newsletters
Written by Michael   
Friday, 20 August 2004

Managing ezines with JavaMail and XSLT, Part 1

Marchal
Consultant, Pineapplesoft
2001 Mar 1Updated

In part one of two-part series, Benoît Marchal demonstrates how to automate e-mail publishing chores with Java and XML. This concrete application of XML and XSLT describes an e-mail newsletter (e-zine) publishing application that outputs both HTML and plain text e-mail messages. Six reusable code samples include a sample newsletter marked up in DocBook, an XSL style sheet to convert the DocBook sample to a custom text output, a Java text formatter (in the form of a SAX ContentHandler), two SAX filters, and the Java code that puts it all together in a multistepped transformation. (The next part of this article covers the JavaMail API.)

So you've learned XML. You mastered your way through DTD, XSLT, SAX and DOM. You unraveled the secrets of namespaces, and you think you're on top of the X-rated acronym soup. Congratulations! Now what?

Judging from the developer feedback I hear, you're not alone asking yourself that crucial question. This article proposes one answer through a practical application. It demonstrates how to automate publishing chores with Java and XML. As such, I think it will prove inspiring.

This article does not include an introduction to XML. I assume you are familiar with XSLT and have some notion of SAX parsing. Even if you need background on those topics, you might still want to read through the article, as it will inspire you to learn more. But make sure you consult the Resources section for some basic XML references.

XML ... and e-mail?
XML may not seem like a natural technology match with e-mail. Stay with me, and you may be surprised by the utility of this strange combination.

As you probably know, Eudora, Outlook, Netscape, and other modern e-mail clients let you send HTML e-mails. Originally e-mail messages were limited to plain text and they would not support bold, italics or hyperlinks. Modern e-mail clients recognize HTML, and so you can now send either plain text messages or richly decorated documents.

This choice of e-mail formats poses a problem to e-mail magazine (e-zine) publishers. Indeed the choice plays a part in the strategies e-zine publishers develop to confront their two biggest problems: acquiring and retaining subscribers. Unfortunately, subscribers have strong positions for or against HTML e-mails.

To make things worst, some e-mail clients (including the popular AOL 4.0 to 5.0) do not support HTML at all. Unless you are extra careful, subscribers with those older e-mail clients see only garbage.

Traditionally, e-zine publishers have gone to great lengths to ensure their reader's comfort. In the days of plain-text e-mails, savvy publishers would manually format their prose. Some continue this fine tradition with HTML e-mails, painfully preparing two versions of each document: plain text for older e-mail clients and HTML for newer ones. When I heard about that, a lightbulb popped over my head and I thought "XSLT style sheets." (This may be a sure sign that I should get a life.)

Principles
In this two-part article, you'll see how XML, XSLT and some Java programming can simplify things. In the process of doing so, you'll use various XML techniques. Let's start by reviewing them all:

  • XML itself, of course. The e-zine will be written in XML and, more specifically, in DocBook. DocBook is a popular XML vocabulary for technical documentation.
  • XSLT is typically used to convert XML documents to HTML. That would solve half of our problem (preparing the HTML version of the e-zine).
  • A special text formatter that enhances XSLT support for text. Indeed, as you might have understood, top-notch text formatting is a priority for e-zines.
  • JavaMail, the standard Java API to send e-mail.

Figure 1 illustrates the relationship between these components. From left to right, the ultimate goal is to prepare a so-called multipart e-mail with both text and HTML versions of the e-zine.

Figure 1. How the components of the solution interact
Workflow

Preparing the e-mail involves going through two style sheets: one creates the text output, the other outputs the HTML version. The text formatter assists the text style sheet. JavaMail picks up both copies and sends them to subscribers.

This first installment of this series concentrates on the text transformation. The second installment will wrap things up with JavaMail.

The DocBook document
The starting point is the article in article.xml in Listing 1. It is written in DocBook, meaning that the XML tags (<article>, <title>, <para>) are all tags defined by DocBook.

Listing 1. article.xml


<?xml version="1.0"?>
<article>
<articleinfo>
 <title>XSL -- First Step in Learning XML</title>
 <author><firstname>Benoît</firstname>
  <surname>Marchal</surname></author>
</articleinfo>
<sect1><title>The Value of XSL</title>
 <para>This is an excerpt from the September 2000 issue of
  Pineapplesoft Link. To subscribe free visit
  <ulink url="http://www.marchal.com">marchal.com</ulink>.</para>
 <para>Where do you start learning XML? Increasingly my answer
  is with XSL. XSL is a very powerful tool with many
  applications. Many XML applications depend on it. Let's take
  two examples.</para>
</sect1>
<sect1>
 <title>XSL and Web Publishing</title>
 <para>As a webmaster you would benefit from using XSL.</para>
 <para>Let's suppose that you decide to support smartphones.
  You will need to redo your web site using WML, the
  <emphasis>wireless markup language</emphasis>, instead of
  HTML. While learning WML is easy, it can take days if not
  months to redo a large web site. Imagine having to edit every
  single page by hand!</para>
 <para>In contrast with XSL, it suffices to update one style
  sheet the changes flow across the entire web site.</para>
</sect1>
<sect1>
 <title>XSL and Programming</title>
 <para>The second facet of XSL is the scripting language. XSL
  has many features of scripting languages including loops,
  function calls, variables and more.</para>
 <para>In that respect, XSL is a valuable addition to any
  programmer toolbox. Indeed, as XML popularity keeps growing,
  you will find that you need to manipulate XML documents
  frequently and XSL is the language for so doing.</para>
</sect1>
<sect1>
 <title>Conclusion</title>
 <para>If you're serious about learning XML, learn XSL. XSL is
  a tool to manipulate XML documents for web publishing or
  programming.</para>
</sect1>
</article>

Text-markup language
Now let's see how to convert DocBook to text. XSLT has some support for text formatting (in the form of <xsl:output method="text"/> but, in my experience, it is inadequate for e-zine publishing. More specifically, with XSL text output, it's:

  • Impossible to break lines at a specific length (a requirement with old e-mail clients)
  • Difficult to remove accented characters (another limitation with old e-mail clients)
  • Troublesome to remove duplicate spaces in the original document

At first sight it would appear that XSLT cannot help, but a small dose of Java programming can make it work. The trick is to define a special XML vocabulary, which I'll call the text-markup language, to describe text documents.

I created this text-markup language specifically for this article, so it's as simple as it needs to be. Indeed it has only two tags: <txt:root> (the root of the document) and <txt:block> (a paragraph with a line break before and after it). Both are defined in the http://www.psol.com/xns/xslist/xml2text namespace. Incidentally, remember that a namespace is just an identifier; it looks just like a URL, but it does not point to anything.

<txt:root> has a lineWidth attribute for the ... that's right: the line width. <txt:block> has a linesAfter attribute with the number of line breaks after the block.

Next, you write a Java application to convert text-markup language to plain text. For example, the document below (Input) will become the following document (Output). Notice that the line breaks occur after 65 characters as specified by the lineWidth attribute:

Input


<?xml version="1.0" encoding="UTF-8"?>
<txt:root lineWidth="65"
          xmlns:txt="http://www.psol.com/xns/xslist/xml2text">
<txt:block linesAfter="1">This is an excerpt from the September
2000 issue of Pineapplesoft Link. To subscribe free visit
marchal.com.</txt:block>
<txt:root>
Output


This is an excerpt from the September 2000 issue of
Pineapplesoft Link. To subscribe free visit marchal.com.

To convert from the original XML document to text-markup language, I'll use XSLT (of course). Incidentally, why bother with the text-markup language? If I'm going to write Java code, why not process DocBook directly? In a nutshell, because it's easier this way. For example:

  • Instead of parsing all the many tags in DocBook, I need to process only the two tags in the text markup language.
  • To change the text output, it suffices to edit the style sheet and, because XSLT is a scripting language, that's easier than hacking around in Java.
  • Last but not least, the combination of text-markup language and XSLT works with DocBook and any other XML vocabulary.

If you're familiar with XSL, this text-markup language is similar to using FO to create PDF files.

The style sheet
The style sheet to convert from DocBook to the text-markup language is text.xsl in the Listing 2. Notice the <xsl:output method="xml"/> tag: this style sheet converts from XML (DocBook) to XML (text markup language) -- not HTML.

Listing 2. text.xsl


<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:txt="http://www.psol.com/xns/xslist/xml2text">

<xsl:output method="xml"/>

<xsl:template match="/">
<txt:root lineWidth="65">
   <xsl:apply-templates/>
</txt:root>
</xsl:template>

<xsl:template match="articleinfo">
   <txt:block linesAfter="0">&gt; <xsl:value-of
       select="title"/> &lt;</txt:block>
   <txt:block linesAfter="2">
       by <xsl:value-of select="author/firstname"/>
       <xsl:value-of select="author/surname"/>
    </txt:block>
</xsl:template>

<xsl:template match="sect1/title">
   <txt:block linesAfter="1">* <xsl:apply-templates/> *</txt:block>
</xsl:template>

<xsl:template match="ulink">
   <xsl:apply-templates/>
   <xsl:text> &lt;</xsl:text>
   <xsl:value-of select="@url"/>
   <xsl:text>&gt;</xsl:text>
</xsl:template>

<xsl:template match="emphasis">
   <xsl:text>*</xsl:text>
   <xsl:apply-templates/>
   <xsl:text>*</xsl:text>
</xsl:template>

<xsl:template match="para">
   <txt:block linesAfter="1"><xsl:apply-templates/></txt:block>
</xsl:template>

</xsl:stylesheet>

The text formatter
You can see the text formatter itself, Xml2Text.java, in Listing 3. Xml2Text is a SAX ContentHandler. (If you're not familiar with SAX, see the sidebar SAX defined.) As SAX handlers go, this one is easy. In the startElement() and characters() events, it buffers the content of <txt:block>. In endElement(), Xml2Text writes the text and inserts line breaks as appropriate.

Listing 3. Xml2Text.java


package com.psol.xslist;

import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class Xml2Text
   extends DefaultHandler
{
   protected static final String
      NAMESPACE_URI = "http://www.psol.com/xns/xslist/xml2text";
   protected static final int NONE = 0,
                              ROOT = 1,
                              BLOCK = 2;
   protected StringBuffer buffer;
   protected int state,
                 lineWidth,
                 linesAfter;
   protected PrintWriter writer = null;

   public Xml2Text(PrintWriter writer)
   {
      this.writer = writer;
   }

   public void startElement(String uri,
                            String name,
                            String qualifiedName,
                            Attributes atts)
   {
      if(!uri.equals(NAMESPACE_URI))
         return;
      if(state == ROOT && name.equals("block"))
      {
         state = BLOCK;
         buffer = new StringBuffer(128);
         try
         {
            linesAfter =
               Integer.parseInt(atts.getValue("linesAfter"));
         }
         catch(NumberFormatException e)
         {
            linesAfter = 0;
         }
      }
      else if(state == NONE && name.equals("root"))
      {
         state = ROOT;
         try
         {
            lineWidth =
               Integer.parseInt(atts.getValue("lineWidth"));
         }
         catch(NumberFormatException e)
         {
            lineWidth = 65;
         }
      }
   }

   public void endElement(String uri,
                          String name,
                          String qualifiedName)
   {
      if(!uri.equals(NAMESPACE_URI))
         return;
      if(state == BLOCK && name.equals("block"))
      {
         state = ROOT;
         int start = 0,
             current = start,
             lastSpace = start - 1;
         while(current < buffer.length())
         {
            while(current < start + lineWidth &&
                  current < buffer.length())
            {
               if(Character.isWhitespace(buffer.charAt(current)))
                  lastSpace = current;
               current++;
            }
            if(current < buffer.length() && start < lastSpace)
            {
               for(int i = start;i < lastSpace;i++)
                  writer.print(buffer.charAt(i));
               start = lastSpace + 1;
            }
            else
            {
               for(int i = start;i < current;i++)
                  writer.print(buffer.charAt(i));
               start = current;
            }
            current = start;
            lastSpace = start - 1;
            writer.println();
         }
         for(int i = 0;i < linesAfter;i++)
            writer.println();
         buffer.delete(0,buffer.length());
      }
      else if(state == ROOT && name.equals("root"))
         state = NONE;
   }

   public void characters(char[] chars,int start,int length)
   {
      if(state == BLOCK)
         buffer.append(chars,start,length);
   }

   public void startDocument()
   {
      state = NONE;
   }

   public void endDocument()
   {
      writer.flush();
   }
}

SAX defined


SAX is the Simple API for XML, one of the most efficient solutions for processing XML documents. SAX is an event-based API, meaning that the parser sends events to your application (rather than reading the document's entire node tree into memory).

The most important events are startElement(), endElement(), and characters().

SAX filters are special event handlers designed to be chained with each other. The second installment of this article series will revisit SAX.

Two handy SAX filters
Xml2Text lacks the ability to remove unwanted spaces and accented letters. Instead of cramming these features in Xml2Text, it makes more sense to implement them as two SAX filters. The beauty of SAX filters is that you can freely combine them.

I can think of several other cases when I could use these two filters, for example, to remove unwanted spaces as preprocessing before publishing HTML documents.

WhitespaceFilter.java in Listing 4 is the SAX filter that removes duplicate spaces. Again, if you are familiar with SAX handlers, this class is easy. In startElement() and characters(), it buffers the text. endElement() removes duplicate spaces. Note that this code is optimized for clarity, not efficiency: it buffers too much.

The filter also recognizes the standard xml:space attribute. You have probably forgotten about xml:space but it is defined in the original XML standard. It takes one of two values: preserve (preserve duplicate spaces, like HTML <pre> ) and default which means duplicate spaces can be removed.

Listing 4. WhitespaceFilter.java


package com.psol.xslist;

import java.util.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class WhitespaceFilter
   extends XMLFilterImpl
{
   protected Stack stack;

   public WhitespaceFilter()
   {
      super();
   }

   public WhitespaceFilter(XMLReader reader)
   {
      super(reader);
   }

   public void startElement(String uri,
                            String name,
                            String qualifiedName,
                            Attributes atts)
      throws SAXException
   {
      String space = atts.getValue("xml:space");
      if(null != space && space.equals("preserve"))
         stack.push(null);
      else
         stack.push(new StringBuffer());
      super.startElement(uri,name,qualifiedName,atts);
   }

   public void endElement(String uri,
                          String name,
                          String qualifiedName)
      throws SAXException
   {
      Object object = stack.pop();
      if(object instanceof StringBuffer)
      {
         StringBuffer input = (StringBuffer)object,
                      output = new StringBuffer();
         boolean wasWhitespace = false;
         for(int current = 0;current < input.length();current++)
         {
            char c = input.charAt(current);
            if(c == '\n' || c == '\r')
               c = ' ';
            if(Character.isWhitespace(c))
            {
               if(!wasWhitespace)
                  output.append(c);
               wasWhitespace = true;
            }
            else
            {
               output.append(c);
               wasWhitespace = false;
            }
         }
         char[] chars = new char[output.length()];
         output.getChars(0,output.length(),chars,0);
         super.characters(chars,0,output.length());
      }
      super.endElement(uri,name,qualifiedName);
   }

   public void characters(char[] chars,int start,int length)
      throws SAXException
   {
      Object object = stack.peek();
      if(object instanceof StringBuffer)
         ((StringBuffer)object).append(chars,start,length);
      else
         super.characters(chars,start,length);
   }

   public void startDocument()
      throws SAXException
   {
      stack = new Stack();
      super.startDocument();
   }
}

The second filter, AsciiFilter.java (see Listing 5), removes accented characters and other special characters not recognized by old e-mail clients. All the processing takes place in characters().

Note that AsciiFilter does not filter attributes and, for the simplicity of this example, it's limited to the accents used in the French language. You might want to add more special characters to filter other languages.

Listing 5. AsciiFilter.java



package com.psol.xslist;

import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class AsciiFilter
   extends XMLFilterImpl
{
   public AsciiFilter()
   {
      super();
   }

   public AsciiFilter(XMLReader reader)
   {
      super(reader);
   }

   public void characters(char[] chars,int start,int length)
      throws SAXException
   {
      StringBuffer filtered =
         new StringBuffer((int)(length * 1.1));
      int i = start,
          stop = start + length;
      while(i < stop)
      {
         char c = chars[i++];
         switch(c)
         {
            case '?x009C;':
               filtered.append("oe");
               break;
            case '©':
               filtered.append("(c)");
               break;
            case 'à':
            case 'ä':
               filtered.append('a');
               break;
            case 'æ':
               filtered.append("ae");
               break;
            case 'ç':
               filtered.append('c');
               break;
            case 'è':
            case 'é':
            case 'ê':
            case 'ë':
               filtered.append('e');
               break;
            case 'î':
            case 'ï':
               filtered.append('i');
               break;
            case 'ô':
            case 'ö':
               filtered.append('o');
               break;
            case 'ù':
            case 'û':
            case 'ü':
               filtered.append('u');
               break;
            // more characters would come here
            default:
               filtered.append(c);
         }
      }
      char[] newChars = new char[filtered.length()];
      filtered.getChars(0,filtered.length(),newChars,0);
      super.characters(newChars,0,filtered.length());
   }
}

Running the project
Console.java in Listing 6 puts all the pieces together. It applies the style sheet (through the standard Java API designed by Sun) and runs the result through the text formatter. Mind the fact that this really is a multistep transformation: from DocBook to the text-markup language to plain text.

Listing 6. Console.java


package com.psol.xslist;

import java.io.*;
import javax.xml.transform.*;
import javax.xml.transform.sax.*;
import javax.xml.transform.stream.*;

public class Console
{
   public static void main(String[] args)
   {
      try
      {
         if(args.length < 3)
         {
            System.out.println("java com.psol.xslist.Console " +
                               "input.xml stylesheet.xsl output.txt");
            return;
         }
         Xml2Text xml2Text =
            new Xml2Text(new PrintWriter(new FileWriter(args[2])));
         WhitespaceFilter whitespaceFilter = new WhitespaceFilter();
         whitespaceFilter.setContentHandler(xml2Text);
         AsciiFilter asciiFilter = new AsciiFilter();
         asciiFilter.setContentHandler(whitespaceFilter);
         TransformerFactory factory = TransformerFactory.newInstance();
         Transformer transformer =
            factory.newTransformer(new StreamSource(new File(args[1])));
         transformer.transform(new StreamSource(new File(args[0])),
                               new SAXResult(asciiFilter));
      }
      catch(IOException e)
      {
         System.err.println(e.getMessage());
      }
      catch(TransformerException e)
      {
         System.err.println(e.getMessage());
      }
   }
}

In summary
This might seem like a lot of work to please a group of subscribers with antiquated e-mail clients. Why bother? Some people would suggest the subscribers should upgrade, but many e-zine publishers are willing to go the extra mile to satisfy their readers. Furthermore, thanks to XML and XSLT, it's not too difficult to automate the repetitive parts of the process, making it more practical to make the effort.

In the second installment, I'll show how to combine the text conversion with JavaMail to completely automate the operation.

Resources

 
< Prev   Next >
Who's Online

Warning: Invalid argument supplied for foreach() in /home/httpd/vhosts/j2eeworld.com/httpdocs/modules/mod_whosonline.php on line 32
latest topics
+ Free Porn Clips! 91120 FREE PORNO Movies!
+ pac man free video game 56.hi5.com free hostname movie pic p
+ Fuck the sperm
+ * * * FREE PORNO VIDEO * * * - company middlesex title
+ u0bbvr8x7c5ih1gouw86ywl75
+ u0bbvr8x7c5ih1gouw86ywl75
+ zfzigamaoa31tct06m3itz5czd4
+ zfzigamaoa31tct06m3itz5czd4
+ sex in art northern middlesex
+ arourioug
most download
J2Exe (4844)
J2TrayExe (1792)
J2WinService (1598)
HealthXP
HealthXP: Experience the Benefits of Health Innovations