创造一种迅速而又随性的(quickanddirty)xml解释器

XML是一种当前很受欢迎的数据格式, 它的优点在于: 人性化,自述性以及使用的方便性.但是,不幸的是,基于Java的xml解释器往往太大了,比如sun的jaXP.jar 和 parser.jar 每个都达到了1.4mb. 如果你要在只有有限的内存容量的运行环境里运行你的程序,比如j2me的环境.或者说带宽很有限的运行环境里,比如applet,这些大的package不应该成为你的选择对象.
    注意:本篇的所有所需要的所有代码你可以通过此链接下载:
http://www.matrix.org.cn/down_view.ASP?id=67
下面是QDParser的代码:
package qdxml;
import java.io.*;
import java.util.*;

/** Quick and Dirty xml parser.  This parser is, like the SAX parser,
    an event based parser, but with mUCh less functionality.  */
public class QDParser {
  private static int popMode(Stack st) {
    if(!st.empty())
      return ((Integer)st.pop()).intValue();
    else
      return PRE;
  }
  private final static int
    TEXT = 1,
    ENTITY = 2,
    OPEN_TAG = 3,
    CLOSE_TAG = 4,
    START_TAG = 5,
    ATTRIBUTE_LVALUE = 6,
    ATTRIBUTE_EQUAL = 9,
    ATTRIBUTE_RVALUE = 10,
    QUOTE = 7,
    IN_TAG = 8,
    SINGLE_TAG = 12,
    COMMENT = 13,
    DONE = 11,
    DOCTYPE = 14,
    PRE = 15,
    CDATA = 16;
  public static void parse(DocHandler doc,Reader r) throws Exception {
    Stack st = new Stack();
    int depth = 0;
    int mode = PRE;
    int c = 0;
    int quotec = '"';
    depth = 0;
    StringBuffer sb = new StringBuffer();
    StringBuffer etag = new StringBuffer();
    String tagName = null;
    String lvalue = null;
    String rvalue = null;
    Hashtable attrs = null;
    st = new Stack();
    doc.startDocument();
    int line=1, col=0;
    boolean eol = false;
    while((c = r.read()) !
= -1) {

      // We need to map \r, \r\n, and \n to \n
      // See XML spec section 2.11
      if(c == '\n' && eol) {
        eol = false;
        continue;
      } else if(eol) {
        eol = false;
      } else if(c == '\n') {
        line++;
        col=0;
      } else if(c == '\r') {
        eol = true;
        c = '\n';
        line++;
        col=0;
      } else {
        col++;
      }

      if(mode == DONE) {
        doc.endDocument();
        return;

      // We are between tags collecting text.
      } else if(mode == TEXT) {
        if(c == '<') {
          st.push(new Integer(mode));
          mode = START_TAG;
          if(sb.length() > 0) {
            doc.text(sb.toString());
            sb.setLength(0);
          }
        } else if(c == '&') {
          st.push(new Integer(mode));
          mode = ENTITY;
          etag.setLength(0);
        } else
          sb.append((char)c);

      // we are processing a closing tag: e.g. </foo>
      } else if(mode == CLOSE_TAG) {
        if(c == '>') {
          mode = popMode(st);
          tagName = sb.toString();
          sb.setLength(0);
          depth--;
          if(depth==0)
            mode = DONE;
          doc.endElement(tagName);
        } else {
          sb.append((char)c);
        }

      // we are processing CDATA
      } else if(mode == CDATA) {
        if(c == '>'
        && sb.toString().endsWith("]]")) {
          sb.setLength(sb.length()-2);
          doc.text(sb.toString());
          sb.setLength(0);
          mode = popMode(st);
        } else
          sb.append((char)c);

      // we are processing a comment.  We are inside
      // the <!
-- .... --> looking for the -->.
      } else if(mode == COMMENT) {
        if(c == '>'
        && sb.toString().endsWith("--")) {
          sb.setLength(0);
          mode = popMode(st);
        } else
          sb.append((char)c);

      // We are outside the root tag element
      } else if(mode == PRE) {
        if(c == '<') {
          mode = TEXT;
          st.push(new Integer(mode));
          mode = START_TAG;
        }

      // We are inside one of these <? ... ?>
      // or one of these <!DOCTYPE ... >
      } else if(mode == DOCTYPE) {
        if(c == '>') {
          mode = popMode(st);
          if(mode == TEXT) mode = PRE;
        }

      // we have just seen a < and
      // are wondering what we are looking at
      // <foo>, </foo>, , etc.
      } else if(mode == START_TAG) {
        mode = popMode(st);
        if(c == '/') {
          st.push(new Integer(mode));
          mode = CLOSE_TAG;
        } else if (c == '?') {
          mode = DOCTYPE;
        } else {
          st.push(new Integer(mode));
          mode = OPEN_TAG;
          tagName = null;
          attrs = new Hashtable();
          sb.append((char)c);
        }

      // we are processing an entity, e.g. <, », etc.
      } else if(mode == ENTITY) {
        if(c == ';') {
          mode = popMode(st);
          String cent = etag.toString();
          etag.setLength(0);
          if(cent.equals("lt"))
            sb.append('<');
          else if(cent.equals("gt"))
            sb.append('>');
          else if(cent.equals("amp"))
            sb.append('&');
          else if(cent.equals("quot"))
            sb.append('"');
          else if(cent.equals("apos"))
            sb.append('\'');
          // Could parse hex entities if we wanted to
          //else if(cent.startsWith("#x"))
            //sb.append((char)Integer.parseInt(cent.substring(2),16));
          else if(cent.startsWith("#"))
            sb.append((char)Integer.parseInt(cent.substring(1)));
          // Insert custom entity definitions here
          else
            exc("Unknown entity: &"+cent+";",line,col);
        } else {
          etag.append((char)c);
        }

      // we have just seen something like this:
      // <foo a="b"/
      // and are looking for the final >.
      } else if(mode == SINGLE_TAG) {
        if(tagName == null)
          tagName = sb.toString();
        if(c !
= '>')
          exc("Expected > for tag: <"+tagName+"/>",line,col);
        doc.startElement(tagName,attrs);
        doc.endElement(tagName);
        if(depth==0) {
          doc.endDocument();
          return;
        }
        sb.setLength(0);
        attrs = new Hashtable();
        tagName = null;
        mode = popMode(st);

      // we are processing something
      // like this <foo ... >.  It could
      // still be a  or something.
      } else if(mode == OPEN_TAG) {
        if(c == '>') {
          if(tagName == null)
            tagName = sb.toString();
          sb.setLength(0);
          depth++;
          doc.startElement(tagName,attrs);
          tagName = null;
          attrs = new Hashtable();
          mode = popMode(st);
        } else if(c == '/') {
          mode = SINGLE_TAG;
        } else if(c == '-' && sb.toString().equals("!-")) {
          mode = COMMENT;
        } else if(c == '[' && sb.toString().equals("![CDATA")) {
          mode = CDATA;
          sb.setLength(0);
        } else if(c == 'E' && sb.toString().equals("!DOCTYP")) {
          sb.setLength(0);
          mode = DOCTYPE;
        } else if(Character.isWhitespace((char)c)) {
          tagName = sb.toString();
          sb.setLength(0);
          mode = IN_TAG;
        } else {
          sb.append((char)c);
        }

      // We are processing the quoted right-hand side
      // of an element's attribute.
      } else if(mode == QUOTE) {
        if(c == quotec) {
          rvalue = sb.toString();
          sb.setLength(0);
          attrs.put(lvalue,rvalue);
          mode = IN_TAG;
        // See section the XML spec, section 3.3.3
        // on normalization processing.
        } else if(" \r\n\u0009".indexOf(c)>=0) {
          sb.append(' ');
        } else if(c == '&') {
          st.push(new Integer(mode));
          mode = ENTITY;
          etag.setLength(0);
        } else {
          sb.append((char)c);
        }

      } else if(mode == ATTRIBUTE_RVALUE) {
        if(c == '"' c == '\'') {
          quotec = c;
          mode = QUOTE;
        } else if(Character.isWhitespace((char)c)) {
          ;
        } else {
          exc("Error in attribute processing",line,col);
        }

      } else if(mode == ATTRIBUTE_LVALUE) {
        if(Character.isWhitespace((char)c)) {
          lvalue = sb.toString();
          sb.setLength(0);
          mode = ATTRIBUTE_EQUAL;
        } else if(c == '=') {
          lvalue = sb.toString();
          sb.setLength(0);
          mode = ATTRIBUTE_RVALUE;
        } else {
          sb.append((char)c);
        }

      } else if(mode == ATTRIBUTE_EQUAL) {
        if(c == '=') {
          mode = ATTRIBUTE_RVALUE;
        } else if(Character.isWhitespace((char)c)) {
          ;
        } else {
          exc("Error in attribute processing.",line,col);
        }

      } else if(mode == IN_TAG) {
        if(c == '>') {
          mode = popMode(st);
          doc.startElement(tagName,attrs);
          depth++;
          tagName = null;
          attrs = new Hashtable();
        } else if(c == '/') {
          mode = SINGLE_TAG;
        } else if(Character.isWhitespace((char)c)) {
          ;
        } else {
          mode = ATTRIBUTE_LVALUE;
          sb.append((char)c);
        }
      }
    }
    if(mode == DONE)
      doc.endDocument();
    else
      exc("missing end tag",line,col);
  }
  private static void exc(String s,int line,int col)
    throws Exception
  {
    throw new Exception(s+" near line "+line+", column "+col);
  }
}
    为何不使用SAX?
    你可以实现仅还有有限功能的SAX接口, 当遇到某些东西你不需要的时候,抛出NotImplemented异常.
     无庸置疑地, 这样你可以开发出小于jaxp.jar和parser.jar的类.但是,你可以通过定义自己的类来达到更加小的size.实际上,我们这里定义的类将会比SAX接口还要小很多.
      我们的迅速而又随性的xml解释器有点类似于SAX. 类似于SAX解释器,它能够让你实现接口从而可以捕获并处理与属性和开始/结束标签. 你们如果已经使用过SAX,你们会发现它很熟悉.

       限制的XML功能
       很多人都喜欢XML样式的简单的,自述的,文本式的数据格式. 他们希望很容易地获取当中地元素,属性以及属性的值. 顺着这种思想,让我们来考虑一下哪些功能使我们必须的.
       我们的简单的解释器只有一个类:QDParser 与一个接口:DocHandler. QDParser拥有一个public的静态方法-parse(DocHandler,Reader)—我们把它定义成一个有限状态自动机.
        我们的简单的解释器会把DTD <!
DOCTYPE> 与 <?xml version="1.0"?>仅仅看成是注释,所以,他们不会造成混乱,他们的内容对我们来说也是无用的.
         因为我们不能处理DOCTYPE, 我们的解释器不能读取自定义的实体.只有这些是作为标准可用的: &, <, >, ', and ".如果你觉得这些不够,那么,可以自己插入代码来扩展自己的定义.或者你也可以再递交给QDParser之前先预处理你的xml文件.
          我们的简单的解释器也不支持条件选择:比如, <![INCLUDE[ ... ]]> or <![IGNORE[ ... ]]>.因为我们不能通过DOCTYPE自定义实体,这个功能对我们来说也是毫无意义的.我们可以在数据传递到我们的有限容量处理设备之前解决这个条件选择的问题.
           因为我们的解释器不会处理任何属性的声明,XML规范要求我们把所有的数据类型都看成是CDATA,这样,我们可以使用java.util.Hashtable来代替org.xml.sax.AttributeList来存储一个元素的属性列表.在Hashtable里,我们仅仅有名字/值对应的信息,因为我们不需要gettype()因为此时,无论如何都会返回CDATA.
      缺少属性声明会导致一些其他的结果,比如,解释器不提供默认的属性值.还有,我们也不能通过声明NMTOKENS来自动减少空闲空间.然后,这些都可以在我们准备或者生成xml文件的时候处理.这些额外的代码都可以放到使用我们的Parser的程序外部去.
       实际上,缺少的功能都可以在准备XML文件的时候补偿回来,这样,你就可以分担很多功能-我们的parser失去的功能给准备XML文件的时候处理.

解释器功能
既然讨论了这么多我们的parser不能做到的事情,那什么是它可以做到的呢?
1.        它能识别所有元素的开始和结束标签.
2.        它能够列出所有属性.
3.        它能够识别<[CDATA[ ... ]]> 这种结构
4.        它能够识别标准实体: &, <, >, ", and &apos,与数字实体.
5.        它能将输入的\r\n和\r to \n看成是一行的结束,符合XML规范里的2.11.
        这个解释器仅仅带有很有限的错误检查,当遇到错误的文法的时候,就会抛出异常,比如遇到它不能识别的实体.

        如何使用这个解释器
        使用我们这个quick and dirty解释器是很简单的,首先,实现DocHandler的接口,然后就可以解释一个xml文件:
         DocHandler doc = new MyDocHandler();
QDParser.parse(doc,new FileReader("config.xml"));
         源代码包含有两个实现了全部DocHandler接口的例子,第一个叫
Reporter,仅仅是输出它读到的内容,你可以用例子里的xml文件:config.xml来测
试这个例子.
第二个例子conf稍微复杂,conf实现更新已经存在的驻扎的内存的数据.
Conf通过java.lang.reflect来定位config.xml里定义的fields和对象.如果你运行这个程序，它会告诉你哪些对象在更新与如何更新.如果遇到要求更新不存在的fields,它会报出错信息.

         修改这个package
          你可以修改这个类来使之适合你自己的使用,你可以添加你自定义的实体定义-在QDParser.java的第180行.
           你也可以添加我排除的功能到这个解释器的有限状态机里面去.这个是比较容易实现的,因为原有的代码量很少.

           Keep it small
           QDParser只有3kb大小当你编译之后或者打包到jar文件里去.源代码也只有300行,还包括注释在内,这个对很多小容量的设备来说是有效的,可以保持符合XML的标准,并实现基本的功能.

matrix开源技术经javaworld授权翻译并发布.
如果你对此文章有任何看法或建议,
请到Matrix论坛发表您的意见.
注明: 如果对matrix的翻译文章系列感兴趣,请点击oreilly和javaworld文章翻译计划查看详细情况
您也可以点击-chris查看翻译作者的详细信息.

（出处：）