convert.xml.tok
Class Tokenizer

java.lang.Object
  extended by convert.xml.tok.Tokenizer

public class Tokenizer
extends java.lang.Object

It provides operations on char arrays that represent all or part of a parsed XML entity.

Several methods operate on char subarrays. The subarray is specified by a char array buf and two integers, off and end; off gives the index in buf of the first char of the subarray and end gives the index in buf of the char immediately after the last char.

The main operations provided by Tokenizer are tokenizeProlog, tokenizeContent and tokenizeCdataSection; these are used to divide up an XML entity into tokens. tokenizeProlog is used for the prolog of an XML document as well as for the external subset and parameter entities (except when referenced in an EntityValue); it can also be used for parsing the Misc* that follows the document element. tokenizeContent is used for the document element and for parsed general entities that are referenced in content except for CDATA sections. tokenizeCdataSection is used for CDATA sections, following the <![CDATA[ up to and including the ]]>.

tokenizeAttributeValue and tokenizeEntityValue are used to further divide up tokens returned by tokenizeProlog and tokenizeContent; they are also used to divide up entities referenced in attribute values or entity values.


Field Summary
static int TOK_ATTRIBUTE_VALUE_S
          Represents a white space character in an attribute value, excluding white space characters that are part of line boundaries.
static int TOK_CDATA_SECT_CLOSE
          Represents the end of a CDATA section ]]>.
static int TOK_CDATA_SECT_OPEN
          Represents the start of a CDATA section <![CDATA[.
static int TOK_CHAR_PAIR_REF
          Represents a numeric character reference (decimal or hexadecimal), when the referenced character is greater than 0xFFFF and so is represented by a pair of chars.
static int TOK_CHAR_REF
          Represents a numeric character reference (decimal or hexadecimal), when the referenced character is less than or equal to 0xFFFF and so is represented by a single char.
static int TOK_CLOSE_BRACKET
          Represents ] in the prolog.
static int TOK_CLOSE_PAREN
          Represents a ) in the prolog that is not followed immediately by any of *, + or ?.
static int TOK_CLOSE_PAREN_ASTERISK
          Represents )* in the prolog.
static int TOK_CLOSE_PAREN_PLUS
          Represents )+ in the prolog.
static int TOK_CLOSE_PAREN_QUESTION
          Represents )? in the prolog.
static int TOK_COMMA
          Represents , in the prolog.
static int TOK_COMMENT
          Represents a comment <!-- comment -->.
static int TOK_COND_SECT_CLOSE
          Represents ]]> in the prolog.
static int TOK_COND_SECT_OPEN
          Represents <![ in the prolog.
static int TOK_DATA_CHARS
          Represents one or more characters of data.
static int TOK_DATA_NEWLINE
          Represents a newline (CR, LF or CR followed by LF) in data.
static int TOK_DECL_CLOSE
          Represents > in the prolog.
static int TOK_DECL_OPEN
          Represents <!NAME in the prolog.
static int TOK_EMPTY_ELEMENT_NO_ATTS
          Represents an empty element tag <name/>, that doesn't have any attribute specifications.
static int TOK_EMPTY_ELEMENT_WITH_ATTS
          Represents an empty element tag <name att="val"/>, that contains one or more attribute specifications.
static int TOK_END_TAG
          Represents a complete end-tag </name>.
static int TOK_ENTITY_REF
          Represents a general entity reference.
static int TOK_LITERAL
          Represents a literal (EntityValue, AttValue, SystemLiteral or PubidLiteral).
static int TOK_MAGIC_ENTITY_REF
          Represents a general entity reference to a one of the 5 predefined entities amp, lt, gt, quot, apos.
static int TOK_NAME
          Represents an unprefixed name in the prolog.
static int TOK_NAME_ASTERISK
          Represents a name followed immediately by *.
static int TOK_NAME_PLUS
          Represents a name followed immediately by +.
static int TOK_NAME_QUESTION
          Represents a name followed immediately by ?.
static int TOK_NMTOKEN
          Represents a name token in the prolog that is not a name.
static int TOK_OPEN_BRACKET
          Represents [ in the prolog.
static int TOK_OPEN_PAREN
          Represents a ( in the prolog.
static int TOK_OR
          Represents | in the prolog.
static int TOK_PARAM_ENTITY_REF
          Represents a parameter entity reference in the prolog.
static int TOK_PERCENT
          Represents a % in the prolog that does not start a parameter entity reference.
static int TOK_PI
          Represents a processing instruction.
static int TOK_POUND_NAME
          Represents #NAME in the prolog.
static int TOK_PREFIXED_NAME
          Represents a name with a prefix.
static int TOK_PROLOG_S
          Represents whitespace in the prolog.
static int TOK_START_TAG_NO_ATTS
          Represents a complete start-tag <name>, that doesn't have any attribute specifications.
static int TOK_START_TAG_WITH_ATTS
          Represents a complete start-tag <name att="val">, that contains one or more attribute specifications.
static int TOK_XML_DECL
          Represents an XML declaration or text declaration (a processing instruction whose target is xml).
 
Constructor Summary
Tokenizer()
           
 
Method Summary
static java.lang.String getPublicId(char[] buf, int off, int end)
          Checks that a literal contained in the specified char subarray is a legal public identifier and returns a string with the normalized content of the public id.
static boolean matchesXMLString(char[] buf, int off, int end, java.lang.String str)
          Returns true if the specified char subarray is equal to the string.
static void movePosition(char[] buf, int off, int end, Position pos)
          Moves a position forward.
static int skipIgnoreSect(char[] buf, int off, int end)
          Skips over an ignored conditional section.
static int skipS(char[] buf, int off, int end)
          Skips over XML whitespace characters at the start of the specified subarray.
static int tokenizeAttributeValue(char[] buf, int off, int end, Token token)
          Scans the first token of a char subarrary that contains part of literal attribute value.
static int tokenizeCdataSection(char[] buf, int off, int end, Token token)
          Scans the first token of a char subarrary that starts with the content of a CDATA section.
static int tokenizeContent(char[] buf, int off, int end, ContentToken token)
          Scans the first token of a char subarrary that contains content.
static int tokenizeEntityValue(char[] buf, int off, int end, Token token)
          Scans the first token of a char subarrary that contains part of literal entity value.
static int tokenizeProlog(char[] buf, int off, int end, Token token)
          Scans the first token of a char subarray that contains part of a prolog.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TOK_DATA_CHARS

public static final int TOK_DATA_CHARS
Represents one or more characters of data.

See Also:
Constant Field Values

TOK_DATA_NEWLINE

public static final int TOK_DATA_NEWLINE
Represents a newline (CR, LF or CR followed by LF) in data.

See Also:
Constant Field Values

TOK_START_TAG_NO_ATTS

public static final int TOK_START_TAG_NO_ATTS
Represents a complete start-tag <name>, that doesn't have any attribute specifications.

See Also:
Constant Field Values

TOK_START_TAG_WITH_ATTS

public static final int TOK_START_TAG_WITH_ATTS
Represents a complete start-tag <name att="val">, that contains one or more attribute specifications.

See Also:
Constant Field Values

TOK_EMPTY_ELEMENT_NO_ATTS

public static final int TOK_EMPTY_ELEMENT_NO_ATTS
Represents an empty element tag <name/>, that doesn't have any attribute specifications.

See Also:
Constant Field Values

TOK_EMPTY_ELEMENT_WITH_ATTS

public static final int TOK_EMPTY_ELEMENT_WITH_ATTS
Represents an empty element tag <name att="val"/>, that contains one or more attribute specifications.

See Also:
Constant Field Values

TOK_END_TAG

public static final int TOK_END_TAG
Represents a complete end-tag </name>.

See Also:
Constant Field Values

TOK_CDATA_SECT_OPEN

public static final int TOK_CDATA_SECT_OPEN
Represents the start of a CDATA section <![CDATA[.

See Also:
Constant Field Values

TOK_CDATA_SECT_CLOSE

public static final int TOK_CDATA_SECT_CLOSE
Represents the end of a CDATA section ]]>.

See Also:
Constant Field Values

TOK_ENTITY_REF

public static final int TOK_ENTITY_REF
Represents a general entity reference.

See Also:
Constant Field Values

TOK_MAGIC_ENTITY_REF

public static final int TOK_MAGIC_ENTITY_REF
Represents a general entity reference to a one of the 5 predefined entities amp, lt, gt, quot, apos.

See Also:
Constant Field Values

TOK_CHAR_REF

public static final int TOK_CHAR_REF
Represents a numeric character reference (decimal or hexadecimal), when the referenced character is less than or equal to 0xFFFF and so is represented by a single char.

See Also:
Constant Field Values

TOK_CHAR_PAIR_REF

public static final int TOK_CHAR_PAIR_REF
Represents a numeric character reference (decimal or hexadecimal), when the referenced character is greater than 0xFFFF and so is represented by a pair of chars.

See Also:
Constant Field Values

TOK_PI

public static final int TOK_PI
Represents a processing instruction.

See Also:
Constant Field Values

TOK_XML_DECL

public static final int TOK_XML_DECL
Represents an XML declaration or text declaration (a processing instruction whose target is xml).

See Also:
Constant Field Values

TOK_COMMENT

public static final int TOK_COMMENT
Represents a comment <!-- comment -->. This can occur both in the prolog and in content.

See Also:
Constant Field Values

TOK_ATTRIBUTE_VALUE_S

public static final int TOK_ATTRIBUTE_VALUE_S
Represents a white space character in an attribute value, excluding white space characters that are part of line boundaries.

See Also:
Constant Field Values

TOK_PARAM_ENTITY_REF

public static final int TOK_PARAM_ENTITY_REF
Represents a parameter entity reference in the prolog.

See Also:
Constant Field Values

TOK_PROLOG_S

public static final int TOK_PROLOG_S
Represents whitespace in the prolog. The token contains one whitespace character.

See Also:
Constant Field Values

TOK_DECL_OPEN

public static final int TOK_DECL_OPEN
Represents <!NAME in the prolog.

See Also:
Constant Field Values

TOK_DECL_CLOSE

public static final int TOK_DECL_CLOSE
Represents > in the prolog.

See Also:
Constant Field Values

TOK_NAME

public static final int TOK_NAME
Represents an unprefixed name in the prolog.

See Also:
Constant Field Values

TOK_PREFIXED_NAME

public static final int TOK_PREFIXED_NAME
Represents a name with a prefix.

See Also:
Constant Field Values

TOK_NMTOKEN

public static final int TOK_NMTOKEN
Represents a name token in the prolog that is not a name.

See Also:
Constant Field Values

TOK_POUND_NAME

public static final int TOK_POUND_NAME
Represents #NAME in the prolog.

See Also:
Constant Field Values

TOK_OR

public static final int TOK_OR
Represents | in the prolog.

See Also:
Constant Field Values

TOK_PERCENT

public static final int TOK_PERCENT
Represents a % in the prolog that does not start a parameter entity reference. This can occur in an entity declaration.

See Also:
Constant Field Values

TOK_OPEN_PAREN

public static final int TOK_OPEN_PAREN
Represents a ( in the prolog.

See Also:
Constant Field Values

TOK_CLOSE_PAREN

public static final int TOK_CLOSE_PAREN
Represents a ) in the prolog that is not followed immediately by any of *, + or ?.

See Also:
Constant Field Values

TOK_OPEN_BRACKET

public static final int TOK_OPEN_BRACKET
Represents [ in the prolog.

See Also:
Constant Field Values

TOK_CLOSE_BRACKET

public static final int TOK_CLOSE_BRACKET
Represents ] in the prolog.

See Also:
Constant Field Values

TOK_LITERAL

public static final int TOK_LITERAL
Represents a literal (EntityValue, AttValue, SystemLiteral or PubidLiteral).

See Also:
Constant Field Values

TOK_NAME_QUESTION

public static final int TOK_NAME_QUESTION
Represents a name followed immediately by ?.

See Also:
Constant Field Values

TOK_NAME_ASTERISK

public static final int TOK_NAME_ASTERISK
Represents a name followed immediately by *.

See Also:
Constant Field Values

TOK_NAME_PLUS

public static final int TOK_NAME_PLUS
Represents a name followed immediately by +.

See Also:
Constant Field Values

TOK_COND_SECT_OPEN

public static final int TOK_COND_SECT_OPEN
Represents <![ in the prolog.

See Also:
Constant Field Values

TOK_COND_SECT_CLOSE

public static final int TOK_COND_SECT_CLOSE
Represents ]]> in the prolog.

See Also:
Constant Field Values

TOK_CLOSE_PAREN_QUESTION

public static final int TOK_CLOSE_PAREN_QUESTION
Represents )? in the prolog.

See Also:
Constant Field Values

TOK_CLOSE_PAREN_ASTERISK

public static final int TOK_CLOSE_PAREN_ASTERISK
Represents )* in the prolog.

See Also:
Constant Field Values

TOK_CLOSE_PAREN_PLUS

public static final int TOK_CLOSE_PAREN_PLUS
Represents )+ in the prolog.

See Also:
Constant Field Values

TOK_COMMA

public static final int TOK_COMMA
Represents , in the prolog.

See Also:
Constant Field Values
Constructor Detail

Tokenizer

public Tokenizer()
Method Detail

movePosition

public static void movePosition(char[] buf,
                                int off,
                                int end,
                                Position pos)
Moves a position forward. On entry, pos gives the position of the char at index off in buf. On exit, it pos will give the position of the char at index end, which must be greater than or equal to off. The chars between off and end must encode one or more complete characters. A carriage return followed by a line feed will be treated as a single line delimiter provided that they are given to movePosition together.


tokenizeCdataSection

public static int tokenizeCdataSection(char[] buf,
                                       int off,
                                       int end,
                                       Token token)
                                throws EmptyTokenException,
                                       PartialTokenException,
                                       InvalidTokenException,
                                       ExtensibleTokenException
Scans the first token of a char subarrary that starts with the content of a CDATA section. Returns one of the following integers according to the type of token that the subarray starts with:

Information about the token is stored in token.

After TOK_CDATA_SECT_CLOSE is returned, the application should use tokenizeContent.

Throws:
EmptyTokenException - if the subarray is empty
PartialTokenException - if the subarray contains only part of a legal token
InvalidTokenException - if the subarrary does not start with a legal token or part of one
ExtensibleTokenException - if the subarray encodes just a carriage return ('\r')
See Also:
TOK_DATA_CHARS, TOK_DATA_NEWLINE, TOK_CDATA_SECT_CLOSE, Token, EmptyTokenException, PartialTokenException, InvalidTokenException, ExtensibleTokenException, tokenizeContent(char[], int, int, convert.xml.tok.ContentToken)

tokenizeContent

public static int tokenizeContent(char[] buf,
                                  int off,
                                  int end,
                                  ContentToken token)
                           throws PartialTokenException,
                                  InvalidTokenException,
                                  EmptyTokenException,
                                  ExtensibleTokenException
Scans the first token of a char subarrary that contains content. Returns one of the following integers according to the type of token that the subarray starts with:

Information about the token is stored in token.

When TOK_CDATA_SECT_OPEN is returned, tokenizeCdataSection should be called until it returns TOK_CDATA_SECT.

Throws:
EmptyTokenException - if the subarray is empty
PartialTokenException - if the subarray contains only part of a legal token
InvalidTokenException - if the subarrary does not start with a legal token or part of one
ExtensibleTokenException - if the subarray encodes just a carriage return ('\r')
See Also:
TOK_START_TAG_NO_ATTS, TOK_START_TAG_WITH_ATTS, TOK_EMPTY_ELEMENT_NO_ATTS, TOK_EMPTY_ELEMENT_WITH_ATTS, TOK_END_TAG, TOK_DATA_CHARS, TOK_DATA_NEWLINE, TOK_CDATA_SECT_OPEN, TOK_ENTITY_REF, TOK_MAGIC_ENTITY_REF, TOK_CHAR_REF, TOK_CHAR_PAIR_REF, TOK_PI, TOK_XML_DECL, TOK_COMMENT, ContentToken, EmptyTokenException, PartialTokenException, InvalidTokenException, ExtensibleTokenException, tokenizeCdataSection(char[], int, int, convert.xml.tok.Token)

tokenizeProlog

public static int tokenizeProlog(char[] buf,
                                 int off,
                                 int end,
                                 Token token)
                          throws PartialTokenException,
                                 InvalidTokenException,
                                 EmptyTokenException,
                                 ExtensibleTokenException,
                                 EndOfPrologException
Scans the first token of a char subarray that contains part of a prolog. Returns one of the following integers according to the type of token that the subarray starts with:

Throws:
EmptyTokenException - if the subarray is empty
PartialTokenException - if the subarray contains only part of a legal token
InvalidTokenException - if the subarrary does not start with a legal token or part of one
EndOfPrologException - if the subarray starts with the document element; tokenizeContent should be used on the remainder of the entity
ExtensibleTokenException - if the subarray is a legal token but subsequent chars in the same entity could be part of the token
See Also:
TOK_PI, TOK_XML_DECL, TOK_COMMENT, TOK_PARAM_ENTITY_REF, TOK_PROLOG_S, TOK_DECL_OPEN, TOK_DECL_CLOSE, TOK_NAME, TOK_NMTOKEN, TOK_POUND_NAME, TOK_OR, TOK_PERCENT, TOK_OPEN_PAREN, TOK_CLOSE_PAREN, TOK_OPEN_BRACKET, TOK_CLOSE_BRACKET, TOK_LITERAL, TOK_NAME_QUESTION, TOK_NAME_ASTERISK, TOK_NAME_PLUS, TOK_COND_SECT_OPEN, TOK_COND_SECT_CLOSE, TOK_CLOSE_PAREN_QUESTION, TOK_CLOSE_PAREN_ASTERISK, TOK_CLOSE_PAREN_PLUS, TOK_COMMA, ContentToken, EmptyTokenException, PartialTokenException, InvalidTokenException, ExtensibleTokenException, EndOfPrologException

tokenizeAttributeValue

public static int tokenizeAttributeValue(char[] buf,
                                         int off,
                                         int end,
                                         Token token)
                                  throws PartialTokenException,
                                         InvalidTokenException,
                                         EmptyTokenException,
                                         ExtensibleTokenException
Scans the first token of a char subarrary that contains part of literal attribute value. The opening and closing delimiters are not included in the subarrary. Returns one of the following integers according to the type of token that the subarray starts with:

Throws:
EmptyTokenException - if the subarray is empty
PartialTokenException - if the subarray contains only part of a legal token
InvalidTokenException - if the subarrary does not start with a legal token or part of one
ExtensibleTokenException - if the subarray encodes just a carriage return ('\r')
See Also:
TOK_DATA_CHARS, TOK_DATA_NEWLINE, TOK_ATTRIBUTE_VALUE_S, TOK_MAGIC_ENTITY_REF, TOK_ENTITY_REF, TOK_CHAR_REF, TOK_CHAR_PAIR_REF, Token, EmptyTokenException, PartialTokenException, InvalidTokenException, ExtensibleTokenException

tokenizeEntityValue

public static int tokenizeEntityValue(char[] buf,
                                      int off,
                                      int end,
                                      Token token)
                               throws PartialTokenException,
                                      InvalidTokenException,
                                      EmptyTokenException,
                                      ExtensibleTokenException
Scans the first token of a char subarrary that contains part of literal entity value. The opening and closing delimiters are not included in the subarrary. Returns one of the following integers according to the type of token that the subarray starts with:

Throws:
EmptyTokenException - if the subarray is empty
PartialTokenException - if the subarray contains only part of a legal token
InvalidTokenException - if the subarrary does not start with a legal token or part of one
ExtensibleTokenException - if the subarray encodes just a carriage return ('\r')
See Also:
TOK_DATA_CHARS, TOK_DATA_NEWLINE, TOK_MAGIC_ENTITY_REF, TOK_ENTITY_REF, TOK_PARAM_ENTITY_REF, TOK_CHAR_REF, TOK_CHAR_PAIR_REF, Token, EmptyTokenException, PartialTokenException, InvalidTokenException, ExtensibleTokenException

skipIgnoreSect

public static int skipIgnoreSect(char[] buf,
                                 int off,
                                 int end)
                          throws PartialTokenException,
                                 InvalidTokenException
Skips over an ignored conditional section. The subarray starts following the <![ IGNORE [.

Returns:
the index of the character following the closing ]]>
Throws:
PartialTokenException - if the subarray does not contain the complete ignored conditional section
InvalidTokenException - if the ignored conditional section contains illegal characters

getPublicId

public static java.lang.String getPublicId(char[] buf,
                                           int off,
                                           int end)
                                    throws InvalidTokenException
Checks that a literal contained in the specified char subarray is a legal public identifier and returns a string with the normalized content of the public id. The subarray includes the opening and closing quotes.

Throws:
InvalidTokenException - if it is not a legal public identifier

matchesXMLString

public static boolean matchesXMLString(char[] buf,
                                       int off,
                                       int end,
                                       java.lang.String str)
Returns true if the specified char subarray is equal to the string. The string must contain only XML significant characters.


skipS

public static int skipS(char[] buf,
                        int off,
                        int end)
Skips over XML whitespace characters at the start of the specified subarray.

Returns:
the index of the first non-whitespace character, end if there is the subarray is all whitespace