Home » openjdk-7 » javax » swing » text » html » parser » [javadoc | source]
public class: Parser [javadoc | source]

All Implemented Interfaces:

Direct Known Subclasses:

A simple DTD-driven HTML parser. The parser reads an HTML file from an InputStream and calls various methods (which should be overridden in a subclass) when tags and data are encountered.

Unfortunately there are many badly implemented HTML parsers out there, and as a result there are many badly formatted HTML files. This parser attempts to parse most HTML files. This means that the implementation sometimes deviates from the SGML specification in favor of HTML.

The parser treats \r and \r\n as \n. Newlines after starttags and before end tags are ignored just as specified in the SGML/HTML specification.

The html spec does not specify how spaces are to be coalesced very well. Specifically, the following scenarios are not discussed (note that a space should be used here, but I am using &nbsp to force the space to be displayed):

'<b>blah <i> <strike> foo' which can be treated as: '<b>blah <i><strike>foo'

as well as: '<p><a href="xx"> <em>Using</em></a></p>' which appears to be treated as: '<p><a href="xx"><em>Using</em></a></p>'

If strict is false, when a tag that breaks flow, (TagElement.breaksFlows) or trailing whitespace is encountered, all whitespace will be ignored until a non whitespace character is encountered. This appears to give behavior closer to the popular browsers.

Field Summary
protected  DTD dtd     
protected  boolean strict    This flag determines whether or not the Parser will be strict in enforcing SGML compatibility. If false, it will be lenient with certain common classes of erroneous HTML constructs. Strict or not, in either case an error will be recorded. 
 public Parser(DTD dtd) 
Method from javax.swing.text.html.parser.Parser Summary:
addString,   endTag,   error,   error,   error,   error,   errorContext,   flushAttributes,   getAttributes,   getBlockStartPosition,   getChars,   getChars,   getCurrentLine,   getCurrentPos,   getEndOfLineString,   getString,   handleComment,   handleEOFInComment,   handleEmptyTag,   handleEndTag,   handleError,   handleStartTag,   handleText,   handleText,   handleTitle,   ignoreElement,   legalElementContext,   legalTagContext,   makeTag,   makeTag,   markFirstTime,   parse,   parseAttributeSpecificationList,   parseAttributeValue,   parseComment,   parseContent,   parseDTDMarkup,   parseIdentifier,   parseInvalidTag,   parseLiteral,   parseMarkupDeclarations,   parseScript,   parseTag,   resetStrBuffer,   skipSpace,   startTag,   strIndexOf
Methods from java.lang.Object:
clone,   equals,   finalize,   getClass,   hashCode,   notify,   notifyAll,   toString,   wait,   wait,   wait
Method from javax.swing.text.html.parser.Parser Detail:
  void addString(int c) 
    Add a char to the string buffer.
 protected  void endTag(boolean omitted) 
    Handle an end tag. The end tag is popped from the tag stack.
 protected  void error(String err) 
 protected  void error(String err,
    String arg1) 
 protected  void error(String err,
    String arg1,
    String arg2) 
 protected  void error(String err,
    String arg1,
    String arg2,
    String arg3) 
    Invoke the error handler.
  void errorContext() throws ChangedCharSetException 
    Error context. Something went wrong, make sure we are in the document's body context
 protected  void flushAttributes() 
 protected SimpleAttributeSet getAttributes() 
 int getBlockStartPosition() 
    Returns the start position of the current block. Block is overloaded here, it really means the current start position for the current comment tag, text, block.... This is provided for subclassers that wish to know the start of the current block when called with one of the handleXXX methods.
 char[] getChars(int pos) 
 char[] getChars(int pos,
    int endPos) 
 protected int getCurrentLine() 
 protected int getCurrentPos() 
 String getEndOfLineString() 
    Returns the end of line string. This will return the end of line string that has been encountered the most, one of \r, \n or \r\n.
 String getString(int pos) 
    Get the string that's been accumulated.
 protected  void handleComment(char[] text) 
    Called when an HTML comment is encountered.
 protected  void handleEOFInComment() 
 protected  void handleEmptyTag(TagElement tag) throws ChangedCharSetException 
    Called when an empty tag is encountered.
 protected  void handleEndTag(TagElement tag) 
    Called when an end tag is encountered.
 protected  void handleError(int ln,
    String msg) 
    An error has occurred.
 protected  void handleStartTag(TagElement tag) 
    Called when a start tag is encountered.
 protected  void handleText(char[] text) 
    Called when PCDATA is encountered.
  void handleText(TagElement tag) 
    Output text.
 protected  void handleTitle(char[] text) 
    Called when an HTML title tag is encountered.
 boolean ignoreElement(Element elem) 
 boolean legalElementContext(Element elem) throws ChangedCharSetException 
    Create a legal content for an element.
  void legalTagContext(TagElement tag) throws ChangedCharSetException 
    Create a legal context for a tag.
 protected TagElement makeTag(Element elem) 
 protected TagElement makeTag(Element elem,
    boolean fictional) 
    Makes a TagElement.
 protected  void markFirstTime(Element elem) 
    Marks the first time a tag has been seen in a document
 public synchronized  void parse(Reader in) throws IOException 
    Parse an HTML stream, given a DTD.
  void parseAttributeSpecificationList(Element elem) throws IOException 
    Parse attribute specification List. [31] 327:17
 String parseAttributeValue(boolean lower) throws IOException 
    Parse attribute value. [33] 331:1
  void parseComment() throws IOException 
    Parse a comment. [92] 391:7
  void parseContent() throws IOException 
    Parse Content. [24] 320:1
 public String parseDTDMarkup() throws IOException 
    Parses th Document Declaration Type markup declaration. Currently ignores it.
 boolean parseIdentifier(boolean lower) throws IOException 
    Parse identifier. Uppercase characters are folded to lowercase when lower is true. Returns falsed if no identifier is found. [55] 346:17
  void parseInvalidTag() throws IOException 
    Parse an invalid tag.
  void parseLiteral(boolean replace) throws IOException 
    Parse literal content. [46] 343:1 and [47] 344:1
 protected boolean parseMarkupDeclarations(StringBuffer strBuff) throws IOException 
    Parse markup declarations. Currently only handles the Document Type Declaration markup. Returns true if it is a markup declaration false otherwise.
  void parseScript() throws IOException 
  void parseTag() throws IOException 
    Parse a start or end tag.
  void resetStrBuffer() 
  void skipSpace() throws IOException 
    Skip space. [5] 297:5
 protected  void startTag(TagElement tag) throws ChangedCharSetException 
    Handle a start tag. The new tag is pushed onto the tag stack. The attribute list is checked for required attributes.
 int strIndexOf(char target)