Parsing HTML Documents

These functions require Swing API.

parseHTML( (String | File | InputStream | Reader | URL) input ,
( Map| Package | HTMLEditorKit$ParserCallback ) handlers
{, boolean ignoreCharset } )

parseHTML() reads a HTML document and executes callback functions, defined in handlers, on the HTML elements.

When handlers is a Map or a pnuts.lang.Package, it can have zero or more key-value mappings. The key element should be one of the following function names and the value should be the function.

handleText(char[] data, int position)
handleStartTag(HTML.Tag tag, Map attributeSet, int position)
handleEndTag(HTML.Tag tag, int position)
handleSimpleTag(HTML.Tag tag, Map attributeSet, int position)
handleError(String errorMessage, int position)

javax.swing.text.html.HTML.Tag class and javax.swing.text.html.HTML.Attribute class can be accessed with the short names; Tag and Attribute. They provides static fields that represent the tags and the attributes of HTML documents.

For example, the following script extracts links in a HTML document.

e.g.
use("html.parser")

function htmllinks(url) htmllinks(url, reader(url))

function htmllinks(url, r) {
 w = list()
 callback = $(
   function handleStartTag(tag, aset, i) {
      if (tag == Tag::A){
         ref = aset.get(Attribute::HREF)
         if (ref != null) push(w, getURL(url, ref))
      }
   }
  )
 parseHTML(r, callback, true)
 w
}