A mini-tutorial on how to write a "raw" SAX parser using SaxObjC. The example implements a very simple RSS newsfeed parser.

Sources

Introduction

When writing an XML processor using SAX, you are dealing with two entities:

  • the SAX parser (in SAX slang called a "reader")
  • the SAX handler
The parser is the part which actually reads an XML file and produces a sequence of "SAX events" which are sent to the handler. If you know AppKit, a SAX handler is basically a "delegate" of the parser.
So for doing XML processing, you only need to know how to instantiate a parser and how to write a handler dealing with SAX events.

What do we want to parse ?

The example implements a simple RSS parser which collects just the RSS item information and gives it back to the processor as an NSArray of NSDictionaries.
Sample RSS section:

...
<item rdf:about="http://slashdot.org/article.pl?sid=03/01/22/1413238">
<title>Elect Steve Jobs President of the United States</title>
<link>http://slashdot.org/article.pl?sid=03/01/22/1413238</link>
<description>
Will Foster writes "There is a groundswell of support for electing Steve Jobs
President of the United States." I'll vote for him if I can write in my vote -- ...
</description>
<dc:subject<humor</dc:subject>
<dc:date<2003-01-22T23:12:48+00:00</dc:date>
...
</item>
...

1. Instantiating the Parser

SaxObjC parsers are usually implemented as bundles and are managed by the SaxXMLReaderFactory class which can locate and load an appropriate SAX parser bundle for you.

id parser;
parser = [[SaxXMLReaderFactory standardXMLReaderFactory]
createXMLReaderForMimeType:@"text/xml"];

You can reuse a parser instantiated like this as many times as you which, but you can use it only in one thread (the object itself is not reentrant).

2. Writing a SAX Handler

To do the actual XML processing, you need to write a SAX handler class. What we show here is only a very simplified one for RSS, but it shows the concepts pretty well.

SAX handlers usually inherit from the SaxDefaultHandler class, which already implements all SAX handler protocols with (usually empty) default implementations.
So do we:

@interface RSSSaxHandler : SaxDefaultHandler
{
NSMutableArray *entries;
/* parsing state */
NSMutableDictionary *entry;
BOOL isInItem; /* are we inside an 'item' tag ? */
NSString *value; /* the (PCDATA) content of a tag */
}
- (NSArray *)rssEntries;
@end

We have one array variable entries for keeping the results of the processing. The other variables are required for tracking where we are in the XML document and for collecting data, see below ...

The usual stuff, @implementation for implementing the handler class, setting up some objects used for processing, ensure that they are correctly deallocated ...:

@implementation RSSSaxHandler
- (id)init {
if ((self = [super init])) {
self->entries = [[NSMutableArray alloc] initWithCapacity:16];
self->entry = [[NSMutableDictionary alloc] initWithCapacity:8];
}
return self;
}
- (void)dealloc {
[self->entry release];
[self->entries release];
[super dealloc];
}

If the parsing is done, we have collected all <item> information in the entries array of the handler. We reuse that array for each parsing invocation, so we give back a copy of the array in the results accessor called "-rssEntries":

- (NSArray *)rssEntries {
return [[self->entries copy] autorelease];
}

The SAX reader sends the handler a startDocument message prior parsing and a endDocument message if it's done. Those callbacks are useful to setup and tear down per document processing state. In this startDocument implementation we ensure that the entries is empty (eg if a processing error occurred in the previous run, it might contains partial results).

- (void)startDocument {
[self->entries removeAllObjects];
self->isInItem = NO;
}

The SAX parser triggers callbacks if it encounteres tags, processing instructions, content, errors, namespace declarations, etc. All callbacks are implemented by our superclass SaxDefaultHandler, so we only need to override the callbacks we are interested in: tags and content.

If the SAX parser encounteres a start tag (eg <item>) it calls the startElement callback and passes in the tagname - as it exists in the file in rawName, and after XML namespace processing in localName and ns. The attributes of the tag are provided in the attributes object - but since RSS doesn't use any tag attributes, we can ignore them.

In the case of an <item> tag, we clean the entry record and place a marker (isInItem). The entry dictionary is used to collect the information of all subtags of <item>.
We also clean the value on any tag we enter, the variable is explained in the -characters callback.

- (void)startElement:(NSString *)_localName
namespace:(NSString *)_ns
rawName:(NSString *)_rawName
attributes:(id)_attributes
{
if ([_localName isEqualToString:@"item"]) {
[self->entry removeAllObjects];
self->isInItem = YES;
}
/* always reset content when entering a new tag */
[self->value release]; self->value = nil;
}

Three cases: a) the item section is closed by a </item>, b) we are inside of an item section, c) we are outside of an item section.
In case a) we add the record containing the item information to the entries array. We make a copy of entry since we reuse that dictionary for any item.
In case b) we use the tagname of the subtag as the key for the collected character data and store that data inside the entry record. We are only adding the key, if the subtag actually contained some character data.
In case c) we do nothing ;-) we are only interested in information contained inside an <item> section.

- (void)endElement:(NSString *)_localName
namespace:(NSString *)_ns
rawName:(NSString *)_rawName
{
if ([_localName isEqualToString:@"item"]) {
/* found end of item */
self->isInItem = NO;
[self->entries addObject:[[self->entry copy] autorelease]];
}
else if (self->isInItem) {
/* any tag inside of an item is a key for the entry dict */
if (self->value) {
/* if we collected a PCDATA value, add it */
[self->entry setObject:self->value forKey:_localName];
[self->value release]; self->value = nil;
}
}
}

Finally the PCDATA (non-tag content) callback. If we encounter a <i>hello</i> the SAX parser will call the -characters callback with "hello" as the string.
For our example we collect all PCDATA in the value variable for later addition to the entry record in the -endElement callback.
Attention!: it is not guaranteed that the SAX parser calls the callback only once ! Eg you might well get two calls like characters:"he" and characters:"llo". This complicates the handler (we need to append the string if we already stored one), but makes it easier to write parsers.
By checking whether we are in an <item> section, we ensure that we don't collect unnecessary content.

- (void)characters:(unichar *)_chars length:(int)_len {
NSString *s;
if (!self->isInItem) return;
s = [[NSString alloc] initWithCharacters:_chars length:_len];
if (self->value) {
self->value = [[self->value stringByAppendingString:s] copy];
[s release];
}
else
self->value = s;
}

Close the implementation, that's it ;-)

@end /* RSSSaxHandler */

3. Connecting the SAX Handler to the Parser

Now that you have the parser and a handler, you need to connect the two:

sax = [[[RSSSaxHandler alloc] init] autorelease];
[parser setContentHandler:sax];
[parser setErrorHandler:sax];

A SAX parser can actually have different kinds of handlers - eg separate handlers for errors, for DTD information, for the content - but in practice you almost always use a single handler which inherits from the SaxDefaultHandler class.

4. Start the Parsing

Easy. Let the parser do the parsing by passing it a URL, then query the results from the handler.

NSArray *entries;
[parser parseFromSource:[NSURL URLWithString:@"file:///...."]];
entries = [sax rssEntries];

Note: You can also pass the parser an NSData or NSString object containing an XML document for parsing.
Note: You can also parse "plist", "pyx", "iCalendar" and "vCalendar" files using specialized SAX parsers coming with SaxObjC ! (SAX is good for processing a lot of different XML "like" structured text formats).

What's next ?

"Raw" SAX handlers are usually only used if you need to process very large documents or if you need to process documents before you have the whole data available (in a streaming fashion).
So for doing "real" work, take a look at SaxObjectDecoder or DOM - much easier.

Note: before you start implementing an RSS reader using the tutorial as a starting point, take a look at the excellent MulleNewz application available for MacOSX, for .NET and for GNUstep !


Written by Helge Heß