XML Analyzer

From BroWiki

Jump to: navigation, search

Contents

Background

The XML analyzer is intended as a generic tool for the analysis of arbitrary XML traffic. As XML is a generic markup language, used to define specific XML data formats, these can vary significantly in structure and complexity. It would be a tedious task to write a full-fledged analyzer in C++ for every XML format to be analyzed.

The basic idea behind the XML analyzer is to leverage existing technologies in the area of XML data processing to be able to easily adjust analysis capabilities to specific XML data formats to analyze. This is achieved by utilizing the XQuery language designed to select parts of XML data and to transform that data.

A selector script, written in XQuery, can be provided to the generic XML analyzer. When running the analyzer against an XML data stream, it executes each selector script against every XML document encountered in the stream. If a script matches the data format of a document, relevant parts of the document can be selected and provided as parameters of events reported to the analysis tool (e.g. Bro).


Selector Scripts

A selector script is an XQuery program that is only restricted in its output format (i.e. the format of the XML data produced as ouput). Here is an example of a selector script intended to analyze RSS data:

xquery version "1.0";
for $rss in ./rss
return
  <event type="rss_event">
    <double value="{$rss/@version}"/>
    <set type="rss_channels">
      {
        for $channel in $rss/channel
        return
          <record type="rss_channel_data">
            <string value="{$channel/title}"/>
            <string value="{$channel/link}"/>
            <count value="{count($channel/item)}"/>
          </record>
      }
    </set>
  </event>

The script produces an event of type "rss_event" for every document with 'rss' as document element or every 'rss' element in a collection. The event is specified with its parameters in XML syntax, filling in the parameter values by using XQuery expression on the input data. At the start of query processing, the context node is bound to the document root and can be accessed using a single '.'.

In the example, the 'rss_event' has two direct parameters: the RSS version given as type 'double' and a set containing information about the RSS channels. Every entry in the set is a record containing the title of the channel and its location as a URL, as well as the number of news items in the channel.


Events and Parameters

An important part of a selector script is its output which is given in XML format and represents an event that is to be reported when the script matches, together with parameters for the event. An event is generated by constructing an XML element with tag name 'event' and attribute 'type' that has as value a string giving the type label for the event. Parameters are added as element content.

Parameters can have a primitive type, represented through elements 'bool', 'count', 'int', 'double', or 'string' with the value of the parameter as content of the 'value' attribute. Elements representing primitive types have to be empty (i.e. no element content).

Furthermore, there are three different types of compound parameters: 'set', 'record', and 'table', each with a type given by the value of the 'type' attribute as with events. A set consists of zero or more child elements, which have to be of the same type (i.e. all of type 'string', all of type 'count', all of type 'record', etc.). A record is composed of one or more child elements, which can be of arbitrary types, not neccessarily the same. Finally, a table consists of zero or more 'tableEntry' elements, each specifying an entry in the table given by an index child element of arbitrary type and a value child element of arbitrary type. The first element in a 'tableEntry' is interpreted as index and the second as value.

With compound types, the tree of parameters of an event can be of arbitrary depth.


Activating the XML Analyzer in Bro

The XML analyzer is implemented in C++ as an analyzer component 'XML_Analyzer' of Bro. It is a subclass of 'Analyzer' and its main methods are 'DeliverStream' (with a chunk of xml data) and 'EndOfData' (signaling that an XML document or collection has been completely received via DeliverStream). Momentarily, it is dynamically added as child analyzer of the HTTP_Analyzer if the string '<?xml' is encountered at the beginning of the payload of an HTTP message.

This is achieved by utilizing the Bro signature engine. The following signature is responsible for adding a child XML analyzer to the HTTP analyzer if XML data is discovered:

signature http_xml {
  ip-proto == tcp
  http-body /<\?xml/
  enable "http:xml"
}

The most notable thing is the 'enable' statement, which specifies that a new analyzer of type 'xml' is to be created and added as child to the HTTP analyzer. From the moment of adding the XML analyzer, the HTTP analyzer forwards it all HTTP payload data.

This signature has to be placed in a signature file to be loaded by Bro, either by using the '-s <signature_file>' switch on the command line or by placing a redefinition of the variable 'signature_files in a policy script, e.g.:

redef signature_files += "xml.sig";


Declaring Events and Parameters Generated in a Selector Script

All events that might be generated in a selector script have to be declared in a Bro policy script, like in the following example:

global rss_event: event
   (c: connection, vers: double, chans: rss_channels);

Image:Caution.png Note: The event generation engine of the XML analyzer always adds as implicit first parameter the connection object of the current XML document. However, this parameter has to be explicitly given in the event declaration (as shown in the example above).

For some basic event types, the declarations are already done in the xml-init.bro policy script. In the case that an event originating in a selector script is not declared before its creation, a run-time error will occur.

The same is true for parameters which are added to an event in a selector script. Of course, primitive types do not have to be declared, but all compound types used as parameters in a selector script have to be declared as well, e.g.:

type rss_channel_data: record {
  title: string;
  uri: string;
  num_items: count;
};


Handling Events Generated in a Selector Script

The events originating in a selector script can be handled just like ordinary Bro events, there is no difference here.


Tweaking the XML parser

A central part of the XML analyzer is the XML parser that is used to parse XML documents. Currently, Xerces-C is used as XML parser. There are several switches which can be used to tweak the parser (such as instructing the parser to use namespaces or to do schema validation). Some of these switches are exposed to Bro so that they can be adapted in a policy file (see the xml-init.bro policy file for more information).


Specifying the Selector Scripts to Execute

A variable 'query_files' is used in policy scripts to instruct the XML analyzer which selector scripts are to be executed on XML documents. A selector script can be added by redefining the variable, e.g.:

redef query_files += "xml-rss";

Names of selector scripts can be given with or without extension. If given without, the standard extension 'xq' is appended before trying to locate the script.

For all scripts specified in the 'query_files' variable, a lookup for the corresponding selector script is performed in the BROPATH. If a script cannot be found, or an error occurs when the script is parsed and compiled, the corresponding selector script is disabled. Apart from that, XML analyzer execution is not effected, so all other queries from successfully parsed selector scripts are still executed.

Personal tools