BinPAC Userguide

From BroWiki

Jump to: navigation, search

Contents

About

This User Guide is currently a work in progress. It is NOT by any means complete nor accurate yet. If you happen to stumble on this guide and have questions, feel free email me ...or bug Ruoming.

Glossary and Convention

To make this document easier to read, the following are the glossary and convention used.

  • PAC grammar - .pac file written by user.
  • PAC source - _pac.cc file generated by binpac
  • PAC header - _pac.h file generated by binpac
  • Analyzer - Protocol decoder generated by compiling PAC grammar
  • Field - a member of a record
  • Primary field - member of a record as direct result of parsing
  • Derivative field - member of a record evaluated through post processing

Image:Binpac_compile.png

Getting Started

Step by step on how to write a basic analyzer with binpac.

Techniques

Specific example on how to do certain tricks.

BinPAC Language Reference

BinPAC language consists of:

  • analyzer
  • type - data structure like definition describing parsing unit. Types can built on each other to form more complex type similar to yacc productions.
  • flow - flow defines how data will be fed into the analyzer and the top level parsing unit.
  • Keywords
  • Built-in macros

Defining an analyzer

There are two components to an analyzer definition: the top level context and the connection definition.


Context Definition

Each analyzer requires a top level context defined by the following syntax:

analyzer <ContextName> withcontext {
... context members ...
}

typically top level context contains pointer to top level analyzer and connection definition like below:

analyzer HTTP withcontext {
   connection : HTTP_analyzer;
   flow     : HTTP_flow;
};


Connection Definition

connection defines the entry point into the analyzer. It consists of two flow definitions, an upflow and a downflow.

connection <AnalyzerName>(optional parameter) {
 upflow = <UpflowConstructor>;
 downflow = <DownflowConstructor>;
}

Example:

connection HTTP_analyzer {
   upflow = HTTP_flow (true);
   downflow = HTTP_flow (false);
};

type

type is the basic building block of binpac-generated parser. type describes the structure of a byte segment. Each non-primitive type generates a C++ class that can independently parse the structure which it describes. Syntax:

type <typeName>{(<optional type parameter(s)>)} = <compositor or primitive class>{
  cases or members declaration.
} <optional attribute(s)>;

Example:

PAC grammar:

type myType = record {
   data:uint8;
};  

PAC header:

class myType{
public:
   myType();
   ~myType();
   int Parse(const_byteptr const t_begin_of_data, const_byteptr const t_end_of_data);
   uint8 data() const  { return data_; }
protected:
   uint8 data_;
};


Primitives

Primitive type can be treated as #define in C language. They are embedded into other type which reference them but do not generate any parsing code of their own. Available primitive types are:

  • int8
  • int16
  • int32
  • uint8
  • uint16
  • uint32
  • Regular expression
type HTTP_URI   = RE/[[:alnum:][:punct:]]+/;
  • bytestring

Examples:

type foo = record { x: number; }; 

is equivalent to:

type foo = record { x: uint8[3]; };

(Note: this behavior may change in future versions of binpac.)

record

record composes primitive type(s) and other record(s) to create new type. This new type in turn can be used as part of parent type or directly for parsing.

Example:

type SMB_body = record {
   word_count  : uint8;
   parameter_words : uint16[word_count];
   byte_count  : uint16;
}

case

The case compositor allows switching between different parsing methods.

type SMB_string(unicode: bool, offset: int) = case unicode of {
   true    -> u: SMB_unicode_string(offset);
   false   -> a: SMB_ascii_string;
};

case supports an optional default label to denote none of the above labels are matched. If no fields follow a given label, a user can specify an arbitrary field name with the empty type. See the following example.

type HTTP_Message(expect_body: ExpectBody) = record {
       headers:        HTTP_Headers;
       body_or_not:    case expect_body of {
               BODY_NOT_EXPECTED       -> none:        empty;
               default                 -> body:        HTTP_Body(expect_body);
       };
};

Note that only one field is allowed after a given label. If multiple fields are to be specified, they should be packed in another record type first. The other usages of case are described here.

array

A type can be defined as a sequence of "single-type elements". By default, array type continue parsing for the array element in an infinite loop. Or an array size can be specified to control the number of match. &until can be also conditionally end parsing;

# This will match for 10 element only
type HTTP_Headers = HTTP_Header [10];

# This will match until the condition is met
type HTTP_Headers = HTTP_Header [] &until(/*Some condition*/);

Array can also be used directly inside of record. For example:

type DNS_message = record {
 header:		DNS_header;
 question:	DNS_question(this)[header.qdcount];
 answer:		DNS_rr(this, DNS_ANSWER)[header.ancount];
 authority:	DNS_rr(this, DNS_AUTHORITY)[header.nscount];
 additional:	DNS_rr(this, DNS_ADDITIONAL)[header.arcount];
}&byteorder = bigendian, &exportsourcedata

flow

flow defines how data is fed into the analyzer. It also maintains custom state information declared by %member. flow is configured by specifiying type of data unit.

Syntax:

flow <Flow name>(<optional attribute>) {
  <flowunit|datagram> = <top level data unit> withcontext (<context constructor parameter>);
};


When flow is added to top level context analyzer, it enables use of &online and &length in record type. flow buffers data when there is not enough to evaluate the record and dispatchs data for evaluation when the threshold is reached.

flowunit

When flowunit is used, the analyzer uses flow buffer to handle incremental input and provide support for &oneline/&length. For further detail on this, see Buffering.

flowunit = HTTP_PDU(is_orig) withcontext (analyzer, this);

datagram

Opposite to flowunit, by declaring data unit as datagram, flow buffer is opted out. This results in faster parsing but no incremental input or buffering support.

datagram = HTTP_PDU(is_orig) withcontext (analyzer, this);

Byte Ordering and Alignment

Byte Ordering

Byte Alignment

type RPC_Opaque = record {
   length: uint32;
   data:   uint8[length];
   pad:    padding align 4;    # pad to 4-byte boundary
};

Functions

User can define functions in binpac.

Declaration

Function can be declared using one of the three ways:

PAC with embedded body

PAC style function prototype and embed the body using %{ %}.

function print_stuff(value :const_bytestring):bool
%{
   printf("Value [%s]\n", std_str(value).c_str());
%}
PAC with PAC-case body

Pac style function with a case body, this type of declaration is useful for extending later by casefunc.

function RPC_Service(prog: uint32, vers: uint32): EnumRPCService =
   case prog of {
       default -> RPC_SERVICE_UNKNOWN;
   };


Inlined by %code

Function can be completely inlined by using %code

%code{
EnumRPCService RPC_Service(const RPC_Call* call)
   {
   return call ? call->service() : RPC_SERVICE_UNKNOWN;
   }
%}


Usage

Extending

PAC code can be extended by using refine. This is useful for code reusing and splitting functionality for parallel development.

Extending record

Record can be extended to add addtional attribute(s) by using refine typeattr. One of the typical use is to add &let for split protocol parsing from protocol analysis.

refine typeattr HTTP_RequestLine += &let {
   process_request: bool =
       process_func(method, uri, version);
};

Extending type case

refine casetype RPC_Params += {
   RPC_SERVICE_PORTMAP -> portmap: PortmapParams(call);
};

Extending function case

Function which is declared as a PAC case can be extended by adding additional case into the switch.

refine casefunc RPC_BuildCallVal += {
   RPC_SERVICE_PORTMAP ->
       PortmapBuildCallVal(call, call.params.portmap);
};

Extending connection

Connection can be extended to add functions and members.

refine connection RPC_Conn += {
   function ProcessPortmapReply(results: PortmapResults): bool
       %{
       %}
};

State Management

State is maintained by extending parsing class by declaring derivative. State lasts until the top level parsing unit (flowunit/datagram is destroyed).

Keywords

Source code embedding

C++ code can be embedded within the .pac file using the following directives. These code will be copied into the final generated code.

%header{...%}

Code to be inserted in binpac generated header file.

%code{...%}

Code to be inserted at the beginning of binpac generated C++ file.

%member{...%}

Add additional member(s) to connection (?) and flow class.

%init{...%}

Code to be inserted in flow constructor.

%cleanup{...%}

Code to be inserted in flow destructor.

Embedded pac primitive

${
$set{
$type{
$typeof{
$const_def{

Condition checking

&until

&until is used in conjunction with array declaration. It specifies exit condition for array parsing.

type HTTP_Headers = HTTP_Header[] &until($input.length() == 0);
&requires

Process data dependencies before evaluating field.

Example: typically, derivative field is evaluated after primary field. However &requires is used to force evaluate of length before msg_body.

type RPC_Message = record {
   xid:        uint32;
   msg_type:   uint32;
   msg_body:   case msg_type of {
       RPC_CALL    -> call:    RPC_Call(this);
       RPC_REPLY   -> reply:   RPC_Reply(this);
   } &requires(length);
} &let {
   length = sourcedata.length();   # length of the RPC_Message
} &byteorder = bigendian, &exportsourcedata, &refcount;
&if

Evaluate field only if condition is met

type DNS_label(msg: DNS_message) = record {
   length:     uint8;
   data:       case label_type of {
       0 ->    label:  bytestring &length = length;
       3 ->    ptr_lo: uint8;
   };
} &let {
   label_type: uint8   = length >> 6;
   last: bool      = (length == 0) || (label_type == 3);
   ptr: DNS_name(msg)
       withinput $context.flow.get_pointer(msg.sourcedata,
           ((length & 0x3f) << 8) | ptr_lo)
       &if(label_type == 3);
   clear_pointer_set: bool = $context.flow.reset_pointer_set()
       &if(last);
};
case

There are two uses to the case keyword.

  • As part of record field. In this scenario, it allow alternative methods to parse a field.
type RPC_Reply(msg: RPC_Message) = record {
   stat:       uint32;
   reply:      case stat of {
       MSG_ACCEPTED    -> areply:  RPC_AcceptedReply(call);
       MSG_DENIED  -> rreply:  RPC_RejectedReply(call);
   };
} &let {
   call: RPC_Call = context.connection.FindCall(msg.xid);
   success: bool = (stat == MSG_ACCEPTED && areply.stat == SUCCESS);
};
  • As function definition
function RPC_Service(prog: uint32, vers: uint32): EnumRPCService =
       case prog of {
               default -> RPC_SERVICE_UNKNOWN;
       };

Note that one can 'refine' both types of cases:

refine casefunc RPC_Service += {
       100000  -> RPC_SERVICE_PORTMAP;
};

Built-in macros

$input

This macro refers to the data that was passed into the ParseBuffer function. When $input is used, binpac generate a const_bytestring which contains the start and end pointer of the input.

PAC grammar

&until($input.length()==0);

PAC source

const_bytestring t_val__elem_input(t_begin_of_data, t_end_of_data);
if (  ( t_val__elem_input.length() == 0 )  )
$element

$element provides access to entry of the array type. Following are the ways which $element can be used.

  • Current element

Check on the value of the most recently parsed entry. This would get executed after each time an entry is parsed. Example:

type SMB_ascii_string       = uint8[] &until($element == 0);
  • Current element's field

Example:

type DNS_label(msg: DNS_message) = record {
   length:     uint8;
   data:       case label_type of {
       0 ->    label:  bytestring &length = length;
       3 ->    ptr_lo: uint8;
   };
} &let {
   label_type: uint8   = length >> 6;
   last: bool      = (length == 0) || (label_type == 3);
};
type DNS_name(msg: DNS_message) = record {
   labels:     DNS_label(msg)[] &until($element.last);
};
$context

This macro refers to the Analyzer context class (Context<Name> class gets generated from analyzer <Name> withcontext {}). Using this macro, users can gain access to the flow object and analyzer object.

Others

&transient

Do not create copy of the bytestring

type MIME_Line = record {
   line:   bytestring &restofdata &transient;
} &oneline;
&let and let
  • &let - adds derivative field to a record
type ncp_request(length: uint32) = record {
   data        : uint8[length];
} &let {
   function    = length > 0 ? data[0] : 0;
   subfunction = length > 1 ? data[1] : 0;
};
  • let - declares global value. If the user does not specify a type, the compiler will assume the int type.

PAC grammar

let myValue:uint8=10; 

PAC source

uint8 const myValue = 10;

PAC header

extern uint8 const myValue;
&restofdata

grab the rest of the data available in the FlowBuffer

PAC grammar

   onebyte: uint8;
   value: bytestring &restofdata &transient;

PAC source

   // Parse "onebyte"
   onebyte_ = *((uint8 const *) (t_begin_of_data));
   // Parse "value"
   int t_value_string_length;
   t_value_string_length = (t_end_of_data) - ((t_begin_of_data + 1));
   int t_value__size;
   t_value__size = t_value_string_length;
   value_.init((t_begin_of_data + 1), t_value_string_length);
&length

Length can appear in two different contexts: as property of a field or as property of a record. Examples: &length as field property

protocol    : bytestring &length = 4;

translates into

const_byteptr t_end_of_data = t_begin_of_data + 4;
int t_protocol_string_length;
t_protocol_string_length = 4;
int t_protocol__size;
t_protocol__size = t_protocol_string_length;
protocol_.init(t_begin_of_data, t_protocol_string_length);
&check

Check a condition and raise exception if not met.

&chunked and $chunk

When parsing a long field with variable length, chunked can be used to improve performance. However, chunked field are not buffered across packet. Data for the chunk in the current packet can be access by using $chunk.

&exportsourcedata

Data matched for a particular type, the data matched can be retained by using &exportsourcedata.

.pac file

type myType = record {
   data:uint8;
} &exportsourcedata;

_pac.h

class myType
{
public:
   myType();
   ~myType();
   int Parse(const_byteptr const t_begin_of_data, const_byteptr const  _end_of_data);
   uint8 myData() const    { return myData_; }
   const_bytestring const & sourcedata() const { return sourcedata_; }
protected:
   uint8 myData_;
   const_bytestring sourcedata_;
};

_pac.cc

sourcedata_ = const_bytestring(t_begin_of_data, t_end_of_data);
sourcedata_.set_end(t_begin_of_data + 1);


Source data can be used within the type that match it or at the parent type.

type myParentType (child:myType) = record {
    somedata:uint8;
} &let{
   do_something:bool = print_stuff(child.sourcedata);
};

translates into

do_something_ = print_stuff(child()->sourcedata());
&refcount
withinput

Parsing Methodology

Buffering

binpac supports incremental input to deal with packet fragmentation. This is done via use of FlowBuffer class and maintaining buffering/parsing states.

FlowBuffer Class

FlowBuffer provides two mode of buffering: line and frame. Line mode is useful for parsing line based language like HTTP. Frame mode is best for fixed length message. Buffering mode can be switched during parsing and is done transparently to the grammar writer.

At compile time binpac calculates number of bytes required to evaluate each field. During run time, data is buffered up in FlowBuffer until there is enough to evaluate the record. To optimize the buffering process, if FlowBuffer has enough data to evaluate on the first NewData, it would only mark the start and end pointer instead of copying.

  • void NewMessage();
    • Advances the orig_data_begin_ pointer depend on current mode_. Moves by 1/2 characters in LINE_MODE, by frame_length_ in FRAME_MODE and nothing in UNKNOWN_MODE (default mode).
    • Set buffer_n_ to 0
    • Reset message_complete_
  • void NewLine();
    • Reset frame_length_ and chunked_, set mode_ to LINE_MODE
  • void NewFrame(int frame_length, bool chunked_);
  • void GrowFrame(int new_frame_length);
  • void AppendToBuffer(const_byteptr data, int len);
    • Reallocate buffer_ to add new data then copy data
  • void ExpandBuffer(int length);
    • Reallocate buffer_ to new size if new size is bigger than current size.
    • Set minimum size to 512 (optimization?)
  • void MarkOrCopyLine ();
    • Seek current input for end of line (CR/LF/CRLF depend on line break mode). If found append found data to buffer if one is already created or mark (set frame_length_) if one is not created (to minimize copying). If end of line is not found, append partial data till end of input to buffer. Buffer is created if one is not there.
  • const_byteptr begin()/end()
    • Returns buffer_ and buffer_n_ if a buffer exist, otherwise orig_data_begin_ and orig_data_begin_ + frame_length_.

Parsing States

  • buffering_state_ - each parsing class contains a flag indicating whether there are enough data buffered to evaluate the next block.
  • parsing_state_ - each parsing class which consists of multiple parsing data unit (line/frames) has this flag indicating the parsing stage. Each time new data comes in, it invokes parsing function and switch on parsing_state to determine which sub parser to use next.

Regular Expression

Evaluation Order

Running Binpac-generated Analyzer Standalone

(an updated description with a bit more detail on how to decouple binpac from Bro based on release 1.4 can be found here)

To run binpac-generated code independent of Bro. Regex library must be substituted. Below is one way of doing it. Use the following three header files.

RE.h

/*Dummy file to replace bro's file*/
#include "binpac_pcre.h"
#include "bro_dummy.h"

bro_dummy.h

#ifndef BRO_DUMMY
#define BRO_DUMMY
#define DEBUG_MSG(x...)  fprintf(stderr, x)
/*Dummy to link, this function suppose to be in Bro*/
double network_time();
#endif

binpac_pcre.h

#ifndef bro_pcre_h
#define bro_pcre_h
#include <stdio.h>
#include <assert.h>
#include <string>
using namespace std;
// TODO: use configure to figure out the location of pcre.h
#include "pcre.h"
class RE_Matcher {
public:
   RE_Matcher(const char* pat){
       pattern_ = "^";
       pattern_ += "(";
       pattern_ += pat;
       pattern_ += ")";
       pcre_ = NULL;
       pextra_=NULL;
   }
   ~RE_Matcher() {
       if (pcre_) {
           pcre_free(pcre_);
       }
   }
   int Compile() {
               const char *err = NULL;
               int erroffset = 0;
       pcre_ = pcre_compile(pattern_.c_str(),
                                    0,  // options,
                                    &err,
                                    &erroffset,
                                    NULL);
       if (pcre_ == NULL) {
           fprintf(stderr,
                   "Error in RE_Matcher::Compile(): %d:%s\n",
                   erroffset, err);
           return 0;
       }
       return 1;
   }
   
   int MatchPrefix (const char* s, int n){
       const char *err=NULL;
       assert(pcre_);
       const int MAX_NUM_OFFSETS = 30;
       int offsets[MAX_NUM_OFFSETS];
       int ret = pcre_exec(pcre_,
                                   pextra_,  // pcre_extra
                                   //NULL,  // pcre_extra
                                   s, n,  
                                   0,     // offset
                                   0,     // options
                                   offsets,
                                   MAX_NUM_OFFSETS);
       if (ret < 0) {
           return -1;
       }
       assert(offsets[0] == 0);
       return offsets[1];
   }
protected:
   pcre *pcre_;
   string pattern_;
};
#endif

main.cc

In your main source, add this dummy stub.
/*Dummy to link, this function suppose to be in Bro*/
double network_time(){
   return 0;
}

Compiler Documentation

Moved to here

Q & A

  • &oneline only work when flow is used?

Yes. binpac uses the flowunit definition in flow to figure out which types require buffering. For those that do, the parse function is:

       bool ParseBuffer(flow_buffer_t t_flow_buffer, ContextHTTP * t_context);

And the code of flow_buffer_t provides the functionality of buffering up to one line. That's why &oneline is only active when flow is used and the type requires buffering.

In certain cases we would want to use &oneline even if the type does not require buffering, binpac currently does not provide such functionality.

  • How would incremental input work in the case of regex?

A regex should not take incremental input. (The binpac compiler will complain when that happens.) It should always appear below some type that has either &length=... or &oneline.

  • What is the role of Context_<Name> class (generated by analyzer <Name> withcontext)?
  • What is the difference between withcontext and w/o withcontext?

withcontext should always be there. It's fine to have an empty context.

  • Elaborate on $context and how it is related to withcontext.

A "context" parameter is passed to every type. It provides a vehicle to pass something to every type without adding a parameter to every type. In that sense, it's optional. It exists for convenience.

  • Example usage of composite type array.

Please see HTTP_Headers in http-protocol.pac in the Bro source code.

  • Clarification on connection keyword (binpac paper).

What are the specific questions?

  • Need a new way to attach hook additional code to each class beside &let.

Why? Can you describe a scenario?

  • &transient, how is this different from declaring anonymous field? and currently it doesn't seem to do much
type HTTP_Header = record {
   name:       HTTP_HEADER_NAME &transient;
   :       HTTP_WS;
   value:      bytestring &restofdata &transient;
} &oneline;
   // Parse "name"
   int t_name_string_length;
   t_name_string_length = 
       HTTP_HEADER_NAME_re_011.MatchPrefix(
           t_begin_of_data,
           t_end_of_data - t_begin_of_data);
   if ( t_name_string_length < 0 )
       {
       throw ExceptionStringMismatch("./http-protocol.pac:96", "|([^: \\t]+:)", string((const char *) (t_begin_of_data),   (const char *) t_end_of_data).c_str()); 
       }
   int t_name__size;
   t_name__size = t_name_string_length;
   name_.init(t_begin_of_data, t_name_string_length);
  • Detail on the globals ($context, $element, $input...etc)

Again, what are the specific questions?

  • How does BinPAC work with dynamic protocol detection?

Well, you can use the code in DNS-binpac.cc as a reference. First, create a pointer to the connection. (See the example in DNS-binpac.cc)

interp = new binpac::DNS::DNS_Conn(this);

Pass the data received from DeliverPacket or DeliverStream to interp->NewData(). (Again, see the example in DNS-binpac.cc)

void DNS_UDP_Analyzer_binpac::DeliverPacket(int len, const u_char* data, bool orig, int seq, const IP_Hdr* ip, int caplen)
       {
       Analyzer::DeliverPacket(len, data, orig, seq, ip, caplen);
       interp->NewData(orig, data, data + len);
       }
  • Explanation of &withinput
  • Difference between using flow and not using flow (binpac generates Parse method instead of ParseBuffer)
  • &check currently working?
  • Difference between flowunit and datagram, datagram and &oneline, &length?
  • Go over TODO list in binpac release
  • How would input get handle/buffered when length is not known (chunked)
  • More feature multi byte character? utf16 utf32 etc.

TODO List

New Features

  • Provides a method to match simple ascii text.
  • Allows use fixed length array in addition to vector.

Bugs

Small clean-ups

  • Remove anonymous field bytestring assignment.
  • Redundant overflow checking/more efficient fixed length text copying.

Warning/Errors

Things that compiler should flag out at code generation time

  • Give warning when &transient is used on none bytestring
  • Give warning when &oneline, &length is used and flowunit is not.
  • Warning when more than one connection is defined
Personal tools
User Management