fxgrep - The Functional XML Querying Tool

-----------------

Contents

0 What is fxgrep?
0 The Processing Model of fxgrep
0 The Pattern Language of fxgrep
0 The Grammar Formalism of fxgrep
0 Text Patterns
0 fxgrep Options
0 fxgrep's Graphical User Interface
0Online Demo

-----------------

What is fxgrep?

fxgrep is a powerful XML querying tool. It is written in the functional programming language SML, based on the XML parser fxp.

Given a query, it parses an XML document and finds all matches of the query within that document. The query can be specified either as a pattern or a grammar; while the pattern syntax is more succinct and intuitive, the grammar syntax is more expressive.

fxgrep can evaluate both unary and binary queries. Unary queries locate individual nodes in the input document. Binary queries locate pairs of nodes which are in some specified relationship.

-----------------

fxgrep Options

0 Invocation
0 Options by Example
0 Summary of Command Line Options

-----------------

Invocation

The typical invocation of fxgrep is

fxgrep [ option ... ] [ file ]

If file is given, fxgrep reads its input document from that file, otherwise from standard input.

-----------------

Options by Example

0 Specification of the Query
0 Controlling the Output
0 Restricting the Matching Algorithm
0 Controlling the Number of Passes

-----------------

Specification of the Query

The query can be specified either as a pattern using the -p or --pattern option or as a grammar with the -g or --grammar option. Since patterns are usually short, they may also be given on the command line using option -P. In this case, however, they can only contain ASCII characters.

The pattern language of fxgrep is described here.

The grammar formalism is described here.

-----------------

Controlling the Output

fxgrep reports the nodes matching a unary query in document order. By default, for each of these nodes a processing instruction with target fxgrep-match is printed, followed by the subtree rooted at that node. For instance, suppose that the file test.xml contains the following XML document:

<a>
  <b>hello<b>world</b></b>
  <a><b>!!!</b></a>
</a>

Invoking fxgrep with the command

fxgrep -P /a//b test.xml

produces the following output:

<?fxgrep-match [Test/d0.xml:2.3]--------------------------------------?>
<b>hello<b>world</b></b>
          
<?fxgrep-match [Test/d0.xml:2.11]-------------------------------------?>
<b>world</b>

<?fxgrep-match [Test/d0.xml:3.6]--------------------------------------?>
<b>!!!</b>

Output of the position can be disabled with the -np or --no-position option. Thus, invoking

fxgrep -P /a//b --no-position test.xml

produces the following output:

<b>hello<b>world</b></b>
          
<b>world</b>

<b>!!!</b>

Conversely, fxgrep omits printing of the subtree if the -nt or --position-only option is given:

fxgrep -P /a//b --position-only test.xml

reports the matches as follows:

<?fxgrep-match [Test/d0.xml:2.3]--------------------------------------?>
<?fxgrep-match [Test/d0.xml:2.11]-------------------------------------?>
<?fxgrep-match [Test/d0.xml:3.6]--------------------------------------?>

The --no-output or -n option makes fxgrep suppress the reporting of matches. This is useful in scripts for determining whether a document contains a match at all: In case is does fxgrep's exit status is 0, otherwise it is 1. Alternatively, the --count or -k option makes fxgrep count the matches instead of reporting them individually:

fxgrep -P /a//b --count test.xml

yields the answer 3.

The output of binary matches is described here.

-----------------

Restricting the Matching Algorithm

Sometimes it is desirable to find only matches that are not a descendant of another matching node or do not contain another match. This can be achieved with the options --outermost/-O and --innermost/-I: The former makes fxgrep report only outermost matches. For instance, invoking

fxgrep -P /a//b --outermost test.xml

yields:

<?fxgrep-match [Test/d0.xml:2.3]--------------------------------------?>
<b>hello<b>world</b></b>

<?fxgrep-match [Test/d0.xml:3.6]--------------------------------------?>
<b>!!!</b>

On the other hand, --innermost selects only innermost matches of the query. The output for test.xml is then:

<?fxgrep-match [Test/d0.xml:2.11]-------------------------------------?>
<b>world</b>

<?fxgrep-match [Test/d0.xml:3.6]--------------------------------------?>
<b>!!!</b>

The --outermost and --innermost options are only sensible when the matches are individual nodes in the input (i.e. for unary queries, as opposed to binary queries).

-----------------

Controlling the Number of Passes

fxgrep performs the matching by one or two runs of a forest automaton on the input document forest. The number of runs actually needed is determined by analysis of the query prior to the matching: If the query is 'simple' enough, a single pass suffices. In this case it is even possible to perform the algorithm 'on the fly', i.e., without having to construct an in-memory copy of the document forest. Though this method needs only a (nearly) constant amount of memory even for huge documents, it is slightly slower than performing the querying on an in-memory copy of the document forest, because subtrees that fulfill the contextual part of the query must be constructed on-demand, requiring an extra administrative overhead. Therefore fxgrep does not perform on-the-fly querying by default.

fxgrep provides command-line options for controlling this behavior:

Option -2 or --force-double makes fxgrep use the two-pass algorithm disregarding the result of the query analysis. This is mainly useful for testing and debugging purposes.

-1 or --force-single enforces the one-pass algorithm disregarding query analysis. Use this option only if you deeply understand the querying algorithm of fxgrep! Otherwise it might report invalid matches.

The -h or --force-hooks option enforces that matching is performed on the fly, during parsing. Use this option for large documents in order to reduce memory usage of fxgrep. For unary queries which are right-ignorant (i.e. their right context of matches need not be checked), matches are output in the usual way. If the unary queries are not right-ignorant, only the position of the matches is output, in order to save memory-space by avoiding constructing in-memory copies of the match trees. Similarly for binary patterns which are right-ignorant. This option is not valid, as it may lead to false matches in case of non right-ignorant binary queries. An warning is issued in this case.

-----------------

Summary of Command Line Options

Each option can be one of:

A file name specifying the input document. Only one input document may be specified.

A long option of the form --key or --key=arg

A short option of the form -k, where k consists of a single character. If k consists of more than one character, each character is assumed to be a short option itself (e.g., -vic equals -v -i -c). The only exception is when the first character is an n (see below).

A short option with argument of the form -k arg, where k consists of a single character.

A negative short option of the form -nk, where k consists of a single character. If k consists of more than one character, each character is assumed to be a negative short option itself (e.g., -nvic equals -nv -ni -nc). If k is empty, then this is the (non-negative) short option -n.

The string --. This option is ignored, except that all remaining options are interpreted as file names, whether they start with - or not.

fxgrep understands all options documented for fxp. Additionally, the following options are available:

-P pat
Use the pattern pat. This option may not be used in conjunction with the -p or -g option.
-p file
--pattern=file
Read the pattern from the file named file. This option may not be used in conjunction with the -P or -g option.
-g file
--grammar=file
Read the grammar from the file named file. This option may not be used in conjunction with the -P or -p option.
--grammar-validate[=(yes|no)]
Turn on or off validation of the grammar as an XML document. Default is no.
--grammar-include[=(yes|no)]
Turn on or off inclusion of external parsed entities when parsing a grammar in non-validating mode. Default is yes.
--warn-undef-vars[=(yes|no)]
Warn about undefined variables used in the grammar. Default is no.
--ignore-xml-errors[=(yes|no)]
Continue even if the grammar is not a well-formed/valid XML document. Default is yes.
--ignore-syntax-errors=[(yes|no)]
Try to continue even if the grammar or pattern has syntax errors. De fault is no.
-n
--no-output
Do not generate any output, except for errors and warnings.
-o fname
--output=fname
Write all output, except for errors and warnings, to the file named fname. If fname is -, use the standard output. Defaults to -.
--output-encoding=enc
Use encoding enc for generating the output. If no output encoding is given, the encoding of the input document is used.
--output-textdecl[=(yes|no)]
Controls whether to start the output with a text declaration. Default is yes.

-k
--count
Only count matches but don't generate output. Implies --no-output.
-np
--no-position
Do not output a processing instruction giving the position of each match. Default is to do this.
-nt
--position-only
Output only the processing instruction with the position. Default is to print also the matching subdocument.

-O
--outermost
Report only matches of the pattern that are not descendants of another matching subtree. Default is to report all matches.
-I
--innermost
Report only matches of the pattern that do not contain other matching subtrees. Default is to report all matches.
-A
--all
Report all matches of the pattern. This is the default.

-h
--force-hooks
Force streamed matching, i.e. matching is performed without building the document tree in memory (for unary patterns and right-ignoring binary patterns).
-1
--force-single
Force matching in a single left-to-right pass disregarding grammar analysis.
-2
--force-double
Force matching in two passes disregarding grammar analysis.

-----------------