The Functional XML Querying Tool
|
What is fxgrep? |
|
The Processing Model of fxgrep |
|
The Pattern Language of fxgrep |
|
The Grammar Formalism of fxgrep |
|
Text Patterns |
|
fxgrep Options |
|
fxgrep's Graphical User Interface |
![]() | Online Demo |
fxgrep is a powerful XML querying tool. It is written in the functional programming language SML, based on the XML parser fxp.
Given a query, it parses an XML document and finds all matches of the query within that document. The query can be specified either as a pattern or a grammar; while the pattern syntax is more succinct and intuitive, the grammar syntax is more expressive.
fxgrep can evaluate both unary and binary queries. Unary queries locate individual nodes in the input document. Binary queries locate pairs of nodes which are in some specified relationship.
|
Invocation |
|
Options by Example |
|
Summary of Command Line Options |
The typical invocation of fxgrep is
fxgrep [ option ... ] [ file ]
If file is given, fxgrep reads its input document from that file, otherwise from standard input.
|
Specification of the Query |
|
Controlling the Output |
|
Restricting the Matching Algorithm |
|
Controlling the Number of Passes |
The query can be specified either as a pattern using the -p or --pattern option or as a grammar with the -g or --grammar option. Since patterns are usually short, they may also be given on the command line using option -P. In this case, however, they can only contain ASCII characters.
The pattern language of fxgrep is described here.
The grammar formalism is described here.
fxgrep reports the nodes matching a unary query in document order. By default, for each of these nodes a processing instruction with target fxgrep-match is printed, followed by the subtree rooted at that node. For instance, suppose that the file test.xml contains the following XML document:
<a> <b>hello<b>world</b></b> <a><b>!!!</b></a> </a> |
Invoking fxgrep with the command
fxgrep -P /a//b test.xml
produces the following output:
<?fxgrep-match [Test/d0.xml:2.3]--------------------------------------?>
<b>hello<b>world</b></b>
<?fxgrep-match [Test/d0.xml:2.11]-------------------------------------?>
<b>world</b>
<?fxgrep-match [Test/d0.xml:3.6]--------------------------------------?>
<b>!!!</b>
|
Output of the position can be disabled with the -np or --no-position option. Thus, invoking
fxgrep -P /a//b --no-position test.xml
produces the following output:
<b>hello<b>world</b></b>
<b>world</b>
<b>!!!</b>
|
Conversely, fxgrep omits printing of the subtree if the -nt or --position-only option is given:
fxgrep -P /a//b --position-only test.xml
reports the matches as follows:
<?fxgrep-match [Test/d0.xml:2.3]--------------------------------------?> <?fxgrep-match [Test/d0.xml:2.11]-------------------------------------?> <?fxgrep-match [Test/d0.xml:3.6]--------------------------------------?> |
The --no-output or -n option makes fxgrep suppress the reporting of matches. This is useful in scripts for determining whether a document contains a match at all: In case is does fxgrep's exit status is 0, otherwise it is 1. Alternatively, the --count or -k option makes fxgrep count the matches instead of reporting them individually:
fxgrep -P /a//b --count test.xml
yields the answer 3.
The output of binary matches is described here.
Sometimes it is desirable to find only matches that are not a descendant of another matching node or do not contain another match. This can be achieved with the options --outermost/-O and --innermost/-I: The former makes fxgrep report only outermost matches. For instance, invoking
fxgrep -P /a//b --outermost test.xml
yields:
<?fxgrep-match [Test/d0.xml:2.3]--------------------------------------?> <b>hello<b>world</b></b> <?fxgrep-match [Test/d0.xml:3.6]--------------------------------------?> <b>!!!</b> |
On the other hand, --innermost selects only innermost matches of the query. The output for test.xml is then:
<?fxgrep-match [Test/d0.xml:2.11]-------------------------------------?> <b>world</b> <?fxgrep-match [Test/d0.xml:3.6]--------------------------------------?> <b>!!!</b> |
The --outermost and --innermost options are only sensible when the matches are individual nodes in the input (i.e. for unary queries, as opposed to binary queries).
fxgrep performs the matching by one or two runs of a forest automaton on the input document forest. The number of runs actually needed is determined by analysis of the query prior to the matching: If the query is 'simple' enough, a single pass suffices. In this case it is even possible to perform the algorithm 'on the fly', i.e., without having to construct an in-memory copy of the document forest. Though this method needs only a (nearly) constant amount of memory even for huge documents, it is slightly slower than performing the querying on an in-memory copy of the document forest, because subtrees that fulfill the contextual part of the query must be constructed on-demand, requiring an extra administrative overhead. Therefore fxgrep does not perform on-the-fly querying by default.
fxgrep provides command-line options for controlling this behavior:
|
Option -2 or --force-double makes fxgrep use the two-pass algorithm disregarding the result of the query analysis. This is mainly useful for testing and debugging purposes. |
|
-1 or --force-single enforces the one-pass algorithm disregarding query analysis. Use this option only if you deeply understand the querying algorithm of fxgrep! Otherwise it might report invalid matches. |
|
The -h or --force-hooks option enforces that matching is performed on the fly, during parsing. Use this option for large documents in order to reduce memory usage of fxgrep. For unary queries which are right-ignorant (i.e. their right context of matches need not be checked), matches are output in the usual way. If the unary queries are not right-ignorant, only the position of the matches is output, in order to save memory-space by avoiding constructing in-memory copies of the match trees. Similarly for binary patterns which are right-ignorant. This option is not valid, as it may lead to false matches in case of non right-ignorant binary queries. An warning is issued in this case. |
Each option can be one of:
|
A file name specifying the input document. Only one input document may be specified. |
|
A long option of the form --key or --key=arg |
|
A short option of the form -k, where k consists of a single character. If k consists of more than one character, each character is assumed to be a short option itself (e.g., -vic equals -v -i -c). The only exception is when the first character is an n (see below). |
|
A short option with argument of the form -k arg, where k consists of a single character. |
|
A negative short option of the form -nk, where k consists of a single character. If k consists of more than one character, each character is assumed to be a negative short option itself (e.g., -nvic equals -nv -ni -nc). If k is empty, then this is the (non-negative) short option -n. |
|
The string --. This option is ignored, except that all remaining options are interpreted as file names, whether they start with - or not. |
fxgrep understands all options documented for fxp. Additionally, the following options are available: