The Functional XML Querying Tool
|
What is fxgrep? |
|
The Processing Model of fxgrep |
|
The Pattern Language of fxgrep |
|
The Grammar Formalism of fxgrep |
|
Text Patterns |
|
fxgrep Options |
|
fxgrep's Graphical User Interface |
|
Introduction |
|
Special Variables |
|
Conjunction and Negation: Forest Expressions |
|
White-Space Handling |
|
Regular Expressions |
|
Summary of the Grammar Syntax |
|
Specification of the Grammar |
fxgrep's grammar formalism is an alternative way of specifying queries. It is less succinct but more expressive than the pattern language. For example, the pattern language cannot express the following query:
Find the elements of type b in a binary tree of type a where left siblings are always of type a and right siblings always of type b.
In the grammar formalism, this query is expressible by the grammar:
FORMULA y START _ x _ RULES x -> <a> xy | <a> y -> <b> xy | <b>
Consider for example the following query:
A b element such that all of its ancestors have element type a.
Note that this could be expressed in the pattern language by the pattern (a/)+b.
In the grammar formalism, this query is expressible using a recursive rule:
FORMULA x START _ (x|y) _ RULES x -> <b> _ y -> <a> _ (x|y) _
Each rule of a grammar has a variable on its left-hand side; the right-hand side specifies permitted expansions for this variable. A grammar produces or derives a document forest as follows:
|
Begin with a sequence of variables in the language of a start expression. These are the expressions given after the START keyword. In our example there is only one start expression, namely _(x|y)_. We might thus begin with the sequence of variables .y (see regular expressions). |
||||||||||
|
Now repeatedly replace a variable with a partial tree matching a right hand side given for this variable, until no variables remain. The special variable . may be replaced with an arbitrary document subtree. In the example, we might perform the following steps:
|
||||||||||
|
For unary queries, as for example the one defined by the grammar specified above, a node is a match of the grammar if the boolean formula over variables specified after FORMULA evaluates to true when each variable is assigned true if there is a derivation of the document forest in which the node is derived from the variable, and false otherwise. In our example, this is the b node, because there is a derivation of it from x which is the only variable in the formula. In other derived forests, there might also be more than one match. |
||||||||||
|
For binary queries, a formula must be specified for the primary match and one or more formula must be specified for the secondary matches. A node is a primary match if the primary formula is fulfilled as described above, and, similary, a node is a secondary match if the corresponding secondary formula is fulfilled. Furthermore, a set of primary targets and one set of secondary targets have to be specified for each secondary formula. A primary match node and a secondary match node are reported together as a match pair if there exists a derivation in which the primary node is derived from a variable in the set of primary targets and the secondary node is derived from a variable in the set of secondary targets. For example to match all the b nodes as above together with their a fathers the following grammar can be specified as follows: |
FORMULA x TARGETS x FORMULA y TARGETS y START _ (x|y) _ RULES x -> <b> _ y -> <a> _ (x|y) _
All right hand sides appearing in this example specify element nodes. Such a right hand side specifies between the < and > an element-type pattern for the types of allowed elements and a list of possibly negated attribute tests for its attributes. The meaning of these attribute tests is the same as within node patterns in the pattern syntax. For instance a right-hand side of the form <!a|b x y="1" !z~"1">_ allows to fill in an element node whose type is different from a and b, and which has an attribute named x, and attribute y with value 1, and either no attribute z or one whose value does not contain 1.
Alternatively, a right-hand side of a grammar can specify a processing instruction. An example is <?^fxp-?> _, which allows any target that begins with the string fxp-, as indicated by the text pattern between <? and ?>. It is followed by a regular expression which specifies the content of the processing instruction, i.e., its text. This regular expression should therefore be a single variable.
The third kind of right-hand sides specify text nodes which must contain a substring matching a text pattern given between a pair of quotes, e.g., 'hurlyburly'. If an exact match of the text pattern is desired, the text pattern must be enclosed between ^ and $ (see below).
fxgrep provides two predefined variables . and ~, describing arbitrary trees and white space. In this context, white space means either a sequence of white-space characters or a processing instruction. These two variables are defined by the following grammar rules:
~ -> "^ $" ~ -> <??> .** . -> "" . -> <??> .** . -> <*> .**
The pattern language allows specification of multiple, possibly negated structure qualifiers for a single node pattern. Analogously, the grammar syntax allows multiple, possibly negated regular expressions on the right-hand sides of rules, in the form of forest expressions. A forest expression is composed from regular expressions using the operator &&; each single regular expression may be negated by !.
However, the meaning of a grammar with conjunction and negation cannot be given with the concept of a single derivation. If a node was generated by a rule with conjunction, then there must be a derivation for its children under each of the given regular expressions; if one of these regular expressions is negated, then there must not be a derivation. Moreover, a node can only be a match of the grammar if no negated regular expression was applied for all of its ancestors. For example, consider the following grammar:
FORMULA y START _ x _ RULES x -> <a> _ x _ x -> <a> _ y _ && ! _ z _ y -> <b> z -> <c> y
According to the second rule, all trees can be derived from variable x that have type a and whose children can be derived from _ y _ but not from _ z _. Therefore the grammar derives all forests that contain an empty element of type b, all of whose ancestors have type a and that does not have a sibling of type c with a single empty child of type b. This first b element is also a match of the grammar in that forest. For instance, the b node is a match in the document <a><a><b/><c/></a></a>, but there is no derivation (and thus no match) for the document <a><b/><c><b/></c></a>.
XML documents are often indented according to the element structure by inserting line breaks and white space. Moreover, comments - which are ignored by fxgrep - are often written into a separate line, introducing extra white space. Similarly, a processing instruction may be placed at arbitrary positions within content without being a significant part of it. Therefore fxgrep considers processing instructions as white space as well.
When querying a document, one usually wants to ignore this extra white space. Therefore fxgrep implicitly allows white space between elements, and at the start and end of an element's content. For instance, the language of the regular expression x*y contains ~~x~x~~~y~ as well as xxy. In order to avoid this behavior, there are two possibilities:
|
Enclosing a regular expression into ^, and ,$ disallows leading and trailing white-space (^ and $ prohibit an arbitrary sequence before and after respectively, while , prohibits white spaces). Thus x~x~~~y is in the language of the regular expression ^,x*y,$ whereas ~xxy~ is not. |
|
Using the operator , for concatenation and ++ and ** for repetition disallows the white space between elements. Thus ~xxy~ is in the language of the regular expression x**,y whereas x~x~~~y is not. |
The regular expressions of fxgrep grammars are an extended form
of conventional regular expressions. A regular expression is composed
from variable names,
, ^,
$, ~ and _
using unary operators ?, *,
**, +, ++
and binary operators | ,
and concatenation.
|
A regular expression describes a set of sequences of variables, called its language. This language is defined as follows:
|
The language of a regular expression consisting of a variable x contains only the unary sequence x. The two symbols . and ~ are predefined special variables. |
|
The language of the empty regular expression ( |
|
A regular expression of the form e1|e2 denotes the disjunction of both subexpressions. Its language is the union of the languages of e1 and e2. |
|
There are two kinds of concatenation. The first one has the form e1,e2. Its language consists of all sequences obtained by appending a sequence from the language of e2 to a sequence in the language of e1. The second form of concatenation drops the comma and allows a sequence of ~'s between the two subsequences. More precisely, a regular expression of the form e1e2 is an abbreviation for e1,~**,e2. |
|
The ? specifies an optional subsequence, i.e., the regular expression e? is an abbreviation for (e|). |
|
A regular expression of the form e++ denotes the repetition of e. Its language is the set of all sequences obtained by concatenating one or more sequences from the language of e. A similar operator is + which allows a sequence of ~'s after each subsequence, that is, e+ is an abbreviation for (e,~**)+. The operators ** and * are analogous, except that they allow additionally the empty sequence. |
A whole regular expression implicitly allows for a leading and
trailing sequence of arbitrary nodes. That is, a regular
expression e is treated as if it were
|
The query can be specified as a grammar using the -g or --grammar option. fxgrep offers two syntaxes for grammars. If the suffix of the file is different from .xml and .XML, then the grammar is read from a file and interpreted in the syntax described above. In order to use arbitrary Unicode characters in the grammar, the file must be encoded in UTF-8 or UTF-16. If the suffix of the file is .xml or .XML, then the grammar is parsed as an XML document which should conform to the following DTD:
<!DOCTYPE grammar [ <!ELEMENT grammar (rule|start|formula|targets|secondaries)*> <!ELEMENT rule (#PCDATA)> <!ELEMENT start (#PCDATA)> <!ELEMENT targets (#PCDATA)> <!ATTLIST targets sign CDATA "+"> <!ELEMENT secondaries (formula,targets)> <!ATTLIST rule var NMTOKEN #REQUIRED> ]> |
Each rule element contains a right-hand side for the variable specified in its var attribute. Each start element contains one start expression; and a targets element contains a space-separated list of target variables. The order of elements is not significant. For instance, the grammar
FORMULA y START _ x _ RULES x -> <a> _ x _ x -> <a> _ y _ && ! _ z _ y -> <b> z -> <c> v
could be specified in XML syntax as follows:
<grammar> <rule var=" x "><![CDATA[ <a> _ x _ ]]> </rule> <rule var=" x "><![CDATA[ <a> _ y _ && ! _ z _ ]]> </rule> <rule var=" y "><![CDATA[ <b> ]]> </rule> <rule var=" z "><![CDATA[ <c> v ]]> </rule> <start> _ x _ </start> <formula> y </formula> </grammar> |
The specification must contain a formula element specifying the primary matches. In the case of binary matches, an element targets must specify the primary targets. Secondary matches must be specified by the element secondaries element. This must have an element formula and an element targets specifying a secondary formula and the corresponding secondary targets, as in the case of pure text grammar specification as above.
For example in:
<grammar>
<rule var=" x1 "><![CDATA[ <*> _ (x1|xa|xb) _ ]]> </rule>
<rule var=" xa "><![CDATA[ <a> _ (x2|xc1) _ ]]> </rule>
<rule var=" x2 "><![CDATA[ <*> _ (x2|xc1) _ ]]> </rule>
<rule var=" xc1 "><![CDATA[ <c> _ xd _ ]]> </rule>
<rule var=" xb "><![CDATA[ <b> _ (x3|xc2) _ ]]> </rule>
<rule var=" x3 "><![CDATA[ <*> _ (x3|xc2) _ ]]> </rule>
<rule var=" xc2 "><![CDATA[ <c> _ xd _ ]]> </rule>
<rule var=" xd "><![CDATA[ <d> _ ]]> </rule>
<start> _ x1 _ </start>
<formula> xd </formula>
<targets> xd </targets>
<secondaryTargets>
<formula> xc1 & xc2 </formula>
<targets> xc1 xc2 </targets>
</secondaryTargets>
</grammar>
|
the primary matches are d nodes that have a c father and some a or b ancestor. The secondary matches are c nodes that have both an a and a b ancestor. A binary match is a pair of a d node and its c father having both an a and a b ancestor.
The reason for the use of CDATA sections is that the < and & characters would be recognized as markup characters otherwise. Though the grammar can be indented using white space, the CDATA sections make the grammar hardly intelligible. However, the XML syntax has the advantage of allowing an arbitrary encoding as far as it is supported by fxp.
By default, the XML document containing the grammar is parsed in non-validating mode with inclusion of external parsed entities. If the grammar has validity or well-formedness errors, fxgrep ignores them. This behavior can be changed with the --grammar-validate, --grammar-include and --ignore-xml-errors options. For instance,
fxgrep -g gram.xml --grammar-validate --ignore-xml-errors=no
reads the grammar from the file gram.xml in validating mode, and exits with an error if that file is invalid or ill-formed XML, whereas
fxgrep -g gram.xml --grammar-include=no
parses that file in non-validating mode without inclusion of external entities.
fxgrep normally warns about undefined variables in a grammar, i.e., variables that are used on a right hand side but not defined by a rule. For instance, fxgrep will produce the following warning for the above example grammar:
[gram.xml:5.42] Grammar warning: Variable 'v' is used here but is not defined.
This warning can be disabled with the option --warn-undef-vars=no.
If the grammar or the pattern contains a syntax error (with respect to the pattern/grammar syntax of fxgrep) then fxgrep will normally exit with an error. However, if --ignore-syntax-errors is given, it will try to recover from the error and perform the querying any way. Beware that in this case the query might be misinterpreted due to the syntax error and yield unintended matches!