public abstract class TregexPattern extends Object implements Serializable
tgrep and tgrep2. However, unlike these
tree pattern matching systems, but like Unix grep, there is no pre-indexing of the data to be searched.
Rather there is a linear scan through the trees where matches are sought.
As a result, matching is slower, but a TregexPattern can be applied
to an arbitrary set of trees at runtime in a processing pipeline without pre-indexing.
TregexPattern instances can be matched against instances of the Tree class.
The main(java.lang.String[]) method can be used to find matching nodes of a treebank from the command line.
/^MW/ < IN.
We then create a pattern, find matches in a given tree, and process
those matches as follows:
// Create a reusable pattern object
TregexPattern patternMW = TregexPattern.compile("/^MW/ < IN");
// Run the pattern on one particular tree
TregexMatcher matcher = patternMW.matcher(tree);
// Iterate over all of the subtrees that matched
while (matcher.findNextMatchingNode()) {
Tree match = matcher.getMatch();
// do what we want to do with the subtree
match.pennPrint();
}
| Symbol | Meaning |
|---|---|
| A << B | A dominates B |
| A >> B | A is dominated by B |
| A < B | A immediately dominates B |
| A > B | A is immediately dominated by B |
| A $ B | A is a sister of B (and not equal to B) |
| A .. B | A precedes B |
| A . B | A immediately precedes B |
| A ,, B | A follows B |
| A , B | A immediately follows B |
| A <<, B | B is a leftmost descendant of A |
| A <<- B | B is a rightmost descendant of A |
| A >>, B | A is a leftmost descendant of B |
| A >>- B | A is a rightmost descendant of B |
| A <, B | B is the first child of A |
| A >, B | A is the first child of B |
| A <- B | B is the last child of A |
| A >- B | A is the last child of B |
| A <` B | B is the last child of A |
| A >` B | A is the last child of B |
| A <i B | B is the ith child of A (i > 0) |
| A >i B | A is the ith child of B (i > 0) |
| A <-i B | B is the ith-to-last child of A (i > 0) |
| A >-i B | A is the ith-to-last child of B (i > 0) |
| A <: B | B is the only child of A |
| A >: B | A is the only child of B |
| A <<: B | A dominates B via an unbroken chain (length > 0) of unary local trees. |
| A >>: B | A is dominated by B via an unbroken chain (length > 0) of unary local trees. |
| A $++ B | A is a left sister of B (same as $.. for context-free trees) |
| A $-- B | A is a right sister of B (same as $,, for context-free trees) |
| A $+ B | A is the immediate left sister of B (same as $. for context-free trees) |
| A $- B | A is the immediate right sister of B (same as $, for context-free trees) |
| A $.. B | A is a sister of B and precedes B |
| A $,, B | A is a sister of B and follows B |
| A $. B | A is a sister of B and immediately precedes B |
| A $, B | A is a sister of B and immediately follows B |
| A <+(C) B | A dominates B via an unbroken chain of (zero or more) nodes matching description C |
| A >+(C) B | A is dominated by B via an unbroken chain of (zero or more) nodes matching description C |
| A .+(C) B | A precedes B via an unbroken chain of (zero or more) nodes matching description C |
| A ,+(C) B | A follows B via an unbroken chain of (zero or more) nodes matching description C |
| A <<# B | B is a head of phrase A |
| A >># B | A is a head of phrase B |
| A <# B | B is the immediate head of phrase A |
| A ># B | A is the immediate head of phrase B |
| A == B | A and B are the same node |
| A <= B | A and B are the same node or A is the parent of B |
| A : B | [this is a pattern-segmenting operator that places no constraints on the relationship between A and B] |
| A <... { B ; C ; ... } | A has exactly B, C, etc as its subtree, with no other children. |
AbstractTreebankLanguagePack.getBasicCategoryFunction().
Note that Label description regular expressions are matched as find(),
as in Perl/tgrep, not as matches();
you need to use ^ or $ to constrain matches to
the ends of strings.
(S < VP < NP) means
"an S over a VP and also over an NP".
Nodes can be grouped using parentheses '(' and ')'
as in S < (NP $++ VP) to match an S
over an NP, where the NP has a VP as a right sister.
So, if instead what you want is an S above a VP above an NP, you must write
"S < (VP < NP)".
B "follows" node A if B
or one of its ancestors is a right sibling of A or one
of its ancestors. Node B "immediately follows" node
A if B follows A and there
is no node C such that B follows
C and C follows A.
A dominates B through an unbroken
chain of unary local trees only if A is also
unary. (A (B)) is a valid example that matches
A <<: B
C, the description
C cannot be a full Tregex expression, but only an
expression specifying the name of the node. Negation of this
description is allowed.
== has the same precedence as the other relations, so the expression
A << B == A << C associates as
(((A << B) == A) << C), not as
((A << B) == (A << C)). (Both expressions are
equivalent, of course, but this is just an example.)
(NP < NN | < NNS) will match an NP node dominating either
an NN or an NNS. (NP > S & $++ VP) matches an NP that
is both under an S and has a VP as a right sister.
Expressions stop evaluating as soon as the result is known. For
example, if the pattern is NP=a | NNP=b and the NP
matches, then variable b will not be assigned even if
there is an NNP in the tree.
Relations can be grouped using brackets '[' and ']'. So the expression
NP [< NN | < NNS] & > S
matches an NP that (1) dominates either an NN or an NNS, and (2) is under an S. Without
brackets, & takes precedence over |, and equivalent operators are
left-associative. Also note that & is the default combining operator if the
operator is omitted in a chain of relations, so that the two patterns are equivalent:
As another example,(S < VP < NP)
(S < VP & < NP)
(VP < VV | < NP % NP)
can be written explicitly as (VP [< VV | [< NP & % NP] ] )
(NP !< NNP) matches only NPs not dominating
an NNP. Label descriptions can also be negated with '!':
(NP < !NNP|NNS) matches NPs dominating some node
that is not an NNP or an NNS.
@ symbol. For example
(@NP < @/NN.?/) This can only be used for individual nodes;
if you want all nodes to use the basic category, it would be more efficient
to use a TreeNormalizer to remove functional
tags before passing the tree to the TregexPattern.
S : NPmatches only those S nodes in trees that also have an NP node.
(NP < NNP=name) will match an NP dominating an NNP
and after a match is found, the map can be queried with the
name to retreived the matched node using TregexMatcher.getNode(String o)
with (String) argument "name" (TregexParseException to be thrown. Named nodes
(@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma)
matches only an NP dominating exactly the four node sequence
NP , NP , -- the mother NP cannot have any other
daughters. Multiple backreferences are allowed. If the node w/ no
node description does not refer to a previously named node, there
will be no error, the expression simply will not match anything.
Another way to refer to previously named nodes is with the "link" symbol: '~'.
A link is like a backreference, except that instead of having to be equal to the
referred node, the current node only has to match the label of the referred to node.
A link cannot have a node description, i.e. the '~' symbol must immediately follow a
relation symbol.
<#, >#, <<#,
and >>#, and also
the Function mapping from labels to Basic Category tags can be
chosen by using a TregexPatternCompiler.
/ <regex-stuff> /#<group-number>%<variable-name>
For example, the pattern (designed for Penn Treebank trees)
@SBAR < /^WH.*-([0-9]+)$/#1%index << (__=empty < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))
will match only such that the WH- node under the SBAR is coindexed with the trace node that gets the name empty.
A | B will not work.
/(.*)/#1%foo and
/(.*)/#1%bar. You might then want to write a pattern
that matches the concatenation of these patterns,
/(.*)(.*)/#1%foo#2%bar, but that will not work.
| Modifier and Type | Class and Description |
|---|---|
static class |
TregexPattern.TRegexTreeReaderFactory |
| Modifier and Type | Method and Description |
|---|---|
static TregexPattern |
compile(String tregex)
Creates a pattern from the given string using the default HeadFinder and
BasicCategoryFunction.
|
static void |
main(String[] args)
Prints out all matches of a tree pattern on each tree in the path.
|
TregexMatcher |
matcher(Tree t)
Get a
TregexMatcher for this pattern on this tree. |
TregexMatcher |
matcher(Tree t,
HeadFinder headFinder)
Get a
TregexMatcher for this pattern on this tree. |
String |
pattern() |
void |
prettyPrint()
Print a multi-line representation of the pattern illustrating
it's syntax to System.out.
|
void |
prettyPrint(PrintStream ps)
Print a multi-line representation
of the pattern illustrating it's syntax.
|
void |
prettyPrint(PrintWriter pw)
Print a multi-line representation
of the pattern illustrating it's syntax.
|
static TregexPattern |
safeCompile(String tregex,
boolean verbose)
Creates a pattern from the given string using the default HeadFinder and
BasicCategoryFunction.
|
abstract String |
toString() |
public TregexMatcher matcher(Tree t)
TregexMatcher for this pattern on this tree.t - a tree to match onpublic TregexMatcher matcher(Tree t, HeadFinder headFinder)
TregexMatcher for this pattern on this tree. Any Relations which use heads of trees should use the provided HeadFinder.t - a tree to match onheadFinder - a HeadFinder to use when matchingpublic static TregexPattern compile(String tregex)
TregexPatternCompiler object.tregex - the pattern stringTregexParseException - if the string does not parsepublic static TregexPattern safeCompile(String tregex, boolean verbose)
TregexPatternCompiler object.
Rather than throwing an exception when the string does not parse,
simply returns null.tregex - the pattern stringverbose - whether to log errors when the string doesn't parsepublic String pattern()
public abstract String toString()
public void prettyPrint(PrintWriter pw)
public void prettyPrint(PrintStream ps)
public void prettyPrint()
public static void main(String[] args) throws IOException
java edu.stanford.nlp.trees.tregex.TregexPattern [[-TCwfosnu] [-filter] [-h <node-name>]]* pattern filepath
Arguments:
pattern: the tree
pattern which optionally names some set of nodes (i.e., gives it the "handle") =name (for some arbitrary
string "name")
filepath: the path to files with trees. If this is a directory, there will be recursive descent and the pattern will be run on all files beneath the specified directory.
-C suppresses printing of matches, so only the
number of matches is printed.
-w causes the whole of a tree that matches to be printed.
-f causes the filename to be printed.
-i <filename> causes the pattern to be matched to be read from <filename> rather than the command line. Don't specify a pattern when this option is used.
-o Specifies that each tree node can be reported only once as the root of a match (by default a node will
be printed once for every way the pattern matches).
-s causes trees to be printed all on one line (by default they are pretty printed).
-n causes the number of the tree in which the match was found to be
printed before every match.
-u causes only the label of each matching node to be printed, not complete subtrees.
-t causes only the yield (terminal words) of the selected node to be printed (or the yield of the whole tree, if the -w option is used).
-encoding <charset_encoding> option allows specification of character encoding of trees..
-h <node-handle> If a -h option is given, the root tree node will not be printed. Instead,
for each node-handle specified, the node matched and given that handle will be printed. Multiple nodes can be printed by using the
-h option multiple times on a single command line.
-hf <headfinder-class-name> use the specified HeadFinder class to determine headship relations.
-hfArg <string> pass a string argument in to the HeadFinder class's constructor. -hfArg can be used multiple times to pass in multiple arguments.
-trf <TreeReaderFactory-class-name> use the specified TreeReaderFactory class to read trees from files.
-e <extension> Only attempt to read files with the given extension. If not provided, will attempt to read all files.-v print every tree that contains no matches of the specified pattern, but print no matches to the pattern.
-x Instead of the matched subtree, print the matched subtree's identifying number as defined in tgrep2:a
unique identifier for the subtree and is in the form s:n, where s is an integer specifying
the sentence number in the corpus (starting with 1), and n is an integer giving the order
in which the node is encountered in a depth-first search starting with 1 at top node in the
sentence tree.
-extract <tree-file> extracts the subtree s:n specified by code from the specified tree-file.
Overrides all other behavior of tregex. Can't specify multiple encodings etc. yet.
-extractFile <code-file> <tree-file> extracts every subtree specified by the subtree codes in
code-file, which must appear exactly one per line, from the specified tree-file.
Overrides all other behavior of tregex. Can't specify multiple encodings etc. yet.
-filter causes this to act as a filter, reading tree input from stdin
-T causes all trees to be printed as processed (for debugging purposes). Otherwise only matching nodes are printed.
-macros <filename> filename with macro substitutions to use. file with tab separated lines original-tab-replacement
IOException