I have an XML file that represents the syntax trees of all the sentences in a book:
<book>
<sentence>
<w class="pronoun" role="subject">
I
</w>
<wg type="verb phrase">
<w class="verb" role="verb">
like
</w>
<wg type="noun phrase" role="object">
<w class="adj">
green
</w>
<w class="noun">
eggs
</w>
</wg>
</wg>
</sentence>
<sentence>
...
</sentence>
...
</book>
This example is fake, but the point is that the actual words (the <w> elements) are nested in unpredictable ways based on syntactic relationships.
What I'm trying to do is find <sentence> nodes with <w> children matching particular criteria in a certain order. For example, I may be looking for a sentence with a w[#class='pronoun'] descendant followed by a w[#class='verb'] descendant.
It's easy to find sentences that just contain both descendants, without caring about ordering:
//sentence[descendant::w[criteria1] and descendant::w[criteria2]]
I did manage to figure out this query that does what I want, which looks for a <w> with a following <w> matching the criteria with the same closest <sentence> ancestor:
for $sentence in //sentence
where $sentence[descendant::w[criteria1 and
following::w[(ancestor::sentence[1] = $sentence) and criteria2]]]
return ...
...but unfortunately it's very slow, and I'm not sure why.
Is there a non-slow way to search for a node that contains descendants matching criteria in a certain order? I'm using XQuery 3.1 with BaseX. If I can't find a reasonable way to do this with XQuery, plan B is to do post-processing with Python.
The following axis is expensive indeed, as it spans all subsequent nodes of a document that are no descendants and no ancestors.
The node comparison operators (<<, >>, is) may help you here. In the code example below, it is checked if there is at least one verb that is followed by a noun:
for $sentence in //sentence
let $words1 := $sentence//w[#class = 'verb']
let $words2 := $sentence//w[#class = 'noun']
where some $w1 in $words1 satisfies
some $w2 in $words2 satisfies $w1 << $w2
return $sentence
Related
I have a short question. How can I display only the elements who's value is = '.'
I have no idea how to do that. I'm newbie in XPath.
<SalesTransaction>
<TransactionHeader>
<TransactionHeaderFields>
<WrntyID>a</WrntyID>
<ExternalID/>
<Type>.</Type>
<Status>
Submited
</Status>
<CreationDate>
2015-01-12
</CreationDate>
<Date>
2015-01-12T11:41:29Z
</Date>
<DeliveryDate>
2015-01-12
</DeliveryDate>
<Remark/>
</TransactionHeaderFields>
<CatalogFields>
<CatalogID>
saf
</CatalogID>
</CatalogFields>
</TransactionHeader>
</SalesTransaction>
Ignoring any of the structure and just looking for any element who's text() is equal to ".", you could use:
//*[text()='.']
//* will search through the entire tree structure, looking for any element at any level
[text()='.'] is a predicate filter (kind of like a WHERE clause in SQL) that performs a test on each of those matched elements. Only the ones that have a text() node who's value is equal to . will evaluate to true() and will be what is left.
It's not not he most efficient XPath expression, but may be good enough for what you need.
What I need doesn't quite seem to match what other articles of a similar title are about.
I need, using Xpath 1, to be able to get node a, or node b, excusively, in that order.
That is, node a if it exists, otherwise, node b.
an xpath expression such as :
expression | expression
will get me both in the case they both exist. that is not what I want.
I could go:
(expression | expression)[last()]
Which does in fact gget me what I need (in my case), but seems to be a bit inefficient, because it will evaluate both sides of the expression before the last result is selected.
I was hoping for an expression that is going to stop working once the left side succeeds.
A more concrete example of XML
<one>
<two>
<three>hello</three>
<four>bye</four>
</two>
<blahfive>again</blahfive>
</one>
and the xpath that works (but inefficient):
(/one/*[starts-with(local-name(.), 'blah')] | .)[last()]
To be clear, I would like to grab the immediate child node of 'one' which starts with 'blah'. However, if it doesn't exist, I would like only the current node.
If the 'blah' node does exist, I do not want the current node.
Is there a more efficient way to achieve this?
I need, using Xpath 1, to be able to get node a, or node b,
excusively, in that order. That is, node a if it exists, otherwise,
node b.
an xpath expression such as :
expression | expression
will get me both in the case they both exist. that is not what I want.
I could go:
(expression | expression)[last()]
Which does in fact gget me what I need (in my case),
This statement is not true.
Here is an example. Let us have this XML document:
<one>
<a/>
<b/>
</one>
Expression1 is:
/*/a
Expression2 is:
/*/b
Your composite expression:
(Expression1 | Expression2)[last()]
when we substitute the two expressions above is:
(/*/a | /*/b)[last()]
And this expression actually selects b -- not a -- because b is the last of the two in document order.
Now, here is an expression that selects just a if it exists, and selects b only if a doesn't exist -- regardless of document order:
/*/a | /*/b[not(/*/a)]
When this expression is evaluated on the XML document above, it selects a, regardless of its document order -- try swapping in the XML document above the places of a and b to confirm that in both cases the element that is selected is a.
To summarize, one expression that selects the wanted node regardless of any document order is:
Expression1 | Expression2[not(Expression1)]
Let us apply this general expression in your case:
Expression1 is:
/one/*[starts-with(local-name(.), 'blah')]
Expression2 is:
self::node()
The wanted expression (after substituting Expression1 and Expression2 in the above general expression) is:
/one/*[starts-with(local-name(.), 'blah')]
|
self::node()[not(/one/*[starts-with(local-name(.), 'blah')])]
This is for C code detection. I'm trying to flag case statements that don't have a break. The hierarchy of the tree looks like this when there are multiple lines before the break statement. This is an example in C:
switch (x) {
case 1:
if (...) {...}
int y = 0;
for (...) {...}
break;
case 2:
It is somehow represented as this:
<switch>
<case>...</case>
<if>...</if>
<expression>...</expression>
<for>...</for>
<break>...</break>
<case>...</case>
</switch>
I need to find <case>s where a <break> exists after any number of lines, but before the next <case>.
This code only helps me find those where the break doesn't immediately follow the case:
//case [name(following-sibling::*[1]) != 'break']
..but when I try to use following-sibling::* it will find a break, but not necessarily before the next case.
How can I do this?
Select any case that has a following break and either no following case or where the position of the next break is less than the position of the next case. With the positions determined by running count() on the preceding siblings.
//case
[
following-sibling::break and
(
not(following-sibling::case) or
(
count(following-sibling::break[1]/preceding-sibling::*) <
count(following-sibling::case[1]/preceding-sibling::*)
)
)
]
To grab the other cases, those without breaks, just throw a big old not() in there like so:
//case
[not(
following-sibling::break and
(
not(following-sibling::case) or
(
count(following-sibling::break[1]/preceding-sibling::*) <
count(following-sibling::case[1]/preceding-sibling::*)
)
)
)]
I agree with #PeterHall, It would be better to restructure the XML into something more closely representing the abstract syntax tree of the C grammar. You can do this easily enough (for this case) with XSLT grouping:
<xsl:for-each-group select="*" group-starting-with="case">
<case>
<xsl:copy-of select="current-group()[not(self::case)]"/>
</case>
</xsl:for-each-group>
You can then find cases with no break as switch/case[not(break)].
I think you are struggling because your XML format does not really model the problem very well. It would be much easier if the other statements were nested inside the <case> elements, instead of being siblings, then you could just use switch/case[break].
With your current structure, it's easiest to start by finding the <break> and then work backwards to find the matching <case>. As #LarsH pointed out, my original expression would find some additional clauses. It can't really be modified to fix that, unless you restrict it to find just the first case:
switch/break/preceding-sibling::case[1]
#derp's answer is better, and can find both cases with and without breaks.
Derp's answer is correct. But I'll just add another. This selects case elements that do have a break:
//case[generate-id(.) =
generate-id(following-sibling::break[1]/preceding-sibling::case[1])]
In otherwords, this selects case elements for which this is true:
The context element is identical to the first case element preceding the next break element (considering siblings only).
If you have a lot of case statements, this variant could be faster than using count(). But you never know for sure unless you test it with the relevant data using the relevant XPath processor.
BTW, the . in generate-id(.) is not required, as the argument defaults to . anyway. But I prefer to make it explicit, for readability.
Using the count(preceding-sibling::*) XPath expression one can obtaining incrementing counters. However, can the same also be accomplished in a two-levels deep sequence?
example XML instance
<grandfather>
<father>
<child>a</child>
</father>
<father>
<child>b</child>
<child>c</child>
</father>
</grandfather>
code (with Saxon HE 9.4 jar on the CLASSPATH for XPath 2.0 features)
Trying to get an counter sequence of 1,2 and 3 for the three child nodes with different kinds of XPath expressions:
XPathExpression expr = xpath.compile("/grandfather/father/child");
NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
for (int i = 0 ; i < nodes.getLength() ; i++) {
Node node = nodes.item(i);
System.out.printf("child's index is: %s %s %s, name is: %s\n"
,xpath.compile("count(preceding-sibling::*)").evaluate(node)
,xpath.compile("count(preceding-sibling::child)").evaluate(node)
,xpath.compile("//child/position()").evaluate(doc)
,xpath.compile(".").evaluate(node));
}
The above code prints:
child's index is: 0 0 1, name is: a
child's index is: 0 0 1, name is: b
child's index is: 1 1 1, name is: c
None of the three XPaths I tried managed to produce the correct sequence: 1,2,3. Clearly it can trivially be done using the i loop variable but I want to accomplish it with XPath if possible. Also I need to keep the basic framework of evaluating an XPath expression to get all the nodes to visit and then iterating on that set since that's the way the real application I work on is structured. Basically I visit each node and then need to evaluate a number of XPath expressions on it (node) or on the document (doc); one of these XPAth expressions is supposed to produce this incrementing sequence.
Use the preceding axis with a name test instead.
count(preceding::child)
Using XPath 2.0, there is a much better way to do this. Fetch all <child/> nodes and use the position() function to get the index:
//child/concat("child's index is: ", position(), ", name is: ", text())
You don't say efficiency is important, but I really hate to see this done with O(n^2) code! Jens' solution shows how to do that if you can use the result in the form of a sequence of (position, name) pairs. You could also return an alternating sequence of strings and numbers using //child/(string(.), position()): though you would then want to use the s9api API rather than JAXP, because JAXP can only really handle the data types that arise in XPath 1.0.
If you need to compute the index of each node as part of other processing, it might still be worth computing the index for every node in a single initial pass, and then looking it up in a table. But if you're doing that, the simplest way is surely to iterate over the result of //child and build a map from nodes to the sequence number in the iteration.
Given a set of strings, for example:
EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green
I want to be able to detect that these are three sets of files:
EntireS[1,2]
J27[Red,Green]P[1,2]
JournalP[1,2][Red,Green,Blue]
Are there any known ways of approaching this problem - any published papers I can read on this?
The approach I am considering is for each string look at all other strings and find the common characters and where differing characters are, trying to find sets of strings that have the most in common, but I fear that this is not very efficient and may give false positives.
Note that this is not the same as 'How do I detect groups of common strings in filenames' because that assumes that a string will always have a series of digits following it.
I would start here: http://en.wikipedia.org/wiki/Longest_common_substring_problem
There are links to supplemental information in the external links, including Perl implementations of the two algorithms explained in the article.
Edited to add:
Based on the discussion, I still think Longest Common Substring could be at the heart of this problem. Even in the Journal example you reference in your comment, the defining characteristic of that set is the substring 'Journal'.
I would first consider what defines a set as separate from the other sets. That gives you your partition to divide up the data, and then the problem is in measuring how much commonality exists within a set. If the defining characteristic is a common substring, then Longest Common Substring would be a logical starting point.
To automate the process of set detection, in general, you will need a pairwise measure of commonality which you can use to measure the 'difference' between all possible pairs. Then you need an algorithm to compute the partition that results in the overall lowest total difference. If the difference measure is not Longest Common Substring, that's fine, but then you need to determine what it will be. Obviously it needs to be something concrete that you can measure.
Bear in mind also that the properties of your difference measurement will bear on the algorithms that can be used to make the partition. For example, assume diff(X,Y) gives the measure of difference between X and Y. Then it would probably be useful if your measure of distance was such that diff(A,C) <= diff(A,B) + diff(B,C). And obviously diff(A,C) should be the same as diff(C,A).
In thinking about this, I also begin to wonder whether we could conceive of the 'difference' as a distance between any two strings, and, with a rigorous definition of the distance, could we then attempt some kind of cluster analysis on the input strings. Just a thought.
Great question! The steps to a solution are:
tokenizing input
using tokens to build an appropriate data structure. a DAWG is ideal, but a Trie is simpler and a decent starting point.
optional post-processing of the data structure for simplification or clustering of subtrees into separate outputs
serialization of the data structure to a regular expression or similar syntax
I've implemented this approach in regroup.py. Here's an example:
$ cat | ./regroup.py --cluster-prefix-len=2
EFgreen
EFgrey
EntireS1
EntireS2
J27RedP1
J27GreenP1
J27RedP2
J27GreenP2
JournalP1Black
JournalP1Blue
JournalP1Green
JournalP1Red
JournalP2Black
JournalP2Blue
JournalP2Green
^D
EFgre(en|y)
EntireS[12]
J27(Green|Red)P[12]
JournalP[12](Bl(ack|ue)|(Green|Red))
Something like that might work.
Build a trie that represents all your strings.
In the example you gave, there would be two edges from the root: "E" and "J". The "J" branch would then split into "Jo" and "J2".
A single strand that forks, e.g. E-n-t-i-r-e-S-(forks to 1, 2) indicates a choice, so that would be EntireS[1,2]
If the strand is "too short" in relation to the fork, e.g. B-A-(forks to N-A-N-A and H-A-M-A-S), we list two words ("banana, bahamas") rather than a choice ("ba[nana,hamas]"). "Too short" might be as simple as "if the part after the fork is longer than the part before", or maybe weighted by the number of words that have a given prefix.
If two subtrees are "sufficiently similar" then they can be merged so that instead of a tree, you now have a general graph. For example if you have ABRed,ABBlue,ABGreen,CDRed,CDBlue,CDGreen, you may find that the subtree rooted at "AB" is the same as the subtree rooted at "CD", so you'd merge them. In your output this will look like this: [left branch, right branch][subtree], so: [AB,CD][Red,Blue,Green]. How to deal with subtrees that are close but not exactly the same? There's probably no absolute answer but someone here may have a good idea.
I'm marking this answer community wiki. Please feel free to extend it so that, together, we may have a reasonable answer to the question.
try "frak" . It creates regex expression from set of strings. Maybe some modification of it will help you.
https://github.com/noprompt/frak
Hope it helps.
There are many many approaches to string similarity. I would suggest taking a look at this open-source library that implements a lot of metrics like Levenshtein distance.
http://sourceforge.net/projects/simmetrics/
You should be able to achieve this with generalized suffix trees: look for long paths in the suffix tree which come from multiple source strings.
There are many solutions proposed that solve the general case of finding common substrings. However, the problem here is more specialized. You're looking for common prefixes, not just substrings. This makes it a little simpler.
A nice explanation for finding longest common prefix can be found at
http://www.geeksforgeeks.org/longest-common-prefix-set-1-word-by-word-matching/
So my proposed "pythonese" pseudo-code is something like this (refer to the link for an implementation of find_lcp:
def count_groups(items):
sorted_list = sorted(items)
prefix = sorted_list[0]
ctr = 0
groups = {}
saved_common_prefix = ""
for i in range(1, sorted_list):
common_prefix = find_lcp(prefix, sorted_list[i])
if len(common_prefix) > 0: #we are still in the same group of items
ctr += 1
saved_common_prefix = common_prefix
prefix = common_prefix
else: # we must have switched to a new group
groups[saved_common_prefix] = ctr
ctr = 0
saved_common_prefix = ""
prefix = sorted_list[i]
return groups
For this particular example of strings to keep it extremely simple consider using simple word/digit -separation.
A non-digit sequence apparently can begin with capital letter (Entire). After breaking all strings into groups of sequences, something like
[Entire][S][1]
[Entire][S][2]
[J][27][Red][P][1]
[J][27][Green][P][1]
[J][27][Red][P][2]
....
[Journal][P][1][Blue]
[Journal][P][1][Green]
Then start grouping by groups, you can fairly soon see that prefix "Entire" is a common for some group and that all subgroups have S as headgroup, so only variable for those is 1,2.
For J27 case you can see that J27 is only leaf, but that it then branches at Red and Green.
So somekind of List<Pair<list, string>> -structure (composite pattern if I recall correctly).
import java.util.*;
class StringProblem
{
public List<String> subString(String name)
{
List<String> list=new ArrayList<String>();
for(int i=0;i<=name.length();i++)
{
for(int j=i+1;j<=name.length();j++)
{
String s=name.substring(i,j);
list.add(s);
}
}
return list;
}
public String commonString(List<String> list1,List<String> list2,List<String> list3)
{
list2.retainAll(list1);
list3.retainAll(list2);
Iterator it=list3.iterator();
String s="";
int length=0;
System.out.println(list3); // 1 1 2 3 1 2 1
while(it.hasNext())
{
if((s=it.next().toString()).length()>length)
{
length=s.length();
}
}
return s;
}
public static void main(String args[])
{
Scanner sc=new Scanner(System.in);
System.out.println("Enter the String1:");
String name1=sc.nextLine();
System.out.println("Enter the String2:");
String name2=sc.nextLine();
System.out.println("Enter the String3:");
String name3=sc.nextLine();
// String name1="salman";
// String name2="manmohan";
// String name3="rahman";
StringProblem sp=new StringProblem();
List<String> list1=new ArrayList<String>();
list1=sp.subString(name1);
List<String> list2=new ArrayList<String>();
list2=sp.subString(name2);
List<String> list3=new ArrayList<String>();
list3=sp.subString(name3);
sp.commonString(list1,list2,list3);
System.out.println(" "+sp.commonString(list1,list2,list3));
}
}