XPath : Find following siblings that don't follow an order pattern - xpath

This is for C code detection. I'm trying to flag case statements that don't have a break. The hierarchy of the tree looks like this when there are multiple lines before the break statement. This is an example in C:
switch (x) {
case 1:
if (...) {...}
int y = 0;
for (...) {...}
break;
case 2:
It is somehow represented as this:
<switch>
<case>...</case>
<if>...</if>
<expression>...</expression>
<for>...</for>
<break>...</break>
<case>...</case>
</switch>
I need to find <case>s where a <break> exists after any number of lines, but before the next <case>.
This code only helps me find those where the break doesn't immediately follow the case:
//case [name(following-sibling::*[1]) != 'break']
..but when I try to use following-sibling::* it will find a break, but not necessarily before the next case.
How can I do this?

Select any case that has a following break and either no following case or where the position of the next break is less than the position of the next case. With the positions determined by running count() on the preceding siblings.
//case
[
following-sibling::break and
(
not(following-sibling::case) or
(
count(following-sibling::break[1]/preceding-sibling::*) <
count(following-sibling::case[1]/preceding-sibling::*)
)
)
]
To grab the other cases, those without breaks, just throw a big old not() in there like so:
//case
[not(
following-sibling::break and
(
not(following-sibling::case) or
(
count(following-sibling::break[1]/preceding-sibling::*) <
count(following-sibling::case[1]/preceding-sibling::*)
)
)
)]

I agree with #PeterHall, It would be better to restructure the XML into something more closely representing the abstract syntax tree of the C grammar. You can do this easily enough (for this case) with XSLT grouping:
<xsl:for-each-group select="*" group-starting-with="case">
<case>
<xsl:copy-of select="current-group()[not(self::case)]"/>
</case>
</xsl:for-each-group>
You can then find cases with no break as switch/case[not(break)].

I think you are struggling because your XML format does not really model the problem very well. It would be much easier if the other statements were nested inside the <case> elements, instead of being siblings, then you could just use switch/case[break].
With your current structure, it's easiest to start by finding the <break> and then work backwards to find the matching <case>. As #LarsH pointed out, my original expression would find some additional clauses. It can't really be modified to fix that, unless you restrict it to find just the first case:
switch/break/preceding-sibling::case[1]
#derp's answer is better, and can find both cases with and without breaks.

Derp's answer is correct. But I'll just add another. This selects case elements that do have a break:
//case[generate-id(.) =
generate-id(following-sibling::break[1]/preceding-sibling::case[1])]
In otherwords, this selects case elements for which this is true:
The context element is identical to the first case element preceding the next break element (considering siblings only).
If you have a lot of case statements, this variant could be faster than using count(). But you never know for sure unless you test it with the relevant data using the relevant XPath processor.
BTW, the . in generate-id(.) is not required, as the argument defaults to . anyway. But I prefer to make it explicit, for readability.

Related

XQuery/XPath "except" operation - Select part of sequence that is not in other sequence

I have a pretty simple example but I am just learning and can't find a solution for the following:
Given 2 sequences, being
<emp>10</emp>
<emp>42</emp>
<emp>100</emp>
and another sequence
<emp>10</emp>
<emp>42</emp>
Want i want to do is: Compare the sequences and return the part of sequences that is in the first, but not in the 2nd sequence, being <emp>100</emp> in this case.
I was thinking about an "except"-operation, but can't figure out how to make it working.
Help greatly appreciated.
The except expression operates on node identity, not node value. What I think you want is a value comparison over your sequences. For example:
let $seq1 :=
(<emp>10</emp>,
<emp>42</emp>,
<emp>100</emp>)
let $seq2 :=
(<emp>10</emp>,
<emp>42</emp>)
return $seq1[not(. = $seq2)]
=>
<emp>100</emp>

Select all nodes until a specific given node/tag

Given the following markup:
<div id="about">
<dl>
<dt>Date</dt>
<dd>1872</dd>
<dt>Names</dt>
<dd>A</dd>
<dd>B</dd>
<dd>C</dd>
<dt>Status</dt>
<dd>on</dd>
<dt>Another Field</dt>
<dd>X</dd>
<dd>Y</dd>
</dl>
</div>
I'm trying to extract all the <dd> nodes following <dt>Names</dt> but only until another <dt> starts. In this case, I'm after the following nodes:
<dd>A</dd>
<dd>B</dd>
<dd>C</dd>
I'm trying the following XPath code, but it's not working as intended.
xpath("//div[#id='about']/dl/dt[contains(text(),'Names')]/following-sibling::dd[not(following-sibling::dt)]/text()")
Any thoughts on how to fix it?
Many thanks.
Update: much simpler solution
There is a prerequisite in your situation, that is that the anchor item always is the first preceding sibling with a certain property. Because of that, here's a much simpler way of writing the below complex expression:
/div/dl/dd[preceding-sibling::dt[1][. = 'Names']]
In other words:
select any dd
that has a first preceding sibling dt (the preceding sibling axis counts backwards)
that itself has a value of "Names"
As can be seen in the following screenshot from oXygen, it selects the nodes you wanted to select (and if you change "Names" to "Status" or "Another Field", it will select only the following ones before the next dt also).
Original complex solution (leaving in for reference)
This is far easier in XPath 2.0, but let's assume you can only use XPath 1.0. The trick is to count the number of preceding siblings from your anchor element (the one with "Names" in it), and disregard any that have the wrong count (i.e., when we cross over <dt>Status</dt>, the number of preceding siblings has increased).
For XPath 1.0, remove the comments between (: and :) (in XPath, whitespace is insignificant, you can make it a multiline XPath for readability, but in 1.0, comments are not possible)
/div/dl/dd
(: any dd having a dt before it with "Names" :)
[preceding-sibling::dt[. = 'Names']]
(: count the preceding siblings up to dt with "Names", add one to include 'self' :)
[count(preceding-sibling::dt[. = 'Names']/preceding-sibling::dt) + 1
=
(: compare with count of all preceding siblings :)
count(preceding-sibling::dt)]
As a one-liner:
/div/dl/dd[preceding-sibling::dt[. = 'Names']][count(preceding-sibling::dt[. = 'Names']/preceding-sibling::dt) + 1 = count(preceding-sibling::dt)]
How about this:
//dd[preceding-sibling::dt[contains(., 'Names')]][following-sibling::dt]

UU-Parsinglib slowering drastically when some rules are enabled

I'm writing a compiler using uu-parsinglib and I saw a very strange thing. I defined a pChoice combinator like:
pChoice = foldr (<<|>) pFail
(notice, I'm using greedy <<|>).
Lets consider following code:
pFactor i = pChoice [ Expr.Var <$> pVar
, Expr.Lit <$> pLit True
, L.pParensed (pExpr i)
-- , Expr.Tuple <$> pTuple (pOpE i)
-- , Expr.List <$> pLst (pListE i)
]
Each element starts with different character - Expr.Var starts with a letter, Expr.Lit with a number, L.pParensed with parenthesis (, Expr.Tuple with brace { and Expr.List with bracket [.
I've got a big test code in which there are no tuples and no lists. The code parses in 0.15s. When I uncomment the above lines, the time increases to 0.65s. This is over 400% slowdown... How is it possible? I'm using only greedy operators and I'm sure parser is not haning in Tuple nor List section, because in the whole code there is no tuple nor list.
If you would need more code or definitions, I'll of course poste it.
I think the cause of the matter may lie in the fact that you have parameterised pFactor. This will cause each call to such a parser to build a new parser, which will take time. It is much better to create such parsers once and for all and share them in the actual parsing process. I cannot see how you are using this parser I cannot answer your questions any further.

XPath :: running counter two levels

Using the count(preceding-sibling::*) XPath expression one can obtaining incrementing counters. However, can the same also be accomplished in a two-levels deep sequence?
example XML instance
<grandfather>
<father>
<child>a</child>
</father>
<father>
<child>b</child>
<child>c</child>
</father>
</grandfather>
code (with Saxon HE 9.4 jar on the CLASSPATH for XPath 2.0 features)
Trying to get an counter sequence of 1,2 and 3 for the three child nodes with different kinds of XPath expressions:
XPathExpression expr = xpath.compile("/grandfather/father/child");
NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
for (int i = 0 ; i < nodes.getLength() ; i++) {
Node node = nodes.item(i);
System.out.printf("child's index is: %s %s %s, name is: %s\n"
,xpath.compile("count(preceding-sibling::*)").evaluate(node)
,xpath.compile("count(preceding-sibling::child)").evaluate(node)
,xpath.compile("//child/position()").evaluate(doc)
,xpath.compile(".").evaluate(node));
}
The above code prints:
child's index is: 0 0 1, name is: a
child's index is: 0 0 1, name is: b
child's index is: 1 1 1, name is: c
None of the three XPaths I tried managed to produce the correct sequence: 1,2,3. Clearly it can trivially be done using the i loop variable but I want to accomplish it with XPath if possible. Also I need to keep the basic framework of evaluating an XPath expression to get all the nodes to visit and then iterating on that set since that's the way the real application I work on is structured. Basically I visit each node and then need to evaluate a number of XPath expressions on it (node) or on the document (doc); one of these XPAth expressions is supposed to produce this incrementing sequence.
Use the preceding axis with a name test instead.
count(preceding::child)
Using XPath 2.0, there is a much better way to do this. Fetch all <child/> nodes and use the position() function to get the index:
//child/concat("child's index is: ", position(), ", name is: ", text())
You don't say efficiency is important, but I really hate to see this done with O(n^2) code! Jens' solution shows how to do that if you can use the result in the form of a sequence of (position, name) pairs. You could also return an alternating sequence of strings and numbers using //child/(string(.), position()): though you would then want to use the s9api API rather than JAXP, because JAXP can only really handle the data types that arise in XPath 1.0.
If you need to compute the index of each node as part of other processing, it might still be worth computing the index for every node in a single initial pass, and then looking it up in a table. But if you're doing that, the simplest way is surely to iterate over the result of //child and build a map from nodes to the sequence number in the iteration.

An efficient technique to replace an occurence in a sequence with mutable or immutable state

I am searching for an efficient a technique to find a sequence of Op occurences in a Seq[Op]. Once an occurence is found, I want to replace the occurence with a defined replacement and run the same search again until the list stops changing.
Scenario:
I have three types of Op case classes. Pop() extends Op, Push() extends Op and Nop() extends Op. I want to replace the occurence of Push(), Pop() with Nop(). Basically the code could look like seq.replace(Push() ~ Pop() ~> Nop()).
Problem:
Now that I call seq.replace(...) I will have to search in the sequence for an occurence of Push(), Pop(). So far so good. I find the occurence. But now I will have to splice the occurence form the list and insert the replacement.
Now there are two options. My list could be mutable or immutable. If I use an immutable list I am scared regarding performance because those sequences are usually 500+ elements in size. If I replace a lot of occurences like A ~ B ~ C ~> D ~ E I will create a lot of new objects If I am not mistaken. However I could also use a mutable sequence like ListBuffer[Op].
Basically from a linked-list background I would just do some pointer-bending and after a total of four operations I am done with the replacement without creating new objects. That is why I am now concerned about performance. Especially since this is a performance-critical operation for me.
Question:
How would you implement the replace() method in a Scala fashion and what kind of data structure would you use keeping in mind that this is a performance-critical operation?
I am happy with answers that point me in the right direction or pseudo code. No need to write a full replace method.
Thank you.
Ok, some considerations to be made. First, recall that, on lists, tail does not create objects, and prepending (::) only creates one object for each prepended element. That's pretty much as good as you can get, generally speaking.
One way of doing this would be this:
def myReplace(input: List[Op], pattern: List[Op], replacement: List[Op]) = {
// This function should be part of an KMP algorithm instead, for performance
def compare(pattern: List[Op], list: List[Op]): Boolean = (pattern, list) match {
case (x :: xs, y :: ys) if x == y => compare(xs, ys)
case (Nil, Nil) => true
case _ => false
}
var processed: List[Op] = Nil
var unprocessed: List[Op] = input
val patternLength = pattern.length
val reversedReplacement = replacement.reverse
// Do this until we finish processing the whole sequence
while (unprocessed.nonEmpty) {
// This inside algorithm would be better if replaced by KMP
// Quickly process non-matching sequences
while (unprocessed.nonEmpty && unprocessed.head != pattern.head) {
processed ::= unprocessed.head
unprocessed = unprocessed.tail
}
if (unprocessed.nonEmpty) {
if (compare(pattern, unprocessed)) {
processed :::= reversedReplacement
unprocessed = unprocessed drop patternLength
} else {
processed ::= unprocessed.head
unprocessed = unprocessed.tail
}
}
}
processed.reverse
}
You may gain speed by using KMP, particularly if the pattern searched for is long.
Now, what is the problem with this algorithm? The problem is that it won't test if the replaced pattern causes a match before that position. For instance, if I replace ACB with C, and I have an input AACBB, then the result of this algorithm will be ACB instead of C.
To avoid this problem, you should create a backtrack. First, you check at which position in your pattern the replacement may happen:
val positionOfReplacement = pattern.indexOfSlice(replacement)
Then, you modify the replacement part of the algorithm this:
if (compare(pattern, unprocessed)) {
if (positionOfReplacement > 0) {
unprocessed :::= replacement
unprocessed :::= processed take positionOfReplacement
processed = processed drop positionOfReplacement
} else {
processed :::= reversedReplacement
unprocessed = unprocessed drop patternLength
}
} else {
This will backtrack enough to solve the problem.
This algorithm won't deal efficiently, however, with multiply patterns at the same time, which I guess is where you are going. For that, you'll probably need some adaptation of KMP, to do it efficiently, or, otherwise, use a DFA to control possible matchings. It gets even worse if you want to match both AB and ABC.
In practice, the full blow problem is equivalent to regex match & replace, where the replace is a function of the match. Which means, of course, you may want to start looking into regex algorithms.
EDIT
I was forgetting to complete my reasoning. If that technique doesn't work for some reason, then my advice is going with an immutable tree-based vector. Tree-based vectors enable replacement of partial sequences with low amount of copying.
And if that doesn't do, then the solution is doubly linked lists. And pick one from a library with slice replacement -- otherwise you may end up spending way too much time debugging a known but tricky algorithm.

Resources