Xquery for determining all nodes are unique - xpath

Assuming a document such as:
<a>
<b>TEST1</b>
<b>TEST2</b>
<b>TEST1</b>
</a>
Is there any Xquery one liner you can use to check that all values of b are unique without having to run a for-each of on the doc? So for example in the above document it would return false, whereas in the below document it would return true
<a>
<b>TEST1</b>
<b>TEST2</b>
<b>TEST3</b>
</a>

Similar approach but using empty() might be a bit more efficient :
empty(/a/b[. = following-sibling::b])
empty() returns true if the parameter expression yields empty sequence, and returns false otherwise. So in this case, if there found sibling b with the same value a.k.a a dupe, empty() will return false.

You can look for sibling with same value e,g:
count(b[preceding-sibling::b = .])
or to get true or false:
not(b[preceding-sibling::b = .])

The most efficient is probably
count(b) = count(distinct-values(b))
The other solutions you have been given are likely to be quadratic in the number of test elements, whereas this is likely to be O(n log n).
(However, this is on the assumption that duplicates are rare. If duplicates are very common, then some kind of fold operation might find them faster, especially with XQuery 3.0).

Related

Xquery not returning desired values

I am trying to return a certain set of values however the query is not quite returning what I would like. I would like to return records by the author "Hennie J. Steenhagen" grouped by year. However what it is returning is records grouped by year if it’s of the same year as one of Hennies records. Not only Hennies.
For example, if we have the record <www><author>Hennie*</author><year>1990</year></www> and <www><author>Derpie</author><year>1990></year></www> the query will return both records grouped in the year 1990, I would only like Hennies to be returned.
for $y in /*/*/year where $y/../author ="Hennie J. Steenhagen" return <year-Pub>{$y}{/*/*[year = $y]}</year-Pub>
Your question is quite difficult to understand because your XPath addresses a larger XML node tree than the example XML you have provided. However for the example I will assume that your records are named record. Also your output of your XPath does not make a lot of sense to me, but I will assume that you know what you want!
Given the XML:
<record>
<www>
<author>Hennie J. Steenhagen</author>
<year>1990</year>
</www>
and
<www>
<author>Derpie</author>
<year>1990></year>
</www>
</record>
If you have an XQuery 3.0 processor, you could use the following:
/record/www[author = "Hennie J. Steenhagen"] ! <year-Pub>{year}{.}</year-Pub>
If you only have access to an XQuery 1.0 processor, then you could fall-back to the following:
for $w in /record/www[author = "Hennie J. Steenhagen"]
return
<year-Pub>{$w/year}{$w}</year-Pub>
Both of my examples only use a single predicate which will only filter the data once. Whereas your self-found solution uses both a predicate and a where expression, and so has to filter the data twice.
Fixed it,
for $y in /*/*/year where $y/../author ="Hennie J. Steenhagen" and /*/*[year=$y] return <year-Pub>{$y/../*}</year-Pub>
Thanks for any one whom spend their time looking.

Need some explanation about getting max in XPath

I'm kinda new to XPath and I've found that to get the max attribute number I can use the next statement: //Book[not(#id > //Book/#id) and it works quite well.
I just can't understand why does it return max id instead of min id, because it looks like I'm checking whether id of a node greater than any other nodes ids and then return a Book where it's not.
I'm probably stupid, but, please, someone, explain :)
You're not querying for maximum values, but for minimum values. Your query
//Book[not(#id > //Book/#id)
could be translated to natural language as "Find all books, which do not have an #id that is larger than any other book's #id". You probably want to use
//Book[not(#id < //Book/#id)
For arbitrary input you might have wanted to use <= instead, so it only returns a single maximum value (or none if it is shared). As #ids must be unique, this does not matter here.
Be aware that //Book[#id > //Book/#id] is not equal to the query above, although math would suggest so. XPath's comparison operators adhere to a kind of set-semantics: if any value on the left side is larger than any value on the right side, the predicate would be true; thus it would include all books but the one with minimum #id value.
Besides XPath 1.0 your function is correct, in XPath 2.0:
/Books/Book[id = max(../Book/id)]
The math:max function returns the maximum value of the nodes passed as the argument. The maximum value is defined as follows. The node set passed as an argument is sorted in descending order as it would be by xsl:sort with a data type of number. The maximum is the result of converting the string value of the first node in this sorted list to a number using the number function.
If the node set is empty, or if the result of converting the string values of any of the nodes to a number is NaN, then NaN is returned.
The math:max template returns a result tree fragment whose string value is the result of turning the number returned by the function into a string.

OR between two function call

What is the meaning of a || between two function call
like
{
//some code
return Find(n.left,req)||Find(n.right,req);
}
http://www.careercup.com/question?id=7560692
can some one help me to understand . Many thanks in advance.
It means that it returns true if one of the two functions is true (or both of them).
Depends on the programming language, the method calls Find(n.left,req) -> if it's true - returns true. if it's false, it calls Find(n.right,req) and returns its Boolean value.
In Java (and C and C#) || means "lazy or". The single stroke | also means "or", but operates slightly differently.
To calculate a||b, the computer calculates the truth value (true or false) of a, and if a is true it returns the value true without bothering to calculate b, hence the word "lazy". Only if a is false, will it checks b to see if it is true (and so if a or b is true).
To calculate a|b, the computer works out the value of a and b first, then "ors" the answers together.
The "lazy or" || is more efficient, because it sometimes does not need to calculate b at all. The only reason you might want to use a|b is if b is actually a function (method) call, and because of some side-effect of the method you want to be sure it executes exactly once. I personally consider this poor programming technique, and on the very few occasions that I want b to always be explicitly calculated I do this explicitly and then use a lazy or.
Eg consider a function or method foo() which returns a boolean. Instead of
boolean x = a|foo(something);
I would write
boolean c=foo(something);
boolean x = a||c;
Which explicitly calls foo() exactly once, so you know what is going on.
Much better programming practice, IMHO. Indeed the best practice would be to eliminate the side effect in foo() entirely, but sometimes that's hard to do.
If you are using lazy or || think about the order you evaluate it in. If a is easy to calculate and usually true, a||b will be more efficient than b||a, as most of the time a simple calculation for a is all that is needed. Conversely if b is usually false and is difficult to calculate, b||a will not be much more efficient than a|b. If one of a or b is a constant and the other a method call, you should have the constant as the first term a||foo() rather than foo()||a as a method call will always be slower than using a simple local variable.
Hope this helps.
Peter Webb
return Find(n.left,req)||Find(n.right,req);
means execute first find {Find(n.left,req)} and return true if it returns true or
execute second find return the value true if it return true, otherwise false.

sorting using weird groovy code

I'm a beginner at groovy and I can't seem to understand this code. Can you please tell me how does this code operate?
def list = [ [1,0], [0,1,2] ]
list = list.sort { a,b -> a[0] <=> b[0] }
assert list == [ [0,1,2], [1,0] ]
what I know is the second line should return the value of 1 because of the spaceship operator but what is the use of that? and what type of sort is this? (there are 6 sort methods in the gdk api and i'm not really sure which is one is used here)
The code is using Collection#sort(Closure). Notice that this method has two variants:
If the closure is binary (i.e. it takes two parameters), sort uses it as the typical comparator interface: it should return an negative integer, zero or a positive integer when the first parameter is less than, equal, or grater than the second parameter respectively.
This is the variant that is being used in that piece of code. It is comparing the elements of the list, which are, in turn, lists, by their first element.
If the closure is unary (i.e. it takes only one parameter) it is used to generate the values that are then going to be used for comparison (in some languages this is called a "key" function).
Therefore, the snippet of code you posted can be rewritten as:
def list = [[1,0], [0,1,2]]
list = list.sort { it[0] } // or { it.first() }
assert list == [[0,1,2], [1,0]]
Notice that using this unary-closure variant is very convenient when you want to compare the elements by some value or some "weight" that is calculated the same way for every element.
The sort in your code snippet uses the comparator argument method call - see http://groovy.codehaus.org/groovy-jdk/java/util/Collection.html#sort(java.util.Comparator)
So, you are sorting the collection using your own comparator. Now the comparator simply uses the first element of the inner collection to decide the order of the outer collection.

An efficient technique to replace an occurence in a sequence with mutable or immutable state

I am searching for an efficient a technique to find a sequence of Op occurences in a Seq[Op]. Once an occurence is found, I want to replace the occurence with a defined replacement and run the same search again until the list stops changing.
Scenario:
I have three types of Op case classes. Pop() extends Op, Push() extends Op and Nop() extends Op. I want to replace the occurence of Push(), Pop() with Nop(). Basically the code could look like seq.replace(Push() ~ Pop() ~> Nop()).
Problem:
Now that I call seq.replace(...) I will have to search in the sequence for an occurence of Push(), Pop(). So far so good. I find the occurence. But now I will have to splice the occurence form the list and insert the replacement.
Now there are two options. My list could be mutable or immutable. If I use an immutable list I am scared regarding performance because those sequences are usually 500+ elements in size. If I replace a lot of occurences like A ~ B ~ C ~> D ~ E I will create a lot of new objects If I am not mistaken. However I could also use a mutable sequence like ListBuffer[Op].
Basically from a linked-list background I would just do some pointer-bending and after a total of four operations I am done with the replacement without creating new objects. That is why I am now concerned about performance. Especially since this is a performance-critical operation for me.
Question:
How would you implement the replace() method in a Scala fashion and what kind of data structure would you use keeping in mind that this is a performance-critical operation?
I am happy with answers that point me in the right direction or pseudo code. No need to write a full replace method.
Thank you.
Ok, some considerations to be made. First, recall that, on lists, tail does not create objects, and prepending (::) only creates one object for each prepended element. That's pretty much as good as you can get, generally speaking.
One way of doing this would be this:
def myReplace(input: List[Op], pattern: List[Op], replacement: List[Op]) = {
// This function should be part of an KMP algorithm instead, for performance
def compare(pattern: List[Op], list: List[Op]): Boolean = (pattern, list) match {
case (x :: xs, y :: ys) if x == y => compare(xs, ys)
case (Nil, Nil) => true
case _ => false
}
var processed: List[Op] = Nil
var unprocessed: List[Op] = input
val patternLength = pattern.length
val reversedReplacement = replacement.reverse
// Do this until we finish processing the whole sequence
while (unprocessed.nonEmpty) {
// This inside algorithm would be better if replaced by KMP
// Quickly process non-matching sequences
while (unprocessed.nonEmpty && unprocessed.head != pattern.head) {
processed ::= unprocessed.head
unprocessed = unprocessed.tail
}
if (unprocessed.nonEmpty) {
if (compare(pattern, unprocessed)) {
processed :::= reversedReplacement
unprocessed = unprocessed drop patternLength
} else {
processed ::= unprocessed.head
unprocessed = unprocessed.tail
}
}
}
processed.reverse
}
You may gain speed by using KMP, particularly if the pattern searched for is long.
Now, what is the problem with this algorithm? The problem is that it won't test if the replaced pattern causes a match before that position. For instance, if I replace ACB with C, and I have an input AACBB, then the result of this algorithm will be ACB instead of C.
To avoid this problem, you should create a backtrack. First, you check at which position in your pattern the replacement may happen:
val positionOfReplacement = pattern.indexOfSlice(replacement)
Then, you modify the replacement part of the algorithm this:
if (compare(pattern, unprocessed)) {
if (positionOfReplacement > 0) {
unprocessed :::= replacement
unprocessed :::= processed take positionOfReplacement
processed = processed drop positionOfReplacement
} else {
processed :::= reversedReplacement
unprocessed = unprocessed drop patternLength
}
} else {
This will backtrack enough to solve the problem.
This algorithm won't deal efficiently, however, with multiply patterns at the same time, which I guess is where you are going. For that, you'll probably need some adaptation of KMP, to do it efficiently, or, otherwise, use a DFA to control possible matchings. It gets even worse if you want to match both AB and ABC.
In practice, the full blow problem is equivalent to regex match & replace, where the replace is a function of the match. Which means, of course, you may want to start looking into regex algorithms.
EDIT
I was forgetting to complete my reasoning. If that technique doesn't work for some reason, then my advice is going with an immutable tree-based vector. Tree-based vectors enable replacement of partial sequences with low amount of copying.
And if that doesn't do, then the solution is doubly linked lists. And pick one from a library with slice replacement -- otherwise you may end up spending way too much time debugging a known but tricky algorithm.

Resources