sequence (XPath 2.0) vs nodeSet (XPath 1.0) - xpath

why the concept node-set has been replaced by sequence in XPath 2.0? for what reason. What are the problems considered using the node-set? what is the advantage of the sequence from the node-set?
i say that :
A node-set contains zero or more nodes, no node can appear in the node set
more than once (that is, no duplicates are possible), and the nodes are not in any particular order.
and
A sequence, by contrast, allows a node to appear more than once (duplicates are permitted), and the nodes in the sequence are in a particular order; in addition, sequences can
contains nodes, atomic values, or any mixture of the two.

Firstly, the only kind of collection allowed in XPath 1.0 was a collection of nodes. XPath 2.0 also allows collections (sequences) of strings, numbers, and so on. Without this, functions such as tokenize() or string-to-codepoints() are impossible.
Secondly, having only sets rather than sequences means you can't do things like binding a variable to the result of a sort operation.

Related

Implementing the Rope data structure using binary search trees (splay trees)

In a standard implementation of the Rope data structure using splay trees, the nodes would be ordered according to a rank statistic measuring the position of each one from the start of the string, so the keys normally found in binary search tree would be irrelevant, would they not?
I ask because the keys shown in the graphic below (thanks Wikipedia!) are letters, which would presumably become non-unique once the number of nodes exceeded the length of the chosen alphabet. Wouldn't it be better to use integers or avoid using keys altogether?
Separately, can anyone point me to a good implementation of the logic to recompute rank statistics after each operation?
Presumably, if the index for a split falls within the substring attached to a particular node, say, between "Hel" and "llo_" on the node E above, you would remove the substring from E, split it and reattach it as two children of E. Correct?
Finally, after a certain number of such operations, the tree could, I suppose, end up with as many leaves as letters. What would be the best way to keep track of that and prune the tree (by combining substrings) as necessary?
Thanks!
For what it's worth, you can implement a Rope using Splay Trees by attaching a substring to each node of the binary search tree (not just to the leaf nodes as shown above).
The rank of each node is its size plus the size of its left subtree. But when recomputing ranks during splay operations, you need to remember to walk down the node.left.right branch, too.
If each node records a reference to the substring it represents (cf. the actual substring itself), everything runs faster. That way when a split operation falls within an existing node, you just need to modify the node's attributes to reflect the right part of the substring you want to split, then add another node to represent the left part and merge it with the left subtree.
Done as above, each node records (in addition its left, right and parent attributes etc.) its rank, size (in characters) and the location of the first character it represents in the string you're trying to modify. That way, you never actually modify the initial string: you just do your operations on bits of the tree and reproduce the final string when you're ready by walking it in order.

Traverse and reconstruct a Ruby object graph efficiently without recursion

I'm probably just trying to do something crazy here, so let me explain my use case first:
I've got an object graph in Ruby, consisting only of the basic JSON types (strings, numbers, arrays, hashes, trues, falses, nils). I'd like ultimately to serialize this graph to JSON.
The problem is that I don't have control over the origin of all objects in the graph. This means that some of the strings contained in the object graph might be tagged with the wrong encodings (for example, a string that's actually just a bunch of random garbage bytes ends up tagged with a UTF-8 encoding). This will cause the JSON serialization to fail (since JSON only supports UTF-8 encoded strings).
I have a strategy for handling these problematic strings, which basically consists of replacing them with a transformed version of each string (the exact transformation isn't really relevant).
In order to apply this transformation to strings, I need to walk the entire object graph and find all of them. This is trivial to implement recursively using standard depth-first search. One wrinkle is that I'd like to avoid mutating the original object graph or any strings therein, so I'm basically building a copy of the object graph as I traverse it (with only the non-problematic leaf nodes being referenced directly from the new graph, and all other nodes being duped).
This all works, and is reasonably efficient, save the duplication of non-leaf nodes in the transformed object graph. The problem is that it sometimes gets fed very deeply-nested object graphs, so the recursion will on occasion produce a SystemStackError.
I've implemented a non-recursive solution using DFS with a stack of Enumerator objects, but it seems to be dramatically slower than the recursive solution (presumably on account of the extra object allocations for the Enumerators and the silly StopIteration exceptions that get raised at the end of each Enumerator.
Breadth-first search seems inappropriate, because I don't think there's a way to determine the path back up to the root when visiting a given node, which I think I need in order to build a copy of the tree.
Am I wrong about BFS here? Are there other techniques that I could be using to accomplish this traversal without recursion? Is this all just loony?
Instead of using recursion you could use a stack explicitly see here for more details:
Way to go from recursion to iteration
http://haacked.com/archive/2007/03/04/Replacing_Recursion_With_a_Stack.aspx/

How to match a tree against a large set of patterns?

I have a potentially infinite set of symbols: A, B, C, ... There is also a distinct special placeholder symbol ? (its meaning will be explained below).
Consider non-empty finite trees such that every node has a symbol attached to it and 0 or more non-empty sub-trees. The order of sub-trees of a given node is significant (so, for example, if there is a node with 2 sub-trees, we can distinguish which one is left and which one is right). Any given symbol can appear in the tree 0 of more times attached to different nodes. The placeholder symbol ? can be attached only to leaf nodes (i.e. nodes having no sub-trees). It follows from the usual definition of a tree that trees are acyclic.
The finiteness requirement means that the total number of nodes in a tree is a positive finite integer. It follows that the total number of attached symbols, the tree depth and the total number of nodes in every sub-tree are all finite.
Trees are given in a functional notation: a node is represented by a symbol attached to it and, if there are any sub-trees, it is followed by parentheses containing comma-separated list of sub-trees represented recursively in the same way. So, for example the tree
A
/ \
? B
/ \
A C
/|\
A C Q
\
?
is represented as A(?,B(A(A,C,Q(?)),C)).
I have a pre-calculated unchanging set of trees S that will be used as patterns to match. The set will typically have ~ 105 trees, and every its element will typically have ~ 10-30 nodes. I can use a plenty of time to create beforehand any representation of S that will best suit my problem stated below.
I need to write a function that accepts a tree T (typically with ~ 102 nodes) and checks as fast as possible if T contains as a subtree any element of S, provided that any node with placeholder symbol ? matches any non-empty subtree (both when it appears in T or in an element of S).
Please suggest a data structure to store the set S and an algorithm to check for a match. Any programing language or a pseudo-code is OK.
This paper describes a variant of the Aho–Corasick algorithm, where instead of using a finite state machine (which the standard Aho–Corasick algorithm uses for string matching) the algorithm instead uses a pushdown automaton for subtree matching. Like the Aho-Corasick string-matching algorithm, their variant only requires one pass through the input tree to match against the entire dictionary of S.
The paper is quite complex - it may be worth it to contact the author to see if he has any source code available.
What you need is a finite state machine that tracks the set of potential matches you might have.
In essence, such a machine is is the result of matching the patterns against each other, and determining what part of the individual matches they share. This is analogous to how lexers take sets of regular expressions for tokens and compose them into a large FSA that can match any of the regular expressions by processing characters one at a time.
You can find references to methods for doing this under term rewriting systems.

Why do we need a sentinel character in a Suffix Tree?

Why do we need to append "$" to the original string when we implement a suffix tree?
There can be special reasons for appending one (or even more) special characters to the end of the string when specific construction algorithms are used – both in the case of suffix trees and suffix arrays.
However, the most fundamental underlying reason in the case of suffix trees is a combination of two properties of suffix trees:
Suffix trees are PATRICIA trees, i.e. the edge labels are, unlike the edge labels of tries, strings consisting of one or more characters
Internal nodes exist only at branching points
This means you can potentially have a situation where one edge label is a prefix of another:
The idea here is that the black node on the right is a leaf node, i.e. a suffix ends here. But if the text has a suffix aa, then the single character a must also be a suffix. But there is no way for us to store the information that a suffix ends after the first a, because aa forms one continuous edge of the tree (property 1 above). We would have to introduce an intermediate node in which we could store the information, like this:
But this would be illegal because of property 2: No inner node must exist unless there is a branching point.
The problem is solved if we can guarantee that the last character of the text is a character that occurs nowhere else in the entire string. The dollar sign is normally used as a symbol for that.
Clearly, if the last character occurs nowhere else, there can't possible be any repetition (such as aa, or even a more complex one like abcabc) at the end of the string, hence the problem of non-branching inner nodes does not occur. In the example above, the effect of putting $ at the end of the string is:
There are three suffixes now: aa$, a$ and $, but none is a prefix of another. Obviously, this means we need to introduce an inner node after all, and there are a total of three leaves now. So, to be sure, the advantage of this is not that we save space or anything becomes more efficient. It's just a way to guarantee the two properties above. These properties are important when we prove certain useful characteristics of suffix trees, including the fact that its number of inner nodes is linear in the length of the string (you could not prove this if non-branching inner nodes were allowed).
This also means that in practice, you might use different ways of dealing with suffixes that are prefixes of other suffixes, and with non-branching inner nodes. For example, if you use the well-known Ukkonen algorithm to construct the tree, you can do that without appending a unique character to the end; you just have to make sure that at the end, after the final iteration, you put non-branching inner nodes to the end of every implicit suffix (i.e. every suffix that ends in the middle of an edge).
Again, there can be further, and very specific reasons for putting $ at the end of text before constructing a suffix tree or array. For example, in construction algorithms for suffix arrays that are based on the DC (difference cover) principle, you must put two $ signs to the end of the string to ensure that even the last character of the string is part of a complete character trigram, as the algorithm is based on trigram sorting. Furthermore, there are specific situations when the unique $ character must be interpreted in a special way. For the Ukkonen construction algorithm, it is sufficient for $ to be unique; for the DC suffix array algorithms it is necessary, in addition to uniqueness, that $ is lexicographically smaller than all other characters, and in the suffix-tree based circular string cutting algorithm (mentioned recently here) it is actually necessary to interpret $ as the lexicographically largest character.
I suspect that it for traversing purposes. When you are generating something from the suffix tree you need to know if you at a node where the string finishes or not, if not then you know that you have to keep going. Looking at the longest common substring to which a suffix tree provides a linear time solution, you need the $ sentinels to determine that you've arrived at a node where the string terminates. You can't finish after A-NA.
from Wikipedia
1. NOT A PATRICIA
Suffix tree is NOT a Patricia Tree, which is Radix 2. Suffix Tree node may have 2 or MORE children.
2. NO ANY VALID REASON TODAY
There is no any reasons to add a special character other than
requirement to have 2 or more childrens
requirement to have exactly n leaves for string of n characters
Suffix tree can be implemented the same way as compressed trie (or radix tree as one of the kind), without any special symbols and there is no any functional disadvantages in this case.
3. OLD TRAILS
If you'll look into old book from 1973, you'll see very similar to trie structure, which is named "uncompressed suffix tree", but with values and termination symbols. Then they compact it.
4. BUT WHAT'S DIFFERENT?
Prefix and suffix trees both have metadata in nodes, right? Which is implemented as value of the node.
But with suffix tree we've got one interesting requirement - we need to have an index of a suffix. So, in the last node we have to keep TWO metadata fields, TWO values. And you need to keep nodes of the same size, byte-to-byte. SO THEY DID IT THROUGH ADDITIONAL NODE, END NODE
In the modern world, you can keep as many fields as you want, you are not going to save each and every byte spent, so you don't need this trick.
5. SO, DO WE HAVE A REASONS FOR END SYMBOL?
Yes, potentially we have a non-functional reason - save few bytes in each non-leaf node.
6. STILL ... ANY FUNCTIONAL REASON FOR END SYMBOL?
Yes, we may have one case where end symbolS are useful - GENERALIZED suffix tree, not just a suffix tree.
Generalized suffix tree will require different end markers, in a collection on the node or as a separate end symbols. Again, you can implement with or without special symbols.
7. BOTTOMLINE
These requirement seems to be a legacy for old systems
Feel free to implement suffix tree in a same way as a compressed prefix tree, there is no caveats except few bytes wasted in each node for unused end index flag.
Generalized suffix tree is a structure where end symbolS may be useful (but you still can build it without them)
I hope this will make situation clearer.

Data structure for range query

I was recently asked a coding question on the below problem.
I have some solution to this problem but I am not very sure if those are most efficient.
Problem:
Write a program to track set of text ranges. Start point and end point will be string.
Text range example : [AbA-Ef]
Aa would fall before this range
AB would fall inside this range
etc.
String comparison would be like 'A' < 'a' < 'B' < 'b' ... 'Z' < 'z'
We need to support following operations on this range
Add range - this should merge the ranges if applicable
Delete range - this deletes range from tracked ranges and recompute the ranges
Query range - Given a character, function should return whether it is part of any of tracked ranges or not.
Note that tracked ranges can be dis-continuous.
My solutions:
I came up with two approaches.
Store ranges as doubly linked list or
Store ranges as some sort of balanced tree with leaf node having actual data and they are inter-connected as linked list.
Do you think that this solution are good enough or you can think of any better way of doing this so that those three API gives your best performance ?
You are probably looking for an interval tree.
Use the data structure with your custom comparator to indicate "What's on range", and you will be able to do the required operations efficiently.
Note, an interval tree is actually an efficient way to implement your 2nd idea (Store ranges as a some sort of balanced tree)
I'm not clear on what the "delete range" operation is supposed to do. Does it;
Delete a previously inserted range, and recompute the merge of the remaining ranges?
Stop tracking the deleted range, regardless of how many times parts of it have been added.
That doesn't make a huge difference algorithmically; it's just bookkeeping. But it's important to clarify. Also, are the ranges closed or half-open? (Another detail which doesn't affect the algorithm but does affect the implementation).
The basic approach to this problem is to merge the tracked set into a sorted list of disjoint (non-overlapping) ranges; either as a vector or a binary search tree, or basically any structure which supports O(log n) searching.
One approach is to put both endpoints of every disjoint range into the datastructure. To find out if a target value is in a range, find the index of the smallest endpoint greater than the target. If the index is odd the target is in some range; even means it's outside.
Alternatively, index all the disjoint ranges by their start points; find the target by searching for the largest start-point not greater than the target, and then compare the target with the associated end-point.
I usually use the first approach with sorted vectors, which are plausible if (a) space utilization is important and (b) insert and merge are relatively rare. With binary search trees, I go for the second approach. But they differ only in details and constants.
Merging and deleting are not difficult, but there are an annoying number of cases. You start by finding the ranges corresponding to the endpoints of the range to be inserted/deleted (using the standard find operation), remove all the ranges in between the two, and fiddle with the endpoints to correct the partially overlapping ranges. While the find operation is always O(log n), the tree/vector manipulation is o(n) (if the inserted/deleted range is large, anyway).
Most languages, including Java and C++, have a some sort of ordered map or ordered set in which you can both look up a value and find the next value after or the first value before a value. You could use this as a building block - If it contains a set of disjoint ranges then it will have a least element of a range followed by a greatest element of a range followed by the least element of a range followed by the greatest element of a range and so on. When you add a range you can check to see if you have preserved this property. If not, you need to merge ranges. Similarly, you want to preserve this when you delete. Then you can query by just looking to see if there is a least element just before your query point and a greatest element just after.
If you want to create your own datastructure from scratch, I would think about some sort of radix trie structure, because this avoids doing lots of repeated string comparisons.
I think you would go for B+ tree it's the same which you have mentioned as your second approach.
Here are some properties of B+ tree:
All data is stored leaf nodes.
Every leaf is at the same level.
All leaf nodes have links to other leaf nodes.
Here are few applications B+ tree:
It reduces the number of I/O operations required to find an element in the tree.
Often used in the implementation of database indexes.
The primary value of a B+ tree is in storing data for efficient retrieval in a block-oriented storage context — in particular, file systems.
NTFS uses B+ trees for directory indexing.
Basically it helps for range queries look ups, minimizes tree traversing.

Resources