Confusion with the XPATH - Document Order

Confusion with the XPATH - Document Order - xpath

The author said in his book,in doing Document Order explanation that:
In other words, document order simply refers to the order in which nodes appear in an XML document. There's no question about the order when you're dealing with elements that enclose other elements, for example, but when you're dealing with elements on the same level—sibling elements—document order specifies that they should be ordered as they were in the original XML document.
Here's one more thing to know about document order—attribute nodes are not in any special order, even in document order.
Now my questions are -
Why do the order of attributes not needed ?
"There's no question about the order when you're dealing with elements that enclose other elements," - why ?
-

To elaborate on the previous answer and respond to the first comment, I think it all revolves around hierarchies. If an element contains other elements, the order is obvious because there is a hierarchy.
In the following example, a precedes b in document order.
<a>
<b/>
</a>
In the following example, b and c are siblings (both on the same level; children of a). It's not quite as obvious what the document order is for the siblings, but c precedes b in document order.
<a>
<c/>
<b/>
</a>
This can get confusing if the structure is complex. For example, in the following document d precedes b in document order even though d is further down the hierarchical tree (it is a child of b's sibling c).
<a>
<c>
<d/>
</c>
<b/>
</a>
The order of attributes is not needed because they aren't representing a hierarchy. They are just describing/further defining the element. Think meta-data. The only document ordering that they have really is that an elements attributes precede any of that elements child elements. The relative order of the attributes are implementation-dependent.
For example, if you use the XPath /*/#*[1] on the following document:
<foo b="x" a="x"/>
you could either get the a attribute or the b attribute depending on how the implementation orders the attributes.

"There's no question about the order when you're dealing with elements that enclose other elements," - why ?
Well, there clearly is a question, because you've just asked it. It's just that the author, for some reason, thought the answer was obvious. The answer is that if A is an ancestor of B, then A precedes B in document order.
Why do the order of attributes not needed ?
It's a design principle in XML that you shouldn't be using attributes if order is significant. That relates to the semantics of object modelling: attributes represent properties of an object that are independent and orthogonal. Like adjectives: saying something is a big red box means the same as saying it is a red big box. If not (as in "great white shark"), then the adjectives are not truly attributive qualifiers of the noun, and shouldn't be modelled in XML as attributes.

In order to support streaming, the document order is the "obvious" choice, "no question about it", and in most cases the most useful one for XML data.
However, considering that an XML document can be represented as a tree structure, and that XPath operates on that tree, it is far from obvious in which order the XPath results are presented. Tree traversal is an interesting subject with many variations, e.g. pre-order, in-order, breath-first, and several others. (see wikipedia an other sources on "tree traversal").
So although the authors description is correct wrt XML, he is glossing over a lot of things. Specifically regarding XPath which operates on the tree, the OP question is perfectly valid and not obvious but is already well answered.

Related

Relation between two texts with different tags

I'm currently having a problem with the conception of an algorithm.
I want to create a WYSIWYG editor that goes along the current [bbcode] editor I have.
To do that, I use a div with contenteditable set to true for the WYSIWYG editor and a textarea containing the associated bbcode. Until there, no problem. But my concern is that if a user wants to add a tag (for example, the [b] tag), I need to know where they want to include it.
For that, I need to know exactly where in the bbcode I should insert the tags. I thought of comparing the two texts (one with html tags like <span>, the other with bbcode tags like [b]), and that's where I'm struggling.
I did some research but couldn't find anything that would help me, or I did not understand it correctly (maybe did I do a wrong research). What I could find is the Jaccard index, but I don't really know how to make it work correctly.
I also thought of another alternative. I could just take the code in the WYSIWYG editor before the cursor location, and split it every time I encounter a html tag. That way, I can, in the bbcode editor, search for the first occurrence, then search for the second occurrence starting at the last index found, and so on until I reach the place where the cursor is pointing at.
I'm not sure if it would work, and I find that solution a bit dirty. Am I totally wrong or should I do it this way?
Thanks for the help.

A popular way of determining what is the level of the similarity between the two texts is computing the mentioned Jaccard similarity. Citing Wikipedia:
The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient, is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures the similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:
If you have a large number of texts though, computing the full Jaccard index of every possible combination of two texts is super computationally expensive. There is another way to approximate this index that is called minhashing. What it does is use several (e.g. 100) independent hash functions to create a signature and it repeats this procedure many times. This whole process has a nice property that the probability (over all permutations) that T1 = T2 is the same as J(A,B).
Another way to cluster similar texts (or any other data) together is to use Locality Sensitive Hashing which by itself is an approximation of what KNN does, and is usually worse than that, but is definitely faster to compute. The basic idea is to project the data into low-dimensional binary space (that is, each data point is mapped to a N-bit vector, the hash key). Each hash function h must satisfy the sensitive hashing property prob[h(x)=h(y)]=sim(x,y) where sim(x,y) in [0,1] is the similarity function of interest. For dots products it can be visualized as follows:
we can now ask what would be the has of the indicated point (in this case it's 101) and everything that is close to this point has the same hash.
EDIT to answer the comment
No, you asked about the text similarity and so I answered that. You basically ask how can you predict the position of the character in text 2. It depends on whether you analyze the writer's style or just pure syntax. In any of those two cases, IMHO you need some sort of statistics that will tell where it is likely for this character to occur given all the other data/text. You can go with n-grams, RNNs, LSTMs, Markov Chains or any other form of sequential data analysis.

What type of list should be used for a timeline?

An online exercise I am doing gave as a solution for building a timeline an ordered list.
I had constructed the timeline using a description list, since I thought it would look weird to have a number or letter preceding a year.
I think a description list looks better, but I'm wondering about WAI-ARIA: does it make sense for a timeline to be constructed as an ordered list so the progression is semantically logical as well as in appearance?
And, if so, is it possible to hide the ordinal indicator (i.e., letter, number) of the <ol>?

Like you've suggested, it's all about semantics. Without referring to a spec, it makes sense to use HTML elements that "suggest" to someone/something (developers/machines) reading the code directly that there's further meaning, in this case the data following a logical order.
Other examples would be semantic elements introduced in HTML5 like <header>, <article>, <section>, <aside>, <time> or even older elements like <address>.
Comparing your two options:
An ordered list <ol> implies that the data is ordered, which suits a list of dates/events in a timeline.
A data list <dl> uses term elements <dt> for holding the term/name and description elements <dd> for describing that term. Depending on the type of timeline, it could be argued that a year is the term, but are you describing it as a term? Most likely it's not being described but just used as a point in time for other data (think: x-axis).
Furthermore, using an ordered list would mean:
Other developers (even in the CSS/JS) would know to respect the order, whether that be in the generation of those elements or in the styling, and get some insight into the data.
If you have an end user with a disability using a screen-reader, the reader can respect that order (think: cooking instructions).
So an ordered list is probably most appropriate, though don't lose sleep over your choice either, we're almost splitting hairs in this case.
If you need to hide the ordinal indicator you can do that quite easily with CSS:
ol {
list-style: none; // removes ordinal indicator
padding-left: 0; // removes the left-over space, if needed
}

Performance of XElement LINQ query against attribute or element

I'm using LINQ to XML to search the XML below (small section of final document) initially on the Country name attibute and was wondering if there would be any performance benefit in creating the name as a child element rather than attribute
<Countries>
<Country name="United Kingdom">
<Grades>
<Grade>PA</Grade>
<Grade>FE</Grade>
</Grades>
</Country>
</Countries>
Thanks

In terms of LINQ processing time the difference should be very very small, it depends on the shape of the document. If you are looking for an attribute on element with many attributes it's going to be slower, so if you know you will have just that one attribute you're looking for it will be fast. Same goes for elements. So if the sample above is representative, the faster one would be the attribute as there's just one attribute, but there would be two elements if you moved the name to the element.
Second consideration which is probably more important is parsing speed. You will need to parse the document first to be able to search it. Parsing speed depends mainly on the number of characters it has to process. So the longer the input document (in bytes), the longer it's going to take to parse it. In this sense attributes are a bit shorter than elements (usually). Also the bookkeeping the parser has to do is little bit less for attributes than elements (especially if you have just one attribute on an element).
But as with anything about performance: Measure It. That's the ultimate answer.

What is a quad linked list?

I'm currently working on implementing a list-type structure at work, and I need it to be crazy effective. In my search for effective data structures I stumbled across a patent for a quad liked list, and this sparked my interest enough to make me forget about my current task and start investigating the quad list instead. Unfortunately, internet was very secretive about the whole thing, and google didn't produce much in terms of usable results. The only explanation I got was the patent description that stated:
A quad linked data structure that provides bidirectional search capability for multiple related fields within a single record. The data base is searched by providing sets of pointers at intervals of N data entries to accommodate a binary search of the pointers followed by a linear search of the resultant range to locate an item of interest and its related field.
This, unfortunately, just makes me more puzzled, as I cannot wrap my head around the non-layman explanation. So therefore I turn to you all in hope that you can explain to me what this quad linked history really is, as I know not knowing will drive me up and over the walls pretty quickly.
Do you know what a quad linked list is?

I can't be sure, but it sounds a bit like a skip list.
Even if that's not what it is, you might find skip lists handy. (To the best of my knowledge they are unidirectional, however.)

I've not come across the term formally before, but from the patent description, I can make an educated guess.
A linked list is one where each node has a link to the next...
a -->-- b -->-- c -->-- d -->-- null
A doubly linked list means each node holds a link to its predecessor as well.
--<-- --<-- --<--
| | | |
a -->-- b -->-- c -->-- d -->-- null
Let's assume the list is sorted. If I want to perform binary search, I'd normally go half way down the list to find the middle node, then go into the appropriate interval and repeat. However, linked list traversal is always O(n) - I have to follow all the links. From the description, I think they're just adding additional links from a node to "skip" a fixed number of nodes ahead in the list. Something like...
--<-- --<-- --<--
| | | |
a -->-- b -->-- c -->-- d -->-- null
| |
|----------->-----------|
-----------<-----------
Now I can traverse the list more rapidly, especially if I chose the extra link targets carefully (i.e. ensure they always go back/forward half of the offset of the item they point from in the list length). I then find the rough interval I want with these links, and use the normal links to find the item.
This is a good example of why I hate software patents. It's eminently obvious stuff, wrapped in florid prose to confuse people.

I don't know if this is exactly a "quad-linked list", but it sounds like something like this:
struct Person {
// Normal doubly-linked list.
Customer *nextCustomer;
Customer *prevCustomer;
std::string firstName;
Customer *nextByFirstName;
Customer *prevByFirstName;
std::string lastName;
Customer *nextByLastName;
Customer *prevByLastName;
};
That is: you maintain several orderings through your collection. You can easily navigate in firstName order, or in lastName order. It's expensive to keep the links up to date, but it makes navigation quite quick.
Of course, this could be something completely different.

My reading of it is that a quad linked list is one which can be traversed (backwards or forwards) in O(n) in two different ways, ie sorted according to FieldX or FieldY:
(a) generating first and second sets
of link pointers, wherein the first
set of link pointers points to
successor elements of the set of
related records when the records are
ordered with respect to the fixed ID
field, and the second set of link
pointers points to predecessor
elements of the set of related records
when the records are ordered with
respect to the fixed ID field;
(b) generating third and fourth sets
of link pointers, wherein the third
set of link pointers points to
successor elements of the set of
related records when the records are
ordered with respect to the variable
ID field, and the fourth set of link
pointers points to predecessor
elements of the set of related records
when the records are ordered with
respect to the variable ID field;
So if you had a quad linked list of employees you could store it sorted by name AND sorted by age, and enumerate either in O(n).

One source of the patent is this. There are, it appears, two claims, the second of which is more nearly relevant:
A computer implemented method for organizing and searching a set of related records, wherein each record includes:
i) a fixed ID field; and
ii) a variable ID field; the method comprising the steps of:
(a) generating first and second sets of link pointers, wherein the first set of link pointers points to successor elements of the set of related records when the records are ordered with respect to the fixed ID field, and the second set of link pointers points to predecessor elements of the set of related records when the records are ordered with respect to the fixed ID field;
(b) generating third and fourth sets of link pointers, wherein the third set of link pointers points to successor elements of the set of related records when the records are ordered with respect to the variable ID field, and the fourth set of link pointers points to predecessor elements of the set of related records when the records are ordered with respect to the variable ID field;
(c) generating first and second sets of field pointers, wherein the first set of field pointers includes an ordered set of pointers that point to every Nth fixed ID field when the records are ordered with respect to the fixed ID field, and the second set of pointers includes an ordered set of pointers that point to every Nth variable ID field when the records are ordered with respect to the variable ID field;
(d) when searching for a particular record by reference to its fixed ID field, conducting a binary search of the first set of field pointers to determine an initial pointer and a final pointer defining a range within which the particular record is located;
(e) examining by linear scarch, the fixed ID fields within the range determined in step (d) to locate the particular record;
(f) when searching for a particular record by reference to its variable ID field, conducting a binary search of the second set of field pointers to determine an initial pointer and a final pointer defining a range within which the particular record is located;
(g) examining, by linear search, the variable ID fields within the range determined in step (f) to locate the particular record.
When you work through the patent gobbledegook, I think it means approximately the same as having two skip lists (one for forward search, one for backwards search) on each of two keys (hence 4 lists in total, and the name 'quad-list'). I don't think it is a very good patent - it looks to be an obvious application of skip lists to a data set where you have two keys to search on.

The description isn't particularly good, but as best I can gather, it sounds like a less-efficient skip list.

Is there a specific name for the node that coresponds to a subtree?

I'm designing a web site navigation hierarchy. It's a tree of nodes. Nodes represent web pages.
Some nodes on the tree are special. I need a name for them.
There are multiple such nodes. Each is the "root" of a sub-tree with pages that have a distinct logo, style sheet, or layout. Think of different departments.
site map with color-coded sub-trees http://img518.imageshack.us/img518/153/subtreesfe1.gif
What should I name this type of node?

How about Root (node with children, but no parent), Node (node with children and parent) and Leaf (node with no children and parent)?
You can then distinguish by name and position within the tree structure (E.g. DepartmentRoot, DepartmentNode, DepartmentLeaf) if need be..
Update Following Comment from OP
Looking at your question, you said that "some" are special, and in your diagram, you have different nodes looking differently at different levels. The nodes may be different in their design, you can build a tree structure many ways. For example, a single abstract class that can have child nodes, if no children, its a leaf, if no parent, its a root but this can change in its lifetime. Or, a fixed class structure in which leafs are a specific class type that cannot have children added to them in any way.
IF your design does not need you to distinguish nodes differently depending on their position (relative to the root) it suggests that you have an abstract class used for them all.
In which case, it raises the question, how is it different?
If it is simply the same as the standard node everywhere else, but with a bit of styling, how about StyledNode? Do you even need it to be seperate (no style == no big deal, it doesn't render).
Since I don't know the mechanics of how the tree is architected, there could possibly be several factors to consider when naming.

The word you are looking for is "Section". It's part of a whole and has the same stuff inside.
So, you have Nodes, which have children and a parent, and you have SectionNodes which are the roots of these special subtrees.

How about PageTemplate to embody the fact that its children have their own layout, CSS etc?

AreaNode

So, it sounds like you are gathering categories. The nodes are the entry points of this categories. How about "TopCategoryNode", "CategoryEntry" then for som,ething that is below them. Or, if you want to divide more, something like "CategoryCSS", "CategoryLayout" etc?
This is kind of generic, but you make clear that there are "categories", and that these do consist of more than one subnode, or subtheme.

Branch ?
Keeps the tree analogy and also hints in this case at departments etc
Thinking about class heirarchies, Root is probably a special case of Branch, which is a special case of Node, special case of Leaf. The Branch/Node distinction is one you get to make for your special situation.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio