Understanding X-Path Expression - xpath

I'm trying to get an understanding of XPath in order to parse a diffxml file. I skimmed over the w3schools site. Am I understanding these correctly?
Statement 1: /node()[1]/node()[3]
Selects the third child of the root node
Statement 2: /node()[1]/node()[1]/node()[1]
Selects the child of the first node of the root node
Statement 3: /node()[1]/node()[3]/node()[2]
Selects the second child of the third node under the root node.

Yes, you understand them correctly, but this is not how you'd use XPath. First node() can be anything, not just elements. Then the pure index is arguably the wort way of selecting things, you should really use names, and possibly predicates for filtering the node-sets.

You'll find a lot of criticism of w3schools on this site. Personally I find it a useful resource, but only when I'm trying to remind myself of something I once knew. It's not really designed for teaching yourself things from scratch, and I suggest you need a different learning strategy. Call me old-fashioned, but when I'm learning a new technology I find there's nothing better than a good book.
You've understood your examples correctly as far as I can tell. But have you understood what a "node" is? For example, do you know under what circumstances whitespace text counts as a node? The key to understanding XPath is to understand the data model, and the way in which the data model relates to the lexical (angle-bracket) form of the XML.

Related

tool for extracting xpath query from speciifed/selected node

Normally, one would use an XPath query to obtain a certain value or node. In my case, I'm doing some web-scraping with google spreadsheets, using the importXML function to update automatically some values. Two examples are given below:
=importxml("http://www.creditagricoledtvm.com.br/";"(//td[#class='xl7825385'])[9]")
=importxml("http://www.bloomberg.com/quote/ELIPCAM:BZ";"(//span)[32]")
The problem is that the pages I'm scraping will change every now and then and I understand very little about XML/XPath, so it takes a lot of trial and error to get to a node. I was wondering if there is any tool I could use to point to an element (either in the page or in its code) that would provide an appropriate query.
For example, in the second case, I've noticed the info I wanted was in a span node (hence (//span)), so I printed all of them in a spreadsheet and used the line count to find the [32] index. This takes long to load, so it's pretty inconvenient. Also, I don't even remember how I've figured the //td[#class='xl7825385'] query. Thus why I'm wondering if there is more practical method of pointing to page elements.
Some clues :
Learning XPath basics is still useful. W3Schools is a good starting point.
https://www.w3schools.com/xml/xpath_intro.asp
Otherwise, built-in dev tools of your browser can help you to generate absolute XPath. Select an element, right-click on it then >Copy>Copy XPath.
https://developers.google.com/web/tools/chrome-devtools/open
Browser extensions like Chropath can generate absolute or relative XPath for you.
https://autonomiq.io/chropath/

Webscraping Selectors

At what level of hierarchy do you begin your selectors?
There seems to be a convention of beginning with the container of the target element, but why not ever the target element itself, especially in the case of an id or starting with a wildcard plus a unique identifier?
Recursive descent seems like everyone's best friend.
XPaths and Css-Selectors are very versatile, and can describe the same element in many different ways - i.e. an single element has infinitely many possible locators to describe it. The goal is to get something to fit the needs of the developer which might include being readable, unique, and or adaptive.
Consider the following html example:
<div id='mainContainer'>
<span>some span</span>
</div>
If I were trying to make a locator for the <span> element, I wouldn't choose //span, because that will probably yield way too many results. Instead you could start with its parent who has an id, and then proceed to the span: //*[#id='mainContainer']/span, and alternatively: //span[parent::*[#id='mainContainer']]. Which XPath is better? Whichever one you personally find more readable. I agree with you that the first example does seem to be more common, although I myself am more partial to the latter.
Sometimes the point of making a locator a certain way is to be adaptable. For instance, I rarely write a locator like this: //*[#class='fooBar']. The reason is because in modern web development classes come and go frequently, and it's likely that that element's class could change at the slightest breeze. Instead you might write //*[contains(#class,'fooBar')]. Now when a developer goes in and adds a class for pure styling, you don't have to go back and update all of your selenium tests. That is also the reason I use wildcard characters frequently. If a developer goes in and updates a div to a span, my test will still work.
As #Gilles Quenot commented, it isn't always safe to assume that ids are unique. Many websites were written by someone's unemployed uncle who took an html class back in '86. They are terrible, and don't care at all about standards or audits. This is another reason that you need to include enough information in your locator to specify the exact element/elements you are talking about, but not too much information that you are describing too many elements.
One more comment is that XPaths are bidirectional, whereas Css-Selectors are not. This means XPaths can go from child to parent and from parent to child, where Css-Selectors can only go from parent to child. This affects which node you are starting at, and may be a reason that you see more Css-Selectors start from a parent/ancestor node.
TL;DR There isn't a convention, just personal preferences. Do what meets your needs.

Arango single tree response using AQL only

I have found several questions that are similar but no solution worked as needed, or used internal functions. This is the most relevant one:
Getting data for d3 from ArangoDB using AQL (or arangojs)
I'm unable to understand how to return a single response with a tree structure of parent + children. Something that D3 can understand. Whatever I do, beyond the first iteration, everything is a mess. I have tried MERGE and MERGE_RECURSIVE but it just did not work as I thought of.
I'm clueless to how I can make it to work. I'm used to Neo4J and for some reason this one is just hard for me to understand.
Any help will do,
Thanks,
DD.
I found a simple solution. I'm just using AQL to get a flat list of results and their edges. After that, I just sort it as I need on my code

Shorten XPATH with wildcards

I'm currently trying to figure out how to shorten my extremely long xpath.
//div[#class='m_set_part'][1]/div/div[2]/div[#class='row']/div[#class='col details detail-head']/div[#class='detail-body']/div[2]/div/div[#class='size']/div/div[#class='m_product_finder_size']/ul/li[1]/span[#class='size-btn']/a
This is the one I have right now and it's way too long, the problem is I need the first node to differentiate between products. Is there a way to shorten it like
//div[#class='m_set_part']/*/span[#class='size-btn']/a
Or do I have to go through all childnodes to reach the last nodes?
Link
I want to find the for each product the sizebuttons. The only way to differentiate them, I guess, is via adding a [1] or [2] to the m_set_part node.
You are basically correct. As said in the comments, you can use // to select descendant or self nodes. Hence, this will give you all the size links:
//span[#class='size-btn']/a
As you suggest, you can select the specific product using a positional predicate. However, if you prefer you could also use another detail, e.g. the name. This would simply be
//div[#class="m_set_part"][.//label="Vælg"]
to given you the Vælg product.
Now combine them both and you can get the size link for this specifc product using
//div[#class="m_set_part"][.//label="Vælg"]//span[#class='size-btn']/a
or using the psoitional predicate it would be
//div[#class="m_set_part"][1]//span[#class='size-btn']/a
Also, please make sure you use a proper namespace as this is an actual XHTML document. One other thing is that you might prefer to use contains(#class, 'm_set_part') instead of #class="m_set_part" and the like, because the query will still work even if the add new CSS classes to this element.
To answer to your question: No you don't have to go through all nodes.
You may use the // descendant-or-self selector to 'skip' zero or more nodes in between the preceeding and the next part of the expression. So //div[#class='m_set_part']//span[#class='size-btn']/a might give you exactly what you want. * on the other hand matches any node, but exactly one node. Therfore
//div[#class='m_set_part'][1]/*/*[2]/*[#class='row']/*[#class='col details detail-head']/*[#class='detail-body']/*[2]/*/*[#class='size']/*/*[#class='m_product_finder_size']/*/*[1]/*[#class='size-btn']/a
is another way to shorten your original expression. Whether it's still returns only the interested node or more is solely depends on the document you apply the expression on.

XPath simple conditional statement? If node X exists, do Y?

I am using google docs for web scraping. More specifically, I am using the Google Sheets built in IMPORTXML function in which I use XPath to select nodes to scrape data from.
What I am trying to do is basically check if a particular node exists, if YES, select some other random node.
/*IF THIS NODE EXISTS*/
if(exists(//table/tr/td[2]/a/img[#class='special'])){
/*SELECT THIS NODE*/
//table/tr/td[2]/a
}
You don't have logic quite like that in XPath, but you might be able to do something like what you want.
If you want to select //table/tr/td[2]/a but only if it has a img[#class='special'] in it, then you can use //table/tr/td[2]/a[img[#class='special']].
If you want to select some other node in some other circumstance, you could union two paths (the | operator), and just make sure that each has a filter (within []) that is mutually exclusive, like having one be a path and the other be not() of that path. I'd give an example, but I'm not sure what "other random node" you'd want… Perhaps you could clarify?
The key thing is to think of XPath as a querying language, not a procedural one, so you need to be thinking of selectors and filters on them, which is a rather different way of thinking about problems than most programmers are used to. But the fact that the filters don't need to specifically be related to the selector (you can have a filter that starts looking at the root of the document, for instance) leads to some powerful (if hard-to-read) possibilities.
Use:
/self::node()[//table/tr/td[2]/a/img[#class='special']]
//table/tr/td[2]/a

Resources