How to use not contains() in XPath? - xpath

I have some XML that is structured like this:
<whatson>
<productions>
<production>
<category>Film</category>
</production>
<production>
<category>Business</category>
</production>
<production>
<category>Business training</category>
</production>
</productions>
</whatson>
And I need to select every production with a category that doesn't contain "Business" (so just the first production in this example).
Is this possible with XPath? I tried working along these lines but got nowhere:
//production[not(contains(category,'business'))]

XPath queries are case sensitive. Having looked at your example (which, by the way, is awesome, nobody seems to provide examples anymore!), I can get the result you want just by changing "business", to "Business"
//production[not(contains(category,'Business'))]
I have tested this by opening the XML file in Chrome, and using the Developer tools to execute that XPath queries, and it gave me just the Film category back.

I need to select every production with a category that doesn't contain "Business"
Although I upvoted #Arran's answer as correct, I would also add this...
Strictly interpreted, the OP's specification would be implemented as
//production[category[not(contains(., 'Business'))]]
rather than
//production[not(contains(category, 'Business'))]
The latter selects every production whose first category child doesn't contain "Business". The two XPath expressions will behave differently when a production has no category children, or more than one.
It doesn't make any difference in practice as long as every <production> has exactly one <category> child, as in your short example XML. Whether you can always count on that being true or not, depends on various factors, such as whether you have a schema that enforces that constraint. Personally, I would go for the more robust option, since it doesn't "cost" much... assuming your requirement as stated in the question is really correct (as opposed to e.g. 'select every production that doesn't have a category that contains "Business"').

You can use not(expression) function.
not() is a function in xpath (as opposed to an operator)
Example:
//a[not(contains(#id, 'xx'))]
OR
expression != true()

Should be xpath with not contains() method, //production[not(contains(category,'business'))]

Related

Webscraping Selectors

At what level of hierarchy do you begin your selectors?
There seems to be a convention of beginning with the container of the target element, but why not ever the target element itself, especially in the case of an id or starting with a wildcard plus a unique identifier?
Recursive descent seems like everyone's best friend.
XPaths and Css-Selectors are very versatile, and can describe the same element in many different ways - i.e. an single element has infinitely many possible locators to describe it. The goal is to get something to fit the needs of the developer which might include being readable, unique, and or adaptive.
Consider the following html example:
<div id='mainContainer'>
<span>some span</span>
</div>
If I were trying to make a locator for the <span> element, I wouldn't choose //span, because that will probably yield way too many results. Instead you could start with its parent who has an id, and then proceed to the span: //*[#id='mainContainer']/span, and alternatively: //span[parent::*[#id='mainContainer']]. Which XPath is better? Whichever one you personally find more readable. I agree with you that the first example does seem to be more common, although I myself am more partial to the latter.
Sometimes the point of making a locator a certain way is to be adaptable. For instance, I rarely write a locator like this: //*[#class='fooBar']. The reason is because in modern web development classes come and go frequently, and it's likely that that element's class could change at the slightest breeze. Instead you might write //*[contains(#class,'fooBar')]. Now when a developer goes in and adds a class for pure styling, you don't have to go back and update all of your selenium tests. That is also the reason I use wildcard characters frequently. If a developer goes in and updates a div to a span, my test will still work.
As #Gilles Quenot commented, it isn't always safe to assume that ids are unique. Many websites were written by someone's unemployed uncle who took an html class back in '86. They are terrible, and don't care at all about standards or audits. This is another reason that you need to include enough information in your locator to specify the exact element/elements you are talking about, but not too much information that you are describing too many elements.
One more comment is that XPaths are bidirectional, whereas Css-Selectors are not. This means XPaths can go from child to parent and from parent to child, where Css-Selectors can only go from parent to child. This affects which node you are starting at, and may be a reason that you see more Css-Selectors start from a parent/ancestor node.
TL;DR There isn't a convention, just personal preferences. Do what meets your needs.

Shorten XPATH with wildcards

I'm currently trying to figure out how to shorten my extremely long xpath.
//div[#class='m_set_part'][1]/div/div[2]/div[#class='row']/div[#class='col details detail-head']/div[#class='detail-body']/div[2]/div/div[#class='size']/div/div[#class='m_product_finder_size']/ul/li[1]/span[#class='size-btn']/a
This is the one I have right now and it's way too long, the problem is I need the first node to differentiate between products. Is there a way to shorten it like
//div[#class='m_set_part']/*/span[#class='size-btn']/a
Or do I have to go through all childnodes to reach the last nodes?
Link
I want to find the for each product the sizebuttons. The only way to differentiate them, I guess, is via adding a [1] or [2] to the m_set_part node.
You are basically correct. As said in the comments, you can use // to select descendant or self nodes. Hence, this will give you all the size links:
//span[#class='size-btn']/a
As you suggest, you can select the specific product using a positional predicate. However, if you prefer you could also use another detail, e.g. the name. This would simply be
//div[#class="m_set_part"][.//label="Vælg"]
to given you the Vælg product.
Now combine them both and you can get the size link for this specifc product using
//div[#class="m_set_part"][.//label="Vælg"]//span[#class='size-btn']/a
or using the psoitional predicate it would be
//div[#class="m_set_part"][1]//span[#class='size-btn']/a
Also, please make sure you use a proper namespace as this is an actual XHTML document. One other thing is that you might prefer to use contains(#class, 'm_set_part') instead of #class="m_set_part" and the like, because the query will still work even if the add new CSS classes to this element.
To answer to your question: No you don't have to go through all nodes.
You may use the // descendant-or-self selector to 'skip' zero or more nodes in between the preceeding and the next part of the expression. So //div[#class='m_set_part']//span[#class='size-btn']/a might give you exactly what you want. * on the other hand matches any node, but exactly one node. Therfore
//div[#class='m_set_part'][1]/*/*[2]/*[#class='row']/*[#class='col details detail-head']/*[#class='detail-body']/*[2]/*/*[#class='size']/*/*[#class='m_product_finder_size']/*/*[1]/*[#class='size-btn']/a
is another way to shorten your original expression. Whether it's still returns only the interested node or more is solely depends on the document you apply the expression on.

XPath simple conditional statement? If node X exists, do Y?

I am using google docs for web scraping. More specifically, I am using the Google Sheets built in IMPORTXML function in which I use XPath to select nodes to scrape data from.
What I am trying to do is basically check if a particular node exists, if YES, select some other random node.
/*IF THIS NODE EXISTS*/
if(exists(//table/tr/td[2]/a/img[#class='special'])){
/*SELECT THIS NODE*/
//table/tr/td[2]/a
}
You don't have logic quite like that in XPath, but you might be able to do something like what you want.
If you want to select //table/tr/td[2]/a but only if it has a img[#class='special'] in it, then you can use //table/tr/td[2]/a[img[#class='special']].
If you want to select some other node in some other circumstance, you could union two paths (the | operator), and just make sure that each has a filter (within []) that is mutually exclusive, like having one be a path and the other be not() of that path. I'd give an example, but I'm not sure what "other random node" you'd want… Perhaps you could clarify?
The key thing is to think of XPath as a querying language, not a procedural one, so you need to be thinking of selectors and filters on them, which is a rather different way of thinking about problems than most programmers are used to. But the fact that the filters don't need to specifically be related to the selector (you can have a filter that starts looking at the root of the document, for instance) leads to some powerful (if hard-to-read) possibilities.
Use:
/self::node()[//table/tr/td[2]/a/img[#class='special']]
//table/tr/td[2]/a

Understanding X-Path Expression

I'm trying to get an understanding of XPath in order to parse a diffxml file. I skimmed over the w3schools site. Am I understanding these correctly?
Statement 1: /node()[1]/node()[3]
Selects the third child of the root node
Statement 2: /node()[1]/node()[1]/node()[1]
Selects the child of the first node of the root node
Statement 3: /node()[1]/node()[3]/node()[2]
Selects the second child of the third node under the root node.
Yes, you understand them correctly, but this is not how you'd use XPath. First node() can be anything, not just elements. Then the pure index is arguably the wort way of selecting things, you should really use names, and possibly predicates for filtering the node-sets.
You'll find a lot of criticism of w3schools on this site. Personally I find it a useful resource, but only when I'm trying to remind myself of something I once knew. It's not really designed for teaching yourself things from scratch, and I suggest you need a different learning strategy. Call me old-fashioned, but when I'm learning a new technology I find there's nothing better than a good book.
You've understood your examples correctly as far as I can tell. But have you understood what a "node" is? For example, do you know under what circumstances whitespace text counts as a node? The key to understanding XPath is to understand the data model, and the way in which the data model relates to the lexical (angle-bracket) form of the XML.

XPath Query in JMeter

I'm currently working with JMeter in order to stress test one of our systems before release. Through this, I need to simulate users clicking links on the webpage presented to them. I've decided to extract theese links with an XPath Post-Processor.
Here's my problem:
I have an a XPath expression that looks something like this:
//div[#data-attrib="foo"]//a//#href
However I need to extract a specific child for each thread (user). I want to do something like this:
//div[#data-attrib="foo"]//a[position()=n]//#href
(n being the current index)
My question:
Is there a way to make this query work, so that I'm able to extract a new index of the expression for each thread?
Also, as I mentioned, I'm using JMeter. JMeter creates a variable for each of the resulting nodes, of an XPath query. However it names them as "VarName_n", and doesn't store them as a traditional array. Does anyone know how I can dynamicaly pick one of theese variables, if possible? This would also solve my problem.
Thanks in advance :)
EDIT:
Nested variables are apparently not supported, so in order to dynamically refer to variables that are named "VarName_1", VarName_2" and so forth, this can be used:
${__BeanShell(vars.get("VarName_${n}"))}
Where "n" is an integer. So if n == 1, this will get the value of the variable named "VarName_1".
If the "n" integer changes during a single thread, the ForEach controller is designed specifically for this purpose.
For the first question -- use:
(//div[#data-attrib="foo"]//a)[position()=$n]/#href
where $n must be substituted with a specific integer.
Here we also assume that //div[#data-attrib="foo"] selects a single div element.
Do note that the XPath pseudo-operator // typically result in very slow evaluation (a complete sub-tree is searched) and also in other confusing problems ( this is why the brackets are needed in the above expression).
It is recommended to avoid using // whenever the structure of the document is known and a complete, concrete path can be specified.
As for the second question, it is not clear. Please, provide an example.

Resources