Webscraping Selectors

Webscraping Selectors - xpath

At what level of hierarchy do you begin your selectors?
There seems to be a convention of beginning with the container of the target element, but why not ever the target element itself, especially in the case of an id or starting with a wildcard plus a unique identifier?
Recursive descent seems like everyone's best friend.

XPaths and Css-Selectors are very versatile, and can describe the same element in many different ways - i.e. an single element has infinitely many possible locators to describe it. The goal is to get something to fit the needs of the developer which might include being readable, unique, and or adaptive.
Consider the following html example:
<div id='mainContainer'>
<span>some span</span>
</div>
If I were trying to make a locator for the <span> element, I wouldn't choose //span, because that will probably yield way too many results. Instead you could start with its parent who has an id, and then proceed to the span: //*[#id='mainContainer']/span, and alternatively: //span[parent::*[#id='mainContainer']]. Which XPath is better? Whichever one you personally find more readable. I agree with you that the first example does seem to be more common, although I myself am more partial to the latter.
Sometimes the point of making a locator a certain way is to be adaptable. For instance, I rarely write a locator like this: //*[#class='fooBar']. The reason is because in modern web development classes come and go frequently, and it's likely that that element's class could change at the slightest breeze. Instead you might write //*[contains(#class,'fooBar')]. Now when a developer goes in and adds a class for pure styling, you don't have to go back and update all of your selenium tests. That is also the reason I use wildcard characters frequently. If a developer goes in and updates a div to a span, my test will still work.
As #Gilles Quenot commented, it isn't always safe to assume that ids are unique. Many websites were written by someone's unemployed uncle who took an html class back in '86. They are terrible, and don't care at all about standards or audits. This is another reason that you need to include enough information in your locator to specify the exact element/elements you are talking about, but not too much information that you are describing too many elements.
One more comment is that XPaths are bidirectional, whereas Css-Selectors are not. This means XPaths can go from child to parent and from parent to child, where Css-Selectors can only go from parent to child. This affects which node you are starting at, and may be a reason that you see more Css-Selectors start from a parent/ancestor node.
TL;DR There isn't a convention, just personal preferences. Do what meets your needs.

Related

Identifying objects in Tosca with Xpath

I am recently brushing up my skills in TOSCA, I was working on it 2 years ago and switched to Selenium, I noticed that the new TOSCA allows identification using Xpath, and I am really familiar with it now, however, I cannot make it work in TOSCA and I am sure the object identification works because I am testing my xpath in google chrome developer tools.
Something as simple as (//*[text()='Forgot Password?'])[1] does not seem to be working. Could I be missing something?
This is the webpage I am using as reference for this example:
https://www.freecrm.com/index.html

XPath certainly can be used to identify elements of an HTML web UI in Tosca.
Since the question was originally posted, the "Forgot Password?" link at https://www.freecrm.com/index.html appears to have changed so that it's text is now "Forgot your password?" and is actually located at https://ui.freecrm.com/.
To account for that change, this answer uses "(//*[text()='Forgot your password?'])[1]" instead of the expression provided in the original post.
With the text modification, the expression works to idenfity the element in XScan after wrapping it in double quotes:
"(//*[text()='Forgot your password?'])[1]"
Some things to keep in mind when using XPath in Tosca:
It seems that XPath expressions need to be wrapped in double quotes (") so that XScan knows when to start evaluating XPath instead of using its normal rules. Looking closely at the expression that is pregenerated when XScan starts, we see that it is wrapped in double quotes:
"id('ui')/div[1]/div[1]/div[1]/a[1]"
A valid XPath expression doesn't necessarily guarantee uniqueness, so it is helpful to pay attention to any feedback messages at the bottom of XScan. There is a significant difference between "The selected element was not found" and "The selected element is not unique". The former simply indicates XScan can't find a match, the latter indicates that XScan matches successfully, but cannot uniquely identify the element.
My experience has been that it helps to explicitly identify the element to reduce the possibility of ambiguity. If the idea is to target the anchor element in order for tests to click a link, then reducing scope from any element i.e. "(//*[text()='Forgot your Password?'])[1]" to only match anchor elements with that text "//a[text()='Forgot your password?']".
In general, Tricentis (or at least the trainers with whom I have spoken) recommends using methods other than XPath to identify a target if they are available. That said, in my experience I've had better luck with XPath than with "Identify by Anchor".
An XPath expression is visible and editable in the XModuleAttribute properties without having to rescan. Personally, I find it easier to work with than the XML value of the RelativeId property that is generated when using Identify by Anchor.
With Anchor, I've had issues where XModuleAttributes scanned in one browser can no longer be found when switching to another browser, specifically from IE to Chrome. With XPath, I've not had these issues.
While XPath works well to identify the properties of one element with attributes of another because it can identify the relationship between them (very common with controls in Angular applications), the same can often be accomplished by adapting the engine layer using the TBox API (i.e. building a custom control). This requires some initial work up front from developer resources, but it can significantly improve how tests steer these controls in addition to reducing the need for Automation Specialists to have to rely on XPath.

What I know is that you can identify elements with XPath when working with XML messages in Tosca API testing. Your use case seems to be UI testing, but I am not sure about that.

Did you try to use XScan to scan the page? Usually Tosca automatically calculates an XPath expression for you that you can use immediately.
Please see the manual for details.
If it still does not work please try to be more specific? What isn't working? Error message? Unexpected behavior? ...

Tosca provides its set of attributes for locating any type of elements. You can directly select any number of attributes you want to make your element unique along with index of that element. Just make sure that you are not using any dynamic values in 'id' or 'class-name' of that element, also the index range is not so large like 20 out of 100; it could be 5 out of 10, which will be helpful if you need to update it in future.
Also take help of parent elements which will be uniquely located easily and then locate your expected element.

TOSCA provide various ways to locate an element just like selenium plus in addition it will provide other properties also.Under transition properties you will find x path and it will be absolute x path since you know selenium you know the difference between absolute and relative x path. I would suggest you to go with.
1.Identify by ID OR name
2. Identify by anchor
if your relative x path is not working

Try load all properties on the right side bottom. But it showed for me without clicking on it. See here

Shorten XPATH with wildcards

I'm currently trying to figure out how to shorten my extremely long xpath.
//div[#class='m_set_part'][1]/div/div[2]/div[#class='row']/div[#class='col details detail-head']/div[#class='detail-body']/div[2]/div/div[#class='size']/div/div[#class='m_product_finder_size']/ul/li[1]/span[#class='size-btn']/a
This is the one I have right now and it's way too long, the problem is I need the first node to differentiate between products. Is there a way to shorten it like
//div[#class='m_set_part']/*/span[#class='size-btn']/a
Or do I have to go through all childnodes to reach the last nodes?
Link
I want to find the for each product the sizebuttons. The only way to differentiate them, I guess, is via adding a [1] or [2] to the m_set_part node.

You are basically correct. As said in the comments, you can use // to select descendant or self nodes. Hence, this will give you all the size links:
//span[#class='size-btn']/a
As you suggest, you can select the specific product using a positional predicate. However, if you prefer you could also use another detail, e.g. the name. This would simply be
//div[#class="m_set_part"][.//label="Vælg"]
to given you the Vælg product.
Now combine them both and you can get the size link for this specifc product using
//div[#class="m_set_part"][.//label="Vælg"]//span[#class='size-btn']/a
or using the psoitional predicate it would be
//div[#class="m_set_part"][1]//span[#class='size-btn']/a
Also, please make sure you use a proper namespace as this is an actual XHTML document. One other thing is that you might prefer to use contains(#class, 'm_set_part') instead of #class="m_set_part" and the like, because the query will still work even if the add new CSS classes to this element.

To answer to your question: No you don't have to go through all nodes.
You may use the // descendant-or-self selector to 'skip' zero or more nodes in between the preceeding and the next part of the expression. So //div[#class='m_set_part']//span[#class='size-btn']/a might give you exactly what you want. * on the other hand matches any node, but exactly one node. Therfore
//div[#class='m_set_part'][1]/*/*[2]/*[#class='row']/*[#class='col details detail-head']/*[#class='detail-body']/*[2]/*/*[#class='size']/*/*[#class='m_product_finder_size']/*/*[1]/*[#class='size-btn']/a
is another way to shorten your original expression. Whether it's still returns only the interested node or more is solely depends on the document you apply the expression on.

Understanding X-Path Expression

I'm trying to get an understanding of XPath in order to parse a diffxml file. I skimmed over the w3schools site. Am I understanding these correctly?
Statement 1: /node()[1]/node()[3]
Selects the third child of the root node
Statement 2: /node()[1]/node()[1]/node()[1]
Selects the child of the first node of the root node
Statement 3: /node()[1]/node()[3]/node()[2]
Selects the second child of the third node under the root node.

Yes, you understand them correctly, but this is not how you'd use XPath. First node() can be anything, not just elements. Then the pure index is arguably the wort way of selecting things, you should really use names, and possibly predicates for filtering the node-sets.

You'll find a lot of criticism of w3schools on this site. Personally I find it a useful resource, but only when I'm trying to remind myself of something I once knew. It's not really designed for teaching yourself things from scratch, and I suggest you need a different learning strategy. Call me old-fashioned, but when I'm learning a new technology I find there's nothing better than a good book.
You've understood your examples correctly as far as I can tell. But have you understood what a "node" is? For example, do you know under what circumstances whitespace text counts as a node? The key to understanding XPath is to understand the data model, and the way in which the data model relates to the lexical (angle-bracket) form of the XML.

How to use not contains() in XPath?

I have some XML that is structured like this:
<whatson>
<productions>
<production>
<category>Film</category>
</production>
<production>
<category>Business</category>
</production>
<production>
<category>Business training</category>
</production>
</productions>
</whatson>
And I need to select every production with a category that doesn't contain "Business" (so just the first production in this example).
Is this possible with XPath? I tried working along these lines but got nowhere:
//production[not(contains(category,'business'))]

XPath queries are case sensitive. Having looked at your example (which, by the way, is awesome, nobody seems to provide examples anymore!), I can get the result you want just by changing "business", to "Business"
//production[not(contains(category,'Business'))]
I have tested this by opening the XML file in Chrome, and using the Developer tools to execute that XPath queries, and it gave me just the Film category back.

I need to select every production with a category that doesn't contain "Business"
Although I upvoted #Arran's answer as correct, I would also add this...
Strictly interpreted, the OP's specification would be implemented as
//production[category[not(contains(., 'Business'))]]
rather than
//production[not(contains(category, 'Business'))]
The latter selects every production whose first category child doesn't contain "Business". The two XPath expressions will behave differently when a production has no category children, or more than one.
It doesn't make any difference in practice as long as every <production> has exactly one <category> child, as in your short example XML. Whether you can always count on that being true or not, depends on various factors, such as whether you have a schema that enforces that constraint. Personally, I would go for the more robust option, since it doesn't "cost" much... assuming your requirement as stated in the question is really correct (as opposed to e.g. 'select every production that doesn't have a category that contains "Business"').

You can use not(expression) function.
not() is a function in xpath (as opposed to an operator)
Example:
//a[not(contains(#id, 'xx'))]
OR
expression != true()

Should be xpath with not contains() method, //production[not(contains(category,'business'))]

HPricot css search: How do I select the parent/ancestor of a particular element using a string selector?

I'm using HPricot's css search to identify a table within a web page. Here's a sample html snippet I'm parsing:
<table height=61 width=700>
<tbody>
<tr>
<td><font size=3pt color = 'Blue'><b><A NAME=a1>Some header text</A></b></font></td></tr>
...
</tbody></table>
There are lots of tables in the page. I want to find the table which contains the A Name=a1 reference.
Right now, the way I'm doing it is
(page/"a[#name=a1]")[0].parent.parent.parent.parent.parent
I don't like this because
It is ugly
It is error prone (what if the folks who maintain the web page remove the tbody?)
Is there a way to tell hpricot to get me the table ancestor of the specified element?
Edit: Here's the full blown page I'm parsing: http://www.blonnet.com/businessline/scoboard/a.htm
The bits I'm interested in are the two tables, one with quarterly results and another with the annual results. Right now, the way I'm extracting those tables is by finding and and moving up from there.

Rohith is right. It is ugly and it is error prone (more than it needs to be). Again as he says it is much more clear with the intent to say "find the closest parent that is a table", and this could go for any child/parent relationship.
If it's "not possible" to do that with hpricot then just say so. But don't just say "it's hopeless to try to do that anyway what's the point". That's a bogus answer. It also doesn't help the next person who comes along (myself) looking for the answer to the same question but for different reasons, which is parsing many pages where differences are ASSUMED and not just feared.
To actually answer the question... I don't know, yet. And I don't have much hope of finding out with hpricot. The documentation is absolutely horridly nonexistent.
But here's a workaround that does about the same thing.
table = (page%"a[#name=a1]").parent
table = table.parent while table.name != "table"

Without seeing the whole page it's hard to give a definitive answer, but often the way you're going about it is the right answer. You have to find a decent landmark, then navigate from there, and if it involves backing up the chain then that's what you do.
You might be able to use XPATH to find the table then look inside it for the link, but that doesn't really improve things, it only changes them. Firebug, the Firefox plugin, makes it easy to get the XPATH to an element in the page, so you could find the table in question and have Firebug show you the path, or just copy it by right-clicking on the node in the xpath display, and past that into your lookup.
"It is ugly", well, maybe, but not all code is beautiful or elegant because not all problems lend themselves to beautiful and/or elegant solutions. Sometimes we have to be happy with "it works". As long as it works reliably and you know why then you're ahead of many other coders.
"... what if the folks who maintain the web page remove the tbody?", almost all parsing of HTML or XML suffers from the same concern because we're not in control of the source. You write your code as best as you can, comment the spots that are likely to fail if content changes, then cross your fingers and move on. Even if you were parsing tabular data from a TPS report you could run into the same problem.
The only thing I'd suggest doing differently, is to use the % (AKA "at") instead of / (AKA search). % returns only the first occurrence so you can drop the [0] index.
(page%"a[#name=a1]").parent.parent.parent.parent.parent
or
page%'//a[#name="a1"]/../../../../../..'
which uses the XPath engine to step back up the chain. That should be a little faster if speed is a consideration.
If you know that the target table is the only one with that width and height, you can use a more specific xpath:
page%'//table[#height=61 and #width=700]'
I recommend Nokogiri over Hpricot.
You can also use XPath from the top of the document down:
irb(main):039:0> print (doc/'//body/table[2]/tr/td[2]/table[2]').to_html[0..100]
<table height="61" width="700"><tbody>
<tr><td width="700" colspan="7" align="center"> <font size="3p=> nil
Basically the XPath pattern means:
Find the body tag, then the third table, then its row's third cell. In the cell locate the third table.
Note: Firefox automatically adds the <tbody> tag to the source, even if it wasn't there in the HTML file received. That can really mess you up trying to use Firefox to view the source to develop your own XPaths.
The other table you are after is /html/body/table[2]/tbody/tr/td[2]/table[3] according to Firefox so you have to strip the tbody. Also you don't need to anchor at /html.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Webscraping Selectors - xpath

Related

Identifying objects in Tosca with Xpath

Shorten XPATH with wildcards

Understanding X-Path Expression

How to use not contains() in XPath?

HPricot css search: How do I select the parent/ancestor of a particular element using a string selector?

Categories

Resources