Unable to understand XPath siblings behaviour

Unable to understand XPath siblings behaviour - xpath

I am trying to scrape a HTML page in an scenario where I only have consecutive tags with information.
From the following code I would like to get the text for the tags (e.g. Name1, Name2, ...), taking into consideration:
"a" followed by "span" gives information about that ID being a Customer or not.
"a" followed by "a" means that ID is anonymous.
<span class="list">
<em>List 1:</em>
</span>
Name1,
Name2
<span class="small">(Customer)</span>,
Name3
<span class="small">(Non Customer)</span>,
Name4
<span class="small">(Customer)</span>,
Name5
Name6
<span class="small">(Non Customer)</span>
I'm using the following XPATH to try to match "a" followed by "span"
//a[contains(#href,'ID/') and ./following-sibling::span[1][text() = '(Customer)']]/text()
This will return Name1, Name2 and Name4, even if Name1 is not a Customer. What am I doing wrong?

It's because the first following-sibling span of that Name1 does indeed equal "(Customer)".
What you should do instead is find the first following sibling (*[1]) and check to see if that sibling is a span ([self::span]) and if it is, then check to see if it's equal to "(Customer)"...
//a[contains(#href,'ID/') and ./following-sibling::*[1][self::span][text() = '(Customer)']]/text()

Related

XPATH Exclude a class or start scraping after it

<p class="region-list"><a class="parent xh-highlight" href="/mediterranean-yacht-charters-1548.htm" title="Mediterranean Yachts for Charter - Summer">Mediterranean</a>Croatia, Italy, Montenegro</p>
I have this query:
//div[#class='hide-for-small']/div/div/div/div[1]/div/div/div/p[#class='region-list']/a
Which returns:
Mediterranean
Croatia
Italy
Montenegro
However, I want to exclude the parent which is "Mediterranean" so I want to either say:
Skip the first <a> and grab the rest OR 2) Exclude the a <a class="parent">
I have been wrestling with this #class!="parent"] but can't seem to get this to work.

Actually, you can do both:
Skip the first <a> node:
//p[#class='region-list']/a[position()>1]/text()
or skip the <a> node with the specific class attribute value:
//p[#class='region-list']/a[not(#class='parent xh-highlight')]/text()

XPath How to Get Posts Separately

How can i get texts with xpath separately?
Code i tried only gets 1 with all info instead of separate:
Post xpath: div
Title xpath: ./p/strong/child::node()
Desc xpath: ./ul/child::node()
Desired:
Title1
Desc1
Title2
Desc2
Got:
Title1 Title2
Desc1 Desc2
HTML:
<div>
<p><strong>Title1</strong></p>
<ul>
<li>Desc1</li>
</ul>
<p><strong>Title2</strong></p>
<ul>
<li>Desc2</li>
</ul>
</div>

Not really clear what your "Desired" example is representing with pairs labeled 1 and 2, but if you are just trying to select each title text followed by its immediate following ul/li text you can use an expression such as:
//div/p/(
./normalize-space(string()),
./(following-sibling::ul[1])/normalize-space(string()))
For each p it selects the entire text content of the p as string and then selects the immediately following ul sibling of the p and selects its entire string content. This can be easily refined to only select p/strong content (instead of all of the p) and similar for ul/li.

Xpath fetch specific nodes without their child nodes from XML

I have XML data that looks like this
<priceData>
<div class='price'>
<div class='price-old'>20.00</div>
<div class='price-new'>10.00</div>
<div class='price-tax'>8.00</div>
</div>
<div class='price'>
40.00 <div class='price-tax'>25.00</div>
</div>
</priceData>
I want to use Xpath to extract data for "price-new" from the first price div, and value 40.00 from the second price div. This must be done using single expression.
I tried expressions like
//div[contains(#class, 'price') and not(contains(#class, 'tax')) and not(contains(#class, '-old'))]
and
//div[contains(#class, 'price') and not(contains(#class, 'tax')) and not(descendant::div[contains(#class, '-old') and not(contains(#class, '-tax'))]) and not(contains(#class, '-old'))]
and some others but I can't get it to work how it is supposed to.
I always end up with fetching extra nodes from the first case and I only need the single node (price-new or price if there are no more nodes in it).

You can try using xpath union (|) to combine 2 queries into one. Given markup in the question as XML input, the following xpath (formatted for readability) :
//div[#class='price']/div[#class='price-new']/text()
|
//div[#class='price']/text()[normalize-space()]
returned 'expected' result in xpath tester :
Text='10.00'
Text='40.00'

How to take XPath of element that is between br tags with <strong> in account

My code is like this,
<div>
<strong> Text1: </strong>
1234
<br>
<strong> Text2: </strong>
5678
<br>
</div>
where numbers, 1234 and 5678 are generated dynamically. When I take XPath of Text2 : 5678, it gives me like /html/body/div[7]/div/div[2]/div/div[2]/div[2]/br[2]. This does not work for me. I need to take XPath of only "Text2 : 5678". any help will be appreciated. (I am using selenium webdriver and C# to code my test script)

I second #Anil's comment above. The text "Text2:" is retrievable as it is within "strong" element. But, "5678" comes under div and is not the innerHTML for either "strong" or "br".
Hence, to retrieve the text "Text 2: 5678", you'll have to retrieve the innerHTML/text of "div" and modify it accordingly to get the required text.
Below is a Java code snippet to retrieve the text:-
WebElement ele = driver.findElement(By.xpath("//div"));
System.out.print(ele.getText().split("\n")[1]; //Splitting using newline as the split string.
I hope you can formulate the above in C#.

xpath getting the name in a certain pattern

I want to get a class name like the following:
class="hostHostGrid0_body"
The integer in between hostHostGrid and _body can change, but everything else I want it just like that in the order.
How can I achieve this?

In XPath 1.0 you can use this:
//*[starts-with(#class,'hostHostGrid') and substring-after(#class,'_') = 'body']
to select any element containing one class. It will match tags in any context. It will match all three elements below:
<div class="hostHostGrid0_body">
<span class="hostHostGrid123_body"/>
<b class="hostHostGrid1_body">xxx</b>
</div>
Limitations: it doesn't restrict what is between them to a number. It can be anything, including spaces (ex: it will also match this: class="hostHostGrid xyz abc_body")
This one allows for the class occurring among other classes:
//*[contains(substring-before(#class,'_body'),'hostHostGrid')]
It will match:
<div class="other-class hostHostGrid0_body">
<span class="hostHostGrid123_body other-class"/>
<b class="hostHostGrid1_body">xxx</b>
</div>
(it also has the same limitations - will match anything between 'hostHostGrid' and '_body')

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Unable to understand XPath siblings behaviour - xpath

Related

XPATH Exclude a class or start scraping after it

XPath How to Get Posts Separately

Xpath fetch specific nodes without their child nodes from XML

How to take XPath of element that is between br tags with <strong> in account

xpath getting the name in a certain pattern

Categories

Resources