I am trying to scrape a HTML page in an scenario where I only have consecutive tags with information.
From the following code I would like to get the text for the tags (e.g. Name1, Name2, ...), taking into consideration:
"a" followed by "span" gives information about that ID being a Customer or not.
"a" followed by "a" means that ID is anonymous.
<span class="list">
<em>List 1:</em>
</span>
Name1,
Name2
<span class="small">(Customer)</span>,
Name3
<span class="small">(Non Customer)</span>,
Name4
<span class="small">(Customer)</span>,
Name5
Name6
<span class="small">(Non Customer)</span>
I'm using the following XPATH to try to match "a" followed by "span"
//a[contains(#href,'ID/') and ./following-sibling::span[1][text() = '(Customer)']]/text()
This will return Name1, Name2 and Name4, even if Name1 is not a Customer. What am I doing wrong?
It's because the first following-sibling span of that Name1 does indeed equal "(Customer)".
What you should do instead is find the first following sibling (*[1]) and check to see if that sibling is a span ([self::span]) and if it is, then check to see if it's equal to "(Customer)"...
//a[contains(#href,'ID/') and ./following-sibling::*[1][self::span][text() = '(Customer)']]/text()
Related
<p class="region-list"><a class="parent xh-highlight" href="/mediterranean-yacht-charters-1548.htm" title="Mediterranean Yachts for Charter - Summer">Mediterranean</a>Croatia, Italy, Montenegro</p>
I have this query:
//div[#class='hide-for-small']/div/div/div/div[1]/div/div/div/p[#class='region-list']/a
Which returns:
Mediterranean
Croatia
Italy
Montenegro
However, I want to exclude the parent which is "Mediterranean" so I want to either say:
Skip the first <a> and grab the rest OR 2) Exclude the a <a class="parent">
I have been wrestling with this #class!="parent"] but can't seem to get this to work.
Actually, you can do both:
Skip the first <a> node:
//p[#class='region-list']/a[position()>1]/text()
or skip the <a> node with the specific class attribute value:
//p[#class='region-list']/a[not(#class='parent xh-highlight')]/text()
How can i get texts with xpath separately?
Code i tried only gets 1 with all info instead of separate:
Post xpath: div
Title xpath: ./p/strong/child::node()
Desc xpath: ./ul/child::node()
Desired:
Title1
Desc1
Title2
Desc2
Got:
Title1 Title2
Desc1 Desc2
HTML:
<div>
<p><strong>Title1</strong></p>
<ul>
<li>Desc1</li>
</ul>
<p><strong>Title2</strong></p>
<ul>
<li>Desc2</li>
</ul>
</div>
Not really clear what your "Desired" example is representing with pairs labeled 1 and 2, but if you are just trying to select each title text followed by its immediate following ul/li text you can use an expression such as:
//div/p/(
./normalize-space(string()),
./(following-sibling::ul[1])/normalize-space(string()))
For each p it selects the entire text content of the p as string and then selects the immediately following ul sibling of the p and selects its entire string content. This can be easily refined to only select p/strong content (instead of all of the p) and similar for ul/li.
I have XML data that looks like this
<priceData>
<div class='price'>
<div class='price-old'>20.00</div>
<div class='price-new'>10.00</div>
<div class='price-tax'>8.00</div>
</div>
<div class='price'>
40.00 <div class='price-tax'>25.00</div>
</div>
</priceData>
I want to use Xpath to extract data for "price-new" from the first price div, and value 40.00 from the second price div. This must be done using single expression.
I tried expressions like
//div[contains(#class, 'price') and not(contains(#class, 'tax')) and not(contains(#class, '-old'))]
and
//div[contains(#class, 'price') and not(contains(#class, 'tax')) and not(descendant::div[contains(#class, '-old') and not(contains(#class, '-tax'))]) and not(contains(#class, '-old'))]
and some others but I can't get it to work how it is supposed to.
I always end up with fetching extra nodes from the first case and I only need the single node (price-new or price if there are no more nodes in it).
You can try using xpath union (|) to combine 2 queries into one. Given markup in the question as XML input, the following xpath (formatted for readability) :
//div[#class='price']/div[#class='price-new']/text()
|
//div[#class='price']/text()[normalize-space()]
returned 'expected' result in xpath tester :
Text='10.00'
Text='40.00'
My code is like this,
<div>
<strong> Text1: </strong>
1234
<br>
<strong> Text2: </strong>
5678
<br>
</div>
where numbers, 1234 and 5678 are generated dynamically. When I take XPath of Text2 : 5678, it gives me like /html/body/div[7]/div/div[2]/div/div[2]/div[2]/br[2]. This does not work for me. I need to take XPath of only "Text2 : 5678". any help will be appreciated. (I am using selenium webdriver and C# to code my test script)
I second #Anil's comment above. The text "Text2:" is retrievable as it is within "strong" element. But, "5678" comes under div and is not the innerHTML for either "strong" or "br".
Hence, to retrieve the text "Text 2: 5678", you'll have to retrieve the innerHTML/text of "div" and modify it accordingly to get the required text.
Below is a Java code snippet to retrieve the text:-
WebElement ele = driver.findElement(By.xpath("//div"));
System.out.print(ele.getText().split("\n")[1]; //Splitting using newline as the split string.
I hope you can formulate the above in C#.
I want to get a class name like the following:
class="hostHostGrid0_body"
The integer in between hostHostGrid and _body can change, but everything else I want it just like that in the order.
How can I achieve this?
In XPath 1.0 you can use this:
//*[starts-with(#class,'hostHostGrid') and substring-after(#class,'_') = 'body']
to select any element containing one class. It will match tags in any context. It will match all three elements below:
<div class="hostHostGrid0_body">
<span class="hostHostGrid123_body"/>
<b class="hostHostGrid1_body">xxx</b>
</div>
Limitations: it doesn't restrict what is between them to a number. It can be anything, including spaces (ex: it will also match this: class="hostHostGrid xyz abc_body")
This one allows for the class occurring among other classes:
//*[contains(substring-before(#class,'_body'),'hostHostGrid')]
It will match:
<div class="other-class hostHostGrid0_body">
<span class="hostHostGrid123_body other-class"/>
<b class="hostHostGrid1_body">xxx</b>
</div>
(it also has the same limitations - will match anything between 'hostHostGrid' and '_body')