xpath: extract data from a node - xpath

I am trying to extra some data from a webpage. the structure of the webpage is as below
<li id="yui_3_4_1_1_1326860702769_9706">
<span id="yui_3_4_1_1_1326860702769_9705">Sales rank: </span>
2
</li>
http://www.barnesandnoble.com/w/enders-game-orson-scott-card/1100353963?ean=9781429963930
I need to extract value "2" from above and identifier has to be "Sales rank"
Thanks for all the help.

try this:
//descendant::*[#class='product-details box']/ul/li[span='Sales rank: ']/text()

You can try using:
//div[#class="product-details"]/ul/li[9]
Not tested though.

Use:
//li[#id='yui_3_4_1_1_1326860702769_9706']
/span[. = 'Sales rank: ']
/following-sibling::text()[1]
This selects the first following-sibling text node of any span element with string value 'Sales rank: ', that is a child of any li element whose id attribute has the value of 'yui_3_4_1_1_1326860702769_9706' .

try this, if any question, please let me know
`//li[#id]/*[contains(text(), 'Sales rank')]/following-sibling::node()[1]`

Related

Please help to extract date using xpath

<div class='postbodytop">
<a class="xxxxxxxxxxxxxxxx" href="xxxxxxxxxxxxxx">tonyd</a>
"posted this 4 minutes ago "
<span class="hidden-xs"> </span>
</div>
Hello, I want to extract the "posted this 4 minutes ago" or just "4 minutes" using xpath. Can anybody help me? Thank you
The div whose class equals postbodytop contains three child nodes: a span, a text node, and another span. Your path should start at the div and then select the child text node, for which the appropriate test is text().
div/text()
Of course this is just a fragment of a bigger page, and your XPath may need to have something at the start e.g. /html/body/ etc. and if there are other div elements at the same level as the <div class=postbodytop>, then you should be more specific about the div, e.g. div[#class="postbodytop"] instead of just div in that XPath expression.

replace full string in xpath just get before

I am searching a solution to remove a string value obtained on a webpage with an XPath function.
I have this :
<div id="article_body" class="">
This my wonderful sentence, however here the string i dont want :
<br><br>
<div class="typo">Found a typo in the article? Click here.
</div>
</div>
So at the end I would have
This my wonderful sentence, however here the string i dont want :
I get the text with
//*[#id="article_body"]
Then I try to use replace:
//replace('*[#id="article_body"]','Found a typo in the article? ', )
But it doesn't work, so I think it's because I'm a newbie with XPath...
How can I do that please?
It appears that you are getting the computed string value of the selected div element.
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
If you don't want to include the text() from the descendant nodes, and only want the text() that are immediate children of the div, then adjust your XPath:
//*[#id="article_body"]/text()
Otherwise, you could use substring-before():
substring-before(//*[#id="article_body"], 'Found a typo in the article?')

Use xpath or xquery to show text in title attribute

I'd like to use xquery (I believe) to output the text from the title attribute of an html element.
Example:
<div class="rating" title="1.0 stars">...</div>
I can use xpath to select the element, but it tries to output the info between the div tags. I think I need to use xquery to output the "1.0 stars" text from the title attribute.
There's gotta be a way to do this. My Google skills are proving ineffective in coming up with an answer.
Thanks.
XPath: //div[#class='rating']/#title
This will give you the title text for every div with a class of "rating".
Addendum (following from comments below):
If the class has other, additional text in it, in addition to "rating", then you can use something like this:
//div[contains(concat(' ', normalize-space(#class), ' '), ' rating ')]
(Hat tip to How can I match on an attribute that contains a certain string?).
You should use:
let $XML := <p><div class="rating" title="2.0 stars">sdfd</div><div class="rating" title="1.0 stars">sdfd</div></p>
for $title in $XML//#title
return
<p>{data($title)}</p>
to get output:
<p>2.0 stars</p>
<p>1.0 stars</p>

How to get element's index by xpath?

I've next structure :
<div id='list'>
<div class='column'>aaa</div>
<div class='column'>bbb</div>
...
<div class='column'>jjj</div>
</div>
I was wonder if there is a ways to use XPath, and to write some query were I can get the index of the requested element within the "list" element.
I mean that I'll ask for location of class='column' where the text value is aaa and I'll get 0 or 1...
Thanks
You could just count the div elements preceding the element you're looking for:
count(div[#id = 'list']/div[#id = 'myid']/preceding-sibling::div)
You can count the preceding siblings:
count(//div[#id="list"]/div[#id="3"]/preceding-sibling::*)
Selenium way to evaluate the position:
int position = driver.findElements(By.xpath("//div[#class='column' and text()='jjj']/preceding-sibling::div[#class='column']")).size() + 1;
System.out.println(position);
You can count how many "preceding-siblings" that are also of class 'column', by using this xpath:
//div[#class='column' and text()='jjj']/preceding-sibling::div[#class='column']

Extracting text in between nodes through XPath

I'm trying to read specific parts of a webpage through XPath. The page is not very well-formed but I can't change that...
<root>
<div class="textfield">
<div class="header">First item</div>
Here is the text of the <strong>first</strong> item.
<div class="header">Second item</div>
<span>Here is the text of the second item.</span>
<div class="header">Third item</div>
Here is the text of the third item.
</div>
<div class="textfield">
Footer text
</div>
</root>
I want to extract the text of the various items, i.e. the text in between the header divs (e.g. 'Here is the text of the first item.'). I've used this XPath expression so far:
//text()[preceding::*[#class='header' and contains(text(),'First item')] and following::*[#class='header' and contains(text(),'Second item')]]
However, I cannot hardcode the ending item name because in the pages I want to scrape the order of the items differ (e.g. 'First item' may be followed by 'Third item').
Any help on how to adapt my XPath query would be greatly appreciated.
Found it!
//text()[preceding::*[#class='header' and contains(text(),'First item')]][following::*[preceding::*[#class='header'][1][contains(text(),'First item')]]]
Indeed your solution, Aleh, won't work for tags inside the text.
Now, the one remaining case is the last item, which is not followed by an element with class=header; so it will include all text found 'till the end of the document. Ideas?
//*[#class='header' and contains(text(),'First item')]/following::text()[1] will select first text node after <div class="header">First item</div>.
//*[#class='header' and contains(text(),'Second item')]/following::text()[1] will select first text node after <div class="header">Second item</div> and so on
EDIT: Sorry, this will not work for <strong> cases. Will update my answer
EDIT2: Used #Michiel part. Looks like omg but works: //div[#class='textfield'][1]//text()[preceding::*[#class='header' and contains(text(),'First item')]][following::*[preceding::*[not(self::strong) and not(self::span)][1][contains(text(),'First item')]] or not(//*[preceding::*[#class='header' and contains(text(),'First item')]])]
Seems that this should be solved with a better solution :)
For the sake of completeness, the final query, composed of various suggestions throughout the thread:
//*[
#class='textfield' and position() = 1
]
//text() [
preceding::*[
#class='header' and contains(text(),'First item')
]
][
following::*[
preceding::*[
#class='header'
][1][
contains(text(),'First item')
]
]
]

Resources