extracting text with xpath with different nodes - xpath

I'm currently trying to extract some text from a website with xPath and Rapidminer.
I want to extract the "270€" from the following code:
<dd class="grid-item three-fifths">
<span class="is1-operator">+</span>
270 €
</dd>
I tried the following which didn't work.
//h:dd[#class='grid-item three-fifths']//text()
Thanks for your help :)

Your Xpath returns 3 text nodes:
""
"+"
"270€"
Try below XPath to fetch only "270€"
//h:dd[#class='grid-item three-fifths']/text()[string-length() > 0]

As mentioned in previous post string-length filter can be used but [string-length() > 0] still brings 3 nodes. Both 'enter' and '+' text contents have a character.
[string-length() > 1] should work.
If you are sure about item position (in this case it is 3rd position)
//dd[#class='grid-item three-fifths']//text()[3]
If you are sure it is always last item:
//dd[#class='grid-item three-fifths']/text()[last()]
You can get text node after span in dd:
//dd[#class='grid-item three-fifths']//span/following-sibling::text()
Look for euro sign:
//dd/text()[contains(.,'€')]

Related

xpath handle double quotes with some other tags

I have this html sample
<html>
<body>
....
<p id="book-1" class="abc">
<b>
book-1
section
</b>
"I have a lot of "
<i>different</i>
"text, and I want "
<i>all</i>
" text and we may or may not have italic surrounded text."
</p>
....
the xpath I currently have is this:
#"/html[1]/body[1]/p[1]/text()"
this gives this result:
I have a lot of
but I want this result:
I have a lot of different text, and I want all text and we may or may not have italic surrounded text.
Thanks for your help.
In XPath 2 and higher you could use string-join(/html[1]/body[1]/p[1]/b/following-sibling::node(), '') I think. It is not quite clear which nodes you want but that would select all sibling nodes following the b child of the p and then concatenate their string values into one.

Xpath get element above

suppose I have this structure:
<div class="a" attribute="foo">
<div class="b">
<span>Text Example</span>
</div>
</div>
In xpath, I would like to retrieve the value of the attribute "attribute" given I have the text inside: Text Example
If I use this xpath:
.//*[#class='a']//*[text()='Text Example']
It returns the element span, but I need the div.a, because I need to get the value of the attribute through Selenium WebDriver
Hey there are lot of ways by which you can figure it out.
So lets say Text Example is given, you can identify it using this text:-
//span[text()='Text Example']/../.. --> If you know its 2 level up
OR
//span[text()='Text Example']/ancestor::div[#class='a'] --> If you don't know how many level up this `div` is
Above 2 xpaths can be used if you only want to identify the element using Text Example, if you don't want to iterate through this text. There are simple ways to identify it directly:-
//div[#class='a']
From your question itself you have mentioned the answer for it
but I need the div.a,
try this
driver.findElement(By.cssSelector("div.a")).getAttribute("attribute");
use cssSelector for best result.
or else try the following xpath
//div[contains(#class, 'a')]
If you want attribute of div.a with it's descendant span which contains text something, try as below :-
driver.findElement(By.xpath("//div[#class = 'a' and descendant::span[text() = 'Text Example']]")).getAttribute("attribute");
Hope it helps..:)

XPath for Google Results: <em> and description without date

I have 3 questions:
1) How can I XPath the text in the Google Results, the bold marked. If there's no , there should be nothing shown.
2) =XPathOnUrl("https://www.google.de/search?q=KEYWORD&num=10");"//span[#class='st']") This gives me the Google Description, but how can i get the description without the <span class="f"> date?
3) I get the description with � as an "ä, ö, ü". How can these letters be displayed?
HTML DOM CODE:-
<span class="st">
<span class="f">18.11.2009 - </span>
This Thursday 19th November
<em>Moonshine</em>
turns 4 years old. I'm proud to say that's 4 years of Malaysian acts pretty much every month. We've ...
</span>
The code I used for this issue
driver.get("https://www.google.de/?gws_rd=ssl#q=moonshine+site:blogspot.com&nu%E2%80%8C%E2%80%8Bm=10");
List<WebElement> ele = driver.findElements(By.xpath("//span[#class='f']/following-sibling::text()"));
ele.toString();
for(int i=0;i<ele.size();i++)
{
System.out.println(ele.get(i).getText());
}
This code throws an InvalidSelectorException
The result of the xpath expression "//span[#class='f']/following-sibling::text()" is: [object Text]. It should be an element.
In future you try this following xpath to capture only the text i.e. description
//span[#class='f']/following-sibling::text()
Actually you can't capture that text because this is selenium Open Issue
[selenium-developer-activity] Issue 5459 in selenium: InvalidSelectorError: The result of the xpath expression is: [object Text]
you can find it in below link (issue details)
http://grokbase.com/t/gg/selenium-developer-activity/13475y4cgj/issue-5459-in-selenium-invalidselectorerror-the-result-of-the-xpath-expression-is-object-text
Use below Xpath for same. It will return all the dates present on the page:-
//span[#class='f']/text()
if you just want text the use below xpath
//span[#class='st' and not(#class='f')]/text()
Hope it will help you :)

Xpath, getting substring-after "closed" at end-html tag/end-node

I want to extract information about Mortal Kombat characters, starting with their weapons.
Sample code:
<ul class="characterInfo">
<li>Name: <b> <span>Lui Kang</span></b></li>
<li>Created by: <b><span>John Tobias</span></b></li>
<li>Battle cry: <b><span><u>Click here</u></span></b></li>
<li>Weapons: <b><span>Dragon sword and nunchaku</span></b></li>
<li>Origin: <b><span>China</span></b> </li>
</ul>
Using Xpath substring-before(substring-after(.,'Weapons: '),',') the extraction becomes
Dragon sword and nunchaku
Origin: China
So I am not using substring-after the correct way. I should end the extraction with the first </span>-node
I have tried substring-before(substring-after(.,'Weapons: '),'</span>') but it isn't returning anything.
I think I am close, can anyone poke me in the right direction?
XPath works on the XML structure of a document, not on the raw text. If the text you want to extract is always inside a <b> element, you can use:
string(//ul[#class = 'characterInfo']/li[starts-with(., 'Weapons:')]/b)
The following is more universal:
substring-after(//ul[#class = 'characterInfo']/li[starts-with(., 'Weapons: ')], 'Weapons: ')

Xquery to extract text in html

I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[#name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[#name='hw2'] and /a[#name='hw3'].
Your xpath is selecting the text of the a nodes, not the text of the td nodes:
$item//a[#name='hw']/text()
Change it to this:
$item[a/#name='hw']/text()
Update (following comments and update to question):
This xpath selects the second text node from $item that have an a tag containing a name attribute set to hw:
$item[a/#name='hw']//text()[2]
I would not want to use text()[3] but
is there some way I could extract the
text out between /a[#name='hw2'] and
/a[#name='hw3'].
If there is just one text node between the two <a> elements, then the following would be quite simple:
/a[#name='hw3']/preceding::text()[1]
If there are more than one text nodes between the two elements, then you need to express the intersection of all text nodes following the first element with all text nodes preceding the second element. The formula for intersection of two nodesets (aka Kaysian method of intersection) is:
$ns1[count(.|$ns2) = count($ns2)]
So, just replace in the above expression $ns1 with:
/a[#name='hw2']/following-sibling::text()
and $ns2 with:
/a[#name='hw3']/preceding-sibling::text()
Lastly, if you really have XQuery (or XPath 2), then this is simply:
/a[#name='hw2']/following-sibling::text()
intersect
/a[#name='hw3']/preceding-sibling::text()
This handles your expanded case, while letting you select by attribute value rather than position:
let $item :=
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
return $item//node()[./preceding-sibling::a/#name = "hw2"][1]
This gets the first node that has a preceding-sibling "a" element with a name attribute of "hw2".

Resources