I'm working on a Selenium Project.
I have a HTML File (copied below) for which I have to write a XPath expression in order to find out the value of 6th "b" element under "a" for which:
2nd "b" has a value of "888" and
4th "666" has a value of "World is Great".
HTML Code
<a>
<b>111</b>
<b>222</b>
<b>333</b>
<b>444</b>
<b>ttt</b>
<b>uuu</b>
</a>
<a>
<b>999</b>
<b>888</b>
<b>777</b>
<b>World is Great</b>
<b>aaa</b>
<b>bbb</b>
</a>
I tried the following, but didn't work. Any help will be appreciated.
//a[b='888'][b='World is Great']/b[6]
//a[contains(#b[2], '888') and contains(#b[4], 'World is Great')]/b[6]
Q.: (modified) to find out the value of 6th "b" element under "a" for which - -- 2nd "b" has a value of "888" and -- 4th "b" has a value of "World is Great"
Try this:
"//a[b[2]='888' and b[4] = 'World is Great' ]/b[6]"
Attention:
But your fist try //a[b='888'][b='World is Great']/b[6] should also work. If it does not return "bbb" for your example there is something else wrong.
Related
There is a HTML like this.
<div class="paginate_box">
<span class="disabled prev_page">Back</span>
<span class="current">1</span>
<a rel="next" href="page2">2</a>
<a rel="next" href="page3">3</a>
<a class="next_page" rel="next" href="page2">Next</a>
</div>
To get biggest number of the pages I wrote this.
doc = Nokogiri::HTML(html)
doc.xpath('//div[#class="paginate_box"]/a[not(#class="next_page")]').last.text
#=> "3"
At first I wrote a[#class!="next_page"] instead of a[not(#class="next_page")], but it didn't match the tag. Why it doesn't match? What am I doing wrong?
So the problem here is you are trying to use != on an attribute (#class) that is only present on the last node. This means #class cannot be compared on the other nodes because it is essentially saying nothing != 'next_page'.
Since nothing is not comparable to anything, operators (including != and =) will always return false.
In your not function you are asking if nothing = 'next_page' which is always false (as explained above) and thus not makes it true and the element is selected.
You can prove this by adding a class to one of the other anchor tags and then use the != version.
Side note you can simplify the code to just use xpath
doc.xpath('//div[#class="paginate_box"]/a[not(#class="next_page")][last()]').text
#=> "3"
# Or
doc.xpath('//div[#class="paginate_box"]/a[not(#class="next_page")][last()]/text()').to_s
#=> "3"
Also if next_page anchor is always present and always last and the highest page number always precedes it then you can avoid the condition altogether:
doc.xpath('//div[#class="paginate_box"]/a[position()=last()-1]').text
#=> "3"
Here we are saying find the anchor in the position right before the last one in that div.
Alternative:
doc.xpath('//div[#class="paginate_box"]/a[last()]/preceding-sibling::a[1]').text
#=> "3"
This will find the last anchor then all the anchor siblings preceding it in bottom up order and we are selecting the first one in that list.
Let's say I want to scrape the "Weight" attribute from the following content on a website:
<div>
<h2>Details</h2>
<ul>
<li><b>Height:</b>6 ft</li>
<li><b>Weight:</b>6 kg</li>
<li><b>Age:</b>6</li>
</ul>
</div>
All I want is "6 kg". But it's not labeled, and neither is anything around it. But I know that I always want the text after "Weight:". Is there a way of selecting an element based on the text near it or in it?
In pseudocode, this is what it might look like:
require 'selenium-webdriver'
require 'nokogiri'
doc = parsed document
div_of_interest = doc.div where text of h2 == "Details"
element_of_interest = <li> element in div_of_interest with content that contains the string "Weight:"
selected_text = (content in element) minus ("<b>Weight:</b>")
Is this possible?
You can write the following code
p driver.find_elements(xpath: "//li").detect{|li| li.text.include?'Weight'}.text[/:(.*)/,1]
output
"6 kg"
My suggestion is to use WATIR which is wrapper around Ruby Selenium Binding where you can easily write the following code
p b.li(text: /Weight/).text[/:(.*)/,1]
Yes.
require 'nokogiri'
Nokogiri::HTML.parse(File.read(path_to_file))
.css("div > ul > li")
.children # get the 'li' items
.each_slice(2) # pair a 'b' item and the text following it
.find{|b, text| b.text == "Weight:"}
.last # extract the text element
.text
will return
"6 kg"
You can locate the element through pure xpath: use the contains() function which returns Boolean is its second argument found in the first, and pass to it text() (which returns the text of the node) and the target string.
xpath_locator = '/div/ul/li[contains(text(), "Weight:")]'
value = driver.find_element(:xpath, xpath_locator).text.partition('Weight:').last
Then just get the value after "Weight:".
I'm currently trying to extract some text from a website with xPath and Rapidminer.
I want to extract the "270€" from the following code:
<dd class="grid-item three-fifths">
<span class="is1-operator">+</span>
270 €
</dd>
I tried the following which didn't work.
//h:dd[#class='grid-item three-fifths']//text()
Thanks for your help :)
Your Xpath returns 3 text nodes:
""
"+"
"270€"
Try below XPath to fetch only "270€"
//h:dd[#class='grid-item three-fifths']/text()[string-length() > 0]
As mentioned in previous post string-length filter can be used but [string-length() > 0] still brings 3 nodes. Both 'enter' and '+' text contents have a character.
[string-length() > 1] should work.
If you are sure about item position (in this case it is 3rd position)
//dd[#class='grid-item three-fifths']//text()[3]
If you are sure it is always last item:
//dd[#class='grid-item three-fifths']/text()[last()]
You can get text node after span in dd:
//dd[#class='grid-item three-fifths']//span/following-sibling::text()
Look for euro sign:
//dd/text()[contains(.,'€')]
Given this XML, what XPath returns all elements whose prop attribute contains Foo (the first three nodes):
<bla>
<a prop="Foo1"/>
<a prop="Foo2"/>
<a prop="3Foo"/>
<a prop="Bar"/>
</bla>
//a[contains(#prop,'Foo')]
Works if I use this XML to get results back.
<bla>
<a prop="Foo1">a</a>
<a prop="Foo2">b</a>
<a prop="3Foo">c</a>
<a prop="Bar">a</a>
</bla>
Edit:
Another thing to note is that while the XPath above will return the correct answer for that particular xml, if you want to guarantee you only get the "a" elements in element "bla", you should as others have mentioned also use
/bla/a[contains(#prop,'Foo')]
This will search you all "a" elements in your entire xml document, regardless of being nested in a "blah" element
//a[contains(#prop,'Foo')]
I added this for the sake of thoroughness and in the spirit of stackoverflow. :)
This XPath will give you all nodes that have attributes containing 'Foo' regardless of node name or attribute name:
//attribute::*[contains(., 'Foo')]/..
Of course, if you're more interested in the contents of the attribute themselves, and not necessarily their parent node, just drop the /..
//attribute::*[contains(., 'Foo')]
descendant-or-self::*[contains(#prop,'Foo')]
Or:
/bla/a[contains(#prop,'Foo')]
Or:
/bla/a[position() <= 3]
Dissected:
descendant-or-self::
The Axis - search through every node underneath and the node itself. It is often better to say this than //. I have encountered some implementations where // means anywhere (decendant or self of the root node). The other use the default axis.
* or /bla/a
The Tag - a wildcard match, and /bla/a is an absolute path.
[contains(#prop,'Foo')] or [position() <= 3]
The condition within [ ]. #prop is shorthand for attribute::prop, as attribute is another search axis. Alternatively you can select the first 3 by using the position() function.
Have you tried something like:
//a[contains(#prop, "Foo")]
I've never used the contains function before but suspect that it should work as advertised...
John C is the closest, but XPath is case sensitive, so the correct XPath would be:
/bla/a[contains(#prop, 'Foo')]
If you also need to match the content of the link itself, use text():
//a[contains(#href,"/some_link")][text()="Click here"]
/bla/a[contains(#prop, "foo")]
try this:
//a[contains(#prop,'foo')]
that should work for any "a" tags in the document
For the code above...
//*[contains(#prop,'foo')]
I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[#name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[#name='hw2'] and /a[#name='hw3'].
Your xpath is selecting the text of the a nodes, not the text of the td nodes:
$item//a[#name='hw']/text()
Change it to this:
$item[a/#name='hw']/text()
Update (following comments and update to question):
This xpath selects the second text node from $item that have an a tag containing a name attribute set to hw:
$item[a/#name='hw']//text()[2]
I would not want to use text()[3] but
is there some way I could extract the
text out between /a[#name='hw2'] and
/a[#name='hw3'].
If there is just one text node between the two <a> elements, then the following would be quite simple:
/a[#name='hw3']/preceding::text()[1]
If there are more than one text nodes between the two elements, then you need to express the intersection of all text nodes following the first element with all text nodes preceding the second element. The formula for intersection of two nodesets (aka Kaysian method of intersection) is:
$ns1[count(.|$ns2) = count($ns2)]
So, just replace in the above expression $ns1 with:
/a[#name='hw2']/following-sibling::text()
and $ns2 with:
/a[#name='hw3']/preceding-sibling::text()
Lastly, if you really have XQuery (or XPath 2), then this is simply:
/a[#name='hw2']/following-sibling::text()
intersect
/a[#name='hw3']/preceding-sibling::text()
This handles your expanded case, while letting you select by attribute value rather than position:
let $item :=
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
return $item//node()[./preceding-sibling::a/#name = "hw2"][1]
This gets the first node that has a preceding-sibling "a" element with a name attribute of "hw2".