consider a html page
<html>
apple
orange
drugs
</html>
how can you select orange using xpath ?
/html/text()[2]
doesn't work.
You cant do it directly by selecting. You need to call an xpath string function to cut the text() to get the string you want
substring-after(/html/text()," ") // something like this,
here is a list of string functions
If the strings are separated with <br> it works
doc = Nokogiri::HTML("""<html>
apple
<br>
orange
<br>
drugs
</html>""")
p doc.xpath('//text()[2]') #=> orange
Related
I have this html sample
<html>
<body>
....
<p id="book-1" class="abc">
<b>
book-1
section
</b>
"I have a lot of "
<i>different</i>
"text, and I want "
<i>all</i>
" text and we may or may not have italic surrounded text."
</p>
....
the xpath I currently have is this:
#"/html[1]/body[1]/p[1]/text()"
this gives this result:
I have a lot of
but I want this result:
I have a lot of different text, and I want all text and we may or may not have italic surrounded text.
Thanks for your help.
In XPath 2 and higher you could use string-join(/html[1]/body[1]/p[1]/b/following-sibling::node(), '') I think. It is not quite clear which nodes you want but that would select all sibling nodes following the b child of the p and then concatenate their string values into one.
How can i get texts with xpath separately?
Code i tried only gets 1 with all info instead of separate:
Post xpath: div
Title xpath: ./p/strong/child::node()
Desc xpath: ./ul/child::node()
Desired:
Title1
Desc1
Title2
Desc2
Got:
Title1 Title2
Desc1 Desc2
HTML:
<div>
<p><strong>Title1</strong></p>
<ul>
<li>Desc1</li>
</ul>
<p><strong>Title2</strong></p>
<ul>
<li>Desc2</li>
</ul>
</div>
Not really clear what your "Desired" example is representing with pairs labeled 1 and 2, but if you are just trying to select each title text followed by its immediate following ul/li text you can use an expression such as:
//div/p/(
./normalize-space(string()),
./(following-sibling::ul[1])/normalize-space(string()))
For each p it selects the entire text content of the p as string and then selects the immediately following ul sibling of the p and selects its entire string content. This can be easily refined to only select p/strong content (instead of all of the p) and similar for ul/li.
I'm using the Nokogiri gem in Ruby and running into some problems.
I want to scrape addresses from webpages and there is no set format to the way the addresses will be displayed.
I've got a list of postcodes and I want my Ruby script to return the node including the postcode so that I can find the rest of the address.
This is what I've got in Ruby, with some example HTML content:
require 'nokogiri'
require 'open-uri'
content1 = '
<div>
<div>
<div>Our Address:</div>
1 North Street
North Town
North County
N21 4DD
</div>
</div>'
doc = Nokogiri::HTML(content1)
result = doc.search "[text()*='N21 4DD']"
puts result.inspect
This returns []
I understand the example above is a strange way for an address to appear in HTML but it's the simplest way I can show the problems I've had. Here's another content variable that returns nothing:
content1 = '
<div>
<div>Our Address:</div>
<div>
1 North Street<br>
North Town<br>
North County<br>
N21 4DD
</div>
</div>'
I know that Nokogiri might have trouble with the above because the <br> tags should be </br> but this is quite common on websites.
THIS EXAMPLE WORKS:
content1 = '
<div>
<div>Our Address:</div>
<div>
1 North Street
North Town
North County
N21 4DD
</div>
</div>'
Can someone explain why the node is not being found from the first two content examples above and how I can fix this?
I'm not looking for a custom solution that will find the postcode in the sample content examples above – these are just for demonstration purposes. The postcode (and address) could be anywhere in the html – body, p, div, td, span, li etc.
Thanks.
With Xpath:
doc.xpath('.//div[contains(.,"N21 4DD")]')
This still returns two nodes because there is a nested div. I'm not sure that there is a way to get the middle div without the 'Our Address' div because it is in the same node.
Let's look at the first one and how Nokogiri translates your "css" (that's not valid css btw):
Nokogiri::CSS.xpath_for "[text()*='N21 4DD']"
#=> ["//*[contains(child::text(), 'N21 4DD')]"]
Ok, so here the problem is the child::text() will actually only match the first text node, which is the empty text before the "Our Address" div.
doc.search("//*[contains(child::text(), 'N21 4DD')]").length
#=> 0
No matches = not good.
Now let's try it jquery-style using the :contains pseudo:
Nokogiri::CSS.xpath_for ":contains('N21 4DD')"
#=> ["//*[contains(., 'N21 4DD')]"]
doc.search("//*[contains(., 'N21 4DD')]").length
#=> 4
This is actually correct, but maybe not what you expected.
Let's try it one more way:
doc.search("//*[text()[contains(., 'N21 4DD')]]").length
#=> 1
It sounds like this is what you're looking for. Just the div that has the string in a child text node.
My code is like this,
<div>
<strong> Text1: </strong>
1234
<br>
<strong> Text2: </strong>
5678
<br>
</div>
where numbers, 1234 and 5678 are generated dynamically. When I take XPath of Text2 : 5678, it gives me like /html/body/div[7]/div/div[2]/div/div[2]/div[2]/br[2]. This does not work for me. I need to take XPath of only "Text2 : 5678". any help will be appreciated. (I am using selenium webdriver and C# to code my test script)
I second #Anil's comment above. The text "Text2:" is retrievable as it is within "strong" element. But, "5678" comes under div and is not the innerHTML for either "strong" or "br".
Hence, to retrieve the text "Text 2: 5678", you'll have to retrieve the innerHTML/text of "div" and modify it accordingly to get the required text.
Below is a Java code snippet to retrieve the text:-
WebElement ele = driver.findElement(By.xpath("//div"));
System.out.print(ele.getText().split("\n")[1]; //Splitting using newline as the split string.
I hope you can formulate the above in C#.
I'd like to use xquery (I believe) to output the text from the title attribute of an html element.
Example:
<div class="rating" title="1.0 stars">...</div>
I can use xpath to select the element, but it tries to output the info between the div tags. I think I need to use xquery to output the "1.0 stars" text from the title attribute.
There's gotta be a way to do this. My Google skills are proving ineffective in coming up with an answer.
Thanks.
XPath: //div[#class='rating']/#title
This will give you the title text for every div with a class of "rating".
Addendum (following from comments below):
If the class has other, additional text in it, in addition to "rating", then you can use something like this:
//div[contains(concat(' ', normalize-space(#class), ' '), ' rating ')]
(Hat tip to How can I match on an attribute that contains a certain string?).
You should use:
let $XML := <p><div class="rating" title="2.0 stars">sdfd</div><div class="rating" title="1.0 stars">sdfd</div></p>
for $title in $XML//#title
return
<p>{data($title)}</p>
to get output:
<p>2.0 stars</p>
<p>1.0 stars</p>