Extract all text in between two nodes using xpath for websrcaping?

Extract all text in between two nodes using xpath for websrcaping? - xpath

<div class="jokeContent">
<h2 style="color:#369;">Can I be Frank</h2>
What did Ellen Degeneres say to Kathy Lee?
<p></p> <p>Can I be Frank with you? </p>
<p>Submitted by Calamjo</p>
<p>Edited by Curtis</p>
<div align="right" style="margin-top:10px;margin-bottom:10px;">#joke #short </div>
<div style="clear:both;"></div>
</div>
So I am trying to extract all text after the <\h2> and before the [div aign = "right" style=...] nodes.
What I have tried so far:
jokes = response.xpath('//div[#class="jokeContent"]')
for joke in jokes:
text = joke.xpath('text()[normalize-space()]').extract()]
if len(text) > 0:
yield text
This works to some extend, but the website is inconsistent in the html and sometimes the text is embedded in <.p> TEXT <\p> and sometimes in <.br> TEXT <\br> or just TEXT.
So I thought just extracting everything after the header and before the style node might make sense and then the filtering can be done afterwords.

If you are looking for a literal xpath of what you are describing, it could be something like:
In [1]: sel.xpath("//h2/following-sibling::*[not(self::div) and not(preceding-sibling::div)]//text()").extract()
Out[1]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']
But there's probably a more logical, cleaner conclusion:
In [2]: sel.xpath("//h2/following-sibling::p//text()").extract()
Out[2]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']
This is just selecting paragraph tags. You said the paragraph tags might be something else and you can match several different tags with self::tag specification:
In [3]: sel.xpath("//h2/following-sibling::*[self::p or self::br]//text()").extract()
Out[3]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']
Edit: apparently I missed the text under the div itself. This can be ammended with | - or selector:
In [3]: sel.xpath("//h2/../text()[normalize-space(.)] | //h2/../p//text()").extract()
Out[3]:
[u'\n What did Ellen Degeneres say to Kathy Lee? \n ',
u'Can I be Frank with you? ',
u'Submitted by Calamjo',
u'Edited by Curtis']
normalize-space(.) is there only to get rid of text values that contain no text (e.g. ' \n').
You can append the first part of this xpath to any of the above and you'd get similar results.

Related

xpath handle double quotes with some other tags

I have this html sample
<html>
<body>
....
<p id="book-1" class="abc">
<b>
book-1
section
</b>
"I have a lot of "
<i>different</i>
"text, and I want "
<i>all</i>
" text and we may or may not have italic surrounded text."
</p>
....
the xpath I currently have is this:
#"/html[1]/body[1]/p[1]/text()"
this gives this result:
I have a lot of
but I want this result:
I have a lot of different text, and I want all text and we may or may not have italic surrounded text.
Thanks for your help.

In XPath 2 and higher you could use string-join(/html[1]/body[1]/p[1]/b/following-sibling::node(), '') I think. It is not quite clear which nodes you want but that would select all sibling nodes following the b child of the p and then concatenate their string values into one.

Ruby Nokogiri text search not working with br tags and others

I'm using the Nokogiri gem in Ruby and running into some problems.
I want to scrape addresses from webpages and there is no set format to the way the addresses will be displayed.
I've got a list of postcodes and I want my Ruby script to return the node including the postcode so that I can find the rest of the address.
This is what I've got in Ruby, with some example HTML content:
require 'nokogiri'
require 'open-uri'
content1 = '
<div>
<div>
<div>Our Address:</div>
1 North Street
North Town
North County
N21 4DD
</div>
</div>'
doc = Nokogiri::HTML(content1)
result = doc.search "[text()*='N21 4DD']"
puts result.inspect
This returns []
I understand the example above is a strange way for an address to appear in HTML but it's the simplest way I can show the problems I've had. Here's another content variable that returns nothing:
content1 = '
<div>
<div>Our Address:</div>
<div>
1 North Street<br>
North Town<br>
North County<br>
N21 4DD
</div>
</div>'
I know that Nokogiri might have trouble with the above because the <br> tags should be </br> but this is quite common on websites.
THIS EXAMPLE WORKS:
content1 = '
<div>
<div>Our Address:</div>
<div>
1 North Street
North Town
North County
N21 4DD
</div>
</div>'
Can someone explain why the node is not being found from the first two content examples above and how I can fix this?
I'm not looking for a custom solution that will find the postcode in the sample content examples above – these are just for demonstration purposes. The postcode (and address) could be anywhere in the html – body, p, div, td, span, li etc.
Thanks.

With Xpath:
doc.xpath('.//div[contains(.,"N21 4DD")]')
This still returns two nodes because there is a nested div. I'm not sure that there is a way to get the middle div without the 'Our Address' div because it is in the same node.

Let's look at the first one and how Nokogiri translates your "css" (that's not valid css btw):
Nokogiri::CSS.xpath_for "[text()*='N21 4DD']"
#=> ["//*[contains(child::text(), 'N21 4DD')]"]
Ok, so here the problem is the child::text() will actually only match the first text node, which is the empty text before the "Our Address" div.
doc.search("//*[contains(child::text(), 'N21 4DD')]").length
#=> 0
No matches = not good.
Now let's try it jquery-style using the :contains pseudo:
Nokogiri::CSS.xpath_for ":contains('N21 4DD')"
#=> ["//*[contains(., 'N21 4DD')]"]
doc.search("//*[contains(., 'N21 4DD')]").length
#=> 4
This is actually correct, but maybe not what you expected.
Let's try it one more way:
doc.search("//*[text()[contains(., 'N21 4DD')]]").length
#=> 1
It sounds like this is what you're looking for. Just the div that has the string in a child text node.

How to check box in Capybara if there are no name, id or label text?

I am newbie here. Please advise. How to select checkbox in my case?
<ul class="phrases-list" style="">
<li>
<input type="checkbox" class="select-phrase">
<span class="prase-title"> Dog - Wikipedia, the free encyclopedia </span>
(en.wikipedia.org)
<div class="prase-desc hidden">The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated...</div>
</li>
The following doesn't work for me:
When /I check box "([^\"]+)"$/ do |label|
page.check(label)
end
step: And I check box "Dog - Wikipedia, the free encyclopedia"

If you can change the html, wrap the input and span in a label element
<ul class="phrases-list" style="">
<li>
<label>
<input type="checkbox" class="select-phrase">
<span class="prase-title"> Dog - Wikipedia, the free encyclopedia </span>
</label>
(en.wikipedia.org)
<div class="prase-desc hidden">The domestic dog (Canis lupus familiaris or Canis familiaris) is a domesticated...</div>
</li>
which has the added benefit of clicks on the "Dog - Wikipedia ..." text triggering the checkbox too. With that change your step should work as written. If you can't modify the html then things get more difficult.
Something like
find('span', text: label).find(:xpath, './preceding-sibling::input').set(true)
should work, although I'm curious how you're using these checkboxes from JS with nothing tying them to any specific value

Let's assume that you are prevented from changing the HTML. In this case, it would probably be easiest to query for the element via XPath. For example:
# Here's the XPath query
q = "//span[contains(text(), 'Dog - Wikipedia')]/preceding-sibling::input"
# Use the query to find the checkbox. Then, check the checkbox.
page.find(:xpath, q).set(true)
Okay - it's not as bad as it looks! Let's analyze this XPath so we can understand what it's doing:
//span
This first part says "Search the entire HTML document and discover all "span" elements. Of course, there are probably a LOT of "span" elements in the HTML document, so we'll need to restrict this:
//span[contains(text(), 'Dog - Wikipedia')]
Now we're only searching for the "span" elements that contain the text "Dog - Wikipedia". Presumably, this text will uniquely identify the desired "span" element on the page (if not, then just search for more of the text).
At this point, we have the "span" element that is adjacent to the desired "input" element. So, we can query for the "input" element using the "preceding-sibling::" XPath Axis:
//span[contains(text(), 'Dog - Wikipedia')]/preceding-sibling::input

How to take XPath of element that is between br tags with <strong> in account

My code is like this,
<div>
<strong> Text1: </strong>
1234
<br>
<strong> Text2: </strong>
5678
<br>
</div>
where numbers, 1234 and 5678 are generated dynamically. When I take XPath of Text2 : 5678, it gives me like /html/body/div[7]/div/div[2]/div/div[2]/div[2]/br[2]. This does not work for me. I need to take XPath of only "Text2 : 5678". any help will be appreciated. (I am using selenium webdriver and C# to code my test script)

I second #Anil's comment above. The text "Text2:" is retrievable as it is within "strong" element. But, "5678" comes under div and is not the innerHTML for either "strong" or "br".
Hence, to retrieve the text "Text 2: 5678", you'll have to retrieve the innerHTML/text of "div" and modify it accordingly to get the required text.
Below is a Java code snippet to retrieve the text:-
WebElement ele = driver.findElement(By.xpath("//div"));
System.out.print(ele.getText().split("\n")[1]; //Splitting using newline as the split string.
I hope you can formulate the above in C#.

Extracting text in between nodes through XPath

I'm trying to read specific parts of a webpage through XPath. The page is not very well-formed but I can't change that...
<root>
<div class="textfield">
<div class="header">First item</div>
Here is the text of the <strong>first</strong> item.
<div class="header">Second item</div>
<span>Here is the text of the second item.</span>
<div class="header">Third item</div>
Here is the text of the third item.
</div>
<div class="textfield">
Footer text
</div>
</root>
I want to extract the text of the various items, i.e. the text in between the header divs (e.g. 'Here is the text of the first item.'). I've used this XPath expression so far:
//text()[preceding::*[#class='header' and contains(text(),'First item')] and following::*[#class='header' and contains(text(),'Second item')]]
However, I cannot hardcode the ending item name because in the pages I want to scrape the order of the items differ (e.g. 'First item' may be followed by 'Third item').
Any help on how to adapt my XPath query would be greatly appreciated.

Found it!
//text()[preceding::*[#class='header' and contains(text(),'First item')]][following::*[preceding::*[#class='header'][1][contains(text(),'First item')]]]
Indeed your solution, Aleh, won't work for tags inside the text.
Now, the one remaining case is the last item, which is not followed by an element with class=header; so it will include all text found 'till the end of the document. Ideas?

//*[#class='header' and contains(text(),'First item')]/following::text()[1] will select first text node after <div class="header">First item</div>.
//*[#class='header' and contains(text(),'Second item')]/following::text()[1] will select first text node after <div class="header">Second item</div> and so on
EDIT: Sorry, this will not work for <strong> cases. Will update my answer
EDIT2: Used #Michiel part. Looks like omg but works: //div[#class='textfield'][1]//text()[preceding::*[#class='header' and contains(text(),'First item')]][following::*[preceding::*[not(self::strong) and not(self::span)][1][contains(text(),'First item')]] or not(//*[preceding::*[#class='header' and contains(text(),'First item')]])]
Seems that this should be solved with a better solution :)

For the sake of completeness, the final query, composed of various suggestions throughout the thread:
//*[
#class='textfield' and position() = 1
]
//text() [
preceding::*[
#class='header' and contains(text(),'First item')
]
][
following::*[
preceding::*[
#class='header'
][1][
contains(text(),'First item')
]
]
]

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Extract all text in between two nodes using xpath for websrcaping? - xpath

Related

xpath handle double quotes with some other tags

Ruby Nokogiri text search not working with br tags and others

How to check box in Capybara if there are no name, id or label text?

How to take XPath of element that is between br tags with <strong> in account

Extracting text in between nodes through XPath

Categories

Resources