XPath: Matching a text between two similar tags - xpath

I'm trying to scrape a website with messy structure, the text I'm requiring is laying between the first 5 consecutive br tags (No more and no less, exactly 5) and the following 2 consecutive br tags.
It looks like this:
<p class="A">
"Some text"
<br>
"Some text"
<br>
<br>
"Some text"
<br>
<br>
<br>
<br>
<br>
"Required text"
<br>
"Required text"
<br>
"Required text"
<br>
<br>
</p>

Scrapy converts <br> tags to newline characters, so you can just extract the whole text and split it at 5 newline characters:
> text = sel.xpath('//text()').extract()
['\n"Some text"\n', '\n"Some text"\n', ...]
> values = ''.join(text).split('\n\n\n\n\n')[1]
'\n"Required text"\n\n"Required text"\n\n"Required text"\n\n\n'
> values.strip().split('\n\n')
['"Required text"', '"Required text"', '"Required text"']

Related

Clarification of Nokogiri::NodeSet XML Content based on 'puts node' and 'puts node.inspect'

I rarely use xpath() but when I do I keep tripping myself up on interpreting content of Nokogiri::Nodesets and believe I now know where I have always gone wrong.
Simply put when I do a 'puts NodeSet' I have always assumed that I could search the Nodeset based on the returned XML. But the first tag returned does not appear to actually part of the node XML.
'puts n1' returns XML that has a SPAN as the first element of the XML, but if I then do an search n1.xpath('SPAN') or n1.xpath('SPAN/DIV') no nodes are found. n1.xpath('DIV') returns the output I expect and proves no SPAN tag in the XML.
The only way I can logically explain this to myself is if assume that the first xml tag of a 'puts node' is the "Node Name" and not part of the node XML. This works for me going forward but am I missing something that is going to bite me elsewhere.
CODE:
docxml = Nokogiri::XML(<<EOT)
<DIV><SPAN><DIV id='1'><H1>-H1-</H1><h1>-h1-</h1></DIV>
<DIV id='2'><H2>-H2-</H2> <h2>-h2-</h2></DIV>
<DIV id='3'><H3>-H3-</H3><h3>-h3-</h3></DIV>
</SPAN></DIV>
EOT
n0 = docxml.xpath('DIV')
n1 = n0.xpath('SPAN')
n2 = n1.xpath('DIV')
n3 = n2.xpath('*')
n4 = n3.xpath('*')
puts "n1:xpath('SPAN'): \n#{n1.xpath('SPAN')}\n#{'^'*80} \nn1 XML:\n#{n1}\n#{'^'*80}\
\nn1:inspect \n#{n1.inspect}\n#{'^'*80}\n"
OUTPUT:
=begin
n1:xpath('SPAN'):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
n1 XML:
<SPAN>
<DIV id="1"> <H1>-H1-</H1> <h1>-h1-</h1> </DIV>
<DIV id="2"> <H2>-H2-</H2> <h2>-h2-</h2> </DIV>
<DIV id="3"> <H3>-H3-</H3> <h3>-h3-</h3> </DIV>
</SPAN>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
n1:inspect
[#<Nokogiri::XML::Element:0x1c10964 name="SPAN"
children=[
#<Nokogiri::XML::Element:0x1c10820 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x18fff90 name="id" value="1">]
children=[#<Nokogiri::XML::Element:0x1c1064c name="H1" children=[#<Nokogiri::XML::Text:0x1c1ffe8 "-H1-">]>,
#<Nokogiri::XML::Element:0x1c10604 name="h1" children=[#<Nokogiri::XML::Text:0x1c1fdcc "-h1-">]>
]>,
#<Nokogiri::XML::Element:0x1c107d8 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x1c1fc10 name="id" value="2">]
children=[#<Nokogiri::XML::Element:0x1c105bc name="H2" children=[#<Nokogiri::XML::Text:0x1c1f874 "-H2-">]>,
#<Nokogiri::XML::Text:0x1c1f778 " ">,
#<Nokogiri::XML::Element:0x1c10574 name="h2" children=[#<Nokogiri::XML::Text:0x1c1f5f8 "-h2-">]
>]>,
#<Nokogiri::XML::Element:0x1c10790 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x1c1f43c name="id" value="3">]
children=[#<Nokogiri::XML::Element:0x1c1052c name="H3" children=[#<Nokogiri::XML::Text:0x1c1f0a0 "-H3-">]>,
#<Nokogiri::XML::Element:0x1c104e4 name="h3" children=[#<Nokogiri::XML::Text:0x1c1ee90 "-h3-">]
>]
>]
>]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
=end
Now that I have had some sleep this works for me.
'nodeset = xpath(tag1/tag2)' returns a 'nodeset' containing member node 'tag2'
'puts nodeset' displays the 'tag2' node member
'nodeset.xpath('*')' returns the content of 'tag2
'nodeset.xpath('tag2')' invalid as 'tag2' is not part of the content of 'tag2'

RegEx code works in theory but not when code is run

i'm trying to use this RegEx search: <div class="ms3">(\n.*?)+<in Ruby, however as soon as i get to the last character "<" it stops working altogether. I've tested it in Rubular and the RegEx works perfectly fine, I'm using rubymine to write my code but i also tested it using Powershell and it comes up with the same results. no Error message. when i run <div class="ms3">(\n.*?)+ it prints <div class="ms3"> which is exactly what i'm looking for, but as soon as i add the "<" it comes out with nothing.
my code:
#!/usr/bin/ruby
# encoding: utf-8
File.open('ms3.txt', 'w') do |fo|
fo.puts File.foreach('input.txt').grep(/<div class="ms3">(\n.*?)+/)
end
some of what i'm searching through:
<div class="ms3">
<span xml:lang="zxx"><span xml:lang="zxx">Still the tone of the remainder of the chapter is bleak. The</span> <span class="See_In_Glossary" xml:lang="zxx">DAY OF THE <span class="Name_Of_God" xml:lang="zxx">LORD</span></span> <span xml:lang="zxx">holds no hope for deliverance (5.16–18); the futility of offering sacrifices unmatched by common justice is once more underlined, and exile seems certain (5.21–27).</span></span>
</div>
<div class="Paragraph">
<span class="Verse_Number" id="idAMO_5_1" xml:lang="zxx">1</span><span class="scrText">Listen, people of Israel, to this funeral song which I sing over you:</span>
</div>
<div class="Stanza_Break"></div>
The full RegEx i need to do is <div class="ms3">(\n.*?)+<\/div> it picks up the first section and nothing else
Your problem starts with using File.foreach('input.txt') which cuts the result into lines. This means that the pattern is matched to each line separately, so none of the lines match the pattern (by definition, none of the lines have \n in its middle).
You should have better luck reading the whole text as a block, and using match on it:
File.read('input.txt').match(/<div class="ms3">(\n.*?)+<\/div>/)
# => #<MatchData "<div class=\"ms3\">\n <span xml:lang=\"zxx\">
# => <span xml:lang=\"zxx\">Still the tone of the remainder of the chapter is bleak. The</span>
# => <span class=\"See_In_Glossary\" xml:lang=\"zxx\">DAY OF THE
# => <span class=\"Name_Of_God\" xml:lang=\"zxx\">LORD</span></span>
# => <span xml:lang=\"zxx\">holds no hope for deliverance (5.16–18);
# => the futility of offering sacrifices unmatched by common justice is once more
# => underlined, and exile seems certain (5.21–27).</span></span>\n </div>" 1:"\n ">

Get Text between two tags using nokogiri

My HTML structure is
<div class="line">
<h2>Header</h2>
<h3>Mailing Address</h3>
2349 Glorem ipsun lorem ipsum CA 95833<br>
<br>
Phone: 111-111-2111 Fax: 111-511-1111<br>
<a onfocus="blur()" target="_blank"" href="">some text</a><br>
<a onfocus="blur()" target="_blank" href="">some address</a><br>
<div><p></p></div>
<h3>Contact(s)</h3>
</div>
The HTML page contains several <div class=line></div> elements. For each div i need to extract Phone and Fax in a array with other data. I tried using
doc.css("div#ctl00_cphContent_divBrowseByMember").each do |div|
div.css("div.line").each do |line|
line.xpath('//text()[preceding-sibling::br and following-sibling::a]').text.strip
end
end
It returns nothing and returns time out error.
If I try as
line.xpath('//text()[preceding-sibling::br and following-sibling::a]')[0].text.strip
will return same Phone and fax for all other divs. Please suggest any other solution that will help me.
The easy way:
phone, fax = line.text.scan /\d{3}-\d{3}-\d{4}/

Locating element in same paragraph of another element in watir-webdriver

Given the following HTML code snippet; after finding the link by ID, how would you select the checkbox in the same paragraph?
For example if I wanted to select the checkbox associated with the link with ID="inst_17901-1746-1747".
The order of the paragraphs in the DIV is not consistent between sessions so I cannot select it by index or ID of the checkbox.
<div id="inst-results">
<p>
<input id="inst-results0-check" type="checkbox">
<a class="ws-rendered" id="inst_17901-1746-1747" title="!!QA Data 2/DOOR FURNITURE/316 Stainless - Altro Range"><img src="http://yr-qa-svr2/Agility/ACMSImages?type=objectType&objectTypeID=32"> <span>!!QA Data 2/DOOR FURNITURE/316 Stainless - Altro Range</span></a>
</p>
<p>
<input id="inst-results1-check" type="checkbox"><a class="ws-rendered" id="inst_17882-1746-1747" title="!!QA Data/DOOR FURNITURE/316 Stainless - Altro Range"><img src="http://yr-qa-svr2/Agility/ACMSImages?type=objectType&objectTypeID=32"> <span>!!QA Data/DOOR FURNITURE/316 Stainless - Altro Range</span></a>
</p>
</div>
I figured out this solution working off the text of the link, but Zeljko solution is much better.
$browser.div(:id,"inst-results").ps.each { |para|
if para.link.text == "!!QA Data/DOOR FURNITURE/316 Stainless - Altro Range" then
para.checkbox.set
break
end
}
If there is only one checkbox in the paragraph with the link:
browser.link(:id => "inst_17901-1746-1747").parent.checkbox.set
Works with watir-webdriver, not sure if it would work with other Watir gems.

How do I extract text from a web page with <br /> tags using Hpricot?

I'm trying to parse an HTML file using Hpricot and Ruby, but I'm having issues extracting "free floating" text which is not enclosed in tags like <p></p>.
require 'hpricot'
text = <<SOME_TEXT
Testing:<br />
line 1<br />
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>
SOME_TEXT
parsed = Hpricot(text)
parsed = parsed.search('//a[#href="http://www.somelink.com/foo/bar.html"]').first.following_siblings
puts parsed
I would expect the result to be
<br />
line 1<br />
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>
But I am getting
<br />
<br />
<br />
<br />
<br />
<br />
<b>Here's some more text</b>
How can I make Hpricot return line 1, line 2, etc?
Your first step is to read the following_siblings documentation:
Find sibling elements which follow the current one. Like the other “sibling” methods, this weeds out text and comment nodes.
Then you should use the Hpricot source to generalize how following_siblings works to get something that works like following_siblings but doesn't filter out non-container nodes:
parsed = Hpricot(text)
link = parsed.search('//a[#href="http://www.somelink.com/foo/bar.html"]').first
link_sibs = link.parent.children
what_you_want = link_sibs[link_sibs.index(link) + 1 ... link_sibs.length]
puts what_you_want
That's pretty much following_siblings with parent.children instead of parent.containers. Having access to the source code of the libraries you use is pretty handy and studying it is to be encouraged.
It's been a while since I've used Hpricot but here's some things I remember that might help:
The quick way to get all the text:
irb(main):023:0> print parsed.inner_text
Testing:
line 1
line 2
line 3
line 4
line 5
Here's some more text
The downside to that is you get the text embedded in tags too.
Similarly, we can search for all 'text()' nodes:
irb(main):033:0> puts (parsed / 'text()')
Testing:
line 1
[...]
line 5
So, we can do this:
irb(main):036:0> puts (parsed / 'text()')[2 .. -3]
line 1
line 2
line 3
line 4
line 5
or:
irb(main):037:0> (parsed / 'text()')[2 .. -3]
=> #<Hpricot::Elements["\n line 1", " \n line 2", "\n line 3", "\n line 4", "\n line 5", "\n "]>
or:
irb(main):039:0> (parsed / 'text()')[2 .. -3].map{ |t| t.inner_text.strip }
=> ["line 1", "line 2", "line 3", "line 4", "line 5", ""]
The main idea for grabbing data/text from a web page is look for landmarks you can use to navigate through the page. Often we can grab text from inside a <div> or <p> tag. If a page doesn't give you landmarks you have to use other tricks; Looking for a series of text nodes followed by <br> nodes maybe, or the five lines following an <a> tag with a certain href attribute. That's the fun and challenge of dealing with HTML.
In the back of my mind there's a nagging thought that there is a more elegant way to do this, but this seems to be working. Dig around on the Hpricot Challenge page for variations on themes on digging out content.

Resources