Hpricot: How to extract inner text without other html subelements - ruby

I'm working on a vim rspec plugin (https://github.com/skwp/vim-rspec) - and I am parsing some html from rspec. It looks like this:
doc = %{
<dl>
<dt id="example_group_1">This is the heading text</dt>
Some puts output here
</dl>
}
I can get the entire inner of the using:
(Hpricot.parse(doc)/:dl).first.inner_html
I can get just the dt by using
(Hpricot.parse(doc)/:dl).first/:dt
But how can I access the "Some puts output here" area? If I use inner_html, there is way too much other junk to parse through. I've looked through hpricot docs but don't see an easy way to get essentially the inner text of an html element, disregarding its html children.

I ended up figuring out a route by myself, by manually parsing the children:
(#context/"dl").each do |dl|
dl.children.each do |child|
if child.is_a?(Hpricot::Elem) && child.name == 'dd'
# do stuff with the element
elsif child.is_a?(Hpricot::Text)
text=child.to_s.strip
puts text unless text.empty?
end
end

Note that this is bad HTML you have there. If you have control over it, you should wrap the content you want in a <dd>.
In XML terms what you are looking for is the TextNode following the <dt> element. In my comment I showed how you can select this node using XPath in Nokogiri.
However, if you must use Hpricot, and cannot select text nodes using it, then you could hack this by getting the inner_html and then stripping out the unwanted:
(Hpricot.parse(doc)/:dl).first.inner_html.sub %r{<dt>.+?</dt>}, ''

Related

Excluding contents of <span> from text using Waitr

Watir
mytext =browser.element(:xpath => '//*[#id="gold"]/div[1]/h1').text
Html
<h1>
This is the text I want
<span> I do not want this text </span>
</h1>
When I run my Watir code, it selects all the text, including what is in the spans. How do I just get the text "This is the text I want", and no span text?
If you have a more complicated HTML, I find it can be easier to deal with this using Nokogiri as it provides more methods for parsing the HTML:
require 'nokogiri'
h1 = browser.element(:xpath => '//*[#id="gold"]/div[1]/h1')
doc = Nokogiri::HTML.fragment(h1.html)
mytext = doc.at('h1').children.select(&:text?).map(&:text).join.strip
Ideally start by trying to avoid using XPath. One of the most powerful features of Watir is the ability to create complicated locators without XPath syntax.
The issue is that calling text on a node gets all content within that node. You'd need to do something like:
top_level = browser.element(id: 'gold')
h1_text = top_level.h1.text
span_text = top_level.h1.span.text
desired_text = h1_text.chomp(span_text)
This is useful for top level text.
If there is only one h1, you can ommit id
#b.h1.text.remove(#b.h1.children.collect(&:text).join(' '))
Or specify it if there are more
#b.h1(id: 'gold').text.remove(#b.h1.children.collect(&:text).join(' '))
Make it a method and call it from your script with get_top_text(#b.h1) to get it
def get_top_text(el)
el.text.chomp(#b.h1.children.collect(&:text).join(' '))
end

How to retrieve string using XPath without returning null errors

I'm trying to write "Private Equity Group; USA" to a file.
"Private Equity Group" prints fine, but I get an error for the "USA" portion
TypeError: null is not an object (evaluating 'style.display')"
HTML code:
<div class="cl profile-xsmall">
<div class="cl profile-small-bold">Private Equity Group</div>
USA
</div>
The XPath for "USA" is:
//*[#id="addrDiv-Id"]/div/div[3]/text()
I get the error when I print the XPath or have it in an if statement:
if (internet.has_xpath?('//*[#id="addrDiv-Id"]/div/div[3]/text()')){
file.puts "#{internet.find(:xpath, '//*[#id="addrDiv-Id"]/div/div[3]/text()')}"
}
Capybara is not a general purpose xpath library - it is a library aimed at testing, and therefore is element centric. The xpaths used need to refer to elements, not text nodes.
if (internet.has_xpath?('//*[#id="addrDiv-Id"]/div/div[3]')){
file.puts internet.find(:xpath, '//*[#id="addrDiv-Id"]/div/div[3]').text
}
although using XPath at all for this is just a bad idea. Whenever possible default to CSS, it's easier to read, and faster for the browser to process - something like
if (internet.has_css?('#addrDiv-Id > div > div:nth-of-type(3)')){
file.puts internet.find('#addrDiv-Id" > div > div:nth-of-type(3)').text
}
or if the HTML allows it (I don't know without seeing more of the HTML)
if (internet.has_css?('#addrDiv-id .cl.profile-xsmall')){
file.puts internet.find('#addrDiv-id .cl.profile-xsmall').text
}
or even cleaner if it works for your use case
file.puts internet.first('#addrDiv-id .cl.profile-xsmall')&.text
Another way to do it :
xml = %{<div class="cl profile-xsmall">
<div class="cl profile-small-bold">Private Equity Group</div>
USA</div>}
require 'rexml/document'
doc = REXML::Document.new xml
print(REXML::XPath.match(doc, 'normalize-space(string(//div[#class="cl profile-xsmall"]))'))
Output :
["Private Equity Group USA"]
I'd say the HTML isn't well-formed, using span would have been better, but this works:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="cl profile-xsmall">
<div class="cl profile-small-bold">Private Equity Group</div>
USA
</div>
EOT
div = doc.at('.profile-small-bold')
[div.text.strip, div.next_sibling.text.strip].join(' ')
# => "Private Equity Group USA"
which can be reduced to:
[div, div.next_sibling].map { |n| n.text.strip }.join(' ')
# => "Private Equity Group USA"
The problem is that you have two nested divs, with "USA" trailing, so it's important to point to the inner node which has the main text you want. Then "USA" is in the following text node, which is accessible using next_sibling:
div.next_sibling.class # => Nokogiri::XML::Text
div.next_sibling # => #<Nokogiri::XML::Text:0x3c "\n USA\n">
Note, I'm using CSS selectors; They're easier to read, which is echoed by the Nokogiri documentation. I have no proof they're faster, and, because Nokogiri uses libxml to process both, there's probably no real difference worth worrying about, so use whatever makes more sense, and run benchmarks if you're curious.
You might be tempted to use text against the div class="cl profile-xsmall" node, but don't be sucked into that, as it's a trap:
doc.at('.profile-xsmall').text # => "\n Private Equity Group\n USA\n"
doc.at('.profile-xsmall').text.gsub(/\s+/, ' ').strip # => "Private Equity Group USA"
text will return a string of the text nodes after they're concatenated together. In this particular, rare case, it results in a somewhat usable result, however, usually you'll get something like this:
doc = Nokogiri::HTML('<div><p>foo</p><p>bar</p></div>')
doc.at('div').text # => "foobar"
doc.search('p').text # => "foobar"
Once those text nodes have been concatenated it's really difficult to take them apart again. Nokogiri's documentation talks about this:
Note: This joins the text of all Node objects in the NodeSet:
doc = Nokogiri::XML('<xml><a><d>foo</d><d>bar</d></a></xml>')
doc.css('d').text # => "foobar"
Instead, if you want to return the text of all nodes in the NodeSet:
doc.css('d').map(&:text) # => ["foo", "bar"]
The XPath for "USA" is:
//*[#id="addrDiv-Id"]/div/div[3]/text()
Um, no, not according to the HTML you gave us. But, let's pretend.
Using an absolute path to a node is a good way to write fragile selectors. It takes only a small change in the HTML to break your access to the node. Instead, find way-points to skip through the HTML to find the node you want, taking advantage of CSS and XPath to search downward through the DOM.
Typically, a selector like yours is generated by a browser, which isn't a good source to trust. Often browsers do fixups on malformed HTML, which changes it from what Nokogiri or a parser would see, resulting in a non-existing target, or the browser presents the HTML after JavaScript has had a change to run, which can move nodes, hide them, add new ones, etc.
Instead of trusting the browser, use curl, wget or nokogiri at the command-line to dump the file and look at it using a text editor. Then you'll be seeing it just as Nokogiri sees it, prior to any fixups or mangling.

Parsing a nested tag, moving it outside of the parent, and changing its type using Nokogiri

I have HTML coming from an API that I want to clean up and format it.
I'm trying to get any <strong> tags that are the first element inside a <p> tag, and change it to be the parent of the <p> tag, and convert the <p> tag to <h4>.
For example:
<p><strong>This is what I want to pull out to an h4 tag.</strong>Here's the rest of the paragraph.</p>
becomes:
<h4>This is what I want to pull out to an h4 tag.</h4><p>Here's the rest of the paragraph.</p>
EDIT: Apologies for the nature of the question being too 'please write this for me'. I posted the solution I came up with below. I just had to take the time to really learn how Nokogiri works, but it is quite powerful and it seems like you can do almost anything with it.
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.css("p").map do |paragraph|
first = paragraph.children.first
if first.element? and first.name == "strong"
first.name = 'h4'
paragraph.add_previous_sibling(first)
end
end

How to find an element's text in Capybara while ignoring inner element text

In the HTML example below I am trying to grab the $16.95 text in the outer span.price element and exclude the text from the inner span.sale one.
<div class="price">
<span class="sale">
<span class="sale-text">"Low price!"</span>
"$16.95"
</span>
</div>
If I was using Nokogiri this wouldn't be too difficult.
price = doc.css('sale')
price.search('.sale-text').remove
price.text
However Capybara navigates rather than removes nodes. I knew something like price.text would grab text from all sub elements, so I tried to use xpath to be more specific. p.find(:xpath, "//span[#class='sale']", :match => :first).text. However this grabs text from the inner element as well.
Finally, I tried looping through all spans to see if I could separate the results but I get an Ambiguous error.
p.find(:css, 'span').each { |result| puts result.text }
Capybara::Ambiguous: Ambiguous match, found 2 elements matching css "span"
I am using Capybara/Selenium as this is for a web scraping project with authentication complications.
There is no single statement way to do this with Capybara since the DOMs concept of innerText doesn't really support what you want to do. Assuming p is the '.price' element, two ways you could get what you want are as follows:
Since you know the node you want to ignore just subtract that text from the whole text
p.find('span.sale').text.sub(p.find('span.sale-text').text, '')
Grab the innerHTML string and parse that with Nokogiri or Capybara.string (which just wraps Nokogiri elements in the Capybara DSL)
doc = Capybara.string(p['innerHTML'])
nokogiri_fragment = doc.native
#do whatever you want with the nokogiri fragment

How can I extract the node names for fragmented XML document using Ruby?

I an XML-like document which is pre-processed by a system out of my control. The format of the document is like this:
<template>
Hello, there <RECALL>first_name</RECALL>. Thanks for giving me your email.
<SETPROFILE><NAME>email</NAME><VALUE><star/></VALUE></SETPROFILE>. I have just sent you something.
</template>
However, I only get as a text string what is between the <template> tags.
I would like to be able to extract without specifying the tags ahead of time when parsing. I can do this with the Crack gem but only if the tags are at the end of the string and there is only one.
With Crack, I can put a string like
string = "<SETPROFILE><NAME>email</NAME><VALUE>go#go.com</VALUE></SETPROFILE>"
and my output from Crack is:
{"SETPROFILE"=>{"NAME"=>"email", "VALUE"=>"go#go.com"}}
Then I can use a case statement for the possible values I care about.
Given that I need to have multiple <tags> in the string and they cannot be at the end of the string, how can I parse out the node names and the values easily, similar to what I do with crack?
These tags also need to be removed. I would like to continue to use the excellent suggestion from #TinMan.
It works perfectly once I know the name of the tag. The number of tags will be finite. I send the tag to the appropriate method once I know it, but it needs to get parsed out easily first.
Using Nokogiri, you can treat the string as a DocumentFragment, then find the embedded nodes:
require 'nokogiri'
doc = Nokogiri::XML::DocumentFragment.parse(<<EOT)
Hello, there <RECALL>first_name</RECALL>. Thanks for giving me your email.
<SETPROFILE><NAME>email</NAME><VALUE><star/></VALUE></SETPROFILE>. I have just sent you something.
EOT
nodes = doc.search('*').each_with_object({}){ |n, h|
h[n] = n.text
}
nodes # => {#<Nokogiri::XML::Element:0x3ff96083b744 name="RECALL" children=[#<Nokogiri::XML::Text:0x3ff96083a09c "first_name">]>=>"first_name", #<Nokogiri::XML::Element:0x3ff96083b5c8 name="SETPROFILE" children=[#<Nokogiri::XML::Element:0x3ff96083a678 name="NAME" children=[#<Nokogiri::XML::Text:0x3ff960836884 "email">]>, #<Nokogiri::XML::Element:0x3ff96083a650 name="VALUE" children=[#<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">]>]>=>"email", #<Nokogiri::XML::Element:0x3ff96083a678 name="NAME" children=[#<Nokogiri::XML::Text:0x3ff960836884 "email">]>=>"email", #<Nokogiri::XML::Element:0x3ff96083a650 name="VALUE" children=[#<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">]>=>"", #<Nokogiri::XML::Element:0x3ff96083a5c4 name="star">=>""}
Or, more legibly:
nodes = doc.search('*').each_with_object({}){ |n, h|
h[n.name] = n.text
}
nodes # => {"RECALL"=>"first_name", "SETPROFILE"=>"email", "NAME"=>"email", "VALUE"=>"", "star"=>""}
Getting the content of a particular tag is easy then:
nodes['RECALL'] # => "first_name"
Iterating over all the tags is also easy:
nodes.keys.each do |k|
...
end
You can even replace a tag and its content with text:
doc.at('RECALL').replace('Fred')
doc.to_xml # => "Hello, there Fred. Thanks for giving me your email. \n<SETPROFILE>\n <NAME>email</NAME>\n <VALUE>\n <star/>\n </VALUE>\n</SETPROFILE>. I have just sent you something.\n"
How to replace the nested tags is left to you as an exercise.

Resources