Parsing a nested tag, moving it outside of the parent, and changing its type using Nokogiri - ruby

I have HTML coming from an API that I want to clean up and format it.
I'm trying to get any <strong> tags that are the first element inside a <p> tag, and change it to be the parent of the <p> tag, and convert the <p> tag to <h4>.
For example:
<p><strong>This is what I want to pull out to an h4 tag.</strong>Here's the rest of the paragraph.</p>
becomes:
<h4>This is what I want to pull out to an h4 tag.</h4><p>Here's the rest of the paragraph.</p>
EDIT: Apologies for the nature of the question being too 'please write this for me'. I posted the solution I came up with below. I just had to take the time to really learn how Nokogiri works, but it is quite powerful and it seems like you can do almost anything with it.

doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.css("p").map do |paragraph|
first = paragraph.children.first
if first.element? and first.name == "strong"
first.name = 'h4'
paragraph.add_previous_sibling(first)
end
end

Related

Excluding contents of <span> from text using Waitr

Watir
mytext =browser.element(:xpath => '//*[#id="gold"]/div[1]/h1').text
Html
<h1>
This is the text I want
<span> I do not want this text </span>
</h1>
When I run my Watir code, it selects all the text, including what is in the spans. How do I just get the text "This is the text I want", and no span text?
If you have a more complicated HTML, I find it can be easier to deal with this using Nokogiri as it provides more methods for parsing the HTML:
require 'nokogiri'
h1 = browser.element(:xpath => '//*[#id="gold"]/div[1]/h1')
doc = Nokogiri::HTML.fragment(h1.html)
mytext = doc.at('h1').children.select(&:text?).map(&:text).join.strip
Ideally start by trying to avoid using XPath. One of the most powerful features of Watir is the ability to create complicated locators without XPath syntax.
The issue is that calling text on a node gets all content within that node. You'd need to do something like:
top_level = browser.element(id: 'gold')
h1_text = top_level.h1.text
span_text = top_level.h1.span.text
desired_text = h1_text.chomp(span_text)
This is useful for top level text.
If there is only one h1, you can ommit id
#b.h1.text.remove(#b.h1.children.collect(&:text).join(' '))
Or specify it if there are more
#b.h1(id: 'gold').text.remove(#b.h1.children.collect(&:text).join(' '))
Make it a method and call it from your script with get_top_text(#b.h1) to get it
def get_top_text(el)
el.text.chomp(#b.h1.children.collect(&:text).join(' '))
end

How to find an element's text in Capybara while ignoring inner element text

In the HTML example below I am trying to grab the $16.95 text in the outer span.price element and exclude the text from the inner span.sale one.
<div class="price">
<span class="sale">
<span class="sale-text">"Low price!"</span>
"$16.95"
</span>
</div>
If I was using Nokogiri this wouldn't be too difficult.
price = doc.css('sale')
price.search('.sale-text').remove
price.text
However Capybara navigates rather than removes nodes. I knew something like price.text would grab text from all sub elements, so I tried to use xpath to be more specific. p.find(:xpath, "//span[#class='sale']", :match => :first).text. However this grabs text from the inner element as well.
Finally, I tried looping through all spans to see if I could separate the results but I get an Ambiguous error.
p.find(:css, 'span').each { |result| puts result.text }
Capybara::Ambiguous: Ambiguous match, found 2 elements matching css "span"
I am using Capybara/Selenium as this is for a web scraping project with authentication complications.
There is no single statement way to do this with Capybara since the DOMs concept of innerText doesn't really support what you want to do. Assuming p is the '.price' element, two ways you could get what you want are as follows:
Since you know the node you want to ignore just subtract that text from the whole text
p.find('span.sale').text.sub(p.find('span.sale-text').text, '')
Grab the innerHTML string and parse that with Nokogiri or Capybara.string (which just wraps Nokogiri elements in the Capybara DSL)
doc = Capybara.string(p['innerHTML'])
nokogiri_fragment = doc.native
#do whatever you want with the nokogiri fragment

Hpricot: How to extract inner text without other html subelements

I'm working on a vim rspec plugin (https://github.com/skwp/vim-rspec) - and I am parsing some html from rspec. It looks like this:
doc = %{
<dl>
<dt id="example_group_1">This is the heading text</dt>
Some puts output here
</dl>
}
I can get the entire inner of the using:
(Hpricot.parse(doc)/:dl).first.inner_html
I can get just the dt by using
(Hpricot.parse(doc)/:dl).first/:dt
But how can I access the "Some puts output here" area? If I use inner_html, there is way too much other junk to parse through. I've looked through hpricot docs but don't see an easy way to get essentially the inner text of an html element, disregarding its html children.
I ended up figuring out a route by myself, by manually parsing the children:
(#context/"dl").each do |dl|
dl.children.each do |child|
if child.is_a?(Hpricot::Elem) && child.name == 'dd'
# do stuff with the element
elsif child.is_a?(Hpricot::Text)
text=child.to_s.strip
puts text unless text.empty?
end
end
Note that this is bad HTML you have there. If you have control over it, you should wrap the content you want in a <dd>.
In XML terms what you are looking for is the TextNode following the <dt> element. In my comment I showed how you can select this node using XPath in Nokogiri.
However, if you must use Hpricot, and cannot select text nodes using it, then you could hack this by getting the inner_html and then stripping out the unwanted:
(Hpricot.parse(doc)/:dl).first.inner_html.sub %r{<dt>.+?</dt>}, ''

Building an HTML document with content from another

I'm have a document A and want to build a new one B using A's node values.
Given A looks like this...
<html>
<head></head>
<body>
<div id="section0">
<h1>Section 0</h1>
<div>
<p>Some <b>important</b> info here</p>
<div>Some unimportant info here</p>
</div>
<div>
<div id="section1">
<h1>Section 1</h1>
<div>
<p>Some <i>important</i> info here</p>
<div>Some unimportant info here</div>
</div>
<div>
</body>
</html>
When building a B document, I'm using method a.at_css("#section#{n} h1").text to grab the data from A's h1 tags like this:
require 'nokogiri'
a = Nokogiri::HTML(html)
Nokogiri::HTML::Builder.new do |doc|
...
doc.h1 a.at_css("#section#{n} h1").text
...
end
So there are three questions:
How do I grab the content of <p> tags preserving tags inside
<p>?
Currently, once I hit a.at_css("#section#{n} p").text it
returns a plain text, which is not what's needed.
If, instead of .text I hit .to_html or .inner_html, the html appears escaped. So I get, for example, <p> instead of <p>.
Is there any known true way of assigning nodes at the document building stage? So that I wouldn't dance with text method at all? I.e. how do I assign doc.h1 node with value of a.at_css("#section#{n} h1") node at building stage?
What's the profit of Nokogiri::Builder.with(...) method? I wonder if I can get use of it...
How do I grab the content of <p> tags preserving tags inside <p>?
Use .inner_html. The entities are not escaped when accessing them. They will be escaped if you do something like builder.node_name raw_html. Instead:
require 'nokogiri'
para = Nokogiri.HTML( '<p id="foo">Hello <b>World</b>!</p>' ).at('#foo')
doc = Nokogiri::HTML::Builder.new do |d|
d.body do
d.div(id:'content') do
d.parent << para.inner_html
end
end
end
puts doc.to_html
#=> <body><div id="content">Hello <b>World</b>!</div></body>
Is there any known true way of assigning nodes at the document building stage?
Similar to the above, one way is:
puts Nokogiri::HTML::Builder.new{ |d| d.body{ d.parent << para } }.to_html
#=> <body><p id="foo">Hello <b>World</b>!</p></body>
Voila! The node has moved from one document to the other.
What's the profit of Nokogiri::Builder.with(...) method?
That's rather unrelated to the rest of your question. As the documentation says:
Create a builder with an existing root object. This is for use when you have an existing document that you would like to augment with builder methods. The builder context created will start with the given root node.
I don't think it would be useful to you here.
In general, I find the Builder to be convenient when writing a large number of custom nodes from scratch with a known hierarchy. When not doing that you may find it simpler to just create a new document and use DOM methods to add nodes as appropriate. It's hard to tell how much hard-coded nodes/hierarchy your document will have versus procedurally created.
One other, alternative suggestion: perhaps you should create a template XML document and then augment that with details from the other, scraped HTML?

Parsing inner tags using Nokogiri

I'm stuck not being able to parse irregularly embedded html tags. Is there a way to remove all html tags from a node and retain all text?
I'm using the code:
rows = doc.search('//table[#id="table_1"]/tbody/tr')
details = rows.collect do |row|
detail = {}
[
[:word, 'td[1]/text()'],
[:meaning, 'td[6]/font'],
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
Using Xpath:
[:meaning, 'td[6]/font']
generates
:meaning: ! '<font size="3">asking for information specifying <font
color="#CC0000" size="3">what is your name?</font> /what/ as in, <font color="#CC0000" size="3">I'm not sure what you mean</font>
/what/ as in <a style="text-decoration: none;" href="http://somesecretlink.com">what</a></font>
On the other hand, using Xpath:
'td/font/text()'
generates
:meaning: asking for information specifying
thus ignoring all children of the node. What I want to achieve is this
:meaning: asking for information specifying what is your name? /what/ as in, I'm not sure what you mean /what/ as in what? I can't hear you
This depends on what you need to extract. If you want all text in font elements, you can do it with the following xpath:
'td/font//text()'
It extracts all text nodes in font tags. If you want all text nodes in the cell, then:
'td//text()'
You can also call the text method on a Nokogiri node:
row.at_xpath(xpath).text
I added an answer for this same sort of question the other day. It's a very easy process.
Take a look at: Convert HTML to plain text and maintain structure/formatting, with ruby

Resources