Getting a single child element with a given name with Nokogiri - ruby

Let's say I have XML which looks like this:
<paper>
<header>
</header>
<body>
<paragraph>
</paragraph>
</body>
<conclusion>
</conclusion>
</paper>
Is there a way I can just get conclusion, without making an ugly loop like:
for child in paper.children do
if child.name == "conclusion"
conclusion = child
end
end
puts conclusion
Ideally something like python's Element.find('conclusion').

Try with xpath method.
node = doc.xpath("//conclusion")[0]
or, if you know is just one
node = doc.at_xpath("//conclusion")

Related

Parsing a nested tag, moving it outside of the parent, and changing its type using Nokogiri

I have HTML coming from an API that I want to clean up and format it.
I'm trying to get any <strong> tags that are the first element inside a <p> tag, and change it to be the parent of the <p> tag, and convert the <p> tag to <h4>.
For example:
<p><strong>This is what I want to pull out to an h4 tag.</strong>Here's the rest of the paragraph.</p>
becomes:
<h4>This is what I want to pull out to an h4 tag.</h4><p>Here's the rest of the paragraph.</p>
EDIT: Apologies for the nature of the question being too 'please write this for me'. I posted the solution I came up with below. I just had to take the time to really learn how Nokogiri works, but it is quite powerful and it seems like you can do almost anything with it.
doc = Nokogiri::HTML::DocumentFragment.parse(html)
doc.css("p").map do |paragraph|
first = paragraph.children.first
if first.element? and first.name == "strong"
first.name = 'h4'
paragraph.add_previous_sibling(first)
end
end

Get element in particular index nokogiri

How can I get the element at index 2.
For example in following HTML I want to display the third element i.e a DIV:
<HTMl>
<DIV></DIV>
<OL></OL>
<DIV> </DIV>
</HTML>
I have been trying the following:
p1 = html_doc.css('body:nth-child(2)')
puts p1
I don't think you're understanding how we use a parser like Nokogiri, because it's a lot easier than you make it out to be.
I'd use:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<HTMl>
<DIV>1</DIV>
<OL></OL>
<DIV>2</DIV>
</HTML>
EOT
doc.at('//div[2]').to_html # => "<div>2</div>"
That's using at which returns the first Node that matches the selector. //div[2] is an XPath selector that will return the second <div> found. search could be used instead of at, but it returns a NodeSet, which is like an array, and would mean I'd need to extract that particular node.
Alternately, I could use CSS instead of XPath:
doc.search('div:nth-child(3)').to_html # => "<div>2</div>"
Which, to me, is not really an improvement over the XPath as far as readability.
Using search to find all occurrences of a particular tag, means I have to select the particular element from the returned NodeSet:
doc.search('div')[1].to_html # => "<div>2</div>"
Or:
doc.search('div').last.to_html # => "<div>2</div>"
The downside to using search this way, is it will be slower and needlessly memory intensive on big documents since search finds all occurrences of the nodes that match the selector in the document, and which are then thrown away after selecting only one. search, css and xpath all behave that way, so, if you only need the first matching node, use at or its at_css and at_xpath equivalents and provide a sufficiently definitive selector to find just the tag you want.
'body:nth-child(2)' doesn't work because you're not using it right, according to ":nth-child()" and how I understand it works. nth-child looks at the tag supplied, and finds the "nth" occurrence of it under its parent. So, you're asking for the third tag under body's "html" parent, which doesn't exist because a correctly formed HTML document would be:
<html>
<head></head>
<body></body
</html>
(How you tell Nokogiri to parse the document determines how the resulting DOM is structured.)
Instead, use: div:nth-child(3) which says, "find the third child of the parent of div, which is "body", and results in the second div tag.
Back to how Nokogiri can be told to parse a document; Meditate on the difference between these:
doc = Nokogiri::HTML(<<EOT)
<p>foo</p>
EOT
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>foo</p>
# >> </body></html>
and:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>foo</p>
EOT
puts doc.to_html
# >> <p>foo</p>
If you can modify the HTML add id's and classes to target easily what you are looking for (also add the body tag).
If you can not modify the HTML keep your selector simple and access the second element of the array.
html_doc.css('div')[1]

Select element by attribute value with XPath in Nokogiri

So if I have this piece of code
<body>
<div class="red">
<a href="http://www.example.com>Example</a>
</div>
</body>
I know that I want to get an element with the attribute "class" and value "red" but I don't know where is located.
If I used XPath, is this piece of code right?
dir = "http://www.domain.com"
doc = Nokogiri::HTML(open(url))
doc.xpath('.//*[class="red"]')
I'm just learning so I don't know if any of this is wrong. I can't make it work. Thanks.
Edit: Now it's working =)
doc.xpath('//*[#class="red"]')
Change class to #class. Remove the dot in the beginning. Then it will work.

Watir. How to make a call to an element of page object, inside a line like $browser.link.click?

We have a page objects elements like
link (:test_link, xpath: './/a[#id = '3'])
unordered_list (:list, id: 'test')
And the code:
def method(elementcontainer, elementlink)
elementcontainer = elementcontainer.downcase.gsub(' ', '_')
elementlink = elementlink.downcase.gsub(' ', '_')
object = send("#{elementcontainer}_element")
object2 = send("#{elementlink}_element")
total_results_1 = object.element.links(id: '3')]").length
total_results_2 = object.element.links(object2).length
end
The last 2 lines contain the mystery.
The total_results_1 is able to get the number of links contained in the unordered list that have id = '3'.
total_results_2 does not work (of course). I donĀ“t want to write in the middle of the code, again, the identification of the links. That is done in the page object.
How it is possible to write something like the total_results_2 line, but in a working version?
I might be misunderstanding the question, but I do not believe you need to create a method for what you want. It can all be done using the page object accessors.
Say we have the following page (I matched this to your accessors, though it seems unlikely that all links would have the same id):
<html>
<body>
<a id="3" href="#">1</a>
<ul id="test">
<li><a id="3" href="#">2</a></li>
<li><a id="3" href="#">3</a></li>
<li><a id="3" href="#">4</a></li>
</ul>
<a id="3" href="#">5</a>
</body>
</html>
As you did, you could define the list with the accessor:
unordered_list(:list, id: 'test')
To get the links with id 3, but are only within the list, you could:
Define the links as a collection - ie use links instead of link.
Use a block to locate the elements. This would allow you to consider the element nesting - ie locate links within the list element.
This would be done with:
links(:test_link){ list_element.link_elements(:id => '3') }
All together, your page object would be:
class MyPage
include PageObject
unordered_list(:list, id: 'test')
links(:test_link){ list_element.link_elements(:id => '3') }
end
To find the number of links, you would access the element collection and check its length.
browser = Watir::Browser.new
browser.goto('your_test_page.htm')
page = MyPage.new(browser)
puts page.test_link_elements.length
#=> 3

Building an HTML document with content from another

I'm have a document A and want to build a new one B using A's node values.
Given A looks like this...
<html>
<head></head>
<body>
<div id="section0">
<h1>Section 0</h1>
<div>
<p>Some <b>important</b> info here</p>
<div>Some unimportant info here</p>
</div>
<div>
<div id="section1">
<h1>Section 1</h1>
<div>
<p>Some <i>important</i> info here</p>
<div>Some unimportant info here</div>
</div>
<div>
</body>
</html>
When building a B document, I'm using method a.at_css("#section#{n} h1").text to grab the data from A's h1 tags like this:
require 'nokogiri'
a = Nokogiri::HTML(html)
Nokogiri::HTML::Builder.new do |doc|
...
doc.h1 a.at_css("#section#{n} h1").text
...
end
So there are three questions:
How do I grab the content of <p> tags preserving tags inside
<p>?
Currently, once I hit a.at_css("#section#{n} p").text it
returns a plain text, which is not what's needed.
If, instead of .text I hit .to_html or .inner_html, the html appears escaped. So I get, for example, <p> instead of <p>.
Is there any known true way of assigning nodes at the document building stage? So that I wouldn't dance with text method at all? I.e. how do I assign doc.h1 node with value of a.at_css("#section#{n} h1") node at building stage?
What's the profit of Nokogiri::Builder.with(...) method? I wonder if I can get use of it...
How do I grab the content of <p> tags preserving tags inside <p>?
Use .inner_html. The entities are not escaped when accessing them. They will be escaped if you do something like builder.node_name raw_html. Instead:
require 'nokogiri'
para = Nokogiri.HTML( '<p id="foo">Hello <b>World</b>!</p>' ).at('#foo')
doc = Nokogiri::HTML::Builder.new do |d|
d.body do
d.div(id:'content') do
d.parent << para.inner_html
end
end
end
puts doc.to_html
#=> <body><div id="content">Hello <b>World</b>!</div></body>
Is there any known true way of assigning nodes at the document building stage?
Similar to the above, one way is:
puts Nokogiri::HTML::Builder.new{ |d| d.body{ d.parent << para } }.to_html
#=> <body><p id="foo">Hello <b>World</b>!</p></body>
Voila! The node has moved from one document to the other.
What's the profit of Nokogiri::Builder.with(...) method?
That's rather unrelated to the rest of your question. As the documentation says:
Create a builder with an existing root object. This is for use when you have an existing document that you would like to augment with builder methods. The builder context created will start with the given root node.
I don't think it would be useful to you here.
In general, I find the Builder to be convenient when writing a large number of custom nodes from scratch with a known hierarchy. When not doing that you may find it simpler to just create a new document and use DOM methods to add nodes as appropriate. It's hard to tell how much hard-coded nodes/hierarchy your document will have versus procedurally created.
One other, alternative suggestion: perhaps you should create a template XML document and then augment that with details from the other, scraped HTML?

Resources