I have a variable e which stores a Nokogiri::XML::Element object.
when I execute puts e I get on the screen the following:
<h3 class="fixed-recipe-card__h3">
<a href="https://www.allrecipes.com/recipe/21712/chocolate-covered-strawberries/" data-content-provider-id="" data-internal-referrer-link="hub recipe" class="fixed-recipe-card__title-link">
<span class="fixed-recipe-card__title-link">Chocolate Covered Strawberries</span>
</a>
</h3>
I would like to scrape this part https://www.allrecipes.com/recipe/21712/chocolate-covered-strawberries/
How can I do this using Nokogiri
If you want to extract the link, you can use:
e.at_css("a").attributes["href"].value
.at_css returns the first element matching the CSS selector (another Nokogiri::XML::Element). To get a list of all matching elements, use .css instead.
.attributes gives you a hash mapping attribute name to Nokogiri::XML::Attr. Once you look up the desired attribute in this hash (href), you can call .value to get the actual text value.
I'm trying to write "Private Equity Group; USA" to a file.
"Private Equity Group" prints fine, but I get an error for the "USA" portion
TypeError: null is not an object (evaluating 'style.display')"
HTML code:
<div class="cl profile-xsmall">
<div class="cl profile-small-bold">Private Equity Group</div>
USA
</div>
The XPath for "USA" is:
//*[#id="addrDiv-Id"]/div/div[3]/text()
I get the error when I print the XPath or have it in an if statement:
if (internet.has_xpath?('//*[#id="addrDiv-Id"]/div/div[3]/text()')){
file.puts "#{internet.find(:xpath, '//*[#id="addrDiv-Id"]/div/div[3]/text()')}"
}
Capybara is not a general purpose xpath library - it is a library aimed at testing, and therefore is element centric. The xpaths used need to refer to elements, not text nodes.
if (internet.has_xpath?('//*[#id="addrDiv-Id"]/div/div[3]')){
file.puts internet.find(:xpath, '//*[#id="addrDiv-Id"]/div/div[3]').text
}
although using XPath at all for this is just a bad idea. Whenever possible default to CSS, it's easier to read, and faster for the browser to process - something like
if (internet.has_css?('#addrDiv-Id > div > div:nth-of-type(3)')){
file.puts internet.find('#addrDiv-Id" > div > div:nth-of-type(3)').text
}
or if the HTML allows it (I don't know without seeing more of the HTML)
if (internet.has_css?('#addrDiv-id .cl.profile-xsmall')){
file.puts internet.find('#addrDiv-id .cl.profile-xsmall').text
}
or even cleaner if it works for your use case
file.puts internet.first('#addrDiv-id .cl.profile-xsmall')&.text
Another way to do it :
xml = %{<div class="cl profile-xsmall">
<div class="cl profile-small-bold">Private Equity Group</div>
USA</div>}
require 'rexml/document'
doc = REXML::Document.new xml
print(REXML::XPath.match(doc, 'normalize-space(string(//div[#class="cl profile-xsmall"]))'))
Output :
["Private Equity Group USA"]
I'd say the HTML isn't well-formed, using span would have been better, but this works:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div class="cl profile-xsmall">
<div class="cl profile-small-bold">Private Equity Group</div>
USA
</div>
EOT
div = doc.at('.profile-small-bold')
[div.text.strip, div.next_sibling.text.strip].join(' ')
# => "Private Equity Group USA"
which can be reduced to:
[div, div.next_sibling].map { |n| n.text.strip }.join(' ')
# => "Private Equity Group USA"
The problem is that you have two nested divs, with "USA" trailing, so it's important to point to the inner node which has the main text you want. Then "USA" is in the following text node, which is accessible using next_sibling:
div.next_sibling.class # => Nokogiri::XML::Text
div.next_sibling # => #<Nokogiri::XML::Text:0x3c "\n USA\n">
Note, I'm using CSS selectors; They're easier to read, which is echoed by the Nokogiri documentation. I have no proof they're faster, and, because Nokogiri uses libxml to process both, there's probably no real difference worth worrying about, so use whatever makes more sense, and run benchmarks if you're curious.
You might be tempted to use text against the div class="cl profile-xsmall" node, but don't be sucked into that, as it's a trap:
doc.at('.profile-xsmall').text # => "\n Private Equity Group\n USA\n"
doc.at('.profile-xsmall').text.gsub(/\s+/, ' ').strip # => "Private Equity Group USA"
text will return a string of the text nodes after they're concatenated together. In this particular, rare case, it results in a somewhat usable result, however, usually you'll get something like this:
doc = Nokogiri::HTML('<div><p>foo</p><p>bar</p></div>')
doc.at('div').text # => "foobar"
doc.search('p').text # => "foobar"
Once those text nodes have been concatenated it's really difficult to take them apart again. Nokogiri's documentation talks about this:
Note: This joins the text of all Node objects in the NodeSet:
doc = Nokogiri::XML('<xml><a><d>foo</d><d>bar</d></a></xml>')
doc.css('d').text # => "foobar"
Instead, if you want to return the text of all nodes in the NodeSet:
doc.css('d').map(&:text) # => ["foo", "bar"]
The XPath for "USA" is:
//*[#id="addrDiv-Id"]/div/div[3]/text()
Um, no, not according to the HTML you gave us. But, let's pretend.
Using an absolute path to a node is a good way to write fragile selectors. It takes only a small change in the HTML to break your access to the node. Instead, find way-points to skip through the HTML to find the node you want, taking advantage of CSS and XPath to search downward through the DOM.
Typically, a selector like yours is generated by a browser, which isn't a good source to trust. Often browsers do fixups on malformed HTML, which changes it from what Nokogiri or a parser would see, resulting in a non-existing target, or the browser presents the HTML after JavaScript has had a change to run, which can move nodes, hide them, add new ones, etc.
Instead of trusting the browser, use curl, wget or nokogiri at the command-line to dump the file and look at it using a text editor. Then you'll be seeing it just as Nokogiri sees it, prior to any fixups or mangling.
Lets say I have a simple page that has less IDs than I'd like for testing
<div class="__panel_body">
<div class="__panel_header">Real Estate Rating</div>
<div class="__panel_body">
<div class="__panel_header">Property Rating Info</div>
<a class="icon.edit"></a>
<a class="icon.edit"></a>
</div>
<div class="__panel_body">
<div class="__panel_header">General Risks</div>
<a class="icon.edit"></a>
<a class="icon.edit"></a>
</div>
<div class="__panel_body">
<div class="__panel_header">Amenities</div>
<a class="icon.edit"></a>
<a class="icon.edit"></a>
</div>
</div>
I'm using Jeff Morgan's Page Object gem and I want to make accessors for the edit links in any given section.
The challenge is that the panel headers differentiate what body I want to choose. Then I need to access the parent and get all links with class "icon.edit". Assume I can't change the HTML to solve this.
Here's a start
module RealEstateRatingPageFields
div(:general_risks_section, ....)
def general_risks_edit_links
general_risks_section_element.links(class: "icon.edit")
end
end
How do I get the general_risks_section accessor to work, though?
I want that to represent the parent div to the panel header with text 'General Risks'...
There are a number of ways to get the general risk section.
Using a Block
The accessors can take a block where you can more programatically describe how to locate the element. This allows you to locate a distinguishing element and then traverse the DOM to the element you actually want. In this case, you can locate the header with the matching text and navigate to its parent.
div(:general_risks_section) { div_element(class: '__panel_header', text: 'General Risks').parent }
Using XPath
While harder to read and write, you could also use an XPath locator. The concept and thought process is the same as using the block. The only benefit is that it reduces the number of element calls, which slightly improves performance.
div(:general_risks_section, xpath: './/div[#class="__panel_body"][./div[#class="__panel_header" and text() = "General Risks"]]')
The XPath is saying:
.//div # Find a div element that
[#class="__panel_body"] # Has the class "__panel_body" and
[./div[ # Contains a div element that
#class="__panel_header" and # Has the class "__panel_header" and
text() = "General Risks" # Has the text "General Risks"
]]
Using the Body Text
Given the HTML, you could also just locate the section directly based on its text.
div(:general_risks_section, class: '__panel_body', text: 'General Risks')
Note that this assumes that the HTML given was not simplified. If there are actually other text nodes, this probably would not be the best option.
Say I want to have a page with content like this:
<h1>{{page.comment_count}} Comment(s)</h1>
{% for c in page.comment_list %}
<div>
<strong>{{c.title}}</strong><br/>
{{c.content}}
</div>
{% endfor %}
There are no variables on the page named comment_count or comment_list by default; instead I want these variables to be added to the page from a Jekyll plugin. Where is a safe place I can populate those fields from without interfering with Jekyll's existing code?
Or is there a better way of achieving a list of comments like this?
Unfortunately, there isn't presently the possibility to add these attributes without some messing with internal Jekyll stuff. We're on our way to adding hooks for #after_initialize, etc but aren't there yet.
My best suggestion is to add these attributes as I've done with my Octopress Date plugin on my blog. It uses Jekyll v1.2.0's Jekyll::Post#to_liquid method to add these attributes, which are collected via send(attr) on the Post:
class Jekyll::Post
def comment_count
comment_list.size
end
def comment_list
YAML.safe_load_file("_comments/#{self.id}.yml")
end
# Convert this post into a Hash for use in Liquid templates.
#
# Returns <Hash>
def to_liquid(attrs = ATTRIBUTES_FOR_LIQUID)
super(attrs + %w[
comment_count
comment_list
])
end
end
super(attrs + %w[ ... ]) will ensure that all the old attributes are still included, then collect the return values of the methods corresponding to the entries in the String array.
This is the best means of extending posts and pages so far.
I'm have a document A and want to build a new one B using A's node values.
Given A looks like this...
<html>
<head></head>
<body>
<div id="section0">
<h1>Section 0</h1>
<div>
<p>Some <b>important</b> info here</p>
<div>Some unimportant info here</p>
</div>
<div>
<div id="section1">
<h1>Section 1</h1>
<div>
<p>Some <i>important</i> info here</p>
<div>Some unimportant info here</div>
</div>
<div>
</body>
</html>
When building a B document, I'm using method a.at_css("#section#{n} h1").text to grab the data from A's h1 tags like this:
require 'nokogiri'
a = Nokogiri::HTML(html)
Nokogiri::HTML::Builder.new do |doc|
...
doc.h1 a.at_css("#section#{n} h1").text
...
end
So there are three questions:
How do I grab the content of <p> tags preserving tags inside
<p>?
Currently, once I hit a.at_css("#section#{n} p").text it
returns a plain text, which is not what's needed.
If, instead of .text I hit .to_html or .inner_html, the html appears escaped. So I get, for example, <p> instead of <p>.
Is there any known true way of assigning nodes at the document building stage? So that I wouldn't dance with text method at all? I.e. how do I assign doc.h1 node with value of a.at_css("#section#{n} h1") node at building stage?
What's the profit of Nokogiri::Builder.with(...) method? I wonder if I can get use of it...
How do I grab the content of <p> tags preserving tags inside <p>?
Use .inner_html. The entities are not escaped when accessing them. They will be escaped if you do something like builder.node_name raw_html. Instead:
require 'nokogiri'
para = Nokogiri.HTML( '<p id="foo">Hello <b>World</b>!</p>' ).at('#foo')
doc = Nokogiri::HTML::Builder.new do |d|
d.body do
d.div(id:'content') do
d.parent << para.inner_html
end
end
end
puts doc.to_html
#=> <body><div id="content">Hello <b>World</b>!</div></body>
Is there any known true way of assigning nodes at the document building stage?
Similar to the above, one way is:
puts Nokogiri::HTML::Builder.new{ |d| d.body{ d.parent << para } }.to_html
#=> <body><p id="foo">Hello <b>World</b>!</p></body>
Voila! The node has moved from one document to the other.
What's the profit of Nokogiri::Builder.with(...) method?
That's rather unrelated to the rest of your question. As the documentation says:
Create a builder with an existing root object. This is for use when you have an existing document that you would like to augment with builder methods. The builder context created will start with the given root node.
I don't think it would be useful to you here.
In general, I find the Builder to be convenient when writing a large number of custom nodes from scratch with a known hierarchy. When not doing that you may find it simpler to just create a new document and use DOM methods to add nodes as appropriate. It's hard to tell how much hard-coded nodes/hierarchy your document will have versus procedurally created.
One other, alternative suggestion: perhaps you should create a template XML document and then augment that with details from the other, scraped HTML?