Get content after header tag with Nokogiri - ruby

I am playing with Nokogiri just to learn it and am trying to write a little CL scraper. Right now I am trying to match up each State on the main page with the cities underneath. Below is a snippet of the HTML:
<div class="colmask">
<div class="box box_1">
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
<li>dothan</li>
<li>florence / muscle shoals</li>
<li>gadsden-anniston</li>
<li>huntsville / decatur</li>
<li>mobile</li>
<li>montgomery</li>
<li>tuscaloosa</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>fairbanks</li>
<li>kenai peninsula</li>
<li>southeast alaska</li>
</ul>
I can already pull out just this div class of "colmask" easy enough. But now I am just trying to get the UL directly after each h4, but can't find a way to do it so far. Suggestions?

You can get ul elements after h4 using following-sibling:
require 'nokogiri'
html = <<-EOF
<div class="colmask">
<div class="box box_1">
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
<li>dothan</li>
<li>florence / muscle shoals</li>
<li>gadsden-anniston</li>
<li>huntsville / decatur</li>
<li>mobile</li>
<li>montgomery</li>
<li>tuscaloosa</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>fairbanks</li>
<li>kenai peninsula</li>
<li>southeast alaska</li>
</ul>
EOF
doc = Nokogiri::HTML(html)
doc.xpath('//h4/following-sibling::ul').each do |node|
puts node.to_html
end
To select ul after an h4 with exact text:
puts doc.xpath("//h4[text()='Alabama']/following-sibling::ul")[0].to_html

I'd do something like this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>fairbanks</li>
</ul>
EOT
states = doc.search('h4')
states_and_cities = states.map{ |state|
cities = state.next_element.search('li a')
[state.text, cities.map(&:text)]
}.to_h
At this point states_and_cities is a hash of arrays:
states_and_cities
# => {"Alabama"=>["auburn", "birmingham"],
# "Alaska"=>["anchorage / mat-su", "fairbanks"]}
If you're concerned about having a big structure, it'd be very easy to convert states to a hash where each state's name is a key, and the associated value is the state's node. Then, that node could be grabbed to find only the cities for the particular state.
However, if you're running this code to generate content for a web-page on the fly, then you're going about it wrong. The information for states and cities should be dumped into a database where it can be accessed much more quickly. Then you won't have to do it every time the page is generated.
Being kind and gentle to other sites is important; Research the HEAD HTTP request. It's your key to determining whether you should retrieve a page in full. Also, learn how to sniff the cache information from the HTTP header returned from a server. That tells you what your minimum refresh rate should be. Also, pay attention to the robots.txt file, which tells you what they consider safe for you to scrape; ignoring that can lead to being banned.

Related

undefined method `children' for nil:NilClass (NoMethodError)

I'm trying to revive a simple example of parsing a site with the help of nokogiri and hit about an error undefined method `children' for nil:NilClass (NoMethodError)
require 'open-uri'
url = 'http://www.cubecinema.com/programme'
html = open(url)
puts html
require 'nokogiri'
doc = Nokogiri::HTML(html)
showings = doc.css('.showing').map do |showing|
showing_id = showing['id'].split('_').last.to_i
tags = showing.css('.tags a')
.map{|tag| tag.text.strip}
title_el = showing.at_css('h1 a')
.children
.delete_if{|c| c.name == 'span'}
title = title_el.text.strip
dates = showing.at_css('.start_and_pricing')
.inner_html
.strip
.split('<br>')
.map(&:strip)
.map{|d| DateTime.parse(d)}
description = showing.at_css('.copy')
.text
.delete('[more...]')
.strip
{id: showing_id,
title: title,
tags: tags,
dates: dates,
description: description}
end
I found a possible solution https://translate.googleusercontent.com/translate_c?anno=2&depth=1&rurl=translate.google.com&sl=auto&sp=nmt4&tl=ru&u=https://github.com/dwightjack/grunt-email-boilerplate/issues/12&xid=25657,15700023,15700186,15700191,15700248,15700253&usg=ALkJrhgLkK2xqf-6SfL3K16DBRdtdNH0Cw but it’s not clear what the premailer subtasks are, reading the site didn’t really help them, where do I need to write down these subtasks. I will be very grateful to the clarification either by my mistake or by the way how these subtasks need to be determined, I myself don’t understand and lack experience it is possible.
I am not able to leave just a comment due to lack of reputation, so I can only give advise in answers section.
So, I think that you should check showing.at_css('h1 a') instance first to be sure that it have a children method. Some Nokogiri objects do not have any children (For example meta tag). Hope it helps.
I ran your program locally, and I can't find any tags within the section of code you are scraping.
The reason you are getting this error is because Nokogiri is returning a nil element and you are attempting to delete something which already has no value therefore giving you the NilClass Error.
This is the section of code you are attempting to retrieve "h1 a" from.
<div class="showing" id="event_10427"> <div class="event_image"> <a href="/programme/event/vula-viel-do-not-be-afraid-album-tour,10427/">
<img src="/media/diary/thumbnails/MSJ_vvlive.jpg.600x0_q45.jpg" alt="Picture for event Vula Viel - “Do Not Be Afraid” Album Tour"></a> <span class="tags"> music </span> </div> <!-- div event_image --> <a href="/programme/event/vula-viel-do-not-be-afraid-album-tour,10427/">
<p><span class="pre_title"> Ear Trumpet Music presents </span></p> <h3>Vula Viel - “Do Not Be Afraid” Album Tour</h3> <span class="post_title"> </span> </a> <p></p>
<div class="event_details"> <p class="start_and_pricing"> Thu 28 March // 20:00 <br> </p> <p class="copy">The trio of music makers called Vula Viel weave sparse polyrhythms and intricate rhythm structures around ... [<a class="more" href="/programme/event/vula-viel-do-not-be-afraid-album-tour,10427/">more</a>]</p> </div> </div>
As you can see there is no h1 tags, therefore Nokogiri is returning nil on your search.
You can either change the tag if it's an error on your behalf; or if not every page has a 'h1 a' tag. You will need to check if
title_el = showing.at_css('h3 a')
returns nil before you try to delete it.

Scraping content from html page

I'm using nokogiri to scrape web pages. The structure of the page is made of an unordered list containing multiple list items each of which has a link, an image and text, all contained in a div.
I'm trying to find clean way to extract the elements in each list item so I can have each li contained in an array or hash like so:
li[0] = ['Acme co 1', 'image1.png', 'Customer 1 details']
li[1] = ['Acme co 2', 'image2.png', 'Customer 2 details']
At the moment I get all the elements in one go then store them in separate arrays. Is there a better, more idiomatic way of doing this?
This is the code atm:
data = Nokogiri::HTML(html)
images = []
name = []
data.css('ul li img').each {|l| images << l}
data.css('ul li a').each {|a| names << a.text }
This is the html I'm working from:
<ul class="customers">
<li>
<div>
Acme co 1
<div class="customer-image">
<img src="image1.png"/>
</div>
<div class=" customer-description">
Cusomter 1 details
</div>
</div>
</li>
<li>
<div>
Acme co 2
<div class="customer-image">
<img src="image1.png"/>
</div>
<div class=" customer-description">
Customer 2 details
</div>
</div>
</li>
</ul>
Thanks
Assuming the code you have is giving you what you want, I wouldn't try to rewrite anything significant. You can be more brief and idiomatic by replacing your #each methods with #map:
data = Nokogiri::HTML(html)
images = data.css('ul li img')
names = data.css('ul li a').map(&:text)
data = Nokogiri::HTML(html)
images = data.css('ul li img')
names = data.css('ul li a').map(&:text)
This simplifies your code slightly, but your original version wasn't too bad.
And my simplification may not generalise if you are, for example, scraping images from multiple regions on the page! In which case, reverting back to something like your original may be fine.

Page Object Gem. Making accessors for elements without ids

Lets say I have a simple page that has less IDs than I'd like for testing
<div class="__panel_body">
<div class="__panel_header">Real Estate Rating</div>
<div class="__panel_body">
<div class="__panel_header">Property Rating Info</div>
<a class="icon.edit"></a>
<a class="icon.edit"></a>
</div>
<div class="__panel_body">
<div class="__panel_header">General Risks</div>
<a class="icon.edit"></a>
<a class="icon.edit"></a>
</div>
<div class="__panel_body">
<div class="__panel_header">Amenities</div>
<a class="icon.edit"></a>
<a class="icon.edit"></a>
</div>
</div>
I'm using Jeff Morgan's Page Object gem and I want to make accessors for the edit links in any given section.
The challenge is that the panel headers differentiate what body I want to choose. Then I need to access the parent and get all links with class "icon.edit". Assume I can't change the HTML to solve this.
Here's a start
module RealEstateRatingPageFields
div(:general_risks_section, ....)
def general_risks_edit_links
general_risks_section_element.links(class: "icon.edit")
end
end
How do I get the general_risks_section accessor to work, though?
I want that to represent the parent div to the panel header with text 'General Risks'...
There are a number of ways to get the general risk section.
Using a Block
The accessors can take a block where you can more programatically describe how to locate the element. This allows you to locate a distinguishing element and then traverse the DOM to the element you actually want. In this case, you can locate the header with the matching text and navigate to its parent.
div(:general_risks_section) { div_element(class: '__panel_header', text: 'General Risks').parent }
Using XPath
While harder to read and write, you could also use an XPath locator. The concept and thought process is the same as using the block. The only benefit is that it reduces the number of element calls, which slightly improves performance.
div(:general_risks_section, xpath: './/div[#class="__panel_body"][./div[#class="__panel_header" and text() = "General Risks"]]')
The XPath is saying:
.//div # Find a div element that
[#class="__panel_body"] # Has the class "__panel_body" and
[./div[ # Contains a div element that
#class="__panel_header" and # Has the class "__panel_header" and
text() = "General Risks" # Has the text "General Risks"
]]
Using the Body Text
Given the HTML, you could also just locate the section directly based on its text.
div(:general_risks_section, class: '__panel_body', text: 'General Risks')
Note that this assumes that the HTML given was not simplified. If there are actually other text nodes, this probably would not be the best option.

How do I replace tags defining a node?

We're trying to move from a rather small bug tracking system to Redmine. For our old system, there's no ready migration solution script available, so we want to do that ourselves.
I suggested using Nokogiri to move some of the formatting over to the new format (Textile), however, I ran into problems.
This is from the DB field in our old system's DB:
<ul>
<li>list item 1</li>
<li>list item 2</li>
</ul>
This needs to be translated into Textile, and it would look like this:
* list item 1
* list item 2
Now, starting to parse using Nokogiri, I'm here:
def self.handle_ul(page)
uls = page.css("ul")
uls.each {|ul|
lis = ul.css("li")
lis.each { |li|
li.inner_html = "*" << li.text << "\n"
}
}
end
This works like a charm. However, I need to do two replacements:
<li>
</li>
tags need to be removed from the <li> object, and:
<ul>
</ul>
tags need to be removed from the <ul> object. However, I cannot seem to find the actual tags in the object representing it. inner_html returned only the HTML between the tags I'm looking for:
ul.inner_html
Results in:
<li>list item 1</li>
<li>list item 2</li>
Where can I find the tags I need to replace? I thought about using parent and reassociate the child <li> tags with the parent.parent, but that would order them at the end of the grandparent.
Can I somehow access the whole HTML representation of an object, without stripping its defining tags out, so that I can replace them?
EDIT:
As requested, here is a mockup of an old DB entry and the style it should have in textile.
Before transformation:
Fixed for rev. 1.7.92.
<h4>Problems:</h4>
<ul>
<li>fixed.</li>
<li>fixed. New minimum 270x270</li>
<li>fixed.</li>
<li>fixed.</li>
<li>fixed.</li>
<li>fixed. Column types list is growing horizontally now.</li>
</ul>
After transformation:
Fixed for rev. 1.7.92.
h4.Problems:
* fixed.
* fixed. New minimum 270x270
* fixed.
* fixed.
* fixed.
* fixed. Column types list is growing horizontally now.
EDIT 2:
I tried to overwrite parts of the to_s method of the Nokogiri elements:
li.to_s["<li>"]=""
but that doesn't seem to be a valid lvalue (not that there is an error, it just doesn't do anything).
Here's the basis for such a transform:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<ul>
<li>list item 1</li>
<li>list item 2</li>
</ul>
EOT
puts doc.to_html
doc.search('ul').each do |ul|
ul.search('li').each do |li|
li.replace("* #{ li.text.strip }")
end
ul.replace(ul.text)
end
puts doc.to_html
Running that outputs:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><ul>
<li>list item 1</li>
<li>list item 2</li>
</ul></body></html>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>* list item 1
* list item 2
</body></html>
I didn't intend, or attempt, to make the first "item" have a leading carriage-return or line-feed. That's left as an exercise for the reader. Nor did I try to handle the <h4> tags or similar substitutions. From the answer code you should be able to figure out how to do it.
Also, I'm using Nokogiri::HTML to parse the HTML, which turns it into a full HTML document with the appropriate DOCTYPE header, <html> and <body> tags to mimic a full HTML document. That could be changed using Nokogiri::HTML::DocumentFragment.parse instead but wouldn't really make a difference in the output.
You may want to look at ClothRed, which is an HTML to Textile converter in Ruby. It hasn't been updated in a while, but it's simple and may be a good starting point for your own converter.
If you really want to use Nokogiri, you're writing a filter, so you may want to use the SAX interface.
You may want to try McBean (https://github.com/flavorjones/mcbean) [caveat: I'm the author of the gem, and it hasn't been updated in a while].
It's similar to ClothRed in spirit, but uses Nokogiri under the hood and actually transforms the document structure into output text. It supports substantial subset of Textile; and in fact I've used it successfully to convert wiki pages between wiki systems, as you're trying to do.
If anybody interested finds this later, another alternative is to use Pandoc. I've just did my first tests, and it seems almost sufficient, and it can do many more formats.

Building an HTML document with content from another

I'm have a document A and want to build a new one B using A's node values.
Given A looks like this...
<html>
<head></head>
<body>
<div id="section0">
<h1>Section 0</h1>
<div>
<p>Some <b>important</b> info here</p>
<div>Some unimportant info here</p>
</div>
<div>
<div id="section1">
<h1>Section 1</h1>
<div>
<p>Some <i>important</i> info here</p>
<div>Some unimportant info here</div>
</div>
<div>
</body>
</html>
When building a B document, I'm using method a.at_css("#section#{n} h1").text to grab the data from A's h1 tags like this:
require 'nokogiri'
a = Nokogiri::HTML(html)
Nokogiri::HTML::Builder.new do |doc|
...
doc.h1 a.at_css("#section#{n} h1").text
...
end
So there are three questions:
How do I grab the content of <p> tags preserving tags inside
<p>?
Currently, once I hit a.at_css("#section#{n} p").text it
returns a plain text, which is not what's needed.
If, instead of .text I hit .to_html or .inner_html, the html appears escaped. So I get, for example, <p> instead of <p>.
Is there any known true way of assigning nodes at the document building stage? So that I wouldn't dance with text method at all? I.e. how do I assign doc.h1 node with value of a.at_css("#section#{n} h1") node at building stage?
What's the profit of Nokogiri::Builder.with(...) method? I wonder if I can get use of it...
How do I grab the content of <p> tags preserving tags inside <p>?
Use .inner_html. The entities are not escaped when accessing them. They will be escaped if you do something like builder.node_name raw_html. Instead:
require 'nokogiri'
para = Nokogiri.HTML( '<p id="foo">Hello <b>World</b>!</p>' ).at('#foo')
doc = Nokogiri::HTML::Builder.new do |d|
d.body do
d.div(id:'content') do
d.parent << para.inner_html
end
end
end
puts doc.to_html
#=> <body><div id="content">Hello <b>World</b>!</div></body>
Is there any known true way of assigning nodes at the document building stage?
Similar to the above, one way is:
puts Nokogiri::HTML::Builder.new{ |d| d.body{ d.parent << para } }.to_html
#=> <body><p id="foo">Hello <b>World</b>!</p></body>
Voila! The node has moved from one document to the other.
What's the profit of Nokogiri::Builder.with(...) method?
That's rather unrelated to the rest of your question. As the documentation says:
Create a builder with an existing root object. This is for use when you have an existing document that you would like to augment with builder methods. The builder context created will start with the given root node.
I don't think it would be useful to you here.
In general, I find the Builder to be convenient when writing a large number of custom nodes from scratch with a known hierarchy. When not doing that you may find it simpler to just create a new document and use DOM methods to add nodes as appropriate. It's hard to tell how much hard-coded nodes/hierarchy your document will have versus procedurally created.
One other, alternative suggestion: perhaps you should create a template XML document and then augment that with details from the other, scraped HTML?

Resources