I'm trying to revive a simple example of parsing a site with the help of nokogiri and hit about an error undefined method `children' for nil:NilClass (NoMethodError)
require 'open-uri'
url = 'http://www.cubecinema.com/programme'
html = open(url)
puts html
require 'nokogiri'
doc = Nokogiri::HTML(html)
showings = doc.css('.showing').map do |showing|
showing_id = showing['id'].split('_').last.to_i
tags = showing.css('.tags a')
.map{|tag| tag.text.strip}
title_el = showing.at_css('h1 a')
.children
.delete_if{|c| c.name == 'span'}
title = title_el.text.strip
dates = showing.at_css('.start_and_pricing')
.inner_html
.strip
.split('<br>')
.map(&:strip)
.map{|d| DateTime.parse(d)}
description = showing.at_css('.copy')
.text
.delete('[more...]')
.strip
{id: showing_id,
title: title,
tags: tags,
dates: dates,
description: description}
end
I found a possible solution https://translate.googleusercontent.com/translate_c?anno=2&depth=1&rurl=translate.google.com&sl=auto&sp=nmt4&tl=ru&u=https://github.com/dwightjack/grunt-email-boilerplate/issues/12&xid=25657,15700023,15700186,15700191,15700248,15700253&usg=ALkJrhgLkK2xqf-6SfL3K16DBRdtdNH0Cw but it’s not clear what the premailer subtasks are, reading the site didn’t really help them, where do I need to write down these subtasks. I will be very grateful to the clarification either by my mistake or by the way how these subtasks need to be determined, I myself don’t understand and lack experience it is possible.
I am not able to leave just a comment due to lack of reputation, so I can only give advise in answers section.
So, I think that you should check showing.at_css('h1 a') instance first to be sure that it have a children method. Some Nokogiri objects do not have any children (For example meta tag). Hope it helps.
I ran your program locally, and I can't find any tags within the section of code you are scraping.
The reason you are getting this error is because Nokogiri is returning a nil element and you are attempting to delete something which already has no value therefore giving you the NilClass Error.
This is the section of code you are attempting to retrieve "h1 a" from.
<div class="showing" id="event_10427"> <div class="event_image"> <a href="/programme/event/vula-viel-do-not-be-afraid-album-tour,10427/">
<img src="/media/diary/thumbnails/MSJ_vvlive.jpg.600x0_q45.jpg" alt="Picture for event Vula Viel - “Do Not Be Afraid” Album Tour"></a> <span class="tags"> music </span> </div> <!-- div event_image --> <a href="/programme/event/vula-viel-do-not-be-afraid-album-tour,10427/">
<p><span class="pre_title"> Ear Trumpet Music presents </span></p> <h3>Vula Viel - “Do Not Be Afraid” Album Tour</h3> <span class="post_title"> </span> </a> <p></p>
<div class="event_details"> <p class="start_and_pricing"> Thu 28 March // 20:00 <br> </p> <p class="copy">The trio of music makers called Vula Viel weave sparse polyrhythms and intricate rhythm structures around ... [<a class="more" href="/programme/event/vula-viel-do-not-be-afraid-album-tour,10427/">more</a>]</p> </div> </div>
As you can see there is no h1 tags, therefore Nokogiri is returning nil on your search.
You can either change the tag if it's an error on your behalf; or if not every page has a 'h1 a' tag. You will need to check if
title_el = showing.at_css('h3 a')
returns nil before you try to delete it.
Related
I'm working on a white-hat web-crawler that will periodically log into my account and check some information for me using Ruby with Watir and Nokogiri.
Here's the simplified HTML I'm trying to pull information from:
<div class="navbar navbar-default navbar-fixed-top hidden-lg hidden-md" style="z-index: 1002">
<div class="banner-g">
<div class="container">
<div id="user-info">
<div id="acct-value">
GAIN/LOSS <span class="SPShares">-$12.85</span>
</div>
<div id="committed">
INVESTED <span class="SPPortfolio">$152.11</span>
</div>
<div id="avail">
AVAILABLE <span class="SPBalance">$26.98</span>
</div>
I'm trying to pull the $26.98. at the bottom of the excerpt.
Here are three snippets of code I'm using. They're all pretty much identical except for the XPath. The first two return their values perfectly, but the third always returns a value of "0" even though it 'should' return "$26.98" or "26.98".
val_one = page_html.xpath(".//*[#id='openone']/div/div[2]/div[1]/div/div[2]/table/tbody/tr[2]/td[1]").text.gsub(/\D/,'').to_i
val_two = page_html.xpath(".//*[#id='opentwo']/div/div[2]/div[2]/div/div[2]/table/tbody/tr[2]/td[1]").text.gsub(/\D/,'').to_i
val_three = page_html.xpath(".//*[#id='avail']/a/span").text.gsub(/\D/,'').to_i
puts val_three
I assume it's a problem with the XPath, but I've gone through dozens of XPath troubleshooting questions here and none have worked. I checked the XPath with both FirePath and "XPath Checker". I also tried having the XPath search for the "SPBalance" class but that gave the same result.
When I remove to.i from the end, it returns a blank line instead of a zero.
Elsewhere in the site when using Watir, I was able to fix problems recording a value by calling .focus, but for this piece of the code, which is more Nokogiri, using .focus causes the error message:
undefined method `focus' for []:Nokogiri::XML::NodeSet (NoMethodError)
I assume .focus doesn't work for Nokogiri.
Update: Replaced HTML with a cleaner/more complete version.
I've continued to play around with different ways of reaching that data cell, including xpath, css and a search method. Someone told me xpath wouldn't work for this page so I spent even more time trying to get css to work. Someone else told me the page had Javascript, which would prevent Watir from working. So I tried rewriting the app for Selenium instead. Selenium did not solve the problem, and created a whole host of other problems.
Update: After following advice from the Tin Man, I've found that the node is not actually visible in the HTML when it is downloaded using curl.
I'm now trying to access the node using Watir instead of Nokogiri (as he suggested).
Here's some of what I've tried so far:
avail_funds = browser.span :class => 'SPBalance'
avail_funds.exists?
avail_funds.text
avail_funds = browser.span(:css, 'span[customattribute]').text
avail_funds = browser.div(:id => "avail").a(:href => "/Profile/MyShares").span(:class => "SPBalance").text
avail_funds = browser.span(:xpath, ".//*[#id='avail']/a/span").text
avail_funds = browser.span(:css, 'span[class="SPBalance"]').text
avail_funds = browser.span.text
avail_funds = browser.div.text
browser.span(:class, "SPBalance").focus
avail_funds = browser.span(:class, "SPBalance").text
avail_funds = #browser.span(:class => 'SPBalance').inner_html
puts #browser.spans(:class => "SPBalance")
puts #browser.span(:class => "SPBalance")
texts = #browser.spans(:class => "SPBalance").map do |span|
span.text
end
So far all of the above return either blank lines or an error message.
The div class with the ID "user-info" is visible within the HTML as downloaded via curl. Everything beneath that, however, is not visible.
When I try:
avail_funds = browser.div(:id => "user-info").text
I get only blank lines.
When I try:
avail_funds = browser.div(:class => "navbar navbar-default navbar-fixed-top hidden-xs hidden-sm").text
I get actual text back! But unfortunately the string does not contain the value I want.
I also tried:
puts browser.html
Because I thought if the value where visible in that version of the HTML, as it is through my Firefox plug-in, I could parse down to the value I want. But unfortunately the value is not visible in that version of the HTML.
By first 2 commands you fetch data directly from table cell beginning from the root of the document, and in the last one you starting from the center.
Try out to give span id and get data again, and then grow up the complexity and you will find your error in xpath
The first problem is you're trying to use a long, too-long, selector that is referencing tags that don't exist:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<head>
<body class="cbp-spmenu-push">
<div id="FreshWidget" class="freshwidget-container responsive" data-html2canvas-ignore="true" style="display: none;">
<div id="freshwidget-button" class="freshwidget-button fd-btn-right" data-html2canvas-ignore="true" style="display: none; top: 235px;">
<link rel="stylesheet" href="/Content/css/NavPushComponent.css"/>
<script src="/Scripts/classie.js"/>
<script src="/Scripts/modernizr.custom.js"/>
<div class="navbar navbar-default navbar-fixed-top hidden-lg hidden-md" style="z-index: 1002">
<div class="banner-g">
<div class="container">
<div id="user-info">
<div id="acct-value">
<div id="committed">
<div id="avail">
<a href="/Profile/MyBalance">
AVAILABLE
<span class="SPBalance">$31.59</span>
EOT
doc.at('tbody') # => nil
".//*[#id='openone']/div/div[2]/div[1]/div/div[2]/table/tbody/tr[2]/td[1]"
".//*[#id='opentwo']/div/div[2]/div[2]/div/div[2]/table/tbody/tr[2]/td[1]"
There is no <tbody> tag in your sample, and there rarely is in HTML created in the wild, especially if people created it manually. We usually see <tbody> in HTML someone grabbed from a browser's "View Source" display, which is the resulting output after their engine has mangled the HTML in an attempt to make it readable. Don't use that output. Instead, ALWAYS go straight to the source and use wget or curl and download the page and inspect it with an editor, or even use nokogiri some_url on the command-line and look at it there.
A second problem is your HTML snippet is invalid because it's full of unterminated tags. Nokogiri will do fixups on bad HTML, which can actually move nodes around, making it difficult to find nodes, especially when debugging. In this particular case Nokogiri is able to terminate them, but it's important to honor tag closures.
Here's what I'd use:
value = doc.at('span.SPBalance').text # => "$31.59"
This is using CSS which is usually much more readable than XPath. at means "find the first occurrence" and is equivalent to search('span.SPBalance').first.
The XPath equivalent would be:
doc.at('//span[#class="SPBalance"]')
doc.at('//span[#class="SPBalance"]').text # => "$31.59"
Once I have the value then it's easy to manipulate it.
value[/[\d.]+/].to_f # => 31.59
Moving on...
the third always returns a value of "0" even though it should return "$31.59" or "31.59"
'$31.58'.to_i # => 0
'$'.to_i # => 0
'31.58'.to_i # => 31
'$31.58'.to_f # => 0.0
'31.58'.to_f # => 31.58
The documentation for to_f and to_i say respectively:
Returns the result of interpreting leading characters in str as a floating point number.
and
Returns the result of interpreting leading characters in str as an integer base base (between 2 and 36).
In both cases "leading characters" is significant.
using .focus causes the error message:
undefined method `focus' for []:Nokogiri::XML::NodeSet (NoMethodError)
I assume .focus doesn't work for Nokogiri.
You could always check the NodeSet documentation, which confirms that focus is not a method.
I am playing with Nokogiri just to learn it and am trying to write a little CL scraper. Right now I am trying to match up each State on the main page with the cities underneath. Below is a snippet of the HTML:
<div class="colmask">
<div class="box box_1">
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
<li>dothan</li>
<li>florence / muscle shoals</li>
<li>gadsden-anniston</li>
<li>huntsville / decatur</li>
<li>mobile</li>
<li>montgomery</li>
<li>tuscaloosa</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>fairbanks</li>
<li>kenai peninsula</li>
<li>southeast alaska</li>
</ul>
I can already pull out just this div class of "colmask" easy enough. But now I am just trying to get the UL directly after each h4, but can't find a way to do it so far. Suggestions?
You can get ul elements after h4 using following-sibling:
require 'nokogiri'
html = <<-EOF
<div class="colmask">
<div class="box box_1">
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
<li>dothan</li>
<li>florence / muscle shoals</li>
<li>gadsden-anniston</li>
<li>huntsville / decatur</li>
<li>mobile</li>
<li>montgomery</li>
<li>tuscaloosa</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>fairbanks</li>
<li>kenai peninsula</li>
<li>southeast alaska</li>
</ul>
EOF
doc = Nokogiri::HTML(html)
doc.xpath('//h4/following-sibling::ul').each do |node|
puts node.to_html
end
To select ul after an h4 with exact text:
puts doc.xpath("//h4[text()='Alabama']/following-sibling::ul")[0].to_html
I'd do something like this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>fairbanks</li>
</ul>
EOT
states = doc.search('h4')
states_and_cities = states.map{ |state|
cities = state.next_element.search('li a')
[state.text, cities.map(&:text)]
}.to_h
At this point states_and_cities is a hash of arrays:
states_and_cities
# => {"Alabama"=>["auburn", "birmingham"],
# "Alaska"=>["anchorage / mat-su", "fairbanks"]}
If you're concerned about having a big structure, it'd be very easy to convert states to a hash where each state's name is a key, and the associated value is the state's node. Then, that node could be grabbed to find only the cities for the particular state.
However, if you're running this code to generate content for a web-page on the fly, then you're going about it wrong. The information for states and cities should be dumped into a database where it can be accessed much more quickly. Then you won't have to do it every time the page is generated.
Being kind and gentle to other sites is important; Research the HEAD HTTP request. It's your key to determining whether you should retrieve a page in full. Also, learn how to sniff the cache information from the HTTP header returned from a server. That tells you what your minimum refresh rate should be. Also, pay attention to the robots.txt file, which tells you what they consider safe for you to scrape; ignoring that can lead to being banned.
I have some weirdly formatted HTML files which I have to parse.
This is my Ruby code:
File.open('2.html', 'r:utf-8') do |f|
#parsed = Nokogiri::HTML(f, nil, 'windows-1251')
puts #parsed.xpath('//span[#id="f5"]//div[#id="f5"]').inner_text
end
I want to parse a file containing:
<span style="position:absolute;top:156pt;left:24pt" id=f6>36.4.1.1. варенье, джемы, конфитюры, сиропы</span>
<div style="position:absolute;top:167.6pt;left:24.7pt;width:709.0;height:31.5;padding-top:23.8;font:0pt Arial;border-width:1.4; border-style:solid;border-color:#000000;"><table></table></div>
<span style="position:absolute;top:171pt;left:28pt" id=f5>003874</span>
<div style="position:absolute;top:171pt;left:99pt" id=f5>ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div style="position:absolute;top:180pt;left:99pt" id=f5>325гр. </div>
<div style="position:absolute;top:167.6pt;left:95.8pt;width:2.8;height:31.5;padding-top:23.8;font:0pt Arial;border-width:0 0 0 1.4; border-style:solid;border-color:#000000;"><table></table></div>
I need to select either <div> or <span> with id==5. With my current XPath selector it's not possible. If I remove //span[#id="f5"], for example, then the divs are selected correctly. I can output them one after another:
puts #parsed.xpath('//div[#id="f5"]').inner_text
puts #parsed.xpath('//span[#id="f5"]').inner_text
but then the order would be a complete mess. The parsed span have to be directly underneath the div from the original file.
Am I missing some basics? I haven't found anything on the web regarding parallel parsing of two elements. Most posts are concerned with parsing two classes of a div for example, but not two different elements at a time.
If I understand this correctly, you can use the following XPath :
//*[self::div or self::span][#id="f5"]
xpathtester demo
The XPath above will find element named either div or span that have id attribute value equals "f5"
output :
<span id="f5" style="position:absolute;top:171pt;left:28pt">003874</span>
<div id="f5" style="position:absolute;top:171pt;left:99pt">ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div id="f5" style="position:absolute;top:180pt;left:99pt">325гр.</div>
I have this html code.
<div class="main" data-reactid=".0.2.1.1">
<div contenteditable="true" data-reactid=".0.2.1.1.0" autocomplete="off">
<p>
<br>
</p>
</div>
</div>
I have to write in tag. For this I wrote as:
paragraph(:article_title) {div_element(:class=>'main').div(:index=>1).paragraph(:index=>1)}
but it is giving an error. I don't understand what is wrong in this.
There are a couple of problems:
Watir uses a 0-based index. As a result, div(:index=>1) actually means to find the 2nd div tag. As this does not exist, you will get an unable to locate element error.
div and paragraph are not methods defined in the page-object gem. You will get deprecation errors when you try to use them. It should be div_element and paragraph_element respectively.
Try doing:
paragraph(:article_title) {div_element(:class=>'main').div_element(:index=>0).paragraph_element(:index=>0)}
More simply, since :index => 0 is implied:
paragraph(:article_title){div_element(:class=>'main').div_element.paragraph_element}
As there is only one paragraph element, you could further simplify it to:
paragraph(:article_title) {div_element(:class=>'main').paragraph_element}
I am using Watir to write some tests for a web application. I need to get the text 'Bishop' from the HTML below but can't figure out how to do it.
<div id="dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view" style="display: block;">
<div class="workprolabel wpFieldLabel">
<span title="Please select a courtesy title from the list.">Title</span> <span class="validationIndicator wpValidationText"></span>
</div>
<span class="wpFieldViewContent" id="dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view_value"><p class="wpFieldValue ">Bishop</p></span>
</div>
Firebug tells me the xpath is:
html/body/form/div[5]/div[6]/div[2]/div[2]/div/div/span/span/div[2]/div[4]/div[1]/span[1]/div[2]/span/p/text()
but I cant format the element_by_xpath to pick it up.
You should be able to access the paragraph right away if it's unique:
my_p = browser.p(:class, "wpFieldValue ")
my_text = my_p.text
See HTML Elements Supported by Watir
Try
//span[#id='dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5b45385e5f45b_view_value']//text()
EDIT:
Maybe this will work
path = "//span[#id='dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5b45385e5f45b_view_value']/p";
ie.element_by_xpath(path).text
And check if the span's id is constant
Maybe you have an extra space in the end of the name?
<p class="wpFieldValue ">
Try one of these (worked for me, please notice trailing space after wpFieldValue in the first example):
browser.p(:class => "wpFieldValue ").text
#=> "Bishop"
browser.span(:id => "dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view_value").text
#=> "Bishop"
It seems in run time THE DIV style changing NONE to BLOCK.
So in this case we need to collect the text (Entire source or DIV Source) and will collect the value from the text
For Example :
text=ie.text
particular_div=text.scan(%r{div id="dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view" style="display: block;(.*)</span></div>}im).flatten.to_s
particular_div.scan(%r{ <p class="wpFieldValue ">(.*)</p> }im).flatten.to_s
The above code is the sample one will solve your problem.