Cannot access Nokogiri element within block - ruby

I run the following successfully:
require 'nokogiri'
require 'open-uri'
own = Nokogiri::HTML(open('https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001513362'))
own_table = own.css('table#transaction-report')
p own_table.css('tr').css('td')[4].css('a').attr('href').value
=> "/Archives/edgar/data/0001513362/000162828016019444/0001628280-16-019444-index.htm"
However, when I try to use the element above in a block (as shown in code below), I get a NoMethodError for nil:NilClass.
I am confused, because I thought that the local variable link in the block would be the same object as in the code above.
Furthermore, if I change the definition of link below to:
link = row.css('td')[4].class
I get a hash without error, saying the value of link is Nokogiri::XML::Element.
Can anyone explain, why I have a Nokogiri::XML::Element object, but cannot run the css method on it. Especially when I can run it in the first snippet?
require 'nokogiri'
require 'open-uri'
own = Nokogiri::HTML(open('https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001513362'))
own_table = own.css('table#transaction-report')
own_table.css('tr').each do |row|
names = [:acq, :transaction_date, :execution_date, :issuer, :form, :transaction_type, :direct_or_indirect_ownership, :number_of_securities_transacted, :number_of_securities_owned, :line_number, :issuer_cik, :security_name, :url]
values = row.css('td').map(&:text)
link = row.css('td')[4].css('a').attr('href').value
values << link
hash = Hash[names.zip values]
puts hash
end
secown.rb:11:in `block in <main>': undefined method `css' for nil:NilClass (NoMethodError)
from /Users/piperwarrior/.rvm/gems/ruby-2.2.1/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:187:in `block in each'
from /Users/piperwarrior/.rvm/gems/ruby-2.2.1/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `upto'
from /Users/piperwarrior/.rvm/gems/ruby-2.2.1/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `each'
from secown.rb:8:in `<main>'

The crucial insight is that in the first case, own_table.css('tr') returns a NodeSet, .css('td') finds all the td that is descendant to any nodes in that nodeset, then finds the fourth one (speaking as a programmer, fifth for normal people :P ).
The second snippet treats each row individually as a Node, then finds all descendant td, then picks the fourth one.
So if you have this structure:
tr id=1
td id=2
td id=3
tr id=4
td id=5
td id=6
td id=7
td id=8
td id=9
then the first snippet will give you the id 7 td (it being the fourth td in all tr); the second snippet would try to find the fourth td in id 1 tr, then fourth td in id 4 tr, but it errors out because id 1 tr doesn't have a fourth td.
Edit: Specifically, having checked your URL, the first tr has no td; all the others have 12. So own_table.css('tr')[0].css('td')[4].class is NilClass, not Nokogiri::XML::Element as you report.

Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div><span><p>foo</p></span></div>
<div id="bar"><span><p>bar</p></span></div>
</body>
</html>
EOT
If I chain the methods I'm going to find all matching <p> nodes inside the <div>s:
doc.css('div').css('span').css('p').to_html
# => "<p>foo</p><p>bar</p>"
or:
doc.css('div').css('p').to_html
# => "<p>foo</p><p>bar</p>"
That's equivalent to using the following selectors, only they are a bit more efficient as they don't involve calling libXML multiple times:
doc.css('div span p').to_html
# => "<p>foo</p><p>bar</p>"
or:
doc.css('div p').to_html
# => "<p>foo</p><p>bar</p>"
Really you should find landmarks in the target markup and leapfrog from one to the next, not step from tag to tag:
doc.css('#bar p').to_html
# => "<p>bar</p>"
If your intention was to find all matches then replace #bar with div in the above selector and it'll loosen the search.
Finally, if your goal is to extract the text of a set of nodes, you don't want to use something like:
doc.css('bar p').text
css, like search and xpath returns a NodeSet and text will concatenate the text from all returned nodes, making it difficult to retrieve the text from the individual nodes. Instead use:
doc.css('bar p').map(&:text)
which will return an array containing the text of each node found:
doc.css('div p').text
# => "foobar"
versus:
doc.css('div p').map(&:text)
# => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.

Related

Nokogiri children method

I have the following XML here:
<listing>
<seller_info>
<payment_types>Visa, Mastercard, , , , 0, Discover, American Express </payment_types>
<shipping_info>siteonly, Buyer Pays Shipping Costs </shipping_info>
<buyer_protection_info/>
<auction_info>
<bid_history>
<item_info>
</listing>
The following code works fine for displaying first child of the first //listing node:
require 'nokogiri'
require 'open-uri'
html_data = open('http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/321gone.xml')
nokogiri_object = Nokogiri::XML(html_data)
listing_elements = nokogiri_object.xpath("//listing")
puts listing_elements[0].children[1]
This also works:
puts listing_elements[0].children[3]
I tried to access the second node <payment_types> with the the following code:
puts listing_elements[0].children[2]
but a blank line was displayed. Looking through Firebug, it is clearly the 2nd child of the listing node. In general, only odd numbers work with the children method.
Is this a bug in Nokogiri? Any thoughts?
It's not a bug, its the space created while parsing strings that contain "\n" (or empty nodes), but you could use the noblanks option to avoid them:
nokogiri_object = Nokogiri::XML(html_data) { |conf| conf.noblanks }
Use that and you will have no blanks in your array.
The problem is you are not parsing the document correctly. children returns more than you think, and its use is painting you into a corner.
Here's a simplified example of how I'd do it:
require 'nokogiri'
doc = Nokogiri::XML(DATA.read)
auctions = doc.search('listing').map do |listing|
seller_info = listing.at('seller_info')
auction_info = listing.at('auction_info')
hash = [:seller_name, :seller_rating].each_with_object({}) do |s, h|
h[s] = seller_info.at(s.to_s).text.strip
end
[:current_bid, :time_left].each do |s|
hash[s] = auction_info.at(s.to_s).text.strip
end
hash
end
__END__
<?xml version='1.0' ?>
<!DOCTYPE root SYSTEM "http://www.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/321gone.dtd">
<root>
<listing>
<seller_info>
<seller_name>537_sb_3 </seller_name>
<seller_rating> 0</seller_rating>
</seller_info>
<auction_info>
<current_bid> $839.93</current_bid>
<time_left> 1 Day, 6 Hrs</time_left>
</auction_info>
</listing>
<listing>
<seller_info>
<seller_name> lapro8</seller_name>
<seller_rating> 0</seller_rating>
</seller_info>
<auction_info>
<current_bid> $210.00</current_bid>
<time_left> 4 Days, 21 Hrs</time_left>
</auction_info>
</listing>
</root>
After running, auctions will be:
auctions
# => [{:seller_name=>"537_sb_3",
# :seller_rating=>"0",
# :current_bid=>"$839.93",
# :time_left=>"1 Day, 6 Hrs"},
# {:seller_name=>"lapro8",
# :seller_rating=>"0",
# :current_bid=>"$210.00",
# :time_left=>"4 Days, 21 Hrs"}]
Notice there are no empty text nodes to deal with because I told Nokogiri exactly which nodes to grab text from. You should be able to extend the code to grab any information you want easily.
A typically formatted XML or HTML document that displays nesting or indentation uses text nodes to provide that indenting:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
Here's what your code is seeing:
doc.at('body').children.map(&:to_html)
# => ["\n" +
# " ", "<p>foo</p>", "\n" +
# " "]
The Text nodes are what are confusing you:
doc.at('body').children.first.class # => Nokogiri::XML::Text
doc.at('body').children.first.text # => "\n "
If you don't drill down far enough you will pick up the Text nodes and have to clean up the results:
doc.at('body')
.text # => "\n foo\n "
.strip # => "foo"
Instead, explicitly find the node you want and extract the information:
doc.at('body p').text # => "foo"
In the suggested code above I used strip because the incoming XML had spaces surrounding some text:
h[s] = seller_info.at(s.to_s).text.strip
which is the result of the original XML creation code not cleaning the lines prior to generating the XML. So sometimes we have to clean up their mess, but the proper accessing of the node can reduce that a lot.
The problem is that children includes text nodes such as the whitespace between elements. If instead you use element_children you get just the child elements (i.e. the contents of the tags, not the surrounding whitespace).

Properly separate String elements in an Array [duplicate]

This question already has answers here:
Nokogiri returning values as a string, not an array
(2 answers)
Closed 6 years ago.
I am trying to parse an HTML page using Nokogiri to get some companies names.
names = []
names << Nokogiri::HTML(mypage).css(".name a").text
My result is:
["MikeGetsLeadsUruBlondeLaunch LIVERynoRyderBoyer ProductionsStrangerxCerealLume CubeKatapyMacaulay Outdoor PromotionsFlixit ABMedia MosaicLiftCast.TVcool.mediaPeekKLIKseeStreamingo SolutionsPvgnaalughaUser"]
But what I'd like to get is:
["MikeGetsLeads", "Uru", "Blonde", "Launch LIVE", RynoRyderBoyer Productions", "Stranger", "xCereal", "Lume Cube", "Katapy", "Macaulay Outdoor Promotions", "Flixit AB", "Media Mosaic", "LiftCast.TV", "cool.media", "Peek", "KLIKsee", "Streamingo Solutions", "Pvgna", "alugha", "User"]
I tried to use .split but it does not give me the right result neither. On this page, each name belongs to a <div>so it's clearly separated in the HTML structure.
The HTML structure looks like this
<div class='name'>
MikeGetsLeads
</div>
The problem is, you are using text with a NodeSet, not with individual nodes. With a NodeSet all the text is concatenated into a single String. Per the NodeSet.inner_text AKA text documentation:
Get the inner text of all contained Node objects
and the actual code is:
def inner_text
collect(&:inner_text).join('')
end
whereas Node.content AKA text or inner_text
Returns the content for this Node
Meditate on this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div>
<p>foo</p>
<p>bar</p>
</div>
EOT
doc.css('p').class # => Nokogiri::XML::NodeSet
doc.css('p').text # => "foobar"
Instead, you need to use text on individual nodes:
doc.css('p').map{ |n| n.class } # => [Nokogiri::XML::Element, Nokogiri::XML::Element]
doc.css('p').map{ |n| n.text } # => ["foo", "bar"]
The previous line can be simplified to:
doc.css('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.
require 'rubygems'
require 'nokogiri'
require 'pp'
names = []
mypage = File.open("myhtml.html", "r")
Nokogiri::HTML(mypage).css(".name a").each do |item|
names << item.text
end
pp names
returns:
["MikeGetsLeads", "MikeGetsLeads2", "MikeGetsLeads3"]

Is there a way to access Nokogiri::XML::Attr by using a symbol key, not a string key

I have a code like this
require 'nokogiri'
url = ENV['URL']
doc = Nokogiri::HTML(open(url))
link = doc.css('a#foo').attr('href').value
I want to access to Nokogiri::XML::Attr by using symbol like this.
doc = Nokogiri::HTML(open(url), hash_key_symbol: true)
link = doc.css('a#foo').attr(:href).value
I couldn't find information for it, but maybe I've overlooked it.
Is there a option like this?
You are calling attr on the NodeSet returned from css, not on a single Node object. attr on a Node will accept a symbol to specify the attribute, and has done for a while, but it looks like the corresponding change hasn’t been made to NodeSet#attr. Note that the NodeSet version of attr is for setting the attribute on all nodes in the set, and will only return the value of the attribute on the first node it contains if you don’t specify a value.
You can use at_css to explicitly only select the first matching node of your query, then you can use a symbol:
doc.at_css('a#foo').attr(:href).value
Alternatively you could select the node from the node set by its index:
doc.css('a#foo')[0].attr(:href).value
The simple way to access a parameter in a tag is to use the "hash" [] syntax:
require 'nokogiri'
html = <<EOT
<html>
<body>
bar
</body>
</html>
EOT
doc = Nokogiri::HTML(html)
doc.at('a#foo')['href'] # => "blah"
But we can use a symbol:
doc.at('a#foo')[:href] # => "blah"
Note, at is equivalent to search('a#foo').first, and both return a Node. search, and its CSS and XPath variants return NodeSets, which don't have the ability to return the attribute of a specific node or all nodes. To process multiple nodes use map:
html = <<EOT
<html>
<body>
bar1
bar2
</body>
</html>
EOT
doc = Nokogiri::HTML(html)
doc.search('a.foo').map{ |n| n['href'] } # => ["blah", "more_blah"]

Get text of a paragraph with all the markup (and their content) removed

How can I get only the text of the node <p> which has other tags in it like:
<p>hello my website is click here <b>test</b></p>
I only want "hello my website is"
This is what I tried:
begin
node = html_doc.css('p')
node.each do |node|
node.children.remove
end
return (node.nil?) ? '' : node.text
rescue
return ''
end
Update 2: all right, well you are removing all children with node.children.remove, including the text nodes, a proposed solution might look like:
# 1. select all <p> nodes
doc.css('p').
# 2. map children, and flatten
map { |node| node.children }.flatten.
# 3. select text nodes only
select { |node| node.text? }.
# 4. get text and join
map { |node| node.text }.join(' ').strip
This sample returns "hello my website is", but note that doc.css('p') als finds <p> tags within <p> tags.
Update: sorry, misread your question, you only want "hello my website is", see solution above, original answer:
Not directly with nokogiri, but the sanitize gem might be an option: https://github.com/rgrove/sanitize/
Sanitize.clean(html, {}) # => " hello my website is click here test "
FYI, it uses nokogiri internally.
Your test case did not include any interesting text interleaved with the markup.
If you want to turn <p>Hello <b>World</b>!</p> into "Hello !", then removing the children is one way to do it. Simpler (and less destructive) is to just find all the text nodes and join them:
require 'nokogiri'
html = Nokogiri::HTML('<p>Hello <b>World</b>!</p>')
# Find the first paragraph (in this case the only one)
para = html.at('p')
# Find all the text nodes that are children (not descendants),
# change them from nodes into the strings of text they contain,
# and then smush the results together into one big string.
p para.search('text()').map(&:text).join
#=> "Hello !"
If you want to turn <p>Hello <b>World</b>!</p> into "Hello " (no exclamation point) then you can simply do:
p para.children.first.text # if you know that text is the first child
p para.at('text()').text # if you want to find the first text node
As #Iwe showed, you can use the String#strip method to removing leading/trailing whitespace from the result, if you like.
There's a different way to go about this. Rather than bother with removing nodes, remove the text that those nodes contain:
require 'nokogiri'
doc = Nokogiri::HTML('<p>hello my website is click here <b>test</b></p>')
text = doc.search('p').map{ |p|
p_text = p.text
a_text = p.at('a').text
p_text[a_text] = ''
p_text
}
puts text
>>hello my website is test
This is a simple example, but the idea is to find the <p> tags, then scan inside those for the tags that contain the text you don't want. For each of those unwanted tags, grab their text and delete it from the surrounding text.
In the sample code, you'd have a list of undesirable nodes at the a_text assignment, loop over them, and iteratively remove the text, like so:
text = doc.search('p').map{ |p|
p_text = p.text
%w[a].each do |bad_nodes|
bad_nodes_text = p.at(bad_nodes).text
p_text[bad_nodes_text] = ''
end
p_text
}
You get back text which is an array of the tweaked text contents of the <p> nodes.

how to find all the child nodes inside the matched elements (including text nodes)?

in jquery its quite simple
for instance
$("br").parent().contents().each(function() {
but for nokogiri, xpath,
its not working out quite well
var = doc.xpath('//br/following-sibling::text()|//br/preceding-sibling::text()').map do |fruit| fruit.to_s.strip end
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML(DATA.read)
fruits = doc.xpath('//br/../text()').map { |text| text.content.strip }
p fruits
__END__
<html>
<body>
<div>
apple<br>
banana<br>
cherry<br>
orange<br>
</div>
</body>
I'm not familiar with nokogiri, but are you trying to find all the children of any element that contains a <br/>? If so, then try:
//*[br]/node()
In any case, using text() will only match text nodes, and not any sibling elements, which may or may not be what you want. If you actually only want text nodes, then
//*[br]/text()
should do the trick.

Resources