I have the following XML here:
<listing>
<seller_info>
<payment_types>Visa, Mastercard, , , , 0, Discover, American Express </payment_types>
<shipping_info>siteonly, Buyer Pays Shipping Costs </shipping_info>
<buyer_protection_info/>
<auction_info>
<bid_history>
<item_info>
</listing>
The following code works fine for displaying first child of the first //listing node:
require 'nokogiri'
require 'open-uri'
html_data = open('http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/321gone.xml')
nokogiri_object = Nokogiri::XML(html_data)
listing_elements = nokogiri_object.xpath("//listing")
puts listing_elements[0].children[1]
This also works:
puts listing_elements[0].children[3]
I tried to access the second node <payment_types> with the the following code:
puts listing_elements[0].children[2]
but a blank line was displayed. Looking through Firebug, it is clearly the 2nd child of the listing node. In general, only odd numbers work with the children method.
Is this a bug in Nokogiri? Any thoughts?
It's not a bug, its the space created while parsing strings that contain "\n" (or empty nodes), but you could use the noblanks option to avoid them:
nokogiri_object = Nokogiri::XML(html_data) { |conf| conf.noblanks }
Use that and you will have no blanks in your array.
The problem is you are not parsing the document correctly. children returns more than you think, and its use is painting you into a corner.
Here's a simplified example of how I'd do it:
require 'nokogiri'
doc = Nokogiri::XML(DATA.read)
auctions = doc.search('listing').map do |listing|
seller_info = listing.at('seller_info')
auction_info = listing.at('auction_info')
hash = [:seller_name, :seller_rating].each_with_object({}) do |s, h|
h[s] = seller_info.at(s.to_s).text.strip
end
[:current_bid, :time_left].each do |s|
hash[s] = auction_info.at(s.to_s).text.strip
end
hash
end
__END__
<?xml version='1.0' ?>
<!DOCTYPE root SYSTEM "http://www.cs.washington.edu/research/projects/xmltk/xmldata/data/auctions/321gone.dtd">
<root>
<listing>
<seller_info>
<seller_name>537_sb_3 </seller_name>
<seller_rating> 0</seller_rating>
</seller_info>
<auction_info>
<current_bid> $839.93</current_bid>
<time_left> 1 Day, 6 Hrs</time_left>
</auction_info>
</listing>
<listing>
<seller_info>
<seller_name> lapro8</seller_name>
<seller_rating> 0</seller_rating>
</seller_info>
<auction_info>
<current_bid> $210.00</current_bid>
<time_left> 4 Days, 21 Hrs</time_left>
</auction_info>
</listing>
</root>
After running, auctions will be:
auctions
# => [{:seller_name=>"537_sb_3",
# :seller_rating=>"0",
# :current_bid=>"$839.93",
# :time_left=>"1 Day, 6 Hrs"},
# {:seller_name=>"lapro8",
# :seller_rating=>"0",
# :current_bid=>"$210.00",
# :time_left=>"4 Days, 21 Hrs"}]
Notice there are no empty text nodes to deal with because I told Nokogiri exactly which nodes to grab text from. You should be able to extend the code to grab any information you want easily.
A typically formatted XML or HTML document that displays nesting or indentation uses text nodes to provide that indenting:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
Here's what your code is seeing:
doc.at('body').children.map(&:to_html)
# => ["\n" +
# " ", "<p>foo</p>", "\n" +
# " "]
The Text nodes are what are confusing you:
doc.at('body').children.first.class # => Nokogiri::XML::Text
doc.at('body').children.first.text # => "\n "
If you don't drill down far enough you will pick up the Text nodes and have to clean up the results:
doc.at('body')
.text # => "\n foo\n "
.strip # => "foo"
Instead, explicitly find the node you want and extract the information:
doc.at('body p').text # => "foo"
In the suggested code above I used strip because the incoming XML had spaces surrounding some text:
h[s] = seller_info.at(s.to_s).text.strip
which is the result of the original XML creation code not cleaning the lines prior to generating the XML. So sometimes we have to clean up their mess, but the proper accessing of the node can reduce that a lot.
The problem is that children includes text nodes such as the whitespace between elements. If instead you use element_children you get just the child elements (i.e. the contents of the tags, not the surrounding whitespace).
Related
I run the following successfully:
require 'nokogiri'
require 'open-uri'
own = Nokogiri::HTML(open('https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001513362'))
own_table = own.css('table#transaction-report')
p own_table.css('tr').css('td')[4].css('a').attr('href').value
=> "/Archives/edgar/data/0001513362/000162828016019444/0001628280-16-019444-index.htm"
However, when I try to use the element above in a block (as shown in code below), I get a NoMethodError for nil:NilClass.
I am confused, because I thought that the local variable link in the block would be the same object as in the code above.
Furthermore, if I change the definition of link below to:
link = row.css('td')[4].class
I get a hash without error, saying the value of link is Nokogiri::XML::Element.
Can anyone explain, why I have a Nokogiri::XML::Element object, but cannot run the css method on it. Especially when I can run it in the first snippet?
require 'nokogiri'
require 'open-uri'
own = Nokogiri::HTML(open('https://www.sec.gov/cgi-bin/own-disp?action=getowner&CIK=0001513362'))
own_table = own.css('table#transaction-report')
own_table.css('tr').each do |row|
names = [:acq, :transaction_date, :execution_date, :issuer, :form, :transaction_type, :direct_or_indirect_ownership, :number_of_securities_transacted, :number_of_securities_owned, :line_number, :issuer_cik, :security_name, :url]
values = row.css('td').map(&:text)
link = row.css('td')[4].css('a').attr('href').value
values << link
hash = Hash[names.zip values]
puts hash
end
secown.rb:11:in `block in <main>': undefined method `css' for nil:NilClass (NoMethodError)
from /Users/piperwarrior/.rvm/gems/ruby-2.2.1/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:187:in `block in each'
from /Users/piperwarrior/.rvm/gems/ruby-2.2.1/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `upto'
from /Users/piperwarrior/.rvm/gems/ruby-2.2.1/gems/nokogiri-1.6.7.2/lib/nokogiri/xml/node_set.rb:186:in `each'
from secown.rb:8:in `<main>'
The crucial insight is that in the first case, own_table.css('tr') returns a NodeSet, .css('td') finds all the td that is descendant to any nodes in that nodeset, then finds the fourth one (speaking as a programmer, fifth for normal people :P ).
The second snippet treats each row individually as a Node, then finds all descendant td, then picks the fourth one.
So if you have this structure:
tr id=1
td id=2
td id=3
tr id=4
td id=5
td id=6
td id=7
td id=8
td id=9
then the first snippet will give you the id 7 td (it being the fourth td in all tr); the second snippet would try to find the fourth td in id 1 tr, then fourth td in id 4 tr, but it errors out because id 1 tr doesn't have a fourth td.
Edit: Specifically, having checked your URL, the first tr has no td; all the others have 12. So own_table.css('tr')[0].css('td')[4].class is NilClass, not Nokogiri::XML::Element as you report.
Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div><span><p>foo</p></span></div>
<div id="bar"><span><p>bar</p></span></div>
</body>
</html>
EOT
If I chain the methods I'm going to find all matching <p> nodes inside the <div>s:
doc.css('div').css('span').css('p').to_html
# => "<p>foo</p><p>bar</p>"
or:
doc.css('div').css('p').to_html
# => "<p>foo</p><p>bar</p>"
That's equivalent to using the following selectors, only they are a bit more efficient as they don't involve calling libXML multiple times:
doc.css('div span p').to_html
# => "<p>foo</p><p>bar</p>"
or:
doc.css('div p').to_html
# => "<p>foo</p><p>bar</p>"
Really you should find landmarks in the target markup and leapfrog from one to the next, not step from tag to tag:
doc.css('#bar p').to_html
# => "<p>bar</p>"
If your intention was to find all matches then replace #bar with div in the above selector and it'll loosen the search.
Finally, if your goal is to extract the text of a set of nodes, you don't want to use something like:
doc.css('bar p').text
css, like search and xpath returns a NodeSet and text will concatenate the text from all returned nodes, making it difficult to retrieve the text from the individual nodes. Instead use:
doc.css('bar p').map(&:text)
which will return an array containing the text of each node found:
doc.css('div p').text
# => "foobar"
versus:
doc.css('div p').map(&:text)
# => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.
This question already has answers here:
Nokogiri returning values as a string, not an array
(2 answers)
Closed 6 years ago.
I am trying to parse an HTML page using Nokogiri to get some companies names.
names = []
names << Nokogiri::HTML(mypage).css(".name a").text
My result is:
["MikeGetsLeadsUruBlondeLaunch LIVERynoRyderBoyer ProductionsStrangerxCerealLume CubeKatapyMacaulay Outdoor PromotionsFlixit ABMedia MosaicLiftCast.TVcool.mediaPeekKLIKseeStreamingo SolutionsPvgnaalughaUser"]
But what I'd like to get is:
["MikeGetsLeads", "Uru", "Blonde", "Launch LIVE", RynoRyderBoyer Productions", "Stranger", "xCereal", "Lume Cube", "Katapy", "Macaulay Outdoor Promotions", "Flixit AB", "Media Mosaic", "LiftCast.TV", "cool.media", "Peek", "KLIKsee", "Streamingo Solutions", "Pvgna", "alugha", "User"]
I tried to use .split but it does not give me the right result neither. On this page, each name belongs to a <div>so it's clearly separated in the HTML structure.
The HTML structure looks like this
<div class='name'>
MikeGetsLeads
</div>
The problem is, you are using text with a NodeSet, not with individual nodes. With a NodeSet all the text is concatenated into a single String. Per the NodeSet.inner_text AKA text documentation:
Get the inner text of all contained Node objects
and the actual code is:
def inner_text
collect(&:inner_text).join('')
end
whereas Node.content AKA text or inner_text
Returns the content for this Node
Meditate on this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div>
<p>foo</p>
<p>bar</p>
</div>
EOT
doc.css('p').class # => Nokogiri::XML::NodeSet
doc.css('p').text # => "foobar"
Instead, you need to use text on individual nodes:
doc.css('p').map{ |n| n.class } # => [Nokogiri::XML::Element, Nokogiri::XML::Element]
doc.css('p').map{ |n| n.text } # => ["foo", "bar"]
The previous line can be simplified to:
doc.css('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.
require 'rubygems'
require 'nokogiri'
require 'pp'
names = []
mypage = File.open("myhtml.html", "r")
Nokogiri::HTML(mypage).css(".name a").each do |item|
names << item.text
end
pp names
returns:
["MikeGetsLeads", "MikeGetsLeads2", "MikeGetsLeads3"]
How can I get only the text of the node <p> which has other tags in it like:
<p>hello my website is click here <b>test</b></p>
I only want "hello my website is"
This is what I tried:
begin
node = html_doc.css('p')
node.each do |node|
node.children.remove
end
return (node.nil?) ? '' : node.text
rescue
return ''
end
Update 2: all right, well you are removing all children with node.children.remove, including the text nodes, a proposed solution might look like:
# 1. select all <p> nodes
doc.css('p').
# 2. map children, and flatten
map { |node| node.children }.flatten.
# 3. select text nodes only
select { |node| node.text? }.
# 4. get text and join
map { |node| node.text }.join(' ').strip
This sample returns "hello my website is", but note that doc.css('p') als finds <p> tags within <p> tags.
Update: sorry, misread your question, you only want "hello my website is", see solution above, original answer:
Not directly with nokogiri, but the sanitize gem might be an option: https://github.com/rgrove/sanitize/
Sanitize.clean(html, {}) # => " hello my website is click here test "
FYI, it uses nokogiri internally.
Your test case did not include any interesting text interleaved with the markup.
If you want to turn <p>Hello <b>World</b>!</p> into "Hello !", then removing the children is one way to do it. Simpler (and less destructive) is to just find all the text nodes and join them:
require 'nokogiri'
html = Nokogiri::HTML('<p>Hello <b>World</b>!</p>')
# Find the first paragraph (in this case the only one)
para = html.at('p')
# Find all the text nodes that are children (not descendants),
# change them from nodes into the strings of text they contain,
# and then smush the results together into one big string.
p para.search('text()').map(&:text).join
#=> "Hello !"
If you want to turn <p>Hello <b>World</b>!</p> into "Hello " (no exclamation point) then you can simply do:
p para.children.first.text # if you know that text is the first child
p para.at('text()').text # if you want to find the first text node
As #Iwe showed, you can use the String#strip method to removing leading/trailing whitespace from the result, if you like.
There's a different way to go about this. Rather than bother with removing nodes, remove the text that those nodes contain:
require 'nokogiri'
doc = Nokogiri::HTML('<p>hello my website is click here <b>test</b></p>')
text = doc.search('p').map{ |p|
p_text = p.text
a_text = p.at('a').text
p_text[a_text] = ''
p_text
}
puts text
>>hello my website is test
This is a simple example, but the idea is to find the <p> tags, then scan inside those for the tags that contain the text you don't want. For each of those unwanted tags, grab their text and delete it from the surrounding text.
In the sample code, you'd have a list of undesirable nodes at the a_text assignment, loop over them, and iteratively remove the text, like so:
text = doc.search('p').map{ |p|
p_text = p.text
%w[a].each do |bad_nodes|
bad_nodes_text = p.at(bad_nodes).text
p_text[bad_nodes_text] = ''
end
p_text
}
You get back text which is an array of the tweaked text contents of the <p> nodes.
I'm trying to write a Nokogiri script that will grep XML for text nodes containing ASCII double-quotes («"»). Since I want a grep-like output I need the line number, and the contents of each line. However, I am unable to see how to tell the line number where the element starts at. Here is my code:
require 'rubygems'
require 'nokogiri'
ARGV.each do |filename|
xml_stream = File.open(filename)
reader = Nokogiri::XML::Reader(xml_stream)
titles = []
text = ''
grab_text = false
reader.each do |elem|
if elem.node_type == Nokogiri::XML::Node::TEXT_NODE
data = elem.value
lines = data.split(/\n/, -1);
lines.each_with_index do |line, idx|
if (line =~ /"/) then
STDOUT.printf "%s:%d:%s\n", filename, elem.line()+idx, line
end
end
end
end
end
elem.line() does not work.
XML and parsers don't really have a concept of line numbers. You're talking about the physical layout of the file.
You can play a game with the parser using accessors looking for text nodes containing linefeeds and/or carriage returns but that can be thrown off because XML allows nested nodes.
require 'nokogiri'
xml =<<EOT_XML
<atag>
<btag>
<ctag
id="another_node">
other text
</ctag>
</btag>
<btag>
<ctag id="another_node2">yet
another
text</ctag>
</btag>
<btag>
<ctag id="this_node">this text</ctag>
</btag>
</atag>
EOT_XML
doc = Nokogiri::XML(xml)
# find a particular node via CSS accessor
doc.at('ctag#this_node').text # => "this text"
# count how many "lines" there are in the document
doc.search('*/text()').select{ |t| t.text[/[\r\n]/] }.size # => 12
# walk the nodes looking for a particular string, counting lines as you go
content_at = []
doc.search('*/text()').each do |n|
content_at << [n.line, n.text] if (n.text['this text'])
end
content_at # => [[14, "this text"]]
This works because of the parser's ability to figure out what is a text node and cleanly return it, without relying on regex or text matches.
EDIT: I went through some old code, snooped around in Nokogiri's docs some, and came up with the above edited changes. It's working correctly, including working with some pathological cases. Nokogiri FTW!
As of 1.2.0 (released 2009-02-22), Nokogiri supports Node#line, which returns the line number in the source where that node is defined.
It appears to use the libxml2 function xmlGetLineNo().
require 'nokogiri'
doc = Nokogiri::XML(open 'tmpfile.xml')
doc.xpath('//xmlns:package[#arch="x86_64"]').each do |node|
puts '%4d %s' % [node.line, node['name']]
end
NOTE if you are working with large xml files (> 65535 lines), be sure to use Nokogiri 1.13.0 or newer (released 2022-01-06), or your Node#line results will not be accurate for large line numbers. See PR 2309 for an explanation.
I have an XML file which is too big. To make it smaller, I want to replace all tags and attribute names with shorter versions of the same thing.
So, I implemented this:
string.gsub!(/<(\w+) /) do |match|
case match
when 'Image' then 'Img'
when 'Text' then 'Txt'
end
end
puts string
which deletes all opening tags but does not do much else.
What am I doing wrong here?
Here's another way:
class String
def minimize_tags!
{"image" => "img", "text" => "txt"}.each do |from,to|
gsub!(/<#{from}\b/i,"<#{to}")
gsub!(/<\/#{from}>/i,"<\/#{to}>")
end
self
end
end
This will probably be a little easier to maintain, since the replacement patterns are all in one place. And on strings of any significant size, it may be a lot faster than Kevin's way. I did a quick speed test of these two methods using the HTML source of this stackoverflow page itself as the test string, and my way was about 6x faster...
Here's the beauty of using a parser such as Nokogiri:
This lets you manipulate selected tags (nodes) and their attributes:
require 'nokogiri'
xml = <<EOT
<xml>
<Image ImagePath="path/to/image">image comment</Image>
<Text TextFont="courier" TextSize="9">this is the text</Text>
</xml>
EOT
doc = Nokogiri::XML(xml)
doc.search('Image').each do |n|
n.name = 'img'
n.attributes['ImagePath'].name = 'path'
end
doc.search('Text').each do |n|
n.name = 'txt'
n.attributes['TextFont'].name = 'font'
n.attributes['TextSize'].name = 'size'
end
print doc.to_xml
# >> <?xml version="1.0"?>
# >> <xml>
# >> <img path="path/to/image">image comment</img>
# >> <txt font="courier" size="9">this is the text</txt>
# >> </xml>
If you need to iterate through every node, maybe to do a universal transformation on the tag-name, you can use doc.search('*').each. That would be slower than searching for individual tags, but might result in less code if you need to change every tag.
The nice thing about using a parser is it'll work even if the layout of the XML changes since it doesn't care about whitespace, and will work even if attribute order changes, making your code more robust.
Try this:
string.gsub!(/(<\/?)(\w+)/) do |match|
tag_mark = $1
case $2
when /^image$/i
"#{tag_mark}Img"
when /^text$/i
"#{tag_mark}Txt"
else
match
end
end