nokogiri doc.xpath('head') returns nil - ruby

I am trying to get all the scripts declared in the head section of a given html, but no matter how I try, it always returns nil.
doc = Nokogiri::HTML(open('http://www.walmart.com.br/'))
puts doc.at('body') # returns nill
doc.xpath('//html/head').each # this also will never iterate
Any suggestions?

The page's DOCTYPE isn't valid, so Nokogiri parses the page improperly. A quick, inefficient fix to the problem:
require 'nokogiri'
require 'open-uri'
require 'pp'
# Request the HTML before parsing
html = open("http://www.walmart.com.br/").read
# Replace original DOCTYPE with a valid DOCTYPE
html = html.sub(/^<!DOCTYPE html(.*)$/, '<!DOCTYPE html>')
# Parse
doc = Nokogiri::HTML(html)
# Party.
pp doc.xpath("/html/head")

Ok, when I tried it in script/console, I could indeed get something useful for:
doc.at('body')
so I'm not sure what's going wrong there for you.
For the html head, I can't get the head element either. html works fine, but head either way doesn't.
I think there's something screwy with that walmart page. I tried doing the same thing for
Nokogiri::HTML(open('http://google.com/'))
and it worked just fine.
So unless you can figure out what they're doing to stop you from accessing parts of the page... then I don't know.
If you can deal with all scripts from the doc, I found that this one works just fine:
doc.xpath('//script')

Related

Removing <head> issues, I need guide/assistance

So guys, i'm make a web parser it was good but i saw that some words inside <head> is screwing everything up (and <strong> in the body too). My code is This one here before nokogiri but I'm new at ruby programming and just started to know about Nokogiri few hours ago.
I wish someone could help me make this work. I need to .read the URL, remove <head> and everything inside it and scan words over the rest of the page
PS: Is it possible bring JUST the body and read it? It would be easier
PSS: About <strong> tags, is it hard to remove it?
My exercise is count how many especific word are in the page, not source code, thats why i need to only grab the body and eliminate a tag
Really hope someone can help me >.<
Thnks guys!
Here is my actual failure code / The pure original is here
require 'open-uri'
require 'cgi'
require 'nokogiri'
class Counter
def initialize(url)
#url = url
end
def decapitate
Nokogiri::HTML(url)
url.css('head').remove.to_s
end
def scan(word)
url.scan(word)
end
end
url, word = ARGV
puts "Found #{Counter.new(url).open.decapitate.scan(word).length} maches."
Many mistakes there.
url in decapitate is an undefined local variable. You need to use #url.
Nokogiri::HTML expects either an IO object, or a string, not an URL. You probably wanted to use open(#url) to read the URL contents (I assume, given that you required open-uri
Nokogiri::HTML returns a document, but you don't store this return value anywhere
Consequently, url (or rather #url) would be a string, and strings don't have a css method; you want to apply css to the document instead
remove will return the node that is removed; as the last thing in the method, that will be what is returned. Thus decapitate would return the text of the head node.
At the end, ...decapitate.scan will invoke the String#scan method, not the method you defined.
You can do what you want as follows:
def count(pattern, url)
doc = Nokogiri::HTML(open(url))
doc.css('head').remove
doc.text.scan(pattern).size
end

Ruby RSS::Parser.to_s silently fails?

I'm using Ruby 1.8.7's RSS::Parser, part of stdlib. I'm new to Ruby.
I want to parse an RSS feed, make some changes to the data, then output it (as RSS).
The docs say I can use '#to_s', but and it seems to work with some feeds, but not others.
This works:
#!/usr/bin/ruby -w
require 'rss'
require 'net/http'
url = 'http://news.ycombinator.com/rss'
feed = Net::HTTP.get_response(URI.parse(url)).body
rss = RSS::Parser.parse(feed, false, true)
# Here I would make some changes to the RSS, but right now I'm not.
p rss.to_s
Returns expected output: XML text.
This fails:
#!/usr/bin/ruby -w
require 'rss'
require 'net/http'
url = 'http://feeds.feedburner.com/devourfeed'
feed = Net::HTTP.get_response(URI.parse(url)).body
rss = RSS::Parser.parse(feed, false, true)
# Here I would make some changes to the RSS, but right now I'm not.
p rss.to_s
Returns nothing (empty quotes).
And yet, if I change the last line to:
p rss
I can see that the object is filled with all of the feed data. It's the to_s method that fails.
Why?
How can I get some kind of error output to debug a problem like this?
From what I can tell, the problem isn't in to_s, it's in the parser itself. Stepping way into the parser.rb code showed nothing being returned, so to_s returning an empty string is valid.
I'd recommend looking at something like Feedzirra.
Also, as a FYI, take a look at Ruby's Open::URI module for easy retrieval of web assets, like feeds. Open-URI is simple but adequate for most tasks. Net::HTTP is lower level, which will require you to type a lot more code to replace the functionality of Open-URI.
I had the same problem, so I started debugging the code. I think the ruby rss has a few too many required elements. The channel need to have "title, link, description", if one is missing to_s will fail.
The second feed in the example above is missing the description, which will make the to_s fail...
I believe this is a bug, but I really don't understand the code and barely ruby so who knows. It would seem natural to me that to_s would try its best even if some elements are missing.
Either way
rss.channel.description="something"
rss.to_s
will "work"
The problem lies in def have_required_elements?
Or in the
self.class::MODELS

Nokogiri - Works with XML, not so much with HTML

I'm having an issue getting Nokogiri to work properly. I'm using version 1.4.4 with Ruby 1.9.2.
I have both libxml2 libxslt installed and up to date. When I run a Ruby script with XML, it works great.
require 'nokogiri'
doc = Nokogiri::XML(File.open("test.xml"))
doc = doc.css("name").each do |node|
puts node.text
end
Enter into the CL, run ruby test.rb, returns
Name 1
Name 2
Name 3
And the crowd goes wild.
I tweak a few things, make a few adjustments to the code...
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://domain.tld"))
doc = doc.css("p").each do |node|
puts node.text
end
Back to CL, ruby test.rb, returns... nothing! Just a new, empty line.
Is there any reason that it will work with an XML file, but not HTML?
To debug this sort of problem we need more information from you. Since you're not giving a working URL, and because we know that Nokogiri works fine for this sort of problem, the debugging falls on you.
Here's what I would do to test:
In IRB:
Do you get output when you do: open('http://whateverURLyouarehiding.com').read
If that returns a valid document, what do you get when you wrap the previous open statement in Nokogiri::HTML(...). That needs to preserve the .read in the previous line too, so Nokogiri is receiving the body of the page, NOT an IO stream.
Try #2 above, but remove the .read. That will tell if there's a problem with Nokogiri reading an IO stream, though I seriously doubt it has a problem since I use it all the time. At that point I'd suspect a problem on your system.
If you're getting a document in #2 and #3, then the problem could be in your accessor; I suspect what you're looking for doesn't exist.
If it does exist, then check the value of doc.errors after Nokogiri parses the document. It could be finding errors in the document, and, if so, they'll be captured there.

HTML to Plain Text with Ruby?

Is there anything out there to convert html to plain text (maybe a nokogiri script)? Something that would keep the line breaks, but that's about it.
If I write something on googledocs, like this, and run that command, it outputs (removing the css and javascript), this:
\n\n\n\n\nh1. Test h2. HELLO THEREI am some teexton the next line!!!OKAY!#*!)$!
So the formatting's all messed up. I'm sure someone has solved the details like these somewhere out there.
Actually, this is much simpler:
require 'rubygems'
require 'nokogiri'
puts Nokogiri::HTML(my_html).text
You still have line break issues, though, so you're going to have to figure out how you want to handle those yourself.
You could start with something like this:
require 'open-uri'
require 'rubygems'
require 'nokogiri'
uri = 'http://stackoverflow.com/questions/2505104/html-to-plain-text-with-ruby'
doc = Nokogiri::HTML(open(uri))
doc.css('script, link').each { |node| node.remove }
puts doc.css('body').text.squeeze(" \n")
Is simply stripping tags and excess line breaks acceptable?
html.gsub(/<\/?[^>]*>/, '').gsub(/\n\n+/, "\n").gsub(/^\n|\n$/, '')
First strips tags, second takes duplicate line breaks down to one, third removes line breaks at the start and end of the string.
require 'open-uri'
require 'nokogiri'
url = 'http://en.wikipedia.org/wiki/Wolfram_language'
doc = Nokogiri::HTML(open(url))
text = ''
doc.css('p,h1').each do |e|
text << e.content
end
puts text
This extracts just the desired text from a webpage (most of the time). If for example you wanted to also include links then add a to the css classes in the block.
I'm using the sanitize gem.
(" " + Sanitize.clean(html).gsub("\n", "\n\n").strip).gsub(/^ /, "\t")
It does drop hyperlinks though, which may be an issue for some applications. But I'm doing NLP text analysis, so this is perfect for my needs.
if you are using rails you can:
html = '<div class="asd">hello world</div><p><span>Hola</span><br> que tal</p>'
puts ActionView::Base.full_sanitizer.sanitize(html)
You want hpricot_scrub:
http://github.com/UnderpantsGnome/hpricot_scrub
You can specify which tags to strip / keep in a config hash.
if its in rails, you may use this:
html_escape_once(value).gsub("\n", "\r\n<br/>").html_safe
Building slightly on Matchu's answer, this worked for my (very similar) requirements:
html.gsub(/<\/?[^>]*>/, ' ').gsub(/\n\n+/, '\n').gsub(/^\n|\n$/, ' ').squish
Hope it makes someone's life a bit easier :-)

extracting content of content attribute in meta tag of a website given a specified value for the name attribute with nokogiri in ruby?

My first question here, would be awesome to find an answer. I am new to using nokogiri.
Here is my problem. I have something like this in the HTML head on a target site (here a techcrunch post):
<meta content="During my time at TechCrunch I've seen thousands of startups and written about hundreds of them. I sure as hell don't know all ..." name="description"/>
I would now like to have a script to run through the meta tags, locate the one with the name attribute "description" and get what is in the content attribute.
I have tried something like this
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.techcrunch.com/2009/10/11/the-underutilized-power-of-the-video-demo-to-explain-what-the-hell-you-actually-do/"
doc = Nokogiri::HTML(open(url))
posts = doc.xpath("//meta")
posts.each do |link|
a = link.attributes['name']
b = link.attributes['content']
end
after which I could select the link where the attribute name is equal to description - but this code returns nil for a and b.
I played around with
posts = doc.xpath("//meta"), posts = doc.xpath("//meta/*"), etc. but still nil.
The problem is not with the xpath, as it seems the document does not parse. You can check that with puts doc, it does not contain the full input. It seems to be a problem with parsing comments (I suspect either invalid HTML or a bug in libxml2).
In your case I would use a regular expression as workaround. Given that <meta> tags are simple enough that might work, eg /<meta name="([^"]*)" content="([^"]*)"/
you should change
doc = Nokogiri::HTML(open(url))
to
doc = Nokogiri::HTML(open(url).read)
update: or maybe not :) actually your code works for me, using ruby 1.8.7 / nokogiri 1.4.0

Resources