Ruby RSS::Parser.to_s silently fails? - ruby

I'm using Ruby 1.8.7's RSS::Parser, part of stdlib. I'm new to Ruby.
I want to parse an RSS feed, make some changes to the data, then output it (as RSS).
The docs say I can use '#to_s', but and it seems to work with some feeds, but not others.
This works:
#!/usr/bin/ruby -w
require 'rss'
require 'net/http'
url = 'http://news.ycombinator.com/rss'
feed = Net::HTTP.get_response(URI.parse(url)).body
rss = RSS::Parser.parse(feed, false, true)
# Here I would make some changes to the RSS, but right now I'm not.
p rss.to_s
Returns expected output: XML text.
This fails:
#!/usr/bin/ruby -w
require 'rss'
require 'net/http'
url = 'http://feeds.feedburner.com/devourfeed'
feed = Net::HTTP.get_response(URI.parse(url)).body
rss = RSS::Parser.parse(feed, false, true)
# Here I would make some changes to the RSS, but right now I'm not.
p rss.to_s
Returns nothing (empty quotes).
And yet, if I change the last line to:
p rss
I can see that the object is filled with all of the feed data. It's the to_s method that fails.
Why?
How can I get some kind of error output to debug a problem like this?

From what I can tell, the problem isn't in to_s, it's in the parser itself. Stepping way into the parser.rb code showed nothing being returned, so to_s returning an empty string is valid.
I'd recommend looking at something like Feedzirra.
Also, as a FYI, take a look at Ruby's Open::URI module for easy retrieval of web assets, like feeds. Open-URI is simple but adequate for most tasks. Net::HTTP is lower level, which will require you to type a lot more code to replace the functionality of Open-URI.

I had the same problem, so I started debugging the code. I think the ruby rss has a few too many required elements. The channel need to have "title, link, description", if one is missing to_s will fail.
The second feed in the example above is missing the description, which will make the to_s fail...
I believe this is a bug, but I really don't understand the code and barely ruby so who knows. It would seem natural to me that to_s would try its best even if some elements are missing.
Either way
rss.channel.description="something"
rss.to_s
will "work"
The problem lies in def have_required_elements?
Or in the
self.class::MODELS

Related

Removing <head> issues, I need guide/assistance

So guys, i'm make a web parser it was good but i saw that some words inside <head> is screwing everything up (and <strong> in the body too). My code is This one here before nokogiri but I'm new at ruby programming and just started to know about Nokogiri few hours ago.
I wish someone could help me make this work. I need to .read the URL, remove <head> and everything inside it and scan words over the rest of the page
PS: Is it possible bring JUST the body and read it? It would be easier
PSS: About <strong> tags, is it hard to remove it?
My exercise is count how many especific word are in the page, not source code, thats why i need to only grab the body and eliminate a tag
Really hope someone can help me >.<
Thnks guys!
Here is my actual failure code / The pure original is here
require 'open-uri'
require 'cgi'
require 'nokogiri'
class Counter
def initialize(url)
#url = url
end
def decapitate
Nokogiri::HTML(url)
url.css('head').remove.to_s
end
def scan(word)
url.scan(word)
end
end
url, word = ARGV
puts "Found #{Counter.new(url).open.decapitate.scan(word).length} maches."
Many mistakes there.
url in decapitate is an undefined local variable. You need to use #url.
Nokogiri::HTML expects either an IO object, or a string, not an URL. You probably wanted to use open(#url) to read the URL contents (I assume, given that you required open-uri
Nokogiri::HTML returns a document, but you don't store this return value anywhere
Consequently, url (or rather #url) would be a string, and strings don't have a css method; you want to apply css to the document instead
remove will return the node that is removed; as the last thing in the method, that will be what is returned. Thus decapitate would return the text of the head node.
At the end, ...decapitate.scan will invoke the String#scan method, not the method you defined.
You can do what you want as follows:
def count(pattern, url)
doc = Nokogiri::HTML(open(url))
doc.css('head').remove
doc.text.scan(pattern).size
end

How to tidy up malformed xml in ruby

I'm having issues tidying up malformed XML code I'm getting back from the SEC's edgar database.
For some reason they have horribly formed xml. Tags that contain any sort of string aren't closed and it can actually contain other xml or html documents inside other tags. Normally I'd had this off to Tidy but that isn't being maintained.
I've tried using Nokogiri::XML::SAX::Parser but that seems to choke because the tags aren't closed. It seems to work alright until it hits the first ending tag and then it doesn't fire off on any more of them. But it is spiting out the right characters.
class Filing < Nokogiri::XML::SAX::Document
def start_element name, attrs = []
puts "starting: #{name}"
end
def characters str
puts "chars: #{str}"
end
def end_element name
puts "ending: #{name}"
end
end
It seems like this would be the best option because I can simply have it ignore the other xml or html doc. Also it would make the most sense because some of these documents can get quite large so storing the whole dom in memory would probably not work.
Here are some example files: 1 2 3
I'm starting to think I'll just have to write my own custom parser
Nokogiri's normal DOM mode is able to automatically fix-up the XML so it is syntactically correct, or a reasonable facsimile of that. It sometimes gets confused and will shift closing tags around, but you can preprocess the file to give it a nudge in the right direction if need be.
I saved the XML #1 out to a document and loaded it:
require 'nokogiri'
doc = ''
File.open('./test.xml') do |fi|
doc = Nokogiri::XML(fi)
end
puts doc.to_xml
After parsing, you can check the Nokogiri::XML::Document instance's errors method to see what errors were generated, for perverse pleasure.
doc.errors
If using Nokogiri's DOM model isn't good enough, have you considered using XMLLint to preprocess and clean the data, emitting clean XML so the SAX will work? Its --recover option might be of use.
xmllint --recover test.xml
It will output errors on stderr, and the code on stdout, so you can pipe it easily to another file.
As for writing your own parser... why? You have other options available to you, and reinventing a nicely implemented wheel is not a good use of time.

Need to build a url and work with the returned result

I would like to start with a little script that fetches the examination results of me and my friends from our university website.
I would like to pass it the roll number as the post parameter and work with the returned data,
I don't know how to create the post string.
It would be great if someone could tell me where to start, what are the things to learn, links to a tutorial would be most appreciated.
I donĀ“t want someone to write code for me, just guidance on how to get started.
I've written a solution here just as a reference for whatever you might come up with. There are multiple ways of attacking this.
#fetch_scores.rb
require 'open-uri'
#define a constant named URL so if the results URL changes we don't
#need to replace a hardcoded URL everywhere.
URL = "http://www.nitt.edu/prm/ShowResult.html?&param="
#checking the count of arguments passed to the script.
#it is only taking one, so let's show the user how to use
#the script
if ARGV.length != 1
puts "Usage: fetch_scores.rb student_name"
else
name = ARGV[0] #could drop the ARGV length check and add a default using ||
# or name = ARGV[0] || nikhil
results = open(URL + name).read
end
You might examine Nokogiri or Hpricot to better parse/format your results. Ruby is an "implicit return" language so if you happened to wonder why we didn't have a return statement that's because results will be returned by the script since it was last executed.
You could have a look at the net/http library, included as part of the standard library. See http://www.ruby-doc.org/stdlib/libdoc/net/http/rdoc/index.html for details, there are some examples on that page to get you started.
A very simple way to do this is to use the open-uri library and just put the query parameters in the URL query string:
require 'open-uri'
name = 'nikhil'
results = open("http://www.nitt.edu/prm/ShowResult.html?&param=#{name}").read
results now contains the body text fetched from the URL.
If you are looking for something more ambitious, look at net/http and the httparty gem.

Nokogiri - Works with XML, not so much with HTML

I'm having an issue getting Nokogiri to work properly. I'm using version 1.4.4 with Ruby 1.9.2.
I have both libxml2 libxslt installed and up to date. When I run a Ruby script with XML, it works great.
require 'nokogiri'
doc = Nokogiri::XML(File.open("test.xml"))
doc = doc.css("name").each do |node|
puts node.text
end
Enter into the CL, run ruby test.rb, returns
Name 1
Name 2
Name 3
And the crowd goes wild.
I tweak a few things, make a few adjustments to the code...
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://domain.tld"))
doc = doc.css("p").each do |node|
puts node.text
end
Back to CL, ruby test.rb, returns... nothing! Just a new, empty line.
Is there any reason that it will work with an XML file, but not HTML?
To debug this sort of problem we need more information from you. Since you're not giving a working URL, and because we know that Nokogiri works fine for this sort of problem, the debugging falls on you.
Here's what I would do to test:
In IRB:
Do you get output when you do: open('http://whateverURLyouarehiding.com').read
If that returns a valid document, what do you get when you wrap the previous open statement in Nokogiri::HTML(...). That needs to preserve the .read in the previous line too, so Nokogiri is receiving the body of the page, NOT an IO stream.
Try #2 above, but remove the .read. That will tell if there's a problem with Nokogiri reading an IO stream, though I seriously doubt it has a problem since I use it all the time. At that point I'd suspect a problem on your system.
If you're getting a document in #2 and #3, then the problem could be in your accessor; I suspect what you're looking for doesn't exist.
If it does exist, then check the value of doc.errors after Nokogiri parses the document. It could be finding errors in the document, and, if so, they'll be captured there.

extract single string from HTML using Ruby/Mechanize (and Nokogiri)

I am extracting data from a forum. My script based on is working fine. Now I need to extract date and time (21 Dec 2009, 20:39) from single post. I cannot get it work. I used FireXPath to determine the xpath.
Sample code:
require 'rubygems'
require 'mechanize'
post_agent = WWW::Mechanize.new
post_page = post_agent.get('http://www.vbulletin.org/forum/showthread.php?t=230708')
puts post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
puts post_page.parser.at_xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
puts post_page.parser.xpath('//[#id="post1960370"]/tbody/tr[1]/td/div[2]/text()')
all my attempts end with empty string or an error.
I cannot find any documentation on using Nokogiri within Mechanize. The Mechanize documentation says at the bottom of the page:
After you have used Mechanize to navigate to the page that you need to scrape, then scrape it using Nokogiri methods.
But what methods? Where can I read about them with samples and explained syntax? I did not find anything on Nokogiri's site either.
Radek. I'm going to show you how to fish.
When you call Mechanize::Page::parser, it's giving you the Nokogiri document. So your "xpath" and "at_xpath" calls are invoking Nokogiri. The problem is in your xpaths. In general, start out with the most general xpath you can get to work, and then narrow it down. So, for example, instead of this:
puts post_page.parser.xpath('/html/body/div/div/div/div/div/table/tbody/tr/td/div[2]/text()').to_s.strip
start with this:
puts post_page.parser.xpath('//table').to_html
This gets the any tables, anywhere, and then prints them as html. Examine the HTML, to see what tables it brought back. It probably grabbed several when you want only one, so you'll need to tell it how to pick out the one table you want. If, for example, you notice that the table you want has CSS class "userdata", then try this:
puts post_page.parser.xpath("//table[#class='userdata']").to_html
Any time you don't get back an array, you goofed up the xpath, so fix it before proceding. Once you're getting the table you want, then try to get the rows:
puts post_page.parser.xpath("//table[#class='userdata']//tr").to_html
If that worked, then take off the "to_html" and you now have an array of Nokogiri nodes, each one a table row.
And that's how you do it.
I think you have copied this from Firebug, firebug gives you an extra tbody, which might not be there in actual code... so my suggestion is to remove that tbody and try again.
if it still doesn't work ... then follow Wayne Conrad's process that's the best!

Resources