Feedzirra cannot parse atom feeds - ruby

The idea of having a single parser for any kind of feed is great and was hoping that it would work for me.
I have been trying to get feedzirra to parse atom feeds.
specifically:
http://pindancing.blogspot.com/feeds/posts/default
http://adam.heroku.com/feed
Those are just 2 that I tried with the problem is that feedzirra cannot parse the
entry URL. It always comes out nil
feed = Feedzirra::Feed.fetch_and_parse(search.rss_feed_url)
p feed.entries.first.title
p feed.entries.first.url #=> returns nil
Is there anything I need to do to get it working?
thanks for your help

Hate to say "works for me", but, well, works for me:
require 'Feedzirra'
urls = %w{
http://adam.heroku.com/feed
http://pindancing.blogspot.com/feeds/posts/default
}
urls.each do |url|
feed = Feedzirra::Feed.fetch_and_parse(url)
puts feed.entries.first.title
puts feed.entries.first.url
end
# => Memcached, a Database?
# => http://adam.heroku.com/past/2010/7/19/memcached_a_database/
# => The answer to "Will you mentor me?" is
# => http://pindancing.blogspot.com/2010/12/answer-to-will-you-mentor-me-is.html
It'd help to see the rest of your code, particularly the actual parameter you're using in the fetch_and_parse method.

Related

Reading Several URIs in ruby

I need to read the contents of web page for several times and extract some information out of it for which I use regular expressions. I am using open-uri to read contents of the page and the sample code I written is as follows:
require 'open-uri'
def getResults(words)
results = []
words.each do |word|
results.push getAResult(word)
end
results
end
def getAResult(word)
file = open("http://www.somapage.com?option=#{word}")
contents = file.read
file.close
contents.match /some-regex-here/
$1.empty? ? -1 : $1.to_f
end
The problem is unless I comment out file.close line getAResult returns always -1. When I try this code on console, getAResult immediately returns -1, but ruby process runs for another two to three seconds or so.
If I remove file.close line getAResult returns the correct result, but now getResults is a bunch of -1s except for the first one. I tried to use curb gem for reading the page, but similar problem appears.
This seems like an issue related with threading. However, I couldn't come up with something reasonable to search and find a corresponding solution. What do you think problem would be?
NOTE: This web page I try to read does not return results so fast. It takes some time.
try hpricot or nokogiri
it can search documents via XPath in your html file
You should grab the match result, like the following:
1.9.3-327 (main):0 > contents.match /div/
=> #<MatchData "div">
1.9.3-327 (main):0 > $1
=> nil
1.9.3-327 (main):0 > contents.match /(div)/
=> #<MatchData "div" 1:"div">
1.9.3-327 (main):0 > $1
=> "div"
If you are worried about thread safety, then you shouldn't use the $n regexp variables. Capture your results directly, like this:
value = contents[/regexp/]
Specifically, here's a more ruby-like formatting of that method:
def getAResult(word)
contents = open("http://www.somapage.com?option=#{word}"){|f| f.read }
value = contents[/some-regex-here/]
value.empty? ? -1 : value.to_f
end
The block form of #open (as above) automatically closes the file when you are done with it.

ruby and curl: skipping invalid pages

I am building a script to parse multiple page titles. Thanks to another question in stack I have now this working bit
curl = %x(curl http://odin.1.ai)
simian = curl.match(/<title>(.*)<\/title>/)[1]
puts simian
but if you try the same where a page has no title for example
curl = %x(curl http://zales.1.ai)
it dies with undefined method for nill class as it has no title ....
I can't check if curl is nil as it is not in this case (it contains another line)
Do you have any solution to have this working even if the title is not present and move to the next page to check ? I would appreciate if we stick to this code as I did try other solutions with nokogiri and uri (Nokogiri::HTML(open("http:/.....") but this is not working either as subdomains like byname_meee.1.ai do not work with the default open-uri so I am thankful if we can stick to this code that uses curl.
UPDATE
I realize that I probably left out some specific cases that are ought to be clarified. This is for parsing 300-400 pages. In the first run I have noticed at least two cases where nokogiri, hpricot but even the more basic open-uri do not work
1) open-uri simply fails in a simple domain with _
like http://levant_alejandro.1.ai this is a valid domain and works with curl but not with open_uri or nokogiri using open_uri
2)The second case if a page has no title like
http://zales.1.ai
3) Third is a page with an image and no valid HTML like http://voldemortas.1.ai/
A fourth case would be a page that has nothing but an internal server error or passenger/rack error.
The first three cases can be sorted with this solution (thanks to Havenwood in #ruby IRC channel)
curl = %x(curl http://voldemortas.1.ai/)
begin
simian = curl.match(/<title>(.*)<\/title>/)[1]
rescue NoMethodError
simian = "" # curl was nil?
rescue ArguementError
simian = "" # not html?
end
puts simian
Now I am aware that this is not elegant nor optimal.
REPHRASED QUESTION
Do you have better way to achieve the same with nokogiri or another gem that includes these cases (no title or no HTML valid page or even 404 page) ? Given that the pages I am parsing have a fairly simple title structure, is the above solution suitable ? For the sake of knowledge it would be useful to know why using an extra gem for the parsing like nokogiri would be better option (note: I try to have few gem dependencies as often and over time they tend to break).
You're making it much to hard on yourself.
Nokogiri doesn't care where you get the HTML, it just wants the body of the document. You can use Curb, Open-URI, a raw Net::HTTP connection, and it'll parse the content returned.
Try Curb:
require 'curb'
require 'nokogiri'
doc = Nokogiri::HTML(Curl.get('http://http://odin.1.ai').body_str)
doc.at('title').text
=> "Welcome to Dotgeek.org * 1.ai"
If you don't know whether you'll have a <title> tag, then don't try to do it all at once:
title = doc.at('title')
next if (!title)
puts title.text
Take a look at "equivalent of curl for Ruby?" for more ideas.
You just need to check for the match before accessing it. If curl.match is nil, the you can't access the grouping:
curl = %x(curl http://odin.1.ai)
simian = curl.match(/<title>(.*)<\/title>/)
simian &&= simian[1] # only access the matched group if available
puts simian
Do heed the Tin Man's advice and use Nokogiri. Your regexp is really only suitable for a brittle solution -- it fails when the title element is spread over multiple lines.
Update
If you really don't want to use an HTML parser and if you promise this is for a quick script, you can use OpenURI (wrapper around net/http) in the standard library. It's at least a little cleaner than parsing curl output.
require 'open-uri'
def extract_title_content(line)
title = line.match(%r{<title>(.*)</title>})
title &&= title[1]
end
def extract_title_from(uri)
title = nil
open(uri) do |page|
page.lines.each do |line|
return title if title = extract_title_content(line)
end
end
rescue OpenURI::HTTPError => e
STDERR.puts "ERROR: Could not download #{uri} (#{e})"
end
puts extract_title_from 'http://odin.1.ai'
What you're really looking for, it seems, is a way to skip non-HTML responses. That's much easier with a curl wrapper like curb, like the Tin Man suggested, than dropping to the shell and using curl there:
1.9.3p125 :001 > require 'curb'
=> true
1.9.3p125 :002 > response = Curl.get('http://odin.1.ai')
=> #<Curl::Easy http://odin.1.ai?>
1.9.3p125 :003 > response.content_type
=> "text/html"
1.9.3p125 :004 > response = Curl.get('http://voldemortas.1.ai')
=> #<Curl::Easy http://voldemortas.1.ai?>
1.9.3p125 :005 > response.content_type
=> "image/png"
1.9.3p125 :006 >
So your code could look something like this:
response = Curl.get(url)
if response.content_type == "text/html" # or more fuzzy: =~ /text/
match = response.body_str.match(/<title>(.*)<\/title>/)
title = match && match[1]
# or use Nokogiri for heavier lifting
end
No more exceptions
puts simian

How to visit a URL with Ruby via http and read the output?

So far I have been able to stitch this together :)
begin
open("http://www.somemain.com/" + path + "/" + blah)
rescue OpenURI::HTTPError
#failure += painting.permalink
else
#success += painting.permalink
end
But how do I read the output of the service that I would be calling?
Open-URI extends open, so you'll get a type of IO stream returned:
open('http://www.example.com') #=> #<StringIO:0x00000100977420>
You have to read that to get content:
open('http://www.example.com').read[0 .. 10] #=> "<!DOCTYPE h"
A lot of times a method will let you pass different types as a parameter. They check to see what it is and either use the contents directly, in the case of a string, or read the handle if it's a stream.
For HTML and XML, such as RSS feeds, we'll typically pass the handle to a parser and let it grab the content, parse it, and return an object suitable for searching further:
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.example.com'))
doc.class #=> Nokogiri::HTML::Document
doc.to_html[0 .. 10] #=> "<!DOCTYPE h"
doc.at('h1').text #=> "Example Domains"
doc = open("http://etc..")
content = doc.read
More often people want to be able to parse the returned document, for this use something like hpricot or nokogiri
I'm not sure if you want to do this yourself for the hell of it or not but if you don't.. Mecanize is a really nice gem for doing this.
It will visit the page you want and automatically wrap the page with nokogiri so that you can access it's elements with css selectors such as "div#header h1". Ryan Bates has a video tutorial on it which will teach you everything you need to know to use it.
Basically you can just
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
agent.get("http://www.google.com")
agent.page.at("some css selector").text
It's that simple.

Using Open-URI to fetch XML and the best practice in case of problems with a remote url not returning/timing out?

Current code works as long as there is no remote error:
def get_name_from_remote_url
cstr = "http://someurl.com"
getresult = open(cstr, "UserAgent" => "Ruby-OpenURI").read
doc = Nokogiri::XML(getresult)
my_data = doc.xpath("/session/name").text
# => 'Fred' or 'Sam' etc
return my_data
end
But, what if the remote URL times out or returns nothing? How I detect that and return nil, for example?
And, does Open-URI give a way to define how long to wait before giving up? This method is called while a user is waiting for a response, so how do we set a max timeoput time before we give up and tell the user "sorry the remote server we tried to access is not available right now"?
Open-URI is convenient, but that ease of use means they're removing the access to a lot of the configuration details the other HTTP clients like Net::HTTP allow.
It depends on what version of Ruby you're using. For 1.8.7 you can use the Timeout module. From the docs:
require 'timeout'
begin
status = Timeout::timeout(5) {
getresult = open(cstr, "UserAgent" => "Ruby-OpenURI").read
}
rescue Timeout::Error => e
puts e.to_s
end
Then check the length of getresult to see if you got any content:
if (getresult.empty?)
puts "got nothing from url"
end
If you are using Ruby 1.9.2 you can add a :read_timeout => 10 option to the open() method.
Also, your code could be tightened up and made a bit more flexible. This will let you pass in a URL or default to the currently used URL. Also read Nokogiri's NodeSet docs to understand the difference between xpath, /, css and at, %, at_css, at_xpath:
def get_name_from_remote_url(cstr = 'http://someurl.com')
doc = Nokogiri::XML(open(cstr, 'UserAgent' => 'Ruby-OpenURI'))
# xpath returns a nodeset which has to be iterated over
# my_data = doc.xpath('/session/name').text # => 'Fred' or 'Sam' etc
# at returns a single node
doc.at('/session/name').text
end

How to parse SOAP response from ruby client?

I am learning Ruby and I have written the following code to find out how to consume SOAP services:
require 'soap/wsdlDriver'
wsdl="http://www.abundanttech.com/webservices/deadoralive/deadoralive.wsdl"
service=SOAP::WSDLDriverFactory.new(wsdl).create_rpc_driver
weather=service.getTodaysBirthdays('1/26/2010')
The response that I get back is:
#<SOAP::Mapping::Object:0x80ac3714
{http://www.abundanttech.com/webservices/deadoralive} getTodaysBirthdaysResult=#<SOAP::Mapping::Object:0x80ac34a8
{http://www.w3.org/2001/XMLSchema}schema=#<SOAP::Mapping::Object:0x80ac3214
{http://www.w3.org/2001/XMLSchema}element=#<SOAP::Mapping::Object:0x80ac2f6c
{http://www.w3.org/2001/XMLSchema}complexType=#<SOAP::Mapping::Object:0x80ac2cc4
{http://www.w3.org/2001/XMLSchema}choice=#<SOAP::Mapping::Object:0x80ac2a1c
{http://www.w3.org/2001/XMLSchema}element=#<SOAP::Mapping::Object:0x80ac2774
{http://www.w3.org/2001/XMLSchema}complexType=#<SOAP::Mapping::Object:0x80ac24cc
{http://www.w3.org/2001/XMLSchema}sequence=#<SOAP::Mapping::Object:0x80ac2224
{http://www.w3.org/2001/XMLSchema}element=[#<SOAP::Mapping::Object:0x80ac1f7c>,
#<SOAP::Mapping::Object:0x80ac13ec>,
#<SOAP::Mapping::Object:0x80ac0a28>,
#<SOAP::Mapping::Object:0x80ac0078>,
#<SOAP::Mapping::Object:0x80abf6c8>,
#<SOAP::Mapping::Object:0x80abed18>]
>>>>>>> {urn:schemas-microsoft-com:xml-diffgram-v1}diffgram=#<SOAP::Mapping::Object:0x80abe6c4
{}NewDataSet=#<SOAP::Mapping::Object:0x80ac1220
{}Table=[#<SOAP::Mapping::Object:0x80ac75e4
{}FullName="Cully, Zara"
{}BirthDate="01/26/1892"
{}DeathDate="02/28/1979"
{}Age="(87)"
{}KnownFor="The Jeffersons"
{}DeadOrAlive="Dead">,
#<SOAP::Mapping::Object:0x80b778f4
{}FullName="Feiffer, Jules"
{}BirthDate="01/26/1929"
{}DeathDate=#<SOAP::Mapping::Object:0x80c7eaf4>
{}Age="81"
{}KnownFor="Cartoonists"
{}DeadOrAlive="Alive">]>>>>
I am having a great deal of difficulty figuring out how to parse and show the returned information in a nice table, or even just how to loop through the records and have access to each element (ie. FullName,Age,etc). I went through the whole "getTodaysBirthdaysResult.methods - Object.new.methods" and kept working down to try and work out how to access the elements, but then I get to the array and I got lost.
Any help that can be offered would be appreciated.
If you're going to parse the XML anyway, you might as well skip SOAP4r and go with Handsoap. Disclaimer: I'm one of the authors of Handsoap.
An example implementation:
# wsdl: http://www.abundanttech.com/webservices/deadoralive/deadoralive.wsdl
DEADORALIVE_SERVICE_ENDPOINT = {
:uri => 'http://www.abundanttech.com/WebServices/DeadOrAlive/DeadOrAlive.asmx',
:version => 1
}
class DeadoraliveService < Handsoap::Service
endpoint DEADORALIVE_SERVICE_ENDPOINT
def on_create_document(doc)
# register namespaces for the request
doc.alias 'tns', 'http://www.abundanttech.com/webservices/deadoralive'
end
def on_response_document(doc)
# register namespaces for the response
doc.add_namespace 'ns', 'http://www.abundanttech.com/webservices/deadoralive'
end
# public methods
def get_todays_birthdays
soap_action = 'http://www.abundanttech.com/webservices/deadoralive/getTodaysBirthdays'
response = invoke('tns:getTodaysBirthdays', soap_action)
(response/"//NewDataSet/Table").map do |table|
{
:full_name => (table/"FullName").to_s,
:birth_date => Date.strptime((table/"BirthDate").to_s, "%m/%d/%Y"),
:death_date => Date.strptime((table/"DeathDate").to_s, "%m/%d/%Y"),
:age => (table/"Age").to_s.gsub(/^\(([\d]+)\)$/, '\1').to_i,
:known_for => (table/"KnownFor").to_s,
:alive? => (table/"DeadOrAlive").to_s == "Alive"
}
end
end
end
Usage:
DeadoraliveService.get_todays_birthdays
SOAP4R always returns a SOAP::Mapping::Object which is sometimes a bit difficult to work with unless you are just getting the hash values that you can access using hash notation like so
weather['fullName']
However, it does not work when you have an array of hashes. A work around is to get the result in xml format instead of SOAP::Mapping::Object. To do that I will modify your code as
require 'soap/wsdlDriver'
wsdl="http://www.abundanttech.com/webservices/deadoralive/deadoralive.wsdl"
service=SOAP::WSDLDriverFactory.new(wsdl).create_rpc_driver
service.return_response_as_xml = true
weather=service.getTodaysBirthdays('1/26/2010')
Now the above would give you an xml response which you can parse using nokogiri or REXML. Here is the example using REXML
require 'rexml/document'
rexml = REXML::Document.new(weather)
birthdays = nil
rexml.each_recursive {|element| birthdays = element if element.name == 'getTodaysBirthdaysResult'}
birthdays.each_recursive{|element| puts "#{element.name} = #{element.text}" if element.text}
This will print out all elements that have any text.
So once you have created an xml document you can pretty much do anything depending upon the methods the library you choose has ie. REXML or Nokogiri
Well, Here's my suggestion.
The issue is, you have to snag the right part of the result, one that is something you can actually iterator over. Unfortunately, all the inspecting in the world won't help you because it's a huge blob of unreadable text.
What I do is this:
File.open('myresult.yaml', 'w') {|f| f.write(result.to_yaml) }
This will be a much more human readable format. What you are probably looking for is something like this:
--- !ruby/object:SOAP::Mapping::Object
__xmlattr: {}
__xmlele:
- - &id024 !ruby/object:XSD::QName
name: ListAddressBooksResult <-- Hash name, so it's resul["ListAddressBooksResult"]
namespace: http://apiconnector.com
source:
- !ruby/object:SOAP::Mapping::Object
__xmlattr: {}
__xmlele:
- - &id023 !ruby/object:XSD::QName
name: APIAddressBook <-- this bastard is enumerable :) YAY! so it's result["ListAddressBooksResult"]["APIAddressBook"].each
namespace: http://apiconnector.com
source:
- - !ruby/object:SOAP::Mapping::Object
The above is a result from DotMailer's API, which I spent the last hour trying to figure out how to enumerate over the results. The above is the technique I used to figure out what the heck is going on. I think it beats using REXML etc this way, I could do something like this:
result['ListAddressBooksResult']['APIAddressBook'].each {|book| puts book["Name"]}
Well, I hope this helps anyone else who is looking.
/jason

Resources