Yahoo Finance news pubDate not accessable by ruby nokogiri - ruby

I'm able to access Yahoo Finance news headlines title, but have a hard time parsing pubDate so that I only look at say the last week's news and ignore anything older.
require 'nokogiri'
sym = "1313.HK"
url = "https://feeds.finance.yahoo.com/rss/2.0/headline?s=#{sym}&region=US&lang=en-US"
doc = Nokogiri::HTML(open(url))
titles = doc.css("title")
puts titles.length # works, comes back with 0-20
puts titles.text # works
pubDates = doc.css("pubDate")
puts pubDates.length #does NOT work, always 0
puts pubDates.text #does NOT work, always blank
keywordregex = "bad news"
nodes = doc.search('title') # search title tags only, for keywords
puts found_title = nodes.select{ |n| n.name=='title' && n.text =~ keywordregex } # TODO && pubDate > 7 days old

Try it with Nokogiri::XML, rss is really XML.
doc = Nokogiri::XML(open(url))

pubdate node names in your XML source are lowercase.
> doc.css("pubdate").length
=> 7

Related

Scraping issue with Nokogiri

I am trying to write a simple script that will tell me when the next episode of x show will be released.
here is what I have so far:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.tv.com/shows/game-of-thrones/episodes/"
doc = Nokogiri::HTML(open(url))
puts doc.at_css('h1').text
airdate = doc.at_css('.highlight_date span , h1').text
date = /\W/.match(airdate)
puts date
when i run this all it returns is:
Game of thrones
The css selector I use there gives the line airdate is /xx/xx/xx, however I only want to the date so thats why I have used the /\W/ although I could be completely wrong here.
So basically I want it to just print the show title and the date of the next episode.
You can do as below :-
require 'nokogiri'
require 'open-uri'
url = "http://www.tv.com/shows/game-of-thrones/episodes/"
doc = Nokogiri::HTML(open(url))
# under season4 currently 7 episodes present, which may change later.
doc.css('#season-4-eps > li').size # => 7
# collect season4 episodes and then their dates and titles
doc.css('#season-4-eps > li').collect { |node| [node.css('.title').text,node.css('.date').text] }
# => [["Mockingbird", "5/18/14"],
# ["The Laws of God and Men", "5/11/14"],
# ["First of His Name", "5/4/14"],
# ["Oathkeeper", "4/27/14"],
# ["Breaker of Chains", "4/20/14"],
# ["The Lion and the Rose", "4/13/14"],
# ["Two Swords", "4/6/14"]]
Looking at the webpage again, I can see, that it always open with latest season's data. Thus the above code can be modified as below :-
# how many sessions are present
latest_session = doc.css(".filters > li[data-season]").size # => 4
# collect season4 episodes and then their dates and titles
doc.css("#season-#{latest_session}-eps > li").collect do |node|
p [node.css('.title').text,node.css('.date').text]
end
# >> ["The Mountain and the Viper", "6/1/14"]
# >> ["Mockingbird", "5/18/14"]
# >> ["The Laws of God and Men", "5/11/14"]
# >> ["First of His Name", "5/4/14"]
# >> ["Oathkeeper", "4/27/14"]
# >> ["Breaker of Chains", "4/20/14"]
# >> ["The Lion and the Rose", "4/13/14"]
# >> ["Two Swords", "4/6/14"]
As per the comment, it seems OP may interested to get the data out from NEXT EPISODE box of the webpage. Here is a way to do the same :
require 'nokogiri'
require 'open-uri'
url = "http://www.tv.com/shows/game-of-thrones/episodes/"
doc = Nokogiri::HTML(open(url))
hash = {}
doc.css('div[class ~= next_episode] div.highlight_info').tap do |node|
hash['date'] = node.css('p.highlight_date > span').text[/\d{1,2}\/\d{1,2}\/\d{4}/]
hash['title'] = node.css('div.highlight_name > a').text
end
hash # => {"date"=>"5/18/2014", "title"=>"Mockingbird"}
Worth to read tap{|x|...} → obj
Yields x to the block, and then returns x. The primary purpose of this method is to “tap into” a method chain, in order to perform operations on intermediate results within the chain.
and str[regexp] → new_str or nil.
Also read CSS selectors to understand how the selectors are with the method #css.

How to get RSS feed in xml format for ruby script

I am using the following ruby script from this dashing widget that retrieves an RSS feed and parses it and sends that parsed title and description to a widget.
require 'net/http'
require 'uri'
require 'nokogiri'
require 'htmlentities'
news_feeds = {
"seattle-times" => "http://seattletimes.com/rss/home.xml",
}
Decoder = HTMLEntities.new
class News
def initialize(widget_id, feed)
#widget_id = widget_id
# pick apart feed into domain and path
uri = URI.parse(feed)
#path = uri.path
#http = Net::HTTP.new(uri.host)
end
def widget_id()
#widget_id
end
def latest_headlines()
response = #http.request(Net::HTTP::Get.new(#path))
doc = Nokogiri::XML(response.body)
news_headlines = [];
doc.xpath('//channel/item').each do |news_item|
title = clean_html( news_item.xpath('title').text )
summary = clean_html( news_item.xpath('description').text )
news_headlines.push({ title: title, description: summary })
end
news_headlines
end
def clean_html( html )
html = html.gsub(/<\/?[^>]*>/, "")
html = Decoder.decode( html )
return html
end
end
#News = []
news_feeds.each do |widget_id, feed|
begin
#News.push(News.new(widget_id, feed))
rescue Exception => e
puts e.to_s
end
end
SCHEDULER.every '60m', :first_in => 0 do |job|
#News.each do |news|
headlines = news.latest_headlines()
send_event(news.widget_id, { :headlines => headlines })
end
end
The example rss feed works correctly because the URL is for an xml file. However I want to use this for a different rss feed that does not provide an actual xml file. This rss feed I want is at http://www.ttc.ca/RSS/Service_Alerts/index.rss
This doesn't seem to display anything on the widget. Instead of using "http://www.ttc.ca/RSS/Service_Alerts/index.rss", I also tried "http://www.ttc.ca/RSS/Service_Alerts/index.rss?format=xml" and "view-source:http://www.ttc.ca/RSS/Service_Alerts/index.rss" but with no luck. Does anyone know how I can get the actual xml data related to this rss feed so that I can use it with this ruby script?
You're right, that link does not provide regular XML, so that script won't work in parsing it since it's written specifically to parse the example XML. The rss feed you're trying to parse is providing RDF XML and you can use the Rubygem: RDFXML to parse it.
Something like:
require 'nokogiri'
require 'rdf/rdfxml'
rss_feed = 'http://www.ttc.ca/RSS/Service_Alerts/index.rss'
RDF::RDFXML::Reader.open(rss_feed) do |reader|
# use reader to iterate over elements within the document
end
From here you can try learning how to use RDFXML to extract the content you want. I'd begin by inspecting the reader object for methods I could use:
puts reader.methods.sort - Object.methods
That will print out the reader's own methods, look for one you might be able to use for your purposes, such as reader.each_entry
To further dig down you can inspect what each entry looks like:
reader.each_entry do |entry|
puts "----here's an entry----"
puts entry.inspect
end
or see what methods you can call on the entry:
reader.each_entry do |entry|
puts "----here's an entry's methods----"
puts entry.methods.sort - Object.methods
break
end
I was able to crudely find some titles and descriptions using this hack job:
RDF::RDFXML::Reader.open('http://www.ttc.ca/RSS/Service_Alerts/index.rss') do |reader|
reader.each_object do |object|
puts object.to_s if object.is_a? RDF::Literal
end
end
# returns:
# TTC Service Alerts
# http://www.ttc.ca/Service_Advisories/index.jsp
# TTC Service Alerts.
# TTC.ca
# http://www.ttc.ca
# http://www.ttc.ca/images/ttc-main-logo.gif
# Service Advisory
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory
# 196 York University Rocket route diverting northbound via Sentinel, Finch due to a collision that has closed the York U Bus way.
# - Affecting: Bus Routes: 196 York University Rocket
# 2013-12-17T13:49:03.800-05:00
# Service Advisory (2)
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory+(2)
# 107B Keele North route diverting northbound via Keele, Lepage due to a collision that has closed the York U Bus way.
# - Affecting: Bus Routes: 107 Keele North
# 2013-12-17T13:51:08.347-05:00
But I couldn't quickly find a way to know which one was a title, and which a description :/
Finally, if you still can't find how to extract what you want, start a new question with this info.
Good luck!

Optimizing Ruby RSS

I'm writing a very simple Ruby script to parse tweets out of a twitter RSS feed. Here's the code I have:
require 'rss'
#rss = RSS::Parser.parse('statuses.xml', false)
outputfile = open("output.txt", "w")
#rss.items.each do |i|
pubdate = i.published.to_s
if pubdate.include? '2011-05'
tweet = i.title.to_s
tweet = tweet.gsub(/<title>SlyFlourish: /, "")
tweet = tweet.gsub(/<\/title>/, "\n\n")
outputfile << tweet
end
end
I think I'm missing something about dealing with the objects coming out of the RSS parser. Can someone tell me how I can better pull out the title and date entries from the object returned by the parser?
Is there a reason you chose RSS? Parsing XML is expensive.
I'd consider using JSON instead.
There's also a twitter Ruby gem that makes this really easy:
require "twitter"
Twitter.user_timeline("gavin_morrice").each do |tweet|
puts tweet.text
puts tweet.created_at
end

How to check whether a string contains today's date in a specific format

I'm parsing some RSS feeds that aggregate what's going on in a given city. I'm only interested in the stuff that is happening today.
At the moment I have this:
require 'rubygems'
require 'rss/1.0'
require 'rss/2.0'
require 'open-uri'
require 'shorturl'
source = "http://rss.feed.com/example.xml"
content = ""
open(source) do |s| content = s.read end
rss = RSS::Parser.parse(content, false)
t = Time.now
day = t.day.to_s
month = t.strftime("%b")
rss.items.each do |rss|
if "#{rss.title}".include?(day)&&(month)
# does stuff with it
end
end
Of course by checking whether the title (that I know contains the date of event in the following format: "(2nd Apr 11)") contains the day and the month (eg. '2' and 'May') I get also info about the events that happen on 12th May, 20th of May and so on. How can I make it foolproof and only get today's events?
Here's a sample title: "Diggin Deeper # The Big Chill House (12th May 11)"
today = Time.now.strftime("%d:%b:%y")
if date_string =~ /(\d*).. (.*?) (\d\d)/
article_date = sprintf("%02i:%s:%s", $1.to_i, $2, $3)
if today == article_date
#this is today
else
#this is not today
end
else
raise("No date found in title.")
end
There could potentially be problems if the title contains other numbers. Does the title have any bounding characters around the date, such as a hyphen before the date or brackets around it? Adding those to the regex could prevent trouble. Could you give us an example title? (An alternative would be to use Time#strftime to create a string which would perfectly match the date as it appears in the title and then just use String#include? with that string, but I don't think there's an elegant way to put the 'th'/'nd'/'rd'/etc on the day.)
Use something like this:
def check_day(date)
t = Time.now
day = t.day.to_s
month = t.strftime("%b")
if date =~ /^#{day}nd\s#{month}\s11/
puts "today!"
else
puts "not today!"
end
end
check_day "3nd May 11" #=> today!
check_day "13nd May 11" #=> not today!
check_day "30nd May 11" #=> not today!

Ruby - Mechanize: Select link by classname and other questions

At the moment I'm having a look on Mechanize.
I am pretty new to Ruby, so please be patient.
I wrote a little test script:
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
page.links.each do |ll|
page_links << ll
end
puts page_links.size
This works. But page_links includes not only the search results. It also includes the google links like Login, Pictures, ...
The result links own a styleclass "1". Is it possible to select only the links with class == 1? How do I achieve this?
Is it possible to modify the "agentalias"? If I own a website, including google analytics or something, what browserclient will I see in ga going with mechanize on my site?
Can I select elements by their ID instead of their name? I tried to use
my_form = page.form_with(:id => 'myformid')
But this does not work.
in such cases like your I am using Nokogiri DOM search.
Here is your code a little bit rewritten:
require 'rubygems'
require 'mechanize'
agent = Mechanize.new
page = agent.get('http://www.google.de')
pp page.title
google_form = page.form_with(:name => 'f')
google_form.q = 'test'
page = agent.submit(google_form)
pp page.title
page_links = Array.new
#maybe you better use 'h3.r > a.l' here
page.parser.css("a.l").each do |ll|
#page.parser here is Nokogiri::HTML::Document
page_links << ll
puts ll.text + "=>" + ll["href"]
end
puts page_links.size
Probably this article is a good place to start:
getting-started-with-nokogiri
By the way samples in the article also deal with Google search ;)
You can build a list of just the search result links by changing your code as follows:
page.links.each do |ll|
cls = ll.attributes.attributes['class']
page_links << ll if cls && cls.value == 'l'
end
For each element ll in page.links, ll.attributes is a Nokogiri::XML::Element and ll.attributes.attributes is a Hash containing the attributes on the link, hence the need for ll.attributes.attributes to get at the actual class and the need for the nil check before comparing the value to 'l'
The problem with using :id in the criteria to find a form is that it clashes with Ruby's Object#id method for returning a Ruby object's internal id. I'm not sure what the work around for this is. You would have no problem selecting the form by some other attribute (e.g. its action.)
I believe the selector you are looking for is:
:dom_id e.g. in your case:
my_form = page.form_with(:dom_id => 'myformid')

Resources