Scraping issue with Nokogiri - ruby

I am trying to write a simple script that will tell me when the next episode of x show will be released.
here is what I have so far:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://www.tv.com/shows/game-of-thrones/episodes/"
doc = Nokogiri::HTML(open(url))
puts doc.at_css('h1').text
airdate = doc.at_css('.highlight_date span , h1').text
date = /\W/.match(airdate)
puts date
when i run this all it returns is:
Game of thrones
The css selector I use there gives the line airdate is /xx/xx/xx, however I only want to the date so thats why I have used the /\W/ although I could be completely wrong here.
So basically I want it to just print the show title and the date of the next episode.

You can do as below :-
require 'nokogiri'
require 'open-uri'
url = "http://www.tv.com/shows/game-of-thrones/episodes/"
doc = Nokogiri::HTML(open(url))
# under season4 currently 7 episodes present, which may change later.
doc.css('#season-4-eps > li').size # => 7
# collect season4 episodes and then their dates and titles
doc.css('#season-4-eps > li').collect { |node| [node.css('.title').text,node.css('.date').text] }
# => [["Mockingbird", "5/18/14"],
# ["The Laws of God and Men", "5/11/14"],
# ["First of His Name", "5/4/14"],
# ["Oathkeeper", "4/27/14"],
# ["Breaker of Chains", "4/20/14"],
# ["The Lion and the Rose", "4/13/14"],
# ["Two Swords", "4/6/14"]]
Looking at the webpage again, I can see, that it always open with latest season's data. Thus the above code can be modified as below :-
# how many sessions are present
latest_session = doc.css(".filters > li[data-season]").size # => 4
# collect season4 episodes and then their dates and titles
doc.css("#season-#{latest_session}-eps > li").collect do |node|
p [node.css('.title').text,node.css('.date').text]
end
# >> ["The Mountain and the Viper", "6/1/14"]
# >> ["Mockingbird", "5/18/14"]
# >> ["The Laws of God and Men", "5/11/14"]
# >> ["First of His Name", "5/4/14"]
# >> ["Oathkeeper", "4/27/14"]
# >> ["Breaker of Chains", "4/20/14"]
# >> ["The Lion and the Rose", "4/13/14"]
# >> ["Two Swords", "4/6/14"]
As per the comment, it seems OP may interested to get the data out from NEXT EPISODE box of the webpage. Here is a way to do the same :
require 'nokogiri'
require 'open-uri'
url = "http://www.tv.com/shows/game-of-thrones/episodes/"
doc = Nokogiri::HTML(open(url))
hash = {}
doc.css('div[class ~= next_episode] div.highlight_info').tap do |node|
hash['date'] = node.css('p.highlight_date > span').text[/\d{1,2}\/\d{1,2}\/\d{4}/]
hash['title'] = node.css('div.highlight_name > a').text
end
hash # => {"date"=>"5/18/2014", "title"=>"Mockingbird"}
Worth to read tap{|x|...} → obj
Yields x to the block, and then returns x. The primary purpose of this method is to “tap into” a method chain, in order to perform operations on intermediate results within the chain.
and str[regexp] → new_str or nil.
Also read CSS selectors to understand how the selectors are with the method #css.

Related

Yahoo Finance news pubDate not accessable by ruby nokogiri

I'm able to access Yahoo Finance news headlines title, but have a hard time parsing pubDate so that I only look at say the last week's news and ignore anything older.
require 'nokogiri'
sym = "1313.HK"
url = "https://feeds.finance.yahoo.com/rss/2.0/headline?s=#{sym}&region=US&lang=en-US"
doc = Nokogiri::HTML(open(url))
titles = doc.css("title")
puts titles.length # works, comes back with 0-20
puts titles.text # works
pubDates = doc.css("pubDate")
puts pubDates.length #does NOT work, always 0
puts pubDates.text #does NOT work, always blank
keywordregex = "bad news"
nodes = doc.search('title') # search title tags only, for keywords
puts found_title = nodes.select{ |n| n.name=='title' && n.text =~ keywordregex } # TODO && pubDate > 7 days old
Try it with Nokogiri::XML, rss is really XML.
doc = Nokogiri::XML(open(url))
pubdate node names in your XML source are lowercase.
> doc.css("pubdate").length
=> 7

How to print XPath search?

I'm trying to parse an XML file with Nokogiri:
require 'nokogiri'
require 'open-uri'
#doc = Nokogiri::XML(open('http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=E%20Sports&contest=no'))
#doc.xpath("//event[#league='*LOL*']")
print #doc.text
which works and prints all the events that contain "LOL" in the "league" attribute, but when I create a block, it runs but prints nothing:
#doc.xpath("//event[#league='*LOL*']").each do |league_element|
puts "\n"+league_element.xpath('league').text
end
require 'nokogiri'
require 'open-uri'
#doc = Nokogiri::XML(open('http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=E%20Sports&contest=no'))
events = #doc.xpath("//event[#league='*LOL*']")
puts #doc.children
.children "returns a new NodeSet containing all the children of all the nodes in the NodeSet." you can keep filtering node names and values using children.xpath()
For example:
#doc = Nokogiri::XML(open('http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=E%20Sports&contest=no'))
events = #doc.xpath("//event[#league='*LOL*']")
puts #doc.children.xpath('//league').text
=> LOL Cham Kor
=> LOL Cham Kor
=> ....
Or
#doc.children.each do |item|
puts item.xpath('//league')
end

Rejecting info from being stored in a file

I have a working program that searches Google using Mechanize, however when the program searches Google it also pulls sites that look something like http://webcache.googleusercontent.com/.
I would like to reject that site from being stored in the file. All the sites' URLs are structured differently.
Source code:
require 'mechanize'
PATH = Dir.pwd
SEARCH = "test"
def info(input)
puts "[INFO]#{input}"
end
def get_urls
info("Searching for sites.")
agent = Mechanize.new
page = agent.get('http://www.google.com/')
google_form = page.form('f')
google_form.q = "#{SEARCH}"
url = agent.submit(google_form, google_form.buttons.first)
url.links.each do |link|
if link.href.to_s =~ /url.q/
str = link.href.to_s
str_list = str.split(%r{=|&})
urls_to_log = str_list[1]
success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}
end
end
info("Sites dumped into #{PATH}/temp/sites.txt")
end
get_urls
Text file:
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
http://www.speedtest.net/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.speedtest.net/results.php
http://www.speedtest.net/mobile/
http://www.speedtest.net/about.php
https://support.speedtest.net/
https://en.wikipedia.org/wiki/Test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:R94CAo00wOYJ
https://en.wikipedia.org/wiki/Test%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.test.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:S92tylTr1V8J
https://www.test.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.speakeasy.net/speedtest/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:sCEGhiP0qxEJ:https://www.speakeasy.net/speedtest/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.google.com/webmasters/tools/mobile-friendly/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:WBvZnqZfQukJ:https://www.google.com/webmasters/tools/mobile-friendly/%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://www.humanmetrics.com/cgi-win/jtypes2.asp
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:w_lAt3mgXcoJ:http://www.humanmetrics.com/cgi-win/jtypes2.asp%252Btest%26gbv%3D1%26%26ct%3Dclnk
http://speedtest.xfinity.com/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:snNGJxOQROIJ:http://speedtest.xfinity.com/%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:1sMSoJBXydo
https://www.act.org/content/act/en/products-and-services/the-act/taking-the-test.html%252Btest%26gbv%3D1%26%26ct%3Dclnk
https://www.16personalities.com/free-personality-test
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:SQzntHUEffkJ
https://www.16personalities.com/free-personality-test%252Btest%26gbv%3D%26%26ct%3Dclnk
https://www.xamarin.com/test-cloud
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:ypEu7XAFM8QJ:
https://www.xamarin.com/test-cloud%252Btest%26gbv%3D1%26%26ct%3Dclnk
It works now. I had issue with success('log'), i dont know why but commented it.
str_list = str.split(%r{=|&})
next if str_list[1].split('/')[2] == "webcache.googleusercontent.com"
# success("Site found: #{urls_to_log}")
File.open("#{PATH}/temp/sites.txt", "a+") {|s| s.puts("#{urls_to_log}")}
There are well-tested wheels used to tear apart URLs into the component parts so use them. Ruby comes with URI, which allows us to easily extract the host, path or query:
require 'uri'
URL = 'http://foo.com/a/b/c?d=1'
URI.parse(URL).host
# => "foo.com"
URI.parse(URL).path
# => "/a/b/c"
URI.parse(URL).query
# => "d=1"
Ruby's Enumerable module includes reject and select which make it easy to loop over an array or enumerable object and reject or select elements from it:
(1..3).select{ |i| i.even? } # => [2]
(1..3).reject{ |i| i.even? } # => [1, 3]
Using all that you could check the host of a URL for sub-strings and reject any you don't want:
require 'uri'
%w[
http://www.speedtest.net/
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:M47_v0xF3m8J
].reject{ |url| URI.parse(url).host[/googleusercontent\.com$/] }
# => ["http://www.speedtest.net/"]
Using these methods and techniques you can reject or select from an input file, or just peek into single URLs and choose to ignore or honor them.

Structuring Nokogiri output without HTML tags

I got Ruby to travel to a web site, iterate through a list of campaigns and scrape the pages for specific data. The problem I have now is getting it from the structure Nokogiri gives me, and outputting it into a readable form.
campaign_list = Array.new
campaign_list.push(1042360, 1042386, 1042365, 992307)
browser = Watir::Browser.new :chrome
browser.goto '<redacted>'
browser.text_field(:id => 'email').set '<redacted>'
browser.text_field(:id => 'password').set '<redacted>'
browser.send_keys :enter
file = File.new('hourlysales.csv', 'w')
data = {}
campaign_list.each do |campaign|
browser.goto "<redacted>"
if browser.text.include? "Application Error"
puts "Error loading page, I recommend restarting script"
# Possibly automatic restart of script
else
hourly_data = Nokogiri::HTML.parse(browser.html).text
# file.write data
puts hourly_data
end
This is the output I get:
{"views":[[17,145],[18,165],[19,99],[20,71],[21,31],[22,26],[23,10],[0,15],[1,1], [2,18],[3,19],[4,35],[5,47],[6,44],[7,67],[8,179],[9,141],[10,112],[11,95],[12,46],[13,82],[14,79],[15,70],[16,103]],"orders":[[17,10],[18,9],[19,5],[20,1],[21,1],[22,0],[23,0],[0,1],[1,0],[2,1],[3,0],[4,1],[5,2],[6,1],[7,5],[8,11],[9,6],[10,5],[11,3],[12,1],[13,2],[14,4],[15,6],[16,7]],"conversion_rates":[0.06870229007633588,0.05442176870748299,0.050505050505050504,0.014084507042253521,0.03225806451612903,0.0,0.0,0.06666666666666667,0.0,0.05555555555555555,0.0,0.02857142857142857,0.0425531914893617,0.022727272727272728,0.07462686567164178,0.06134969325153374,0.0425531914893617,0.044642857142857144,0.031578947368421054,0.021739130434782608,0.024390243902439025,0.05063291139240506,0.08571428571428572,0.06741573033707865]}
The arrays stand for { views [[hour, # of views], [hour, # of views], etc. }. Same with orders. I don't need conversion rates.
I also need to add the values up for each key, so after doing this for 5 pages, I have one key for each hour of the day, and the total number of views for that hour. I tried a couple each loops, but couldn't make any progress.
I appreciate any help you guys can give me.
It looks like the output (which from your code I assume is the content of hourly_data) is JSON. In that case, it's easy to parse and add up the numbers. Something like this:
require "json" # at the top of your script
# ...
def sum_hours_values(data, hours_values=nil)
# Start with an empty hash that automatically initializes missing keys to `0`
hours_values ||= Hash.new {|hsh,hour| hsh[hour] = 0 }
# Iterate through the [hour, value] arrays, adding `value` to the running
# count for that `hour`, and return `hours_values`
data.each_with_object(hours_values) do |(hour, value), hsh|
hsh[hour] += value
end
end
# ... Watir/Nokogiri stuff here...
# Initialize these so they persist outside the loop
hours_views, orders_views = nil
campaign_list.each do |campaign|
browser.goto "<redacted>"
if browser.text.include? "Application Error"
# ...
else
# ...
hourly_data_parsed = JSON.parse(hourly_data)
hours_views = sum_hours_values(hourly_data_parsed["views"], hours_views)
hours_orders = sum_hours_values(hourly_data_parsed["orders"], orders_views)
end
end
puts "Views by hour:"
puts hours_views.sort.map {|hour_views| "%2i\t%4i" % hour_views }
puts "Orders by hour:"
puts hours_orders.sort.map {|hour_orders| "%2i\t%4i" % hour_orders }
P.S. There's a really nice recursive version of sum_hours_values I didn't include since the iterative version is clearer to most Ruby programmers. If you're into recursion I leave it as an exercise for you. ;)

How to get RSS feed in xml format for ruby script

I am using the following ruby script from this dashing widget that retrieves an RSS feed and parses it and sends that parsed title and description to a widget.
require 'net/http'
require 'uri'
require 'nokogiri'
require 'htmlentities'
news_feeds = {
"seattle-times" => "http://seattletimes.com/rss/home.xml",
}
Decoder = HTMLEntities.new
class News
def initialize(widget_id, feed)
#widget_id = widget_id
# pick apart feed into domain and path
uri = URI.parse(feed)
#path = uri.path
#http = Net::HTTP.new(uri.host)
end
def widget_id()
#widget_id
end
def latest_headlines()
response = #http.request(Net::HTTP::Get.new(#path))
doc = Nokogiri::XML(response.body)
news_headlines = [];
doc.xpath('//channel/item').each do |news_item|
title = clean_html( news_item.xpath('title').text )
summary = clean_html( news_item.xpath('description').text )
news_headlines.push({ title: title, description: summary })
end
news_headlines
end
def clean_html( html )
html = html.gsub(/<\/?[^>]*>/, "")
html = Decoder.decode( html )
return html
end
end
#News = []
news_feeds.each do |widget_id, feed|
begin
#News.push(News.new(widget_id, feed))
rescue Exception => e
puts e.to_s
end
end
SCHEDULER.every '60m', :first_in => 0 do |job|
#News.each do |news|
headlines = news.latest_headlines()
send_event(news.widget_id, { :headlines => headlines })
end
end
The example rss feed works correctly because the URL is for an xml file. However I want to use this for a different rss feed that does not provide an actual xml file. This rss feed I want is at http://www.ttc.ca/RSS/Service_Alerts/index.rss
This doesn't seem to display anything on the widget. Instead of using "http://www.ttc.ca/RSS/Service_Alerts/index.rss", I also tried "http://www.ttc.ca/RSS/Service_Alerts/index.rss?format=xml" and "view-source:http://www.ttc.ca/RSS/Service_Alerts/index.rss" but with no luck. Does anyone know how I can get the actual xml data related to this rss feed so that I can use it with this ruby script?
You're right, that link does not provide regular XML, so that script won't work in parsing it since it's written specifically to parse the example XML. The rss feed you're trying to parse is providing RDF XML and you can use the Rubygem: RDFXML to parse it.
Something like:
require 'nokogiri'
require 'rdf/rdfxml'
rss_feed = 'http://www.ttc.ca/RSS/Service_Alerts/index.rss'
RDF::RDFXML::Reader.open(rss_feed) do |reader|
# use reader to iterate over elements within the document
end
From here you can try learning how to use RDFXML to extract the content you want. I'd begin by inspecting the reader object for methods I could use:
puts reader.methods.sort - Object.methods
That will print out the reader's own methods, look for one you might be able to use for your purposes, such as reader.each_entry
To further dig down you can inspect what each entry looks like:
reader.each_entry do |entry|
puts "----here's an entry----"
puts entry.inspect
end
or see what methods you can call on the entry:
reader.each_entry do |entry|
puts "----here's an entry's methods----"
puts entry.methods.sort - Object.methods
break
end
I was able to crudely find some titles and descriptions using this hack job:
RDF::RDFXML::Reader.open('http://www.ttc.ca/RSS/Service_Alerts/index.rss') do |reader|
reader.each_object do |object|
puts object.to_s if object.is_a? RDF::Literal
end
end
# returns:
# TTC Service Alerts
# http://www.ttc.ca/Service_Advisories/index.jsp
# TTC Service Alerts.
# TTC.ca
# http://www.ttc.ca
# http://www.ttc.ca/images/ttc-main-logo.gif
# Service Advisory
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory
# 196 York University Rocket route diverting northbound via Sentinel, Finch due to a collision that has closed the York U Bus way.
# - Affecting: Bus Routes: 196 York University Rocket
# 2013-12-17T13:49:03.800-05:00
# Service Advisory (2)
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory+(2)
# 107B Keele North route diverting northbound via Keele, Lepage due to a collision that has closed the York U Bus way.
# - Affecting: Bus Routes: 107 Keele North
# 2013-12-17T13:51:08.347-05:00
But I couldn't quickly find a way to know which one was a title, and which a description :/
Finally, if you still can't find how to extract what you want, start a new question with this info.
Good luck!

Resources