Having problems with Ruby file from Dashing - ruby

I am having trouble with twitter_user.rb, which is supposed to get the number of tweets, followers, and following of a given Twitter username.
I assume that I am supposed to replace TWITTER_USERNAME in line 9 with the Twitter username that I am interested in. I did that and started dashing but I got:
scheduler caught exception:
undefined method '[]' for nil:NilClass
/.../jobs/twitter_user.rb:19:in 'block in <top (required)>'
It looks like the problem is with line 19 which is:
tweets = /profile["']>[\n\t\s]*<strong>([\d.,]+)/.match(response.body)[1].delete('.,').to_i
Can anybody tell me what is going on and how to fix it?

Your assumption is incorrect. The program is looking for an environment variable called TWITTER_USERNAME that is set to the relevant user name. If that variable doesn't exist then the code uses foobugs instead.
If you would rather modify the code than set up an environment variable, then change
twitter_username = ENV['TWITTER_USERNAME'] || 'foobugs'
to
twitter_username = 'myusername'

This is untested code, but it's a general idea how it should have been written. If you clone the source on the original page you can adjust it for your own purposes (i.e. fix it):
require 'nokogiri'
doc = Nokogiri::XML(content)
tweets = doc.at('profile strong').text.delete('.,').to_i
following = doc.at('following strong').text.delete('.,').to_i
followers = doc.at('followers strong').text.delete('.,').to_i
The above three lines can be reduced to something like:
tweets, following, followers = %w[profile following followers].map{ |tag|
doc.at("#{ tag } strong").text.delete(',.').to_i
}
Again, without a usable sample of the XML/HTML I can't do much more, but as a practice we (programmers) shouldn't use regular expressions to try to parse XML or HTML. It's much to easy to break a pattern with either of those types of files.

I managed to solve the same issue for myself by using the twitter API instead to pull out the relevant information. It seems the web page had changed too much for the scraping to work and it could also stop working again at no notice as various people have already said...
This is the solution I used.
#### Get your twitter keys & secrets:
#### https://dev.twitter.com/docs/auth/tokens-devtwittercom
Twitter.configure do |config|
config.consumer_key = 'YOUR_CONSUMER_KEY'
config.consumer_secret = 'YOUR_CONSUMER_SECRET'
config.oauth_token = 'YOUR_OAUTH_TOKEN'
config.oauth_token_secret = 'YOUR_OAUTH_SECRET'
end
twitter_username = 'foobugs'
MAX_USER_ATTEMPTS = 10
user_attempts = 0
SCHEDULER.every '10m', :first_in => 0 do |job|
begin
tw_user = Twitter.user("#{twitter_username}")
if tw_user
tweets = tw_user.statuses_count
followers = tw_user.followers_count
following = tw_user.friends_count
send_event('twitter_user_tweets', current: tweets)
send_event('twitter_user_followers', current: followers)
send_event('twitter_user_following', current: following)
end
rescue Twitter::Error => e
user_attempts = user_attempts +1
puts "Twitter error #{e}"
puts "\e[33mFor the twitter_user widget to work, you need to put in your twitter API keys in the jobs/twitter_user.rb file.\e[0m"
sleep 5
retry if(user_attempts < MAX_USER_ATTEMPTS)
end
end

I have resolved by substituting this line:
followers = /<strong>([\d.]+)<\/strong> Follower/.match(response.body)[0].delete('.,').to_i
with these two:
followers_count_metadata = /followers_count":[\d]+/.match(response.body)
followers = /[\d]+/.match(followers_count_metadata.to_s).to_s

Related

Dashing - Twitter search term no updating

I am new to dashing and I have managed to work a lot out using the internet, however I am now at a loss as to why my widget doesn't update to the new search_term when I change it in the twitter.rb file?
I am using the default twitter.rb file with a couple of amendments. Firstly I have included my Tokens and authorisation keys from twitter.com and secondly, I have just added an extra line to receive more info when something fails in the twitter::error statement.
This is my current code (minus the keys & tokens)
search_term = URI::encode('#weather')
SCHEDULER.every '2m', :first_in => 0 do |job|
begin
tweets = twitter.search("#{search_term}")
if tweets
tweets = tweets.map do |tweet|
{ name: tweet.user.name, body: tweet.text, avatar: tweet.user.profile_image_url_https }
end
send_event('twitter_mentions', comments: tweets)
end
rescue Twitter::Error => e
puts "Twitter Error: #{e}"
puts "\e[33mFor the twitter widget to work, you need to put in your twitter API keys in the jobs/twitter.rb file.\e[0m"
end
end
I have restarted Dashing; I have even rebooted the box it is on, but all to no avail. I am a total loss.
Any help would be greatly appreciated.

Structuring Nokogiri output without HTML tags

I got Ruby to travel to a web site, iterate through a list of campaigns and scrape the pages for specific data. The problem I have now is getting it from the structure Nokogiri gives me, and outputting it into a readable form.
campaign_list = Array.new
campaign_list.push(1042360, 1042386, 1042365, 992307)
browser = Watir::Browser.new :chrome
browser.goto '<redacted>'
browser.text_field(:id => 'email').set '<redacted>'
browser.text_field(:id => 'password').set '<redacted>'
browser.send_keys :enter
file = File.new('hourlysales.csv', 'w')
data = {}
campaign_list.each do |campaign|
browser.goto "<redacted>"
if browser.text.include? "Application Error"
puts "Error loading page, I recommend restarting script"
# Possibly automatic restart of script
else
hourly_data = Nokogiri::HTML.parse(browser.html).text
# file.write data
puts hourly_data
end
This is the output I get:
{"views":[[17,145],[18,165],[19,99],[20,71],[21,31],[22,26],[23,10],[0,15],[1,1], [2,18],[3,19],[4,35],[5,47],[6,44],[7,67],[8,179],[9,141],[10,112],[11,95],[12,46],[13,82],[14,79],[15,70],[16,103]],"orders":[[17,10],[18,9],[19,5],[20,1],[21,1],[22,0],[23,0],[0,1],[1,0],[2,1],[3,0],[4,1],[5,2],[6,1],[7,5],[8,11],[9,6],[10,5],[11,3],[12,1],[13,2],[14,4],[15,6],[16,7]],"conversion_rates":[0.06870229007633588,0.05442176870748299,0.050505050505050504,0.014084507042253521,0.03225806451612903,0.0,0.0,0.06666666666666667,0.0,0.05555555555555555,0.0,0.02857142857142857,0.0425531914893617,0.022727272727272728,0.07462686567164178,0.06134969325153374,0.0425531914893617,0.044642857142857144,0.031578947368421054,0.021739130434782608,0.024390243902439025,0.05063291139240506,0.08571428571428572,0.06741573033707865]}
The arrays stand for { views [[hour, # of views], [hour, # of views], etc. }. Same with orders. I don't need conversion rates.
I also need to add the values up for each key, so after doing this for 5 pages, I have one key for each hour of the day, and the total number of views for that hour. I tried a couple each loops, but couldn't make any progress.
I appreciate any help you guys can give me.
It looks like the output (which from your code I assume is the content of hourly_data) is JSON. In that case, it's easy to parse and add up the numbers. Something like this:
require "json" # at the top of your script
# ...
def sum_hours_values(data, hours_values=nil)
# Start with an empty hash that automatically initializes missing keys to `0`
hours_values ||= Hash.new {|hsh,hour| hsh[hour] = 0 }
# Iterate through the [hour, value] arrays, adding `value` to the running
# count for that `hour`, and return `hours_values`
data.each_with_object(hours_values) do |(hour, value), hsh|
hsh[hour] += value
end
end
# ... Watir/Nokogiri stuff here...
# Initialize these so they persist outside the loop
hours_views, orders_views = nil
campaign_list.each do |campaign|
browser.goto "<redacted>"
if browser.text.include? "Application Error"
# ...
else
# ...
hourly_data_parsed = JSON.parse(hourly_data)
hours_views = sum_hours_values(hourly_data_parsed["views"], hours_views)
hours_orders = sum_hours_values(hourly_data_parsed["orders"], orders_views)
end
end
puts "Views by hour:"
puts hours_views.sort.map {|hour_views| "%2i\t%4i" % hour_views }
puts "Orders by hour:"
puts hours_orders.sort.map {|hour_orders| "%2i\t%4i" % hour_orders }
P.S. There's a really nice recursive version of sum_hours_values I didn't include since the iterative version is clearer to most Ruby programmers. If you're into recursion I leave it as an exercise for you. ;)

How to get RSS feed in xml format for ruby script

I am using the following ruby script from this dashing widget that retrieves an RSS feed and parses it and sends that parsed title and description to a widget.
require 'net/http'
require 'uri'
require 'nokogiri'
require 'htmlentities'
news_feeds = {
"seattle-times" => "http://seattletimes.com/rss/home.xml",
}
Decoder = HTMLEntities.new
class News
def initialize(widget_id, feed)
#widget_id = widget_id
# pick apart feed into domain and path
uri = URI.parse(feed)
#path = uri.path
#http = Net::HTTP.new(uri.host)
end
def widget_id()
#widget_id
end
def latest_headlines()
response = #http.request(Net::HTTP::Get.new(#path))
doc = Nokogiri::XML(response.body)
news_headlines = [];
doc.xpath('//channel/item').each do |news_item|
title = clean_html( news_item.xpath('title').text )
summary = clean_html( news_item.xpath('description').text )
news_headlines.push({ title: title, description: summary })
end
news_headlines
end
def clean_html( html )
html = html.gsub(/<\/?[^>]*>/, "")
html = Decoder.decode( html )
return html
end
end
#News = []
news_feeds.each do |widget_id, feed|
begin
#News.push(News.new(widget_id, feed))
rescue Exception => e
puts e.to_s
end
end
SCHEDULER.every '60m', :first_in => 0 do |job|
#News.each do |news|
headlines = news.latest_headlines()
send_event(news.widget_id, { :headlines => headlines })
end
end
The example rss feed works correctly because the URL is for an xml file. However I want to use this for a different rss feed that does not provide an actual xml file. This rss feed I want is at http://www.ttc.ca/RSS/Service_Alerts/index.rss
This doesn't seem to display anything on the widget. Instead of using "http://www.ttc.ca/RSS/Service_Alerts/index.rss", I also tried "http://www.ttc.ca/RSS/Service_Alerts/index.rss?format=xml" and "view-source:http://www.ttc.ca/RSS/Service_Alerts/index.rss" but with no luck. Does anyone know how I can get the actual xml data related to this rss feed so that I can use it with this ruby script?
You're right, that link does not provide regular XML, so that script won't work in parsing it since it's written specifically to parse the example XML. The rss feed you're trying to parse is providing RDF XML and you can use the Rubygem: RDFXML to parse it.
Something like:
require 'nokogiri'
require 'rdf/rdfxml'
rss_feed = 'http://www.ttc.ca/RSS/Service_Alerts/index.rss'
RDF::RDFXML::Reader.open(rss_feed) do |reader|
# use reader to iterate over elements within the document
end
From here you can try learning how to use RDFXML to extract the content you want. I'd begin by inspecting the reader object for methods I could use:
puts reader.methods.sort - Object.methods
That will print out the reader's own methods, look for one you might be able to use for your purposes, such as reader.each_entry
To further dig down you can inspect what each entry looks like:
reader.each_entry do |entry|
puts "----here's an entry----"
puts entry.inspect
end
or see what methods you can call on the entry:
reader.each_entry do |entry|
puts "----here's an entry's methods----"
puts entry.methods.sort - Object.methods
break
end
I was able to crudely find some titles and descriptions using this hack job:
RDF::RDFXML::Reader.open('http://www.ttc.ca/RSS/Service_Alerts/index.rss') do |reader|
reader.each_object do |object|
puts object.to_s if object.is_a? RDF::Literal
end
end
# returns:
# TTC Service Alerts
# http://www.ttc.ca/Service_Advisories/index.jsp
# TTC Service Alerts.
# TTC.ca
# http://www.ttc.ca
# http://www.ttc.ca/images/ttc-main-logo.gif
# Service Advisory
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory
# 196 York University Rocket route diverting northbound via Sentinel, Finch due to a collision that has closed the York U Bus way.
# - Affecting: Bus Routes: 196 York University Rocket
# 2013-12-17T13:49:03.800-05:00
# Service Advisory (2)
# http://www.ttc.ca/Service_Advisories/all_service_alerts.jsp#Service+Advisory+(2)
# 107B Keele North route diverting northbound via Keele, Lepage due to a collision that has closed the York U Bus way.
# - Affecting: Bus Routes: 107 Keele North
# 2013-12-17T13:51:08.347-05:00
But I couldn't quickly find a way to know which one was a title, and which a description :/
Finally, if you still can't find how to extract what you want, start a new question with this info.
Good luck!

Issue parsing web page data from twitter for dashing ruby app

I think my issue is the same as that in Having problems with Ruby file from Dashing which as to date no answer.
Full problem is when I start dashing I get.
scheduler caught exception:
undefined method `[]' for nil:NilClass
/home/bhladmin/Shopify-dashing-e672d84/dashboard/jobs/twitter_user.rb:19:in `block in <top (required)>'
/usr/lib64/ruby/gems/1.9.1/gems/rufus-scheduler-2.0.23/lib/rufus/sc/jobs.rb:230:in `call'
/usr/lib64/ruby/gems/1.9.1/gems/rufus-scheduler-2.0.23/lib/rufus/sc/jobs.rb:230:in `trigger_block'
/usr/lib64/ruby/gems/1.9.1/gems/rufus-scheduler-2.0.23/lib/rufus/sc/jobs.rb:204:in `block in trigger'
/usr/lib64/ruby/gems/1.9.1/gems/rufus-scheduler-2.0.23/lib/rufus/sc/scheduler.rb:430:in `call'
/usr/lib64/ruby/gems/1.9.1/gems/rufus-scheduler-2.0.23/lib/rufus/sc/scheduler.rb:430:in `block in trigger_job'
Something isn't right on line 19, but I can't work out what...
The full section of code is below...
#!/usr/bin/env ruby
require 'net/http'
# Track public available information of a twitter user like follower, follower
# and tweet count by scraping the user profile page.
# Config
# ------
twitter_username = ENV['TWITTER_USERNAME'] || 'foobugs'
SCHEDULER.every '2m', :first_in => 0 do |job|
http = Net::HTTP.new("twitter.com", Net::HTTP.https_default_port())
http.use_ssl = true
response = http.request(Net::HTTP::Get.new("/#{twitter_username}"))
if response.code != "200"
puts "twitter communication error (status-code: #{response.code})\n#{response.body}"
else
tweets = /profile["']>[\n\t\s]*<strong>([\d.,]+)/.match(response.body)[1].delete('.,').to_i
following = /following["']>[\n\t\s]*<strong>([\d.,]+)/.match(response.body)[1].delete('.,').to_i
followers = /followers["']>[\n\t\s]*<strong>([\d.,]+)/.match(response.body)[1].delete('.,').to_i
send_event('twitter_user_tweets', current: tweets)
send_event('twitter_user_followers', current: followers)
send_event('twitter_user_following', current: following)
end
end
From the previous question it looks like the way of extracting the data from the webpage is the problem, but I don't know Ruby well enough. I've tried removing the ENV['TWITTER_USERNAME'] section to make sure the username I used (not the one above) is being used. If I dump out the raw html data then it contains the info I'm searching for so I know that part is working.
I think I've solved this myself, by going about it a different way. I've changed the code to use the twitter API rather than page scraping. Details below... The auth checking and timeout isn't great so if anyone has hints on making that better they'd be welcome...
#### Get your twitter keys & secrets:
#### https://dev.twitter.com/docs/auth/tokens-devtwittercom
Twitter.configure do |config|
config.consumer_key = 'YOUR_CONSUMER_KEY'
config.consumer_secret = 'YOUR_CONSUMER_SECRET'
config.oauth_token = 'YOUR_OAUTH_TOKEN'
config.oauth_token_secret = 'YOUR_OAUTH_SECRET'
end
twitter_username = 'foobugs'
MAX_USER_ATTEMPTS = 10
user_attempts = 0
SCHEDULER.every '10m', :first_in => 0 do |job|
begin
tw_user = Twitter.user("#{twitter_username}")
if tw_user
tweets = tw_user.statuses_count
followers = tw_user.followers_count
following = tw_user.friends_count
send_event('twitter_user_tweets', current: tweets)
send_event('twitter_user_followers', current: followers)
send_event('twitter_user_following', current: following)
end
rescue Twitter::Error => e
user_attempts = user_attempts +1
puts "Twitter error #{e}"
puts "\e[33mFor the twitter_user widget to work, you need to put in your twitter API keys in the jobs/twitter_user.rb file.\e[0m"
sleep 5
retry if(user_attempts < MAX_USER_ATTEMPTS)
end
end

Ruby EOFError with open-uri and loop

I'm attempting to build a web crawler and ran into a bit of a snag. Basically what I'm doing is extracting the links from a web page and pushing each link to a queue. Whenever the Ruby interpreter hits this section of code:
links.each do |link|
url_frontier.push(link)
end
I receive the following error:
/home/blah/.rvm/rubies/ruby-1.9.3-p0/lib/ruby/1.9.1/net/protocol.rb:141:in `read_nonblock': end of file reached (EOFError)
If I comment out the above block of code I get no errors. Please, any help would be appreciated. Here is the rest of the code:
require 'open-uri'
require 'net/http'
require 'uri'
class WebCrawler
def self.Spider(root)
eNDCHARS = %{.,'?!:;}
num_documents = 0
token_list = []
url_repository = Hash.new
url_frontier = Queue.new
url_frontier.push(root.to_s)
while !url_frontier.empty? && num_documents < 10
url = url_frontier.pop
if !url_repository.has_key?(url)
document = open(url)
html = document.read
# extract url's
links = URI.extract(html, ['http']).collect { |u| eNDCHARS.index(u[-1]) ? u.chop : u }
links.each do |link|
url_frontier.push(link)
end
# tokenize
Tokenizer.tokenize(document).each do |word|
token_list.push(IndexStructures::Term.new(word, url))
end
# add to the repository
url_repository[url] = true
num_documents += 1
end
end
# sort by term (primary) and document id (secondary) in reverse to aid in the construction of the inverted index
return num_documents, token_list.sort_by! { |term| [term.term, term.document_id]}.reverse!
end
end
I encountered the same error but with Watir-webdriver, running firefox in headless mode. What I found out was, if I was running two of my applications in parallel and I destroy "headless" in one of the applications, it automatically kills the other one as well with the exact error you quoted. Though my situation is not the same as yours, I think the issue is related to prematurely closing the file handle externally while your application is still using it. I removed the destroy command from my application and the error disappeared.
Hope this helps.

Resources