Ruby and Timeout.timeout performance issue - ruby

I'm not sure how to solve this big performance issue of my application. I'm using open-uri to request the most popular videos from youtube and when I ran perftools https://github.com/tmm1/perftools.rb
It shows that the biggest performance issue is Timeout.timeout. Can anyone suggest me how to solve the problem?
I'm using ruby 1.8.7.
Edit:
This is the output from my profiler
https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B4bANr--YcONZDRlMmFhZjQtYzIyOS00YjZjLWFlMGUtMTQyNzU5ZmYzZTU4&hl=en_US

Timeout is wrapping the function that is actually doing the work to ensure that if the server fails to respond within a certain time, the code will raise an error and stop execution.
I suspect that what you are seeing is that the server is taking some time to respond. You should look at caching the response in some way.
For instance, using memcached (pseudocode)
require 'dalli'
require 'open-uri'
DALLI = Dalli.client.new
class PopularVideos
def self.get
result = []
unless result = DALLI.get("videos_#{Date.today.to_s}")
doc = open("http://youtube/url")
result = parse_videos(doc) # parse the doc somehow
DALLI.set("videos_#{Date.today.to_s}", result)
end
result
end
end
PopularVideos.get # calls your expensive parsing script once
PopularVideos.get # gets the result from memcached for the rest of the day

Related

How to extract data from dynamic collapsing table with hidden elements using Nokogiri and Ruby

I am trying to scrape through the following website :
https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html
to get all of the state statistics on coronavirus.
My code below works:
require 'nokogiri'
require 'open-uri'
require 'httparty'
require 'pry'
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
doc = Nokogiri::HTML.parse(open(url))
total_cases = doc.css("span.count")[0].text
total_deaths = doc.css("span.count")[1].text
new_cases = doc.css("span.new-cases")[0].text
new_deaths = doc.css("span.new-cases")[1].text
However, I am unable to get into the collapsed data/gridcell data.
I have tried searching by the class .aria-label and by the .rt-tr-group class. Any help would be appreciated. Thank you.
Although the answer of Layon Ferreira already states the problem it does not provide the steps needed to load the data.
Like already said in the linked answer the data is loaded asynchronously. This means that the data is not present on the initial page and is loaded through the JavaScript engine executing code.
When you open up the browser development tools, go to the "Network" tab. You can clear out all requests, then refresh the page. You'll get to see a list of all requests made. If you're looking for asynchronously loaded data the most interesting requests are often those of type "json" or "xml".
When browsing through the requests you'll find that the data you're looking for is located at:
https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json
Since this is JSON you don't need "nokogiri" to parse it.
require 'httparty'
require 'json'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
data = JSON.parse(response.body)
When executing the above you'll get the exception:
JSON::ParserError ...
This seems to be a Byte Order Mark (BOM) that is not removed by HTTParty. Most likely because the response doesn't specify an UTF-8 charset.
response.body[0]
#=> ""
format '%X', response.body[0].ord
#=> "FEFF"
To correctly handle the BOM Ruby 2.7 added the set_encoding_by_bom method to IO which is also available on StringIO.
require 'httparty'
require 'json'
require 'stringio'
response = HTTParty.get('https://www.cdc.gov/coronavirus/2019-ncov/json/us-cases-map-data.json')
body = StringIO.new(response.body)
body.set_encoding_by_bom
data = JSON.parse(body.gets(nil))
#=> [{"Jurisdiction"=>"Alabama", "Range"=>"10,001 to 20,000", "Cases Reported"=>10145, ...
If you're not yet using Ruby 2.7 you can use a substitute to remove the BOM, however the former is probably the safer option:
data = JSON.parse(response.body.force_encoding('utf-8').sub(/\A\xEF\xBB\xBF/, ''))
That page is using AJAX to load its data.
in that case you may use Watir to fetch the page using a browser
as answered here: https://stackoverflow.com/a/13792540/2784833
Another way is to get data from the API directly.
You can see the other endpoints by checking the network tab on your browser console
I replicated your code and found some of the errors that you might have done
require 'HTTParty'
will not work. You need to use
require 'httparty'
Secondly, there should be quotes around your variable url value i.e
url = "https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/cases-in-us.html"
Other than that, it just worked fine for me.
Also, if you're trying to get the Covid-19 data you might want to use these APIs
For US Count
For US Daily Count
For US Count - States
You could learn more about the APIs here

Ruby github api not sending same response as irb?

When I run the following snippet of Ruby code in RubyMine, it responds with #<Github::Search:0x38d5b52>. However, when I run it in the irb shell, it responds appropriately with a large JSON object which is what I'm looking for. Anybody know why this is happening and how to fix it?
require 'github_api'
github= Github.new do |config|
config.endpoint = 'http://my.domain.com/api/v3'
config.site= 'http://github.com'
config.adapter = :net_http
end
puts github.repos.search("pushed:2014-06-20")
In IRB, the console is automatically calling #inspect on all returned objects. This often times confuses developers who are new to Rails, for example, who are led to believe that queries can't be chained together because they see #inspect being called, causing the query to execute.
My guess is that you're witnessing the same thing here.

Ruby get RSS feed won't get the latest feed

I have a problem parsing an RSS feed.
When I do this:
feed = getFeed("http://example.com/rss)
If the feed content changes it don't update.
If I do it like this:
feed = getFeed("http://example.com/rss?" + Random.rand(20).to_s)
It works most of the time but not always.
getFeed() is implemented like this:
def getFeed(url)
rss_content = ""
open(url) do |f|
rss_content = f.read
end
return rss_content
end
I used this in Sinatra with Ruby 1.9.3, if this make a difference.
On my opinion somewhere it gets cached but I have no idea where.
Edit:
Okey after 1/2 day running on the server it works with out a problem.
This:
feed = getFeed("http://example.com/rss?" + Random.rand(20).to_s)
implies the problem is with caching, but Ruby, OpenURI and Sinatra shouldn't be caching anything. Perhaps your code is running behind a caching device or app that is handling outgoing requests as well as incoming?
This isn't the fix, but your code can be streamlined greatly:
def getFeed(url)
open(url).read
end

Heroku, Nokogiri & Sidekiq memory leak - how to debug?

I just switched to using Sidekiq on Heroku but I'm getting the following after my jobs run for a while:
2012-12-11T09:53:07+00:00 heroku[worker.1]: Process running mem=1037M(202.6%)
2012-12-11T09:53:07+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-12-11T09:53:28+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-12-11T09:53:28+00:00 heroku[worker.1]: Process running mem=1044M(203.9%)
It keeps growing like that.
For these jobs I'm using Nokogiri and HTTParty to retrieve URLs and parse them. I've tried changing some code but I'm not actually sure what I'm looking for in the first place. How should I go about debugging this?
I tried adding New Relic to my app but unfortunately that doesn't support Sidekiq yet.
Also, after Googling I'm trying to switch to a SAX parser and see if that works but I'm getting stuck. This is what I've done so far:
class LinkParser < Nokogiri::XML::SAX::Document
def start_element(name, attrs = [])
if name == 'a'
puts Hash[attrs]['href']
end
end
end
Then I try something like:
page = HTTParty.get("http://site.com")
parser = Nokogiri::XML::SAX::Parser.new(LinkParser.new)
Then I tried using the following methods with the data I retrieved using HTTParty, but haven't been able to get any of these methods to work correctly:
parser.parse(File.read(ARGV[0], 'rb'))
parser.parse_file(filename, encoding = 'UTF-8')
parser.parse_memory(data, encoding = 'UTF-8')
Update
I discovered that the parser wasn't working because I was calling parser.parse(page) instead of parser.parse(page.body) however I've tried printing out all the html tags for various websites using the above script and for some sites it prints out all the tags, while for others it only prints out a few tags.
If I use Nokogiri::HTML() instead of parser.parse() it works fine.
I was using Nokogiri::XML::SAX::Parser.new() instead of Nokogiri::HTML::SAX::Parser.new() for HTML documents and that's why I was running into trouble.
Code Update
Ok, I've got the following code working now, but can't figure out how to put the data I get into an array which I can use later on...
require 'nokogiri'
class LinkParser < Nokogiri::XML::SAX::Document
attr_accessor :link
def initialize
#link = false
end
def start_element(name, attrs = [])
url = Hash[attrs]
if name == 'a' && url['href'] && url['href'].starts_with?("http")
#link = true
puts url['href']
puts url['rel']
end
end
def characters(anchor)
puts anchor if #link
end
def end_element(name)
#link = false
end
def self.starts_with?(prefix)
prefix.respond_to?(:to_str) && self[0, prefix.length] == prefix
end
end
In the end I discovered that the memory leak is due to the 'Typhoeus' gem which is a dependency for the 'PageRankr' gem that I'm using in part of my code.
I discovered this by running the code locally while monitoring memory usage with watch "ps u -C ruby", and then testing different parts of the code until I could pinpoint where the memory leak came from.
I'm marking this as the accepted answer since in the original question I didn't know how to debug memory leaks but someone told me to do the above and it worked.
Just in case if you can't to resolve gems memory leaks issue:
You can run sidekiq jobs inside a forks, as described in the answer https://stackoverflow.com/a/1076445/3675705
Just add Application helper "do_in_child" and then inside your worker
def perform
do_in_child do
# some polluted task
end
end
Yes, i know it's kind a dirty solution becase Sidekiq should work in threads, but in my case it's the only one fast solution for production becase i have a slow jobs with parsing big XML files by nokogiri.
"Fast" thread feature will not give any advantage but memory leaks gives me a 2GB+ main sidekiq process after 10 minutes of work. And after one day sidekiq virtual memory grows up to 11GB (all available virtual memory on my server) and all the tasks are going extremely slow.

Ruby tweetstream stop unexpectedly

I use tweetstream gem to get sample tweets from Twitter Streaming API:
TweetStream.configure do |config|
config.username = 'my_username'
config.password = 'my_password'
config.auth_method = :basic
end
#client = TweetStream::Client.new
#client.sample do |status|
puts "#{status.text}"
end
However, this script will stop printing out tweets after about 100 tweets (the script continues to run). What could be the problem?
The Twitter Search API sets certain arbitrary (from the outside) limits for things, from the docs:
GET statuses/:id/retweeted_by Show user objects of up to 100 members who retweeted the status.
From the gem, the code for the method is:
# Returns a random sample of all public statuses. The default access level
# provides a small proportion of the Firehose. The "Gardenhose" access
# level provides a proportion more suitable for data mining and
# research applications that desire a larger proportion to be statistically
# significant sample.
def sample(query_parameters = {}, &block)
start('statuses/sample', query_parameters, &block)
end
I checked the API docs but don't see an entry for 'statuses/sample', but looking at the one above I'm assuming you've reached 100 of whatever statuses/xxx has been accessed.
Also, correct me if I'm wrong, but I believe Twitter no longer accepts basic auth and you must use an OAuth key. If this is so, then that means you're unauthenticated, and the search API will also limit you in other ways too, see https://dev.twitter.com/docs/rate-limiting
Hope that helps.
Ok, I made a mistake there, I was looking at the search API when I should've been looking at the streaming API (my apologies), but it's possible some of the things I was talking about could be the cause of your problems so I'll leave it up. Twitter definitely has moved away from basic auth, so I'd try resolving that first, see:
https://dev.twitter.com/docs/auth/oauth/faq

Resources