Heroku, Nokogiri & Sidekiq memory leak - how to debug? - ruby

I just switched to using Sidekiq on Heroku but I'm getting the following after my jobs run for a while:
2012-12-11T09:53:07+00:00 heroku[worker.1]: Process running mem=1037M(202.6%)
2012-12-11T09:53:07+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-12-11T09:53:28+00:00 heroku[worker.1]: Error R14 (Memory quota exceeded)
2012-12-11T09:53:28+00:00 heroku[worker.1]: Process running mem=1044M(203.9%)
It keeps growing like that.
For these jobs I'm using Nokogiri and HTTParty to retrieve URLs and parse them. I've tried changing some code but I'm not actually sure what I'm looking for in the first place. How should I go about debugging this?
I tried adding New Relic to my app but unfortunately that doesn't support Sidekiq yet.
Also, after Googling I'm trying to switch to a SAX parser and see if that works but I'm getting stuck. This is what I've done so far:
class LinkParser < Nokogiri::XML::SAX::Document
def start_element(name, attrs = [])
if name == 'a'
puts Hash[attrs]['href']
end
end
end
Then I try something like:
page = HTTParty.get("http://site.com")
parser = Nokogiri::XML::SAX::Parser.new(LinkParser.new)
Then I tried using the following methods with the data I retrieved using HTTParty, but haven't been able to get any of these methods to work correctly:
parser.parse(File.read(ARGV[0], 'rb'))
parser.parse_file(filename, encoding = 'UTF-8')
parser.parse_memory(data, encoding = 'UTF-8')
Update
I discovered that the parser wasn't working because I was calling parser.parse(page) instead of parser.parse(page.body) however I've tried printing out all the html tags for various websites using the above script and for some sites it prints out all the tags, while for others it only prints out a few tags.
If I use Nokogiri::HTML() instead of parser.parse() it works fine.
I was using Nokogiri::XML::SAX::Parser.new() instead of Nokogiri::HTML::SAX::Parser.new() for HTML documents and that's why I was running into trouble.
Code Update
Ok, I've got the following code working now, but can't figure out how to put the data I get into an array which I can use later on...
require 'nokogiri'
class LinkParser < Nokogiri::XML::SAX::Document
attr_accessor :link
def initialize
#link = false
end
def start_element(name, attrs = [])
url = Hash[attrs]
if name == 'a' && url['href'] && url['href'].starts_with?("http")
#link = true
puts url['href']
puts url['rel']
end
end
def characters(anchor)
puts anchor if #link
end
def end_element(name)
#link = false
end
def self.starts_with?(prefix)
prefix.respond_to?(:to_str) && self[0, prefix.length] == prefix
end
end

In the end I discovered that the memory leak is due to the 'Typhoeus' gem which is a dependency for the 'PageRankr' gem that I'm using in part of my code.
I discovered this by running the code locally while monitoring memory usage with watch "ps u -C ruby", and then testing different parts of the code until I could pinpoint where the memory leak came from.
I'm marking this as the accepted answer since in the original question I didn't know how to debug memory leaks but someone told me to do the above and it worked.

Just in case if you can't to resolve gems memory leaks issue:
You can run sidekiq jobs inside a forks, as described in the answer https://stackoverflow.com/a/1076445/3675705
Just add Application helper "do_in_child" and then inside your worker
def perform
do_in_child do
# some polluted task
end
end
Yes, i know it's kind a dirty solution becase Sidekiq should work in threads, but in my case it's the only one fast solution for production becase i have a slow jobs with parsing big XML files by nokogiri.
"Fast" thread feature will not give any advantage but memory leaks gives me a 2GB+ main sidekiq process after 10 minutes of work. And after one day sidekiq virtual memory grows up to 11GB (all available virtual memory on my server) and all the tasks are going extremely slow.

Related

Attempting to use pools crashes Celluloid

I'm trying to use pools in a project of mine that uses Celluloid. However, whenever I invoke the pool method on a class which includes Celluloid (thus receiving methods from Celluloid::ClassMethods) I consistently get the error:
NoMethodError: undefined method `services' for Celluloid:Module
router at /Users/my_username/.rvm/gems/jruby-9.0.5.0/gems/celluloid-supervision-0.20.6/lib/celluloid/supervision/deprecate/supervise.rb:54
supervise at /Users/my_username/.rvm/gems/jruby-9.0.5.0/gems/celluloid-supervision-0.20.6/lib/celluloid/supervision/deprecate/supervise.rb:6
pool at /Users/my_username/.rvm/gems/jruby-9.0.5.0/gems/celluloid-pool-0.20.5/lib/celluloid/supervision/container/behavior/pool.rb:13
<top> at celluloid_pool_test.rb:14
Specifically, this part seems to be the problem:
NoMethodError: undefined method `services' for Celluloid:Module
It tells me that the offending line is /Users/my_username/.rvm/gems/jruby-9.0.5.0/gems/celluloid-supervision-0.20.6/lib/celluloid/supervision/deprecate/supervise.rb:54. It turns out that line holds the code for the Celluloid::Supervision.router method:
def router(*_args)
# TODO: Actually route, based on :branch, if present; or else:
Celluloid.services ### this line is what causes the error
end
To make sure that the issue wasn't with my particular project, I grabbed a code sample from this article which utilizes pools and tried to run it:
require 'celluloid'
require 'mathn'
class PrimeWorker
include Celluloid
def prime(number)
if number.prime?
puts number
end
end
end
pool = PrimeWorker.pool
(2..1000).to_a.map do |i|
pool.prime! i
end
sleep 100
It failed with the exact same error as my project:
Finally, I ran a dead simple piece of code in IRB to see if pool is what triggers the error about services:
class Foo
include Celluloid
end
Foo.pool
Sure enough, I got the exact same error. It seems that there is a bug in Celluloid or that I'm not loading a dependency properly. However, I did require 'celluloid/supervision' in my attempts at solving this, to no avail. Am I doing something wrong on my end or is this a bug in Celluloid?
It seems that others have run into this issue before: https://github.com/celluloid/celluloid-pool/issues/10. I guess it has something to do with Celluloid.services being deprecated and not working in newer versions of Celluloid, so using require 'celluloid/current' rather than just require 'celluloid' seems to do the trick.

Rails with mutex on class variable, rake task and cron

Sorry for such a big question. I do not have much experience with Rails threads and mutex.
I have a class as follow which is used by different controllers to get the license for each customers.
Customers and their licenses gets added and removed every hour. An api is available to get all customers and their licenses.
I plan to create a rake task to call update_set_customers_licenses, run hourly via a cronjob.
I have following questions:
1) Even with a mutex, currently there is a potential for problem, there is a chance that my rake task can occur while updating. Any idea on how to solve this?
2) My design below writes the json out to a file, this is done is for safety as the api is not that reliable. As can be seen, it is not reading the file back, so in essence the file write is useless. I tried to implement a file read but together with mutex and rake task, it gets really confusing. Any pointers will help here.
class Customer
##customers_to_licenses_hash = nil
##last_updated_at = nil
##mutex = Mutex.new
CUSTOMERS_LICENSES_FILE = "#{Rails.root}/tmp/customers_licenses"
def self.cached_license_with_customer(customer)
Rails.cache.fetch('customer') {self.license_with_customer(customer)}
end
def self.license_with_customer(customer)
##mutex.synchronize do
license = ##customers_to_licenses_hash[customer]
if license
return license
elsif(##customers_to_licenses_hash.nil? || Time.now.utc - ##last_updated_at > 1.hours)
updated = self.update_set_customers_licenses
return ##customers_to_licenses_hash[customer] if updated
else
return nil
end
end
end
def self.update_set_customers_licenses
updated = nil
file_write = File.open(CUSTOMERS_LICENSES_FILE, 'w')
results = self.get_active_customers_licenses
if results
##customers_to_licenses_hash = results
file_write.print(results.to_json)
##last_updated_at = Time.now.utc
updated = true
end
file_write.close
updated
end
def self.get_active_customers_licenses
#http get thru api
#return hash of records
end
end
I'm pretty it's the case that every time rails loads, the environment is "fresh" and has no concept of "state" in between instances. That is to say, a mutex in one ruby instance (the one request to rails) has no effect on a second ruby instance (another request to rails or in this case, a rake task).
If you follow the data upstream, you'll find that the common root of every instance that can be used to synchronize them is the database. You could use transactional blocks or maybe a manual flag you set and unset in the database.

How to create a sandboxed RSpec environment?

Essentially, I want to create a program that will run some untrusted code that defines some method or class, and then run an untrusted rspec spec against it.
I've looked into sandboxing Ruby a bit, and this video from rubyconf was particularly helpful. After looking at several solutions, the two that appear to be the most helpful are rubycop, which essentially does static analysis on the code, and the jruby sandbox (both covered in above video). My instinct tells me that the jruby sandbox is probably safer, but I could well be wrong.
Here's a completely unsafe example of what I want to do:
code = <<-RUBY
class Person
def hey
"hey!"
end
end
RUBY
spec = <<-RUBY
describe Person do
let(:person) { Person.new }
it "says hey" do
person.hey.should == "hey!"
end
end
RUBY
# code and spec will be from user input (unsafe)
eval code
require 'rspec/autorun'
eval spec
Which all works fine, but the code obviously needs to be sandboxed. It will be a matter of minutes before some genius submits system("rm -rf /*"), fork while fork or something equally dangerous.
I made various attempts with the jruby sandbox...
sand = Sandbox::Safe.new
sand.eval("require 'rspec/autorun'")
sand.activate! # lock it down
sand.eval code
puts sand.eval spec
That code throws this exception:
Sandbox::SandboxException: NoMethodError: undefined method `require' for #<RSpec::Core::Configuration:0x7c3cfaab>
This is because RSpec tries to require some stuff after the sandbox has been locked down.
So, I tried to force RSpec to require stuff before the sandbox gets locked down by calling an empty describe:
sand = Sandbox::Safe.new
sand.eval("require 'rspec/autorun'")
sand.eval("describe("") { }")
sand.activate! # lock it down
sand.eval code
sand.eval spec
And I get this:
Sandbox::SandboxException: NameError: uninitialized constant RSpec
Which basically means that RSpec doesn't exist in the sandbox. Which is odd, considering sand.eval("require 'rspec/autorun'") returns true, and that the earlier example actually worked (RSpec's autoloader started to run).
It may be a problem with gems and this particular sandbox though. The sandbox object actually supports a method #require, which is essentially bound to Kernel.require, and therefore can't load gems.
It's starting to look like using this sandbox just might not really be possible with rspec. The main problem is trying to actually load it into the sandbox. I even tried something like this:
require 'rspec'
sand.ref(RSpec) # open access to local rspec
But it wasn't having any of it.
So, my question is two-fold:
Does anyone have any bright ideas on how to get this to work with the jruby sandbox?
If not, how secure is rubycop? Apparently codeschool use it, so it must be pretty well tested... it would be nice to be able to use ruby 1.9 instead of jruby as well.
It looks like the sand box environment isn't loading the bundle/gemset. RVM could be at fault here if you are using a gemset or something.
One might try loading the Bundle again once sand boxed.
I would look at ruby taint modes
$SAFE The security level
0 --> No checks are performed on externally supplied (tainted) data. (default)
1 --> Potentially dangerous operations using tainted data are forbidden.
2 --> Potentially dangerous operations on processes and files are forbidden.
3 --> All newly created objects are considered tainted.
4 --> Modification of global data is forbidden.
I have been trying to figure out a similar problem. I want to use some gems like json and rest-client inside my sandbox after activating it. I tried following.
require "sandbox"
s=Sandbox.safe
s.eval <<-RUBY
require 'bundler'
Bundler.require :sandbox
RUBY
s.activate!
Gemfile.rb
group :sandbox do
platforms :jruby do
gem 'json'
gem 'rest-client'
end
end
This way, I was able to require gems in my sandbox. But, then there were some gem specific issues with sandbox. For eg, I had to add a method initialize_dup to whitelist for safe.rb in jruby-sandbox. RestClient has some problem with Fake File Sytem ALT_SEPARATOR which I am trying to patch. You can try this approach for RSpec and see if everything goes through.

LoadError: Expected {app_path}/models/model file.rb to define model name

I am getting this error While running this
LoadError: Expected /home/user/Desktop/Tripurari/myapp/app/models/host.rb to define Host##
But every thing on it's place. Can some one tell me what the exact problem is below method.
def self.check_all(keyword)
memo_mutex = Mutex.new
memo = {}
threads = []
name = keyword.keyword
SITES.each do |site_and_options|
threads << Thread.new do
#host = Host.find_or_create_by_name(site)
if keyword.unavailable_usernames.find_by_host_id(#host.id)
memo[#host.name] = true
else
memo[#host.name] = false
end
end
end
threads.each { |t| t.join }
memo
end
The issue is probably caused by the autoloader. If the Host class is not yet loaded when first entering the loop where you create a couple of new threads, it is autoloaded, i.e. Rails searches the loadpath for a file matching the naming conventions and requires it.
This process is not threadsave. In your case, as you are creating servral threads in quick succession, each trying to autoload the global class, you get race conditions and strange things happen. Basically, you have two options for tackling this:
You can explicitly load the model before starting your threads by using require 'host' before starting your loop.
Or you can set config.threadsave! in an initializer. This will (among other things) preload all your classes when starting your server. This is preferred as with this, you avoid a truckload of other difficult to debug concurrency issues. For more information about config.threadsafe!, please refer to the excellent article by Aaron Patterson arguing it should be removed altogether in Rails 4.
Assuming the code you've quoted above is in a model's .rb file, add require_relative "host" to the top of that file.

Ruby and Timeout.timeout performance issue

I'm not sure how to solve this big performance issue of my application. I'm using open-uri to request the most popular videos from youtube and when I ran perftools https://github.com/tmm1/perftools.rb
It shows that the biggest performance issue is Timeout.timeout. Can anyone suggest me how to solve the problem?
I'm using ruby 1.8.7.
Edit:
This is the output from my profiler
https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B4bANr--YcONZDRlMmFhZjQtYzIyOS00YjZjLWFlMGUtMTQyNzU5ZmYzZTU4&hl=en_US
Timeout is wrapping the function that is actually doing the work to ensure that if the server fails to respond within a certain time, the code will raise an error and stop execution.
I suspect that what you are seeing is that the server is taking some time to respond. You should look at caching the response in some way.
For instance, using memcached (pseudocode)
require 'dalli'
require 'open-uri'
DALLI = Dalli.client.new
class PopularVideos
def self.get
result = []
unless result = DALLI.get("videos_#{Date.today.to_s}")
doc = open("http://youtube/url")
result = parse_videos(doc) # parse the doc somehow
DALLI.set("videos_#{Date.today.to_s}", result)
end
result
end
end
PopularVideos.get # calls your expensive parsing script once
PopularVideos.get # gets the result from memcached for the rest of the day

Resources