How to deal with memory leaks in RMagick in Ruby? - ruby

Im developing web-application with Merb and im looking for some safe and stable image processing library. I used to work with Imagick in php, then moved to ruby and start using RMagick. But there is a problem. Long running scripts causing memory leaks. There are couple solution exists, but I don't know which one is the most stable. So, what do you think?
Right now, my app uses internal API that i wrote to process images, in PHP. Its running on separate server along with other applications, so its not a big problem. But i think its not a good architecture.
Anyway, i`ll consider any practical tips.

I too have encountered this issue - the solution is to force garbage collection.
When you have reassigned the image variable to a new image simply use GC.start to ensure the old reference is released from memory.
On later versions of RMagick, I also believe you can also call destroy! on the image when you have finished processing it.
A combination of the two would probably ensure you are covered, but im not sure of the real life impact on performance (I would assume it is negligible i most cases).
Alternatively, you could use mini-magick which is a wrapper for the ImageMagick commandline client.

When using RMagick it's important to remember to destroy the image once you are done, otherwise you will fill up the /tmp dir when working with large sets of images. For example you must call destroy!
require 'RMagick'
Dir.foreach('/home/tiffs/') do |file|
next if file == '.' or file == '..'
image = Magick::Image.read(file).first
image.format = "PNG"
image.write("/home/png/#{File.basename(file, '.*')}.png")
image.destroy!
end

Actually, it isn't really a Ruby specific problem, other Interpreters share that as well. The concrete problem is that the GC of Ruby only sees memory that was allocated by Ruby itself, and not by external libraries (with the notable exception of the library using Rubys memory management facilities). So, a ImageMagick-Object in Ruby memory space is really small, but the image in the space managed by ImageMagick is large. So, this is not a leak per se, but it behaves like one.
Rubys Garbage Collector never kicks in if your Process stays under a certain limit (8MB is standard). As ImageMagick never creates large objects in Ruby space, it probably never kicks in. So, either you use the proposed method of spawning a new process or using exec. Another rather nifty one is to have an image processing service in the backend that forks for every task. Another one would be to have some kind of monitoring in place that kickstarts the GC every once in a while.
There is another Library called MagickWand by Timothy Paul Hunter (the author of RMagick) that tries to address these issues and create a nicer API. It's in alpha and requires a rather new release of ImageMagick, though.

Now you can tell ImageMagick which memory space should be used.
I think RMAGICK_ENABLE_MANAGED_MEMORY = true and GC.start is what you need.
MANAGED_MEMORY
If true, RMagick is using Ruby managed memory for all allocations. If false,
RMagick allocates memory for objects directly from the operating system. You can
enable RMagick to use Ruby managed memory (when built with ImageMagick 6.4.0-11
and later) by setting
RMAGICK_ENABLE_MANAGED_MEMORY = true
before requiring RMagick.
https://rmagick.github.io/constants.html
However, image.destroy! itself is enough to stabilize the memory consumption.

This is not due to ImageMagick; it's due to Ruby itself, and it's a well known problem. My suggestion is to split your program into two parts: a long-running part that allocates little memory and just deals with the control of the system, and a separate program that actually does the processing work. The long-running control process should do just enough to find some work for a child process that it spawns, and the child should do all of the processing for that particular work item.
Another option would be to leave the two combined, but after a work unit is complete, use exec to replace your process with a freshly started version of the same program, which would search for another work item, process it, and exec itself again.
This is assuming that the work items are fairly large, which they almost certainly are if you're using ImageMagick. If they're not, you'll find that the overhead of spawning a new process and having the Ruby interpreter re-parse your entire program starts to get a little too large. You can deal with this by having your program do more work units (say, ten or a hundred) before re-executing itself.

Related

how to index tons of data at once with Rails, (re)tire, json without eating (all) memory?

In a Rails 3.2.x app, using (Re)tire to access an ES cluster a rake task is going through approx 1M rows to create a new index. (Ruby 1.9.3).
The task is using .to_json with specific attributes and methods listed to limit the resulting hash for each element.
Yet as the task run the memory is eaten away, ending with the process being killed usually by the system.
The task is already using find_by_batch. Smaller batches sizes (using find_each) don't help.
checking without index
Removing the index.import call does improve things (obviously). The task goes through the whole collection very fast without a problem. Pointing to either ES, tire or the JSON conversion (and the relations it might call upon).
reducing the scope of the task
Adding back index.import and passing a very limited hash (with string keys) for each item does make things slower but not too much and does not eat memory away. So json might no be the culprit here.
adding attributes and methods back
The culprit seems to be one of the method used to grab one of the additional attributes. It's based on a relation of the model and another ... Ending up with a lot of models being involved and sifted through.
As pointed out by Index the results of a method in ElasticSearch (Tire + ActiveRecord) adding includes does help a bit but the task does end up heavy too.
going around
I also tried to go around part of the problem and replace the calls to Tire with the use of ES bulk API.
Generating json files and sending them with a Ruby http lib can work. Yet, the same problem arise : memory since the same requests to the DB are made.
What's left ?
What I don't get is why even with the find_by_batch Ruby keeps eating away memory. I would expect that after each batch of data, memory related that batch would be freed.
Next to try : GC.start calls, Active Record caching de activation around the tasks.
Yet, except if a solution limiting the memory use drastically (300 or 500Mo instead of 800+) the background issue is : indexing a lot of instances of a Model including data related to some other models.
am I missing something for the import and includes that would solve the issue ?
would splitting that task into smaller background jobs (resque, sidekiq) help ? I would suppose so as each batch would be isolated from the others and once treated, really free up the memory (?) (orchestrating those tasks would be another trouble)
is there good practices related to indexing big quantities of data into ES ?
I've been using Rails + Elasticsearch for a while and did this kind of dance a few times.
A few things comes to mind, in no particular order.
Did you try to use the recent elasticsearch gem (instead of tire) ? I've updated my apps to use and like having more control on what is done.
I would also try to force a GC sweep after each ActiveRecord loop. You could also be extra careful with memory allocation by explicitly resetting all local variables each time.
You could use the fork & exec trick to fork a brand new process at each loop, it would be the most effective GC you can get. It's a little overhead when you write it the first time, but the pay-off is great. Take good care of limiting the amount of memory used in the outer part of the task. Using a process-based background task would partly achieve the same goal, but you might still get memory bloat.
Can you limit the use of ActiveRecord? If you need some basic associations you could use a lower-level/simpler tool like Sequel (or else) to use Ruby hashes/arrays instead of full fledged AR models.

Forcing a synchronous garbage collection in Ruby

I am trying to use the GDAL bindings to create geographic datasets in a Ruby on Rails app. However, GDAL only flushes those datasets on disk when the corresponding Ruby objects are destroyed. This (unanswered) question provides a nice explanation of what I am facing.
I tried setting every variable to nil and manually running GC.start, but as I understand it the Ruby GC is somewhat asynchronous (tell me if I'm wrong there, as I have limited Ruby experience), so this doesn't work all the time.
Is there a way to force a synchronous garbage collection so that I can be absolutely certain that my objects are destroyed when it is done?
Note that I would vastly prefer using GDAL over other libraries, as I have a large existing Python codebase that also uses the GDAL bindings, and the Python to Ruby translation is (or should be) relatively painless.

Why does loading cached objects increase the memory consumption drastically when computing them will not?

Relevant background info
I've built a little software that can be customized via a config file. The config file is parsed and translated into a nested environment structure (e.g. .HIVE$db = an environment, .HIVE$db$user = "Horst", .HIVE$db$pw = "my password", .HIVE$regex$date = some regex for dates etc.)
I've built routines that can handle those nested environments (e.g. look up value "db/user" or "regex/date", change it etc.). The thing is that the initial parsing of the config files takes a long time and results in quite a big of an object (actually three to four, between 4 and 16 MB). So I thought "No problem, let's just cache them by saving the object(s) to .Rdata files". This works, but "loading" cached objects makes my Rterm process go through the roof with respect to RAM consumption (over 1 GB!!) and I still don't really understand why (this doesn't happen when I "compute" the object all anew, but that's exactly what I'm trying to avoid since it takes too long).
I already thought about maybe serializing it, but I haven't tested it as I would need to refactor my code a bit. Plus I'm not sure if it would affect the "loading back into R" part in just the same way as loading .Rdata files.
Question
Can anyone tell me why loading a previously computed object has such effects on memory consumption of my Rterm process (compared to computing it in every new process I start) and how best to avoid this?
If desired, I will also try to come up with an example, but it's a bit tricky to reproduce my exact scenario. Yet I'll try.
Its likely because the environments you are creating are carrying around their ancestors. If you don't need the ancestor information then set the parents of such environments to emptyenv() (or just don't use environments if you don't need them).
Also note that formulas (and, of course, functions) have environments so watch out for those too.
If it's not reproducible by others, it will be hard to answer. However, I do something quite similar to what you're doing, yet I use JSON files to store all of my values. Rather than parse the text, I use RJSONIO to convert everything to a list, and getting stuff from a list is very easy. (You could, if you want, convert to a hash, but it's nice to have layers of nested parameters.)
See this answer for an example of how I've done this kind of thing. If that works out for you, then you can forego the expensive translation step and the memory ballooning.
(Taking a stab at the original question...) I wonder if your issue is that you are using an environment rather than a list. Saving environments might be tricky in some contexts. Saving lists is no problem. Try using a list or try converting to/from an environment. You can use the as.list() and as.environment() functions for this.

NSThread or pythons' threading module in pyobjc?

I need to do some network bound calls (e.g., fetch a website) and I don't want it to block the UI. Should I be using NSThread's or python's threading module if I am working in pyobjc? I can't find any information on how to choose one over the other. Note, I don't really care about Python's GIL since my tasks are not CPU bound at all.
It will make no difference, you will gain the same behavior with slightly different interfaces. Use whichever fits best into your system.
Learn to love the run loop. Use Cocoa's URL-loading system (or, if you need plain sockets, NSFileHandle) and let it call you when the response (or failure) comes back. Then you don't have to deal with threads at all (the URL-loading system will use a thread for you).
Pretty much the only time to create your own threads in Cocoa is when you have a large task (>0.1 sec) that you can't break up.
(Someone might say NSOperation, but NSOperationQueue is broken and RAOperationQueue doesn't support concurrent operations. Fine if you already have a bunch of NSOperationQueue code or really want to prepare for working NSOperationQueue, but if you need concurrency now, run loop or threads.)
I'm more fond of the native python threading solution since I could join and reference threads around. AFAIK, NSThreads don't support thread joining and cancelling, and you could get a variety of things done with python threads.
Also, it's a bummer that NSThreads can't have multiple arguments, and though there are workarounds for this (like using NSDictionarys and NSArrays), it's still not as elegant and as simple as invoking a thread with arguments laid out in order / corresponding parameters.
But yeah, if the situation demands you to use NSThreads, there shouldn't be any problem at all. Otherwise, it's cool to stick with native python threads.
I have a different suggestion, mainly because python threading is just plain awful because of the GIL (Global Interpreter Lock), especially when you have more than one cpu core. There is a video presentation that goes into this in excruciating detail, but I cannot find the video right now - it was done by a Google employee.
Anyway, you may want to think about using the subprocess module instead of threading (have a helper program that you can execute, or use another binary on the system. Or use NSThread, it should give you more performance than what you can get with CPython threads.

Is there a way to dump the objects in memory from a running ruby process?

Killing the processs while obtaining this information would be fine.
A quick-and-dirty way would be ObjectSpace.each_object{|e| p e}. You could do some tests to determine what you wanted to keep, or Marshal the objects.
For 1.9.2/1.9.3 there's heap_dump gem, it can be injected into a running process using gdb (but more stable was is to include it in process itself, no performance overhead)
It dumps references to objects, not objects themselves, but this is usable if you're into fighting leaks
For the more hardcore there is also BleakHouse which gives you a special custom-compiled copy of ruby with better memory leak tracking powarz

Resources