Reduce memory usage of ruby in RoR script - ruby

I have a RoR app and a cron rake-task, something like:
Model.all.each do |m|
if m < some_condition
m.do_something
m.save
end
end
Model has 1 000 000 records (and 200 000 with acceptable conditions). Is there any way to improve task memory usage? It takes gigabytes of memory, and Ruby process is killed by the server on production. My DB is PostgreSQL.

You should use methods like #find_each and #find_in_batches. These will load only a small portion of records at a time. Take a look ActiveRecord::Batches.

I would suggest to use find_each, which yields your objects in batches.
Also, apply the condition you have inside the loop in sql if possible, so ActiveRecord does not have to instantiate the objects (and therefore use memory for) you're not using anyway:
Model.find_each(:conditions => {:my => :condition}).each do |m|
# do something
end

you can try following method:
def with_gc(enum)
count = enum.count
limit = 100
(0..count).select{|i| i % limit == 0}.each do |index|
new_count = enum.count
raise "query depends on updated param. Expected count #{count}, got #{new_count}" if count != new_count
enum.skip(index).limit(limit).each do |record|
yield record
end
GC.start
end
end
you can use it like this:
with_gc(Model.all) do |m|
if m < some_condition
m.do_something
m.save
end
end

Related

Ruby Array Operations: Increasing Efficiency

In preparation for manipulation of a large chunk of data, I perform some cleaning and pruning operations prior to processing it. The code I have functions fine, but I'm worried about how these operations will scale up when I perform this on millions of points. The code I have feels inefficient but I don't know how to simplify my steps.
In my process, I parse in a CSV file, check for garbage data i.e. non-numerical values, typecast the remaining data to floats, and then sort it. I'm hoping for some guidance on how to improve this if possible.
require 'green_shoes'
require 'csv'
class String
def valid_float?
true if Float self rescue false
end
end
puts "Parsing..."
temp_file = ask_open_file("")
temp_arr = CSV.read(temp_file)
temp_arr.each do |temp_row|
temp_row.map!{ |x| !x.valid_float? ? 0 : x }
temp_row.map!{ |x| x.to_f}
temp_row.sort!
end
My guess is that you'd want to return the file contents when you are done, right? If so, you'll want to use map on temp_arr, instead of each.
You can save an iteration by combining first two lines together:
temp_arr.map! do |temp_row|
temp_row.map!{ |x| x.valid_float? ? x.to_f : 0.0 }
temp_row.sort!
end

Equivalent to find_each in DataMapper

Is there a way to do the equivalent of ActiveRecord#find_each in DataMapper ?
(find_each will iterate over the result of a query by fetching things in memory by batch of 1000 rather than loading everything in memory)
I checked dm-chunked_query as #MichaelKohl suggested, but I couldn't make it work as I'd expect, it gets the whole collection (I'd expect it to use OFFSET+LIMIT). So I wrote my own extension, it's pretty simple, hope it helps:
class DataMapper::Collection
def batch(n)
Enumerator.new do |y|
offset = 0
loop do
records = slice(offset, n)
break if records.empty?
records.each { |record| y.yield(record) }
offset += records.size
end
end
end
end
# Example
Model.all(:order => :id.asc).batch(1000).each { |obj| p obj }
I don't DM much, but it would not be that hard to write your own, assuming DM lets you apply your own 'limit'and 'offset' manually to queries.
Check out the implementation of find_each/find_in_batches in AR, only a couple dozen lines.
https://github.com/rails/rails/blob/master/activerecord/lib/active_record/relation/batches.rb#L19
https://github.com/rails/rails/blob/master/activerecord/lib/active_record/relation/batches.rb#L48

What's the most efficient way to iterate through an entire table using Datamapper?

What's the most efficient way to iterate through an entire table using Datamapper?
If I do this, does Datamapper try to pull the entire result set into memory before performing the iteration? Assume, for the sake of argument, that I have millions of records and that this is infeasible:
Author.all.each do |a|
puts a.title
end
Is there a way that I can tell Datamapper to load the results in chunks? Is it smart enough to know to do this automatically?
Thanks, Nicolas, I actually came up with a similar solution. I've accepted your answer since it makes use of Datamapper's dm-pagination system, but I'm wondering if this would do equally as well (or worse):
while authors = Author.slice(offset, CHUNK) do
authors.each do |a|
# do something with a
end
offset += CHUNK
end
Datamapper will run just one sql query for the example above so it will have to keep the whole result set in memory.
I think you should use some sort of pagination if your collection is big.
Using dm-pagination you could do something like:
PAGE_SIZE = 20
pager = Author.page(:per_page => PAGE_SIZE).pager # This will run a count query
(1..pager.total_pages).each do |page_number|
Author.page(:per_page => PAGE_SIZE, :page => page_number).each do |a|
puts a.title
end
end
You can play around with different values for PAGE_SIZE to find a good trade-off between the number of sql queries and memory usage.
What you want is the dm-chunked_query plugin: (example from the docs)
require 'dm-chunked_query'
MyModel.each_chunk(20) do |chunk|
chunk.each do |resource|
# ...
end
end
This will allow you to iterate over all the records in the model, in chunks of 20 records at a time.
EDIT: the example above had an extra #each after #each_chunk, and it was unnecessary. The gem author updated the README example, and I changed the above code to match.

Guidance with benchmarking some string splitting etc

I want to load an array with 1million guids, then loop through them and perform some string operations on each element of the array.
I only want to benchmark the time for the operations I perform on each element of the array, not the time it takes to initialize the array with 1 million rows.
I tried doing a benchmark before, but I didn't understand the output.
How would you do this, I have:
rows = []
(1..1000000).each do |x|
rows[x] = // some string data
end
n = 50000
Benchmark.bm do |x|
rows.each do |r|
# perform string logic here
end
end
Will this return consistent results?
Any guidance/gotcha's I should know about?
Yes, this will return consistent results. You need to report the benchmark, however and (if processing a million rows is too fast) you will need to use your n variable to iterate a few times. (Start with a low n and increase it if your times are in the tenth or hundredths of a second).
require 'benchmark'
# Prepare your test data here
n = 1
Benchmark.bm do |x|
x.report('technique 1') do
n.times do
# perform your string logic here
end
end
x.report('technique 2') do
n.times do
# perform your alternative logic here
end
end
end
Make sure you run your multiple comparisons in the same Benchmark block; don't write one attempt, write down the numbers, and then change the code to run it again. Not only is that more work for you, but it also may produce incorrect comparisons if your machine is in a different state (or if, heaven forfend, you run one test on one machine and another test on another machine).

How to chunk an array in Ruby

In Ruby 1.8.6, I have an array of, say, 100,000 user ids, each of which is an int. I want to perform a block of code on these user ids but I want to do it in chunks. For example, I want to process them 100 at a time. How can I easily achieve this as simply as possible?
I could do something like the following, but probably there's an easier way:
a = Array.new
userids.each { |userid|
a << userid
if a.length == 100
# Process chunk
a = Array.new
end
}
unless a.empty?
# Process chunk
end
Use each_slice:
require 'enumerator' # only needed in ruby 1.8.6 and before
userids.each_slice(100) do |a|
# do something with a
end
Rails has in_groups_of, which under the hood uses each_slice.
userids.in_groups_of(100){|group|
//process group
}

Resources