In Ruby 1.8.6, I have an array of, say, 100,000 user ids, each of which is an int. I want to perform a block of code on these user ids but I want to do it in chunks. For example, I want to process them 100 at a time. How can I easily achieve this as simply as possible?
I could do something like the following, but probably there's an easier way:
a = Array.new
userids.each { |userid|
a << userid
if a.length == 100
# Process chunk
a = Array.new
end
}
unless a.empty?
# Process chunk
end
Use each_slice:
require 'enumerator' # only needed in ruby 1.8.6 and before
userids.each_slice(100) do |a|
# do something with a
end
Rails has in_groups_of, which under the hood uses each_slice.
userids.in_groups_of(100){|group|
//process group
}
Related
In preparation for manipulation of a large chunk of data, I perform some cleaning and pruning operations prior to processing it. The code I have functions fine, but I'm worried about how these operations will scale up when I perform this on millions of points. The code I have feels inefficient but I don't know how to simplify my steps.
In my process, I parse in a CSV file, check for garbage data i.e. non-numerical values, typecast the remaining data to floats, and then sort it. I'm hoping for some guidance on how to improve this if possible.
require 'green_shoes'
require 'csv'
class String
def valid_float?
true if Float self rescue false
end
end
puts "Parsing..."
temp_file = ask_open_file("")
temp_arr = CSV.read(temp_file)
temp_arr.each do |temp_row|
temp_row.map!{ |x| !x.valid_float? ? 0 : x }
temp_row.map!{ |x| x.to_f}
temp_row.sort!
end
My guess is that you'd want to return the file contents when you are done, right? If so, you'll want to use map on temp_arr, instead of each.
You can save an iteration by combining first two lines together:
temp_arr.map! do |temp_row|
temp_row.map!{ |x| x.valid_float? ? x.to_f : 0.0 }
temp_row.sort!
end
Note: I'm using ruby 1.9.3 and I can't bring any external dependencies. I have to do this with the core lib.
#run_histogram is a Hash of names to an Array of values (two values at this time, :failure, and :run)
foo = #run_histogram.sort_by { |scenario_name, failure_and_run_count|
(failure_and_run_count[:failure].to_f / failure_and_run_count[:run].to_f) * -100.0
}
foo.each{|x| puts x[0]; puts x[1][:run]; puts x[1][:failure] }
The sorting is working properly, but now my issues is that I want to be able to print out the scenario_name (the index to the hash) and then print out the run and failure count.
Unfortunately at the moment I'm forced to use the indexes of the array after using sort_by, which is bad. They're "magic numbers". I would much rather continue to use the :run and :failure symbols to access the data.
Anyone have a better solution?
Why don't you just name the arguments like you did in the first place?
foo = #run_histogram.sort_by { |scenario_name, failure_and_run_count|
(failure_and_run_count[:failure].to_f / failure_and_run_count[:run].to_f) * -100.0
}
foo.each do |scenario_name, failure_and_run_count|
puts scenario_name
puts failure_and_run_count[:run]
puts failure_and_run_count[:failure]
end
You can also chain these together if you don't need to store the intermediate form:
#run_histogram.sort_by do |scenario_name, failure_and_run_count|
# ...
end.each do |scenario_name, failure_and_run_count|
# ...
end
Note that it's more typical to use { ... } for single-line blocks, and do ... end for multi-line. Forcing a series of things into one line impairs readability and is generally a bad thing.
I have a RoR app and a cron rake-task, something like:
Model.all.each do |m|
if m < some_condition
m.do_something
m.save
end
end
Model has 1 000 000 records (and 200 000 with acceptable conditions). Is there any way to improve task memory usage? It takes gigabytes of memory, and Ruby process is killed by the server on production. My DB is PostgreSQL.
You should use methods like #find_each and #find_in_batches. These will load only a small portion of records at a time. Take a look ActiveRecord::Batches.
I would suggest to use find_each, which yields your objects in batches.
Also, apply the condition you have inside the loop in sql if possible, so ActiveRecord does not have to instantiate the objects (and therefore use memory for) you're not using anyway:
Model.find_each(:conditions => {:my => :condition}).each do |m|
# do something
end
you can try following method:
def with_gc(enum)
count = enum.count
limit = 100
(0..count).select{|i| i % limit == 0}.each do |index|
new_count = enum.count
raise "query depends on updated param. Expected count #{count}, got #{new_count}" if count != new_count
enum.skip(index).limit(limit).each do |record|
yield record
end
GC.start
end
end
you can use it like this:
with_gc(Model.all) do |m|
if m < some_condition
m.do_something
m.save
end
end
Is there a way to do the equivalent of ActiveRecord#find_each in DataMapper ?
(find_each will iterate over the result of a query by fetching things in memory by batch of 1000 rather than loading everything in memory)
I checked dm-chunked_query as #MichaelKohl suggested, but I couldn't make it work as I'd expect, it gets the whole collection (I'd expect it to use OFFSET+LIMIT). So I wrote my own extension, it's pretty simple, hope it helps:
class DataMapper::Collection
def batch(n)
Enumerator.new do |y|
offset = 0
loop do
records = slice(offset, n)
break if records.empty?
records.each { |record| y.yield(record) }
offset += records.size
end
end
end
end
# Example
Model.all(:order => :id.asc).batch(1000).each { |obj| p obj }
I don't DM much, but it would not be that hard to write your own, assuming DM lets you apply your own 'limit'and 'offset' manually to queries.
Check out the implementation of find_each/find_in_batches in AR, only a couple dozen lines.
https://github.com/rails/rails/blob/master/activerecord/lib/active_record/relation/batches.rb#L19
https://github.com/rails/rails/blob/master/activerecord/lib/active_record/relation/batches.rb#L48
Here is the code I'm working with:
class Trader
def initialize(ticker ="GLD")
#ticker = ticker
end
def yahoo_data(days=12)
require 'yahoofinance'
YahooFinance::get_historical_quotes_days( #ticker, days ) do |row|
puts "#{row.join(',')}" # this is where a solution is required
end
end
end
The yahoo_data method gets data from Yahoo Finance and puts the price history on the console. But instead of a simple puts that evaporates into the ether, how would you use the preceding code to populate an array that can be later manipulated as object.
Something along the lines of :
do |row| populate_an_array_method(row.join(',') end
If you don't give a block to get_historical_quotes_days, you'll get an array back. You can then use map on that to get an array of the results of join.
In general since ruby 1.8.7 most iterator methods will return an enumerable when they're called without a block. So if foo.bar {|x| puts x} would print the values 1,2,3 then enum = foo.bar will return an enumerable containing the values 1,2,3. And if you do arr = foo.bar.to_a, you'll get the array [1,2,3].
If have an iterator method, which does not do this (from some library perhaps, which does not adhere to this convention), you can use foo.enum_for(:bar) to get an enumerable which contains all the values yielded by bar.
So hypothetically, if get_historical_quotes_days did not already return an array, you could use YahooFinance.enum_for(:get_historical_quotes_days).map {|row| row.join(",") } to get what you want.