Let's say I have a large query (for the purposes of this exercise say it returns 1M records) in MongoDB, like:
users = Users.where(:last_name => 'Smith')
If I loop through this result, working with each member, with something like:
users.each do |user|
# Some manipulation to "user"
# Some calculation for "user"
...
# Saving "user"
end
I'll often get a Mongo cursor timeout (as the database cursor that is reserved exceeds the default timeout length). I know I can extend the cursor timeout, or even turn it off--but this isn't always the most efficient method. So, one way I get around this is to change the code to:
users = Users.where(:last_name => 'Smith')
user_array = []
users.each do |u|
user_array << u
end
THEN, I can loop through user_array (since it's a Ruby array), doing manipulations and calculations, without worrying about a MongoDB timeout.
This works fine, but there has to be a better way--does anyone have a suggestion?
If your result set is so large that it causes cursor timeouts, it's not a good idea to load it entirely to RAM.
A common approach is to process records in batches.
Get 1000 users (sorted by _id).
Process them.
Get another batch of 1000 users where _id is greater than _id of last processed user.
Repeat until done.
For a long running task, consider using rails runner.
runner runs Ruby code in the context of Rails non-interactively. For instance:
$ rails runner "Model.long_running_method"
For further details, see:
http://guides.rubyonrails.org/command_line.html
Related
I'm implementing Model.find_each as mentioned here:
http://edgeguides.rubyonrails.org/active_record_querying.html#retrieving-multiple-objects-in-batches
but I'm getting this error message:
Ruby ActiveRecord: DEPRECATION WARNING: Relation#find_in_batches with finder options is deprecated. Please build a scope and then call find_in_batches on it instead.
for this code:
Person.find_each(start: start_index, limit: limit) do |person|
I'm pretty much following the code given in the documentation so I'm a little puzzled. Is this code correct and, if not, what's the fix?
Change limit: limit for batch_size: limit
EDIT
Since you need a limit and your batch is based on id, i think you can do something like:
Person.where("id < ?", (start_index + limit)).find_each(start: start_index) do |person|
I don't get exactly the issue but I think you can just write this:
Person.offset(start_index).limit(limit).find_each(batch_size: NUMBER) do |person|
find_each is just a replacement for each, it's the last thing you should do on a query and you should use it only to choose how many results will be loaded in memory (batch_size).
For offsets/limits you should use the usual query API.
Important: Notice that find_each has nothing to do with find, find_each replaces each, which should be used to loop over records, not to find/filter/what else.
Also, is clearly written in the guide:
The find_each method accepts most of the options allowed by the
regular find method, except for :order and :limit, which are reserved
for internal use by find_each.
So you can't use :limit. In any case I think there is something wrong in the guide: you should use find_each for looping, not for searching.
Notice also that :start is used to restart at some point an interrupted job (it's not directly and id like 2000, it's the 2000 element in a batch of 5000 for example), much like the docs states:
By default, records are fetched in ascending order of the primary key,
which must be an integer. The :start option allows you to configure
the first ID of the sequence whenever the lowest ID is not the one you
need. This would be useful, for example, if you wanted to resume an
interrupted batch process, provided you saved the last processed ID as
a checkpoint.
You should not use it to replace offset and limit methods from ActiveRecord
I have a problem (due to time) when inserting around 13,000 records into the devices database.
Is there any way to optimize this? Is it possible to put these all into one transaction (as I believe that it is currently creating one transaction per insert (which apparently has a diabolical effect on speed)).
Currently this takes around 10 minutes, this includes converting CSV to a hash (this doesn't seem to be the bottleneck).
Stupidly I am not using RhoSync...
Thanks
Set up a transaction around the inserts and then only commit at the end.
From their FAQ.
http://docs.rhomobile.com/faq#how-can-i-seed-a-large-amount-of-data-into-my-application-with-rhom
db = ::Rho::RHO.get_src_db('Model')
db.start_transaction
begin
items.each do |item|
# create hash of attribute/value pairs
data = {
:field1 => item['value1'],
:field2 => item['value2']
}
# Creates a new Model object and saves it
new_item = Model.create(data)
end
db.commit
rescue
db.rollback
end
I've found this technique to be a tremendous speed up.
Use Fixed schema rather then property bag, and you can use one transaction (see below link for how).
http://docs.rhomobile.com/rhodes/rhom#perfomance-tips.
This question was answer by someone else, on google groups (HAYAKAWA Takashi)
I'm looking for a Ruby ORM to replace ActiveRecord. I've been looking at Sequel and DataMapper. They look pretty good however none of them seems to do the basic: not loading everything in memory when you don't need it.
I mean I've tried the following (or equivalent) on ActiveRecord and Sequel on table with lots of rows:
posts.each { |p| puts p }
Both of them go crazy on memory. They seem to load everything in memory rather than fetching stuff when needed. I used the find_in_batches in ActiveRecord, but it's not an acceptable solution:
ActiveRecord is not an acceptable solution because we had too many problems with it.
Why should my code be aware of a paging mechanism? I'm happy to configure somewhere the size of the page but that's it. With find_in_batches you need to do something like:
post.find_in_batches { |batch| batch.each { |p| puts p } }
But that should be transparent.
So is there somewhere a reliable Ruby ORM which does the fetch properly?
Update:
As Sergio mentioned, in Rails 3 you can use find_each which exactly what I want. However as ActiveRecord is not an option, except if someone can really convince me to use it, the questions are:
Which ORMs support the equivalent of find_each?
How to do it?
Why do we need a find_each, while find should do it, shouldn't it?
Sequel's Dataset#each does yield individual rows at a time, but most database drivers will load the entire result in memory first.
If you are using Sequel's Postgres adapter, you can choose to use real cursors:
posts.use_cursor.each{|p| puts p}
This fetches 1000 rows at a time by default, but you can use an option to specify the amount of rows to grab per cursor fetch:
posts.use_cursor(:rows_per_fetch=>100).each{|p| puts p}
If you aren't using Sequel's Postgres adapter, you can use Sequel's pagination extension:
Sequel.extension :pagination
posts.order(:id).each_page(1000){|ds| ds.each{|p| puts p}}
However, like ActiveRecord's find_in_batches/find_each, this does separate queries, so you need to be careful if there are concurrent modifications to the dataset you are retrieving.
The reason this isn't the default in Sequel is probably the same reason it isn't the default in ActiveRecord, which is that it isn't a good default in the general case. Only queries with large result sets really need to worry about it, and most queries don't return large result sets.
At least with the Postgres adapter cursor support, it's fairly easy to make it the default for your model:
Post.dataset = Post.dataset.use_cursor
For the pagination extension, you can't really do that, but you can wrap it in a method that makes it mostly transparent.
Sequel.extension :pagination
posts.order(:id).each_page(1000) do |ds|
ds.each { |p| puts p }
end
It is very very slow on large tables!
It becomes clear, looked at the method body:
http://sequel.rubyforge.org/rdoc-plugins/classes/Sequel/Dataset.html#method-i-paginate
# File lib/sequel/extensions/pagination.rb, line 11
def paginate(page_no, page_size, record_count=nil)
raise(Error, "You cannot paginate a dataset that already has a limit") if #opts[:limit]
paginated = limit(page_size, (page_no - 1) * page_size)
paginated.extend(Pagination)
paginated.set_pagination_info(page_no, page_size, record_count || count)
end
ActiveRecord actually has an almost transparent batch mode:
User.find_each do |user|
NewsLetter.weekly_deliver(user)
end
This code works faster than find_in_batches in ActiveRecord
id_max = table.get(:max[:id])
id_min = table.get(:min[:id])
n=1000
(0..(id_max-id_min)/n).map.each do |i|
table.filter(:id >= id_min+n*i, :id < id_min+n*(i+1)).each {|row|}
end
Maybe you can consider Ohm, that is based on Redis NoSQL store.
Let me set the stage: My application deals with gift cards. When we create cards they have to have a unique string that the user can use to redeem it with. So when someone orders our gift cards, like a retailer, we need to make a lot of new card objects and store them in the DB.
With that in mind, I'm trying to see how quickly I can have my application generate 100,000 Cards. Database expert, I am not, so I need someone to explain this little phenomena: When I create 1000 Cards, it takes 5 seconds. When I create 100,000 cards it should take 500 seconds right?
Now I know what you're wanting to see, the card creation method I'm using, because the first assumption would be that it's getting slower because it's checking the uniqueness of a bunch of cards, more as it goes along. But I can show you my rake task
desc "Creates cards for a retailer"
task :order_cards, [:number_of_cards, :value, :retailer_name] => :environment do |t, args|
t = Time.now
puts "Searching for retailer"
#retailer = Retailer.find_by_name(args[:retailer_name])
puts "Retailer found"
puts "Generating codes"
value = args[:value].to_i
number_of_cards = args[:number_of_cards].to_i
codes = []
top_off_codes(codes, number_of_cards)
while codes != codes.uniq
codes.uniq!
top_off_codes(codes, number_of_cards)
end
stored_codes = Card.all.collect do |c|
c.code
end
while codes != (codes - stored_codes)
codes -= stored_codes
top_off_codes(codes, number_of_cards)
end
puts "Codes are unique and generated"
puts "Creating bundle"
#bundle = #retailer.bundles.create!(:value => value)
puts "Bundle created"
puts "Creating cards"
#bundle.transaction do
codes.each do |code|
#bundle.cards.create!(:code => code)
end
end
puts "Cards generated in #{Time.now - t}s"
end
def top_off_codes(codes, intended_number)
(intended_number - codes.size).times do
codes << ReadableRandom.get(CODE_LENGTH)
end
end
I'm using a gem called readable_random for the unique code. So if you read through all of that code, you'll see that it does all of it's uniqueness testing before it ever starts creating cards. It also writes status updates to the screen while it's running, and it always sits for a while at creating. Meanwhile it flies through the uniqueness tests. So my question to the stackoverflow community is: Why is my database slowing down as I add more cards? Why is this not a linear function in regards to time per card? I'm sure the answer is simple and I'm just a moron who knows nothing about data storage. And if anyone has any suggestions, how would you optimize this method, and how fast do you think you could get it to create 100,000 cards?
(When I plotted out my times on a graph and did a quick curve fit to get my line formula, I calculated how long it would take to create 100,000 cards with my current code and it says 5.5 hours. That maybe completely wrong, I'm not sure. But if it stays on the line I curve fitted, it would be right around there.)
Not an answer to your question, but a couple of suggestions on how to make the insert faster:
Use Ruby's Hash to eliminate duplicates - using your card codes as hash keys, adding them to a hash until your hash grows to the desired size. You can also use class Set instead (but I doubt it's any faster than Hash).
Use bulk insert into the database, instead of series of INSERT queries. Most DBMS's offer the possibility: create text file with new records, and tell database to import it. Here are links for MySQL and PostgreSQL.
My first thoughts would be around transactions - if you have 100,000 pending changes waiting to be committed in the transaction that would slow things down a little, but any decent DB should be able to handle that.
What DB are you using?
What indexes are in place?
Any DB optimisations, eg clustered tables/indexes.
Not sure of the Ruby transaction support - is the #bundle.transaction line something from ActiveModel or another library you are using?
I have a collection of users:
users = User.all()
I want to pass a subset of the user collection to a method.
Each subset should contain 1000 items (or less on the last iteration).
some_method(users)
So say users has 9500 items in it, I want to call some_method 10 times, 9 times passing 1000 items and the last time 500.
You can use Enumerable#each_slice method:
User.all.each_slice(1000) do |subarray|
some_method subarray
end
but that would first pull all the records from the database.
However, I guess you could make something like this:
def ar_each_slice scope, size
(scope.count.to_f / size).ceil.times do |i|
yield scope.scoped(:offset => i*size, :limit => size)
end
end
and use it as in:
ar_each_slice(User.scoped, 1000) do |slice|
some_method slice.all
end
It will first get the number of records (using COUNT), and then get 1000 by 1000 using LIMIT clause and pass it to your block.
Since Rails 2.3 one can specify batch_size:
User.find_in_batches(:batch_size =>1000) do |users|
some_method(users)
end
In this case, framework will run select query for every 1000 records. It keeps memory low if you are processing large number of records.
I think, you should divide into subset manually.
For example,
some_method(users[0..999])
I forgot about using :batch_size but Chandra suggested it. That's the right way to go.
Using .all will ask the database to retrieve all records, passing them to Ruby to hold then iterate over them internally. That is a really bad way to handle it if your database will be growing. That's because the glob of records will make the DBM work harder as it grows, and Ruby will have to allocate more and more space to hold them. Your response time will grow as a result.
A better solution is to use the :limit and :offset options to tell the DBM to successively find the first 1000 records at offset 0, then the next 1000 records at offset 1, etc. Keep looping until there are no more records.
You can determine how many times you'll have to loop by doing a .count before you begin asking, which is really fast unless your where-clause is beastly, or simply loop until you get no records back.