Creating thousands of records in Rails - ruby

Let me set the stage: My application deals with gift cards. When we create cards they have to have a unique string that the user can use to redeem it with. So when someone orders our gift cards, like a retailer, we need to make a lot of new card objects and store them in the DB.
With that in mind, I'm trying to see how quickly I can have my application generate 100,000 Cards. Database expert, I am not, so I need someone to explain this little phenomena: When I create 1000 Cards, it takes 5 seconds. When I create 100,000 cards it should take 500 seconds right?
Now I know what you're wanting to see, the card creation method I'm using, because the first assumption would be that it's getting slower because it's checking the uniqueness of a bunch of cards, more as it goes along. But I can show you my rake task
desc "Creates cards for a retailer"
task :order_cards, [:number_of_cards, :value, :retailer_name] => :environment do |t, args|
t = Time.now
puts "Searching for retailer"
#retailer = Retailer.find_by_name(args[:retailer_name])
puts "Retailer found"
puts "Generating codes"
value = args[:value].to_i
number_of_cards = args[:number_of_cards].to_i
codes = []
top_off_codes(codes, number_of_cards)
while codes != codes.uniq
codes.uniq!
top_off_codes(codes, number_of_cards)
end
stored_codes = Card.all.collect do |c|
c.code
end
while codes != (codes - stored_codes)
codes -= stored_codes
top_off_codes(codes, number_of_cards)
end
puts "Codes are unique and generated"
puts "Creating bundle"
#bundle = #retailer.bundles.create!(:value => value)
puts "Bundle created"
puts "Creating cards"
#bundle.transaction do
codes.each do |code|
#bundle.cards.create!(:code => code)
end
end
puts "Cards generated in #{Time.now - t}s"
end
def top_off_codes(codes, intended_number)
(intended_number - codes.size).times do
codes << ReadableRandom.get(CODE_LENGTH)
end
end
I'm using a gem called readable_random for the unique code. So if you read through all of that code, you'll see that it does all of it's uniqueness testing before it ever starts creating cards. It also writes status updates to the screen while it's running, and it always sits for a while at creating. Meanwhile it flies through the uniqueness tests. So my question to the stackoverflow community is: Why is my database slowing down as I add more cards? Why is this not a linear function in regards to time per card? I'm sure the answer is simple and I'm just a moron who knows nothing about data storage. And if anyone has any suggestions, how would you optimize this method, and how fast do you think you could get it to create 100,000 cards?
(When I plotted out my times on a graph and did a quick curve fit to get my line formula, I calculated how long it would take to create 100,000 cards with my current code and it says 5.5 hours. That maybe completely wrong, I'm not sure. But if it stays on the line I curve fitted, it would be right around there.)

Not an answer to your question, but a couple of suggestions on how to make the insert faster:
Use Ruby's Hash to eliminate duplicates - using your card codes as hash keys, adding them to a hash until your hash grows to the desired size. You can also use class Set instead (but I doubt it's any faster than Hash).
Use bulk insert into the database, instead of series of INSERT queries. Most DBMS's offer the possibility: create text file with new records, and tell database to import it. Here are links for MySQL and PostgreSQL.

My first thoughts would be around transactions - if you have 100,000 pending changes waiting to be committed in the transaction that would slow things down a little, but any decent DB should be able to handle that.
What DB are you using?
What indexes are in place?
Any DB optimisations, eg clustered tables/indexes.
Not sure of the Ruby transaction support - is the #bundle.transaction line something from ActiveModel or another library you are using?

Related

Algorithm in Ruby to trigger a method based on presence of certain texts

I don't know if it can be called an algorithm but i think its close.
I will be pulling data from an API that will have certain words in the title, eg:
Great Software 2.0 Download Now
Buy Great Software for just $10
Great Software Torrent Download
So, i want to do different things based on the presence of certain words such as Download, Buy etc. For eg, if it has the word 'buy' in it, i would like to extract the word buy and the amount value that is present in the title and show it in another div, so in this case it would be "Buy for $10" or "Buy $10" etc. I can do if/else as well but I don't want to use if else because there could be more such conditions in the future. So what i am thinking about is using the send method. eg:
def buy(string)
'Buy for just' + string.scan(/\$\d+/).first
end
def whichkeyword(title)
send (title.scan(/(download|buy)/i)[0][0]).downcase.to_sym, title
end
whichkeyword('Buy this software for $10 now')
is there a better way to do this? Or is this even a good way to do it? Any help would be appreciated
First of all, use send if and only you are to call private method, use public_send otherwise.
In this particular case metaprogramming is an overkill. It requires too much redundant code, plus it requires the code to be changed for new items. I would go with building a hash like:
#hash = { 'buy' => { text: 'Buy for just %{placeholder}', re: /\$\d+/ } }
This hash might be places somewhere outside of the code, e. g. it might be stored in yml file near the code and loaded in advance. That way you might be able to change a behaviour without modifying the code, that is handy for instance in gem.
As we have a hash defined/loaded, I would call the method:
def format string
key = string[/#{Regexp.union(#hash.keys).source}/i].downcase
puts #hash[key][:text] % { placeholder: string[#hash[key][:re]] }
end
Yielding:
▶ format("Buy this software for $10 now")
#⇒ Buy for just $10
There are many advantages over declaring methods, e. g. now matches might contain spaces, you might easily add/remove matchers etc.
First of all, your algorithm can work, but has some troubles in it, like what if no keyword is applied.
I have two solutions for you:
NLP
If you want to do it much more dynamic, you can use NLP - Natural language Processing. NLP will find main words in you sentence and then you can find the good solution for each.
A good gem for that is Treat that you can use with stanford-core-nlp. After processing the data you can find the verbs and even synonyms in the sentence and figure out what to do.
sentence('Buy this software for $10 now').verbs # ['buy']
Simple Hash
This solution is less dynamic, but much more simple. Like you did with the scan, just use Constant to manage your keywords, and the output from them(I would do it with lambdas). you can also add default to the hash
KEYWORDS = Hash.new('Default Title').merge(
buy: -> { },
download: -> { }
)
KEYWORDS[sentence[/(#{KEYWORDS.keys.join('|')})/i].downcase]
I think this solution is good enough.
The only thing that looks strange is scan(/(download|buy)/i)[0][0].
As for me I don't very much like using [] syntax in Ruby.
I think using scan here is not necessary.
What about
def whichkeyword(title)
title =~ /(download|buy)/i
send $1.downcase.to_sym, title unless $1.nil?
end
UPDATE
def whichkeyword(title)
action = title[/(download|buy)/i]
public_send action.downcase.to_sym, title if action
end

What's the 'Rails 4 Way' of finding some number of random records?

User.find(:all, :order => "RANDOM()", :limit => 10) was the way I did it in Rails 3.
User.all(:order => "RANDOM()", :limit => 10) is how I thought Rails 4 would do it, but this is still giving me a Deprecation warning:
DEPRECATION WARNING: Relation#all is deprecated. If you want to eager-load a relation, you can call #load (e.g. `Post.where(published: true).load`). If you want to get an array of records from a relation, you can call #to_a (e.g. `Post.where(published: true).to_a`).
You'll want to use the order and limit methods instead. You can get rid of the all.
For PostgreSQL and SQLite:
User.order("RANDOM()").limit(10)
Or for MySQL:
User.order("RAND()").limit(10)
As the random function could change for different databases, I would recommend to use the following code:
User.offset(rand(User.count)).first
Of course, this is useful only if you're looking for only one record.
If you wanna get more that one, you could do something like:
User.offset(rand(User.count) - 10).limit(10)
The - 10 is to assure you get 10 records in case rand returns a number greater than count - 10.
Keep in mind you'll always get 10 consecutive records.
I think the best solution is really ordering randomly in database.
But if you need to avoid specific random function from database, you can use pluck and shuffle approach.
For one record:
User.find(User.pluck(:id).shuffle.first)
For more than one record:
User.where(id: User.pluck(:id).sample(10))
I would suggest making this a scope as you can then chain it:
class User < ActiveRecord::Base
scope :random, -> { order(Arel::Nodes::NamedFunction.new('RANDOM', [])) }
end
User.random.limit(10)
User.active.random.limit(10)
While not the fastest solution, I like the brevity of:
User.ids.sample(10)
The .ids method yields an array of User IDs and .sample(10) picks 10 random values from this array.
Strongly Recommend this gem for random records, which is specially designed for table with lots of data rows:
https://github.com/haopingfan/quick_random_records
All other answers perform badly with large database, except this gem:
quick_random_records only cost 4.6ms totally.
the accepted answer User.order('RAND()').limit(10) cost 733.0ms.
the offset approach cost 245.4ms totally.
the User.all.sample(10) approach cost 573.4ms.
Note: My table only has 120,000 users. The more records you have, the more enormous the difference of performance will be.
UPDATE:
Perform on table with 550,000 rows
Model.where(id: Model.pluck(:id).sample(10)) cost 1384.0ms
gem: quick_random_records only cost 6.4ms totally
For MYSQL this worked for me:
User.order("RAND()").limit(10)
You could call .sample on the records, like: User.all.sample(10)
The answer of #maurimiranda User.offset(rand(User.count)).first is not good in case we need get 10 random records because User.offset(rand(User.count) - 10).limit(10) will return a sequence of 10 records from the random position, they are not "total randomly", right? So we need to call that function 10 times to get 10 "total randomly".
Beside that, offset is also not good if the random function return a high value. If your query looks like offset: 10000 and limit: 20 , it is generating 10,020 rows and throwing away the first 10,000 of them,
which is very expensive. So call 10 times offset.limit is not efficient.
So i thought that in case we just want to get one random user then User.offset(rand(User.count)).first maybe better (at least we can improve by caching User.count).
But if we want 10 random users or more then User.order("RAND()").limit(10) should be better.
Here's a quick solution.. currently using it with over 1.5 million records and getting decent performance. The best solution would be to cache one or more random record sets, and then refresh them with a background worker at a desired interval.
Created random_records_helper.rb file:
module RandomRecordsHelper
def random_user_ids(n)
user_ids = []
user_count = User.count
n.times{user_ids << rand(1..user_count)}
return user_ids
end
in the controller:
#users = User.where(id: random_user_ids(10))
This is much quicker than the .order("RANDOM()").limit(10) method - I went from a 13 sec load time down to 500ms.

RhoMobile 13,000 inserts causing issues due to time

I have a problem (due to time) when inserting around 13,000 records into the devices database.
Is there any way to optimize this? Is it possible to put these all into one transaction (as I believe that it is currently creating one transaction per insert (which apparently has a diabolical effect on speed)).
Currently this takes around 10 minutes, this includes converting CSV to a hash (this doesn't seem to be the bottleneck).
Stupidly I am not using RhoSync...
Thanks
Set up a transaction around the inserts and then only commit at the end.
From their FAQ.
http://docs.rhomobile.com/faq#how-can-i-seed-a-large-amount-of-data-into-my-application-with-rhom
db = ::Rho::RHO.get_src_db('Model')
db.start_transaction
begin
items.each do |item|
# create hash of attribute/value pairs
data = {
:field1 => item['value1'],
:field2 => item['value2']
}
# Creates a new Model object and saves it
new_item = Model.create(data)
end
db.commit
rescue
db.rollback
end
I've found this technique to be a tremendous speed up.
Use Fixed schema rather then property bag, and you can use one transaction (see below link for how).
http://docs.rhomobile.com/rhodes/rhom#perfomance-tips.
This question was answer by someone else, on google groups (HAYAKAWA Takashi)

Best way to convert a Mongo query to a Ruby array?

Let's say I have a large query (for the purposes of this exercise say it returns 1M records) in MongoDB, like:
users = Users.where(:last_name => 'Smith')
If I loop through this result, working with each member, with something like:
users.each do |user|
# Some manipulation to "user"
# Some calculation for "user"
...
# Saving "user"
end
I'll often get a Mongo cursor timeout (as the database cursor that is reserved exceeds the default timeout length). I know I can extend the cursor timeout, or even turn it off--but this isn't always the most efficient method. So, one way I get around this is to change the code to:
users = Users.where(:last_name => 'Smith')
user_array = []
users.each do |u|
user_array << u
end
THEN, I can loop through user_array (since it's a Ruby array), doing manipulations and calculations, without worrying about a MongoDB timeout.
This works fine, but there has to be a better way--does anyone have a suggestion?
If your result set is so large that it causes cursor timeouts, it's not a good idea to load it entirely to RAM.
A common approach is to process records in batches.
Get 1000 users (sorted by _id).
Process them.
Get another batch of 1000 users where _id is greater than _id of last processed user.
Repeat until done.
For a long running task, consider using rails runner.
runner runs Ruby code in the context of Rails non-interactively. For instance:
$ rails runner "Model.long_running_method"
For further details, see:
http://guides.rubyonrails.org/command_line.html

Are there any Ruby ORMs which use cursors or smart fetch?

I'm looking for a Ruby ORM to replace ActiveRecord. I've been looking at Sequel and DataMapper. They look pretty good however none of them seems to do the basic: not loading everything in memory when you don't need it.
I mean I've tried the following (or equivalent) on ActiveRecord and Sequel on table with lots of rows:
posts.each { |p| puts p }
Both of them go crazy on memory. They seem to load everything in memory rather than fetching stuff when needed. I used the find_in_batches in ActiveRecord, but it's not an acceptable solution:
ActiveRecord is not an acceptable solution because we had too many problems with it.
Why should my code be aware of a paging mechanism? I'm happy to configure somewhere the size of the page but that's it. With find_in_batches you need to do something like:
post.find_in_batches { |batch| batch.each { |p| puts p } }
But that should be transparent.
So is there somewhere a reliable Ruby ORM which does the fetch properly?
Update:
As Sergio mentioned, in Rails 3 you can use find_each which exactly what I want. However as ActiveRecord is not an option, except if someone can really convince me to use it, the questions are:
Which ORMs support the equivalent of find_each?
How to do it?
Why do we need a find_each, while find should do it, shouldn't it?
Sequel's Dataset#each does yield individual rows at a time, but most database drivers will load the entire result in memory first.
If you are using Sequel's Postgres adapter, you can choose to use real cursors:
posts.use_cursor.each{|p| puts p}
This fetches 1000 rows at a time by default, but you can use an option to specify the amount of rows to grab per cursor fetch:
posts.use_cursor(:rows_per_fetch=>100).each{|p| puts p}
If you aren't using Sequel's Postgres adapter, you can use Sequel's pagination extension:
Sequel.extension :pagination
posts.order(:id).each_page(1000){|ds| ds.each{|p| puts p}}
However, like ActiveRecord's find_in_batches/find_each, this does separate queries, so you need to be careful if there are concurrent modifications to the dataset you are retrieving.
The reason this isn't the default in Sequel is probably the same reason it isn't the default in ActiveRecord, which is that it isn't a good default in the general case. Only queries with large result sets really need to worry about it, and most queries don't return large result sets.
At least with the Postgres adapter cursor support, it's fairly easy to make it the default for your model:
Post.dataset = Post.dataset.use_cursor
For the pagination extension, you can't really do that, but you can wrap it in a method that makes it mostly transparent.
Sequel.extension :pagination
posts.order(:id).each_page(1000) do |ds|
ds.each { |p| puts p }
end
It is very very slow on large tables!
It becomes clear, looked at the method body:
http://sequel.rubyforge.org/rdoc-plugins/classes/Sequel/Dataset.html#method-i-paginate
# File lib/sequel/extensions/pagination.rb, line 11
def paginate(page_no, page_size, record_count=nil)
raise(Error, "You cannot paginate a dataset that already has a limit") if #opts[:limit]
paginated = limit(page_size, (page_no - 1) * page_size)
paginated.extend(Pagination)
paginated.set_pagination_info(page_no, page_size, record_count || count)
end
ActiveRecord actually has an almost transparent batch mode:
User.find_each do |user|
NewsLetter.weekly_deliver(user)
end
This code works faster than find_in_batches in ActiveRecord
id_max = table.get(:max[:id])
id_min = table.get(:min[:id])
n=1000
(0..(id_max-id_min)/n).map.each do |i|
table.filter(:id >= id_min+n*i, :id < id_min+n*(i+1)).each {|row|}
end
Maybe you can consider Ohm, that is based on Redis NoSQL store.

Resources