The following code returns a unique 3 character code by continually checking if the genereated code already exists in the db. Once it finds one that does not exist the loop exits.
How can I protect against race conditions which could lead to non-unique codes being returned?
pubcode = Pubcode.find_by_pub_id(current_pub.id)
new_id = nil
begin
new_id = SecureRandom.hex(2)[0..2].to_s
old_id = Pubcode.find_by_guid(new_id)
if !old_id.nil?
pubcode.guid = new_id
pubcode.save
end
end while (old_id)
How can I protect against race conditions which could lead to non-unique codes being returned?
Don't use the database as a synchronization point. Apart from synchronization issues, your code is susceptible to slowdown as the number of available codes shrinks. There is no guarantee your loop would terminate.
A far better approach to this would be to have a service which pre-generates a batch of unique identifiers and hands these out on a first-come, first-served basis.
Given that you are only using 3 characters for this code, you can only store ~= 17 000 records - you could generate the entire list of permutations of three character codes up front, and remove entries from this list as you allocate them.
You can add a unique index on the database column, and then just try to update a Pubcode with a random uuid. If that fails because of the unique index, just try another code:
pubcode = Pubcode.find_by_pub_id!(current_pub.id)
begin
pupcode.update!(guid: SecureRandom.hex(2)[0..2])
rescue ActiveRecord::StatementInvalid => e
retry
end
Perhaps you want to count the number of retries and raise the exception if there was no code found within a certain number of tries (because there are only 4096 possible ids).
Preventing a race is done by putting the process in a mutex:
#mutex = Mutex.new
within the method that calls your code:
#mutex.synchronize do
# Whatever process you want to avoid race
end
But a problem with your approach is that your loop may never end since you are using randomness.
Related
I have a slow sql query, and I'd like to limit the number of times a function can be called in one second to (3).
Lets say I have these calls, I'd like the function to do that:
call_func() -> true
call_func() -> true
call_func() -> true
call_func() -> false
sleep(1)
call_func() -> true
I placed a limit of up to 3 calls per second, and then the timer is reset. How would you do this in ruby using Process.clock_gettime(Process::CLOCK_MONOTOMIC)
https://msp-greg.github.io/eventmachine/EventMachine/PeriodicTimer.html
Your code would look something like this:
n = 0
timer = EventMachine::PeriodicTimer.new(1) do
call_func()
timer.cancel if (n+=1) > 3
end
Though in truth I think the overhead of this alone will eat into some of the 1 second that you are allocating for the SQL query, so you might end up with a situation where the query only executes 1 or 2 times within the second allocated to the timer, and I couldn't begin to predict the resource increase (memory, etc.) that adding this would create.
You may be better off simply using Threads to handle the issue. You would simply create a Thread for each query and whenever it completes, it completes. Then you could restrict the Thread pool to manage the load.
https://rossta.net/blog/a-ruby-antihero-thread-pool.html
I'm struggling with locking a PostgreSQL table I'm working on. Ideally I want to lock the entire table, but individual rows will do as long as they actually work.
I have several concurrent ruby scripts that all query a central jobs database on AWS (via a DatabaseAccessor class), find a job that hasn't yet been started, change the status to started and carry it out. The problem is, since these are all running at once, they'll typically all find the same unstarted job at once, and begin carrying it out, wasting time and muddying the results.
I've tried a bunch of things, .lock, .transaction, the fatalistic gem but they don't seem to be working, at least, not in pry.
My code is as follows:
class DatabaseAccessor
require 'pg'
require 'pry'
require 'active_record'
class Jobs < ActiveRecord::Base
enum status: [ :unstarted, :started, :slow, :completed]
end
def initialize(db_credentials)
ActiveRecord::Base.establish_connection(
adapter: db_credentials[:adapter],
database: db_credentials[:database],
username: db_credentials[:username],
password: db_credentials[:password],
host: db_credentials[:host]
)
end
def find_unstarted_job
job = Jobs.where(status: 0).limit(1)
job.started!
job
end
end
Does anyone have any suggestions?
EDIT: It seems that LOCK TABLE jobs IN ACCESS EXCLUSIVE MODE; is the way to do this - however, I'm struggling with then returning the results of this after updating. RETURNING * will return the results after an update, but not inside a transaction.
SOLVED!
So the key here is locking in Postgres. There are a few different table-level locks, detailed here.
There are three factors here in making a decision:
Reads aren't thread safe. Two threads reading the same record will result in that job being run multiple times at once.
Records are only updated once (to be marked as completed) and created, other than the initial read and update to being started. Scripts that create new records will not read the table.
Reading varies in frequency. Waiting for an unlock is non-critical.
Given these factors, if there were a read-lock that still allowed writes, this would be acceptable, however, there isn't, so ACCESS EXCLUSIVE is our best option.
Given this, how do we deal with locking? A hunt through the ActiveRecord documentation gives no mention of it.
Thankfully, other methods to deal with PostgreSQL exist, namely the ruby-pg gem. A bit of a play with SQL later, and a test of locking, and I get the following method:
def converter
result_hash = {}
conn = PG::Connection.open(:dbname => 'my_db')
conn.exec("BEGIN WORK;
LOCK TABLE jobs IN ACCESS EXCLUSIVE MODE;")
conn.exec("UPDATE jobs SET status = 1 WHERE id =
(SELECT id FROM jobs WHERE status = 0 ORDER BY ID LIMIT 1)
RETURNING *;") do |result|
result.each { |row| result_hash = row }
end
conn.exec("COMMIT WORK;")
result_hash.transform_keys!(&:to_sym)
end
This will result in:
An output of an empty hash if there are no jobs with a status of 0
An output of a symbolized hash if one is found and updated
Sleeping if the database is currently locked, before returning the above once unlocked.
The table will remain locked until the COMMIT WORK statement.
As an aside, I wish there was a cleaner way to convert the result to a hash. If anyone has any suggestions, please let me know in the comments! :)
I'm writting a worker to add lot's of users into a group. I'm wondering if it's better to run a big task who had all users, or batch like 100 users or one by one per task.
For the moment here is my code
class AddUsersToGroupWorker
include Sidekiq::Worker
sidekiq_options :queue => :group_utility
def perform(store_id, group_id, user_ids_to_add)
begin
store = Store.find store_id
group = Group.find group_id
rescue ActiveRecord::RecordNotFound => e
Airbrake.notify e
return
end
users_to_process = store.users.where(id: user_ids_to_add)
.where.not(id: group.user_ids)
group.users += users_to_process
users_to_process.map(&:id).each do |user_to_process_id|
UpdateLastUpdatesForUserWorker.perform_async store.id, user_to_process_id
end
end
end
Maybe it's better to have something like this in my method :
def add_users
users_to_process = store.users.where(id: user_ids_to_add)
.where.not(id: group.user_ids)
users_to_process.map(&:id).each do |user_to_process_id|
AddUserToGroupWorker.perform_async group_id, user_to_process_id
UpdateLastUpdatesForUserWorker.perform_async store.id, user_to_process_id
end
end
But so many find request. What do you think ?
I have a sidekig pro licence if needed (for batch for example).
Here are my thoughts.
1. Do a single SQL query instead of N queries
This line: group.users += users_to_process is likely to produce N SQL queries (where N is users_to_process.count). I assume that you have many-to-many connection between users and groups (with user_groups join table/model), so you should use some Mass inserting data technique:
users_to_process_ids = store.users.where(id: user_ids_to_add)
.where.not(id: group.user_ids)
.pluck(:id)
sql_values = users_to_process_ids.map{|i| "(#{i.to_i}, #{group.id.to_i}, NOW(), NOW())"}
Group.connection.execute("
INSERT INTO groups_users (user_id, group_id, created_at, updated_at)
VALUES #{sql_values.join(",")}
")
Yes, it's raw SQL. And it's fast.
2. User pluck(:id) instead of map(&:id)
pluck is much quicker, because:
It will select only 'id' column, so less data is transferred from DB
More importantly, it won't create ActiveRecord object for each raw
Doing SQL is cheap. Creating Ruby objects is really expensive.
3. Use horizontal parallelization instead of vertical parallelization
What I mean here, is if you need to do sequential tasks A -> B -> C for a dozen of records, there are two major ways to split the work:
Vertical segmentation. AWorker does A(1), A(2), A(3); BWorker does B(1), etc.; CWorker does all C(i) jobs;
Horizontal segmentation. UniversalWorker does A(1)+B(1)+C(1).
Use the latter (horizontal) way.
It's a statement from experience, not from some theoretical point of view (where both ways are feasible).
Why you should do that?
When you use vertical segmentation, you will likely get errors when you pass job from one worker down to another. Like such kind of errors. You will pull your hair out if you bump into such errors, because they aren't persistent and easily reproducible. Sometimes they happen and sometimes they aren't. Is it possible to write a code which will pass the work down the chain without errors? Sure, it is. But it's better to keep it simple.
Imagine that your server is at rest. And then suddenly new jobs arrive. Your B and C workers will just waste the RAM, while your A workers do the job. And then your A and C will waste the RAM, while B's are at work. And so on. If you make horizontal segmentation, your resource drain will even itself out.
Applying that advice to your specific case: for starters, don't call perform_async in another async task.
4. Process in batches
Answering your original question – yes, do process in batches. Creating and managing async task takes some resources by itself, so there's no need to create too many of them.
TL;DR So in the end, your code could look something like this:
# model code
BATCH_SIZE = 100
def add_users
users_to_process_ids = store.users.where(id: user_ids_to_add)
.where.not(id: group.user_ids)
.pluck(:id)
# With 100,000 users performance of this query should be acceptable
# to make it in a synchronous fasion
sql_values = users_to_process_ids.map{|i| "(#{i.to_i}, #{group.id.to_i}, NOW(), NOW())"}
Group.connection.execute("
INSERT INTO groups_users (user_id, group_id, created_at, updated_at)
VALUES #{sql_values.join(",")}
")
users_to_process_ids.each_slice(BATCH_SIZE) do |batch|
AddUserToGroupWorker.perform_async group_id, batch
end
end
# add_user_to_group_worker.rb
def perform(group_id, user_ids_to_add)
group = Group.find group_id
# Do some heavy load with a batch as a whole
# ...
# ...
# If nothing here is left, call UpdateLastUpdatesForUserWorker from the model instead
user_ids_to_add.each do |id|
# do it synchronously – we already parallelized the job
# by splitting it in slices in the model above
UpdateLastUpdatesForUserWorker.new.perform store.id, user_to_process_id
end
end
There's no silver bullet. It depends on your goals and your application. General questions to ask yourself:
How much user ids could you pass to a worker? Is it possible to pass 100? What about 1000000?
How long your workers can work? Should it have any restrictions about working time? Can they stuck?
For a big applications it's necessary to split passed arguments to smaller chunks, to avoid creating long-running jobs. Creating a lot of small jobs allows you to scale easily - you can always add more workers.
Also it might be a good idea to define kind of timeout for workers, to stop processing of stuck workers.
Is it possible to troubleshoot HBase batch puts? I'm using HBase batch puts of 5000 records at a time, and I would like to, on put failure, find out which row or rows is causing a problem and to log it.
The method HTable.batch(List actions) receives a list of Puts and returns an array in the same size of actions list (your puts list you gave to the function). If actions(i) failed, then the result[i] will be null.
Please note that when the failure inside batch() is due to maximum number of attempts to write, you need to catch RetriesExhaustedWithDetailsException, and call getExceptions(), to get the array which contains the mapping of the error to the put causing it.
See code here
I need to perform long-running operation in ruby/rails asynchronously.
Googling around one of the options I find is Sidekiq.
class WeeklyReportWorker
include Sidekiq::Worker
def perform(user, product, year = Time.now.year, week = Date.today.cweek)
report = WeeklyReport.build(user, product, year, week)
report.save
end
end
# call WeeklyReportWorker.perform_async('user', 'product')
Everything works great! But there is a problem.
If I keep calling this async method every few seconds, but the actual time heavy operation performs is one minute things won't work.
Let me put it in example.
5.times { WeeklyReportWorker.perform_async('user', 'product') }
Now my heavy operation will be performed 5 times. Optimally it should have performed only once or twice depending on whether execution of first operaton started before 5th async call was made.
Do you have tips how to solve it?
Here's a naive approach. I'm a resque user, maybe sidekiq has something better to offer.
def perform(user, product, year = Time.now.year, week = Date.today.cweek)
# first, make a name for lock key. For example, include all arguments
# there, so that another perform with the same arguments won't do any work
# while the first one is still running
lock_key_name = make_lock_key_name(user, product, year, week)
Sidekiq.redis do |redis| # sidekiq uses redis, let us leverage that
begin
res = redis.incr lock_key_name
return if res != 1 # protection from race condition. Since incr is atomic,
# the very first one will set value to 1. All subsequent
# incrs will return greater values.
# if incr returned not 1, then another copy of this
# operation is already running, so we quit.
# finally, perform your business logic here
report = WeeklyReport.build(user, product, year, week)
report.save
ensure
redis.del lock_key_name # drop lock key, so that operation may run again.
end
end
end
I am not sure I understood your scenario well, but how about looking at this gem:
https://github.com/collectiveidea/delayed_job
So instead of doing:
5.times { WeeklyReportWorker.perform_async('user', 'product') }
You can do:
5.times { WeeklyReportWorker.delay.perform('user', 'product') }
Out of the box, this will make the worker process the second job after the first job, but only if you use the default settings (because by default the worker process is only one).
The gem offers possibilities to:
Put jobs on a queue;
Have different queues for different jobs if that is required;
Have more than one workers to process a queue (for example, you can start 4 workers on a 4-CPU machine for higher efficiency);
Schedule jobs to run at exact times, or after set amount of time after queueing the job. (Or, by default, schedule for immediate background execution).
I hope it can help you as you did to me.