I am using the redis-semaphore gem, version 0.3.1.
For some reason, I occasionally can't release a stale Redis lock. From my analysis it seems to happen if my Docker process crashed after the lock was created.
I have described my debugging process below and would like to know if anyone can suggest how to further debug.
Assume that we want to create a redis lock with this name:
name = "test"
We insert this variable in two different terminal windows. In the first, we run:
def lock_for_15_secs(name)
job = Redis::Semaphore.new(name.to_sym, redis: NonBlockingRedis.new(), custom_blpop: true, :stale_client_timeout => 15)
if job.lock(-1) == "0"
puts "Locked and starting"
sleep(15)
puts "Now it's stale, try to release in another process"
sleep(15)
puts "Now trying to unlock"
unlock = job.unlock
puts unlock == false ? "Wuhuu, already unlocked" : "Hm, should have been unlocked by another process, but wasn't"
end
end
lock_for_15_secs(name)
In the second we run:
def release_and_lock(name)
job = Redis::Semaphore.new(name.to_sym, redis: NonBlockingRedis.new(), custom_blpop: true, :stale_client_timeout => 15)
release = job.release_stale_locks!
count = job.available_count
puts "Release reponse is #{release.inspect} and available count is #{count}"
if job.lock(-1) == "0"
puts "Wuhuu, we can lock it"
job.unlock
else
puts "Hmm, we can't lock it"
end
end
release_and_lock(name)
This usually plays out as expected. For 15 seconds, the second terminal can't relase the lock, but when run after 15 seconds, it releases. Below is the output from release_and_lock(name).
Before 15 seconds have passed:
irb(main):1:0> release_and_lock(name)
Release reponse is {"0"=>"1580292557.321834"} and available count is 0
Hmm, we can't lock it
=> nil
After 15 seconds have passed:
irb(main):2:0> release_and_lock(name)
Release reponse is {"0"=>"1580292557.321834"} and available count is 1
Wuhuu, we can lock it
=> 1
irb(main):3:0> release_and_lock(name)
Release reponse is {} and available count is 1
Wuhuu, we can lock it
But whenever I see that a stale lock isn't released, and I run release_and_lock(name) to diagnose, this is returned:
irb(main):4:0> release_and_lock(name)
Release reponse is {} and available count is 0
Hmm, we can't lock it
And at this point my only option is to flush redis:
require 'non_blocking_redis'
non_blocking_redis = NonBlockingRedis.new()
non_blocking_redis.flushall
P.s. My NonBlockingRedis inherits from Redis:
class Redis
class Semaphore
def initialize(name, opts = {})
#custom_opts = opts
#name = name
#resource_count = opts.delete(:resources) || 1
#stale_client_timeout = opts.delete(:stale_client_timeout)
#redis = opts.delete(:redis) || Redis.new(opts)
#use_local_time = opts.delete(:use_local_time)
#custom_blpop = opts.delete(:custom_blpop) # false=queue, true=cancel
#tokens = []
end
def lock(timeout = 0)
exists_or_create!
release_stale_locks! if check_staleness?
token_pair = #redis.blpop(available_key, timeout, #custom_blpop)
return false if token_pair.nil?
current_token = token_pair[1]
#tokens.push(current_token)
#redis.hset(grabbed_key, current_token, current_time.to_f)
if block_given?
begin
yield current_token
ensure
signal(current_token)
end
end
current_token
end
alias_method :wait, :lock
end
end
class NonBlockingRedis < Redis
def initialize(options = {})
if options.empty?
options = {
url: Rails.application.secrets.redis_url,
db: Rails.application.secrets.redis_sidekiq_db,
driver: :hiredis,
network_timeout: 5
}
end
super(options)
end
def blpop(key, timeout, custom_blpop)
if custom_blpop
if timeout == -1
result = lpop(key)
return result if result.nil?
return [key, result]
else
super(key, timeout)
end
else
super
end
end
def lock(timeout = 0)
exists_or_create!
release_stale_locks! if check_staleness?
token_pair = #redis.blpop(available_key, timeout, #custom_blpop)
return false if token_pair.nil?
current_token = token_pair[1]
#tokens.push(current_token)
#redis.hset(grabbed_key, current_token, current_time.to_f)
if block_given?
begin
yield current_token
ensure
signal(current_token)
end
end
current_token
end
alias_method :wait, :lock
end
require 'non_blocking_redis'
😜 An awesome bug 👏
The bug
I think it happens if you kill the process when it does lpop on the SEMAPHORE:test:AVAILABLE
Most probably here https://github.com/dv/redis-semaphore/blob/v0.3.1/lib/redis/semaphore.rb#L67
To replicate it
NonBlockingRedis.new.flushall
release_and_lock('test');
NonBlockingRedis.new.lpop('SEMAPHORE:test:AVAILABLE')
Now initially you have:
SEMAPHORE:test:AVAILABLE 0
SEMAPHORE:test:VERSION 1
SEMAPHORE:test:EXISTS 1
After the above code you get:
SEMAPHORE:test:VERSION 1
SEMAPHORE:test:EXISTS 1
The code checks the SEMAPHORE:test:EXISTS and then expects to have SEMAPHORE:test:AVAILABLE / SEMAPHORE:test:GRABBED
Solution
From my brief check I don't think it is possible to make the gem work without a modification. I tried adding an expiration: but somehow it managed to disable the expiration for SEMAPHORE:test:EXISTS
NonBlockingRedis.new.ttl('SEMAPHORE:test:EXISTS') # => -1 and it should have been e.g. 20 seconds and going down
So.. maybe a fix will be
class Redis
class Semaphore
def exists_or_create!
token = #redis.getset(exists_key, EXISTS_TOKEN)
if token.nil? || all_tokens.empty?
create!
else
# Previous versions of redis-semaphore did not set `version_key`.
# Make sure it's set now, so we can use it in future versions.
if token == API_VERSION && #redis.get(version_key).nil?
#redis.set(version_key, API_VERSION)
end
true
end
end
end
end
the all_tokens is https://github.com/dv/redis-semaphore/blob/v0.3.1/lib/redis/semaphore.rb#L120
I'll open a PR to the gem shortly -> https://github.com/dv/redis-semaphore/pull/66 maybe 🤷♂️
Note 1
Not sure how you use the NonBlockingRedis but it is not in use in Redis::Semaphore. You do lock(-1) which does in the code lpop. Also the code never calls your lock.
Random
Here is a helper to dump the keys
class Test
def self.all
r = NonBlockingRedis.new
puts r.keys('*').map { |k|
[
k,
((r.hgetall(k) rescue r.get(k)) rescue r.lrange(k, 0, -1).join(' | '))
].join("\t\t")
}
end
end
> Test.all
SEMAPHORE:test:AVAILABLE 0
SEMAPHORE:test:VERSION 1
SEMAPHORE:test:EXISTS 1
For completeness here is how it looks when you have grabbed the lock
SEMAPHORE:test:VERSION 1
SEMAPHORE:test:EXISTS 1
SEMAPHORE:test:GRABBED {"0"=>"1583672948.7168388"}
Related
Jobs in sidekiq are suppose to check if they have been cancelled, but if I have a long running job, I'd like for it to check itself periodically. This example does not work : I've not wrapped the fake work in any sort of future within which I can raise an exception -- which I'm not sure is even possible. How might I do this?
class ThingWorker
def perform(phase, id)
thing = Thing.find(id)
# schedule the initial check
schedule_cancellation_check(thing.updated_at, id)
# maybe wrap this in something I can raise an exception within?
sleep 10 # fake work
#done = true
return true
end
def schedule_cancellation_check(initial_time, thing_id)
Concurrent.schedule(5) {
# just check right away...
return if #done
# if our thing has been updated since we started this job, kill this job!
if Thing.find(thing_id).updated_at != initial_time
cancel!
# otherwise, schedule the next check
else
schedule_cancellation_check(initial_time, thing_id)
end
}
end
# as per sidekiq wiki
def cancelled?
#cancelled
Sidekiq.redis {|c| c.exists("cancelled-#{jid}") }
end
def cancel!
#cancelled = true
# not sure what this does besides marking the job as cancelled tho, read source
Sidekiq.redis {|c| c.setex("cancelled-#{jid}", 86400, 1) }
end
end
You're thinking about this way too hard. Your worker should be a loop and check for cancellation every iteration.
def perform(thing_id, updated_at)
thing = Thing.find(thing_id)
while !cancel?(thing, updated_at)
# do something
end
end
def cancel?(thing, last_updated_at)
thing.reload.updated_at > last_updated_at
end
I am trying to implement a simple timeout class that handles timeouts of different requests.
Here is the first version:
class MyTimer
def handleTimeout mHash, k
while mHash[k] > 0 do
mHash[k] -=1
sleep 1
puts "#{k} : #{mHash[k]}"
end
end
end
MAX = 3
timeout = Hash.new
timeout[1] = 41
timeout[2] = 5
timeout[3] = 14
t1 = MyTimer.new
t2 = MyTimer.new
t3 = MyTimer.new
first = Thread.new do
t1.handleTimeout(timeout,1)
end
second = Thread.new do
t2.handleTimeout(timeout,2)
end
third = Thread.new do
t3.handleTimeout(timeout,3)
end
first.join
second.join
third.join
This seems to work fine. All the timeouts work independently of each other.
Screenshot attached
The second version of the code however produces different results:
class MyTimer
def handleTimeout mHash, k
while mHash[k] > 0 do
mHash[k] -=1
sleep 1
puts "#{k} : #{mHash[k]}"
end
end
end
MAX = 3
timeout = Hash.new
timers = Array.new(MAX+1)
threads = Array.new(MAX+1)
for i in 0..MAX do
timeout[i] = rand(40)
# To see timeout value
puts "#{i} : #{timeout[i]}"
end
sleep 1
for i in 0..MAX do
timers[i] = MyTimer.new
threads[i] = Thread.new do
timers[i].handleTimeout( timeout, i)
end
end
for i in 0..MAX do
threads[i].join
end
Screenshot attached
Why is this happening?
How can I implement this functionality using arrays?
Is there a better way to implement the same functionality?
In the loop in which you are creating threads by using Thread.new, the variable i is shared between main thread (where threads are getting created) and in the threads created. So, the value of i seen by handleTimeout is not consistent and you get different results.
You can validate this by adding a debug statement in your method:
#...
def handleTimeout mHash, k
puts "Handle timeout called for #{mHash} and #{k}"
#...
end
#...
To fix the issue, you need to use code like below. Here parameters are passed to Thread.new and subsequently accessed using block variables.
for i in 0..MAX do
timers[i] = MyTimer.new
threads[i] = Thread.new(timeout, i) do |a, b|
timers[i].handleTimeout(a, b)
end
end
More on this issue is described in When do you need to pass arguments to Thread.new? and this article.
I wrote a crawler which uses 8 threads to download JSON from the Internet:
#encoding: utf-8
require 'net/http'
require 'sqlite3'
require 'zlib'
require 'json'
require 'thread'
$mutex = Mutex.new # Lock of database and $cnt
$cntMutex = Mutex.new # Lock of $threadCnt
$threadCnt = 0 # number of running threads
$cnt = 0 # number of lines in this COMMIT to database
db = SQLite3::Database.new "price.db"
db.results_as_hash = true
STDOUT.sync = true
start = 10000000
def fetch(http, url, timeout = 10)
# ...
end
def parsePrice( i, db)
ss = fetch(Net::HTTP.start('p.3.cn',80), 'http://p.3.cn/prices/get?skuid=J_'+i.to_s)
doc = JSON.parse(ss)[0]
puts "processing "+i.to_s
STDOUT.flush
begin
$mutex.synchronize {
$cnt = $cnt+1
db.execute("insert into prices (id, price) VALUES (?,?)", [i,doc["p"].to_f])
if $cnt > 20
db.execute('COMMIT')
db.execute('BEGIN')
$cnt = 0
end
}
rescue SQLite3::ConstraintException
warn("duplicate id: "+i.to_s)
$cntMutex.synchronize {
$threadCnt -= 1;
}
Thread.terminate
rescue NoMethodError
warn("Matching failed")
rescue
raise
ensure
end
$cntMutex.synchronize {
$threadCnt -= 1;
}
end
puts "will now start from " + start.to_s()
db.execute("BEGIN")
Thread.new {
for ii in start..12000000 do
sleep 0.1 while $threadCnt > 7
$cntMutex.synchronize {
$threadCnt += 1;
}
Thread.new {
parsePrice( ii, db)
}
end
db.execute('COMMIT')
} . join
Then I created a database named price.db:
sqlite3 > create table prices (id INT PRIMATY KEY, price REAL);
To make my code thread-safe, db, $cnt, $threadCnt are all protected by $mutex or $cntMutex.
However, when I tried to run this script, the following messages were printed:
[lz#lz crawl]$ ruby priceCrawler.rb
will now start from 10000000
http://p.3.cn/prices/get?skuid=J_10000008http://p.3.cn/prices/get?skuid=J_10000008
http://p.3.cn/prices/get?skuid=J_10000008http://p.3.cn/prices/get?skuid=J_10000002http://p.3.cn/prices/get?skuid=J_10000008
http://p.3.cn/prices/get?skuid=J_10000008
http://p.3.cn/prices/get?skuid=J_10000002http://p.3.cn/prices/get?skuid=J_10000002
processing 10000002
processing 10000002processing 10000008processing 10000008processing 10000002
duplicate id: 10000002
duplicate id: 10000002processing 10000008
processing 10000008duplicate id: 10000008
duplicate id: 10000008processing 10000008
duplicate id: 10000008
It seems that this script skipped some id and called parsePrice with the same id more than once.
So why did this error occur? Any help would be appreciated.
It seems to me that your thread scheduling is wrong. I have modified your code to illustrates some possible race conditions you were triggering.
re 'net/http'
require 'sqlite3'
require 'zlib'
require 'json'
require 'thread'
$mutex = Mutex.new # Lock of database and $cnt
$cntMutex = Mutex.new # Lock of $threadCnt
$threadCnt = 0 # number of running threads
$cnt = 0 # number of lines in this COMMIT to database
db = SQLite3::Database.new "price.db"
db.results_as_hash = true
STDOUT.sync = true
start = 10000000
def fetch(http, url, timeout = 10)
# ...
end
def parsePrice(i, db)
must_terminate = false
ss = fetch(Net::HTTP.start('p.3.cn',80), "http://p.3.cn/prices/get?skuid=J_#{i}")
doc = JSON.parse(ss)[0]
puts "processing #{i}"
STDOUT.flush
begin
$mutex.synchronize {
$cnt = $cnt+1
db.execute("insert into prices (id, price) VALUES (?,?)", [i,doc["p"].to_f])
if $cnt > 20
db.execute('COMMIT')
db.execute('BEGIN')
$cnt = 0
end
}
rescue SQLite3::ConstraintException
warn("duplicate id: #{i}")
must_terminate = true
rescue NoMethodError
warn("Matching failed")
rescue
# Raising here does not prevent ensure from running.
# It will raise after we decrement $threadCnt on
# ensure clause.
raise
ensure
$cntMutex.synchronize {
$threadCnt -= 1;
}
end
Thread.terminate if must_terminate
end
puts "will now start from #{start}"
# This begin makes no sense for me.
db.execute("BEGIN")
for ii in start..12000000 do
should_redo = false
# Instead of sleeping, we acquire the lock and check
# if we can create another thread. If we can't, we just
# release the lock and retry latter (using for-redo).
$cntMutex.synchronize{
if $threadCnt <= 7
$threadCnt += 1;
Thread.new { parsePrice(ii, db) }
else
# We use this flag since we don't know for sure redo's
# behavior inside a lock.
should_redo = true
end
}
# Will redo this iteration if we can't create the thread.
if should_redo
# Mitigate busy waiting a bit.
sleep(0.1)
redo
end
end
# This commit makes no sense to me.
db.execute('COMMIT')
Thread.list.each { |t| t.join }
Also, most databases already implement locks themselves. You can probably remove the mutex that locks the database. And another advice is that you be more consistent with your commits. You have a lot of scattered begins and commits in the code. I suggest that you either make the operation and then commit or use a commit buffer and then commit everything in a single place.
The race condition, it seems you were not being careful enough when dealing with $threadCnt. The implementation I gave you makes more sense to me, but I have not tested it.
The redo in the main loop is a form of busy waiting, which is bad for performance. You can and you should put a sleep clause there. But it is essential that you maintain the $threadCnt checking and updating inside the lock. The way you implemented it before did not ensure the check and updating was an atomic operation.
In ruby, is it possible to cause a thread to pause from a different concurrently running thread.
Below is the code that I've written so far. I want the user to be able to type 'pause thread' and the sample500 thread to pause.
#!/usr/bin/env ruby
# Creates a new thread executes the block every intervalSec for durationSec.
def DoEvery(thread, intervalSec, durationSec)
thread = Thread.new do
start = Time.now
timeTakenToComplete = 0
loopCounter = 0
while(timeTakenToComplete < durationSec && loopCounter += 1)
yield
finish = Time.now
timeTakenToComplete = finish - start
sleep(intervalSec*loopCounter - timeTakenToComplete)
end
end
end
# User input loop.
exit = nil
while(!exit)
userInput = gets
case userInput
when "start thread\n"
sample500 = Thread
beginTime = Time.now
DoEvery(sample500, 0.5, 30) {File.open('abc', 'a') {|file| file.write("a\n")}}
when "pause thread\n"
sample500.stop
when "resume thread"
sample500.run
when "exit\n"
exit = TRUE
end
end
Passing Thread object as argument to DoEvery function makes no sense because you immediately overwrite it with Thread.new, check out this modified version:
def DoEvery(intervalSec, durationSec)
thread = Thread.new do
start = Time.now
Thread.current["stop"] = false
timeTakenToComplete = 0
loopCounter = 0
while(timeTakenToComplete < durationSec && loopCounter += 1)
if Thread.current["stop"]
Thread.current["stop"] = false
puts "paused"
Thread.stop
end
yield
finish = Time.now
timeTakenToComplete = finish - start
sleep(intervalSec*loopCounter - timeTakenToComplete)
end
end
thread
end
# User input loop.
exit = nil
while(!exit)
userInput = gets
case userInput
when "start thread\n"
sample500 = DoEvery(0.5, 30) {File.open('abc', 'a') {|file| file.write("a\n")} }
when "pause thread\n"
sample500["stop"] = true
when "resume thread\n"
sample500.run
when "exit\n"
exit = TRUE
end
end
Here DoEvery returns new thread object. Also note that Thread.stop called inside running thread, you can't directly stop one thread from another because it is not safe.
You may be able to better able to accomplish what you are attempting using Ruby Fiber object, and likely achieve better efficiency on the running system.
Fibers are primitives for implementing light weight cooperative
concurrency in Ruby. Basically they are a means of creating code
blocks that can be paused and resumed, much like threads. The main
difference is that they are never preempted and that the scheduling
must be done by the programmer and not the VM.
Keeping in mind the current implementation of MRI Ruby does not offer any concurrent running threads and the best you are able to accomplish is a green threaded program, the following is a nice example:
require "fiber"
f1 = Fiber.new { |f2| f2.resume Fiber.current; while true; puts "A"; f2.transfer; end }
f2 = Fiber.new { |f1| f1.transfer; while true; puts "B"; f1.transfer; end }
f1.resume f2 # =>
# A
# B
# A
# B
# .
# .
# .
I was trying to speed up multiple FTP downloads by using threaded FTP connections. My problem is that I always have threads hang. I am looking for a clean way of either telling FTP it needs to retry the ftp transaction, or at least knowing when the FTP connection is hanging.
In the code below I am threading 5/6 separate FTP connections where each thread has a list of files it is expected to download. When the script completes, a few of the threads hang and can not be joined. I am using the variable #last_updated to represent the last successful download time. If the current time + 20 seconds exceeds #last_updated, kill the remaining threads. Is there a better way?
threads = []
max_thread_pool = 5
running_threads = 0
Thread.abort_on_exception = true
existing_file_count = 0
files_downloaded = 0
errors = []
missing_on_the_server = []
#last_updated = Time.now
if ids.length > 0
ids.each_slice(ids.length / max_thread_pool) do |id_set|
threads << Thread.new(id_set) do |t_id_set|
running_threads += 1
thread_num = running_threads
thread_num.freeze
puts "making thread # #{thread_num}"
begin
ftp = Net::FTP.open(#remote_site)
ftp.login(#remote_user, #remote_password)
ftp.binary = true
#ftp.debug_mode = true
ftp.passive = false
rescue
raise "Could not establish FTP connection"
end
t_id_set.each do |id|
#last_updated = Time.now
rmls_path = "/_Photos/0#{id[0,2]}00000/#{id[2,1]}0000/#{id[3,1]}000/#{id}-1.jpg"
local_path = "#{#photos_path}/01/#{id}-1.jpg"
progress += 1
unless File.exist?(local_path)
begin
ftp.getbinaryfile(rmls_path, local_path)
puts "ftp reponse: #{ftp.last_response}"
# find the percentage of progress just for fun
files_downloaded += 1
p = sprintf("%.2f", ((progress.to_f / total) * 100))
puts "Thread # #{thread_num} > %#{p} > #{progress}/#{total} > Got file: #{local_path}"
rescue
errors << "#{thread_num} unable to get file > ftp response: #{ftp.last_response}"
puts errors.last
if ftp.last_response_code.to_i == 550
# Add the missing file to the missing list
missing_on_the_server << errors.last.match(/\d{5,}-\d{1,2}\.jpg/)[0]
end
end
else
puts "found file: #{local_path}"
existing_file_count += 1
end
end
puts "closing FTP connection #{thread_num}"
ftp.close
end # close thread
end
end
# If #last_updated has not been updated on the server in over 20 seconds, wait 3 seconds and check again
while Time.now < #last_updated + 20 do
sleep 3
end
# threads are hanging so joining the threads does not work.
threads.each { |t| t.kill }
The trick for me that worked was to use ruby's Timeout.timeout to ensure the FTP connection was not hanging.
begin
Timeout.timeout(10) do
ftp.getbinaryfile(rmls_path, local_path)
end
# ...
rescue Timeout::Error
errors << "#{thread_num}> File download timed out for: #{rmls_path}"
puts errors.last
rescue
errors << "unable to get file > ftp reponse: #{ftp.last_response}"
# ...
end
Hanging FTP downloads were causing my threads to appear to hang. Now that the threads are no longer hanging, I can use the more proper way of dealing with threads:
threads.each { |t| t.join }
rather than the ugly:
# If #last_updated has not been updated on the server in over 20 seconds, wait 3 seconds and check again
while Time.now < #last_updated + 20 do
sleep 3
end
# threads are hanging so joining the threads does not work.
threads.each { |t| t.kill }