Time-based conditional query - ruby

Suppose I have the following Server data model:
Server
-> created_at Timestamp
-> last_ping Timestamp
A "stale" Server is defined as a Server whose last_ping occurred more than one hour ago (i.e., last_ping < Time.now - 1 hour). It should be destroyed if there exists another non-stale server that has come online (created_at) within one hour of the last_ping of the stale server.
How can I find all the Servers that should be destroyed? What would a query look like for this?

Something likeā€¦
def clean_stale_servers
return unless Server.exists?(last_ping: 1.hour..)
Server.where(last_ping: ...1.hour.ago)
.destroy_all # .delete_all is faster, use that if possible
end
Then you can call the clean_stale_servers method periodically, i.e. from a cronjob.

Related

How to handle signed S3 image url expiration in Elasticsearch in and Rails 7

I am using AWS S3 with Rails 7 to store images via Active Storage. I'm presenting my data to the view by querying Elasticsearch (using the elasticsearch-model gem).
While this works great for my other data, the expiration of the signed AWS URL becomes an issue after a little while and the images are of course no longer accessible.
class MyClass
has_one_attached :image
end
I'd like to be able to have a fresh URL and still use Elasticsearch so that I don't need to make a trip to the database every time I want to see the image.
I have looked up whether I can just remove the expiration however I've read that it's unsafe and mostly unsupported. I know that Elasticsearch::Model callbacks exists but I'm not clear on whether that could be applied to ActiveStorage::Blob, especially since nothing changes in the DB when the expiration occurs.
I've also thought about just changing the URLs to expire at 1 week via passing in the expires_in param to the url method on the attachement and then performing a chon job to update the image once a week. Seems hacky though.
I'm sure there are many ways to approach this but what worked for me was using the save callback on an async job when the model that contains the Elasticsearch::Model. When this particular attribute was updated, I called a job with a delay just before the maximum signed_url time allowed by s3 which is 7 days.
after_save :set_refresh_url_job, if: Proc.new { logo_url? }
def set_refresh_url_job
RefreshLogoUrlJob.
set(wait: MyModel::LOGO_EXPIRTY_REFRESH).
perform_later(self)
end

ActiveRecord Postgres database not locking - getting race conditions

I'm struggling with locking a PostgreSQL table I'm working on. Ideally I want to lock the entire table, but individual rows will do as long as they actually work.
I have several concurrent ruby scripts that all query a central jobs database on AWS (via a DatabaseAccessor class), find a job that hasn't yet been started, change the status to started and carry it out. The problem is, since these are all running at once, they'll typically all find the same unstarted job at once, and begin carrying it out, wasting time and muddying the results.
I've tried a bunch of things, .lock, .transaction, the fatalistic gem but they don't seem to be working, at least, not in pry.
My code is as follows:
class DatabaseAccessor
require 'pg'
require 'pry'
require 'active_record'
class Jobs < ActiveRecord::Base
enum status: [ :unstarted, :started, :slow, :completed]
end
def initialize(db_credentials)
ActiveRecord::Base.establish_connection(
adapter: db_credentials[:adapter],
database: db_credentials[:database],
username: db_credentials[:username],
password: db_credentials[:password],
host: db_credentials[:host]
)
end
def find_unstarted_job
job = Jobs.where(status: 0).limit(1)
job.started!
job
end
end
Does anyone have any suggestions?
EDIT: It seems that LOCK TABLE jobs IN ACCESS EXCLUSIVE MODE; is the way to do this - however, I'm struggling with then returning the results of this after updating. RETURNING * will return the results after an update, but not inside a transaction.
SOLVED!
So the key here is locking in Postgres. There are a few different table-level locks, detailed here.
There are three factors here in making a decision:
Reads aren't thread safe. Two threads reading the same record will result in that job being run multiple times at once.
Records are only updated once (to be marked as completed) and created, other than the initial read and update to being started. Scripts that create new records will not read the table.
Reading varies in frequency. Waiting for an unlock is non-critical.
Given these factors, if there were a read-lock that still allowed writes, this would be acceptable, however, there isn't, so ACCESS EXCLUSIVE is our best option.
Given this, how do we deal with locking? A hunt through the ActiveRecord documentation gives no mention of it.
Thankfully, other methods to deal with PostgreSQL exist, namely the ruby-pg gem. A bit of a play with SQL later, and a test of locking, and I get the following method:
def converter
result_hash = {}
conn = PG::Connection.open(:dbname => 'my_db')
conn.exec("BEGIN WORK;
LOCK TABLE jobs IN ACCESS EXCLUSIVE MODE;")
conn.exec("UPDATE jobs SET status = 1 WHERE id =
(SELECT id FROM jobs WHERE status = 0 ORDER BY ID LIMIT 1)
RETURNING *;") do |result|
result.each { |row| result_hash = row }
end
conn.exec("COMMIT WORK;")
result_hash.transform_keys!(&:to_sym)
end
This will result in:
An output of an empty hash if there are no jobs with a status of 0
An output of a symbolized hash if one is found and updated
Sleeping if the database is currently locked, before returning the above once unlocked.
The table will remain locked until the COMMIT WORK statement.
As an aside, I wish there was a cleaner way to convert the result to a hash. If anyone has any suggestions, please let me know in the comments! :)

Aborting queries on neo4jrb

I am running something along the lines of the following:
results = queries.map do |query|
begin
Neo4j::Session.query(query)
rescue Faraday::TimeoutError
nil
end
end
After a few iterations I get an unrescued Faraday::TimeoutError: too many connection resets (due to Net::ReadTimeout - Net::ReadTimeout) and Neo4j needs switching off and on again.
I believe this is because the queries themselves aren't aborted - i.e. the connection times out but Neo4j carries on trying to run my query. I actually want to time them out, so simply increasing the timeout window won't help me.
I've had a scout around and it looks like I can find my queries and abort them via the Neo4j API, which will be my next move.
Am I right in my diagnosis? If so, is there a recommended way of managing queries (and aborting them) from neo4jrb?
Rebecca is right about managing queries manually. Though if you want Neo4j to automatically stop queries within a certain time period, you can set this in your neo4j conf:
dbms.transaction.timeout=60s
You can find more info in the docs for that setting.
The Ruby gem is using Faraday to connect to Neo4j via HTTP and Faraday has a built-in timeout which is separate from the one in Neo4j. I would suggest setting the Neo4j timeout as a bit longer (5-10 seconds perhaps) than the one in Ruby (here are the docs for configuring the Faraday timeout). If they both have the same timeout, Neo4j might raise a timeout before Ruby, making for a less clear error.
Query management can be done through Cypher. You must be an admin user.
To list all queries, you can use CALL dbms.listQueries;.
To kill a query, you can use CALL dbms.killQuery('ID-OF-QUERY-TO-KILL');, where the ID is obtained from the list of queries.
The previous statements must be executed as a raw query; it does not matter whether you are using an OGM, as long as you can input queries manually. If there is no way to manually input queries, and there is no way of doing this in your framework, then you will have to access the database using some other method in order to execute the queries.
So thanks to Brian and Rebecca for useful tips about query management within Neo4j. Both of these point the way to viable solutions to my problem, and Brian's explicitly lays out steps for achieving one via Neo4jrb so I've marked it correct.
As both answers assume, the diagnosis I made IS correct - i.e. if you run a query from Neo4jrb and the HTTP connection times out, Neo4j will carry on executing the query and Neo4jrb will not issue any instruction for it to stop.
Neo4jrb does not provide a wrapper for any query management functionality, so simply setting a transaction timeout seems most sensible and probably what I'll adopt. Actually intercepting and killing queries is also possible, but this means running your query on one thread so that you can look up its queryId in another. This is the somewhat hacky solution I'm working with atm:
class QueryRunner
DEFAULT_TIMEOUT=70
def self.query(query, timeout_limit=DEFAULT_TIMEOUT)
new(query, timeout_limit).run
end
def initialize(query, timeout_limit)
#query = query
#timeout_limit = timeout_limit
end
def run
start_time = Time.now.to_i
Thread.new { #result = Neo4j::Session.query(#query) }
sleep 0.5
return #result if #result
id = if query_ref = Neo4j::Session.query("CALL dbms.listQueries;").to_a.find {|x| x.query == #query }
query_ref.queryId
end
while #result.nil?
if (Time.now.to_i - start_time) > #timeout_limit
puts "killing query #{id} due to timeout"
Neo4j::Session.query("CALL dbms.killQuery('#{id}');")
#result = []
else
sleep 1
end
end
#result
end
end

Ruby and Rails Async

I need to perform long-running operation in ruby/rails asynchronously.
Googling around one of the options I find is Sidekiq.
class WeeklyReportWorker
include Sidekiq::Worker
def perform(user, product, year = Time.now.year, week = Date.today.cweek)
report = WeeklyReport.build(user, product, year, week)
report.save
end
end
# call WeeklyReportWorker.perform_async('user', 'product')
Everything works great! But there is a problem.
If I keep calling this async method every few seconds, but the actual time heavy operation performs is one minute things won't work.
Let me put it in example.
5.times { WeeklyReportWorker.perform_async('user', 'product') }
Now my heavy operation will be performed 5 times. Optimally it should have performed only once or twice depending on whether execution of first operaton started before 5th async call was made.
Do you have tips how to solve it?
Here's a naive approach. I'm a resque user, maybe sidekiq has something better to offer.
def perform(user, product, year = Time.now.year, week = Date.today.cweek)
# first, make a name for lock key. For example, include all arguments
# there, so that another perform with the same arguments won't do any work
# while the first one is still running
lock_key_name = make_lock_key_name(user, product, year, week)
Sidekiq.redis do |redis| # sidekiq uses redis, let us leverage that
begin
res = redis.incr lock_key_name
return if res != 1 # protection from race condition. Since incr is atomic,
# the very first one will set value to 1. All subsequent
# incrs will return greater values.
# if incr returned not 1, then another copy of this
# operation is already running, so we quit.
# finally, perform your business logic here
report = WeeklyReport.build(user, product, year, week)
report.save
ensure
redis.del lock_key_name # drop lock key, so that operation may run again.
end
end
end
I am not sure I understood your scenario well, but how about looking at this gem:
https://github.com/collectiveidea/delayed_job
So instead of doing:
5.times { WeeklyReportWorker.perform_async('user', 'product') }
You can do:
5.times { WeeklyReportWorker.delay.perform('user', 'product') }
Out of the box, this will make the worker process the second job after the first job, but only if you use the default settings (because by default the worker process is only one).
The gem offers possibilities to:
Put jobs on a queue;
Have different queues for different jobs if that is required;
Have more than one workers to process a queue (for example, you can start 4 workers on a 4-CPU machine for higher efficiency);
Schedule jobs to run at exact times, or after set amount of time after queueing the job. (Or, by default, schedule for immediate background execution).
I hope it can help you as you did to me.

Limitation in retrieving rows from a mongodb from ruby code

I have a code which gets all the records from a collection of a mongodb and then it performs some computations.
My program takes too much time as the "coll_id.find().each do |eachitem|......." returns only 300 records at an instant.
If I place a counter inside the loop and check it prints 300 records and then sleeps for around 3 to 4 seconds before printing the counter value for next set of 300 records..
coll_id.find().each do |eachcollectionitem|
puts "counter value for record " + counter.to_s
counter=counter +1
---- My computations here -----
end
Is this a limitation of ruby-mongodb api or some configurations needs to be done so that the code can get access to all the records at one instant.
How large are your documents? It's possible that the deseriaization is taking a long time. Are you using the C extensions (bson_ext)?
You might want to try passing a logger when you connect. That could help sort our what's going on. Alternatively, can you paste in the MongoDB log? What's happening there during the pause?

Resources