How can I troubleshoot silently failing queued jobs? - laravel

I have a job that is dispatched with two arguments - path and filename to a file. The job parses the file using simplexml, then makes a notice of it in the database and moves the file to the appropriate folder for safekeeping. If anything goes wrong, it moves the file to another folder for failed files, as well as creates an event to give me a notification.
My problem is that sometimes the job will fail silently. The job is removed from the queue, but the file has not been parsed and it remains in the same directory. The failed_jobs table is empty (I'm using the database queue driver for development) and the failed() method has not been triggered. The Queue::failing() method I put in the app service provider has not been triggered either - I know, since both of those contain only a single log call to check whether they were hit. The Laravel log is empty (it's readable and Laravel does write to it for other errors - I double-checked) and so are relevant system log files such as e.g. php's.
At first I thought it was a timeout issue, but the queue listener has not failed or stopped, nor been restarted. I increased the timeout to 300 seconds anyway, and verified that all of the "[datetime] Processed: [job]" lines the listener generates were well within that timespan. Php execution times etc. are also far longer than required for this job.
So how on earth can I troubleshoot this when the logs are empty, nothing appears to fail, and I get no notification of what's wrong? If I queue up 200 files then maybe 180 will be processed and the remaining 20 fail silently. If I refresh the database + migrations and queue up the same 200 again, then maybe 182 will be processed and 18 will fail silently - but they won't necessarily be the same.
My handle method, simplified to show relevant bits, looks as follows:
public function handle()
{
try {
$xml = simplexml_load_file($this->path.$this->filename);
$this->parse($xml);
$parsedFilename = config('feeds.parsed path').$this->filename;
File::move($this->path.$this->filename, $parsedFilename);
} catch (Exception $e) {
// if i put deliberate errors in the files, this works fine
$errorFilename = config('feeds.error path').$this->filename;
File::move($this->path.$this->filename, $errorFilename);
event(new ParserThrewAnError($this->filename));
}
}

Okay, so I still have absolutely no idea why, but... after restarting the VM I have tested eight times with various different files and options and had zero problems. If anyone can guess the reason, feel free to reply and I'll accept your answer if it sounds reasonable. For now, I'll mark my own answer as correct once I can, in case somebody else stumbles across this later.

Related

How to send an exception to Sentry from Laravel Job only on final fail?

Configuration
I'm using Laravel 8 with sentry/sentry-laravel plugin.
There is a Job that works just fine 99% of time. It retries N times in case of any problems due to:
public $backoff = 120;
public function retryUntil()
{
return now()->addHours(6);
}
And it simply calls some service:
public function handle()
{
// Service calls some external API
$service->doSomeWork(...);
}
Method doSomeWork sometimes throws an exception due to network problems, like Curl error: Operation timed out after 15001 milliseconds with 0 bytes received. This is fine due to automatic retries. In most cases next retry will succeed.
Problem
Every curl error is sent to Sentry. As an administrator I must check every alert, because this job is pretty important and I can't miss actually failed job. For example:
There is some network problem that is not resolved for an hour.
Application queues a Job
Every 2 minutes application generates similar message to Sentry
After network problems resolved job succeeds, so no attention required
But we are seing dozens of errors, that theoretically could be ignored. But what if there an actual problem in that pile and I will miss it?
Question
How to make that only "final" job fail would send a message to Sentry? I mean after 6 hours of failed retries: only then I'd like to receive one alert.
What I tried
There is one workaround that kind of "works". We can replace Exception with SomeCustomException and add it to \App\Exceptions\Handler::$dontReport array. In that case there are no "intermediate" messages sent to Sentry.
But when job finally fails, Laravel sends standard ... job has been attempted too many times or run too long message without details of actual error.

Why $time from $lock=Cache::lock('name', $time) should be greater than the updating Cache time?

I placed this code inside a Route::get() method only to test it quicker. So this is how it looks:
use Illuminate\Support\Facades\Cache;
Route::get('/cache', function(){
$lock = Cache::lock('test', 4);
if($lock->get()){
Cache::put('name', 'SomeName'.now());
dump(Cache::get('name'));
sleep(5);
// dump('inside get');
}else{
dump('locked');
}
// $lock->release();
});
If you reach this route from two browsers (almost)at the same time. They both will respond with the result from dump(Cache::get('name'));. Shouldn't the second browser respond be "locked"? Because when it calls the $lock->get() that is supposed to return false? And that because when the second browser tries to reach this route the lock should be still set.
That same code works just fine if the time required for the code after the $lock = Cache::lock('test', 4) to be executed is less than 4. If you set the sleep($sec) when $sec<4 you will see that the first browser reaching this route will respond with the result from Cache::get('name') and the second browser will respond with "locked" as expected.
Can anyone explain why is this happening? Isn't it suppose that any get() method to that lock, expect the first one, to return false for that amount of time the lock has been set? I used 2 different browsers but it works the same with 2 tabs from the same browser too.
Quote from the 5.6 docs https://laravel.com/docs/5.6/cache#atomic-locks:
To utilize this feature, your application must be using the memcached or redis cache driver as your application's default cache driver. In addition, all servers must be communicating with the same central cache server.
Quote from the 5.8 docs https://laravel.com/docs/5.8/cache#atomic-locks:
To utilize this feature, your application must be using the memcached, dynamodb, or redis cache driver as your application's default cache driver. In addition, all servers must be communicating with the same central cache server.
Quote from the 8.0 docs https://laravel.com/docs/8.x/cache#atomic-locks:
To utilize this feature, your application must be using the memcached, redis, dynamodb, database, file, or array cache driver as your application's default cache driver. In addition, all servers must be communicating with the same central cache server.
Apparently, they have been adding support for more drivers to make use of this lock functionality. Check which Cache driver you are using and if it fits the support list of your Laravel version.
There is likely an atomicity issue here where the cache driver you are using is not able to lock a file atomically. What should happen is that when a process (i.e. a php request) is writing to the lock file, all other processes requiring the lock file should at least wait until the lock file available to be read again. If not, they read the lock file before it has been written to, which obviously causes a race condition.
I saw this question I asked, well now I can say that the problem I was trying to solve here was not because of the atomic lock. The problem here is the sleep method. If the time provided to the sleep method is bigger than the time that a lock will live, it means when the next request it's able to hit the route the lock time will expire(will be released). And that's because let's say you have defined a route like this:
Route::get('case/{value}', function($value){
if($value){
dump('hit-1');
}else{
sleep(5);
dump('hit-0');
}
});
And you open two browser tabs with the same URL that hits this route something like:
127.0.0.1:8000/case/0
and
127.0.0.1:8000/case/1
It will show you that the first route will take 5sec to finish execution and even if the second request is sent almost at the same time with the first request, still it will wait to finish the first one and then run. This means the second request will last 5sec(from the first request) plus the time it took to run.
Back to the asked question the lock time will expire by the time the second request will get it or said differently run the $lock->get() statement.

Laravel Jobs fail on Redis when attempting to use throttle

End Goal
The aim is for my application to fire off potentially a lot of emails to the Redis queue (This bit is working) and then Redis throttle the processing of these to only a set number of emails every selected number of minutes.
For this example, I have a test job that appends the time to a file and I am attempting to throttle it to once every 60 seconds.
The story so far....
So far, I have the application successfully pushing a test amount of 50 jobs to the Redis queue. I can log in to Horizon and see these 50 jobs in the "processjob" queue. I can also log in to redis-cli and see 50 sets under the list key "queues:processjob".
My issue is that as soon as I attempt to put the throttle on, only 1 job runs and the rest fail with the following error:
Predis\Response\ServerException: ERR Error running script (call to f_29cc07bd431ccbf64637e5dcb60484560fdfa2da): #user_script:10: WRONGTYPE Operation against a key holding the wrong kind of value in /var/www/html/smhub/vendor/predis/predis/src/Client.php:370
If I remove the throttle, all works file and 5 jobs are instantly ran.
I thought maybe it was the incorrect key name but if I change the following:
public function handle()
{
//
Redis::throttle('queues:processjob')->allow(1)->every(60)->then(function(){
Storage::disk('local')->append('testFile.txt',date("Y-m-d H:i:s"));
}, function (){
return $this->release(10);
});
}
to this:
public function handle()
{
//
Redis::funnel('queues:processjob')->limit(1)->then(function(){
Storage::disk('local')->append('testFile.txt',date("Y-m-d H:i:s"));
}, function (){
return $this->release(10);
});
}
then it all works fine.
My thoughts...
Something tells me that the issue is that the redis key is of type "list" and that the jobs are all under a single list. That being said, if it didn't work this way, how would we throttle a queue as the throttle requires a unique key.
For anybody else that is having issues attempting to get this to work and is getting the same issue as I was, this is what resolved my issues:
The Fault
I assumed that Redis::throttle('queues:processjob') was meant to be referring to the queue that you wanted to be throttled. However, after some re-reading of the documentation and testing of the code, I realized that this was not the case.
The Fix
Redis::throttle('queues:processjob') is meant to point to it's own 'holding' queue and so must be a unique Redis key name. Therefore, changing it to Redis::throttle('throttle:queues:processjob') worked fine for me.
The workings
When I first looked in to this, I assumed that that Redis::throttle('this') throttled the queue that you specified. To some degree this is correct but it will not work if the job was created via another means.
Redis::throttle('this') actually creates a new 'holding' queue where the jobs go until the condition(s) you specify are met. So jobs will go to the queue 'this' in this example and when the throttle trigger is released, they will be passed to the queue specified in their execution code. In this case, 'queues:processjob'.
I hope this helps!

spring integration message released twice from aggregator

I have a spring integration flow that starts with a channel inboundadapter and picks up files and passes them through the system as messages.
After a few components, the messages are aggregated at an "Aggregator" from where they are released based on release strategies or by group timeout of 30 sec.
The downstream processing has another bunch of components till the final one.
The problem I am facing is this,
When I send 33 files which create 33 "groups/buckets" based on correlation IDs, aggregated at the "Aggregator", some of the files or messages seems to be "released" twice. The reason I conclude that is because I have a channel interceptor which shows a few messages passing through the "released" channel (appearing right after the aggregator) a second time, after completing the downstream processing successfully, the first time. Additionally, this behavior causes my application to not find a file and throw an exception which I see. This leads me to conclude that the message bucket/group/corrID is somehow being "Released" twice.
I have tried to debug this many ways , but essentially, I want to know how a corrID/bucket after being released and having successfully gone through all downstream components in a single thread, can be "released" again.
My question is, how can I debug this? I want to know what is making this message/bucket re-appear in the aggregator.
My aggregator is as follows,
<int:aggregator id="bufferedFiles" input-channel="inQueueForStage"
output-channel="released" expire-groups-upon-completion="true"
send-partial-result-on-expiry="true" release-strategy="releaseHandler"
release-strategy-method="canRelease"
group-timeout-expression="size() > 0 ? T(com.att.datalake.ifr.loader.utils.MessageUtils).getAggregatorTimeout(one, #sourceSnapshot) : -1">
<int:poller fixed-delay="${files.pickup.delay:3000}"
max-messages-per-poll="${num.files.pickup.per.poll:10}"
task-executor="executor" />
</int:aggregator>
Explanation of aggregator: The size()>0 applies to EACH correlation bucket. each of the 33 files I am sending will spawn/generate/create a new bucket because of the file name, so the aggregator will have 33 buckets/groups/corrIds, each bucket will contain only one file.
So the aggregator SPEL expression simply says that if there no release strategies, then release the bucket/group after 30 secs if the group indeed has at least some files.
My Channel inbound adapter is as follows:
<int-file:inbound-channel-adapter id="files"
channel="dispatchFiles" directory="${source.dir}" scanner="directoryScanner">
<int:poller fixed-delay="${files.pickup.delay:3000}"
max-messages-per-poll="${num.files.pickup.per.poll:10}" />
</int-file:inbound-channel-adapter>
Logs
here is the log of message completing the flow the first time. The completion time invoked suggests reaching the last component a "completionHandler" SA.
Explanation of Log: "cor" is the bucket/corrId that is being released twice. The reason I get the final exception is because during the first time, the file is removed from that original location and processed. So the second time around when this erroneous release happens, there is nothing to process there.
From the pictures it can be seen that the first batch/corrId/bucket is processed and finished around 11:09, and the second one is started around 11:10
an important point I noticed that this behavior only happens when I have a global channel interceptor in which I am doing somewhat long processing. When this interceptor is commented out, the errors go away.
Question:
is it possible for aggregator to double release a batch/corrId under any circumstance? How can I make aggregator emit any logs?
Thanks
Edit 10:15pm
My channel following the aggregator has an interceptor as follows,
public Message<?> preSend(Message<?> message, MessageChannel channel) {
LOGGER.info("******** Releasing from aggregator(interceptor) , corrID:{} at time:{} ********",MessageUtils.getCorrelationId(message), new Date() );
finalReporter.callback(channel.toString(), message);
return message;
}
From Aggregator down to final compeltionHandler SA, I have single threaded processing
Aggregator -> releasedChannel -> some SA1 -> some channel -> ..... -> completionChannel->completeSA
When I run for 33 partitions, let's follow corrId = "alh" The first time it is released, it looks like following,
What it shows is that thread-5 released it and it should process all the downstream components. But it leaves it mid-way and starts doing other things and is picked up again by a diffferent thread a little later as follows,
That seems/seemed to be the problem,
Solution Update:
I did following 3 things to sort of work around, at the moment,
for some reason, my interceptors were doing return super.preSend(message, channel) instead of simply return message. I changed it to latter
I had a global channel interceptors, I removed global and kept individual ones
If the channel interceptors had any issues before returning, would that cause a new release?
Although I still see the above scenario depicted in pictures, I am not getting double processing attempts and as such it avoids the errors. I am still trying to make sense out of this.
I understand it's too specific and difficult to explain; still thanks for the time and comments...
However, yes. I think #GaryRussell is right: since you use expire-groups-upon-completion="true" some partial groups may be released by group-timeout-expression and the new messages with the same correlationId will form a new group, which is released by the next group-timeout. Your size() > 0 isn't good too. It means that it is going to release partial group after that group-timeout. Maybe size() > 1? The group can't be size() == 0 though. Because it is created on the first message, so, if gruop exists, it contains at least one message. Yes, group can be empty, but in that case the aggregator should be marked with expire-groups-upon-completion="false". In that case it is marked as completed and doesn't allow new messages.
After struggling with debugging and various blind scenarios, I believe that at least I have a workaround and a possible root cause. I will try to outline all the things that I modified,
Root Cause:
My interceptors were calling a Common class with a common callback method. This method, based on the channel name from which the request was coming from, would decide the appropriate action to take. The actions were essentially collecting data, incrementing counters and persisting to database some information.
It seems that some of them were having errors and consequently, the thread was dying and message re-released. I am not entirely sure about it and please correct me if that's not the case.
But after I fixed those errors, the re-release issue seems to have subsided or vanished altogether.
The reason it was hard to diagnose was because I could not see those errors thrown during callback method invocations; may be I was catching them or may be they were lost.
I also found that the issue was only on any channel interceptors AFTER the aggregator. Interceptors before the aggregator did not present any issues; may be because they were simpler...
To debug,
I removed the interceptors and made the callback directly from various components (SAs), removed global interceptors and tried to add individual interceptors for specific channels.
Thanks for all the help.

Basic Sidekiq Questions about Idempotency and functions

I'm using Sidekiq to perform some heavy processing in the background. I looked online but couldn't find the answers to the following questions. I am using:
Class.delay.use_method(listing_id)
And then, inside the class, I have a
self.use_method(listing_id)
listing = Listing.find_by_id listing_id
UserMailer.send_mail(listing)
Class.call_example_function()
Two questions:
How do I make this function idempotent for the UserMailer sendmail? In other words, in case the delayed method runs twice, how do I make sure that it only sends the mail once? Would wrapping it in something like this work?
mail_sent = false
if !mail_sent
UserMailer.send_mail(listing)
mail_sent = true
end
I'm guessing not since the function is tried again and then mail_sent is set to false for the second run through. So how do I make it so that UserMailer is only run once.
Are functions called within the delayed async method also asynchronous? In other words, is Class.call_example_function() executed asynchronously (not part of the response / request cycle?) If not, should I use Class.delay.call_example_function()
Overall, just getting familiar with Sidekiq so any thoughts would be appreciated.
Thanks
I'm coming into this late, but having been around the loop and had this StackOverflow entry appearing prominently via Google, it needs clarification.
The issue of idempotency and the issue of unique jobs are not the same thing. The 'unique' gems look at the parameters of job at the point it is about to be processed. If they find that there was another job with the same parameters which had been submitted within some expiry time window then the job is not actually processed.
The gems are literally what they say they are; they consider whether an enqueued job is unique or not within a certain time window. They do not interfere with the retry mechanism. In the case of the O.P.'s question, the e-mail would still get sent twice if Class.call_example_function() threw an error thus causing a job retry, but the previous line of code had successfully sent the e-mail.
Aside: The sidekiq-unique-jobs gem mentioned in another answer has not been updated for Sidekiq 3 at the time of writing. An alternative is sidekiq-middleware which does much the same thing, but has been updated.
https://github.com/krasnoukhov/sidekiq-middleware
https://github.com/mhenrixon/sidekiq-unique-jobs (as previously mentioned)
There are numerous possible solutions to the O.P.'s email problem and the correct one is something that only the O.P. can assess in the context of their application and execution environment. One would be: If the e-mail is only going to be sent once ("Congratulations, you've signed up!") then a simple flag on the User model wrapped in a transaction should do the trick. Assuming a class User accessible as an association through the Listing via listing.user, and adding in a boolean flag mail_sent to the User model (with migration), then:
listing = Listing.find_by_id(listing_id)
unless listing.user.mail_sent?
User.transaction do
listing.user.mail_sent = true
listing.user.save!
UserMailer.send_mail(listing)
end
end
Class.call_example_function()
...so that if the user mailer throws an exception, the transaction is rolled back and the change to the user's flag setting is undone. If the "call_example_function" code throws an exception, then the job fails and will be retried later, but the user's "e-mail sent" flag was successfully saved on the first try so the e-mail won't be resent.
Regarding idempotency, you can use https://github.com/mhenrixon/sidekiq-unique-jobs gem:
All that is required is that you specifically set the sidekiq option
for unique to true like below:
sidekiq_options unique: true
For jobs scheduled in the future it is possible to set for how long
the job should be unique. The job will be unique for the number of
seconds configured or until the job has been completed.
*If you want the unique job to stick around even after it has been successfully processed then just set the unique_unlock_order to
anything except :before_yield or :after_yield (unique_unlock_order =
:never)
I'm not sure I understand the second part of the question - when you delay a method call, the whole method call is deferred to the sidekiq process. If by 'response / request cycle' you mean that you are running a web server, and you call delay from there, so all the calls within the use_method are called from the sidekiq process, and hence outside of that cycle. They are called synchronously relative to each other though...

Resources