We have maradb 10.1 and beanstalkd 1.10 and laravel 4.2
We have one query that run successfully without queue. but when run it in beanstalkd not afected and we get 'MySql server has gone away' error in log file
config:
wait_timeout = 120
max_allowed_packet = 1024M
Why different behavior between with and without queue
We had similar issues and either it was because the code was running in different thread, and connection being lost, or a strange garbage collection and closing of connection for long running processes.
Anyway what we implemented is
- when a job is reserved and starts processing we reconnect the DB always
- when we detect a connection gone away, we release the job (so it will be picked up again)
In case it happens in the middle of the processing flow, you may want to reconnect to lose work done so far on that job, if the job is somehow transactional.
Related
I am using SchedulerLock in Spring Boot And I am using 2 servers.
What I'm curious about is why is "lockAtMostFor" an option that exists?
Take an example: on one of my 2 servers, the schedule runs first and then locks.
But something went wrong while running, and my server went down.
At this moment, my scheduled task ends incompletely.
Any guide I read is full of vague answers about "lock time in case a node dies".
When a node dies, it can no longer execute schedules.
But why keep holding a LOCK for a dead node?
Even if I urgently try to manually execute the schedule on the 2nd server, it is impossible to manually execute it because of the above lock.
What are options that exist for?
Our application is a Spring batch running in openshift. The application calls another service via REST to fetch records from database. Both use nginx side car for handling the traffic. Both side cars restarted for some reason and the Spring batch job terminated suddenly .I already implemented retry mechanism using #Retryable but the logic has not even reached the retry part. The only log I found in the application is given below
"Encountered an error executing step myPartitionStep in job myJob","level":"ERROR","thread":"main","logClass":"o.s.batch.core.step.AbstractStep","logMethod":"execute","stack_trace":"o.s.b.core.JobExecutionException: Partition handler returned an unsuccessful step
o.s.b.c.p.support.PartitionStep.doExecute(PartitionStep.java:112)
o.s.batch.core.step.AbstractStep.execute(AbstractStep.java:208)
o.s.b.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
o.s.b.c.job.flow.JobFlowExecutor.executeStep(JobFlowExecutor.java:68)
o.s.b.c.j.f.s.state.StepState.handle(StepState.java:68)
o.s.b.c.j.f.support.SimpleFlow.resume(SimpleFlow.java:169)
o.s.b.c.j.f.support.SimpleFlow.start(SimpleFlow.java:144)
o.s.batch.core.job.flow.FlowJob.doExecute(FlowJob.java:137)
o.s.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
o.s.b.c.l.s.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
o.s.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:50)
o.s.b.c.l.s.SimpleJobLauncher.run(SimpleJobLauncher.java:140)
j.i.r.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java)
j.i.r.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
j.i.r.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:566)
o.s.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
o.s.a.f.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
o.s.a.f.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
o.s.b.c.c.a.SimpleBatchConfiguration$PassthruAdvice.invoke(SimpleBatchConfiguration.java:128)
... 13 frames truncated\n"
I am not able to point what exactly is the reason for this error. It stopped at partition step which uses itemReader to call another service and fetche the records,FlatFileItemWriter which writes the records. We cannot afford to have duplicates in our file. Is it possible to restart the app exactly where it stopped without having duplicates?
The stacktrace you shared is truncated, so it is not possible to see the root cause from what you shared.
Spring Batch supports restarting a failed paritioned step, as long as you use a persistent job repository. You need to restart the same job instance, ie use the same job parameter that you used in your first run (that failed). Only failed partitions will be rerun. Any failed partition will resume from where it left off.
I was running some tests to understand the MaxMemory-Reserved & MaxMemory-Policy and we faced “Server Closed the connection” error few times when Redis DB was almost full. Here are the details:
1) Created the Redis Cache with Standard C1(1 GB) tier and chose “allkeys-lru” and max-memory-reserved as 50 MB
2) Ran the Redis Benchmark tool to add the Keys in Redis DB to make sure Redis DB is almost full.
3) As soon as DB reached around ~960-980 MB, again ran Benchmark tool to add some more keys and got following error. In which all scenarios this error can occur?
Note: The Connected_Clients value was 0 when we ran the info command just before we encountered this error.
4) At same time ran the info command on Azure Portal Console and got the output as “Error”.
5) This error lasted approximately for 2-3 Mins and we were able to add keys after that. And once we ran the info command again, we got following stat. Here we see that difference between used_memory and used_memory_rss is around 76 MB. Do you think the above error could be because of this?
info
Server redis_version:3.2.3
redis_mode:standalone
os:Windows
arch_bits:64
multiplexing_api:winsock_IOCP
hz:10
Clients
connected_clients:2
client_longest_output_list:0
client_biggest_input_buf:0
client_total_writes_outstanding:0
client_total_sent_bytes_outstanding:0
blocked_clients:0
Memory
used_memory:968991592
used_memory_human:924.10M
used_memory_rss:1049776128
used_memory_rss_human:1001.14M
used_memory_peak:1070912296
used_memory_peak_human:1021.30M
used_memory_lua:37888
maxmemory:1100000000
maxmemory_human:1.02G
maxmemory_policy:allkeys-lru
mem_allocator:jemalloc-3.6.0 #
Most likely you are running into scenario of high un-authenticated connections. Redis-benchmark first creates all the client connections (in your case -c 400 connections) and then authenticates them. The delay in auth causes high number of unauthenticated connections from a single IP and Azure Redis Cache closes them for DOS protection. Hence, the error “Server closed the connection”
You can try the redis-benchmark from here, which I have modified to authenticate as soon as a connection has been made and should solve this issue.
I noticed some workflows stop working because the connection pool looses connection to hive, if I stop and then restart the connection pool anything work. How can I avoid this problem working only on hive? Does anybody faced this problem ?
This is a bug, DBCP currently assumes the connections are valid, so it can return such a connection the next time the processor runs, causing the error you are seeing.
I think HiveConnectionPool would benefit from the same fix in NIFI-2381, and perhaps other DBCP settings to ensure only valid connections are returned/borrowed. I have written NIFI-2927 to cover this.
Unfortunately I believe the only workaround at present is to restart NiFi.
I am running a banking program, coded in Oracle PL/SQL. This program runs for 2-3 hours everyday, as part of the End of Day processing.
Till yesterday, it was working fine. Today when I run it today, after around 30 mins, the program terminates with the error ORA-01013: user requested cancel of current operation. I am not terminating the program manually.
I feel this could be a weblogic (where the application is deployed) setting problem. I am not fluent in weblogic and am not sure what parameter can be changed to stop this error. Please help!!!
Oracle version: 11.2.0.3
Oracle weblogic server: 11g
This sounds like a JDBC timeout. From the WebLogic console go to Services->Data Sources and click the name of your data source to see its settings. Click the Connection Pool tab, and expand the Advanced section at the bottom of the page. Look for the Statement Timeout setting.
From the documentation:
When Statement Timeout is set to -1, (the default) statements do not timeout.
The behaviour you're seeing suggests the timeout is set to 1800 if it's timing out after 30 minutes.
However, this won't have changed on its own, and if it was already set then it was being ignored previously, which would need some investigation - has anything else changed?
Another possibility is that your code is making several calls within the 3-4 hour window and one of them is now exceeding the timeout on its own, which might be the case if the timeout is lower than 1800. Without seeing your code or the current timeout value I'm just guessing, obviously.