Hive job gets killed and query execute() remains hanging - amazon-ec2

I am using hive-jdbc-0.7.1-cdh3u5.jar. I have some memory-intensive queries running on EMR which occasionally fail. When I look at the job tracker I see that the query has been killed and I see the following error:
java.io.IOException: Task process exit with nonzero status of 137
However, the Hive JDBC driver execute() call does not detect this, but instead is left hanging. No exception is caught. Any ideas? Thanks:
ST stQuery = MY_QUERY;
try {
Statement stmt = conn.createStatement();
stmt.execute(stQuery.render()); // Hangs here without knowing that the job has been killed. Exception does not get raised.
}
catch(SQLException sqle){
sqle.printStackTrace();
log.error("Failed to run query");
return;
}

It is perhaps due to the fact that the hadoop will kill
the task after 10 minutes (600 sec) if it doesn't get the response and
by setting the parameter mapred.task.timeout=0 we can avoid killing
the tasks which are running for more than 10 min.
Also in theses cases one can write mapper/reducer in such a way as to report progress on a regular basis (more frequently than every 10 minutes). This may be achieved in a number of ways:
Call setStatus() on Reporter to set a human-readable description of
the task’s progress
Call incrCounter() on Reporter to increment a user counter
Call progress() on Reporter to tell Hadoop that your task is still there (and making progress)

Related

I'm executing the script for 1 hr but it is running for 10 min

I'm executing the script for 1 hour but it is running for 10 minutes , i also check loop forever, test data is also proper, all the script is running properly without any error , I run the script thrice validate all the things but im not getting why it is happening
How to overwrite the issues and how to run a the script for 1 hour
Normally you can find the reason for termination of a thread or test in jmeter.log file. If it is not there or it's vague - you can increase JMeter logging level to something more verbose
The most common reasons for premature end of the test are:
Not enough loops to cover the anticipated duration of the test in the Thread Group
Not enough test data if CSV Data Set Config is configured to stop thread on EOF
Thread group is configured to stop the thread/test on a sampler error
There is Flow Control Action sampler somewhere configured to stop thread/test
There is a runtime error like OutOfMemoryError or StackOverFlowError

Running out of SQL connections with Quarkus and hibernate-reactive-panache

I've got a Quarkus app which uses hibernate-reactive-panache to run some queries and than process the result and return JSON via a Rest Call.
For each Rest call 5 DB queries are done, the last one will load about 20k rows:
public Uni<GraphProcessor> loadData(GraphProcessor graphProcessor){
return myEntityRepository.findByDateLeaving(graphProcessor.getSearchDate())
.select().where(graphProcessor::filter)
.onItem().invoke(graphProcessor::onNextRow).collect().asList()
.onItem().invoke(g -> log.info("loadData - end"))
.replaceWith(graphProcessor);
}
//In myEntityRepository
public Multi<MyEntity> findByDateLeaving(LocalDate searchDate){
LocalDateTime startDate = searchDate.atStartOfDay();
return MyEntity.find("#MyEntity.findByDate",
Parameters.with("startDate", startDate)
.map()).stream();
}
This all works fine for the first 4 times but on the 5th call I get
11:12:48:070 ERROR [org.hibernate.reactive.util.impl.CompletionStages:121] (147) HR000057: Failed to execute statement [$1select <ONE OF THE QUERIES HERE>]: $2could not load an entity: [com.mycode.SomeEntity#1]: java.util.concurrent.CompletionException: io.vertx.core.impl.NoStackTraceThrowable: Timeout
at <16 internal lines>
io.vertx.sqlclient.impl.pool.SqlConnectionPool$1PoolRequest.lambda$null$0(SqlConnectionPool.java:202) <4 internal lines>
at io.vertx.sqlclient.impl.pool.SqlConnectionPool$1PoolRequest.lambda$onEnqueue$1(SqlConnectionPool.java:199) <15 internal lines>
Caused by: io.vertx.core.impl.NoStackTraceThrowable: Timeout
I've checked https://quarkus.io/guides/reactive-sql-clients#pooled-connection-idle-timeout and configured
quarkus.datasource.reactive.idle-timeout=1000
That itself did not make a difference.
I than added
quarkus.datasource.reactive.max-size=10
I was able to run 10 Rest calls before getting the timeout again. On a pool setting of max-size=20 I was able to run it 20 times. So it does look like each Rest call will use up a SQL connection and not release it again.
Is there something that needs to be done to manually release the connection or is this simply a bug?
The problem was with using #Blocking on a reactive Rest method.
See https://github.com/quarkusio/quarkus/issues/25138 and https://quarkus.io/blog/resteasy-reactive-smart-dispatch/ for more information.
So if you have a rest method that returns e.g. Uni or Multi, DO NOT use #Blocking on the call. I had to initially add it as I received an Exception telling me that the thread cannot block. This was due to some CPU intensive calculations. Adding #Blocking made that exception go away (in dev-mode but another problem popped up in native mode) but caused this SQL pool issue.
The real solution was to use emitOn to change the thread for the cpu intensive method:
.emitOn(Infrastructure.getDefaultWorkerPool())
.onItem().transform(processor::cpuIntensiveMethod)

Spring Batch: Terminating the current running job

I am having an issue in terminating the current running spring batch. I wrote
Set<Long> executions = jobOperator.getRunningExecutions("Job-Builder");
jobOperator.stop(longExecutions.iterator().next());`
in my code after going through the spring documentation.
The problem I am facing is at times the termination of the job is happening as expected and the other times the termination of job is not happening. In fact every time I call stop on joboperator it is updating the BATCH_JOB_EXECUTION table. When the termination happens successfully the status of the job is updating to STOPPED by killing the jobExecution in my batch process. The other times when it fails it is completing the rest of the different flows of the batch and updating the status to FAILED on BATCH_JOB_EXECUTION table.
But every time I call stop in the job operator I see a message in my console
2020-09-30 18:14:29.780 [http-nio-8081-exec-5] INFO o.s.b.c.l.s.SimpleJobOperator:428 - Aborting job execution: JobExecution: id=33058, version=2, startTime=2020-09-30 18:14:25.79, endTime=null, lastUpdated=2020-09-30 18:14:28.9, status=STOPPING, exitStatus=exitCode=UNKNOWN;exitDescription=, job=[JobInstance: id=32922, version=0, Job=[Job-Builder]], jobParameters=[{date=1601504064263, time=1601504064262, extractType=false, JobId=1601504064262}]
My project has a series of flows and steps with in it.
Over all my batch process looks like this:
JobBuilderFactory has 3 flows
Each flow has a stepbuilder and two tasklets.
each stepbuilder has a partitioner and a chunk(size is 100) based itemReader, itemProcessor and itemWriter.
I am calling the stop method when I am executing the very first flow in my jobBuilderFactory. The over all process to complete takes about 30 mins. So, it has close to around 20-25 mins from the time I call the stop method and the chunk size is 100 with in each and every flow and I am dealing with more than 500k records.
So, my question is why is jobExecution stopping at times when called stop methos(which is what I wanted) and why it isn't able to stop the jobExecution the remaining times.
Thanks in advance
So, my question is why is jobExecution stopping at times when called stop methos(which is what I wanted) and why it isn't able to stop the jobExecution the remaining times.
It's not easy to figure out the reason for that from what you shared, but I can give you a couple of notes about stopping jobs:
jobOperator.stop does not guarantee that the job stops, it only sends a stop signal to the job execution. From what you shared, you are not checking the returned boolean that indicates if the signal has been correctly sent or not, so you should be doing that first.
You did not share your code, but you need to use StoppableTasklet instead of Tasklet to make sure the stop signal is correctly sent to your steps.

hadoop streaming jobs fails to report?

All jobs were running successfully using hadoop-streaming, but all of a sudden I started to see errors due to one of worker machines
Hadoop job_201110302152_0002 failures on master
Attempt Task Machine State Error Logs
attempt_201110302152_0002_m_000037_0 task_201110302152_0002_m_000037 worker2 FAILED
Task attempt_201110302152_0002_m_000037_0 failed to report status for 622 seconds. Killing!
-------
Task attempt_201110302152_0002_m_000037_0 failed to report status for 601 seconds. Killing!
Last 4KB
Last 8KB
All
Questions :
- Why does this happening ?
- How can I handle such issues?
Thank you
The description for mapred.task.timeout which defaults to 600s says "The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. "
Increasing the value of mapred.task.timeout might solve the problem, but you need to figure out if more than 600s is actually required for the map task to complete processing the input data or if there is a bug in the code which needs to be debugged.
According to the Hadoop best practices, on average a map task should take a minute or so to process an InputSplit.

Quartz.NET: Need CronTrigger on an iStatefulJob instance to *delay instead of skip* if running job while schedule matures

Greetings, your friendly neighborhood Quartz.NET n00b is back!
I have a Windows Service running iStatefulJob instances on a Quartz.NET CronTrigger based schedule scheme... The CRON String used to schedule the job: "0 0/1 * * * ? *"
Everything works great. However, if I have a job that is set to run, say, at the X:00 mark of every minute, and that job happens to run for MORE than a minute, I notice that the subsequent job runs IMMEDIATELY after the job is finished executing, rather than waiting until its next scheduled run, effectively "queuing" up instead of merely skipping the job till it's next scheduled run.
I put in the trigger a CronTrigger MisfireInstruction of DONOTHING, but the exact same thing happens when a job overruns its next scheduled execution schedule.
How do I get an iStatefulJob instance to merely SKIP a scheduled execution trigger if it is currently running, rather than have it delay it until the first execution completes?
I explicitly set the trigger.MisfireInstruction = MisfireInstruction.CronTrigger.DoNothing;
...But instead of "doing nothing", for a job scheduled to run every minute that takes 90 seconds to complete, I experience the following execution log:
Job runs at 9:00:00am, finishes at 9:01:30am <- job runs for 1:30
Job runs at 9:01:30am, finishes at 9:03:00am <- subsequent job that should have run at 9:01:00
Job runs at 9:04:00am, finishes at 9:05:30am <- shouldn't this one have run at 9:03:00?
Job runs at 9:05:30am, finishes at 9:07:00am <- subsequent job that should have run at 9:05:00
Job runs at 9:08:00am, finishes at 9:09:30am <- shouldn't this have run at 9:07:00?
... it seems like it runs correctly the first time, on the minute... delays for 30 seconds as the 90 second job execution time expires, and then, instead of waiting till the NEXT full minute, EXECUTES IMMEDIATELY at the 30 second mark... Doubly odd, is that it then finishes the SECOND job on the minute mark, but waits till the NEXT minute mark to execute instead of running it back-2-back...
Pretty much seems like it works correctly EVERY OTHER RUN, when it is not running on the :30 marks...
What's the best way to get a job not to delay/queue, but to just SKIP until it is idle and the next schedule matures?
EDIT: I tried going back to iJobs instead of iStatefulJobs using the same DONOTHING trigger misfire instruction, but the job executes EVERY MINUTE despite the prior execution being still active. I can't seem to get it to skip a scheduled run if it is currently running with either iJob or iStatefulJob...
EDIT#2: I think that my triggers are NEVER misfiring, which is why DoNothing as a misfire instruction is useless... Given that's the case, I guess I need another mechanism to detect if a job instance of a schedule is running to ensure the job SKIPS its next execution until its following scheduled time rather than delaying it until first instance completion...
EDIT3: I tried adding an element to the iStatefulJob jobdatamap called "IsRunning"... I set it to TRUE when the execute sequence starts, and then return it to false after job completion. Before executing, it checks the element, which is apparently persisted between jobs, and prematurely quits the execution (logging "JOB SKIPPED!") if it detects it to be true... This unfortunately doesn't work, for probably obvious reasons: If the jobs are running following the bulleted schedule above, then the job is never SIMULTANEOUSLY running along with itself, as it is delaying the run till the job ends, so this check is useless. According to documentation, returning to iJob from iStatefulJob would not help here as the jobdatamap is only persisted between jobs in the Stateful job type...
I still haven't solved how to SKIP a scheduled job instead of delaying it till it's current iteration completes... If anyone has ideas, you're a lifesaver! :)
It should be caused by misfireThreshold of RAMJobStore (http://quartznet.sourceforge.net/apidoc/topic2722.html).
The time span by which a trigger must
have missed its next-fire-time, in
order for it to be considered
"misfired" and thus have its misfire
instruction applied.
It is 60 seconds by default. So job isn't considered as "misfired" until it is late for more than misfiredThreshold value.
To resolve the problem just decrease this threshold (below code sets to 1 ms):
...
properties["quartz.jobStore.misfireThreshold"] = "1";
...
schedulerFactory = new StdSchedulerFactory(properties);
It should resolve the issue.

Resources