When I execute my application to process data using Embedded Kafka Streams to run in parallel, i am getting the following error:
Exception in thread "TopicInGroup-684d9a1a-35fd-40eb-9d76-d869eab30251-StreamThread-1" org.apache.kafka.streams.errors.StreamsException: task [1_0] Fatal error while trying to lock the state directory for task 1_0
Note: Each kafka stream instance is executed in seperate JVM only.
Related
I had been running my streaming job for a while and it had processed thousands of batches.
There is retention on the checkpoint file system and the older directories are removed. Now when I restarted my streaming job it failed with the following error
terminated with error",throwable.class="java.lang.IllegalStateException",throwable.msg="failed to read log file for batch 0"
this is because the corresponding batch directory is no longer available. Is there a way to make the streaming job start from a specific batchId?
I have a spring batch job which reads, transforms and writes to an Oracle database. I am running the job via the CommandLineJobRunner utility (using a fat jar + dependencies generated with the maven shade plugin); the job subsequently fails halfway through due to "java heap memory limit reached" and the job is not marked as FAILED but rather still shows status STARTED.
I tried to re-run the job using the same job parameters (as the docs suggest) but this gives me this error:
5:24:34.147 [main] ERROR o.s.b.c.l.s.CommandLineJobRunner - Job Terminated in error: A job execution for this job is already running: JobInstance: id=1, version=0, Job=[maskTableJob]
org.springframework.batch.core.repository.JobExecutionAlreadyRunningException: A job execution for this job is already running: JobInstance: id=1, version=0, Job=[maskTableJob]
at org.springframework.batch.core.repository.support.SimpleJobRepository.createJobExecution(SimpleJobRepository.java:120) ~[maskng-batch-1.0-SNAPSHOT-executable.jar:1.0-SNAPSH
I have tried all sorts of things (like manually setting the status to FAILED, using the -restart argument) but to no avail. Is there something I am missing here as I thought one of the strong points of spring batch is its ability to restart jobs where they left off....!!?
First thing that you should know is Joblauncher cannot be used to restart the job which has already run .
The reason why you are getting "JobExecutionAlreadyRunningException" is because the parameter that you are passing is already present in the DB and hence you are getting this exception .
In spring batch , job can be restarted if it has completed with "FAILED" status or "STOPPED" status.
JobOperator has restart method which can be used to restart a failed job by passing the jobexecution id which was completed with "FAILED" status or "STOPPED" status.
Please note that a job cannot be restarted if it has completed with "FINISHED" status .
In this case you will have to submit new job with new job parameters
If you want to manually set the status of job as failed then run the below query and restart the job using JobOperator.restart() method.
update batch_job_execution set status="FAILED", version=version+1 where job_instance_id=jobId;
Improper handling of transaction management could be one possible reason why your job status is not getting updated with the "FAILED" status . Please make sure you are transaction is getting completed even if the job has encountered run time exception.
Hi,
I am trying to schedule a falcon process using falcon CLI and falcon service user on a Kerberised cluster. I am getting the following error message:
ERROR: Bad Request;default/org.apache.falcon.FalconWebException::org.apache.falcon.FalconException: Entity schedule failed for process: testHiveProc
Falcon app logs shows following:
used by: org.apache.falcon.FalconException: E0501 : E0501: Could not perform authorization operation, Failed on local exception: java.io.IOException: Couldn't set up IO streams; Host Details :
Any suggestions?
Thanks.
Root cause:
Oozie was running out of processes due to more number of scheduled jobs.
Short term solution:
Restart Oozie server
Long term solution:
- Increase ulimit
- Limit the number of scheduled jobs in Oozie
I'm running a few a few Spark Streaming jobs in a chain (one looking for input in the output folder of the previous one) on a Hadoop cluster, using HDFS, running in Yarn-cluster mode.
job 1 --> reads from folder A outputs to folder A'
job 2 --> reads from folder A'outputs to folder B
job 3 --> reads from folder B outputs to folder C
...
When running the jobs independently they work just fine.
But when they are all waiting for input and I place a file in folder A, job1 will change its status from running to accepting to failed.
I can not reproduce this error when using the local FS, only when running it on a cluster (using HDFS)
Client: Application report for application_1422006251277_0123 (state: FAILED)
INFO Client:
client token: N/A
diagnostics: Application application_1422006251277_0123 failed 2 times due to AM Container for appattempt_1422006251277_0123_000002 exited with exitCode: 15 due to: Exception from container-launch.
Container id: container_1422006251277_0123_02_000001
Exit code: 15
Even though Mapreduce ignores files that start with . or _, Spark Streaming does not.
The problem is, when a file is being copied or processes or whatever and there is a trace of a file found on HDFS(i.e. "somefilethatsuploading.txt.tmp") Spark will try to process it.
By the time the process starts to read the file, it's either gone or not complete yet.
That's why the processes kept blowing up.
Ignoring files that start with . or _ or end with .tmp fixes this issue.
Addition:
We kept having issues with the chained jobs. It appears that as soon as Spark notices a file (even if it's not completely written) it will try to process it, ignoring all additional data. The file rename operation is typically atomic and should prevent issues.
I am successfully using the DataStax Java Driver to access Cassandra inside my Java code just before I start a MapReduce Job.
cluster = Cluster.builder().addContactPoint("127.0.0.1").build();
However I am needing to check additional information to decide on a per record basis how to reduce the record. If I attempt to use the same code inside a Hadoop Reducer class it fails to connect with the error:
INFO mapred.JobClient: Task Id :
attempt_201310280851_0012_r_000000_1, Status : FAILED
com.datastax.driver.core.exceptions.NoHostAvailableException:
All host(s) tried for query failed (tried: /127.0.0.1 ([/127.0.0.1]
Unexpected error during transport initialization
(com.datastax.driver.core.TransportException: [/127.0.0.1] Error writing)))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:186)
at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:81)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:662)
at com.datastax.driver.core.Cluster$Manager.access$100(Cluster.java:604)
at com.datastax.driver.core.Cluster.<init>(Cluster.java:69)
at com.datastax.driver.core.Cluster.buildFrom(Cluster.java:96)
at com.datastax.driver.core.Cluster$Builder.build(Cluster.java:585)
The MapReduce input and output will successfully read and write to Cassandra. As I mentioned I can connect before I run the job so I do not think it is an issue with the Cassandra server.