Operation not permitted while launch mr job - hadoop

I have change my kerberos cluster to unkerberized
after that below Exception occurred while launching MR jobs.
Application application_1458558692044_0001 failed 1 times due to AM Container for appattempt_1458558692044_0001_000001 exited with exitCode: -1000
For more detailed output, check application tracking page:http://hdp1.impetus.co.in:8088/proxy/application_1458558692044_0001/Then, click on links to logs of each attempt.
Diagnostics: Operation not permitted
Failing this attempt. Failing the application.

I am able to continue my work
delete yarn folder from all the nodes which is define on yarn.nodemanager.local-dirs this property
then restart yarn process

Related

Spark Streaming: How to restart spark streaming job running on hdfs cleanly

We have a spark streaming job which reads data from kafka running on a 4 node cluster that uses a checkpoint directory on HDFS ....we had an I/O error where we ran out of the space and we had to go in and delete a few hdfs folders to free up some space and now we have bigger disks mounted ....and want to restart cleanly no need to preserve checkpoint data or kafka offset.....getting the error ..
Application application_1482342493553_0077 failed 2 times due to AM Container for appattempt_1482342493553_0077_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://hdfs-name-node:8088/cluster/app/application_1482342493553_0077Then, click on links to logs of each attempt.
Diagnostics: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1266542908-96.118.179.119-1479844615420:blk_1073795938_55173 file=/user/hadoopuser/streaming_2.10-1.0.0-SNAPSHOT.jar
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1484420770001
final status: FAILED
tracking URL: http://hdfs-name-node:8088/cluster/app/application_1482342493553_0077
user: hadoopuser
From the error what i can make out is it's still looking for old hdfs blocks which we deleted ...
From research found that ..changing check point directory will help tried changing it and pointing to a new directory ...but still it's not helping to restart spark on clean slate ..it's still giving the same block exception ...Are we missing anything while doing the configuration changes? And how can we make sure that spark is started on a clean slate ?
Also this is how we are setting the checkpoint directory
val ssc = new StreamingContext(sparkConf, Seconds(props.getProperty("spark.streaming.window.seconds").toInt))
ssc.checkpoint(props.getProperty("spark.checkpointdir"))
val sc = ssc.sparkContext
current checkpoint directory in property file is like this
spark.checkpointdir:hdfs://hadoopuser#hdfs-name-node:8020/user/hadoopuser/.checkpointDir1
previously it used to be like this
spark.checkpointdir:hdfs://hadoopuser#hdfs-name-node:8020/user/hadoopuser/.checkpointDir

Yarn app timeout and no error

I am running a map-reduce job triggered by YARN's REST API. The yarn app starts, triggers another map-reduce job. But the actual yarn app timeout exactly around 12 mins.
This is the final log where it ends:
2016-09-01 13:22:53 DEBUG ProtobufRpcEngine:221 - Call: getJobReport took 0ms
2016-09-01 13:22:54 DEBUG Client:97 - stopping client from cache: org.apache.hadoop.ipc.Client#6bbe2511
There are literally no errors or exceptions. I don't know which setting in Hadoop is causing this.
The Diagnostics says Application *application_whatever* failed 1 times due to ApplicationMaster for attempt *appattempt_application_whatever* timed out. Failing the application.

Unable to restart a Spring batch job

I have a spring batch job which reads, transforms and writes to an Oracle database. I am running the job via the CommandLineJobRunner utility (using a fat jar + dependencies generated with the maven shade plugin); the job subsequently fails halfway through due to "java heap memory limit reached" and the job is not marked as FAILED but rather still shows status STARTED.
I tried to re-run the job using the same job parameters (as the docs suggest) but this gives me this error:
5:24:34.147 [main] ERROR o.s.b.c.l.s.CommandLineJobRunner - Job Terminated in error: A job execution for this job is already running: JobInstance: id=1, version=0, Job=[maskTableJob]
org.springframework.batch.core.repository.JobExecutionAlreadyRunningException: A job execution for this job is already running: JobInstance: id=1, version=0, Job=[maskTableJob]
at org.springframework.batch.core.repository.support.SimpleJobRepository.createJobExecution(SimpleJobRepository.java:120) ~[maskng-batch-1.0-SNAPSHOT-executable.jar:1.0-SNAPSH
I have tried all sorts of things (like manually setting the status to FAILED, using the -restart argument) but to no avail. Is there something I am missing here as I thought one of the strong points of spring batch is its ability to restart jobs where they left off....!!?
First thing that you should know is Joblauncher cannot be used to restart the job which has already run .
The reason why you are getting "JobExecutionAlreadyRunningException" is because the parameter that you are passing is already present in the DB and hence you are getting this exception .
In spring batch , job can be restarted if it has completed with "FAILED" status or "STOPPED" status.
JobOperator has restart method which can be used to restart a failed job by passing the jobexecution id which was completed with "FAILED" status or "STOPPED" status.
Please note that a job cannot be restarted if it has completed with "FINISHED" status .
In this case you will have to submit new job with new job parameters
If you want to manually set the status of job as failed then run the below query and restart the job using JobOperator.restart() method.
update batch_job_execution set status="FAILED", version=version+1 where job_instance_id=jobId;
Improper handling of transaction management could be one possible reason why your job status is not getting updated with the "FAILED" status . Please make sure you are transaction is getting completed even if the job has encountered run time exception.

Unable to schedule falcon process - Could not perform authorization operation, java.io.IOException: Couldn't set up IO streams

​Hi,
I am trying to schedule a falcon process using falcon CLI and falcon service user on a Kerberised cluster. I am getting the following error message:
ERROR: Bad Request;default/org.apache.falcon.FalconWebException::org.apache.falcon.FalconException: Entity schedule failed for process: testHiveProc
Falcon app logs shows following:
used by: org.apache.falcon.FalconException: E0501 : E0501: Could not perform authorization operation, Failed on local exception: java.io.IOException: Couldn't set up IO streams; Host Details :
Any suggestions?
Thanks.
Root cause:
Oozie was running out of processes due to more number of scheduled jobs.
Short term solution:
Restart Oozie server
Long term solution:
- Increase ulimit
- Limit the number of scheduled jobs in Oozie

Spark streaming jobs fail when chained

I'm running a few a few Spark Streaming jobs in a chain (one looking for input in the output folder of the previous one) on a Hadoop cluster, using HDFS, running in Yarn-cluster mode.
job 1 --> reads from folder A outputs to folder A'
job 2 --> reads from folder A'outputs to folder B
job 3 --> reads from folder B outputs to folder C
...
When running the jobs independently they work just fine.
But when they are all waiting for input and I place a file in folder A, job1 will change its status from running to accepting to failed.
I can not reproduce this error when using the local FS, only when running it on a cluster (using HDFS)
Client: Application report for application_1422006251277_0123 (state: FAILED)
INFO Client:
client token: N/A
diagnostics: Application application_1422006251277_0123 failed 2 times due to AM Container for appattempt_1422006251277_0123_000002 exited with exitCode: 15 due to: Exception from container-launch.
Container id: container_1422006251277_0123_02_000001
Exit code: 15
Even though Mapreduce ignores files that start with . or _, Spark Streaming does not.
The problem is, when a file is being copied or processes or whatever and there is a trace of a file found on HDFS(i.e. "somefilethatsuploading.txt.tmp") Spark will try to process it.
By the time the process starts to read the file, it's either gone or not complete yet.
That's why the processes kept blowing up.
Ignoring files that start with . or _ or end with .tmp fixes this issue.
Addition:
We kept having issues with the chained jobs. It appears that as soon as Spark notices a file (even if it's not completely written) it will try to process it, ignoring all additional data. The file rename operation is typically atomic and should prevent issues.

Resources