snakebite recently come as an alternative to hdfs cli. According to my understanding hdfs does not do retry of command, if command failed due to cluster issue.
My question here is that does snakebite handle the retry in case of cluster failure. By retry I mean it tries the command several time if the command is failing.
Yes from the cli interface it does opt for retry. But after a time when enough retry fails, It does fail.
Related
I currently have a snakemake pipeline running with multiple jobs on a cluster. I want to cancel my jobs early, and the snakemake documentation says that I can use the --cluster-cancel option. However, it doesn't have any example of how to use it. The cluster I am using cancels jobs with qdel. So, I tried using snakemake --cluster-cancel "qdel", but when I do this it returns the error
snakemake: error: unrecognized arguments: --cluster-cancel
--cluster-cancel is a feature that was introduced in version 7.0.0. If you’re able to use that or a newer version on the cluster, you ought to be able to use that feature.
I haven't used the --cluster-cancel feature, but have the following routine whenever I want to kill a bunch of snakemake jobs on my cluster.
Interrupt the snakemake pipeline: ^C
Kill existing jobs on the cluster: bkill 0 (kills all jobs) or bkill [jobid]
Unlock the snakemake pipeline for future use: snakemake --unlock
I have change my kerberos cluster to unkerberized
after that below Exception occurred while launching MR jobs.
Application application_1458558692044_0001 failed 1 times due to AM Container for appattempt_1458558692044_0001_000001 exited with exitCode: -1000
For more detailed output, check application tracking page:http://hdp1.impetus.co.in:8088/proxy/application_1458558692044_0001/Then, click on links to logs of each attempt.
Diagnostics: Operation not permitted
Failing this attempt. Failing the application.
I am able to continue my work
delete yarn folder from all the nodes which is define on yarn.nodemanager.local-dirs this property
then restart yarn process
My Pig script works fine on its own, until I put it in an Oozie workflow, where I receive the following error:
ERROR 2043: Unexpected error during execution.
org.apache.pig.backend.executionengine.ExecException: ERROR 2043: Unexpected error during execution.
...
Caused by: java.io.IOException: No FileSystem for scheme: hbase
I registered the HBase and Zookeeper jars successfully, but received the same error.
I also attempted to set the Zookeeper Quorum by adding variation of these lines in the Pig script:
SET hbase.zookeeper.quorum 'vm-myhost-001,vm-myhost-002,vm-myhost-003'
Some searching on the internet instructed me to add this to the beginning of my workflow.xml:
SET mapreduce.fileoutputcommitter.marksuccessfuljobs false
This solved the problem. I was even able to remove the registration of the HBase and Zookeeper jars and the Zookeeper quorum.
Now after double checking, I noticed that my jobs actually do their job: they store the results in HBase as expected. But, Oozie claims that a failure occurred, when it didn't.
I don't think that setting the mapreduce.fileoutputcommitter.marksuccessfuljobs to false constitutes a solution.
Are there any other solutions?
It seems that there is currently no real solution for this.
However, this answer to a different question seems to indicate that the best workaround is to create the success flag 'manually'.
I just started to practice AWS EMR.
I have a sample word-count application set-up, run and completed from the web interface.
Following the guideline here, I have setup the command-line interface.
so when I run the command:
./elastic-mapreduce --list
I receive
j-27PI699U14QHH COMPLETED ec2-54-200-169-112.us-west-2.compute.amazonaws.comWord count
COMPLETED Setup hadoop debugging
COMPLETED Word count
Now, I want to see the log files. I run the command
./elastic-mapreduce --ssh --jobflow j-27PI699U14QHH
Then I receive the following error:
Error: Jobflow entered COMPLETED while waiting to ssh
Can someone please help me understand what's going on here?
Thanks,
When you setup a job on EMR, this means that Amazon is going to provision a cluster on-demand for you for a limited amount of time. During that time, you are free to ssh to your cluster and look at the logs as much as you want, but by the time your job has finished running, then your cluster is going to be taken down ! At that point, you won't be able to ssh anymore because your cluster simply won't exist.
The workflow typically looks like this:
Create your jobflow
It will be for a few minutes in status STARTING. At that point if you try to run ./elastic-mapreduce --ssh --jobflow <jobid> it will simply wait because the cluster is not available yet.
After a while the status will switch to RUNNING. If you had already started the ssh command above it should automatically connect you to your cluster. Otherwise you can initiate your ssh command now and it should connect you directly without any wait.
Depending on the nature of your job, the RUNNING step could take a while or be very short, it depends what amount of data you're processing and the nature of your computations.
Once all your data has been processed, the status will switch to SHUTTING_DOWN. At that point, if you already sshed before you will get disconnected. If you try to use the ssh command at that point, it will not connect.
Once the cluster has finished shutting down it will enter a terminal state of either COMPLETED or FAILED depending on whether your job succeeded or not. At that point your cluster is no longer available, and if you try to ssh you will get the error you are seeing.
Of course there are exceptions, you could setup an EMR cluster in interactive mode, for example you just want to have Hive setup and then ssh there and run Hive queries and you would have to take your cluster down manually. But if you just want a MapReduce job to run, then you will only be able to ssh for the duration of the job.
That being said, if all you want to do is debugging, there is not even a need to ssh in the first place ! When you create your jobflow, you have the option to enable debugging, so you could do something like that:
./elastic-mapreduce --create --enable-debugging --log-uri s3://myawsbucket
What that means is that all the logs for your job will end up being written to the S3 bucket specified (you have to own this bucket of course and have permission to write to it). Also if you do that, you can go into the AWS console afterwards in the EMR section, and you will be able to see next to your job a button to debug as shown below in the screenshot, this should make your life much easier:
In a small Hadoop cluster set up on a number of developer workstations (i.e., they have different local configurations), I have one TaskTracker of 6 that is being problematic. Whenever it receives a task, that task immediately fails with ChildError:
java.lang.Throwable: Child Error
at org.apache.hardoop.mapred.TaskRunner.run(TaskRunner.java:242)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hardoop.mapred.TaskRunner.run(TaskRunner.java:229)
When I look at the stdout and stderr logs for the task, the stdout log is empty, and the stderr log only has:
execvp: Permission denied
My jobs complete because the tasktracker eventually gets blacklisted and runs on the other nodes that have no problem running a task. I am not able to get any tasks running on this one node, from any number of jobs, so this is a universal problem.
I have a DataNode running on this node with no issues.
I imagine there might some sort of Java issue here where it is having a hard time spawning a JVM or something...
We have same problem. we fix it by adding 'execute' to below file.
$JAVA_HOME/jre/bin/java
Because hadoop use $JAVA_HOME/jre/bin/java to spawn task program instead of $JAVA_HOME/bin/java.
If you still have this issue after change the file mode, suggest you use remote debug to find the shell cmd which spawning the task, see debugging hadoop task
Whatever it is trying to execvp does not have the executable bit set on it. You can set the executable bit using chmod from the commandline.
I have encountered the same problem.
You can try changing the jdk version 32bit to 64bit or 64bit to 32bit.