Oozie workflow fails - Mkdirs failed to create file - hadoop

I am using an Oozie workflow to run a pyspark script, and I'm running into an error I can't figure out.
When running the workflow (either locally or on YARN) a MapReduce job is run before the Spark starts. After a few minutes the task fails (before the Spark action), and digging through the logs shows the following error:
java.io.IOException: Mkdirs failed to create file:/home/oozie/oozie-oozi/0000011-160222043656138-oozie-oozi-W/bulk-load-node--spark/output/_temporary/1/_temporary/attempt_1456129482428_0003_m_000000_2 (exists=false, cwd=file:/hadoop/yarn/local/usercache/root/appcache/application_1456129482428_0003/container_e68_1456129482428_0003_01_000004)
(Apologies for the length)
There are no other evident errors. I do not directly create this folder (I assume given the name that it is used for temporary storage of MapReduce jobs). I can create this folder from the command line using mkdir -p /home/oozie/blah.... It doesn't appear to be a permissions issue, as setting that folder to 777 made no difference. I have also added default ACLs for oozie, yarn and mapred users for that folder, so I've pretty much ruled out permission issues. It's also worth noting that the working directory listed in the error does not exist after the job fails.
After some Googling I saw that a similar problem is common on Mac systems, but I'm running on CentOS. I am running the HDP 2.3 VM Sandbox, which is a single node 'cluster'.
My workflow.xml is as follows:
<workflow-app xmlns='uri:oozie:workflow:0.4' name='SparkBulkLoad'>
<start to = 'bulk-load-node'/>
<action name = 'bulk-load-node'>
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>client</mode>
<name>BulkLoader</name>
<jar>file:///test/BulkLoader.py</jar>
<spark-opts>
--num-executors 3 --executor-cores 1 --executor-memory 512m --driver-memory 512m\
</spark-opts>
</spark>
<ok to = 'end'/>
<error to = 'fail'/>
</action>
<kill name = 'fail'>
<message>
Error occurred while bulk loading files
</message>
</kill>
<end name = 'end'/>
</workflow-app>
and job.properties is as follows:
nameNode=hdfs://192.168.26.130:8020
jobTracker=http://192.168.26.130:8050
queueName=spark
oozie.use.system.libpath=true
oozie.wf.application.path=file:///test/workflow.xml
If necessary I can post any other parts of the stack trace. I appreciate any help.
Update 1
After having checked my Spark History Server, I can confirm that the actual Spark action is not starting - no new Spark apps are being submitted.

Related

cant run shell in oozie ( error=2, No such file or directory )

I create workflow in ambari-views ui for oozie and sample.sh file in my workflow
after run that i have an error. when i change body of shell to simple command for example echo 1 this error did not appear
please advise me
2:34,752 WARN ShellActionExecutor:523 - SERVER[dlmaster02.sic] USER[root] GROUP[-] TOKEN[] APP[shell-wf] JOB[0000043-180630152627142-oozie-oozi-W] ACTION[0000043-180630152627142-oozie-oozi-W#shell-node] Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.ShellMain], main() threw exception, Cannot run program "sample.sh" (in directory "/hadoop/yarn/local/usercache/root/appcache/application_1531029096800_0022/container_e18_1531029096800_0022_01_000002"): error=2, No such file or directory
2018-07-21 16:42:34,753 WARN ShellActionExecutor:523 - SERVER[dlmaster02.sic] USER[root] GROUP[-] TOKEN[] APP[shell-wf] JOB[0000043-180630152627142-oozie-oozi-W] ACTION[0000043-180630152627142-oozie-oozi-W#shell-node] Launcher exception: Cannot run program "sample.sh" (in directory "/hadoop/yarn/local/usercache/root/appcache/application_1531029096800_0022/container_e18_1531029096800_0022_01_000002"): error=2, No such file or directory
java.io.IOException: Cannot run program "sample.sh" (in directory "/hadoop/yarn/local/usercache/root/appcache/application_1531029096800_0022/container_e18_1531029096800_0022_01_000002"): error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.oozie.action.hadoop.ShellMain.execute(ShellMain.java:110)
at org.apache.oozie.action.hadoop.ShellMain.run(ShellMain.java:69)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:75)
at org.apache.oozie.action.hadoop.ShellMain.main(ShellMain.java:59)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:231)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 17 more
the xml of my workflow
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<workflow-app xmlns="uri:oozie:workflow:0.5" name="test">
<start to="shell_1"/>
<action name="shell_1">
<shell xmlns="uri:oozie:shell-action:0.3">
<job-tracker>${resourceManager}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>Group</name>
<value>hadoop</value>
</property>
</configuration>
<exec>/user/ambari-qa/sample.sh</exec>
<file>/user/ambari-qa/sample.sh</file>
</shell>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>${wf:errorMessage(wf:lastErrorNode())}</message>
</kill>
<end name="end"/>
</workflow-app>
I had the same issue, but the root cause, in my case, was due to shell script's CRLF line separator(\r\n).
This issue was resolved when I changed the shell script's line separator to LF (\n).
Note: When using IntelliJ in Windows with default settings, CRLF(\r\n) will be the default line separator
As you are doing this via the Ambari Workflow Management tool which is an Ambari View you should edit shell action, scroll down to "Advanced Properties", and add a "File" that you want to run such as "/user/admin/hello.sh" which must be a file in hdfs. If you don't do that then the file isn't copied into the yarn container's file cache so you will get "file not found".
If you do that then "Submit" the job, go to the Dashboard, then open the job, then click on the "Definition" tab you should see that the graphical tool added a <file> node into the workflow:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><workflow-app xmlns="uri:oozie:workflow:0.5" name="helloworldshell">
<start to="shell_1"/>
<action name="shell_1">
<shell xmlns="uri:oozie:shell-action:0.3">
<job-tracker>${resourceManager}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>hello.sh</exec>
<file>/user/admin/hello.sh</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>${wf:errorMessage(wf:lastErrorNode())}</message>
</kill>
<end name="end"/>
</workflow-app>
The important lines are:
<exec>hello.sh</exec>
<file>/user/admin/hello.sh</file>
A node like <file>/x/y/z</file> causes a hdfs file to be copied from that path on hdfs into the current working directory of the running shell action on the remote data node server where the action is run in a yarn container. It can then be used by the <exec>z</exec> element which will look for it in the $PWD of the final JVM. The $PWD is set to a generated temporary location on the final host where the work runs. This may be a different server and different folder for every run of the workflow job.
Note that the "yarn containers" that run any oozie workflow are nothing like a docker container. It is a JVM with a managed classpath and a security manager set to prevent you from reading arbitrary linux files. On a big cluster any action could run on any node so files must be distributed via HDFS. There is a caching mechanism to ensure that files are cached locally on hosts. The security manager setup by yarn will only let you access files that are properly setup in the file cache as defined by one or more <file> nodes in your XML.
While the Workflow GUI seems very helpful if you don't understand the underlying technology then the manual isn't very helpful when debugging. It is a good idea to do some "hello world" jobs on the commmand line on an edge node first putting sample xml into a hdfs folder then launching jobs with the commandline tools. The workflow web UI is just adding a graphical use interface over the top.
In general what you do is put the files into subfolder below the workflow that you save:
$ tree shdemo
shdemo
├── bin
│   ├── hello.sh
└── workflow.xml
And in the workflow use a relative path to the file not an absolute path:
<file>bin/hello.sh#hello.sh</file>
The # says to symlink the file to the $PWD which is optional but can be helpful. With a complex workflow you could have different subfolder for different file types (e.g., 'bin', 'config', 'data'). You can then add many <file? entries into your workflow XML. Finally you copy all those folders up into hdfs from where you would run it:
# copy up the workflow files and subfolders into hdfs
hdfs dfs -copyFromLocal -f shdemo /user/$(whoami)/my_workflows
# launch it from that location within hdfs
oozie job -oozie $OOZIE_URL -config shdemo-job.properties -run
You will notice that when you use the Workflow GUI when you want to submit a job you have to "save" your workflow to a HDFS folder. It is that hdfs folder where you would add a bin/hello.sh that you would reference as above. Once again the web UI is simply a skin over the commandline technology. Once you have a simple one working on the commandline you can import it into the workflow GUI, edit it, and save it back to the same hdfs location.
Please try the below and let me know your result.
<exec>sample.sh</exec>
<file>${nameNode}/user/ambari-qa/sample.sh</file>
It needs a full path with Namenode to access, else it will look for default path and here the error says the script is not available in the default path.

Oozie workflow with spark application reports out of memory

I’ve tried to execute Oozie workflow with spark program as single step.
I've used jar which is successfully executed with spark-submit or spark-shell (the same code):
spark-submit --packages com.databricks:spark-csv_2.10:1.5.0 --master yarn-client --class "SimpleApp" /tmp/simple-project_2.10-1.1.jar
Application shouldn’t demand lot of resources – load single csv (<10MB) to hive using spark.
Spark version: 1.6.0
Oozie version: 4.1.0
Workflow is created with Hue, Oozie Workflow Editor:
<workflow-app name="Spark_test" xmlns="uri:oozie:workflow:0.5">
<start to="spark-589f"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="spark-589f">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapreduce.map.java.opts</name>
<value>-XX:MaxPermSize=2g</value>
</property>
</configuration>
<master>yarn</master>
<mode>client</mode>
<name>MySpark</name>
<jar>simple-project_2.10-1.1.jar</jar>
<spark-opts>--packages com.databricks:spark-csv_2.10:1.5.0</spark-opts>
<file>/user/spark/oozie/jobs/simple-project_2.10-1.1.jar#simple-project_2.10-1.1.jar</file>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
I got following logs after running workflow:
stdout:
Invoking Spark class now >>>
Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], exception invoking main(), PermGen space
stderr:
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "Yarn application state monitor"
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], exception invoking main(), PermGen space
syslog:
2017-03-14 12:31:19,939 ERROR [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: PermGen space
Please suggest which configuration parameters should be increased.
You have at least 2 options here:
1) increase PermGen size for launcher MR job by adding this to workflow.xml:
<property>
<name>oozie.launcher.mapreduce.map.java.opts</name>
<value>-XX:PermSize=512m -XX:MaxPermSize=512m</value>
</property>
see details here: http://www.openkb.info/2016/07/memory-allocation-for-oozie-launcher-job.html
2) preferred way is to use Java 8 instead of outdated Java 7
PermGen memory is a non-heap memory which is used to store the class metadata and string constants. It will not usually grow drastically if there are no runtime class loading by class.forname() or any other third-party JARs.
If you get this error message as soon as you launch your application, then it means that the allocated permanent generation space is smaller than actually required by all the class files in your application.
"-XX:MaxPermSize=2g"
You already set 2gb for PermGen memory. You can increase this value gradually and see which value does not throw outofmemoryerror and keep that value. You can also use profilers to monitor the memory usage of permanent generation and set the right value.
If this error is triggered at run time, then it might be due to runtime class loading or excessive creation of string constants in permanent generation. It requires profiling your application to fix the issue and set the right value for -XX:MaxPermSize parameter.

not able to run the shell script with oozie

hi i am trying to run the shell script through oozie.while running the shell script i am getting the following error.
org.apache.oozie.action.hadoop.ShellMain], exit code [1]
my job.properties file
nameNode=hdfs://ip-172-31-41-199.us-west-2.compute.internal:8020
jobTracker=ip-172-31-41-199.us-west-2.compute.internal:8032
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib/
oozie.use.system.libpath=true
oozie.wf.rerun.failnodes=true
oozieProjectRoot=shell_example
oozie.wf.application.path=${nameNode}/user/karun/${oozieProjectRoot}/apps/shell
my workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.1" name="pi.R example">
<start to="shell-node"/>
<action name="shell-node">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>script.sh</exec>
<file>/user/karun/oozie-oozi/script.sh#script.sh</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Incorrect output</message>
</kill>
<end name="end"/>
</workflow-app>
my shell script- script.sh
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/spark
export YARN_CONF_DIR=/etc/hadoop/conf
export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
export HADOOP_CMD=/usr/bin/hadoop
/SparkR-pkg/lib/SparkR/sparkR-submit --master yarn-client examples/pi.R yarn-client 4
error log file
WEBHCAT_DEFAULT_XML=/opt/cloudera/parcels/CDH-5.4.2- 1.cdh5.4.2.p0.2/etc/hive-webhcat/conf.dist/webhcat-default.xml:
CDH_KMS_HOME=/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/lib/hadoop-kms:
LANG=en_US.UTF-8:
HADOOP_MAPRED_HOME=/opt/cloudera/parcels/CDH-5.4.2- 1.cdh5.4.2.p0.2/lib/hadoop-mapreduce:
=================================================================
Invoking Shell command line now >>
Stdoutput Running /opt/cloudera/parcels/CDH-5.4.2-
1.cdh5.4.2.p0.2/lib/spark/bin/spark-submit --class edu.berkeley.cs.amplab.sparkr.SparkRRunner --files hdfs://ip-172-31-41-199.us-west-2.compute.internal:8020/user/karun/examples/pi.R --master yarn-client
/SparkR-pkg/lib/SparkR/sparkr-assembly-0.1.jar hdfs://ip-172-31-41-199.us-west- 2.compute.internal:8020/user/karun/examples/pi.R yarn-client 4
Stdoutput Fatal error: cannot open file 'pi.R': No such file or directory
Exit code of the Shell command 2
<<< Invocation of Shell command completed <<<
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]
Oozie Launcher failed, finishing Hadoop job gracefully
Oozie Launcher, uploading action data to HDFS sequence file: hdfs://ip-172-31-41-199.us-west-2.compute.internal:8020/user/karun/oozie-oozi/0000035-150722003725443-oozie-oozi-W/shell-node--shell/action-data.seq
Oozie Launcher ends
I dont know how to solve the issue.any help will be appreciated.
sparkR-submit ... examples/pi.R ...
Fatal error: cannot open file 'pi.R': No such file or directory
The message is really explicit: your shell tries to read a R script from the local FileSystem. But local to what, actually???
Oozie uses YARN to run your shell; so YARN allocates a container on a random machine. It's something you must put into your head so that it becomes a reflex: all resources required by an Oozie Action (scripts, libraries, config files, whatever) must be
available in HDFS beforehand
downloaded at execution time thanks to <file> instructions in the Oozie script
accessed as local files in the Current Working Dir
In your case:
<exec>script.sh</exec>
<file>/user/karun/oozie-oozi/script.sh</file>
<file>/user/karun/some/place/pi.R</file>
Then
sparkR-submit ... pi.R ...

Oozie variable[user] cannot ber resolved

I'm trying to use Oozie's Hive action in Hue. My Hive script is very simple:
create table test.test_2 as
select * from test.test
This Oozie action has only 3 steps:
start
hive_query
end
My job.properties:
jobTracker worker-1:8032
mapreduce.job.user.name hue
nameNode hdfs://batchlayer
oozie.use.system.libpath true
oozie.wf.application.path hdfs://batchlayer/user/hue/oozie/workspaces/_hue_-oozie-4-1425575226.04
user.name hue
I add hive-site.xml two times - as file and as job.xml. Oozie action starts and on second step stops. Job is 'accepted'. But in hue console I've got an error:
variable[user] cannot ber resolved
I'm using Apache Oozie 4.2, Apache Hive 0.14 and Hue 3.7 (from Github).
UPDATE:
This is my workflow.xml:
bash-4.1$ bin/hdfs dfs -cat /user/hue/oozie/workspaces/*.04/work*
<workflow-app name="ccc" xmlns="uri:oozie:workflow:0.4">
<start to="ccc"/>
<action name="ccc">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<job-xml>/user/hue/hive-site.xml</job-xml>
<script>/user/hue/hive_test.hql</script>
<file>/user/hue/hive-site.xml#hive-site.xml</file>
</hive>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
Tried running a sample hive action in Oozie following similar steps as you, and was able to resolve error faced by you using following steps
Remove the add for hive-site.xml
Add following line to your
job.properties oozie.libpath=${nameNode}/user/oozie/share/lib
Increase visibility of your hive-site.xml file kept in HDFS. Maybe
you have very restrictive privileges over it (in my case 500)
With this both the [user] variable cannot be resolved and subsequent errors got resolved.
Hope it helps.
This message can be really misleading. You should check yarn logs and diagnostics.
In my case it was configuration settings regarding reduce task and container memory. By some error container memory limit was lower than single reduce task memory limit. After looking into yarn application logs I saw the true cause in 'diagnostics' section, which was:
REDUCE capability required is more than the supported max container capability in the cluster. Killing the Job. reduceResourceRequest: <memory:8192, vCores:1> maxContainerCapability:<memory:5413, vCores:4>
Regards

Do I need to provide configuration in workflow.xml and job.properties in oozie?

I'm tryuing to run job looks like this (workflow.xml)
<workflow-app name="FirstWorkFlow" xmlns="uri:oozie:workflow:0.2">
<start to="FirstJob"/>
<action name="FirstJob">
<pig>
<job-tracker>hadoop1:50300</job-tracker>
<name-node>hdfs://hadoop1:8020</name-node>
<script>lib/FirstScript.pig</script>
</pig>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end" />
</workflow-app>
FirstScript :
dual = LOAD 'default.dual' USING org.apache.hcatalog.pig.HCatLoader();
store dual into '/user/oozie/dummy_file.txt' using PigStorage();
job.properties:
nameNode=hdfs://hadoop1:8020
jobTracker=hadoop1:50300
oozie.wf.application.path=/user/oozie/FirstScript
oozie.use.system.libpath=true
My question is: do I need to provide nameNode, and jobTracker confguration both in job.properies and workflow.xml?
I'm quite confused, cause no matter if I set these paramaters or not I get this error (error from hue interface):
E0902: Exception occured: [Call to localhost/127.0.0.1:8020 failed on connection exception: java.net.ConnectException: Connection refused]
Regards
Pawel
First to answer your question about job.properties - it is used to parametrize the workflow (the variables in the flow are replaced with the values specified in job.properties). So you can set the job tracker and namenode in job.properties and use the variables in workflow.xml or you can set it directly just in workflow.xml.
Are you sure that your Job Tracker's port is 50300? It seems suspicious for two reasons: normally, job tracker's web UI is accessible at http://ip:50030 but that is not the port that you are supposed to use for this configuration. For a Hadoop job configuration, the job tracker port is usually 8021, 9001, or 8012.
So it seems your problem is with setting the correct job tracker and name node (as opposed to setting it in the correct place). Try to check your Hadoop's settings in mapred-site.xml and core-site.xml for the correct ports and IPs. Alternatively, you can simply SSH to the machines running your Hadoop nodes and run netstat -plnt and look for the ports mentioned here.
I see a difference in port that you have specified in namenode and jobtracker. Just check what you have configured in mapred-site.xml and core-site.xml and put the appropriate port.
And also might be the hadoop1 host-name is not getting resolved. Try to add the ip address of the server or put hadoop1 in your /etc/hosts file.
You define the properties file so that the workflow could be parametarized.
Try with port 9000 which is default.Otherwise we need to see the Hadoop configuration files.

Resources