Pyspark actions submitted with oozie failing: '[Errno 2] No such file or directory' - hadoop

I am trying to submit basic spark actions on YARN to be performed on a hadoop cluster through an oozie workflow, and I get the following error (from the YARN application logs):
>>> Invoking Spark class now >>>
python: can't open file '/absolute/local/path/to/script.py': [Errno 2] No such file or directory
Hadoop Job IDs executed by Spark:
Intercepting System.exit(2)
<<< Invocation of Main class completed <<<
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [2]
But I am sure that the file is there. In fact, when I run the following command:
spark-submit --master yarn --deploy-mode client /absolute/local/path/to/script.py arg1 arg2
it works. I get the output that I want.
Note: I followed everything in this article to get it set up (I am using Spark2):
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_spark-component-guide/content/ch_oozie-spark-action.html
Any ideas?
workflow.xml (simplified for clarity)
<action name = "action1">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>${sparkMaster}</master>
<mode>${sparkMode}</mode>
<name>action1</name>
<jar>${integrate_script}</jar>
<arg>arg1</arg>
<arg>arg2</arg>
</spark>
<ok to = "end" />
<error to = "kill_job" />
</action>
job.properties (simplified for clarity)
oozie.wf.application.path=${nameNode}/user/${user.name}/${user.name}/${zone}
oozie.use.system.libpath=true
nameNode=hdfs://myNameNode:8020
jobTracker=myJobTracker:8050
oozie.action.sharelib.for.spark=spark2
sparkMaster=yarn
sparkMode=client
integrate_script=/absolute/local/path/to/script.py
zone=somethingUsefulForMe
Exception when running in CLUSTER mode:
diagnostics: Application application_1502381591395_1000 failed 2 times due to AM Container for appattempt_1502381591395_1000_000002 exited with exitCode: -1000
For more detailed output, check the application tracking page: http://hostname:port/cluster/app/application_1502381591395_1000 Then click on links to logs of each attempt.
Diagnostics: File does not exist: hdfs://hostname:port/user/oozie/.sparkStaging/application_1502381591395_1000/__spark_conf__.zip
java.io.FileNotFoundException: File does not exist: hdfs://hostname:port/user/oozie/.sparkStaging/application_1502381591395_1000/__spark_conf__.zip
at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1427)
at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
EDIT2:
I just tried from the shell, it fails due to an import.
/scripts/functions/tools.py
/scripts/functions/__init__.py
/scripts/myScript.py
from functions.tools import *
And that's the line it fails at. I'm assuming the script is first copied over to the cluster and run there. How do I get all the required modules to also go with it? Modifying the PYTHONPATH on hdfs? I understand why it's not working just not sure how to fix it.
EDIT3:
See stacktrace below. Most of the comments online say the issue is that the python code is setting Master to "local". This is not the case. What's more, I even removed everything spark related (in the python script), and still get the same issue.
Diagnostics: File does not exist: hdfs://hdfs/path/user/myUser/.sparkStaging/application_1502381591395_1783/pyspark.zip
java.io.FileNotFoundException: File does not exist: hdfs://hdfs/path/user/myUser/.sparkStaging/application_1502381591395_1783/pyspark.zip
at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1427)
at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1419)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1419)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:358)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

If you want to call the script with oozie, it needs to be placed on HDFS (because you'll never know which node will run the launcher).
After you place it on HDFS, there is need to explicitely tell spark-submit to get it from remote filesystem, so in the job.properties set:
integrate_script=hdfs:///absolute/hdfs/path/to/script.py

TLDR; Make sure you don't set SparkSession.builder.master('something') in your application code. This must be set in the spark-submit arguments only.
I came across this question by Googling with a similar problem.
My yarn job was failing with an error java.io.FileNotFoundException: File does not exist for some file called __spark_conf__.zip or pyspark.zip on hdfs, in the staging directory.
One of the comments in this ticket https://issues.apache.org/jira/browse/SPARK-10795 helped me understand my mistake.

Related

M/R job submission failing with error: Could not find Yarn tags property > (mapreduce.job.tags)

I am getting the following exception when running a map/reduce job. We submit map/reduce jobs through oozie.
Failing Oozie Launcher, Main class
[org.apache.oozie.action.hadoop.JavaMain], main() threw exception,
Could not find Yarn tags property (mapreduce.job.tags)
java.lang.RuntimeException: Could not find Yarn tags property
(mapreduce.job.tags) at
org.apache.oozie.action.hadoop.LauncherMainHadoopUtils.getChildYarnJobs(LauncherMainHadoopUtils.java:53)
at
org.apache.oozie.action.hadoop.LauncherMainHadoopUtils.killChildYarnJobs(LauncherMainHadoopUtils.java:88)
at org.apache.oozie.action.hadoop.JavaMain.run(JavaMain.java:46) at
org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:46)
at org.apache.oozie.action.hadoop.JavaMain.main(JavaMain.java:38) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:228)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at
org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:378)
at
org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:296)
at
org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:181)
at
org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:224)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745
I did a google search, and found the following SO post: Hadoop MapReduce job starts but can not find Map class? However the resolution mentioned in this post is not working for me, I cannot see any file permission related errors in the log files.
We are using Cloudera distribution.
You need to upgrade Oozie sharelibs. Follow instructions in Cloudera's documentation. Namely:
sudo oozie-setup sharelib create -fs FS_URI -locallib /usr/lib/oozie/oozie-sharelib-yarn
Don't forget to restart Oozie afterwards. This helped us to solve this particular problem after CDH 5.5 upgrade.

ERROR : org.apache.oozie.action.hadoop.PigMain not found

I'm trying to execute a simple pig script through oozie workflow which imports a python jar and as well as some other jar and eventually getting error like:
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.PigMain], exception invoking main(), java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.PigMain not found
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.PigMain not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1895)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:224)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.PigMain not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1801)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1893)
... 9 more
Oozie Launcher failed, finishing Hadoop job gracefully
and for this workflow i added all jars in lib directory including pig.jar .
Please check the Pig Jar should be present in Physical location of the Node where the Oozie Workflow is running.
Also You can plase the Pig jar in hadoop location of Oozie Shared Lib, and pass parameter
oozie.use.system.libpath = true
these will read the jar from Shared Lib Location

hadoop mapreduce job fails as it checking for jars in the HDFS (only in hbase jobs)

Hadoop mapreduce job fails with the below exception if the job has hbase handling as in the below log, how ever I added the wanted jars to hadoop_classPath.
I can overcome this issue by adding the wanted jars on the wanted path on HDFS but I think that isn't the correct handling for the issue.
I want to know if there is missed thing that I should do to handle this issue.
BTW this issue is the same as in:
issue1
issue2
issue3
Here is the error:
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost/dedge1/hadoop/hbase-0.96.1.1-hadoop2/lib/netty-3.6.6.Final.jar
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1110)
at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:288)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.getFileStatus(ClientDistributedCacheManager.java:224)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestamps(ClientDistributedCacheManager.java:93)
at org.apache.hadoop.mapreduce.filecache.ClientDistributedCacheManager.determineTimestampsAndCacheVisibilities(ClientDistributedCacheManager.java:57)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:264)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:300)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:387)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)

Hadoop can't find example jar file

I am trying to run this in pseudo distributed mode following the directions in Hadoop In Action. It ran when I used the local/standalone mode.
Now it can't seem to find the path to the jar file.
cd $HADOOP_HOME
jps
17559 JobTracker
17466 SecondaryNameNode
17791 TaskTracker
16993 NameNode
17942 Jps
bin/hadoop hadoop-examples-1.0.3.jar wordcount
Warning: $HADOOP_HOME is deprecated.
Exception in thread "main" java.lang.NoClassDefFoundError: hadoop-examples-1/0/3/jar
Caused by: java.lang.ClassNotFoundException: hadoop-examples-1.0.3.jar
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: hadoop-examples-1.0.3.jar. Program will exit.
My CLASSPATH is set to $HADOOP_HOME
Any ideas?
Two things that don't look right:
You should also have DataNode process running check the logs to see what happened to it.
The correct command to use is bin/hadoop jar hadoop-examples-1.0.3.jar wordcount
You should also have HADOOP_CONF_DIR set to point to the directory with 'hdfs-site.xml' and 'core-site.xml'

hadoop ClassNotFoundException when running start-all.sh

I tried to run ./hadoop start-all.sh
Unfortunately this error is thrown
Exception in thread "main" java.lang.NoClassDefFoundError: start/all/sh
Caused by: java.lang.ClassNotFoundException: start.all.sh
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
Could not find the main class: start.all.sh. Program will exit.
I though it might have been the hadoop path but that does not seem to fix the issue. The path that i set in the hadoop-env.sh is /usr/local/hadoop/bin`.
I looked at other posts with simular titles
Hadoop: strange ClassNotFoundException
what is considered the main class. I tried changing the path to /usr/local/hadoop/bin/
Its a shell script. >> start-all.sh should do. You do not need hadoop. You can find more information here. http://hadoop.apache.org/common/docs/r0.19.2/quickstart.html
Just run as follows
/path/to/Hadoop/home/bin/start-all.sh
In your case
/user/local/hadoop/bin/start-all.sh
Since you are already in /hadoop/bin folder. You no need to give again ./hadoop start-all.sh
instead just give ./start-all.sh
It will not throw any error and it will start your hadoop process.

Resources