How can you resolve Oozie error JA009 - hadoop

I am running a simple Oozie workflow on Cloudera VM. The sub-workflow calls a shell script which sends a test email. However, I am getting the JA009 error:
(JA009: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.).
I have already changed mapreduce.framework.name from yarn to classic in the following places:
/etc/oozie/conf/hadoop-conf/core-site.xml
/etc/oozie/conf/hadoop-config.xml
/etc/hadoop/conf/mapred-site.xml
Also, in /etc/hadoop/conf/hadoop-env.sh I changed:
"export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce"
to:
"export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce"
Is there anything I am missing? Yarn is not showing up in hadoop fs -ls /user (hive, pig, spark etc are). So I am assuming Yarn is not pre-installed here.

Related

Getting No such file or directory error when i use shell from oozie

i am trying to run shell script from oozie, when i am using hadoop commands inside shell script, it's working fine but when i am trying to run local commands, i am getting no such file or directory exception.
Example:
sample.sh
hadoop fs -touchz /user/123/test.txt
this script is working, when i use NFS path or local path i am getting
"No such file or directory" exception,
Example:
sample.sh
touch /HDFS/user/123/test.txt
is there anything i am missing, please let me know, '/HDFS' is NFS path.
The thing is all the Oozie workflows will be executed by the Oozie server so if you have the directory /HDFS/user/123 created already in the Oozie server, it will work.
So the solution to make it work would be configuring the NFS to work with (attach) the Oozie server.
Update
After clarifying some of my own unknowns, what I had mentioned above is not entirely correct. Here is my updated answer:
When you, the client, submit the Oozie job, with YARN, it goes to the ResourceManager which then negotiates and routes it to any of the NodeManagers, so for your case to work, you would have to have the NFS mount configured on all of the NodeManagers to work properly.

Spark installed but no command 'hdfs' or 'hadoop' found

I am a new pyspark user.
I just downloaded and installed a spark cluster ("spark-2.0.2-bin-hadoop2.7.tgz")
after installation I wanted to access the file system (upload local files to cluster). But when I tried to type hadoop or hdfs in command it will say "no command found".
Am I gonna install hadoop/HDFS (I thought it's built in the spark, I don't get)?
Thanks in advance.
You have to install hadoop first to access HDFS.
Follow this http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
Choose the latest version of hadoop from the apache site.
Once you done with hadoop setup go to spark http://d3kbcqa49mib13.cloudfront.net/spark-2.0.2-bin-hadoop2.7.tgz download this, Extract files. Setup java_home and hadoop_home in spark-env.sh.
You don't have hdfs or hadoop on classpath so this is the reason why you are getting message: "no command found".
If you run \yourparh\hadoop-2.7.1\bin\hdfs dfs -ls / it should works and show root content.
But, You can add your hadoop/bin (hdfs, hadoop ...) commands to classpath with something like this:
export PATH $PATH:$HADOOP_HOME/bin
where HADOOP_HOME is your env. variable with path to hadoop installation folder (download and install is required)

Job tracker and Task tracker don't sow up when ran the start-all.sh command in ububtu for hadoop

Job tracker and Task tracker don't sow up when ran the start-all.sh command in ububtu for hadoop
I do get the rest of the processes while i run the "JPS" command in unix.
Not sure why i am not being shown with the job tracker and task tracker.Have been following couple of links and couldn't get my prob sorted.
Steps done :
-Multiple times formatted the namenode
-Multiple time deleted and recreated the tmp folder with appropriate permissions.
What could be the issue ?
Any suggestions would really help me as i am struggling in setting up hadoop on my laptop.I am new to it though.
Try starting jobtracker and tasktracker separately.
From your hadoop HOME directory run
. bin/../libexec/hadoop-config.sh
Then from hadoop BIN directory run
hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker
hadoop-daemon.sh --config $HADOOP_CONF_DIR start tasktracker
You must have been using hadoop 2.x version where jobtracker is replaced with YARN resource manager. Using jps(jdk is needed) you can check whether resouce manager is running. If it is running then the default url for it is (host-name):8088. You can check your nodes,jobs also configuration there.If not running then start them with sbin/start-yarn.sh.

Spark Standalone Mode: Worker not starting properly in cloudera

I am new to the spark, After installing the spark using parcels available in the cloudera manager.
I have configured the files as shown in the below link from cloudera enterprise:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.8.1/Cloudera-Manager-Installation-Guide/cmig_spark_installation_standalone.html
After this setup, I have started all the nodes in the spark by running /opt/cloudera/parcels/SPARK/lib/spark/sbin/start-all.sh. But I couldn't run the worker nodes as I got the specified error below.
[root#localhost sbin]# sh start-all.sh
org.apache.spark.deploy.master.Master running as process 32405. Stop it first.
root#localhost.localdomain's password:
localhost.localdomain: starting org.apache.spark.deploy.worker.Worker, logging to /var/log/spark/spark-root-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out
localhost.localdomain: failed to launch org.apache.spark.deploy.worker.Worker:
localhost.localdomain: at java.lang.ClassLoader.loadClass(libgcj.so.10)
localhost.localdomain: at gnu.java.lang.MainThread.run(libgcj.so.10)
localhost.localdomain: full log in /var/log/spark/spark-root-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out
localhost.localdomain:starting org.apac
When I run jps command, I got:
23367 Jps
28053 QuorumPeerMain
28218 SecondaryNameNode
32405 Master
28148 DataNode
7852 Main
28159 NameNode
I couldn't run the worker node properly. Actually I thought to install a standalone spark where the master and worker work on a single machine. In slaves file of spark directory, I given the address as "localhost.localdomin" which is my host name. I am not aware of this settings file. Please any one cloud help me out with this installation process. Actually I couldn't run the worker nodes. But I can start the master node.
Thanks & Regards,
bips
Please notice error info below:
localhost.localdomain: at java.lang.ClassLoader.loadClass(libgcj.so.10)
I met the same error when I installed and started Spark master/workers on CentOS 6.2 x86_64 after making sure that libgcj.x86_64 and libgcj.i686 had been installed on my server, finally I solved it. Below is my solution, wish it can help you.
It seem as if your JAVA_HOME environment parameter didn't set correctly.
Maybe, your JAVA_HOME links to system embedded java, e.g. java version "1.5.0".
Spark needs java version >= 1.6.0. If you are using java 1.5.0 to start Spark, you will see this error info.
Try to export JAVA_HOME="your java home path", then start Spark again.

Need help adding multiple DataNodes in pseudo-distributed mode (one machine), using Hadoop-0.18.0

I am a student, interested in Hadoop and started to explore it recently.
I tried adding an additional DataNode in the pseudo-distributed mode but failed.
I am following the Yahoo developer tutorial and so the version of Hadoop I am using is hadoop-0.18.0
I tried to start up using 2 methods I found online:
Method 1 (link)
I have a problem with this line
bin/hadoop-daemon.sh --script bin/hdfs $1 datanode $DN_CONF_OPTS
--script bin/hdfs doesn't seem to be valid in the version I am using. I changed it to --config $HADOOP_HOME/conf2 with all the configuration files in that directory, but when the script is ran it gave the error:
Usage: Java DataNode [-rollback]
Any idea what does the error mean? The log files are created but DataNode did not start.
Method 2 (link)
Basically I duplicated conf folder to conf2 folder, making necessary changes documented on the website to hadoop-site.xml and hadoop-env.sh. then I ran the command
./hadoop-daemon.sh --config ..../conf2 start datanode
it gives the error:
datanode running as process 4190. stop it first.
So I guess this is the 1st DataNode that was started, and the command failed to start another DataNode.
Is there anything I can do to start additional DataNode in the Yahoo VM Hadoop environment? Any help/advice would be greatly appreciated.
Hadoop start/stop scripts use /tmp as a default directory for storing PIDs of already started daemons. In your situation, when you start second datanode, startup script finds /tmp/hadoop-someuser-datanode.pid file from the first datanode and assumes that the datanode daemon is already started.
The plain solution is to set HADOOP_PID_DIR env variable to something else (but not /tmp). Also do not forget to update all network port numbers in conf2.
The smart solution is start a second VM with hadoop environment and join them in a single cluster. It's the way hadoop is intended to use.

Resources