How to set HADOOP_CLASSPATH via oozie while running HBase job - hadoop

I'm using CDH5. I'm hit by a HBase bug while running a MapReduce job through Oozie in a fully distributed environment. This job connects to HBase and adds records programmatically. Requesting to refer these links to understand the bug I'm hitting. Please note that I cannot modify the map reduce job code. The job runs fine from commandline after setting HADOOP_CLASSPATH env variable. But there seem to be no way to set/override this environment variable from oozie. As a result the job fails when running from oozie. Anybody experienced and found a workaround for this problem?
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.1/bk_releasenotes_hdp_2.0/content/ch_relnotes-hdpch_relnotes-hdp-2.0.9.0-knownissues-hbase.html
https://issues.apache.org/jira/browse/HBASE-11118

You can set the HADOOP_CLASSPATH in the system that runs oozie server. So, sending it every time in request is not required.
Otherwise, we can set it in the xml. In file oozie-site.xml set:
<property>
<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=/home/user/oozie/etc/hadoop</value>
</property>
Where /home/user/oozie/etc/hadoop is the absolute path where hadoop
configuration files are located.

Related

Unable to deploy Spark jobs using Oozie

I need to keep a spark job running 24/7 and for this I am using Oozie. To do this I have written a workflow.xml and job.properties files, containing the needful information to invoke it.
However when I try to send the oozie job using this:
oozie job –config /home/oozie/tst/job.properties -run
I get the following error message, which is very clear:
java.io.IOException: configuration is not specified
at org.apache.oozie.cli.OozieCLI.getConfiguration(OozieCLI.java:816)
at org.apache.oozie.cli.OozieCLI.jobCommand(OozieCLI.java:1055)
at org.apache.oozie.cli.OozieCLI.processCommand(OozieCLI.java:686)
at org.apache.oozie.cli.OozieCLI.run(OozieCLI.java:639)
at org.apache.oozie.cli.OozieCLI.main(OozieCLI.java:225)
configuration is not specified
The problem here is that the configuration file (job.properties) exists locally on the path specified. I also PUT the directory containing both files and .jar in the HDFS.
Any idea why is this failing?
Is Oozie the best tool for this task I have?
The config parameter takes local path not HDFS. check job.properties present in /home/oozie/tst/job.properties
check job.properties contain oozie.wf.application.path=PATH_TO_HDFS_PATH_WHERE_WORKFLOW.XML_IS_PRESENT
Plus I see the dash(-) given in config parameter is different then dash(-) in run parameter
Specify the host in your command
oozie job --oozie http://your_host:11000/oozie -config /home/oozie/tst/job.properties -run
11000 is deafult port

Why MR2 map task is running under 'yarn' user and not under user I ran hadoop job?

I'm trying to run mapreduce job on MR2, Hadoop ver. 2.6.0-cdh5.8.0. Job has relative path to directory which has a lot of files to be compressed based on some criteria(not really necessary for this question). I'm running my job as following:
sudo -u my_user hadoop jar my_jar.jar com.example.Main
There is a folder on HDFS under path /user/my_user/ with files. But when I'm running my job I got following exception:
java.io.FileNotFoundException: File /user/yarn/<path_from_job> does not exist.
I'm migrating this job from MR1 where this job is working correctly. My suggestion is this is happening due to YARN, because each container started under YARN user. In my job configuration I've tried to set mapreduce.job.user.name="my_user" but this didn't help.
I've found ${user.home} usage in me Job configuration, but I don't know aware where it is set and is it possible to change this.
The only solution I found so far is to provide absolute path to folder. Is there any other way around, because I feel like this is not correct approach.
Thank you

Hadoop 2.2.0 Web UI not showing Job Progress

I have installed Single node hadoop 2.2.0 from this link. When i run a job from terminal, it works fine with output. Web UI's i used
- Resource Manager : http://localhost:8088
- Namenode Daemon : http://localhost:50070
But from Resource Manager's web UI(shown above) i can't see job progress like Submitted Jobs, Running Jobs, etc..
MY /etc/hosts file is as follows:
127.0.0.1 localhost
127.0.1.1 meitpict
My System has IP: 192.168.2.96(I tried by removing this ip but still it didn't worked)
The only host:port i mentioned is in core-site.xml and that is:
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:54310</value>
</property>
Even i get these problems while executing Map - Reduce job.
Best what i did that while executing Map-Reduce job you will get link on the box from which you executed MapR job for checking the job progress something like below:
http://node1:19888/jobhistory/job/job_1409462238335_0002/mapreduce/app
Here node1 is set to 192.168.56.101 in my hosts file entry and that is for NameNode box.
So at the time of your MapR job is running you can go to the UI link provided by MapR framework.
ANd when it gets opened then do not close and there you can find details about other jobs also, when they started and when they got finished etc.
So next time better to check your putty console output after submitting the MapR job, you will definitely see a link for the current job to check it status from the browser UI.
In Hadoop 2.x this problem could be related to memory issues, you can see it in MapReduce in Hadoop 2.2.0 not working

JobTracker doesn't show completed tasks

I am running hadoop-1.1.2 on my laptop in pseudo-distributed mode. I am able to run a simple WordCount program, reading from and writing back to HDFS. I am also able to see JobTracker running at http://localhost:50030/jobtracker.jsp. However when I run the WordCount job from Eclipse, there is no entry, either under running or completed jobs.
Am I missing any additional property setting in one of the configuration files?
Thanks.
This is happening because when you are running your job through Eclipse it starts running job inside itself rather than submitting it to the JobTracker as it does not know where to go to find the JobTracker. You need to tell it to Eclipse. Add the following lines in your code and it should work :
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://localhost:9000");
conf.set("mapred.job.tracker", "localhost:9001");
Copy your hdfs-site.xml and mapred-site.xml into your 'src' folder (make them available in classpath).
It will pick up all Configurations.

PIG automatically connected with default HDFS, how?

I just started learning Hadoop and PIG (from last two days!) for one of my future project.
For experiments I've installed Hadoop (HDFS on default localhost:9000) as pseudo distributed mode and PIG (map-reduce mode).
When I initialized PIG by typing ./bin/pig command it launched GRUNT command line and I got message that pig connected with HDFS (localhost:9000), later I could successfully able to access HDFS thru pig.
I was expecting to perform some manual configuration for PIG to access HDFS (as per various internet articles).
My question is, from where PIG identified default HDFS configuration (localhost:9000)? I checked pig.properties but I didn't find anything there. I need this info as I might change default HDFS configuration in future.
BTW, I have HADOOP_HOME and PIG_HOME defined in my OS PATH variable.
When installing Pig (I assume v0.10.0) you have to tell how it will connect to the HDFS.
I don't know how you did this but generally this is done by adding the hadoop conf dir path to the PIG_CLASSPATH environment variable. You can also set HADOOP_CONF_DIR as well.
If you are starting the grunt shell Pig will locate the directory of the Hadoop configuration XMLs, and takes the value of fs.default.name (core-site.xml) and mapred.job.tracker (mapred-site.xml) , i.e: the location of the Namenode and JobTracker.
For reference you may have a look at the Pig shell script to see how env. variables are collected and evaluated.
PIG can connects to underlying HDFS in the 3 ways
1-
Pig uses HADOOP_HOME for finding the HADOOP client to Run.
your HADOOP_HOME should have been already setup in your bash_profile
export HADOOP_HOME=~/myHadoop/hadoop-2.5.2
2-
or else there might be possibility that your HADOOP_CONF_DIR has already been setup which contains the xml file for the hadoop configuration
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
3-And if these are not setup you can also connect to underlying hdfs
by changing the pig.properties which is present under PIG_HOME/conf dir

Resources