How to output Hadoop EL counters from streaming Map Reduce job triggered by Oozie? - hadoop

I am triggering a streaming MapReduce job using Oozie, for which I would like to collect the following Hadoop EL constants:
MAP_IN: Hadoop mapper input records counter name.
MAP_OUT: Hadoop mapper output records counter name.
REDUCE_IN: Hadoop reducer input records counter name.
REDUCE_OUT: Hadoop reducer input record counter name.
I see that these can be accessed using
${ hadoop:counters('mr-action')[RECORDS][REDUCE_OUT]}
However, I have no idea how to get these values to be output back to either the screen via STDOUT or to a file in HDFS on the server from where I'm launching the Oozie workflow.
I've tried passing these values to a shell action and then echo / append to a file, but I believe this is being handled on the data nodes and so I'm not able to see that output. I've also tried setting oozie.action.external.stats.write to true, as one thread suggested, and then calling
oozie job -info -verbose
but I still don't see these counters showing up under an External Stats field. Any suggestions of how to get these counters output will be very helpful.

Before I was doing oozie job -info job-id -verbose which wasn't displaying external stats. Key was to make following changes.
In workflow.xml file, under the action I want to collect counters for, add the following to the configuration:
<action name="mr-action">
<configuration>
<property>
<name>oozie.action.external.stats.write</name>
<value>true</value>
</property>
</configuration>
</action>
Then, after job is run, do the following in the command line:
oozie job -info job-id#mr-action -verbose
which gives me the counters I was looking for.

Related

Find running job priority

How can I find the priority used by a job running in Hadoop?
I tried to use Hadoop commands like hadoop job, yarn container, or mapred job, etc., but couldn't find how to get the running job priority.
You can use getJobPriority() method in your mapreduce code.
Use:
hadoop job -list
...it will show you the information of all running jobs with priority.
hadoop job -list all
...will show you the information of all the job(Running,Success,Fail) with priority.

How to set HADOOP_CLASSPATH via oozie while running HBase job

I'm using CDH5. I'm hit by a HBase bug while running a MapReduce job through Oozie in a fully distributed environment. This job connects to HBase and adds records programmatically. Requesting to refer these links to understand the bug I'm hitting. Please note that I cannot modify the map reduce job code. The job runs fine from commandline after setting HADOOP_CLASSPATH env variable. But there seem to be no way to set/override this environment variable from oozie. As a result the job fails when running from oozie. Anybody experienced and found a workaround for this problem?
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.1/bk_releasenotes_hdp_2.0/content/ch_relnotes-hdpch_relnotes-hdp-2.0.9.0-knownissues-hbase.html
https://issues.apache.org/jira/browse/HBASE-11118
You can set the HADOOP_CLASSPATH in the system that runs oozie server. So, sending it every time in request is not required.
Otherwise, we can set it in the xml. In file oozie-site.xml set:
<property>
<name>oozie.service.HadoopAccessorService.hadoop.configurations</name>
<value>*=/home/user/oozie/etc/hadoop</value>
</property>
Where /home/user/oozie/etc/hadoop is the absolute path where hadoop
configuration files are located.

hive-builtins-0.9.0.jar FileNotFoundException

I am newbie and I am trying to run a hive query
hive> SELECT xpath('<a><b id="foo">b1</b><b
id="bar">b2</b></a>','//#id') FROM src LIMIT 1;
when I execute the above command I get the following error
Job Submission failed with exception
'java.io.FileNotFoundException(File does not exist:
hdfs://localhost:9100/usr/local/hive/lib/hive-builtins-0.9.0.jar)'
Execution failed with exit status: 2 Obtaining error information
Task failed! Task ID: Stage-1
It is trying to look for hive-builtins-0.9.0.jar in hdfs. But this file is available under $HIVE_HOME/lib. why should it be uploaded to HDFS?
I have the following setting at the start of the hive
~/.hiverc
set hive.cli.print.current.db=true;
set hive.exec.mode.local.auto=true;
If I add this hadoop property in hive-site.xml then it gives me the required output
<property>
<name>fs.defaultFS</name>
<value>file:///</value>
</property>
but ideally I want to set it to
<value>hdfs://localhost</value>
as I have other hadoop specific java programs that use hdfs. What is the mistake I am making here. Is there a configuration that I need to set while starting up.
As requested $PATH information
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:/usr/local/hive/bin
Please help.
Many thanks

Output Folders for Amazon EMR

I want to jun a custom jar, whose main class a chain of map reduce jobs, with the output of the first job going as the input of the second jar, and so on.
What do I set in FileOutputFormat.setOutputPath("what path should be here?");
If I specify -outputdir in the argument, I get the error FileAlraedy exists. If I don't specify, then I do not know where the ouput will land. I want to be able to see the ouput from every job of the chained mapreduce jobs.
Thanks in adv. Pls help!
You are likely getting the "FileAlraedy exists" error because that output directory exists prior to the job you are running. Make sure to delete the directories that you specify as output for your Hadoop jobs; otherwise you will not be able to run those jobs.
Good practice is to take output from command line as it will increase flexibility of your code And you will compile your jar only once provided the changes are related to your path.
for EMR if you launch your cluster and compile your jar
For eg.
dfs_ip_folder=HDFS_IP_DIR
dfs_op_folder=HDFS_OP_DIR
hadoop jar hadoop-examples-*.jar wordcount ${dfs_ip_folder} ${dfs_op_folder}
Note : you have to create dfs_ip_folder and store input data inside it.
dfs_op_folder will be created automatically on HDFS not on local file system
To access the HDFS op folder either you can copy it to local file system or you can do cat.
eg.
hadoop fs -cat ${dfs_op_folder}/<file_name>
hadoop fs -copyToLocal ${dfs_op_folder} ${your_local_input_dir_path}

Can I rename the oozie job name dynamically

We have a Hadoop service in which we have multiple applications. We need to process the data for each of the applications by reexecuting the same workflow. These are scheduled to execute at the same time of the day. The issue is that when these jobs are running its hard to know for which application the job is running/failed/succeeded. Ofcourse, I can open the job coonfiguration and know it but that does take time since there are 10s of applications running under that service.
Is there any option in oozie to dynamically pass the name of the workflow (or part of it) when executing the job such as
oozie job -run -config <filename> -name "<NameIWishToGive>"
OR
oozie job -run -config <filename> -nameSuffix "<MyApplicationNameUnderTheService>"
Also, we dont wish to create multiple job folders to execute separately as that would be too much of copy paste.
Please suggest.
It looks to me like you should be able to just use properties set in the job config.
I was able to get a dynamic name by doing the following.
Here's an example of my workflow.xml:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="map-reduce-wf-${environment}">
...
</workflow-app>
And in my job.properties I had:
...
environment=test
...
The name ended up being: "map-reduce-wf-test"
you will find a whole bunch of oozie command lines here in the apache docs. i'm not sure which one exactly you are looking for so i thought i'd just paste the link. hope this helps!
I couldn't find anything in oozie to do that. Here is the script that does find/replace of #{appName} and #{frequency} in *.xml files + uploads all files to hdfs. Values are taken from the properties file passed to the script as the 3rd argument.
Gist - https://gist.github.com/epishkin/5952522
Example:
./upload.sh simple_reports namenode01 simple_reports/coordinator_script-1.properties
where 'simple_reports' is a folder with workflow.xml and coordinator.xml files.
workflow.xml:
<workflow-app name="#{appName}" xmlns="uri:oozie:workflow:0.3">
...
</workflow-app>
coordinator.xml:
<coordinator-app name="#{appName}-coord" xmlns="uri:oozie:coordinator:0.2"
frequency="#{frequency}"
start="${start}"
end= "${end}"
timezone="America/New_York">
...
</coordinator-app>
coordinator_script-1.properties:
appName=multi_network
frequency=${coord:days(7)}
...
Hope this helps.
I had recently faced this issue and this, All the tables uses the same workflow but name of the oozie application should reflect the name of the table it is processing.
Then pass the same parameter from job.properties then the name of the ozzie application will be acoording to dataload_tablename.

Resources