Multiple Input Paths configuration in OOZIE - hadoop

I am trying to configure a Mapreduce job in oozie . This job has two different input formats and two input data folders. I used this post How to configure oozie workflow for multi-input path with multiple mappers
and added these properties to my workflow.xml :
<property>
<name>mapred.input.dir.formats</name>
<value>folder/data/*;org.apache.hadoop.mapred.SequenceFileInputFormat\,data/*;org.apache.hadoop.mapred.TextInputFormat</value>
</property>
<property>
<name>mapred.input.dir.mappers</name>
<value>folder/data/*;....PublicMapper\,data/*;....PublicMapper</value>
</property>
but when the job is launched i have the following error: " No input paths specified in job".
Is there anyone that can help me ?
thks

You need to set some additional properties:
<property>
<name>mapreduce.inputformat.class</name>
<value>org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>org.apache.hadoop.mapreduce.lib.input.DelegatingMapper</value>
</property>

I faced the same issue today, so I used the following properties.
<property>
<name>mapreduce.inputformat.class</name>
<value>org.apache.hadoop.mapreduce.lib.input.DelegatingInputFormat</value>
</property>
<property>
<name>mapreduce.map.class</name>
<value>org.apache.hadoop.mapreduce.lib.input.DelegatingMapper</value>
</property>
<property>
<name>mapreduce.input.multipleinputs.dir.formats</name>
<value>/first/input/path;org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat,/second/input/path;org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat</value>
</property>
<property>
<name>mapreduce.input.multipleinputs.dir.mappers</name>
<value>/first/input/path;com.first.Mapper,/second/input/path;com.second.Mapper</value>
</property>
The difference is instead of mapred.input.dir.formats and mapred.input.dir.mappers which is part of the old map-reduce API I used mapreduce.input.multipleinputs.dir.formats and mapreduce.input.multipleinputs.dir.mappers respectively. The code worked just fine after that. I ran it on Hadoop 1.2.1 and Oozie 3.3.2.

Related

Hadoop localhost:9870 browser interface is not working

I need to do data analysis using Hadoop. Therefore I have installed Hadoop and configured as below. But localhost:9870 is not working. Even I have format namenode every time I worked with that. Some articles and answers of this forum mentioned that 9870 is the updated one from 50070. I have win 10. I also referred answers in this forum but none of them worked. Java-home and hadoop-home paths are set. Paths to bin and sbin of hadoop are also set up. Can anyone please tell me what I am doing wrong in here?
I referred this site to do the installation and configuration.
https://medium.com/#pedro.a.hdez.a/hadoop-3-2-2-installation-guide-for-windows-10-454f5b5c22d3
core-site.xml
I have set up the Java path in this xml as well.
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9870</value>
</property>
hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-3.2.2\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-3.2.2\data\datanode</value>
</property>
mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
If you look at the namenode logs, it very likely has an error saying something about a port already being in use.
The default fs.defaultFS port should be 9000 - https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html ; you shouldn't change this without good reason.
The Namenode web UI isn't the value in fs.defaultFS. It's default port is 9870, and is defined by dfs.namenode.http-address in hdfs-site.xml
need to do data analysis
You can do analysis on Windows without Hadoop using Spark, Hive, MapReduce, etc. directly and it'll have direct access to your machine without being limited by YARN container sizes.

Duplicate YARN conf settins

I'm am using hadoop 2.6
In yarn-site.xml I have the following defined:
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
But if I look inside the yarn UI /conf URL I get the following 2 definitions:
<property>
<name>yarn.nodemanamger.vmem-check-enabled</name>
<value>false</value>
<source>java.io.BufferedInputStream#2893de87</source>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>true</value>
<source>yarn-default.xml</source>
</property>
Which one of the two properties are actually being followed?
Am I correct in saying that the <source>java.io.BufferedInputStream#2893de87</source> overwrites <source>yarn-default.xml</source> ?
The <source> is not the property value.
The value is read from your configuration through some external source (an InputStream), then it is placed in the file with <value>false</value>
How that file gets read is up to the Configuration object in the Hadoop API.

Yarn timeline server log aggregation

Configuring hadoop 2.7.1 to retain yarn jobs for longer
Have enabled log aggregation and the jobhistory/timeline server and when a job is complete in the resource manager it does show up in the jobhistory server(if you give the correct url), however the jobhistory server is only listing M/R jobs, not yarn applications
The problem is the job is not visible in the timeline server, in fact no jobs show in the timeline server
Current yarn-site.xml configuration :
<property>
<name>yarn.timeline-service.hostname</name>
<value>host1</value>
</property>
<property>
<name>yarn.timeline-service.address</name>
<value>${yarn.timeline-service.hostname}:10200</value>
</property>
<property>
<name>yarn.timeline-service.webapp.address</name>
<value>${yarn.timeline-service.hostname}:8188</value>
</property>
<property>
<name>yarn.timeline-service.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.generic-application-history.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://${yarn.timeline-service.hostname}:19888/jobhistory/logs/</value>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/var/vm/apps/hadoop/logs</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/var/vm/apps/hadoop/logs</value>
</property>
Am I providing conflicting configuration in using the jobhistory server AND the timeline server?
At the end of the day I want the yarn logs persisted to hdfs for viewing in the web-ui over the following days/weeks
You need to set mapreduce.job.emit-timeline-data property to true in mapred-site.xml
This will enable mapreduce jobs to push events to the timeline server.

Configure job memory in Hadoop 1.2.0

I need to set -Xmx property of a job, running on data node.
On task tracker node I tried to put properties
<property>
<name>mapred.map.java.opts</name>
<value>-Xmx64m</value>
</property>
<property>
<name>mapred.reduce.java.opts</name>
<value>-Xmx64m</value>
</property>
into conf/core-site.xml
but it doesn't have any effect on submitted jobs, I still see java process with -Xmx200m in process list.
Please advice.
Try using:
<property>
<name>mapred.map.child.java.opts</name>
<value>-Xmx64m</value>
</property>
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xmx64m</value>
</property>
in your conf/mapred-site.xml on each data node.

Getting E0902: Exception occured: [User: oozie is not allowed to impersonate oozie]

Hi i am new to Oozie and i am getting this error E0902: Exception occured: [User: pramod is not allowed to impersonate pramod] when i run the following command
./oozie job -oozie htt p://localhost:11000/oozie/ -config ~/Desktop/map-reduce /job.properties -run.
My hadoop version is 1.0.3 and oozie version is 3.3.2 and running in a pseudo mode
The following is the content of my core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/pramod/hadoop-${user.name}</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
</property>
<property>
<name>hadoop.proxyuser.${user.name}.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.${user.name}.groups</name>
<value>*</value>
</property>
</configuration>
Can somebody help
Hadoop 1.0.x does not support wildcards. http://mail-archives.apache.org/mod_mbox/oozie-user/201212.mbox/%3CCAOcnVr1TZZ5X0Mrb7fFA8JdW6rO6PgoJ9u0=2UYbfXf_o8r=DA#mail.gmail.com%3E
So try
<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>localhost</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>oozie,pramod</value>
</property>
One thing missed in the discussion above:
In core-site.xml you need to use the user with which oozie is started, as in the user that invoked the command "bin/oozied.sh start". For example: if you have "hadoop.proxyuser.bob.hosts" along with hadoop.proxyuser.bob.groups, then the user 'bob' would be required to start oozie using "bin/oozied.sh start".
I don't think you can use variables in the key name - you'll need to hardcode the user name rather than ${user.name}.
I assume you have an oozie user (which the oozie server is run as), so basically you want to configure as follows to allow the oozie user to impersonate anyone from any host:
<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>
Make sure you restart your HDFS / MAPREDUCE services for this to take affect

Resources