Hadoop MapRed default config - hadoop

I have a Hadoop 2.7.2 cluster which I'm trying to run a DFSIO test. If I leave mapred-site.xml and yarn-site.xml untouched, will the MapRed be set to classic MapRed (V1) by default?
Thanks

If you leave mapred-site.xml and yarn-site.xml untouched, MapRed V1 (classic) will be used by default.

Related

How to get Hadoop configuration xml info using rest api

I have core-site.xml, mapred-site.xml, hdfs-site.xml and yarn-site.xml file at '$(hadoop_home)\etc\hadoop'.
I need to get those xml files using weblink or webHdfs rest command.
In following link I able to get core-site.xml, mapred-site.xml using jmx (or) rest command.
http://<host-name>:8088/conf
How to get core-site.xml and yarn-site.xml property also?
Finally I got a solution for getting hadoop configuration information using rest or jmx command.
Namenode Configuration:
http://<host-name>:50070/conf -> (core-site.xml, mapred-site.xml, yarn-site.xml, hdfs-site.xml)
Node manager Configuration:
http://<host-name>:8042/conf -> (core-site.xml, mapred-site.xml, yarn-site.xml)
Resource manager Configuration:
http://<host-name>:8088/conf -> (core-site.xml, mapred-site.xml)
Note: Make sure datanode and nodeManager information have to check with slave node. And namnode and resourceManager information have to check with master node

Oozie java-action does not include core-site.xml

When running an Oozie java action on a freshly installed Hadoop HDP 2.2.2.4, and for example tries to access hdfs it accesses the wrong filesystem:
java.lang.IllegalArgumentException: Wrong FS: hdfs:/tmp/text.txt, expected: file:///
It can be fixed by included the core-site.xml in the Oozie action:
<file>hdfs:/path-to-core-site.xml-on-hdfs</file>
But what is the reason and what is the proper fix?
The reason of that the core-site.xml is not included in the class-path of the java-action is because the property mapreduce.application.classpath points to the wrong directory:
<snip>/etc/hadoop/conf/secure
It should point to
<snip>/etc/hadoop/conf
i.e, the full property should be something like, in mapred-site.xml:
<property>
<name>mapreduce.application.classpath</name>
<value>$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf</value>
</property>
Those files are include in hadoop classpath, as I know since HDP 2.2, you need to add
// loading action conf prepared by Oozie
Configuration actionConf = new Configuration(false);
actionConf.addResource(new Path("file:///", System.getProperty("oozie.action.conf.xml")));
to use *-site.xml, you can get the details in oozie document
https://oozie.apache.org/docs/4.2.0/WorkflowFunctionalSpec.html#a3.2.7_Java_Action

Oozie on YARN - oozie is not allowed to impersonate hadoop

I'm trying to use Oozie from Java to start a job on a Hadoop cluster. I have very limited experience with Oozie on Hadoop 1 and now I'm struggling trying out the same thing on YARN.
I'm given a machine that doesn't belong to the cluster, so when I try to start my job I get the following exception:
E0501 : E0501: Could not perform authorization operation, User: oozie is not allowed to impersonate hadoop
Why is that and what to do?
I read a bit about core-site properties that need to be set
<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>users</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>master</value>
</property>
Does it seem that this is the problem? Should I contact people responsible for cluster to fix that?
Could there be problems because I'm using same code for YARN as I did for Hadoop 1? Should something be changed? For example, I'm setting nameNode and jobTracker in workflow.xml, should jobTracker exist, since there is now ResourceManager? I have set the address of ResourceManager, but left the property name as jobTracker, could that be the error?
Maybe I should also mention that Ambari is used...
Hi please update the core-site.xml
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
and jobTracker address is the Resourcemananger address that will not be the case . once update the core-site.xml file it will works.
Reason:
Cause of this type of error is- You run oozie server as a hadoop user but you define oozie as a proxy user in core-site.xml file.
Solution:
change the ownership of oozie installation directory to oozie user and run oozie server as a oozie user and problem will be solved.

Use of core-site.xml in mapreduce program

I have seen mapreduce programs using/adding core-site.xml as a resource in the program. What is or how can core-site.xml be used in mapreduce programs ?
From documentation,
Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:
core-default.xml : Read-only defaults for hadoop,
core-site.xml: Site-specific configuration for a given hadoop installation
Configuration config = new Configuration();
config.addResource(new Path("/user/hadoop/core-site.xml"));
config.addResource(new Path("/user/hadoop/hdfs-site.xml"));
Core-site.xml and HDFS-site.xml will denote the hadoop and its hdfs so that your mapreduce program will findout the which cluster to be pointing out and where it to be performed..

Hadoop / Yarn (v0.23.3) Psuedo-Distributed Mode setup :: No job node

I just setup Hadoop/Yarn 2.x (specifically, v0.23.3) in Psuedo-Distributed mode.
I followed the instructions of a few blogs & websites which, more-or-less provide the
same prescription for setting it up. I also followed the 3rd-Edition of O'reilly's
Hadoop book (which ironically was the least helpful).
THE PROBLEM:
After running "start-dfs.sh" and then "start-yarn.sh", while all of the daemons
do start (as indicated by jps(1)), the Resource Manager web portal
(Here: http://localhost:8088/cluster/nodes) indicates 0 (zero) job-nodes in the
cluster. So while submitting the example/test Hadoop job indeed does get
scheduled, it pends forever because, I assume, the configuration doesn't see a
node to run it on.
Below are the steps I performed, including resultant configuration files.
Hopefully the community help me out... (And thank you in advance).
THE CONFIGURATION:
The following environment variables are set in both my and hadoop's UNIX account profiles: ~/.profile:
export HADOOP_HOME=/home/myself/APPS.d/APACHE_HADOOP.d/latest
# Note: /home/myself/APPS.d/APACHE_HADOOP.d/latest -> hadoop-0.23.3
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_INSTALL=${HADOOP_HOME}
export HADOOP_CLASSPATH=${HADOOP_HOME}/lib
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop/conf
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop/conf
export JAVA_HOME=/usr/lib/jvm/jre
hadoop$ java -version
java version "1.7.0_06-icedtea<br>
OpenJDK Runtime Environment (fedora-2.3.1.fc17.2-x86_64)<br>
OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)<br>
# Although the above shows OpenJDK, the same problem happens with Sun's JRE/JDK.
The NAMENODE & DATANODE directories, also specified in etc/hadoop/conf/hdfs-site.xml:
/home/myself/APPS.d/APACHE_HADOOP.d/latest/YARN_DATA.d/HDFS.d/DATANODE.d/
/home/myself/APPS.d/APACHE_HADOOP.d/latest/YARN_DATA.d/HDFS.d/NAMENODE.d/
Next, the various XML configuration files (again, YARN/MRv2/v0.23.3 here):
hadoop$ pwd; ls -l
/home/myself/APPS.d/APACHE_HADOOP.d/latest/etc/hadoop/conf
lrwxrwxrwx 1 hadoop hadoop 16 Sep 20 13:14 core-site.xml -> ../core-site.xml
lrwxrwxrwx 1 hadoop hadoop 16 Sep 20 13:14 hdfs-site.xml -> ../hdfs-site.xml
lrwxrwxrwx 1 hadoop hadoop 18 Sep 20 13:14 httpfs-site.xml -> ../httpfs-site.xml
lrwxrwxrwx 1 hadoop hadoop 18 Sep 20 13:14 mapred-site.xml -> ../mapred-site.xml
-rw-rw-r-- 1 hadoop hadoop 10 Sep 20 15:36 slaves
lrwxrwxrwx 1 hadoop hadoop 16 Sep 20 13:14 yarn-site.xml -> ../yarn-site.xml
core-site.xml
<?xml version="1.0"?>
<!-- core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
mapred-site.xml
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<configuration>
<!-- Same problem whether this (legacy) stanza is included or not. -->
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
hdfs-site.xml
<!-- hdfs-site.xml -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/myself/APPS.d/APACHE_HADOOP.d/YARN_DATA.d/HDFS.d/NAMENODE.d</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/myself/APPS.d/APACHE_HADOOP.d/YARN_DATA.d/HDFS.d/DATANODE.d</value>
</property>
</configuration>
yarn-site.xml
<?xml version="1.0"?>
<!-- yarn-site.xml -->
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/home/myself/APPS.d/APACHE_HADOOP.d/YARN_DATA.d/TEMP.d</value>
</property>
</configuration>
etc/hadoop/conf/saves
localhost
# Community/friends, is this entry correct/needed for my psuedo-dist mode?
Miscellaneous wrap-up notes:
(1) As you may have gleaned from above, all files/directories are owned
by the 'hadoop' UNIX user. There is a hadoop:hadoop, UNIX User and
Group, respectively.
(2) The following command was run after the NAMENODE & DATANODE directories
(listed above) were created (and whose paths were entered into
hdfs-site.xml):
hadoop$ hadoop namenode -format
(3) Next, I ran "start-dfs.sh", then "start-yarn.sh".
Here is jps(1) output:
hadoop#e6510$ jps
21979 DataNode
22253 ResourceManager
22384 NodeManager
22156 SecondaryNameNode
21829 NameNode
22742 Jps
Thank you!
After much toil on this problem without success (and trust me I tried it all), I instituted
hadoop using a different solution. Whereas above I downloaded a gzip/tar ball
of the hadoop distribution (again v0.23.3) from one of the download mirrors, this
time I used the Caldera CDH distribution of RPM packages, which I installed via
their YUM repos. In hopes that this will help someone, here are the detailed steps.
Step-1:
For Hadoop 0.20.x (MapReduce version 1):
# rpm -Uvh http://archive.cloudera.com/redhat/6/x86_64/cdh/cdh3-repository-1.0-1.noarch.rpm
# rpm --import http://archive.cloudera.com/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
# yum install hadoop-0.20-conf-pseudo
-or-
For Hadoop 0.23.x (MapReduce version 2):
# rpm -Uvh http://archive.cloudera.com/cdh4/one-click-install/redhat/6/x86_64/cloudera-cdh-4-0.noarch.rpm
# rpm --import http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
# yum install hadoop-conf-pseudo
In both cases above, installing that "psuedo" package (which stands for "pseudo-distributed
Hadoop" mode), will alone conveniently trigger the installation of all the other necessary packages you'll need (via dependency resolution).
Step-2:
Install Sun/Oracle's Java JRE (if you haven't already done so). You can
install it via the RPM that they provide, or the gzip/tar ball portable
version. It doesn't matter which as long as you set and export the "JAVA_HOME"
environment appropriately, and ensure ${JAVA_HOME}/bin/java is in your path.
# echo $JAVA_HOME; which java
/home/myself/APPS.d/JAVA-JRE.d/jdk1.7.0_07
/home/myself/APPS.d/JAVA-JRE.d/jdk1.7.0_07/bin/java
Note: I actually create a symlink called "latest" and point/re-point it to the JAVA
version specific directory whenever I update the JAVA. I was explicit above for
the reader's understanding.
Step-3: Format hdfs as the "hdfs" Unix user (created during "yum install" above).
# sudo su hdfs -c "hadoop namenode -format"
Step-4:
Manually start the hadoop daemons.
for file in `ls /etc/init.d/hadoop*`
do
{
${file} start
}
done
Step-5:
Check to see if things are working. The following is for MapReduce v1
(It's not that much different for MapReduce v2 at this superficial level).
root# jps
23104 DataNode
23469 TaskTracker
23361 SecondaryNameNode
23187 JobTracker
23267 NameNode
24754 Jps
# Do the next commands as yourself (not as "root").
myself$ hadoop fs -mkdir /foo
myself$ hadoop fs -rmr /foo
myself$ hadoop jar /usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u5-examples.jar pi 2 100000
I hope this helped!
Noel,
I followed this other day the steps in this tutorial http://www.thecloudavenue.com/search?q=0.23 and I managed to set up a small cluster of 3 centos 6.3 machines

Resources