Use of core-site.xml in mapreduce program - hadoop

I have seen mapreduce programs using/adding core-site.xml as a resource in the program. What is or how can core-site.xml be used in mapreduce programs ?

From documentation,
Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:
core-default.xml : Read-only defaults for hadoop,
core-site.xml: Site-specific configuration for a given hadoop installation
Configuration config = new Configuration();
config.addResource(new Path("/user/hadoop/core-site.xml"));
config.addResource(new Path("/user/hadoop/hdfs-site.xml"));

Core-site.xml and HDFS-site.xml will denote the hadoop and its hdfs so that your mapreduce program will findout the which cluster to be pointing out and where it to be performed..

Related

Hadoop MapRed default config

I have a Hadoop 2.7.2 cluster which I'm trying to run a DFSIO test. If I leave mapred-site.xml and yarn-site.xml untouched, will the MapRed be set to classic MapRed (V1) by default?
Thanks
If you leave mapred-site.xml and yarn-site.xml untouched, MapRed V1 (classic) will be used by default.

How to get Hadoop configuration xml info using rest api

I have core-site.xml, mapred-site.xml, hdfs-site.xml and yarn-site.xml file at '$(hadoop_home)\etc\hadoop'.
I need to get those xml files using weblink or webHdfs rest command.
In following link I able to get core-site.xml, mapred-site.xml using jmx (or) rest command.
http://<host-name>:8088/conf
How to get core-site.xml and yarn-site.xml property also?
Finally I got a solution for getting hadoop configuration information using rest or jmx command.
Namenode Configuration:
http://<host-name>:50070/conf -> (core-site.xml, mapred-site.xml, yarn-site.xml, hdfs-site.xml)
Node manager Configuration:
http://<host-name>:8042/conf -> (core-site.xml, mapred-site.xml, yarn-site.xml)
Resource manager Configuration:
http://<host-name>:8088/conf -> (core-site.xml, mapred-site.xml)
Note: Make sure datanode and nodeManager information have to check with slave node. And namnode and resourceManager information have to check with master node

Namenode high availability client request

Can anyone please tell me that If I am using java application to request some file upload/download operations to HDFS with Namenode HA setup, Where this request go first? I mean how would the client know that which namenode is active?
It would be great if you provide some workflow type diagram or something that explains request steps in detail(start to end).
Please check Namenode HA architecture with key entities in HDFS client requests handling.
Where this request go first? I mean how would client know that which
namenode is active?
For client/driver it doesn't matter which namenode is active. because we query on HDFS with nameservice id rather than hostname of namenode. nameservice will automatically transfer client requests to active namenode.
Example: hdfs://nameservice_id/rest/of/the/hdfs/path
Explanation:
How this hdfs://nameservice_id/ works and what are the confs involved in it?
In hdfs-site.xml file
Create a nameservice by adding an id to it(here nameservice_id is mycluster)
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
<description>Logical name for this new nameservice</description>
</property>
Now specify namenode ids to determine namenodes in cluster
dfs.ha.namenodes.[$nameservice ID]:
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
<description>Unique identifiers for each NameNode in the nameservice</description>
</property>
Then link namenode ids with namenode hosts
dfs.namenode.rpc-address.[$nameservice ID].[$name node ID]
<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>machine1.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>machine2.example.com:8020</value>
</property>
After that specify the Java class that HDFS clients use to contact the Active NameNode so that DFS Client uses this class to determine which NameNode is currently serving client requests.
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
Finally HDFS URL will be like this after these configuration changes.
hdfs://mycluster/<file_lication_in_hdfs>
To answer your question I have taken few configuration only. please check the detailed documentation for how does Namenodes, Journalnodes and Zookeeper machines form Namenode HA in HDFS.
If hadoop cluster is configured with HA, then it will have namenode IDs in hdfs-site.xml like this :
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>namenode1,namenode2</value>
</property>
Whichever NameNode is started first will become active. You may choose to start the cluster in a specific order such that your preferred node starts first.
If you want to determine the current status of namenode, you can use getServiceStatus() command :
hdfs haadmin -getServiceState <machine-name>
Well, while writing the driver class, you need to set the following properties in configuration object:
public static void main(String[] args) throws Exception {
if (args.length != 2){
System.out.println("Usage: pgm <hdfs:///path/to/copy> </local/path/to/copy/from>");
System.exit(1);
}
Configuration conf = new Configuration(false);
conf.set("fs.defaultFS", "hdfs://nameservice1");
conf.set("fs.default.name", conf.get("fs.defaultFS"));
conf.set("dfs.nameservices","nameservice1");
conf.set("dfs.ha.namenodes.nameservice1", "namenode1,namenode2");
conf.set("dfs.namenode.rpc-address.nameservice1.namenode1","hadoopnamenode01:8020");
conf.set("dfs.namenode.rpc-address.nameservice1.namenode2", "hadoopnamenode02:8020");
conf.set("dfs.client.failover.proxy.provider.nameservice1","org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
FileSystem fs = FileSystem.get(URI.create(args[0]), conf);
Path srcPath = new Path(args[1]);
Path dstPath = new Path(args[0]);
//in case the same file exists on remote location, it will be overwritten
fs.copyFromLocalFile(false, true, srcPath, dstPath);
}
Request will go to the nameservice1 and further handled by Hadoop cluster as per the namenode status(active/standby).
For more details, please refer the HDFS High availability

Oozie java-action does not include core-site.xml

When running an Oozie java action on a freshly installed Hadoop HDP 2.2.2.4, and for example tries to access hdfs it accesses the wrong filesystem:
java.lang.IllegalArgumentException: Wrong FS: hdfs:/tmp/text.txt, expected: file:///
It can be fixed by included the core-site.xml in the Oozie action:
<file>hdfs:/path-to-core-site.xml-on-hdfs</file>
But what is the reason and what is the proper fix?
The reason of that the core-site.xml is not included in the class-path of the java-action is because the property mapreduce.application.classpath points to the wrong directory:
<snip>/etc/hadoop/conf/secure
It should point to
<snip>/etc/hadoop/conf
i.e, the full property should be something like, in mapred-site.xml:
<property>
<name>mapreduce.application.classpath</name>
<value>$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf</value>
</property>
Those files are include in hadoop classpath, as I know since HDP 2.2, you need to add
// loading action conf prepared by Oozie
Configuration actionConf = new Configuration(false);
actionConf.addResource(new Path("file:///", System.getProperty("oozie.action.conf.xml")));
to use *-site.xml, you can get the details in oozie document
https://oozie.apache.org/docs/4.2.0/WorkflowFunctionalSpec.html#a3.2.7_Java_Action

How to use JobClient in hadoop2(yarn)

(Solved)I want to contact hadoop cluster and get some job/task information.
In hadoop1, I was able to use JobClient ( local pesudo distributed mode, use Eclipse):
JobClient jobClient = new JobClient(new InetSocketAddress("127.0.0.1",9001),new JobConf(config));
JobID job_id = JobID.forName("job_xxxxxx");
RunningJob job = jobClient.getJob(job_id);
.....
Today I set up a pesudo distributed hadoop2 YARN cluster, however, the above code doesn't work. I use the port of resource manager(8032).
JobClient jobClient = new JobClient(new InetSocketAddress("127.0.0.1",8032),new JobConf(config));
This line gives exception:
Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
I search this exception but all solutions are not working. I use eclipse, and I have add all hadoop jars including hadoop-mapreduce-client-xxx. Also, I can successfully run example programs on my cluster.
Any suggestions on how to use JobClient on hadoop2 yarn?
Update: I am able to solve this issue by compile with the same hadoop lib as the rm server. In Eclipse it still gives this exception but after I compiled and deployed my project it works fine.(not sure why as in hadoop1 it works in eclipse) There is no need to change the api, JobClient is still functioning well in hadoop2
Have you configured the mapred-site.xml file as followed? It is located in $HADOOP_HOME/etc/hadoop/ in hadoop 2.x
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
edit: Also make sure that your yarn-site.xml (same location) contains the following property:
<property>
<name>yarn.resourcemanager.address</name>
<value>host:port</value>
</property>
One last thing: I strongly advise you to work with hostnames instead of IPs. There are known cases of failure with hadoop when IPs are set in the configuration files.

Resources