Technique to know the Default scheduler in hadoop - hadoop

I have installed a multi node setup in 3 Ubuntu systems 12.04. I am using Hadoop1.2.1 in all three.Now i want to which scheduler is running by default???
How to check the default scheduler running in Hadoop1.2.1?

Default scheduler in hadoop is JobQueueTaskScheduler, which is a FIFO scheduler. As a default scheduler you need to refer the property mapred.jobtracker.taskScheduler in mapred-default.xml. If you want you can change the default scheduler to either CapacityScheduler or FairScheduler based on your requirement.
mapred-site.xml is used to override the default values inside mapred-default.xml, which can be found inside the configuration directory. You may not find mapred-default file in the configuration directory along with hadoop binary distribution(rpm,deb etc.), instead mapred-default.xml can be found directly inside the jar file hadoop-core-1.2.1.jar.
hackzon:~/hadoop-1.2.1$ jar -tvf hadoop-core-1.2.1.jar | grep mapred-default.xml
47324 Mon Jul 22 15:12:48 IST 2013 mapred-default.xml
These file is used in the below mentioned hadoop source files as an argument to addDefaultResource() method as
addDefaultResource("mapred-default.xml"); // First
addDefaultResource("mapred-site.xml"); // Second
Initially mapred-default.xml would be loaded, then mapred-site.xml. So that properties which need to be overridden can be specified inside mapred-site.xml
org.apache.hadoop.conf.Configuration.java
org.apache.hadoop.mapred.JobConf.java
org.apache.hadoop.mapred.TaskTracker.java
org.apache.hadoop.mapred.JobClient.java
org.apache.hadoop.mapred.JobTracker.java
org.apache.hadoop.mapred.JobHistoryServer.java
Have a look at any of the source code.

Goto your Resource Manager UI and under "Tools" click on "Configuration", or simply type the url. Replace <resource-manager> with your resource manager domain name.
http://<resource-manager>:8088/conf
Search for any settings that you want.

After much hard work i finally got how to check the scheduler which is running in Hadoop-1.1.2. After running a word-count job i went into jobtracker web interface. There go for job history. there right side of job file one link will be there. Click on it you will get every thing like scheduler, dfs replication etc.
Also sir in hadoop-1.1.2 its mapred-site.xml file where we need to add some properties as specified in apache documentation for hadoop-1.1.2.

Related

Is there any way our Job history server wont show our MR application related information

I tried spark.eventlog.dir=false and then the Spark history server is not showing any information related to this.
Is there any similar way Job history server wont show our application related information similar to Spark History Server when spark.eventlog.dir is set to false.
This can be done if we set below highlighted property to any location other than default location which is there in mapred-site.xml
yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar pi -D mapreduce.map.memory.mb=512 -D mapreduce.reduce.memory.mb=512 -Dmapreduce.jobhistory.intermediate-done-dir=<<"new-location">> 2 10
When this is done then the Job history server can't move logs from intermediate done dir to done dir. Since it reads from the location which is configured in mapred-site.xml

How to change java.io.tmpdir for spark job running on yarn

How can I change java.io.tmpdir folder for my Hadoop 3 Cluster running on YARN?
By default it gets something like /tmp/***, but my /tmp filesystem is to small for everythingYARN Job will write there.
Is there a way to change it ?
I have also set hadoop.tmp.dir in core-site.xml, but it looks like, it is not really used.
perhaps its a duplicate of What should be hadoop.tmp.dir ?. Also, go through all .conf's in /etc/hadoop/conf and search tmp, see if anything is hardcoded. Also specify:
Whether you see (any) files getting created # what you specified as hadoop.tmp.dir.
What pattern of files are being formed # /tmp/** after your changes are applied.
I have also noticed hive creating files in /tmp. So, you may also have a look # hive-site.xml. Similar for any other ecosystem product you are using.
I have configured yarn.nodemanager.local-dirs property in yarn-site.xml and restarted the cluster. After that spark stopped using /tmp file system and used directories, configured in yarn.nodemanager.local-dirs.
java.io.tmpdir property for spark executors was also set to directories defined in yarn.nodemanager.local-dirs property.
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/somepath1,/anotherpath2</value>
</property>

Why Hive will search its configuration profile in HADOOP_CONF_DIR first?

Today I found that if I copy hive-site.xml into $HADOOP_HOME/etc/hadoop/, Hive will use the hive-site.xml in the $HADOOP_HOME/etc/hadoop/ instead of the one in $HIVE_HOME/conf, and it will also search for the hive-log4j.properties in $HADOOP_HOME/etc/hadoop/.
If not found, Hive will just use the default one in /lib/hive-common-1.1.0-cdh5.7.6.jar!/hive-log4j.properties instead of the customized one in $HIVE_HOME/conf, but why?
I searched the keyword copy hive-site.xml to HADOOP_HOME in the official Hive manual in apache.org but failed to find any explanation...
My Hive version is hive-1.1.0-cdh5.7.6, Hadoop version hadoop-2.6.0-cdh5.7.6, JDK 1.7.
So, you've mentioned Sqoop, therefore I'll point out the proper processes for getting hive XML configuration.
1) There's a classpath problem if the file isn't found. Copying the file is one solution, but a poor one. A symlink is preferred.
Every time I've used Sqoop, I never messed around with controlling any XML files - it just worked. Therefore, both HDP and CDH must have the proper classpath and/or symlinks setup.
2) The documentation states where configurations are loaded from
Sqoop will fall back to $HADOOP_HOME. If it is not set either, Sqoop will use the default installation locations for Apache Bigtop, /usr/lib/hadoop and /usr/lib/hadoop-mapreduce, respectively.
The active Hadoop configuration is loaded from $HADOOP_HOME/conf/, unless the $HADOOP_CONF_DIR environment variable is set
This classpath controls where configurations are loaded from
3) You can also, at runtime, give extra files
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
sqoop import -files $HIVE_HOME/conf/hive-site.xml ...

Get a yarn configuration from commandline

In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command?
For example I would like to do something like this
yarn get-config yarn.scheduler.maximum-allocation-mb
It's a bit non-intuitive, but it turns out the hdfs getconf command is capable of checking configuration properties for YARN and MapReduce, not only HDFS.
> hdfs getconf -confKey fs.defaultFS
hdfs://localhost:19000
> hdfs getconf -confKey dfs.namenode.name.dir
file:///Users/chris/hadoop-deploy-trunk/data/dfs/name
> hdfs getconf -confKey yarn.resourcemanager.address
0.0.0.0:8032
> hdfs getconf -confKey mapreduce.framework.name
yarn
A benefit of using this is that you'll see the actual, final results of any configuration properties as they are actually used by Hadoop. This would account for some of the more advanced configuration patterns, such as use of XInclude in the XML files or property substitutions, like this:
<property>
<description>The address of the applications manager interface in the RM.</description>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
Any scripting approach that tries to parse the XML files directly is unlikely to accurately match the implementation as its done inside Hadoop, so it's better to ask Hadoop itself.
You might be wondering why an hdfs command can get configuration properties for YARN and MapReduce. Great question! It's somewhat of a coincidence of the implementation needing to inject an instance of MapReduce's JobConf into some objects created via reflection. The relevant code is visible here:
https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/ReflectionUtils.java#L82-L114
This code is executed as part of running the hdfs getconf command. By triggering a reference to JobConf, it forces class loading and static initialization of the relevant MapReduce and YARN classes that add yarn-default.xml, yarn-site.xml, mapred-default.xml and mapred-site.xml to the set of configuration files in effect.
Since it's a coincidence of the implementation, it's possible that some of this behavior will change in future versions, but it would be a backwards-incompatible change, so we definitely wouldn't change that behavior inside the current Hadoop 2.x line. The Apache Hadoop Compatibility policy commits to backwards-compatibility within a major version line, so you can trust that this will continue working at least within the 2.x version line.

Cloudera Manager: Where do I put Java ClassPath for MapReduce jobs?

I've got Hadoop-Lzo working happily on my local pseudo-cluster but the second I try the same jar file in production, I get:
java.lang.RuntimeException: native-lzo library not available
The libraries are verified to be on the DataNodes, so my question is:
In what screen / setting do I specify the location of the native-lzo library?
For MapReduce you need to add the entries to the MapReduce Client Environment Safety valve. You can find MapReduce Client Safety by going to View and Edit tab under Configuration. Then add these lines over there :
HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/*
JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/opt/cloudera/parcels/HADOOP_LZO/lib/hadoop/lib/native
Also add the LZO codecs to the io.compression.codecs property under the MapReduce Service. To do that go to io.compression under View and Edit tab under Configuration and these lines :
com.hadoop.compression.lzo.LzoCodec
com.hadoop.compression.lzo.LzopCodec
Do not forget to restart your MR daemons after making the changes. Once restarted redeploy your MR client configuration.
For a detailed help on how to use LZO you can visit this link :
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/latest/Cloudera-Manager-Installation-Guide/cmig_install_LZO_Compression.html
HTH
try sudo apt-get install lzop in your TaskTracker nodes.

Resources