what is the different between configuration files under /etc/hadoop/conf and /etc/hadoop/conf.cloudera.HDFS and /etc/hadoop/conf.cloudera.YARN - hadoop

I have cloudera 5.7, I have Cloudera Manager too.
under the directory /etc/hadoop, I saw three sub-directories.
/etc/hadoop/conf
/etc/hadoop/conf.cloudera.HDFS/
/etc/hadoop/conf.cloudera.YARN/
the hadoop-env.sh in ../conf/ is different from ../conf.cloudera.HDFS/..
the core-site.xml in ../conf/ is different from ../conf.cloudera.HDFS/.. as well.
the hadoop-env.sh in ../conf/ has settings on YARN, while the one under../conf.cloudera.HDFS doesn't has it.
and the one in ../conf.cloudera.HDFS/.. has the settings for Namenode, datanodes, etc.
I have CM installed, I am wondering if these configuration files are really in use?
If yes, and I need to change some environment variables, should I change all of these hadoop-env.sh? and copy it to the other nodes?
Thanks.

Cloudera Manager handle settings for you. If you edit the settings files manually, it will erase by CM.
If you want make some change, do it by CM.

Related

How to change java.io.tmpdir for spark job running on yarn

How can I change java.io.tmpdir folder for my Hadoop 3 Cluster running on YARN?
By default it gets something like /tmp/***, but my /tmp filesystem is to small for everythingYARN Job will write there.
Is there a way to change it ?
I have also set hadoop.tmp.dir in core-site.xml, but it looks like, it is not really used.
perhaps its a duplicate of What should be hadoop.tmp.dir ?. Also, go through all .conf's in /etc/hadoop/conf and search tmp, see if anything is hardcoded. Also specify:
Whether you see (any) files getting created # what you specified as hadoop.tmp.dir.
What pattern of files are being formed # /tmp/** after your changes are applied.
I have also noticed hive creating files in /tmp. So, you may also have a look # hive-site.xml. Similar for any other ecosystem product you are using.
I have configured yarn.nodemanager.local-dirs property in yarn-site.xml and restarted the cluster. After that spark stopped using /tmp file system and used directories, configured in yarn.nodemanager.local-dirs.
java.io.tmpdir property for spark executors was also set to directories defined in yarn.nodemanager.local-dirs property.
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/somepath1,/anotherpath2</value>
</property>

Connect spark on yarn-cluster in CDH 5.4

I am trying to understand the "concept" of connecting to a remote server. What I have are 4 servers on CentOS using CDH5.4
What I want to do is connect spark on yarn on all these four nodes.
My problem is I do not understand how to set HADOOP_CONF_DIR as specified here. Where and what value should i set for this variable? And then do I need to set this variable on all four nodes or only the master node will suffice?
The documentation says "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster".
I have read many questions similar to this before asking it in here. Please, let me know what can I do to solve this problem. I am able to run spark and pyspark on stand alone mode on all nodes.
Thanks for your help.
Ashish
Where and what value should i set for this variable?
The variable HADOOP_CONF_DIR should point to the directory that contains yarn-site.xml. Usually you set it in ~/.bashrc. I found documentation for CDH.
http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-project-dist/hadoop-common/ClusterSetup.html
Basically all nodes need to have configuration files pointed by the environment variable.
Once all the necessary configuration is complete, distribute the files to the HADOOP_CONF_DIR directory on all the machines

[hdfs]how to configure different dfs.datanode.data.dir for each datanode?

I use ambari to setup a hadoop cluster.
but when I configure the hdfs's config. I found that if I modify the dfs.datanode.data.dir, the configure will take effect on all datanodes...
How could I configure different configs for each datanode?
for example, there are two disks in machine A, which is mounted to /data1, /data2
but there is only one disk in machine B, which is mounted to /data1
so I want to configure the dfs.datanode.data.dir to "/data1,/data2" for machine A.
but only "/data1" for machine B
HDFS directories that don't exist will be ignored. Put them all in, it won't matter.
Remember that each Hadoop node in the cluster has its own set of configuration files too (under the usual conf\ dir). So you can login to that data-node machine and change config files.
The local configuration on data-node will take effect for that data-node.

A Hadoop DataNode error: host:port authority

guys.when I try to run the hadoop cluster ,but i don't make it .The main error is like this:
But the strong strange is that the NameNode,JobTracker,SecondNameNode and TaskTracker are ok,besides the dataNode .
My other configurations are like these:
hdfs-site.xml
core-site.xml
mapred-site.xml
I am not sure if it would help, but check this page
To quote from there,
Even thought I configured the core-site.xml, mapred-site.xml &
hdfs-site.xml under /usr/local/hadoop/conf/ folder, by default the
system is referring to /etc/hadoop/ *.xml. Once I update the
configuration files in /etc/hadoop location everything started
working.
Please make sure you are picking the correct set of configuration files. Looks like some classpath related issue since your setup is bypassing whatever you have configured in your core-site.xml. Make sure you don't have any classpath related issue. Do you have any other Hadoop setup on the same machine, which was done earlier, and then you forgot to edit the classpath for the current setup?
Also, http:// is not required in mapred-site.xml.
HTH

apache Hadoop-2.0.0 aplha version installation in full cluster using fedration

I had installed hadoop stable version successfully. but confused while installing hadoop -2.0.0 version.
I want to install hadoop-2.0.0-alpha on two nodes, using federation on both machines. rsi-1, rsi-2 are hostnames.
what should be values of below properties for implementation of federation. Both machines are also used for datanodes too.
fs.defaulFS dfs.federation.nameservices dfs.namenode.name.dir dfs.datanode.data.dir yarn.nodemanager.localizer.address yarn.resourcemanager.resource-tracker.address yarn.resourcemanager.scheduler.address yarn.resourcemanager.address
One more point, in stable version of hadoop i have configuration files under conf folder in installation directory.
But in 2.0.0-aplha version, there is etc/hadoop directory and it doesnt have mapred-site.xml, hadoop-env.sh. do i need to copy conf folder under share folder into hadoop-home directory? or do i need to copy these files from share folder into etc/hadoop directory?
Regards, Rashmi
You can run hadoop-setup-conf.sh in sbin folder. It instructs you step-by-step to configure.
Please remember when it asks you to input the directory path, you should use full link
e.g., when it asks for conf directory, you should input /home/user/Documents/hadoop-2.0.0/etc/hadoop
After completed, remember to check every configuration file in etc/hadoop.
As my experience, I modified JAVA_HOME variable in hadoop-env.sh and some properties in core-site.xml, mapred-site.xml.
Regards

Resources