Deploying hdfs core-site.xml with cloudera manager - hadoop

I'm trying to add lzo support to my configuration files using the cloudera manager (CDH5b2).
If I add the io.compression.codecs to the service-wide hdfs configuration, and deploy the configuration file, /etc/hadoop/conf.cloudera.hdfs/core-site.xml now contains the new value.
However, /etc/hadoop/conf.cloudera.yarn/core-site.xml has a higher priority (update-alternatives --display hadoop-conf), the hdfs core-site.xml values are not used when I start a MR job.
Obviously, I can simply modify the yarn core-site.xml file manually, but I don't understand how to do deploy the hdfs core-site.xml file properly using cloudera manager.

There is a MapReduce Client Environment Safety Valve, also known as 'MapReduce Service Advanced Configuration Snippet (Safety Valve) for core-site.xml' found in the gui under mapreduce's configuration ->Service-Wide->Advanced will allow you to add any value that doesn't fit elsewhere. (There is also one for core-site.xml as well.)
Having said that, details can be found on Cloudera's site at: http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_cdh5_install.html

Related

Which processes need access to core-site.xml and hdfs-site.xml

The core-site.xml file informs Hadoop daemon where NameNode runs in
the cluster. It contains the configuration settings for Hadoop Core
such as I/O settings that are common to HDFS and MapReduce.
The hdfs-site.xml file contains the configuration settings for HDFS
daemons; the NameNode, the Secondary NameNode, and the DataNodes.
Here, we can configure hdfs-site.xml to specify default block
replication and permission checking on HDFS. The actual number of
replications can also be specified when the file is created. The
default is used if replication is not specified in create time.
I'm looking to understand which processes [Namenode, Datanode, HDFS client] need access to which of those configuration files?
Namenode: I presume it only needs hdfs-site.xml because it doesn't need to know its own location.
Datanode: I presume it needs access to both core-site.xml (to locate the namenode) and hdfs-site.xml (for various settings)?
HDFS client: I presume it needs access to both core-site.xml (to locate the namenode) and hdfs-site.xml (for various settings)?
Is that accurate?
The clients and server processes need access to both files
If you use HDFS nameservices with highly available Namenodes, then the two Namenodes need to find each other
Some comments:
core-site.xml hdfs-site.xml Are the two used by external
programs (such as NiFi) to access the cluster/WEB HDFS API
Edge nodes require both for cluster access
Ambari will manage both of these along with all the others
The three you listed all need access in order to run the cluster and at a bare minimum set basic settings such as proxy settings and cluster access

Aggregation retaining in yarn-site. What does it mean, how does it work?

Let's look at https://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn
Here we have something like:
yarn.log-aggregation.retain-seconds
What logs are connected to this option? Hadoop DataNode? NameNode? Yarn Resource Manager?
Should I set it on all hadoop nodes? Where?
If it starts with yarn it only applies to YARN services. This includes Resource Manager, Node Manager and maybe the new Timeline Server (if enabled). YARN-specific configuration settings belong in the yarn-site.xml.
Similarly,
HDFS specific configuration settings are found in the hdfs-site.xml. They usually start with dfs.
Common settings belong to the core-site.xml.
You setup the yarn-site.xml on all hosts running YARN services.

Plain vanilla Hadoop installation vs Hadoop installation using Ambari

I recently downloaded hadoop distribution from Apache and got it up and running quite fast; download the hadoop tar ball, untar it at a location and some configuration setting. The thing here is that I am able to see the various configuration files like: yarn-site.xml, hdfs-site.xml etc; and I know the hadoop home location.
Next, I installed hadoop (HDP) Using Ambari.
Here comes the confusion part. It seems Ambarin installs the hdp in /usr/hdp; however the directory structure in plain vanilla hadoop vs Ambari is totally different. I am not able to locate the configuration files e.g. yarn-site.xml etc.
So can anyone help me demystify this?
All the configuration changes must be done via the Ambari UI. There is no use for the configuration files since Ambari persists the configurations in Ambari Database.
If you still need them, they are under /etc/hadoop/conf/.
It's true that configuration changes must be made via Ambari UI and that those configurations are stored in a database.
Why is it necessary to change these configuration properties in Ambari UI and not directly on disk?
Every time a service is restarted and it has a stale configuration the ambari-agent is responsible for writing the latest configuration to disk. They are written to /etc/<service-name>/conf. If you were to make changes directly to the configuration files on disk they would get overwritten by the aforementioned process.
However the configuration files found on disk DO still have a use...
The configuration files (on disk) are used by the various hadoop daemons when they're started/running.
Basically the benefit of using Ambari UI in Cluster Hadoop deployment. It will give you central management point.
For example:
10 pcs Hadoop cluster setup.
Plain vanilla Hadoop:
If you change any configuration you must be changed in 10 pcs
Ambari UI :
Due to configuration store in db. you just change in management portal all changes effect reflected on all node by single point change.

Hadoop config - hdfs-site.xml : Should I use the same file on namenode and datanode?

On a distributed Hadoop cluster, can I copy the same hdfs-site.xml file to the namenodes and datanodes?
Some of the set-up instructions I've seen (i.e. Cloudera) say to have the dfs.data.dir property in this file on the datanodes and and the dfs.name.dir property in this file on the namenode. Meaning I should have two copies of hdfs-site.xml, one for the namenode and one for the datanodes.
But if it's all the same I'd rather just own/maintain one copy of the file and push it to ALL nodes anytime I change it.
Is there any harm/risk in having both dfs.name.dir and dfs.data.dir properties in the same file? What issues might happen if a data node sees the property for "dfs.name.dir" ?
And if there are issues, what other properties should be in the hdfs-site.xml file on the namenode but not on datanode? and vice versa.
And finally, what properties need to be included in the hdfs-site.xml file that I copy to a client machine (who isn't a tasktracker or datanode, but just talks to the Hadoop cluster) ?
I'v searched around, including the O'reilly operations book, but can't find any good article describing how the config file needs to differ across different nodes.
Thanks!
The namenode is picked up from masters file therefore essentially FSimage and edit logs will be written only on namenode and not in the datanode even though you copy the same hdfs-site.xml.
For the second question..You can't necessarily communicate with hdfs without being on the cluster directly. If you want to have a remote client you might try webhdfs and create certain web services using which you can write or access files in hdfs

Locating Cloudera Manager HDFS config files

I've installed a cluster via Cloudera Manager, and now I need to launch the cluster manually.
I've been using the following command:
$ sudo -u hdfs hadoop namenode / datanode / jobtracker
But then the dfs.name.dir is set up /tmp. I can't seem to find where cloudera manager has the HDFS config files. The ones in /usr/lib/hadoop-02*/conf seem to be minimal. They're missing the dfs.name.dir which is what I'm looking for particularly. I'm on an RHLE 6 system, by the way. Being lazy, I though I could just copy over cloudera manager's HDFS config files, so I don't have to manually create them, the copy them over to 6 nodes :)
Thanks
I was facing same problem.
I was changing configuration parameters from cloudera manager ui but was clueless where my changes were getting updated on local file system.
I ran grep command and found out that in my case configuration were stored at /var/run/cloudera-scm-agent/process/*-hdfs-NAMENODE directory.
So David is right, whenever we change configs from ui and restart service, it creates new config. settings in /var/run/cloudera-scm-agent/process/ directory.
Using CentOS 6.5, the Cloudera Manager special files do not show up in a SEARCH FILES result because their permissions are set to hide from all but the 'hdfs' user. In addition, there are multiple versions of hdfs-site.xml on the local drive some of which have partial amounts of real settings. The actual settings file is in the DATANODE folder not the NAMENODE folder as evidenced by the lack of dfs.datanode.data.dir values in the latter.
Cloudera manager deploying config file each time you start cluster, each time in different directory. Directories are named after process id or something like this.
The configuration is passed explicitly to each deamon as parameter. So if you will look into command line of each hadoop deamons you can see where is configuration sitting (or just grep over disk for hdfs-site.xml. Names of config files are the same as usual.
I was in the same boat and found this answer:
To allow Hadoop client users to work with the HDFS, MapReduce, YARN
and HBase services you created, Cloudera Manager generates client
configuration files that contain the relevant configuration files with
the settings from your services. These files are deployed
automatically by Cloudera Manager based on the services you have
installed, when you add a service, or when you add a Gateway role on a
host.
You can download and distribute these client configuration files
manually to the users of a service, if necessary.
The Client Configuration URLs command on the cluster Actions menu
opens a pop-up that displays links to the client configuration zip
files created for the services installed in your cluster. You can
download these zip files by clicking the link.
See Deploying Client Configuration Files for more information on this
topic.
On our system I got there via http://your_server:7180/cmf/services/status and clicked the Actions popup under the Add Cluster button. Hope that helps.

Resources