Locating Cloudera Manager HDFS config files - hadoop

I've installed a cluster via Cloudera Manager, and now I need to launch the cluster manually.
I've been using the following command:
$ sudo -u hdfs hadoop namenode / datanode / jobtracker
But then the dfs.name.dir is set up /tmp. I can't seem to find where cloudera manager has the HDFS config files. The ones in /usr/lib/hadoop-02*/conf seem to be minimal. They're missing the dfs.name.dir which is what I'm looking for particularly. I'm on an RHLE 6 system, by the way. Being lazy, I though I could just copy over cloudera manager's HDFS config files, so I don't have to manually create them, the copy them over to 6 nodes :)
Thanks

I was facing same problem.
I was changing configuration parameters from cloudera manager ui but was clueless where my changes were getting updated on local file system.
I ran grep command and found out that in my case configuration were stored at /var/run/cloudera-scm-agent/process/*-hdfs-NAMENODE directory.
So David is right, whenever we change configs from ui and restart service, it creates new config. settings in /var/run/cloudera-scm-agent/process/ directory.

Using CentOS 6.5, the Cloudera Manager special files do not show up in a SEARCH FILES result because their permissions are set to hide from all but the 'hdfs' user. In addition, there are multiple versions of hdfs-site.xml on the local drive some of which have partial amounts of real settings. The actual settings file is in the DATANODE folder not the NAMENODE folder as evidenced by the lack of dfs.datanode.data.dir values in the latter.

Cloudera manager deploying config file each time you start cluster, each time in different directory. Directories are named after process id or something like this.
The configuration is passed explicitly to each deamon as parameter. So if you will look into command line of each hadoop deamons you can see where is configuration sitting (or just grep over disk for hdfs-site.xml. Names of config files are the same as usual.

I was in the same boat and found this answer:
To allow Hadoop client users to work with the HDFS, MapReduce, YARN
and HBase services you created, Cloudera Manager generates client
configuration files that contain the relevant configuration files with
the settings from your services. These files are deployed
automatically by Cloudera Manager based on the services you have
installed, when you add a service, or when you add a Gateway role on a
host.
You can download and distribute these client configuration files
manually to the users of a service, if necessary.
The Client Configuration URLs command on the cluster Actions menu
opens a pop-up that displays links to the client configuration zip
files created for the services installed in your cluster. You can
download these zip files by clicking the link.
See Deploying Client Configuration Files for more information on this
topic.
On our system I got there via http://your_server:7180/cmf/services/status and clicked the Actions popup under the Add Cluster button. Hope that helps.

Related

how users should work with ambari cluster

My question is pretty trivial but didnt find anyone actually asking it.
We have a ambari cluster with spark storm hbase and hdfs(among other things).
I dont understand how a user that want to use that cluster use it.
for example, a user want to copy a file to hdfs, run a spark-shell or create new table in hbase shell.
should he get a local account on the server that run the cooresponded service? shouldn't he use a 3rd party machine(his own laptop for example)?
If so ,how one should use hadoop fs, there is no way to specify the server ip like spark-shell has.
what is the normal/right/expected way to run all these tasks from a user prespective.
Thanks.
The expected way to run the described tasks from the command line is as follows.
First, gain access to the command line of a server that has the required clients installed for the services you want to use, e.g. HDFS, Spark, HBase et cetera.
During the process of provisioning a cluster via Ambari, it is possible to define one or more servers where the clients will be installed.
Here you can see an example of an Ambari provisioning process step. I decided to install the clients on all servers.
Afterwards, one way to figure out which servers have the required clients installed is to check your hosts views in Ambari. Here you can find an example of an Ambari hosts view: check the green rectangle to see the installed clients.
Once you have installed the clients on one or more servers, these servers will be able to utilize the services of your cluster via the command line.
Just to be clear, the utilization of a service by a client is location-independent from the server where the service is actually running.
Second, make sure that you are compliant with the security mechanisms of your cluster. In relation to HDFS, this could influence which users you are allowed to use and which directories you can access by using them. If you do not use security mechanisms like e.g. Kerberos, Ranger and so on, you should be able to directly run your stated tasks from the command line.
Third, execute your tasks via command line.
Here is a short example of how to access HDFS without considering security mechanisms:
ssh user#hostxyz # Connect to the server that has the required HDFS client installed
hdfs dfs -ls /tmp # Command to list the contents of the HDFS tmp directory
Take a look on Ambari views, especially on Files view that allows browsing HDFS

Plain vanilla Hadoop installation vs Hadoop installation using Ambari

I recently downloaded hadoop distribution from Apache and got it up and running quite fast; download the hadoop tar ball, untar it at a location and some configuration setting. The thing here is that I am able to see the various configuration files like: yarn-site.xml, hdfs-site.xml etc; and I know the hadoop home location.
Next, I installed hadoop (HDP) Using Ambari.
Here comes the confusion part. It seems Ambarin installs the hdp in /usr/hdp; however the directory structure in plain vanilla hadoop vs Ambari is totally different. I am not able to locate the configuration files e.g. yarn-site.xml etc.
So can anyone help me demystify this?
All the configuration changes must be done via the Ambari UI. There is no use for the configuration files since Ambari persists the configurations in Ambari Database.
If you still need them, they are under /etc/hadoop/conf/.
It's true that configuration changes must be made via Ambari UI and that those configurations are stored in a database.
Why is it necessary to change these configuration properties in Ambari UI and not directly on disk?
Every time a service is restarted and it has a stale configuration the ambari-agent is responsible for writing the latest configuration to disk. They are written to /etc/<service-name>/conf. If you were to make changes directly to the configuration files on disk they would get overwritten by the aforementioned process.
However the configuration files found on disk DO still have a use...
The configuration files (on disk) are used by the various hadoop daemons when they're started/running.
Basically the benefit of using Ambari UI in Cluster Hadoop deployment. It will give you central management point.
For example:
10 pcs Hadoop cluster setup.
Plain vanilla Hadoop:
If you change any configuration you must be changed in 10 pcs
Ambari UI :
Due to configuration store in db. you just change in management portal all changes effect reflected on all node by single point change.

Unable to find oozie-site.xml in CDH5.01 version

I use cloudera hadoop CDH5.01
During oozie execution I'm getting error
Jobtracker [cloudera:5032] not allowed, not in Oozies whitelist
In order to fix this issue, I require to add resource manager address to the whitelist in oozie-site.xml.
Cloudera documents say its located in /etc/oozie/conf/. Modifying the file is not reflected in ooize console. configuration which oozie is using is from somewhere else which is getting generated whenever I start oozie.
eg
/run/cloudera-scm-agent/process/294-oozie-OOZIE_SERVER/oozie-site.xml
How to find the actual configuration file which is being used which cloudera hadoop + oozie
You have the file oozie-default.xml in $OOZIE_HOME/conf
The folder you listed is where the actual oozie-site.xml is written to; for example /run/cloudera-scm-agent/process/294-oozie-OOZIE_SERVER/oozie-site.xml. Whenever oozie is started a process directory is created and its configuration is written somewhere under that directory.
If you need to modify values that get written to oozie-site.xml then you must modify those values in Cloudera Manager. Modifying the oozie-site.xml directly will not work as the configuration will just be overwritten the next time the service is started. Open cloudera manager in a browser, Select your cluster, Select the oozie service, select the configuration tab. Then modify the setting in question. You will see an icon next to the service after you save the changes that dictates the configuration needs to be redeployed and that the service needs to be restarted to pick up those new changes.

How To Sync the contents to all nodes

I was following this tutorial trying to install and configure Spark on my cluster..
My cluster (5 nodes) is hosted on AWS and installed from Cloudera Manager.
It is mentioned in the tutorial that "Sync the contents of /etc/spark/conf to all nodes." after the modification of the configuration file.
I am really wondering what is the easies way to make that happen. I read a post that has a similar question like mine HERE. Based on my understanding, for the configuration files of hadoop, hdfs ...etc. which are monitored by zookeeper or cloudera manager. That might be the case to use CM deploy or zookeeper to make it happen.
However, Spark's configuration file is totally outside zookeeper's scope. How can I "sync" to other nodes..
Many thanks!
Why don't you mount the same EBS via NFS to /etc/spark/conf or one of it's parents so the files are automatically synced?

Namenode UI - Browse File System not working in psedo-distributed mode

I have installed Hadoop 0.20.2 in psuedo distributed mode (all daemons on single machine).
It's up and running and I'm able to access HDFS through command line and run the jobs and I'm able to see the output.
But I am not able to browse the file system using UI provide by Hadoop.
http://namenode:50070/dfshealth.jsp.. it shows version and cluster status.. When i click on browse filesystem its not showing anything. Is there any issue with this?
I'm able to list the contents using hdfs shell commands, and In Cluster mode it's working fine.
Only in distributed mode I'm not able to browse the file system.. any inputs on this is appreciated. I have installed hadoop1.0.0 in psudodistributed mode too, and facing the same problem.
try this:
vi /usr/local/hadoop/conf/core-site.xml
And change this line:
<value>hdfs://localhost:54310</value>
to
<value>hdfs://[your IP]:54310</value>
Add the hostname and IP of namenode into the hosts file of the system from where you are browsing the above URL. If not done, then on clicking "Browse the filesystem" link will fail.
ok i was also facing same problem...
first my namenode Storage ditectory was tmp folder so when i restart my machine all data is lost.
so i changed my namenode storage directory to other location in my hadrdisk.
and than i was also facing same problem i cant browse my file system.
when i check permission of that folder there is no permission given to that folder and i was unable to change that permission.
so i copied the hadoop folder from my tmp folder to my home folder and change my namenode storage directory to that folder in home dirctory.
and my problem is solved.
Open /etc/hadoop/conf/core-site.xml
and change this
hdfs://localhost:8020
to
hdfs://(your-ip):8020
then restart hadoop-datanode service
Check logs if this also doesn't work.
Here is my analysis. I'm having the same problem and I'm using AWS. the "Browse the filesystem" link points to nn_browsedfscontent.jsp.
nn_browsedfscontent.jsp typically does the following
fetch the datanode ip address
fetch the datanode port (50075)
redirect the request to the ipaddress:port.
In the case of aws, a server instance has private DNS (available only between instances) and public DNS (available to access externally, internet).
In step #1, the ip address fetched is the private DNS and not public dns
In step #3, the ip address:50075 is private dns:50075 which will fail as it is not accessible publicly.
I replaced the private dns:50075 with public dns:50075 and I was able to browse the filesystem contents.
My knowledge of javascript is very poor and so unable to modify the nn_browsedfscontent.jsp to solve this problem. Not sure if has already been resolved.

Resources