Amazon Elastic Mapreduce default configurations - hadoop

Are the Hadoop configuration defaults (core-site.xml, yarn-site.xml, etc) published by Amazon? I've seen select parameters published but not the total default configurations.

you will find defaults config at amazon emr documentation page at below link
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html
There is way you can find config for job that is triggerd. You can visit to S3 Location where logs are saved. There you will find j-yourjoblogsfilelocation/jobs/job_1403597560615_0001_conf.xml. The _conf.xml has all config that are passed to job. They are the defaults config unless we overwrite them. So this will be different as per box config we choose. e.g memory configs may be different

Related

Unable to find oozie-site.xml in CDH5.01 version

I use cloudera hadoop CDH5.01
During oozie execution I'm getting error
Jobtracker [cloudera:5032] not allowed, not in Oozies whitelist
In order to fix this issue, I require to add resource manager address to the whitelist in oozie-site.xml.
Cloudera documents say its located in /etc/oozie/conf/. Modifying the file is not reflected in ooize console. configuration which oozie is using is from somewhere else which is getting generated whenever I start oozie.
eg
/run/cloudera-scm-agent/process/294-oozie-OOZIE_SERVER/oozie-site.xml
How to find the actual configuration file which is being used which cloudera hadoop + oozie
You have the file oozie-default.xml in $OOZIE_HOME/conf
The folder you listed is where the actual oozie-site.xml is written to; for example /run/cloudera-scm-agent/process/294-oozie-OOZIE_SERVER/oozie-site.xml. Whenever oozie is started a process directory is created and its configuration is written somewhere under that directory.
If you need to modify values that get written to oozie-site.xml then you must modify those values in Cloudera Manager. Modifying the oozie-site.xml directly will not work as the configuration will just be overwritten the next time the service is started. Open cloudera manager in a browser, Select your cluster, Select the oozie service, select the configuration tab. Then modify the setting in question. You will see an icon next to the service after you save the changes that dictates the configuration needs to be redeployed and that the service needs to be restarted to pick up those new changes.

Configure Hadoop to use S3 requester-pays-enabled

I'm using Hadoop (via Spark), and need to access S3N content which is requester-pays. Normally, this is done by enabling httpclient.requester-pays-buckets-enabled = true in jets3t.properties. Yet, I've set this and Spark / Hadoop are ignoring it. Perhaps I'm putting the jets3t.properties in the wrong place (/usr/share/spark/conf/). How can I get Hadoop / Spark / JetS3t to access requestor-pays buckets?
UPDATE: This is needed if you are outside Amazon EC2. Within EC2, Amazon doesn't require requester-pays. So, a crude workaround is to run out of EC2.
The Spark system is made up of several JVMs (application, master, workers, executors), so setting properties can be tricky. You could use System.getProperty() before the file operation to check if the JVM where the code runs has loaded the right config. You could even use System.setProperty() to directly set it at that point instead of figuring out the config files.
Environment variables and config files didn't work, but some manual code did: sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "PUTTHEKEYHERE")

Deploying hdfs core-site.xml with cloudera manager

I'm trying to add lzo support to my configuration files using the cloudera manager (CDH5b2).
If I add the io.compression.codecs to the service-wide hdfs configuration, and deploy the configuration file, /etc/hadoop/conf.cloudera.hdfs/core-site.xml now contains the new value.
However, /etc/hadoop/conf.cloudera.yarn/core-site.xml has a higher priority (update-alternatives --display hadoop-conf), the hdfs core-site.xml values are not used when I start a MR job.
Obviously, I can simply modify the yarn core-site.xml file manually, but I don't understand how to do deploy the hdfs core-site.xml file properly using cloudera manager.
There is a MapReduce Client Environment Safety Valve, also known as 'MapReduce Service Advanced Configuration Snippet (Safety Valve) for core-site.xml' found in the gui under mapreduce's configuration ->Service-Wide->Advanced will allow you to add any value that doesn't fit elsewhere. (There is also one for core-site.xml as well.)
Having said that, details can be found on Cloudera's site at: http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_cdh5_install.html

Change number of reducers/mappers in DataStax Enterprise

How can I change the number of mappers/reducers in Hadoop? For some odd reason, mapred.tasttracker.map.tasks.maximum and mapred.tasttracker.reduce.tasks.maximum are not present in the mapred-site.xml. I did manage to find these settings in dse-mapred-default.xml but once the xml is opened, there's a note which indicates that the settings shouldn't be edited in this file and that the properties should be overridden in mapred-site.xml.
I have tried adding the two settings to the mapred-site.xml and restarting Hadoop and i was expecting the numbers to also be updated in dse-mapred-default.xml but with no luck.
Could someone please shed some light on this?
Thanks
Majd
It's not mapred.tasttracker.map.tasks.maximum, but mapred.tasktracker.map.tasks.maximum. I hope it is only a typo and you used correct names in your config.
On startup DSE creates dse-mapred-default.xml and dse-core-default.xml files and fills them with defaults adapted to your local OS configuration and hardware. This is mostly for Hadoop autotuning feature and for simplifying configuring of security-enabled Hadoop. Then Hadoop loads config files in the following order:
Hadoop internal defaults (the defaults you can find in the Hadoop docs)
DSE defaults from dse-core-default.xml and dse-mapred-default.xml
User files: core-site.xml and mapred-site.xml.
Settings from files loaded later override settings loaded earlier. The final state of configuration is not written back to the files with defaults. You should not expect settings from mapred-site.xml to be copied into dse-mapred-default.xml files.
If you're unsure what is the final configuration and whether your settings are properly set - just run a job and look into hadoop log directory and search for files matching pattern job_xxxxxxxxxxxx_xxxx_conf.xml, where x is a digit. You can also view the final config in the jobtracker HTTP console.

Locating Cloudera Manager HDFS config files

I've installed a cluster via Cloudera Manager, and now I need to launch the cluster manually.
I've been using the following command:
$ sudo -u hdfs hadoop namenode / datanode / jobtracker
But then the dfs.name.dir is set up /tmp. I can't seem to find where cloudera manager has the HDFS config files. The ones in /usr/lib/hadoop-02*/conf seem to be minimal. They're missing the dfs.name.dir which is what I'm looking for particularly. I'm on an RHLE 6 system, by the way. Being lazy, I though I could just copy over cloudera manager's HDFS config files, so I don't have to manually create them, the copy them over to 6 nodes :)
Thanks
I was facing same problem.
I was changing configuration parameters from cloudera manager ui but was clueless where my changes were getting updated on local file system.
I ran grep command and found out that in my case configuration were stored at /var/run/cloudera-scm-agent/process/*-hdfs-NAMENODE directory.
So David is right, whenever we change configs from ui and restart service, it creates new config. settings in /var/run/cloudera-scm-agent/process/ directory.
Using CentOS 6.5, the Cloudera Manager special files do not show up in a SEARCH FILES result because their permissions are set to hide from all but the 'hdfs' user. In addition, there are multiple versions of hdfs-site.xml on the local drive some of which have partial amounts of real settings. The actual settings file is in the DATANODE folder not the NAMENODE folder as evidenced by the lack of dfs.datanode.data.dir values in the latter.
Cloudera manager deploying config file each time you start cluster, each time in different directory. Directories are named after process id or something like this.
The configuration is passed explicitly to each deamon as parameter. So if you will look into command line of each hadoop deamons you can see where is configuration sitting (or just grep over disk for hdfs-site.xml. Names of config files are the same as usual.
I was in the same boat and found this answer:
To allow Hadoop client users to work with the HDFS, MapReduce, YARN
and HBase services you created, Cloudera Manager generates client
configuration files that contain the relevant configuration files with
the settings from your services. These files are deployed
automatically by Cloudera Manager based on the services you have
installed, when you add a service, or when you add a Gateway role on a
host.
You can download and distribute these client configuration files
manually to the users of a service, if necessary.
The Client Configuration URLs command on the cluster Actions menu
opens a pop-up that displays links to the client configuration zip
files created for the services installed in your cluster. You can
download these zip files by clicking the link.
See Deploying Client Configuration Files for more information on this
topic.
On our system I got there via http://your_server:7180/cmf/services/status and clicked the Actions popup under the Add Cluster button. Hope that helps.

Resources