Change number of reducers/mappers in DataStax Enterprise - hadoop

How can I change the number of mappers/reducers in Hadoop? For some odd reason, mapred.tasttracker.map.tasks.maximum and mapred.tasttracker.reduce.tasks.maximum are not present in the mapred-site.xml. I did manage to find these settings in dse-mapred-default.xml but once the xml is opened, there's a note which indicates that the settings shouldn't be edited in this file and that the properties should be overridden in mapred-site.xml.
I have tried adding the two settings to the mapred-site.xml and restarting Hadoop and i was expecting the numbers to also be updated in dse-mapred-default.xml but with no luck.
Could someone please shed some light on this?
Thanks
Majd

It's not mapred.tasttracker.map.tasks.maximum, but mapred.tasktracker.map.tasks.maximum. I hope it is only a typo and you used correct names in your config.
On startup DSE creates dse-mapred-default.xml and dse-core-default.xml files and fills them with defaults adapted to your local OS configuration and hardware. This is mostly for Hadoop autotuning feature and for simplifying configuring of security-enabled Hadoop. Then Hadoop loads config files in the following order:
Hadoop internal defaults (the defaults you can find in the Hadoop docs)
DSE defaults from dse-core-default.xml and dse-mapred-default.xml
User files: core-site.xml and mapred-site.xml.
Settings from files loaded later override settings loaded earlier. The final state of configuration is not written back to the files with defaults. You should not expect settings from mapred-site.xml to be copied into dse-mapred-default.xml files.
If you're unsure what is the final configuration and whether your settings are properly set - just run a job and look into hadoop log directory and search for files matching pattern job_xxxxxxxxxxxx_xxxx_conf.xml, where x is a digit. You can also view the final config in the jobtracker HTTP console.

Related

Plain vanilla Hadoop installation vs Hadoop installation using Ambari

I recently downloaded hadoop distribution from Apache and got it up and running quite fast; download the hadoop tar ball, untar it at a location and some configuration setting. The thing here is that I am able to see the various configuration files like: yarn-site.xml, hdfs-site.xml etc; and I know the hadoop home location.
Next, I installed hadoop (HDP) Using Ambari.
Here comes the confusion part. It seems Ambarin installs the hdp in /usr/hdp; however the directory structure in plain vanilla hadoop vs Ambari is totally different. I am not able to locate the configuration files e.g. yarn-site.xml etc.
So can anyone help me demystify this?
All the configuration changes must be done via the Ambari UI. There is no use for the configuration files since Ambari persists the configurations in Ambari Database.
If you still need them, they are under /etc/hadoop/conf/.
It's true that configuration changes must be made via Ambari UI and that those configurations are stored in a database.
Why is it necessary to change these configuration properties in Ambari UI and not directly on disk?
Every time a service is restarted and it has a stale configuration the ambari-agent is responsible for writing the latest configuration to disk. They are written to /etc/<service-name>/conf. If you were to make changes directly to the configuration files on disk they would get overwritten by the aforementioned process.
However the configuration files found on disk DO still have a use...
The configuration files (on disk) are used by the various hadoop daemons when they're started/running.
Basically the benefit of using Ambari UI in Cluster Hadoop deployment. It will give you central management point.
For example:
10 pcs Hadoop cluster setup.
Plain vanilla Hadoop:
If you change any configuration you must be changed in 10 pcs
Ambari UI :
Due to configuration store in db. you just change in management portal all changes effect reflected on all node by single point change.

How to enable hadoop-metrics.properties

I need to report hadoop metrics (such as jvm, cldb) to a text file. I modified hadoop-metrics file in a conf directory on one of the nodes to test, but output files still didn't appear.
I tried to restart YARN-nodemanager and node itself, but still no result.
Am I need to do some additional magic, like changing env variables or other configs?
The problem was with a wrong config. I've been using a sample config file which supposed to report Namenode and Resourcemanager metrics.
But my node didn't have both of them.
Added other metrics. Works fine.

Amazon Elastic Mapreduce default configurations

Are the Hadoop configuration defaults (core-site.xml, yarn-site.xml, etc) published by Amazon? I've seen select parameters published but not the total default configurations.
you will find defaults config at amazon emr documentation page at below link
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html
There is way you can find config for job that is triggerd. You can visit to S3 Location where logs are saved. There you will find j-yourjoblogsfilelocation/jobs/job_1403597560615_0001_conf.xml. The _conf.xml has all config that are passed to job. They are the defaults config unless we overwrite them. So this will be different as per box config we choose. e.g memory configs may be different

Deploying hdfs core-site.xml with cloudera manager

I'm trying to add lzo support to my configuration files using the cloudera manager (CDH5b2).
If I add the io.compression.codecs to the service-wide hdfs configuration, and deploy the configuration file, /etc/hadoop/conf.cloudera.hdfs/core-site.xml now contains the new value.
However, /etc/hadoop/conf.cloudera.yarn/core-site.xml has a higher priority (update-alternatives --display hadoop-conf), the hdfs core-site.xml values are not used when I start a MR job.
Obviously, I can simply modify the yarn core-site.xml file manually, but I don't understand how to do deploy the hdfs core-site.xml file properly using cloudera manager.
There is a MapReduce Client Environment Safety Valve, also known as 'MapReduce Service Advanced Configuration Snippet (Safety Valve) for core-site.xml' found in the gui under mapreduce's configuration ->Service-Wide->Advanced will allow you to add any value that doesn't fit elsewhere. (There is also one for core-site.xml as well.)
Having said that, details can be found on Cloudera's site at: http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_ig_cdh5_install.html

Is the Hadoop read-only default configuration file core-default.xml read at startup?

Is the file described in the documentation as a read-only configuration file, src/core/core-default.xml, used by Hadoop at startup? Some of the docs say to copy this file to conf/core-site.xml and make changes and some say to include only those properties being changed. If the latter is the case, it seems the core-default.xml file is necessary.
core-default.xml is loaded first and then core-site.xml is overlayed on that. core-site.xml will only contain the values that need to be changed from the default.
See the resources section at the top of: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/conf/Configuration.html
Just another information which may be interesting here: If you use the constructor of YarnConfiguration yarn-default.xml and yarn-site.xml are read additionally.

Resources