where to find the M/R configuration file and update it - hadoop

our Hadoop cluster shows the job tracker process eat up memory gradually that we have to restart the cluster every week. I searched around for the possible solution for this. one of the post mentioned to decrease 'mapred.jobtracker.completeuserjobs.maximum' to 5, so i checked mapred-site.xml under /hadoop-install/conf directory on name node and found there are two entries for that parameter, one set it to 30, the other set it to 5, when i goto any of the data node and check mapred-site.xml, i don't find the setting for that parameter at all. however when I checked running job on M/R administration page and checked their job file, it showed that parameters set to 100. I'm really confused where does this parameter is set. and if i updated it, do i need to restart the cluster? we are running apache Hadoop 1.2.1 on google cloud

Hadoop does not automatically copy the configuration files from your driver machine to all of the cluster machines You need to do that via scp and/or rsync or preferably an automated deployment tool like chef, ansible, puppet, etc.
As far as the individual job parameters: you can actually set them on per-job basis by using -D:
<path to jar>/myHadoopJobJar.jar -Dmapred.jobtracker.completeuserjobs.maximum=5

Related

how users should work with ambari cluster

My question is pretty trivial but didnt find anyone actually asking it.
We have a ambari cluster with spark storm hbase and hdfs(among other things).
I dont understand how a user that want to use that cluster use it.
for example, a user want to copy a file to hdfs, run a spark-shell or create new table in hbase shell.
should he get a local account on the server that run the cooresponded service? shouldn't he use a 3rd party machine(his own laptop for example)?
If so ,how one should use hadoop fs, there is no way to specify the server ip like spark-shell has.
what is the normal/right/expected way to run all these tasks from a user prespective.
Thanks.
The expected way to run the described tasks from the command line is as follows.
First, gain access to the command line of a server that has the required clients installed for the services you want to use, e.g. HDFS, Spark, HBase et cetera.
During the process of provisioning a cluster via Ambari, it is possible to define one or more servers where the clients will be installed.
Here you can see an example of an Ambari provisioning process step. I decided to install the clients on all servers.
Afterwards, one way to figure out which servers have the required clients installed is to check your hosts views in Ambari. Here you can find an example of an Ambari hosts view: check the green rectangle to see the installed clients.
Once you have installed the clients on one or more servers, these servers will be able to utilize the services of your cluster via the command line.
Just to be clear, the utilization of a service by a client is location-independent from the server where the service is actually running.
Second, make sure that you are compliant with the security mechanisms of your cluster. In relation to HDFS, this could influence which users you are allowed to use and which directories you can access by using them. If you do not use security mechanisms like e.g. Kerberos, Ranger and so on, you should be able to directly run your stated tasks from the command line.
Third, execute your tasks via command line.
Here is a short example of how to access HDFS without considering security mechanisms:
ssh user#hostxyz # Connect to the server that has the required HDFS client installed
hdfs dfs -ls /tmp # Command to list the contents of the HDFS tmp directory
Take a look on Ambari views, especially on Files view that allows browsing HDFS

How to submit a job to specifc nodes in Hadoop?

I have a Hadoop cluster with 1 Master and 5 slaves. Is there any way of submitting jobs to specific set of slaves? Basically what i am trying to do is benchmark my application with many possibilities. So after testing with 5 slaves, I would like to run my application with 4 slaves and then 3 slaves and so on.
Currently the only way I know of is decommissioning a slave and removing from the hadoop cluster. But that seems to be a tedious task. I was wondering if there is an easier approach so as to avoid removing a node from the cluster.
Thanks.
In hadoop/conf there is a file called 'slaves' here you can simply add or remove nodes, and then restart your dfs and mapred.
There is a setting that points to a file with a list of excluded hosts you can set in the mapred-site-xml. Though also a bit cumbersome, changing a single configuration value might be preferable physically decommissioning and recommissioning multiple nodes. You could prepare multiple host exclusion files in advance, change the setting and restart the mapreduce service. Restarting the mapreduce service is pretty quick.
In 0.23 this setting is named mapreduce.jobtracker.hosts.exclude.filename. This is a feature introduced in 0.21, though I believe the setting was named mapred.hosts.exclude then. Check what this setting is called for the version of Hadoop you are using.
For those who encounter this problem, comments from Alex and stackoverflow question will help in successfully decommissioning a node from hadoop cluster.
EDIT : Just editing files hdfs-site.xml and mapred-site.xml and executing hadoop dfsadmin -refreshNodes might put your datanode into decommissioning node status for a long time. So it is also necessary to change dfs.replication to an appropriate value.

How do I configure and reboot an HDInsight cluster running on Azure?

Specifically, I want to change the maximum number of mappers and the maximum number of reducers for each node in an HDInsight cluster running on Microsoft Azure.
Using remote desktop, I logged in to the head node. I edited the mapred-site.xml file on the head node and changed the mapred.tasktracker.map.tasks.maximum and the mapred.tasktracker.reduce.tasks.maximum values. I tried rebooting the head node, but I was not able to reboot. I used the start-onebox.cmd and stop-onebox.cmd scripts to try and start/stop HDInsight.
I then ran a streaming mapreduce passing the desired number of reducers to the hadoop-streaming.jar, but the number of reducers was still limited by the previous value of mapred.tasktracker.reduce.tasks.maximum. Most of my reducers were pending execution.
Do I need to change the mapred-site.xml file on every node? Is there an easy way to change this, or do I need to remote desktop into every node? How do I reboot or restart the cluster so that my new values are used?
Thanks
I know it has been a while since the question was posted, but I would like to post for other users who may find useful.
There are 2 ways you can change Hadoop configuration files (such as mapred-site.xml, hive-site.xml etc) on HDinsight
Option #1:
This is the easiest - you can supply the hadoop configuration values per job, as shown in this blog
Option #2:
You can customize HDinsight cluster with hadoop configuration values during provisioning or installing a cluster, as shown in this blog
Manually modifying a config file is not supported and the change will be lost when the Azure VM gets re-imaged.

Do we need to restart hadoop after making changes in the xml configuration files in Hadoop conf directory?

Suppose I change the port numbers for tasktrackers or change the number of maximum map tasks through the conf files in hadoop, do I need to stop and restart the servers/daemons?
It depends on what options you change, but for the two examples you provide i would say yes, restart the mapred services (you don't need to restart the DFS services for these options).
I don't think there is an exhaustive list anywhere of what you need to restart when you amend a specific option.

Hadoop on Amazon Cloud

I'm trying to get set up on the Amazon Cloud to run some hadoop MapReduce jobs but I'm struggling to successfully create a cluster. I have downloaded the ec2 files, have my certificates and keypair file, but I believe it's the AMIs that are causing me trouble. If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
Also, some of my jobs will require an alteration in hadoops parameter settings (specifically the mapred-site.xml config file), is it possible to alter this file, and if so, how do I gain access to it? Is hadoop already installed on amazon machines, with this file accessible and alterable?
Thanks
Have you tried Amazon Elastic MapReduce? This is a simple API that brings up Hadoop clusters of a specified size on demand.
That's easier then to create own cluster manually.
But once the jobflow is finished by default it shuts the cluster down, leaving you with outputs on S3. If what you need is simply to do some crunching, this may be the way to go.
In case you need HDFS contents stored permanently (e.g. if you are running HBase on top of Hadoop) you may actually need own cluster on EC2. In this case you may find Cloudera's distribution of Hadoop for Amazon EC2 useful.
Altering Hadoop configuration on nodes it will start is possible using EC2 Bootstrap Actions:
Q: How do I configure Hadoop settings for my job flow?
The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.
About the way you are starting the cluster, please clarify:
If I'm trying to run a cluster with a master node and n slave nodes, I start n+1 instances using standard compatible AMIs and then run the code "hadoop-ec2 launch-cluster name n" in the terminal. The master node is successful, but I get an error when the slave nodes start to launch, saying "missing parameter -h (AMI missing)" and I'm not entirely sure how to progress.
How exactly you are trying start it? What exactly AMIs are you using?

Resources