Amazon EMR: Set unique number of mappers and reducers per EMR instance - hadoop

I'm running an Amazon EMR cluster that has M core instances and N task instances.
My jobs run multiple times per day and are time sensitive so I am keeping the M core instances up and running 24/7 so that I don't have data transfer overhead to/from S3.
The N task nodes are being dynamically launched and terminated as needed.
The M core nodes are c1.mediums and the N task nodes are m2.xlarge.
Is there a way to configure mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum per instance?
For the core nodes I want:
mapred.tasktracker.map.tasks.maximum=2
mapred.tasktracker.reduce.tasks.maximum=1
For the task nodes I want at least:
mapred.tasktracker.map.tasks.maximum=2
mapred.tasktracker.reduce.tasks.maximum=2
Note that task trackers run on the core nodes as well, so I think this configuration will need to be on a per-instance basis depending on the instance size.
Is this possible? And if so how can I set up this type of configuration?

There is a great blog here - which gives you the answer.
http://blog.earlh.com/index.php/2013/05/modifying-the-number-of-mappers-or-reducers-on-a-running-emr-cluster/
Note though that you might have to play around a bit with sshing into your task nodes. It will not work just like that.
I would get my pem file onto a local directory.
chmod 400 on that pem file
and then do "scp -l hadoop -i .pem and then the rest of of it"
as mentioned in the blog
Mind you I have not tried this yet but I believe it will work.
Also - the .versions... stuff may not be needed. You will probably just need conf.
Thanks

Related

Modifying number of tasks executed on mesos slave

In a Mesos ecosystem(master + scheduler + slave), with the master executing tasks on the slaves, is there a configuration that allows modifying number of tasks executed on each slave?
Say for example, currently a mesos master runs 4 tasks on one of the slaves(each task is using 1 cpu). Now, we have 4 slaves(4 cores each) and except for this one slave the other three are not being used.
So, instead of this execution scenario, I'd rather prefer the master running 1 task on each of the 4 slaves.
I found this stackoverflow question and these configurations relevant to this case, but still not clear on how to use the --isolation=VALUE or --resources=VALUE configuration here.
Thanks for the help!
Was able to reduce number of tasks being executed on a single host at a time by adding the following properties to startup script for mesos agent.
--resources="cpus:<<value>>" and --cgroups_enable_cfs=true.
This however does not take care of the concurrent scheduling issue where the requirement is to have each agent executing a task at the same time. For that need to look into the scheduler code as also suggested above.

"Too many fetch-failures" while using Hive

I'm running a hive query against a hadoop cluster of 3 nodes. And I am getting an error which says "Too many fetch failures". My hive query is:
insert overwrite table tablename1 partition(namep)
select id,name,substring(name,5,2) as namep from tablename2;
that's the query im trying to run. All i want to do is transfer data from tablename2 to tablename1. Any help is appreciated.
This can be caused by various hadoop configuration issues. Here a couple to look for in particular:
DNS issue : examine your /etc/hosts
Not enough http threads on the mapper side for the reducer
Some suggested fixes (from Cloudera troubleshooting)
set mapred.reduce.slowstart.completed.maps = 0.80
tasktracker.http.threads = 80
mapred.reduce.parallel.copies = sqrt (node count) but in any case >= 10
Here is link to troubleshooting for more details
http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera
Update for 2020 Things have changed a lot and AWS mostly rules the roost. Here is some troubleshooting for it
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-error-resource-1.html
Too many fetch-failures
PDF
Kindle
The presence of "Too many fetch-failures" or "Error reading task output" error messages in step or task attempt logs indicates the running task is dependent on the output of another task. This often occurs when a reduce task is queued to execute and requires the output of one or more map tasks and the output is not yet available.
There are several reasons the output may not be available:
The prerequisite task is still processing. This is often a map task.
The data may be unavailable due to poor network connectivity if the data is located on a different instance.
If HDFS is used to retrieve the output, there may be an issue with HDFS.
The most common cause of this error is that the previous task is still processing. This is especially likely if the errors are occurring when the reduce tasks are first trying to run. You can check whether this is the case by reviewing the syslog log for the cluster step that is returning the error. If the syslog shows both map and reduce tasks making progress, this indicates that the reduce phase has started while there are map tasks that have not yet completed.
One thing to look for in the logs is a map progress percentage that goes to 100% and then drops back to a lower value. When the map percentage is at 100%, this does not mean that all map tasks are completed. It simply means that Hadoop is executing all the map tasks. If this value drops back below 100%, it means that a map task has failed and, depending on the configuration, Hadoop may try to reschedule the task. If the map percentage stays at 100% in the logs, look at the CloudWatch metrics, specifically RunningMapTasks, to check whether the map task is still processing. You can also find this information using the Hadoop web interface on the master node.
If you are seeing this issue, there are several things you can try:
Instruct the reduce phase to wait longer before starting. You can do this by altering the Hadoop configuration setting mapred.reduce.slowstart.completed.maps to a longer time. For more information, see Create Bootstrap Actions to Install Additional Software.
Match the reducer count to the total reducer capability of the cluster. You do this by adjusting the Hadoop configuration setting mapred.reduce.tasks for the job.
Use a combiner class code to minimize the amount of outputs that need to be fetched.
Check that there are no issues with the Amazon EC2 service that are affecting the network performance of the cluster. You can do this using the Service Health Dashboard.
Review the CPU and memory resources of the instances in your cluster to make sure that your data processing is not overwhelming the resources of your nodes. For more information, see Configure Cluster Hardware and Networking.
Check the version of the Amazon Machine Image (AMI) used in your Amazon EMR cluster. If the version is 2.3.0 through 2.4.4 inclusive, update to a later version. AMI versions in the specified range use a version of Jetty that may fail to deliver output from the map phase. The fetch error occurs when the reducers cannot obtain output from the map phase.
Jetty is an open-source HTTP server that is used for machine to machine communications within a Hadoop cluster

How to find the right portion between hadoop instance types

I am trying to find out how many MASTER, CORE, TASK instances are optimal to my jobs. I couldn't find any tutorial that explains how do I figure it out.
How do I know if I need more than 1 core instance? What are the "symptoms" I would see in EMR's console in the metrics that would hint I need more than one core? So far when I tried the same job with 1*core+7*task instances it ran pretty much like on 8*core, but it doesn't make much sense to me. Or is it possible that my job is so much CPU bound that the IO is such minor? (I have a map-only job that parses apache log files into csv file)
Is there such a thing to have more than 1 master instance? If yes, when is it needed? I wonder, because my master node pretty much is just waiting for the other nodes to do the job (0%CPU) for 95% of the time.
Can the master and the core node be identical? I can have a master only cluster, when the 1 and only node does everything. It looks like it would be logical to be able to have a cluster with 1 node that is the master and the core , and the rest are task nodes, but it seems to be impossible to set it up that way with EMR. Why is that?
The master instance acts as a manager and coordinates everything that goes in the whole cluster. As such, it has to exist in every job flow you run but just one instance is all you need. Unless you are deploying a single-node cluster (in which case the master instance is the only node running), it does not do any heavy lifting as far as actual MapReducing is concerned, so the instance does not have to be a powerful machine.
The number of core instances that you need really depends on the job and how fast you want to process it, so there is no single correct answer. A good thing is that you can resize the core/task instance group, so if you think your job is running slow, then you can add more instances to a running process.
One important difference between core and task instance groups is that the core instances store actual data on HDFS whereas task instances do not. In turn, you can only increase the core instance group (because removing running instances would lose the data on those instances). On the other hand, you can both increase and decrease the task instance group by adding or removing task instances.
So these two types of instances can be used to adjust the processing power of your job. Typically, you use ondemand instances for core instances because they must be running all the time and cannot be lost, and you use spot instances for task instances because losing task instances do not kill the entire job (e.g., the tasks not finished by task instances will be rerun on core instances). This is one way to run a large cluster cost-effectively by using spot instances.
The general description of each instance type is available here:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/InstanceGroups.html
Also, this video may be useful for using EMR effectively:
https://www.youtube.com/watch?v=a5D_bs7E3uc

Re-use files in Hadoop Distributed cache

I am wondering if someone can explain how the distributed cache works in Hadoop. I am running a job many times, and after each run I notice that the local distributed cache folder on each node is growing in size.
Is there a way for multiple jobs to re-use the same file in the distributed cache? Or is the distributed cache only valid for the lifetime of any individual job?
The reason I am confused is that the Hadoop documentation mentions that "DistributedCache tracks modification timestamps of the cache files", so this leads me to believe that if the time stamp hasn't changed, then it should not need to re-cache or re-copy the files to the nodes.
I am adding files successfully to the distributed cache using:
DistributedCache.addFileToClassPath(hdfsPath, conf);
DistributedCache uses reference counting to manage the caches. org.apache.hadoop.filecache.TrackerDistributedCacheManager.CleanupThread is in charge of cleaning up the CacheDirs whose reference count is 0. It will check every minute (default period is 1 minute, you can set it by "mapreduce.tasktracker.distributedcache.checkperiod").
When a Job finishes or fails, JobTracker will send a org.apache.hadoop.mapred.KillJobAction to the TaskTrackers. Then if a TaskTracker receives a KillJobAction, it puts the action to tasksToCleanup. In the TaskTracker, there is a background Thread called taskCleanupThread which takes the action from tasksToCleanup and do the cleanup work. For a KillJobAction, it will invoke purgeJob to clean up the Job. In this method, it will decrease the reference count used by this Job (rjob.distCacheMgr.release();).
The above analysis bases on hadoop-core-2.0.0-mr1-cdh4.2.1-sources.jar. I also checked the hadoop-core-0.20.2-cdh3u1-sources.jar and found there was a litte difference between this two versions. For example, there was not a org.apache.hadoop.filecache.TrackerDistributedCacheManager.CleanupThread in 0.20.2-cdh3u1. When initializing a Job, TrackerDistributedCacheManager will check if there is enough space to put the new caches files for this Job. If not, it will delete the caches which have 0 reference count.
If you are using cdh4.2.1, you can increase "mapreduce.tasktracker.distributedcache.checkperiod" to let the clean up work delay. Then the probability that multiple Jobs use the same distributed cache is increased.
If you are using cdh3u1, you can increase the limitation of the cache size("local.cache.size", default is 10G) and the max directories for caches("mapreduce.tasktracker.cache.local.numberdirectories", default is 10000). This can be also applied to cdh4.2.1.
If you look closely at what this book says, is that there is a limit of what can be stored in Distributed Cache. By default it's 10GB (configurable). There can be multiple different jobs running in the cluster concurrently. Furthermore, Hadoop kind of guarantees the files stay available in the cache for a single job as it is maintained by reference count done by the tasktracker for different tasks accessing the files in cache. In your case, for subsequent Jobs, the files may not be there as they are already marked for deletion.
Please correct me if you disagree anywhere. I'll be glad to discuss this further.
According to this: http://www.datasalt.com/2011/05/handling-dependencies-and-configuration-in-java-hadoop-projects-efficiently/
You should be able to do this via DistributedCache API instead of "-libjars"

How can I add new nodes to a live hbase/hadoop cluster?

I run some batch jobs with data inputs that are constantly changing and I'm having problems provisioning capacity. I am using whirl to do the intial setup but once I start, for example, 5 machines I don't know how to add new machines to it while its running. I don't know in advance how complex or how large the data will be so I was wondering if there was a way to add new machines to a cluster and have it take effect right away(or with some delay but don't want to have to bring down the cluster and bring it up with the new nodes).
There is exact explanation how to add node:
http://wiki.apache.org/hadoop/FAQ#I_have_a_new_node_I_want_to_add_to_a_running_Hadoop_cluster.3B_how_do_I_start_services_on_just_one_node.3F
In the same time - I am not sure that already running jobs will take advantages of these nodes since planning where to run each task happens during job start time (as far as I understand).
I also think that it is more practical to run Task Trackers only on these transient nodes.
Check the files referred by the below parameters:
dfs.hosts => dfs.include
dfs.hosts.exclude
mapreduce.jobtracker.hosts.filename => mapred.include
mapreduce.jobtracker.hosts.exclude.filename
You can add the list of hosts to the files dfs.include and mapred.include and then run
hadoop mradmin -refreshNodes ;
hadoop dfsadmin -refreshNodes ;
That's all.
BTW, 'mradmin -refreshNodes' facility was added in 0.21
Nikhil

Resources