Hadoop cluster for non-MapReduce algorithms in parallel - hadoop

The Apache Hadoop is inspired by the Google MapReduce paper. The flow of MapReduce can be considered as two set of SIMDs (single instruction multiple data), one for Mappers, another for Reducers. Reducers, through predefined "key", consume the output of Mappers. The essence of MapReduce framework (and Hadoop) is to automatically partition the data, determine the number of partitions and parallel jobs, and manage distributed resources.
I have a general algorithm (not necessarily MapReducable) to be run in parallel. I am not implementing the algorithm itself the MapReduce-way. Instead, the algorithm is just a single-machine python/java program. I want to run 64 copies of this program in parallel (assuming there is no concurrency issue in the program). i.e. I am more interested in the computing resources in the Hadoop cluster than the MapReduce frameworks. Is there anyway I can use the Hadoop cluster in this old fashion?

Other way of thinking about MapReduce, is MR does the transformation and Reduce does some sort of aggregations.
Hadoop also allows for a Map only job. This way it should be possible to run 64 copies of the Map program run in parallel.
Hadoop has the concept of slots. By default there will be 2 map and 2 reduce slots per node/machine. So, for 64 processes in parallel, 32 nodes are required. If the nodes are of higher end configuration, then the number of M/R slots per node can also be bumped up.

Related

Switching off data locality for Hadoop MapReduce jobs

I have a YARN cluster and dozens of nodes in the cluster. My program is a map-only job.
Its Avro input is very small in size with several million rows, but processing a single row requires lots of CPU power. What I observe is that many maps tasks are running on a single node, whereas other nodes are not participating. That causes some nodes to be very slow and affecting overall HDFS performance. I assume this behaviour is because of the Hadoop data-locality.
I'm curious whether it's possible to switch it off, or is there another way to force YARN to distribute map tasks across more uniformly across cluster?
Thanks!
Assuming you can't easily redistribute the data more uniformly across the cluster (surely not all your data is on 1 node right?!) this seems to be the easy way to relax locality:
yarn.scheduler.capacity.node-locality-delay
This setting should have a default of 40, try setting it to 1 to see whether this has the desired effect. Perhaps even 0 could work.

How non mapreduce applications work in YARN?

By using YARN, we can run non mapreduce application.
But how it works?
In HDFS, All gets stored in Blocks. For each blocks one mapper tasks would get create to process whole dataset.
But Non mapreduce applications, how it will process the datasets in different data node with out using mapreduce?
Please explain me.
Do not confuse the Map reduce paradigm with other applications like for instance Spark. Spark can run under Yarn but does not use mappers or reducers.
Instead it uses executors, these executors are aware of the datalocality, the same way mapreduce is.
The spark Driver will start executors on data nodes and will try to keep the data locality in mind when doing so.
Also do not confuse Map Reduce default behaviour with standard behaviour. you do not need to have 1 mapper per input split.
Also HDFS and Map Reduce are two different things. HDFS is just the storage layer while Map Reduce handles processing.

Can map task and reduce task be in the same node?

I am a new about Hadoop, since the data transfer between map node and reduce node may reduce the efficiency of MapReduce, why not map task and reduce task are put together in the same node?
Actually you can run map and reduce in same JVM if the data is too 'small'. It is possible in Hadoop 2.0 (aka YARN) and now called Ubertask.
From the great "Hadoop: The Definitive Guide" book:
If the job is small, the application master may choose to run the tasks in the same JVM as itself. This happens when it judges the overhead of allocating and running tasks in new containers outweighs the gain to be had in running them in parallel, compared to running them sequentially on one node. (This is different from MapReduce 1, where small jobs are never run on a single tasktracker.) Such a job is said to be uberized, or run as an uber task.
The amount of data to be processed is too large that's why we are doing map and reduce in separate nodes. If the amount of data to be processed is small then definitely you ca use Map and Reduce on the same node.
Hadoop is usually used when the amount of data is very large in that case for high availability and concurrency separate nodes are needed for both map and reduce operations.
Hope this will clear your doubt.
An Uber Job occurs when multiple mapper and reducers are combined to get executed inside Application Master.
So assuming, the job that is to be executed has MAX Mappers <= 9 ; MAX Reducers <= 1, then the Resource Manager(RM) creates an Application Master and executes the job well within the Application Master using its very own JVM.
SET mapreduce.job.ubertask.enable=TRUE;
So the advantage using Uberised job is, the roundtrip overhead that the Application master carries out, by asking containers for the job, from Resource Manager (RM) and RM allocating the containers to Application master is eliminated.

How does Hadoop/MapReduce scale when input data is NOT stored?

The intended use for Hadoop appears to be for when the input data is distributed (HDFS) and already stored local to the nodes at the time of the mapping process.
Suppose we have data which does not need to be stored; the data can be generated at runtime. For example, the input to the mapping process is to be every possible IP address. Is Hadoop capable of efficiently distributing the Mapper work across nodes? Would you need to explicitly define how to split the input data (i.e. the IP address space) to different nodes, or does Hadoop handle that automatically?
Let me first clarify a comment you made. Hadoop is designed to support potentially massively parallel computation across a potentially large number of nodes regardless of where the data comes from or goes. The Hadoop design favors scalability over performance when it has to. It is true that being clever about where the data starts out and how that data is distributed can make a significant difference in how well/quickly a hadoop job can run.
To your question and example, if you will generate the input data you have the choice of generating it before the first job runs or you can generate it within the first mapper. If you generate it within the mapper then you can figure out what node the mapper's running on and then generate just the data that would be reduced in that partition (Use a partitioner to direct data between mappers and reducers)
This is going to be a problem you'll have with any distributed platform. Storm, for example, lets you have some say in which bolt instance will will process each tuple. The terminology might be different, but you'll be implementing roughly the same shuffle algorithm in Storm as you would Hadoop.
You are probably trying to run a non-MapReduce task on a map reduce cluster then. (e.g. IP scanning?) There may be more appropriate tools for this, your know...
A thing few people do not realize is that MapReduce is about checkpointing. It was developed for huge clusters, where you can expect machines to fail during the computation. By having checkpointing and recovery built-in into the architecture, this reduces the consequences of failures and slow hosts.
And that is why everything goes from disk to disk in MapReduce. It's checkpointed before, and it's checkpointed after. And if it fails, only this part of the job is re-run.
You can easily outperform MapReduce by leaving away the checkpointing. If you have 10 nodes, you will win easily. If you have 100 nodes, you will usually win. If you have a major computation and 1000 nodes, chances are that one node fails and you wish you had been doing similar checkpointing...
Now your task doesn't sound like a MapReduce job, because the input data is virtual. It sounds much more as if you should be running some other distributed computing tool; and maybe just writing your initial result to HDFS for later processing via MapReduce.
But of course there are way to hack around this. For example, you could use /16 subnets as input. Each mapper reads a /16 subnet and does it's job on that. It's not that much fake input to generate if you realize that you don't need to generate all 2^32 IPs, unless you have that many nodes in your cluster...
Number of Mappers depends on the number of Splits generated by the implementation of the InputFormat.
There is NLineInputFormat, which you could configure to generate as many splits as there are lines in the input file. You could create a file where each line is an IP range. I have not used it personally and there are many reports that it does not work as expected.
If you really need it, you could create your own implementation of the InputFormat which generates the InputSplits for your virtual data and force as many mappers as you need.

Does Amazon Elastic Map Reduce runs one or several mapper processes per instance?

My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically?
I haven't found the answer neither in Hadoop Streaming documentation, nor in Amazon Elastic MapReduce FAQ.
Hadoop has a notion of "slots". Slot is a place where mapper process will run. You configure number of slots per tasktracker node. It is teoretical maximum of map process which will run parralel per node. It can be less if there is not enough separate poprtions of the input data (called FileSplits).
Elastic MapReduce do have its own estimation how much slots to allocate depending on the instance capabilities.
In the same time I can imagine scenario where your processing will be more efficeint when one datastream is prcessed by many cores. If you have your mapper with built-in multicore usage - you can reduce number of slots. But it is inot usually a case in the typycial Hadoop tasks.
See the EMR doco [1] for the number of map/reduce tasks per instance type.
In addition to David's answers you can also have Hadoop run multiple threads per map slot by setting...
conf.setMapRunnerClass(MultithreadedMapRunner.class);
The default is 10 threads but it's tunable with
-D mapred.map.multithreadedrunner.threads=5
I often find this useful for custom high IO stuff.
[1] http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault_AMI2.html
My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically?
Once a Hadoop cluster has been set, the minimum required to submit a job is
Input format and location
Output format and location
Map and Reduce functions for processing the data
Location of the NameNode and the JobTracker
Hadoop will take care of distributing the job to different nodes, monitoring them, reading the data from the i/p and writing the data to the o/p. If the user has to do all those tasks, then there is no point of using Hadoop.
Suggest, to go through the Hadoop documentation and a couple of tutorials.

Resources