Does Map and Reduce runs in separate JVM's? - hadoop

Hi I have a Map Reduce task say AverageScoreCalculator which has mapper and reducer.
the question is i static initialitze few fields in AverageScoreCalculator will that be avialable to both mapper and reducer ?

By default, each map and reduce task runs in a different JVM and there can be multiple JVMs running at any particular instance on a node.
Set the following properties
mapred.job.reuse.jvm.num.tasks = -1
mapreduce.tasktracker.map.tasks.maximum = 1
mapreduce.tasktracker.reduce.tasks.maximum = 1
mapreduce.job.reduce.slowstart.completedmaps = 1
and there will be only a single mapper/reducer running on a given node with JVM reuse and the reducers won't start until all the mappers have completed processing.
Couple of things to note
The above approach works with MapReduce 1x release and is not an efficient approach.
JVM reuse is not supported in MapReduce 2x release.

Static fields will create problem if they are updated dynamically in either map or reduce program. Standalone and pseudo-distributed modes are for beginners and should only be used if you are learning Hadoop. These mode wont help while processing huge volumes of data which is primary objective of map - reduce programming practice.
When jobs are distributed across the nodes , static information will be lost. Reconsider use of static variable.
If you can , paste the map and reduce program and the need for static fields , we can have a better solution for the same.

You should first know which configuration/mode your job is going to be run in.
For instance, if you run in local(standalone) mode, there will be only one JVM running your job.
If you run it in a pseudo-distributed mode, the job will be run using multiple JVMs on your machine.
If you run it in a distributed mode they will run on different machines and of course different JVMs (with JVM reuse)

Related

Does hadoop Behave differently in local and distributed mode for static variables

Suppose I am having a static variables assigned to a class variables in my mapper, the value of the static variable depends upon the job, hence it is same of a set of input splits being executed in a job tracker node for that Job and hence I can assign the Job Specific Variables directly as static Variables in my Mapper (The JVM running in the Job Tracker Node).
For Some Different Job, these values will change as it is a different Job and have different Class Path Variables for its own Job, but I believe it will not impact the former mentioned job as they are running in different JVMs(Jobtracker).
Now If i try this in the local mode, the above Different Job will be runnig inthe same JVM, hence when this Diferent Job will try to overrire the Job Specific Class Variables which my formar Job had set, it will cause a problem for my former Job.
So can we say that the behavior of same code in Local and Distributed mode in not same always.
The Class Variables I am setting is nothing but some resource level and distributed cache values.
I know the use case is not good, but just wanted to know if this is what will happen when it comes to static variables.
Thanks.
The usage of static variables is not encouraged for the same reason you mentioned. The behavior is surely different based on the mode in which Hadoop is running. if the static is just a resource name and you are just reading it, the usage is fine. But if trying to modify, it will impact in standalone mode. Also, as you know, the standalone and psuedo is just for beginners and learning. Usecases should not dictate our learning :) Happy learning.

Adding new files to a running hadoop cluster

consider that you have 10GB data and you want to process them by a MapReduce program using Hadoop. Instead of copying all the 10GB at the beginning to HDFS and then running the program, I want to for example copy 1GB and start the work and gradually add the remaining 9GB during the time. I wonder if it is possible in Hadoop.
Thanks,
Morteza
Unfortunately this is not possible with MapReduce. When you initiate a MapReduce Job, part of the setup process is determining block locations of your input. If the input is only partially there, the setup process will only work on those blocks and wont dynamically add inputs.
If you are looking for a stream processor, have a look at Apache Storm https://storm.apache.org/ or Apache Spark https://spark.apache.org/

Mesos & Hadoop: How to get the running job input data size?

I'm running Hadoop 1.2.1 on top of Mesos 0.14. My goal is to log the input data size, running time, cpu usage, memory usage, and so on for optimization purposes later. All of these but data size are obtained using Sigar.
Is there any way I can get the input data size of any job which is running?
For example, when I'm running hadoop example's terasort, I need to get the teragen's generated data size before the job actually runs. If I'm running Wordcount example, I need to get the wordcount input file size. I need to get the data size automatically since I won't be able to know what job will be run inside this framework later.
I'm using Java to write some of the mesos library code. Preferably, I want to get the data size inside MesosExecutor class. For some reason, upgrading Hadoop/Mesos isn't an option.
Any suggestions or related API will be appreciated. Thank you.
Does hadoop fs -dus satisfy your requirement? Before submit the job to hadoop, calculate the input file size and pass it as params to your executor.

Why only 1 map and 1 reduce task and 1 node is used in a Hadoop job?

I have configured a 3-nodes-cluster to run wordcount mapreduce program. I am using a book, whose size is 659 kb (http://www.gutenberg.org/ebooks/20417) as the test data. Interestingly, in the web UI of that Job, only 1 map, 1 reduce and 1 node is involved. I am wondering if this is because the data size is too small. If yes, could I set manually to split the data into different maps on multi nodes?
Thanks,
Allen
The default block size is 64 MB. So yes, the framework does assign only one task of each kind because your input data is smaller.
1) You can either give input data that are more than 64 MB and see what happens.
2) Change the value of mapred.max.split.size which is specific for the mapreduce jobs
(in mapred-site.xml or running the job with the -D mapred.max-split.size=noOfBytes)
or
3) Change the value of dfs.block.size which has a more global scope and applies for all the HDFS. (in hdfs-site.xml)
Don't forget to restart your cluster to apply changes in case you are modifying the conf files.

How to control file assignation in different slave in hadoop distributed system?

How to control file assignation in different slave in hadoop distributed system?
Is it possible to write 2 or more file in hadoop as map reduce task Simultaneously?
I am new to hadoop.It will be really helpful to me.
If you know please answer.
This is my answer for your #1:
You can't directly control where map tasks go in your cluster or where files get sent in your cluster. The JobTracker and the NameNode handle these, respectively. The JobTracker will try to send the map tasks to be data local to improve performance. (I had to guess what you meant for your question , if I didn't get it right, please elaborate)
This is my answer for your #2:
MultipleOutputs is what you are looking for when you want to write multiple files out from a single reducer.

Resources