Does hadoop Behave differently in local and distributed mode for static variables - hadoop

Suppose I am having a static variables assigned to a class variables in my mapper, the value of the static variable depends upon the job, hence it is same of a set of input splits being executed in a job tracker node for that Job and hence I can assign the Job Specific Variables directly as static Variables in my Mapper (The JVM running in the Job Tracker Node).
For Some Different Job, these values will change as it is a different Job and have different Class Path Variables for its own Job, but I believe it will not impact the former mentioned job as they are running in different JVMs(Jobtracker).
Now If i try this in the local mode, the above Different Job will be runnig inthe same JVM, hence when this Diferent Job will try to overrire the Job Specific Class Variables which my formar Job had set, it will cause a problem for my former Job.
So can we say that the behavior of same code in Local and Distributed mode in not same always.
The Class Variables I am setting is nothing but some resource level and distributed cache values.
I know the use case is not good, but just wanted to know if this is what will happen when it comes to static variables.
Thanks.

The usage of static variables is not encouraged for the same reason you mentioned. The behavior is surely different based on the mode in which Hadoop is running. if the static is just a resource name and you are just reading it, the usage is fine. But if trying to modify, it will impact in standalone mode. Also, as you know, the standalone and psuedo is just for beginners and learning. Usecases should not dictate our learning :) Happy learning.

Related

Can HDFS block size be changed during job run? Custom Split and Variant Size

I am using hadoop 1.0.3. Can the input split/block be changed (increase/decrease) during run time based on some constraints. Is there a class to override to accomplish this mechanism like FileSplit/InputTextFormat? Can we have variant size blocks in HDFS depending on logical constraint in one job?
You're not limited to TextInputFormat... Thats entirely configurable based on the data source you are reading. Most examples are line delimited plaintext, but that obviously doesn't work for XML, for example.
No, block boundaries can't change during runtime as your data should already be on disk, and ready to read.
But the InputSplit is dependent upon the InputFormat for the given job, which should remain consistent throughout a particular job, but the Configuration object in the code is basically a Hashmap which can be changed while running, sure
If you want to change block size only for a particular run or application you can do by overriding "-D dfs.block.size=134217728" .It helps you to change block size for your application instead of changing overall block size in hdfs-site.xml.
-D dfs.block.size=134217728

Hadoop Spark (Mapr) - AddFile how does it work

I am trying to understand how does hadoop work. Say I have 10 directory on hdfs, it contains 100s of file which i want to process with spark.
In the book - Fast Data Processing with Spark
This requires the file to be available on all the nodes in the cluster, which isn't much of a
problem for a local mode. When in a distributed mode, you will want to use Spark's
addFile functionality to copy the file to all the machines in your cluster.
I am not able to understand this, will spark create copy of file on each node.
What I want is that it should read the file which is present in that directory (if that directory is present on that node)
Sorry, I am bit confused , how to handle the above scenario in spark.
regards
The section you're referring to introduces SparkContext::addFile in a confusing context. This is a section titled "Loading data into an RDD", but it immediately diverges from that goal and introduces SparkContext::addFile more generally as a way to get data into Spark. Over the next few pages it introduces some actual ways to get data "into an RDD", such as SparkContext::parallelize and SparkContext::textFile. These resolve your concerns about splitting up the data among nodes rather than copying the whole of the data to all nodes.
A real production use-case for SparkContext::addFile is to make a configuration file available to some library that can only be configured from a file on the disk. For example, when using MaxMind's GeoIP Legacy API, you might configure the lookup object for use in a distributed map like this (as a field on some class):
#transient lazy val geoIp = new LookupService("GeoIP.dat", LookupService.GEOIP_MEMORY_CACHE | LookupService.GEOIP_CHECK_CACHE)
Outside your map function, you'd need to make GeoIP.dat available like this:
sc.addFile("/path/to/GeoIP.dat")
Spark will then make it available in the current working directory on all of the nodes.
So, in contrast with Daniel Darabos' answer, there are some reasons outside of experimentation to use SparkContext::addFile. Also, I can't find any info in the documentation that would lead one to believe that the function is not production-ready. However, I would agree that it's not what you want to use for loading the data you are trying to process unless it's for experimentation in the interactive Spark REPL, since it doesn't create an RDD.
addFile is only for experimentation. It is not meant for production use. In production you just open a file specified by a URI understood by Hadoop. For example:
sc.textFile("s3n://bucket/file")

How to define a shared (global) variable in Hadoop?

I need a shared (global) variable which is accessible among all mappers and reducers. Mappers just read value from it, but reducers change some values to be used in the next iteration in it. I know DistributedCache is a technique to do that, however it only support reading a shared value.
This is exactly what ZooKeeper was built for. ZooKeeper can keep up with lots of reads from mappers/reducers, and still be able to write something now and then.
The other option would be to set values in the configuration object. However, this only persists globally for a single job. You'd have to manage the passing of this value across jobs yourself. Also, you can't end this while the job is running.

HADOOP - get nodename inside mapper

I'm writing a mapper and would like to know if it is possible to get the nodename, where the mapper is running.
Hadoop automatically moves your MapReduce program to where your data is so I think you can just do getHostName() (if you're using Java that is) and it should return the name of the node on which your program is running.
java.net.InetAddress.getLocalHost().getHostName();
If you're using other languages such as Python, Ruby, etc. (i.e. using HadoopStreaming), the same idea holds true so you should be able to use the appropriate function/method available in those languages to get the host name.
The configuration value fs.default.name will most probably give you a URL to the namenode, and if you get an instance of the FileSystem (Filesystem.get(conf)) you should be able to call the getUri() method to get the same information.

Does Map and Reduce runs in separate JVM's?

Hi I have a Map Reduce task say AverageScoreCalculator which has mapper and reducer.
the question is i static initialitze few fields in AverageScoreCalculator will that be avialable to both mapper and reducer ?
By default, each map and reduce task runs in a different JVM and there can be multiple JVMs running at any particular instance on a node.
Set the following properties
mapred.job.reuse.jvm.num.tasks = -1
mapreduce.tasktracker.map.tasks.maximum = 1
mapreduce.tasktracker.reduce.tasks.maximum = 1
mapreduce.job.reduce.slowstart.completedmaps = 1
and there will be only a single mapper/reducer running on a given node with JVM reuse and the reducers won't start until all the mappers have completed processing.
Couple of things to note
The above approach works with MapReduce 1x release and is not an efficient approach.
JVM reuse is not supported in MapReduce 2x release.
Static fields will create problem if they are updated dynamically in either map or reduce program. Standalone and pseudo-distributed modes are for beginners and should only be used if you are learning Hadoop. These mode wont help while processing huge volumes of data which is primary objective of map - reduce programming practice.
When jobs are distributed across the nodes , static information will be lost. Reconsider use of static variable.
If you can , paste the map and reduce program and the need for static fields , we can have a better solution for the same.
You should first know which configuration/mode your job is going to be run in.
For instance, if you run in local(standalone) mode, there will be only one JVM running your job.
If you run it in a pseudo-distributed mode, the job will be run using multiple JVMs on your machine.
If you run it in a distributed mode they will run on different machines and of course different JVMs (with JVM reuse)

Resources