I am just beginner in Hadoop framework. I would like to understand few concepts here and i browsed so many links but I would like to get clear answers
1) Why Map reduce works only with Key Value pairs.I also read that I can create a Map reduce job with out actually using reduce
2) Key for the input of Mapping phase is fileoffset key. Can I use explicit key value? or custom input ?
Good, you are digging hadoop concepts.
1) Can I use explicit key value? or custom input?: Yes, write your own (overwrite) RecordReader to do so.
2) Why Map reduce works only with Key Value pairs?:
MapReduce, as name suggests, program just maps(filters) required data to Reduce(Combine based on unique keys) from the data set fed to the program.
Now, why key-value pair?: Since you are processing on unstructured data, one would not like to get the same as output too. We will require some manipulations on data. Think of using Map in java, it helps to uniquely identify the pair, so does in hadoop with the help of Sort & Shuffle.
create a Map reduce job with out actually using reduce?:
Ofcourse, completely depends but recommended for only small operations and in a scenario where your mapper outputs are not required to be combined for expected output.
Reason: Here is where Distributed concept, commodity hardware to be given a priority. For example: i have a large data set to process upon. While processing the data set using a java program(just java, not hadoop), we store the required in Collection objects (As simple as using RAM space). Hadoop is introduced to do the same job in different fashion: store required data in context. Context in mapper refers to Intermediate data (Local FS), in reducer refers to Output(HDFS). Ofcourse, Context in both the cases store in HardDisk.
Hadoop helps doing all the calculations in HardDisk instead of RAM.
I suggest read Hadoop Defenitive Guide, Data Algorithms book for better understanding.
Related
I'm trying to understand lazy evaluation in Apache spark.
My understanding says:
Lets say am having Text file in hardrive.
Steps:
1) First I'll create RDD1, that is nothing but a data definition right now.(No data loaded into memory right now)
2) I apply some transformation logic on RDD1 and creates RDD2, still here RDD2 is data definition (Still no data loaded into memory)
3) Then I apply filter on RDD2 and creates RDD3 (Still no data loaded into memory and RDD3 is also an data definition)
4) I perform an action so that I could get RDD3 output in text file. So the moment I perform this action where am expecting output something from memory, then spark loads data into memory creates RDD1, 2 and 3 and produce output.
So laziness of RDDs in spark says just keep making the roadmap(RDDs) until they dont get the approval to make it or produce it live.
Is my understanding correct upto here...?
My second question here is, its said that its(Lazy Evaluation) one of the reason that the spark is powerful than Hadoop, May I know please how because am not much aware of Hadoop ? What happens in hadoop in this scenario ?
Thanks :)
Yes, your understanding is fine. A graph of actions (a DAG) is built via transformations, and they computed all at once upon an action. This is what is meant by lazy execution.
Hadoop only provides a filesystem (HDFS), a resource manager (YARN), and the libraries which allow you to run MapReduce. Spark only concerns itself with being more optimal than the latter, given enough memory
Apache Pig is another framework in the Hadoop ecosystem that allows for lazy evaluation, but it has its own scripting language compared to the wide programmability of Spark in the languages it supports. Pig supports running MapReduce, Tez, or Spark actions for computations. Spark only runs and optimizes its own code.
What happens in actual MapReduce code is that you need to procedurally write out each stage of an action to disk or memory in order to accomplish relatively large tasks
Spark is not a replacement for "Hadoop" it's a compliment.
I have a use case where I have millions of small files in S3 which needs to be processed by Spark. I have two options to reduce number of tasks:
1. Use Coalesce
2. Extend CombineFileInputFormat
But I'm not clear of performance implications with bot and when to use one over other.
Also, CombineFileInputFormat is an abstract class, that means I need to provide my implementation. But Spark API (newAPIHadoopRDD) takes the class name as param, I'm not sure how to pass configurable maxSplitSize
Another great option to consider for such scenarios is SparkContext.wholeTextFiles() which makes one record for each file with its name as the key and the content as the value -- see Documentation
I am new for Apache Spark and I have couple of basic questions in spark which I could not understand while reading the spark material. Every materials have their own style of explanation. I am using PySpark Jupyter notebook on Ubuntu to practice.
As per my understanding, When I run the below command, the data in the testfile.csv is partitioned and stored in memory of the respective nodes.( actually I know its a lazy evaluation and it will not process until it sees action command ), but still the concept is
rdd1 = sc.textFile("testfile.csv")
My question is when I run the below transformation and action command, where does the rdd2 data will store.
1.Does it stores in memory?
rdd2 = rdd1.map( lambda x: x.split(",") )
rdd2.count()
I know the data in rdd2 will available till I close the jupyter notebook.Then what is the need of cache(), anyhow rdd2 is available to do all transformation. I heard after all the transformation the data in memory is cleared, what is that about?
Is there any difference between keeping RDD in memory and cache()
rdd2.cache()
Does it stores in memory?
When you run a spark transformation via an action (count, print, foreach), then, and only then is your graph being materialized and in your case the file is being consumed. RDD.cache purpose it to make sure that the result of sc.textFile("testfile.csv") is available in memory and isn't needed to be read over again.
Don't confuse the variable with the actual operations that are being done behind the scenes. Caching allows you to re-iterate the data, making sure it is in memory (if there is sufficient memory to store it in it's entirety) if you want to re-iterate the said RDD, and as long as you've set the right storage level (which defaults to StorageLevel.MEMORY). From the documentation (Thanks #RockieYang):
In addition, each persisted RDD can be stored using a different
storage level, allowing you, for example, to persist the dataset on
disk, persist it in memory but as serialized Java objects (to save
space), replicate it across nodes, or store it off-heap in Tachyon.
These levels are set by passing a StorageLevel object (Scala, Java,
Python) to persist(). The cache() method is a shorthand for using the
default storage level, which is StorageLevel.MEMORY_ONLY (store
deserialized objects in memory).
You can mark an RDD to be persisted using the persist() or cache()
methods on it. The first time it is computed in an action, it will be
kept in memory on the nodes. Spark’s cache is fault-tolerant – if any
partition of an RDD is lost, it will automatically be recomputed using
the transformations that originally created it.
Is there any difference between keeping RDD in memory and cache()
As stated above, you keep it in memory via cache, as long as you've provided the right storage level. Otherwise, it won't necessarily be kept in memory at the time you want to re-use it.
Hadoop's Distributed Cache lets the developer add small files to the MR context which can be used to obtain additional information during Map or Reduce phases. However, I did not find a way to access this cache in a Partitioner. I need the contents of a small file (the output of an earlier MR job) in a custom Partitioner to determine how the keys are sent to the reducers.
Unfortunately, I cannot find any useful documentation on this, and my only idea is currently a somewhat "hackish" approach, which involves serializing the contents of the file to a Base64 string and putting it into the Configuration. Configurations can be used in a partitioner by letting it implement Configurable. While the file is small enough for this approach (around 50KB) I suppose the distributed cache is better suited for this.
EDIT:
I found another approach which I consider slightly better. Since the file I need to access in the partitioner is on HDFS, I put its fully-qualified URI into the Configuration. In my Partitioner's setConf method I can then re-create the Path via new Path(new URI(conf.get("some.file.key"))) and read it with the help of the Configuration. Still hackish though...
I am working on a project to extend Hive to support some image processing functions.
To do this, we need to read in an image, break it up into multiple files, pass each into a separate Map task that does some processing on it and then reduce them back into one image to be returned to the user.
To do this, we had planned to implement a UDF that would call a MapReduce task in Hadoop. However, from what we understand the UDF would only operate either on the Map side OR the Reduce side of the HQL query, while we need it to ideally 'bridge the gap' between the Map and the Reduce side.
The Hive documentation isn't the most helpful, and I was looking for some pointers on where to start looking for more information about this.
Please feel free to ask more questions if I haven't been clear enough in the question.
Looking into HIPI (Hadoop Image Processing Inteface) might give you a start.
Particularly, the example on computing the Principal Components of a bunch of images might be of interest.
Use UDAF (User Defined Aggragate Function). Which has sort of map and reduce phase.