Are mappers reducers involved while uploading/Inserting data to HDFS? - hadoop

I have a big confusion here.When we upload/insert/put the data in HADOOP HDFS it is known that data is stored in chunks based on the block size and replication factor . Moreover Map reduce only works when processing the data.
I'm using MRV2 when i insert any data in one of my table i can see there is MAP REDUCE progress bar.So what is the exact picture here.in reality are there an mappers and reducers involved while insertion/uploading the data to HDFS?

Need for MapReduce depends on the type of Write operation.
Operations like hdfs dfs -put or -copyFromLocal do not use MapReduce when writing data from LocalFS to HDFS. Whereas DistCp, to perform inter/intra cluster HDFS data copying, uses Mappers. Similarly, Sqoop uses mappers to import data into HDFS. Hive's LOAD statements do not while INSERT's do.
And they are Mapper only MapReduce jobs.
I'm using MRV2 when i insert any data in one of my table
I assume, you are inserting data into a Hive table. INSERT statements in Hive use Mappers.
are there an mappers and reducers involved while insertion/uploading
the data to HDFS?
Not always. Based on the write operation, mappers are involved.

The HDFS client writes directly to the datanodes after consulting with the namenode for block locations. No mappers or reducers are required.
Ref: Architecture of HDFS Read and Write
Because there's a progress bar, doesn't mean it's a MapReduce process.
If every file written to HDFS was a MapReduce process, then YARN ResourceManager UI would log it all, so if you don't believe me, check there

MapReduce is not used when you copy data from local or put data in HDFS.

Related

How non mapreduce applications work in YARN?

By using YARN, we can run non mapreduce application.
But how it works?
In HDFS, All gets stored in Blocks. For each blocks one mapper tasks would get create to process whole dataset.
But Non mapreduce applications, how it will process the datasets in different data node with out using mapreduce?
Please explain me.
Do not confuse the Map reduce paradigm with other applications like for instance Spark. Spark can run under Yarn but does not use mappers or reducers.
Instead it uses executors, these executors are aware of the datalocality, the same way mapreduce is.
The spark Driver will start executors on data nodes and will try to keep the data locality in mind when doing so.
Also do not confuse Map Reduce default behaviour with standard behaviour. you do not need to have 1 mapper per input split.
Also HDFS and Map Reduce are two different things. HDFS is just the storage layer while Map Reduce handles processing.

Spark coalesce vs HDFS getmerge

I am developing a program in Spark. I need to have the results in a single file, so there are two ways to merge the result:
Coalesce (Spark):
myRDD.coalesce(1, false).saveAsTextFile(pathOut);
Merge it afterwards in HDFS:
hadoop fs -getmerge pathOut localPath
Which one is most efficient and quick?
Is there any other method to merge the files in HDFS (like "getmerge") saving the result to HDFS, instead of getting it to a local path?
If you are sure your data fits in memory probably coalesce is the best option but in other case in order to avoid an OOM error I would use getMerge or if you are using Scala/Java copyMerge API function from FileUtil class.
Check this thread of spark user mailing list.
If you're processing a large dataset (and I assume you are), I would recommend letting Spark write each partition to its own "part" file in HDFS and then using hadoop fs -getMerge to extract a single output file from the HDFS directory.
Spark splits the data up into partitions for efficiency, so it can distribute the workload among many worker nodes. If you coalesce to a small number of partitions, you reduce its ability to distribute the work, and with just 1 partition you're putting all the work on a single node. At best this will be slower, at worst it will run out of memory and crash the job.

Spark Fundamentals

I am new to Spark... some basic things i am not clear when going through fundamentals:
Query 1. For distributing processing - Can Spark work without HDFS - Hadoop file system on a cluster (like by creating it's own distributed file system) or does it requires some base distributed file system in place as a per-requisite like HDFS, GPFS, etc.
Query 2. If we already have a file loaded in HDFS (as distributed blocks) - then will Spark again be converting it into blocks and redistributes at it's level (for distributed processing) or will just use the block distribution as per the Haddop HDFS cluster.
Query 3. Other than defining of a DAG does SPARK also creates the partitions like MapReduce does and shuffles partitions to the reducer nodes for further computation?
I am confused on same, as till DAG creation it's clear that Spark Executor working on each Worker node loads data blocks as RDD in memory and computation is applied as per DAG .... but where does the part goes required for partitioning the data as per Keys and taking them to other nodes where reducer task will be performed (just like mapreduce) how that is done in-memory??
This would be better asked as separate questions and question 3 is hard to understand. Anyway:
No, Spark does not require a distributed file system.
By default Spark will create one partition per HDFS block, and will co-locate computation with the data if possible.
You're asking about shuffle. Shuffle creates blocks on the mappers that the reducers will fetch from them. The spark.shuffle.memoryFraction parameter controls how much memory to allocate to shuffle block files. (20% by default.) The spark.shuffle.spill parameter controls whether to spill shuffle blocks to local disk when the memory runs out.
Query 1. For distributing processing - Can Spark work without HDFS ?
For distributed processing, Spark does not require HDFS. But it may read/write data from/to HDFS system. For some use case, it may write data to HDFS. For teragen sorting world record program, it used HDFS for sorting the data instead of using in-memoery.
Spark don't provide distributed storage. But integration with HDFS is one option for storage. But Spark can use other storage systems like Cassnadra etc. Have a look at this article for more details : https://gigaom.com/2012/07/11/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs/
Query 2. If we already have a file loaded in HDFS (as distributed blocks) - then will Spark again be converting it into blocks and redistributes at it's level
I agree with Daniel Darabos response. Spark will create one partition per HDFS block.
Query 3: on shuffle
Depending on size of the data, shuffle will be done in-memory Or it may use disk (e.g. teragen sorting) or it may use both. Have a look at this excellent article on Spark shuffle.
Fine with this. What if you don’t have enough memory to store the whole “map” output? You might need to spill intermediate data to the disk. Parameter spark.shuffle.spill is responsible for enabling/disabling spilling, and by default spilling is enabled
The amount of memory that can be used for storing “map” outputs before spilling them to disk is “JVM Heap Size” * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction, with default values it is “JVM Heap Size” * 0.2 * 0.8 = “JVM Heap Size” * 0.16.
Query 1.
Yes it can work with others as well . Spark works with RDDs , if you have corresponding RDD implemented thats it .When you actually create a RDD by opening a file in HDFS , it inherently creates a HADOOP RDD which has implementation for understanding the HDFS , if you write your own Distributed file system you can write your own implementation for the same and instantiate the class its done . But writing the connector RDD to our own DFS is the challenge . For more you can look at the RDD interface in spark code
Query 2. It wont re create , instead my means of the HADOOP/HDFS RDD connector it knows where the blocks are .It will also try to use the same yarn nodes to run the jvm task to do processing .
Query 3. Not sure about this
Query 1 :- For simple spark provide distribute processing because of abstraction RDD (resilent distribute dataset), and without HDFS this cann't provide distribute Storage.
Query 2:- No it won't recreate.Here Spark will provide every block as partition(which means reference to that block) so it launch yarn on same block
Query 3:- no idea.

Is it possible to specify which takstrackers to use in a MapReduce job?

We have two types of jobs in our Hadoop cluster. One job uses MapReduce HBase scanning, the other one is just pure manipulation of raw files in HDFS. Within our HDFS cluster, part of the datanodes are also HBase regionservers, but others aren't. We would like to run the HBase scans only in the regionservers (to take advantage of the data locality), and run the other type of jobs in all the datanodes. Is this idea possible at all? Can we specify which tasktrackers to use in the MapReduce job configuration?
Any help is appreciated.

Hbase mapreduce interaction

I have an program hbase and mapreduce.
I store data in HDFS, size of this file is : 100G. Now i put this data to Hbase.
I use mapreduce to scan this file lost 5 minutes. But to scan hbase table lost 30 minutes.
How to increase the speed when using hbase and mapreduce ?
Thanks.
I am assuming you are having a Single Node HDFS. If you had your 100Gb file in a Multi Node cluster of HDFS, it would have been much faster for both Map Reduce and Hive.
You could try increasing no of mappers and reducers on Map Reduce to gain some performance increase, have a look at this post.
Hive is essentially a Data Warehousing tool built on top of HDFS and every query is underneath is a Map Reduce task itself. So above post would answer this problem also.

Resources