Spark coalesce vs HDFS getmerge - hadoop

I am developing a program in Spark. I need to have the results in a single file, so there are two ways to merge the result:
Coalesce (Spark):
myRDD.coalesce(1, false).saveAsTextFile(pathOut);
Merge it afterwards in HDFS:
hadoop fs -getmerge pathOut localPath
Which one is most efficient and quick?
Is there any other method to merge the files in HDFS (like "getmerge") saving the result to HDFS, instead of getting it to a local path?

If you are sure your data fits in memory probably coalesce is the best option but in other case in order to avoid an OOM error I would use getMerge or if you are using Scala/Java copyMerge API function from FileUtil class.
Check this thread of spark user mailing list.

If you're processing a large dataset (and I assume you are), I would recommend letting Spark write each partition to its own "part" file in HDFS and then using hadoop fs -getMerge to extract a single output file from the HDFS directory.
Spark splits the data up into partitions for efficiency, so it can distribute the workload among many worker nodes. If you coalesce to a small number of partitions, you reduce its ability to distribute the work, and with just 1 partition you're putting all the work on a single node. At best this will be slower, at worst it will run out of memory and crash the job.

Related

When does file from local system is moved to HDFS

I am new to Hadoop, so please excuse me if my questions are trivial.
Is local file system is different than HDFS.
While creating a mapreduce program, we file input file path using fileinputformat.addInputPath() function. Does it split that data into multiple data node and also perform inputsplits as well? If yes, how long this data will stay in datanodes? And can we write mapreduce program to the existing data in HDFS?
1:HDFS is actually a solution to distributed storage, and there will be more storage ceilings and backup problems in localized storage space. HDFS is the server cluster storage resource as a whole, through the nameNode storage directory and block information management, dataNode is responsible for the block storage container. HDFS can be regarded as a higher level abstract localized storage, and it can be understood by solving the core problem of distributed storage.
2:if we use hadoop fileinputformat , first it create an open () method to filesystem and get connection to namenode to get location messages return those message to client . then create a fsdatainputstream to read from different nodes one by one .. at the end close the fsdatainputstream
if we put data into hdfs the client the data will be split into multiple data and storged in different machine (bigger than 128M [64M])
Data persistence is stored on the hard disk
SO if your file is much bigger beyond the pressure of Common server & need Distributed computing you can use HDFS
HDFS is not your local filesystem - it is a distributed file system. This means your dataset can be larger than the maximum storage capacity of a single machine in your cluster. HDFS by default uses a block size of 64 MB. Each block is replicated to at least 3 other nodes in the cluster to account for redundancies (such as node failure). So with HDFS, you can think of your entire cluster as one large file system.
When you write a MapReduce program and set your input path, it will try to locate that path on the HDFS. The input is then automatically divided up into what is known as input splits - fixed size partitions containing multiple records from your input file. A Mapper is created for each of these splits. Next, the map function (which you define) is applied to each record within each split, and the output generated is stored in the local filesystem of the node where map function ran from. The Reducer then copies this output file to its node and applies the reduce function. In the case of a runtime error when executing map and the task fails, Hadoop will have the same mapper task run on another node and have the reducer copy that output.
The reducers use the outputs generated from all the mapper tasks, so by this point, the reducers are not concerned with the input splits that was fed to the mappers.
Grouping answers as per the questions:
HDFS vs local filesystem
Yes, HDFS and local file system are different. HDFS is a Java-based file system that is a layer above a native filesystem (like ext3). It is designed to be distributed, scalable and fault-tolerant.
How long do data nodes keep data?
When data is ingested into HDFS, it is split into blocks, replicated 3 times (by default) and distributed throughout the cluster data nodes. This process is all done automatically. This data will stay in the data nodes till it is deleted and finally purged from trash.
InputSplit calculation
FileInputFormat.addInputPath() specifies the HDFS file or directory from which files should be read and sent to mappers for processing. Before this point is reached, the data should already be available in HDFS, since it is now attempting to be processed. So the data files themselves have been split into blocks and replicated throughout the data nodes. The mapping of files, their blocks and which nodes they reside on - this is maintained by a master node called the NameNode.
Now, based on the input path specified by this API, Hadoop will calculate the number of InputSplits required for processing the file/s. Calculation of InputSplits is done at the start of the job by the MapReduce framework. Each InputSplit then gets processed by a mapper. This all happens automatically when the job runs.
MapReduce on existing data
Yes, MapReduce program can run on existing data in HDFS.

Is it possible to create/work with a non-paralleized file in hadoop

we always talk about how much faster will be if we use hadoop to paralleized our data and programme .
I would like to know is that possible to keep a small file in one specific dataNode(not paralleized)?
possible to keep a small file in one specific dataNode
HDFS will try to split any file into HDFS blocks. The datanodes don't store the entire file, nor should you attempt to store on a particular one. Let Hadoop manage the data-locality.
Your file will be replicated 3 times by default in Hadoop for fault tolerance anyway.
If you have small files (less than the HDFS block size, 64 or 128MB, depending on the Hadoop version), then you probably shouldn't be using Hadoop. If you need parallelized processing, start with multi-threading. If you actually need distributed processes, my recommendation nowadays would be Spark or Flink, not Hadoop (MapReduce).
If you want this, seems like you want object storage, not block storage

Are mappers reducers involved while uploading/Inserting data to HDFS?

I have a big confusion here.When we upload/insert/put the data in HADOOP HDFS it is known that data is stored in chunks based on the block size and replication factor . Moreover Map reduce only works when processing the data.
I'm using MRV2 when i insert any data in one of my table i can see there is MAP REDUCE progress bar.So what is the exact picture here.in reality are there an mappers and reducers involved while insertion/uploading the data to HDFS?
Need for MapReduce depends on the type of Write operation.
Operations like hdfs dfs -put or -copyFromLocal do not use MapReduce when writing data from LocalFS to HDFS. Whereas DistCp, to perform inter/intra cluster HDFS data copying, uses Mappers. Similarly, Sqoop uses mappers to import data into HDFS. Hive's LOAD statements do not while INSERT's do.
And they are Mapper only MapReduce jobs.
I'm using MRV2 when i insert any data in one of my table
I assume, you are inserting data into a Hive table. INSERT statements in Hive use Mappers.
are there an mappers and reducers involved while insertion/uploading
the data to HDFS?
Not always. Based on the write operation, mappers are involved.
The HDFS client writes directly to the datanodes after consulting with the namenode for block locations. No mappers or reducers are required.
Ref: Architecture of HDFS Read and Write
Because there's a progress bar, doesn't mean it's a MapReduce process.
If every file written to HDFS was a MapReduce process, then YARN ResourceManager UI would log it all, so if you don't believe me, check there
MapReduce is not used when you copy data from local or put data in HDFS.

Spark Fundamentals

I am new to Spark... some basic things i am not clear when going through fundamentals:
Query 1. For distributing processing - Can Spark work without HDFS - Hadoop file system on a cluster (like by creating it's own distributed file system) or does it requires some base distributed file system in place as a per-requisite like HDFS, GPFS, etc.
Query 2. If we already have a file loaded in HDFS (as distributed blocks) - then will Spark again be converting it into blocks and redistributes at it's level (for distributed processing) or will just use the block distribution as per the Haddop HDFS cluster.
Query 3. Other than defining of a DAG does SPARK also creates the partitions like MapReduce does and shuffles partitions to the reducer nodes for further computation?
I am confused on same, as till DAG creation it's clear that Spark Executor working on each Worker node loads data blocks as RDD in memory and computation is applied as per DAG .... but where does the part goes required for partitioning the data as per Keys and taking them to other nodes where reducer task will be performed (just like mapreduce) how that is done in-memory??
This would be better asked as separate questions and question 3 is hard to understand. Anyway:
No, Spark does not require a distributed file system.
By default Spark will create one partition per HDFS block, and will co-locate computation with the data if possible.
You're asking about shuffle. Shuffle creates blocks on the mappers that the reducers will fetch from them. The spark.shuffle.memoryFraction parameter controls how much memory to allocate to shuffle block files. (20% by default.) The spark.shuffle.spill parameter controls whether to spill shuffle blocks to local disk when the memory runs out.
Query 1. For distributing processing - Can Spark work without HDFS ?
For distributed processing, Spark does not require HDFS. But it may read/write data from/to HDFS system. For some use case, it may write data to HDFS. For teragen sorting world record program, it used HDFS for sorting the data instead of using in-memoery.
Spark don't provide distributed storage. But integration with HDFS is one option for storage. But Spark can use other storage systems like Cassnadra etc. Have a look at this article for more details : https://gigaom.com/2012/07/11/because-hadoop-isnt-perfect-8-ways-to-replace-hdfs/
Query 2. If we already have a file loaded in HDFS (as distributed blocks) - then will Spark again be converting it into blocks and redistributes at it's level
I agree with Daniel Darabos response. Spark will create one partition per HDFS block.
Query 3: on shuffle
Depending on size of the data, shuffle will be done in-memory Or it may use disk (e.g. teragen sorting) or it may use both. Have a look at this excellent article on Spark shuffle.
Fine with this. What if you don’t have enough memory to store the whole “map” output? You might need to spill intermediate data to the disk. Parameter spark.shuffle.spill is responsible for enabling/disabling spilling, and by default spilling is enabled
The amount of memory that can be used for storing “map” outputs before spilling them to disk is “JVM Heap Size” * spark.shuffle.memoryFraction * spark.shuffle.safetyFraction, with default values it is “JVM Heap Size” * 0.2 * 0.8 = “JVM Heap Size” * 0.16.
Query 1.
Yes it can work with others as well . Spark works with RDDs , if you have corresponding RDD implemented thats it .When you actually create a RDD by opening a file in HDFS , it inherently creates a HADOOP RDD which has implementation for understanding the HDFS , if you write your own Distributed file system you can write your own implementation for the same and instantiate the class its done . But writing the connector RDD to our own DFS is the challenge . For more you can look at the RDD interface in spark code
Query 2. It wont re create , instead my means of the HADOOP/HDFS RDD connector it knows where the blocks are .It will also try to use the same yarn nodes to run the jvm task to do processing .
Query 3. Not sure about this
Query 1 :- For simple spark provide distribute processing because of abstraction RDD (resilent distribute dataset), and without HDFS this cann't provide distribute Storage.
Query 2:- No it won't recreate.Here Spark will provide every block as partition(which means reference to that block) so it launch yarn on same block
Query 3:- no idea.

(HDFS) How to copy large data safely within a cluster?

I've got to make big sample data(say 1TB) and have approximately 20GB text files.
so I tried to just copy that 50times to make it that bigger, but every time I tried hadoop fs -cp command, some of my datanode die.
I heard that in UNIX , when deleting large data one can use SHRINK to safely remove data from disk. is there somthing like that in hadoop to copy large data?
In short, is there any way to copy large data safely within a hadoop cluster?
or do I have to modify some configuration files?
Try distcp. It runs MR job under the hood for copying data allowing us to leverage the parallelism provided by Hadoop.

Resources