(HDFS) How to copy large data safely within a cluster? - hadoop

I've got to make big sample data(say 1TB) and have approximately 20GB text files.
so I tried to just copy that 50times to make it that bigger, but every time I tried hadoop fs -cp command, some of my datanode die.
I heard that in UNIX , when deleting large data one can use SHRINK to safely remove data from disk. is there somthing like that in hadoop to copy large data?
In short, is there any way to copy large data safely within a hadoop cluster?
or do I have to modify some configuration files?

Try distcp. It runs MR job under the hood for copying data allowing us to leverage the parallelism provided by Hadoop.

Related

How to speed up retrieval of a large number of small files from HDFS

I am trying to copy a parquet file from a hadoop cluster to an edge node, using hadoop fs -get. The parquet file is around 2.4gb in size but is made up of thousands of files, each around 2kb in size. This process is taking forever.
Is there something I can do to speed up the process, maybe increase the concurrency?
I do not own the cluster and cannot make configuration changes to it.
You can try distcp rather than using -get command, provided your cluster where you are running the command has MapReduce support
https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html#Basic_Usage

Is it possible to create/work with a non-paralleized file in hadoop

we always talk about how much faster will be if we use hadoop to paralleized our data and programme .
I would like to know is that possible to keep a small file in one specific dataNode(not paralleized)?
possible to keep a small file in one specific dataNode
HDFS will try to split any file into HDFS blocks. The datanodes don't store the entire file, nor should you attempt to store on a particular one. Let Hadoop manage the data-locality.
Your file will be replicated 3 times by default in Hadoop for fault tolerance anyway.
If you have small files (less than the HDFS block size, 64 or 128MB, depending on the Hadoop version), then you probably shouldn't be using Hadoop. If you need parallelized processing, start with multi-threading. If you actually need distributed processes, my recommendation nowadays would be Spark or Flink, not Hadoop (MapReduce).
If you want this, seems like you want object storage, not block storage

HDFS small file design

I want to be able to store millions of small files (binary files- images,exe etc) (~1Mb) on HDFS, my requirements are basically to be able to query random files and not running MapReduce jobs.
The main problem for me is the Namenode memory issue, and not the MapReduce mappers issue.
So my options are:
HAR files - aggregate small files and only than saving them with their har:// path in another place
Sequence files - append them as they come in, this is more suitable for MapReduce jobs so i pretty much eliminated it
HBase - saving the small files to Hbase is another solution described in few articles on google
i guess i'm asking if there is anything i missed? can i achieve what i need by appeding binary files to big Avro/ORC/Parquet files? and then query them by name or by hash from java/client program?
Thanks,
If you append multiple files into large files, then you'll need to maintain an index of which large file each small file resides in. This is basically what Hbase will do for you. It combines data into large files, stores them in HDFS and uses sorting on keys to support fast random access. It sounds to me like Hbase would suit your needs, and if you hand rolled something yourself, you may end up redoing a lot of work that Hbase already does.

Spark coalesce vs HDFS getmerge

I am developing a program in Spark. I need to have the results in a single file, so there are two ways to merge the result:
Coalesce (Spark):
myRDD.coalesce(1, false).saveAsTextFile(pathOut);
Merge it afterwards in HDFS:
hadoop fs -getmerge pathOut localPath
Which one is most efficient and quick?
Is there any other method to merge the files in HDFS (like "getmerge") saving the result to HDFS, instead of getting it to a local path?
If you are sure your data fits in memory probably coalesce is the best option but in other case in order to avoid an OOM error I would use getMerge or if you are using Scala/Java copyMerge API function from FileUtil class.
Check this thread of spark user mailing list.
If you're processing a large dataset (and I assume you are), I would recommend letting Spark write each partition to its own "part" file in HDFS and then using hadoop fs -getMerge to extract a single output file from the HDFS directory.
Spark splits the data up into partitions for efficiency, so it can distribute the workload among many worker nodes. If you coalesce to a small number of partitions, you reduce its ability to distribute the work, and with just 1 partition you're putting all the work on a single node. At best this will be slower, at worst it will run out of memory and crash the job.

Hadoop as Data Archive System

I am analyzing on the possibilities to use hadoop (HDFS) as data archival solution which is giving linear scalability and lower cost maintenance per tera byte.
Please let me know the your recommendations and set of the parameters like I/O, Memory, Disk which has to be analyzed to viz hadoop as data archival system.
On the related query, While trying to upload a 500MB sized file using hadoop shell as,
$ #We've 500MB file created using dd
$ dd if=/dev/zero of=500MBFile.txt bs=524288000 count=1
$ hadoop fs -Ddfs.block.size=67108864 -copyFromLocal 500MBFile.txt /user/cloudera/
Please let me know why the input file is not getting splitted based on the block size (64MB). This will be good to understand since as part of data archival if we're getting 1TB file, how this will be splitted and distributed across the cluster.
I've tried the exercise using single node cloudera hadoop setup and replication factor is 1.
Thanks again for your great response.
You can use HDFS as archiving/storage solution, while I am doubt it is optimal. Specifically it is not as high-available as let say OpenStack Swift and not suited for store small files
In the same time if HDFS is Your choice I would suggest to build the cluster with storage oriented nodes. I would describe them as:
a) Put large and slow SATA disks. Since data is not going to be read / written constantly - desktop grade disks might do - it will be a major saving.
b) Put minimal memory - I would suggest 4 GB. It will not add much costs, but still enable ocaassional MR processing.
c) Sinlge CPU will do.
Regarding copyFromLocal. Yep, file is getting split according to the defined block size.
Distribution on cluster will be even across the cluster, taking to the account replication factor. HDFS will also try to put each block on more then one rack
You can load the file in .har format.
You get more details here : Hadoop Archives
Few inputs
Consider compression in your solution. Looks like you will be using Text files. You can achieve around 80% compression.
Make sure you select Hadoop friendly (i.e.splittable) compression

Resources