Which is the easiest way to combine small HDFS blocks? - hadoop

I'm collecting logs with Flume to the HDFS. For the test case I have small files (~300kB) because the log collecting process was scaled for the real usage.
Is there any easy way to combine these small files into larger ones which are closer to the HDFS block size (64MB)?

The GNU coreutils split could do the work.
If the source data are lines - in my case they are - and one line is around 84 bytes, then an HDFS block 64MB could contain around 800000 lines:
hadoop dfs -cat /sourcedir/* | split --lines=800000 - joined_
hadoop dfs -copyFromLocal ./joined_* /destdir/
or with --line-bytes option:
hadoop dfs -cat /sourcedir/* | split --line-bytes=67108864 - joined_
hadoop dfs -copyFromLocal ./joined_* /destdir/

My current solution is to write a MapReduce job that effectively does nothing, while having a limited number of reducers. Each reducer outputs a file, so this cats them together. You can add the name of the original file in each line to help show where it came from.
I'm still interested in hearing if there is a standard or proven best way of doing this that I am not aware of.

You should take a look at File Crusher open sourced by media6degrees. It might be a little outdated but you can download the source and make your changes and/or contribute. The JAR and Source are in: http://www.jointhegrid.com/hadoop_filecrush/index.jsp
This is essentially a map-reduce technique for merging small files.

Related

How to speed up retrieval of a large number of small files from HDFS

I am trying to copy a parquet file from a hadoop cluster to an edge node, using hadoop fs -get. The parquet file is around 2.4gb in size but is made up of thousands of files, each around 2kb in size. This process is taking forever.
Is there something I can do to speed up the process, maybe increase the concurrency?
I do not own the cluster and cannot make configuration changes to it.
You can try distcp rather than using -get command, provided your cluster where you are running the command has MapReduce support
https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html#Basic_Usage

Is it possible to create/work with a non-paralleized file in hadoop

we always talk about how much faster will be if we use hadoop to paralleized our data and programme .
I would like to know is that possible to keep a small file in one specific dataNode(not paralleized)?
possible to keep a small file in one specific dataNode
HDFS will try to split any file into HDFS blocks. The datanodes don't store the entire file, nor should you attempt to store on a particular one. Let Hadoop manage the data-locality.
Your file will be replicated 3 times by default in Hadoop for fault tolerance anyway.
If you have small files (less than the HDFS block size, 64 or 128MB, depending on the Hadoop version), then you probably shouldn't be using Hadoop. If you need parallelized processing, start with multi-threading. If you actually need distributed processes, my recommendation nowadays would be Spark or Flink, not Hadoop (MapReduce).
If you want this, seems like you want object storage, not block storage

Spark coalesce vs HDFS getmerge

I am developing a program in Spark. I need to have the results in a single file, so there are two ways to merge the result:
Coalesce (Spark):
myRDD.coalesce(1, false).saveAsTextFile(pathOut);
Merge it afterwards in HDFS:
hadoop fs -getmerge pathOut localPath
Which one is most efficient and quick?
Is there any other method to merge the files in HDFS (like "getmerge") saving the result to HDFS, instead of getting it to a local path?
If you are sure your data fits in memory probably coalesce is the best option but in other case in order to avoid an OOM error I would use getMerge or if you are using Scala/Java copyMerge API function from FileUtil class.
Check this thread of spark user mailing list.
If you're processing a large dataset (and I assume you are), I would recommend letting Spark write each partition to its own "part" file in HDFS and then using hadoop fs -getMerge to extract a single output file from the HDFS directory.
Spark splits the data up into partitions for efficiency, so it can distribute the workload among many worker nodes. If you coalesce to a small number of partitions, you reduce its ability to distribute the work, and with just 1 partition you're putting all the work on a single node. At best this will be slower, at worst it will run out of memory and crash the job.

Merging compressed files on HDFS

How do I merge all files in a directory on HDFS, that I know are all compressed, into a single compressed file, without copying the data through the local machine? For example, but not necessarily, using Pig?
As an example, I have a folder /data/input that contains the files part-m-00000.gz and part-m-00001.gz. Now I want to merge them into a single file /data/output/foo.gz
I would suggest to look at FileCrush (https://github.com/edwardcapriolo/filecrush), a tool to merge files on HDFS using MapReduce. It does exactly what you described and provides several options to deal with compressions and control the number of output files.
Crush --max-file-blocks XXX /data/input /data/output
max-file-blocks represents the maximum number of dfs blocks per output file. For example, according to the documentation:
With the default value 8, 80 small files, each being 1/10th of a dfs
block will be grouped into to a single output file since 8 * 1/10 = 8
dfs blocks. If there are 81 small files, each being 1/10th of a dfs
block, two output files will be created. One output file contain the
combined contents of 41 files and the second will contain the combined
contents of the other 40. A directory of many small files will be
converted into fewer number of larger files where each output file is
roughly the same size.
If you set the Parallel to 1 - then you will have single output file.
This can be done in 2 ways:
in your pig add set default_parallel 20; but note that this effect everything in your pig
Change the Parallel for a single operation - like DISTINCT ID PARALLEL 1;
Can read more about Parallel Features
I know there's an option to do merging to the local filesystem using "hdfs dfs -getMerge" command. Perhaps you can use that to merge to the local filesystem and then use 'hdfs dfs -copyFromLocal" command to copy it back into hdfs.

Splitting BZip2 is not working

I have 1.2GB file in Hadoop, compressed in BZip2 codec. Our Hadoop YARN cluster has 10 nodes. HDFS block size is 128 MB so I think the file is splitted into 10 blocks. BZip2 should be splittable codec so I thought when I start processing the input file, Hadoop executes 10 map task (one for each block). But when I look at job logs, I can see only one Map task.
I did not find any settings which limits number of mappers in YARN (in contrast with Hadoop 1).
What am I missing or what am I doing wrong?
Thank you
I've never used BZip2, but I think this issue may have to do with your fileInputFormat You may also need to configure your fileInputFormat plz take a look at this answer.

Resources