How to reduce number of output files in Apache Hive - hadoop

Does anyone know of a tool that can "crunch" the output files of Apache Hadoop into fewer files or one file. Currently I am downloading all the files to a local machine and the concatenate them in one file. So does anyone know of an API or a tool that does the same.
Thanks in advance.

Limiting the number of output files means you want to limit the number of reducers. You could do that with the help of mapred.reduce.tasks property from the Hive shell. Example :
hive> set mapred.reduce.tasks = 5;
But it might affect the performance of your query. Alternatively, you could use getmerge command from the HDFS shell once you are done with your query. This command takes a source directory and a destination file as input and concatenates files in src into the destination local file.
Usage :
bin/hadoop fs -getmerge <src> <localdst>
HTH

See https://community.cloudera.com/t5/Support-Questions/Hive-Multiple-Small-Files/td-p/204038
set hive.merge.mapfiles=true; -- Merge small files at the end of a map-only job.
set hive.merge.mapredfiles=true; -- Merge small files at the end of a map-reduce job.
set hive.merge.size.per.task=???; -- Size (bytes) of merged files at the end of the job.
set hive.merge.smallfiles.avgsize=??? -- File size (bytes) threshold
-- When the average output file size of a job is less than this number,
-- Hive will start an additional map-reduce job to merge the output files
-- into bigger files. This is only done for map-only jobs if hive.merge.mapfiles
-- is true, and for map-reduce jobs if hive.merge.mapredfiles is true.

Related

Merging small files into single file in hdfs

In a cluster of hdfs, i receive multiple files on a daily basis which can be of 3 types :
1) product_info_timestamp
2) user_info_timestamp
3) user_activity_timestamp
The number of files received can be of any number but they will belong to one of these 3 categories only.
I want to merge all the files(after checking whether they are less than 100mb) belonging to one category into a single file.
for eg: 3 files named product_info_* should be merged into one file named product_info.
How do i achieve this?
You can use getmerge toachieve this, but the result will be stored in your local node (edge node), so you need to be sure you have enough space there.
hadoop fs -getmerge /hdfs_path/product_info_* /local_path/product_inf
You can move them back to hdfs with put
hadoop fs -put /local_path/product_inf /hdfs_path
You can use hadoop archive (.har file) or sequence file. It is very simple to use - just google "hadoop archive" or "sequence file".
Another set of commands along the similar lines as suggested by #SCouto
hdfs dfs -cat /hdfs_path/product_info_* > /local_path/product_info_combined.txt
hdfs dfs -put /local_path/product_info_combined.txt /hdfs_path/

Hadoop : Using Pig to add text at the end of every line of a hdfs file

We have files in HDFS with raw logs, each individual log is a line as these logs are line separated.
Our requirement is that to add a text (' 12345' for e.g. ) by the end of every log in these files ... using pig / hadoop command / or any other map reduce based tool.
Please advice
Thanks
AJ
Load the files where each log entry is loaded into one field i.e. line:chararray and use CONCAT to add the text to each line.Store it into new log file.If you want the individual files then you will have to parameterize the script to load each file and store into a new file instead of wildcard load.
Log = LOAD '/path/wildcard/*.log' USING TextLoader(line:chararray);
Log_Text = FOREACH Log GENERATE CONCAT(line,'Your Text') as newline;
STORE Log_Text INTO /path/NewLog.log';
If your files aren't extremely large, you can do that with a single shell command.
hdfs dfs -cat /user/hdfs/logfile.log | sed 's/$/12345/g' |\
hdfs dfs -put - /user/hdfs/newlogfile.txt

Concatenating multiple text files into one very large file in HDFS

I have the multiple text files.
The total size of them exceeds the largest disk size available to me (~1.5TB)
A spark program reads a single input text file from HDFS. So I need to combine those files into one. (I cannot re-write the program code. I am given only the *.jar file for execution)
Does HDFS have such a capability? How can I achieve this?
What I understood from your question is you want to Concatenate multiple files into one. Here is a solution which might not be the most efficient way of doing it but it works. suppose you have two files: file1 and file2 and you want to get a combined file as ConcatenatedFile
.Here is the script for that.
hadoop fs -cat /hadoop/path/to/file/file1.txt /hadoop/path/to/file/file2.txt | hadoop fs -put - /hadoop/path/to/file/Concatenate_file_Folder/ConcatenateFile.txt
Hope this helps.
HDFS by itself does not provide such capabilities. All out-of-the-box features (like hdfs dfs -text * with pipes or FileUtil's copy methods) use your client server to transfer all data.
In my experience we always used our own written MapReduce jobs to merge many small files in HDFS in distributed way.
So you have two solutions:
Write your own simple MapReduce/Spark job to combine text files with
your format.
Find already implemented solution for such kind of
purposes.
About solution #2: there is the simple project FileCrush for combining text or sequence files in HDFS. It might be suitable for you, check it.
Example of usage:
hadoop jar filecrush-2.0-SNAPSHOT.jar crush.Crush -Ddfs.block.size=134217728 \
--input-format=text \
--output-format=text \
--compress=none \
/input/dir /output/dir 20161228161647
I had a problem to run it without these options (especially -Ddfs.block.size and output file date prefix 20161228161647) so make sure you run it properly.
You can do a pig job:
A = LOAD '/path/to/inputFiles' as (SCHEMA);
STORE A into '/path/to/outputFile';
Doing a hdfs cat and then putting it back to hdfs means, all this data is processed in the client node and will degradate your network

Pig script to compress and decompress the hdfs data in bzip2

How to compress hdfs data to bzip2 using pig such that on decompression it should give the same dir structure which it had initially.I am new to pig.
I tried to compress with bzip2 but it generated many files due to many mappers being spawned and hence reverting back to plain text file(initial form) in the same dir structure becomes difficult.
Just like how in unix if we compress bzip2 using tarball and then after decompression of bzip2.tar gives me exactly same data and folder structure which it had initially.
eg Compression:- tar -cjf compress_folder.tar.bz2 compress_folder/
Decompression:- tar -jtvf compress_folder.tar.bz2
will give exactly same dir st.
Approach 1:
you can try running one reducer to store only 1 file on hdfs. but compromise will be performance here.
set default_parallel 1;
to compress data, set these parameters in pig script , if not tried this way:-
set output.compression.enabled true;
SET mapred.output.compression.codec 'org.apache.hadoop.io.compress.BZip2Codec';
just use JsonStorage while storing file
STORE file INTO '/user/hduser/data/usercount' USING JsonStorage();
Eventually you also want to read data, use TextLoader
data = LOAD '/user/hduser/data/usercount/' USING TextLoader;
Approach 2:
filecrush: file merge utility available at #Mr. github

Hive - Possible to get total size of file parts in a directory?

Background:
I have some gzip files in a HDFS directory. These files are named in the format yyyy-mm-dd-000001.gz, yyyy-mm-dd-000002.gz and so on.
Aim:
I want to build a hive script which produces a table with the columns: Column 1 - date (yyyy-mm-dd), Column 2 - total file size.
To be specific, I would like to sum up the sizes of all of the gzip files for a particular date. The sum will be the value in Column 2 and the date in Column 1.
Is this possible? Are there any in-built functions or UDFs that could help me with my use case?
Thanks in advance!
A MapReduce job for this doesn't seem efficient since you don't actually have to load any data. Plus, doing this seems kind of awkward in Hive.
Can you write a bash script or python script or something like that to parse the output of hadoop fs -ls? I'd imagine something like this:
$ hadoop fs -ls mydir/*gz | python datecount.py | hadoop fs -put - counts.txt

Resources