Pig script to compress and decompress the hdfs data in bzip2 - hadoop

How to compress hdfs data to bzip2 using pig such that on decompression it should give the same dir structure which it had initially.I am new to pig.
I tried to compress with bzip2 but it generated many files due to many mappers being spawned and hence reverting back to plain text file(initial form) in the same dir structure becomes difficult.
Just like how in unix if we compress bzip2 using tarball and then after decompression of bzip2.tar gives me exactly same data and folder structure which it had initially.
eg Compression:- tar -cjf compress_folder.tar.bz2 compress_folder/
Decompression:- tar -jtvf compress_folder.tar.bz2
Approach 1:
you can try running one reducer to store only 1 file on hdfs. but compromise will be performance here.
set default_parallel 1;
to compress data, set these parameters in pig script , if not tried this way:-
set output.compression.enabled true;
SET mapred.output.compression.codec 'org.apache.hadoop.io.compress.BZip2Codec';
just use JsonStorage while storing file
STORE file INTO '/user/hduser/data/usercount' USING JsonStorage();
Eventually you also want to read data, use TextLoader
data = LOAD '/user/hduser/data/usercount/' USING TextLoader;
Approach 2:
Concatenating multiple text files into one very large file in HDFS

I have the multiple text files.
The total size of them exceeds the largest disk size available to me (~1.5TB)
A spark program reads a single input text file from HDFS. So I need to combine those files into one. (I cannot re-write the program code. I am given only the *.jar file for execution)
Does HDFS have such a capability? How can I achieve this?
What I understood from your question is you want to Concatenate multiple files into one. Here is a solution which might not be the most efficient way of doing it but it works. suppose you have two files: file1 and file2 and you want to get a combined file as ConcatenatedFile
.Here is the script for that.
hadoop fs -cat /hadoop/path/to/file/file1.txt /hadoop/path/to/file/file2.txt | hadoop fs -put - /hadoop/path/to/file/Concatenate_file_Folder/ConcatenateFile.txt
Hope this helps.
HDFS by itself does not provide such capabilities. All out-of-the-box features (like hdfs dfs -text * with pipes or FileUtil's copy methods) use your client server to transfer all data.
In my experience we always used our own written MapReduce jobs to merge many small files in HDFS in distributed way.
So you have two solutions:
Write your own simple MapReduce/Spark job to combine text files with
your format.
Find already implemented solution for such kind of
About solution #2: there is the simple project FileCrush for combining text or sequence files in HDFS. It might be suitable for you, check it.
Example of usage:
hadoop jar filecrush-2.0-SNAPSHOT.jar crush.Crush -Ddfs.block.size=134217728 \
--input-format=text \
--output-format=text \
--compress=none \
/input/dir /output/dir 20161228161647
I had a problem to run it without these options (especially -Ddfs.block.size and output file date prefix 20161228161647) so make sure you run it properly.
You can do a pig job:
A = LOAD '/path/to/inputFiles' as (SCHEMA);
STORE A into '/path/to/outputFile';
Doing a hdfs cat and then putting it back to hdfs means, all this data is processed in the client node and will degradate your network

Collecting Parquet data from HDFS to local file system

Given a Parquet dataset distributed on HDFS (metadata file + may .parquet parts), how to correctly merge parts and collect the data onto local file system? dfs -getmerge ... doesn't work - it merges metadata with actual parquet files..
There is a way involving Apache Spark APIs - which provides a solution, but more efficient method without third-party tools may exist.
spark> val parquetData = sqlContext.parquetFile("pathToMultipartParquetHDFS")
spark> parquet.repartition(1).saveAsParquetFile("pathToSinglePartParquetHDFS")
bash> ../bin/hadoop dfs -get pathToSinglePartParquetHDFS localPath
Since Spark 1.4 it's better to use DataFrame::coalesce(1) instead of DataFrame::repartition(1)
you may use pig
A = LOAD '/path/to parquet/files' USING parquet.pig.ParquetLoader as (x,y,z) ;
STORE A INTO 'xyz path' USING PigStorage('|');
You may create Impala table on to it, & then use
impala-shell -e "query" -o <output>
same way you may use Mapreduce as well
You may use parquet tools
java -jar parquet-tools.jar merge source/ target/

Data retention in Hadoop HDFS

We have a Hadoop cluster with over 100TB data in HDFS. I want to delete data older than 13 weeks in certain Hive tables.
Are there any tools or way I can achieve this?
To delete data older then a certain time frame, you have a few options.
First, if the Hive table is partitioned by date, you could simply DROP the partitions within Hive and remove their underlying directories.
Second option would be to run an INSERT to a new table, filtering out the old data using a datestamp (if available). This is likely not a good option since you have 100TB of data.
A third option would be to recursively list the data directories for your Hive tables. hadoop fs -lsr /path/to/hive/table. This will output a list of the files and their creation dates. You can take this output, extract the date and compare against the time frame you want to keep. If the file is older then you want to keep, run a hadoop fs -rm <file> on it.
A fourth option would be to grab a copy of the FSImage: curl --silent "http://<active namenode>:50070/getimage?getimage=1&txid=latest" -o hdfs.image Next turn it into a text file. hadoop oiv -i hdfs.image -o hdfs.txt. The text file will contain a text representation of HDFS, the same as what hadoop fs -ls ... would return.

Convert multiple .deflate files into one gzip file in ubuntu

I ran one hadoop job which has generated multiple .deflate files. Now these files are stored on S3. So, i cannot run hadoop fs -text /somepath command it will take the hdfs path. Now, i want to convert multiple files stored on s3 in .deflate format into one gzip file.
If you make gzip files instead, using the GzipCodec, you can simply concatenate them to make one large gzip file.
You can wrap a deflate stream with a gzip header and trailer, as described in RFC 1952. A fixed 10-byte header, and an 8-byte trailer that is computed from the uncompressed data. So you will need to decompress each .deflate stream in order to compute its CRC-32 and uncompressed length to put in the trailer.

how to prevent hadoop corrupted .gz file

I'm using following simple code to upload files to hdfs.
FileSystem hdfs = FileSystem.get(config);
hdfs.copyFromLocalFile(src, dst);
The files are generated by webserver java component and rotated and closed by logback in .gz format. I've noticed that sometimes the .gz file is corrupted.
> gunzip logfile.log_2013_02_20_07.close.gz
gzip: logfile.log_2013_02_20_07.close.gz: unexpected end of file
But the following command does show me the content of the file
> hadoop fs -text /input/2013/02/20/logfile.log_2013_02_20_07.close.gz
The impact of having such files is quite disaster - since the aggregation for the whole day fails, and also several slave nodes is marked as blacklisted in such case.
What can I do in such case?
Can hadoop copyFromLocalFile() utility corrupt the file?
Does anyone met similar problem ?
It shouldn't do - this error is normally associated with GZip files which haven't been closed out when originally written to local disk, or are being copied to HDFS before they have finished being written to.
You should be able to check by running an md5sum on the original file and that in HDFS - if they match then the original file is corrupt:
hadoop fs -cat /input/2013/02/20/logfile.log_2013_02_20_07.close.gz | md5sum
md5sum /path/to/local/logfile.log_2013_02_20_07.close.gz
If they don't match they check the timestamps on the two files - the one in HDFS should be modified after the local file system one.
