Spark: writing BZip2 compressed parquet files - hadoop

I am wanting to write parquet files from a DataFrame in spark-sql with BZip2 codec compression so that they are splittable. With the following code, I'm able to use codecs such as snappy and gzip:
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
sqlContext.read.parquet(stagingDir)
.dropDuplicates()
.write
.mode(SaveMode.Append)
.parquet(outputDir)
However, when I try BZip2 it seems it isn't available as I get this exception, even though I was able to write BZip2 compressed text files from an RDD
java.lang.IllegalArgumentException: The value of spark.sql.parquet.compression.codec should be one of uncompressed, snappy, gzip, lzo, but was bzip2
Is there a way to write BZip2 compressed parquet files from Spark SQL?

Related

How can I get raw content of a file which is stored on hdfs with gzip compressed ?

Is there any way that can read raw content of a file which is stored on hadoop hdfs byte by byte ?
Typically when I submit a streaming job with -input param that point to an .gz file (like -input hdfs://host:port/path/to/gzipped/file.gz).
My task received decompressed input line by line, this is NOT what I want.
You can initialize the FileSystem with respective Hadoop configuration:
FileSystem.get(conf);
It has a method open which should in principle allow you to read raw data.

Pig script to compress and decompress the hdfs data in bzip2

How to compress hdfs data to bzip2 using pig such that on decompression it should give the same dir structure which it had initially.I am new to pig.
I tried to compress with bzip2 but it generated many files due to many mappers being spawned and hence reverting back to plain text file(initial form) in the same dir structure becomes difficult.
Just like how in unix if we compress bzip2 using tarball and then after decompression of bzip2.tar gives me exactly same data and folder structure which it had initially.
eg Compression:- tar -cjf compress_folder.tar.bz2 compress_folder/
Decompression:- tar -jtvf compress_folder.tar.bz2
will give exactly same dir st.
Approach 1:
you can try running one reducer to store only 1 file on hdfs. but compromise will be performance here.
set default_parallel 1;
to compress data, set these parameters in pig script , if not tried this way:-
set output.compression.enabled true;
SET mapred.output.compression.codec 'org.apache.hadoop.io.compress.BZip2Codec';
just use JsonStorage while storing file
STORE file INTO '/user/hduser/data/usercount' USING JsonStorage();
Eventually you also want to read data, use TextLoader
data = LOAD '/user/hduser/data/usercount/' USING TextLoader;
Approach 2:
filecrush: file merge utility available at #Mr. github

How fsi image is stored in hadoop?

How fsi image is stored in hadoop(secondary namenode fsimage format), like table format or file format. If file format means it is compressed or non compressed and it is readable format?
Thanks
venkatbala
Fsimage is an “Image” file and It is not in a human-readable format. You have to use HDFS Offline Image Viewer in Hadoop to convert it to a readable format.
The contents of the fsimage is just an "Image" and cannot be read with "CAT". Basically the fsimage content has the meta data information like directory structure ,transaction ,etc . There is a tool "oiv " using it you can convert the fsimage into text file .
Download the fsimage using
hdfs dfsadmin -fetchImage /tmp
Then excute the below command -i - input , -o output
hdfs oiv -i fsimage_0000000000000001382 -o /tmp/fsimage.txt

Convert multiple .deflate files into one gzip file in ubuntu

I ran one hadoop job which has generated multiple .deflate files. Now these files are stored on S3. So, i cannot run hadoop fs -text /somepath command it will take the hdfs path. Now, i want to convert multiple files stored on s3 in .deflate format into one gzip file.
If you make gzip files instead, using the GzipCodec, you can simply concatenate them to make one large gzip file.
You can wrap a deflate stream with a gzip header and trailer, as described in RFC 1952. A fixed 10-byte header, and an 8-byte trailer that is computed from the uncompressed data. So you will need to decompress each .deflate stream in order to compute its CRC-32 and uncompressed length to put in the trailer.

ElasticMapReduce streaming compressed output

I'm running streaming jobs, with python scripts for the map and reduce. The job flow I create with the boto library.
I'm using gzip input files. How can I create gzip output files, though?
I use java to process gzip files and generate output in gzip compression. I use below code
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
FileOutputFormat.setOutputPath(job, output path));
I hope You will find similar API/ code in python.
You can generate gzip files as your generated output. Pass '-D mapred.output.compress=true -D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec' as option to your streaming job.

Resources