How to decompress the hadoop reduce output file end with snappy? - hadoop

Our hadoop cluster using snappy as default codec. Hadoop job reduce output file name is like part-r-00000.snappy. JSnappy fails to decompress the file bcz JSnappy requires the file start with SNZ. The reduce output file start with some bytes 0 somehow.
How could I decompress the file?

Use "Hadoop fs -text" to read this file and pipe it to txt file.
ex:
hadoop fs -text part-r-00001.snappy > /tmp/mydatafile.txt

Related

Hadoop : Using Pig to add text at the end of every line of a hdfs file

We have files in HDFS with raw logs, each individual log is a line as these logs are line separated.
Our requirement is that to add a text (' 12345' for e.g. ) by the end of every log in these files ... using pig / hadoop command / or any other map reduce based tool.
Please advice
Thanks
AJ
Load the files where each log entry is loaded into one field i.e. line:chararray and use CONCAT to add the text to each line.Store it into new log file.If you want the individual files then you will have to parameterize the script to load each file and store into a new file instead of wildcard load.
Log = LOAD '/path/wildcard/*.log' USING TextLoader(line:chararray);
Log_Text = FOREACH Log GENERATE CONCAT(line,'Your Text') as newline;
STORE Log_Text INTO /path/NewLog.log';
If your files aren't extremely large, you can do that with a single shell command.
hdfs dfs -cat /user/hdfs/logfile.log | sed 's/$/12345/g' |\
hdfs dfs -put - /user/hdfs/newlogfile.txt

How can I get raw content of a file which is stored on hdfs with gzip compressed ?

Is there any way that can read raw content of a file which is stored on hadoop hdfs byte by byte ?
Typically when I submit a streaming job with -input param that point to an .gz file (like -input hdfs://host:port/path/to/gzipped/file.gz).
My task received decompressed input line by line, this is NOT what I want.
You can initialize the FileSystem with respective Hadoop configuration:
FileSystem.get(conf);
It has a method open which should in principle allow you to read raw data.

How fsi image is stored in hadoop?

How fsi image is stored in hadoop(secondary namenode fsimage format), like table format or file format. If file format means it is compressed or non compressed and it is readable format?
Thanks
venkatbala
Fsimage is an “Image” file and It is not in a human-readable format. You have to use HDFS Offline Image Viewer in Hadoop to convert it to a readable format.
The contents of the fsimage is just an "Image" and cannot be read with "CAT". Basically the fsimage content has the meta data information like directory structure ,transaction ,etc . There is a tool "oiv " using it you can convert the fsimage into text file .
Download the fsimage using
hdfs dfsadmin -fetchImage /tmp
Then excute the below command -i - input , -o output
hdfs oiv -i fsimage_0000000000000001382 -o /tmp/fsimage.txt

Checksum verification in Hadoop

Do we need to verify checksum after we move files to Hadoop (HDFS) from a Linux server through a Webhdfs ?
I would like to make sure the files on the HDFS have no corruption after they are copied. But is checking checksum necessary?
I read client does checksum before data is written to HDFS
Can somebody help me to understand how can I make sure that source file on Linux system is same as ingested file on Hdfs using webhdfs.
If your goal is to compare two files residing on HDFS, I would not use "hdfs dfs -checksum URI" as in my case it generates different checksums for files with identical content.
In the below example I am comparing two files with the same content in different locations:
Old-school md5sum method returns the same checksum:
$ hdfs dfs -cat /project1/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
$ hdfs dfs -cat /project2/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
However, checksum generated on the HDFS is different for files with the same content:
$ hdfs dfs -checksum /project1/file.txt
0000020000000000000000003e50be59553b2ddaf401c575f8df6914
$ hdfs dfs -checksum /project2/file.txt
0000020000000000000000001952d653ccba138f0c4cd4209fbf8e2e
A bit puzzling as I would expect identical checksum to be generated against the identical content.
Checksum for a file can be calculated using hadoop fs command.
Usage: hadoop fs -checksum URI
Returns the checksum information of a file.
Example:
hadoop fs -checksum hdfs://nn1.example.com/file1
hadoop fs -checksum file:///path/in/linux/file1
Refer : Hadoop documentation for more details
So if you want to comapre file1 in both linux and hdfs you can use above utility.
I wrote a library with which you can calculate the checksum of local file, just the way hadoop does it on hdfs files.
So, you can compare the checksum to cross check.
https://github.com/srch07/HDFSChecksumForLocalfile
If you are doing this check via API
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
Option 1: for the value b9fdea463b1ce46fabc2958fc5f7644a
val md5:String = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("/project1/file.txt"))).toString
Option 2: for the value 3e50be59553b2ddaf401c575f8df6914
val md5:String = FileSystem.get(hadoopConfiguration).getFileChecksum(new Path("/project1/file.txt"))).toString.split(":")(0)
It does crc check. For each and everyfile it create .crc to make sure there is no corruption.

Specify Hadoop process split

I want to run Hadoop MapReduce on a small part of my text file.
One of my task is failing. I can read in the log:
Processing split: hdfs://localhost:8020/user/martin/history/history.xml:3556769792+67108864
Can I execute once again MapReduce on this file from offset 3556769792 to 3623878656 (3556769792+67108864) ?
A way to do is to copy the file from the offset define and add it back into HDFS. From this point simply run the mapreduce job only on this block.
1) copy file from offset 3556769792 follow by 67108864:
dd if=history.xml bs=1 skip=3556769792 count=67108864 >
history_offset.xml
2) import into HDFS
hadoop fs -copyFromLocal history_offset.xml offset/history_offset.xml
3) run again MapReduce
hadoop jar myJar.jar 'offset' 'offset_output'

Resources