How fsi image is stored in hadoop? - hadoop

How fsi image is stored in hadoop(secondary namenode fsimage format), like table format or file format. If file format means it is compressed or non compressed and it is readable format?
Thanks
venkatbala

Fsimage is an “Image” file and It is not in a human-readable format. You have to use HDFS Offline Image Viewer in Hadoop to convert it to a readable format.

The contents of the fsimage is just an "Image" and cannot be read with "CAT". Basically the fsimage content has the meta data information like directory structure ,transaction ,etc . There is a tool "oiv " using it you can convert the fsimage into text file .
Download the fsimage using
hdfs dfsadmin -fetchImage /tmp
Then excute the below command -i - input , -o output
hdfs oiv -i fsimage_0000000000000001382 -o /tmp/fsimage.txt

Related

How can I get raw content of a file which is stored on hdfs with gzip compressed ?

Is there any way that can read raw content of a file which is stored on hadoop hdfs byte by byte ?
Typically when I submit a streaming job with -input param that point to an .gz file (like -input hdfs://host:port/path/to/gzipped/file.gz).
My task received decompressed input line by line, this is NOT what I want.
You can initialize the FileSystem with respective Hadoop configuration:
FileSystem.get(conf);
It has a method open which should in principle allow you to read raw data.

Pig script to compress and decompress the hdfs data in bzip2

How to compress hdfs data to bzip2 using pig such that on decompression it should give the same dir structure which it had initially.I am new to pig.
I tried to compress with bzip2 but it generated many files due to many mappers being spawned and hence reverting back to plain text file(initial form) in the same dir structure becomes difficult.
Just like how in unix if we compress bzip2 using tarball and then after decompression of bzip2.tar gives me exactly same data and folder structure which it had initially.
eg Compression:- tar -cjf compress_folder.tar.bz2 compress_folder/
Decompression:- tar -jtvf compress_folder.tar.bz2
will give exactly same dir st.
Approach 1:
you can try running one reducer to store only 1 file on hdfs. but compromise will be performance here.
set default_parallel 1;
to compress data, set these parameters in pig script , if not tried this way:-
set output.compression.enabled true;
SET mapred.output.compression.codec 'org.apache.hadoop.io.compress.BZip2Codec';
just use JsonStorage while storing file
STORE file INTO '/user/hduser/data/usercount' USING JsonStorage();
Eventually you also want to read data, use TextLoader
data = LOAD '/user/hduser/data/usercount/' USING TextLoader;
Approach 2:
filecrush: file merge utility available at #Mr. github

Move files to HDFS using Spring XD

How to move the files from local disk to HDFS using Spring XD.
I do not want contents , but to move whole file for archival which saves the file with original name and content.
Here is what i have tried
stream create --name fileapple --definition "file --mode=ref --dir=/Users/dev/code/open/learnspringxd/input --pattern=apple*.txt | WHATTODOHERE"
I can see now with reference the file names with full path are made available , how to move that to HDFS.
You might want to check this which imports data from files to HDFS as a batch job and check if that fits your requirement. You can also check file | hdfs as a stream if that works for you.
example like below will load the file from data folder to HDFS and save the file by date folders(if there are multi records with different date) which by the record column named LastModified, the data file is a json file separate by lines.
file --mode=ref --dir=/Users/dev/code/open/learnspringxd/input --pattern=apple*.txt | hdfs --directory=/user/file_folder --partitionPath=path(dateFormat('yyyy-MM-dd',#jsonPath(payload,'$.LastModified'),'yyyy-MM-dd')) --fileName=output_file_name_prefix --fsUri=hdfs://HDFShostname.company.com:8020 --idleTimeout=30000

How to decompress the hadoop reduce output file end with snappy?

Our hadoop cluster using snappy as default codec. Hadoop job reduce output file name is like part-r-00000.snappy. JSnappy fails to decompress the file bcz JSnappy requires the file start with SNZ. The reduce output file start with some bytes 0 somehow.
How could I decompress the file?
Use "Hadoop fs -text" to read this file and pipe it to txt file.
ex:
hadoop fs -text part-r-00001.snappy > /tmp/mydatafile.txt

how to prevent hadoop corrupted .gz file

I'm using following simple code to upload files to hdfs.
FileSystem hdfs = FileSystem.get(config);
hdfs.copyFromLocalFile(src, dst);
The files are generated by webserver java component and rotated and closed by logback in .gz format. I've noticed that sometimes the .gz file is corrupted.
> gunzip logfile.log_2013_02_20_07.close.gz
gzip: logfile.log_2013_02_20_07.close.gz: unexpected end of file
But the following command does show me the content of the file
> hadoop fs -text /input/2013/02/20/logfile.log_2013_02_20_07.close.gz
The impact of having such files is quite disaster - since the aggregation for the whole day fails, and also several slave nodes is marked as blacklisted in such case.
What can I do in such case?
Can hadoop copyFromLocalFile() utility corrupt the file?
Does anyone met similar problem ?
It shouldn't do - this error is normally associated with GZip files which haven't been closed out when originally written to local disk, or are being copied to HDFS before they have finished being written to.
You should be able to check by running an md5sum on the original file and that in HDFS - if they match then the original file is corrupt:
hadoop fs -cat /input/2013/02/20/logfile.log_2013_02_20_07.close.gz | md5sum
md5sum /path/to/local/logfile.log_2013_02_20_07.close.gz
If they don't match they check the timestamps on the two files - the one in HDFS should be modified after the local file system one.

Resources