I want to transfer too many small files (e.g. 200k files) in a zip file into HDFS from the local machine. When I unzip the zip file and tranfer the files into HDFS, it takes a long time. Is there anyway I can transfer the original zip file into HDFS and unzip it there?
If your file is in GB's then this command would certainly help to avoid out of space errors as there is no need to unzip the file on local filesystem.
put command in hadoop supports reading input from stdin. For reading the input from stdin use '-' as source file.
Compressed filename: compressed.tar.gz
gunzip -c compressed.tar.gz | hadoop fs -put - /user/files/uncompressed_data
Only Disadvantage: The only drawback of this approach is that in HDFS the data will be merged into a single file even though the local compressed file contains more than one file.
http://bigdatanoob.blogspot.in/2011/07/copy-and-uncompress-file-to-hdfs.html
Related
I have bunch of big zipped(.bz2) file in hadoop/hdfs location and I dont have enough space to bring those in my local and get count, I am looking help to get command to have count of those zipped file in hdfs liek we do in local linux wc -l ***.txt for all the same pattern of file.
I have (huge) zip files (not gzip) sitting on HDFS. These files all contain multiple files. Is there any way, other than pulling it to local, to list the files in the zip file? Like zipinfo does on Linux.
HDFS does not support processing zip files.
i understand that listing files in a zip file is too simple to code java for it but
you may want to try processing them with MapReduce
try ZipFileInputFormat
I have a split zip file (created by winzip in window) , then ftp to hadoop server.
Somehow i can't unzip it through something like below command
The files like below
file.z01,file.zo2,file.zo3....file.zip
Then i run below command
hadoop fs -cat /tmp/Links.txt.gz | gzip -d | hadoop fs -put - /tmp/unzipped/Links.txt
Then Error comes up
cat: Unable to write to output stream
What i expect is that unzip those split files to Hadoop particular folder
Unclear how Links.txt.gz is related to your .zip part files...
Hadoop doesn't really understand ZIP format (especially split ones), and gzip -d wouldn't work on .zip files anyway.
Zip nor gzip are splittable in Hadoop processing (read "able to be computed in parallel"), so since WinZip supports BZ2 format, I suggest you switch to that, and I don't see a need to create split files in Windows unless it's to upload the file faster...
Sidenote: hadoop fs -cat /input | <anything> | hadoop fs -put - /output is not splitting "in Hadoop"... You are copying the raw text of the file to your local buffer, then doing an operation locally, then optionally streaming it back to HDFS.
I would like to know, how does the getMerge command work in OS/HDFS level. Will it copy each and every byte/blocks from one file to another file,or just a simple file descriptor change? How costliest operation is it?
getmerge
Usage: hadoop fs -getmerge <src> <localdst> [addnl]
Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file.
So, to answer your question,
Will it copy each and every byte/blocks from one file to another file
Yes, and no. It will find every HDFS block containing the files in the given source directory and concatenate them together into a single file on your local filesystem.
a simple file descriptor change
Not sure what you mean by that. getmerge doesn't change any file descriptors; it is just reading data from HDFS to your local filesystem.
How costliest operation is it?
Expect it to be as costly as manually cat-ing all the files in an HDFS directory. The same operation for
hadoop fs -getmerge /tmp/ /home/user/myfile
Could be achieved by doing
hadoop fs -cat /tmp/* > /home/user/myfile
The costly operation being the fetching of many file pointers and transferring those records over the network to your local disk.
I am seeking for way of combining small RC files generated by Map-reduce program.
What is best of going merge of small RC files to large one RC files.
You can try getmerge command . This takes a source directory and a destination file as input and concatenates files in source directory into the destination file.
Example , if Hive table name is search_combined_rc, you can get the combined rc file into a single file.
hadoop fs -getmerge /user/hive/warehouse/dev.db/search_combined_rc/ /localdata/destinationfilename
Since RCFile’s cannot be opened with the tools that open typical sequence files, you can try using rcfilecat tool to display the contents of RCFiles. you need to move back the file from local directory to HDFS.
hive --service rcfilecat /hdfsfilelocation