I have (huge) zip files (not gzip) sitting on HDFS. These files all contain multiple files. Is there any way, other than pulling it to local, to list the files in the zip file? Like zipinfo does on Linux.
HDFS does not support processing zip files.
i understand that listing files in a zip file is too simple to code java for it but
you may want to try processing them with MapReduce
try ZipFileInputFormat
Related
I have two folders in hdfs, In folder-1 have some files and sub-folders, where as folder-2 contains the some of files and sub-folders of folder-1.
Now i need to copy missing files and sub folders of folder-1 to folder-2.
and is this possible to do using shell script?
can any one help me on this.
Use distcp.
$ hadoop distcp -update FOLDER1 FOLDER2
I would look at the examples in DistCp Version2 Guide to make sure you're comfortable with the semantics of -update.
I have a .bin file that will comprise of 3 files
1. tar.gz file
2. .zip file
3. install.sh file
For now the install.sh file is empty. I am trying to write a shell script that should be able to extract the .zip file and copy the tar.gz file to a specific location when the *.bin file is executed on an Ubuntu machine. There is a Jenkins job that will pull in these 3 files to create the *.bin file
My Question is how do I access the tar.gz and .zip file from my shell script ?
There are two general tricks that I'm aware of for this sort of thing.
The first is to use a file format that will ignore invalid data and find the correct file contents automatically (I believe zip is one such format/tool).
When this is the case you just run the tool on the packed/concatenated file and let the tool do its job.
For formats and tools where that doesn't work and/or isn't possible the general trick is to embed markers in the concatenated file such that the original script ignores the data but can operate on itself to "extract" the embedded data so the other tool can operate on the extracted contents.
I have a Pig script (using a slightly modified MultiStorage) that transforms some data. Once the script runs, I have data in the following format on HDFS:
/tmp/data/identifier1/indentifier1-0,0001
/tmp/data/identifier1/indentifier1-0,0002
/tmp/data/identifier2/indentifier2-0,0001
/tmp/data/identifier3/indentifier3-0,0001
I'm attempting to use S3DistCp to copy these files to S3. I am using the --groupBy .*(identifier[0-9]).* option to combine files based on the identifier. The combination works, but when copying to S3, the folders are also copied. The end output is:
/s3bucket/identifier1/identifier1
/s3bucket/identifier2/identifier2
/s3bucket/identifier3/identifier3
Is there a way to copy these files without that first folder? Ideally, my output in S3 would look like:
/s3bucket/identifier1
/s3bucket/identifier2
/s3bucket/identifier3
Another solution I've considered is to use HDFS commands to pull those files out of their directories before copying to S3. Is that a reasonable solution?
Thanks!
The solution I've arrived upon is to use distcp to bring these files out of the directories before using s3distcp:
hadoop distcp -update /tmp/data/** /tmp/grouped
Then, I changed the s3distcp script to move data from /tmp/grouped into my S3 bucket.
Using distcp before s3distcp is really expensive. One other option you have is to create a manifest file with all your files in it and give its path to s3distcp. In this manifest you can define the "base name" of each file. If you need an example of a manifest file just run s3distcp on any folder with argument --outputManifest.
more information can be found here
I want to transfer too many small files (e.g. 200k files) in a zip file into HDFS from the local machine. When I unzip the zip file and tranfer the files into HDFS, it takes a long time. Is there anyway I can transfer the original zip file into HDFS and unzip it there?
If your file is in GB's then this command would certainly help to avoid out of space errors as there is no need to unzip the file on local filesystem.
put command in hadoop supports reading input from stdin. For reading the input from stdin use '-' as source file.
Compressed filename: compressed.tar.gz
gunzip -c compressed.tar.gz | hadoop fs -put - /user/files/uncompressed_data
Only Disadvantage: The only drawback of this approach is that in HDFS the data will be merged into a single file even though the local compressed file contains more than one file.
http://bigdatanoob.blogspot.in/2011/07/copy-and-uncompress-file-to-hdfs.html
I am an newbie to Hadoop and Mapreduce.
Now I need to process a zip file in myproject using Mapreduce, like input have to be a zip file and the output may be zip or text file.
Can anyone give me sample or suggest me a link for that.
Thanks,
varadhan.S
I am also currently working on Hadoop and Mapreduce. there is no need to specially specify anything for Zip file. the hadoop automatically unzips and processes them. But, Output is Text what I am using. I am processing currently Huge number of Compressed files where each tar.gz file contains one text file each.
Regards
Balaram