Remove directory level when transferring from HDFS to S3 using S3DistCp - hadoop

I have a Pig script (using a slightly modified MultiStorage) that transforms some data. Once the script runs, I have data in the following format on HDFS:
/tmp/data/identifier1/indentifier1-0,0001
/tmp/data/identifier1/indentifier1-0,0002
/tmp/data/identifier2/indentifier2-0,0001
/tmp/data/identifier3/indentifier3-0,0001
I'm attempting to use S3DistCp to copy these files to S3. I am using the --groupBy .*(identifier[0-9]).* option to combine files based on the identifier. The combination works, but when copying to S3, the folders are also copied. The end output is:
/s3bucket/identifier1/identifier1
/s3bucket/identifier2/identifier2
/s3bucket/identifier3/identifier3
Is there a way to copy these files without that first folder? Ideally, my output in S3 would look like:
/s3bucket/identifier1
/s3bucket/identifier2
/s3bucket/identifier3
Another solution I've considered is to use HDFS commands to pull those files out of their directories before copying to S3. Is that a reasonable solution?
Thanks!

The solution I've arrived upon is to use distcp to bring these files out of the directories before using s3distcp:
hadoop distcp -update /tmp/data/** /tmp/grouped
Then, I changed the s3distcp script to move data from /tmp/grouped into my S3 bucket.

Using distcp before s3distcp is really expensive. One other option you have is to create a manifest file with all your files in it and give its path to s3distcp. In this manifest you can define the "base name" of each file. If you need an example of a manifest file just run s3distcp on any folder with argument --outputManifest.
more information can be found here

Related

List contents of zip file on HDFS

I have (huge) zip files (not gzip) sitting on HDFS. These files all contain multiple files. Is there any way, other than pulling it to local, to list the files in the zip file? Like zipinfo does on Linux.
HDFS does not support processing zip files.
i understand that listing files in a zip file is too simple to code java for it but
you may want to try processing them with MapReduce
try ZipFileInputFormat

Combine Map output for directory to one file

I have a requirement, where i have to merge the output of mappers of a directory in to a single file. Lets say i have a directory A which contains 3 files.
../A/1.txt
../A/2.txt
../A/3.txt
I need to run a mapper to process these files which shud generate one output file. I KNOW REDUCER WILL DO THAT, BUT I DONT WANT TO USE REDUCER LOGIC.
OR
Can i have only one mapper to process all the files under a directory.
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us
Can i have only one mapper to process all the files under a directory.
Have you looked into CombinedFileInputFormat? Felix Ren-Chyan Chern writes about setting it up in some detail.

Is there a tool to continuously copy contents of a directory to HDFS as they are?

I tried using flume directory spooler source and HDFS sink. But this does not serve my purpose because, the files are read by Flume and then get written to HDFS as part files which can be rolled by size/time (Please correct me if I've got this wrong). Is there a tool that continously does something like an HDFS put on all files that are dumped in the spool directory ?
If i got your question correctly then you have a and you are getting files into it and that file you want to move to HDFS without reading it and HDFS copyFromLocal will solve your issue then you just need to have an logic which can return you the recent files in the directory and run CopyFromLocal command to copy it in HDFS.

How to rename output file(s) of Hive on EMR?

The output of Hive on EMR is a file named 000000_0 (perhaps a different number if there is more than 1 reducer).
How do I get this file to be named differently? I see two options:
1) Get Hive to write it differently
2) Rename the file(s) in S3 after it is written. This is could be a problem: from what I've read S3 doesn't really have a "rename". You have to copy it, and delete the original. When dealing with a file that is 1TB in size, for example, this could cause performance problems or increase usage cost?
The AWS Command Line Interface (CLI) has a convenient mv command that you could add to a script:
aws s3 mv s3://my-bucket/000000_0 s3://my-bucket/data1
Or, you could do it programmatically via the Amazon S3 COPY API call.

How to use hadoop fs -cp s3://<bucket> hdfs:///tmp

I want to copy a file from s3 bucket to hdfs. I am abke to copy using the above command. But, how do I use this in java code to copy file from s3 to hdfs. I am able to implement filesystem.copyFromLocal and copytolocal but not -cp. How do i implemet this. Any help. Thanks.
What you're looking for is org.apache.hadoop.fs.FileUtil which has all the file system commands. See here for example: http://hadoop.apache.org/docs/current/api/src-html/org/apache/hadoop/fs/FileUtil.html#line.285
You may also consider using s3distcp which is optimized for copying (and concatenating) files from S3 to HDFS and vice versa

Resources