How to upload large files from HDFS to S3 - hadoop

I have an issue while uploading a large file (larger than 5GB) from HDFS to S3. Is there a way to upload the file directly from HDFS to S3 without downloading it to the local file system and using multipart ?

For copying data between HDFS and S3, you should use s3DistCp. s3DistCp is optimized for AWS and does an efficient copy of large number of files in parallel across S3 buckets.
For usage of s3DistCp, you can refer the document here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
The code for s3DistCp is available here: https://github.com/libin/s3distcp

If you are using Hadoop 2.7.1 or later, use the s3a:// filesystem to talk to S3. It supports multi-part uploads, which is what you need here.
Update: September 2016
I should add that we are reworking the S3A output stream work for Hadoop 2.8; the current one buffers multipart uploads in the Heap, and falls over when you are generating bulk data faster than your network can push to s3.

Related

How to ingest Parquet files residing on AWS S3 into Druid

I'm very new to Druid and want to know how we can ingest Parquet files on S3 into Druid?
We get data in CSV format and we standardise it to Parquet format in the Data Lake. This then needs to be loaded into Druid.
Instead of trying to ingest parquet files from S3, I streamed data to a Kinesis topic and used that as a source for Druid.
You have to add druid-parquet-extensions in the druid.extensions.loadList in the common.runtime.properties file.
After that you can restart the Druid server.
However, only ingesting a parquet file from local source is documented. I couldn't verify loading from S3 as my files were encrypted.
Try adding the above extension and then read from S3 just like you'd ingest a regular file from S3.

Can Apache Hadoop HDFS help speed up the large file uploads (through a web browser) to a server?

As I understand, that Hadoop HDFS can't increase the network speed, but I was in a discussion with a few folks trying to brainstorm how we can significantly speed up our uploads, and someone said that they were able to significantly improve the upload speed using HDFS.
If a user is on a LAN (100 MBPS), is there someway Hadoop HDFS can help increase the upload speeds when the user uploads a large file >100GB using their browser?
The webbrowser and webserver will then become the bottleneck in itself. They must buffer the file on that server, and then upload to HDFS, as compared to a direct datanode writer of hadoop fs -copyFromLocal
HUE (which uses WebHDFS) operates in this fashion, and I don't think there is an easy way to stream that large of a file via HTTP to exist on HDFS unless you can do chunked uploads, and once you do, you'd then have multiple smaller files on HDFS rather than the original 100+ GB one (assuming you're not trying to append to the same file reference on HDFS)

MIT StarCluster and S3

I am trying to run a mapreduce job on spot instances.
I launch my instances by using StarClusters and its hadoop plugin. I have no problem upload the data then put it into HDFS and then copy the result back from the HDFS.
My question is that is there way to load the data directly from s3 and push the result back to s3? (I don't want to manually download the data from s3 to HDFS and push the result from HDFS to s3, is there a way to do it in background)?
I am using the standard MIT starcluster ami
you cannot do it, but you can write a script to do that.
for example you can use:
hadoop distcp s3n://ID:key#mybucket/file /user/root/file
to put the file directly to hdfs from s3

How could i relate Amazon EC2,S3 and my HDFS?

I am learning hadoop in a pseudo distributed mode,so not much aware of the cluster. So when browsed about cluster i get that S3 is a data storage device. And EC2 is a computing service,but couldn't understand the real use of it. Will my HDFS be available in S3. If yes when i was learning hive i came across moving data from HDFS to S3 and this is mentioned as a archival logic.
hadoop distcp /data/log_messages/2011/12/02 s3n://ourbucket/logs/2011/12/02
My HDFS is landed on S3 so how would it be beneficial? This might be silly but if some one could give me a overview that would be helpful for me.
S3 is just storage, no computation is allowed. You can think S3 as a bucket which can hold data & you can retrieve data from it using there API.
If you are using AWS/EC2 then your hadoop cluster will be on AWS/EC2, it is different from S3. HDFS is just a file system in hadoop for maximizing input/output performance.
The command which you shared is distributed copy. It will copy data from your hdfs to S3. In short, EC2 will have HDFS as default file system in hadoop environment and you can move archive data or unused data to S3, as S3 storage is cheaper than EC2 machines.

Copying HDFS-format files from S3 to local

We are using Amazon EMR and commoncrawl to perform crawling. EMR writes the output to Amazon S3 in a binary-like format. We'd like to copy that to our local in raw-text format.
How can we achieve that? What's the best way?
Normally we could hadoop copyToLocal but we can't access hadoop directly and the data is on S3.

Resources