MIT StarCluster and S3 - hadoop

I am trying to run a mapreduce job on spot instances.
I launch my instances by using StarClusters and its hadoop plugin. I have no problem upload the data then put it into HDFS and then copy the result back from the HDFS.
My question is that is there way to load the data directly from s3 and push the result back to s3? (I don't want to manually download the data from s3 to HDFS and push the result from HDFS to s3, is there a way to do it in background)?
I am using the standard MIT starcluster ami

you cannot do it, but you can write a script to do that.
for example you can use:
hadoop distcp s3n://ID:key#mybucket/file /user/root/file
to put the file directly to hdfs from s3

Related

How can I couple Amazon Glacier / S3 with hadoop map reduce / spark?

I need to process data stored in Amazon S3 and Amazon Glacier with Hadoop / EMR and save the output data in an RDBMS for eg. Vertica
I am a total noob in big data. And I have only gone through few online sessions and ppts about map reduce and sparx. And created few dummy map reduce codes for learning purpose.
Till now I only have commands that let me import data from S3 to HDFC in Amazon EMR and after processing they store them in HDFS files.
So here are my questions:
Is it really mandatory to sync data from S3 to HDFC first before executing map reduce or is there a way to use S3 directly.`
How can I make hadoop access Amazon Glacier data`
And finally how can I store the output to Database.`
Any suggestion / reference is welcome.
EMR clusters are able to read/write to/from S3, so no need to copy data to the cluster. S3 has an implementation as Hadoop FileSystem so it can mostly be treated the same as HDFS.
AFAIK your MR/Spark jobs cannot directly access data from Glacier, data has to first be downloaded from glacier, by itself a lengthy procedure.
Check out Sqoop for pumping data between HDFS and DB

How could i relate Amazon EC2,S3 and my HDFS?

I am learning hadoop in a pseudo distributed mode,so not much aware of the cluster. So when browsed about cluster i get that S3 is a data storage device. And EC2 is a computing service,but couldn't understand the real use of it. Will my HDFS be available in S3. If yes when i was learning hive i came across moving data from HDFS to S3 and this is mentioned as a archival logic.
hadoop distcp /data/log_messages/2011/12/02 s3n://ourbucket/logs/2011/12/02
My HDFS is landed on S3 so how would it be beneficial? This might be silly but if some one could give me a overview that would be helpful for me.
S3 is just storage, no computation is allowed. You can think S3 as a bucket which can hold data & you can retrieve data from it using there API.
If you are using AWS/EC2 then your hadoop cluster will be on AWS/EC2, it is different from S3. HDFS is just a file system in hadoop for maximizing input/output performance.
The command which you shared is distributed copy. It will copy data from your hdfs to S3. In short, EC2 will have HDFS as default file system in hadoop environment and you can move archive data or unused data to S3, as S3 storage is cheaper than EC2 machines.

define a HDFS file on Elastic MapReduce on Amazon Web Services

I'm starting the implementation of the KMeans algorithm on the Hadoop MapReduce framework. I'm using in this regard the elastic MapReduce offered by Amazon Web Services. I want to create an HDFS file to save on it the initial cluster coordinates, and to store on it the final results of the reducers. I'm totally confused here. Is there anyway to create or "upload" this file into the HDFS format in order to be seen by all the mapper.
Any clarification in this regard?
Thanks.
At the end I got how to do it.
So, in order to upload the HDFS file into the cluster. You have to connect to your cluster via putty (by using the security key).
Then write these commands
hadoop distcp s3://bucke_name/data/fileNameinS3Bucket HDFSfileName
with
fileNameinS3Bucket is the name of the file in the s3 bucket
HDFSfileName is what do you want to name your file when I uploaded.
To check if the file has been uploaded
hadoop fs -ls

Copying HDFS-format files from S3 to local

We are using Amazon EMR and commoncrawl to perform crawling. EMR writes the output to Amazon S3 in a binary-like format. We'd like to copy that to our local in raw-text format.
How can we achieve that? What's the best way?
Normally we could hadoop copyToLocal but we can't access hadoop directly and the data is on S3.

<ask> How to Backup and Restore HDFS

Actually i have develop application which use Hdfs to store image.Now i want to migrate server and setup hadoop again in new server.How i can backup my image file in HDFS (old sever) to HDFS in my new server ?
I've try to use CopyToLocal command to backup and CopyFromLocal to restore, but i've error, when application running, image which i've restore on hdfs can't show on my application.
How to solve this ?
Thanks
Distcp is the command to use when performing data for large inter/intra-cluster copying. Here is the documentation for the same.
CopyToLocal and CopyFromLocal should also work well for small amounts of data. Run the HDFS CLI and make sure that the files are there. Then it might be a problem with the application.

Resources