define a HDFS file on Elastic MapReduce on Amazon Web Services - hadoop

I'm starting the implementation of the KMeans algorithm on the Hadoop MapReduce framework. I'm using in this regard the elastic MapReduce offered by Amazon Web Services. I want to create an HDFS file to save on it the initial cluster coordinates, and to store on it the final results of the reducers. I'm totally confused here. Is there anyway to create or "upload" this file into the HDFS format in order to be seen by all the mapper.
Any clarification in this regard?
Thanks.

At the end I got how to do it.
So, in order to upload the HDFS file into the cluster. You have to connect to your cluster via putty (by using the security key).
Then write these commands
hadoop distcp s3://bucke_name/data/fileNameinS3Bucket HDFSfileName
with
fileNameinS3Bucket is the name of the file in the s3 bucket
HDFSfileName is what do you want to name your file when I uploaded.
To check if the file has been uploaded
hadoop fs -ls

Related

nifi putHDFS writes to local filesystem

Challenge
I currently have two hortonworks clusters, a NIFI cluster and a HDFS cluster, and want to write to HDFS using NIFI.
On the NIFI cluster I use a simple GetFile connected to a PutHDFS.
When pushing a file through this, the PutHDFS terminates in success. However, rather than seeing a file dropped on my HFDS (on the HDFS cluster) I just see a file being dropped onto the local filesystem where I run NIFI.
This confuses me, hence my question:
How to make sure PutHDFS writes to HDFS, rather than to the local filesystem?
Possibly relevant context:
In the PutHDFS I have linked to the hive-site and core-site of the HDFS cluster (I tried updating all server references to the HDFS namenode, but with no effect)
I don't use Kerberos on the HDFS cluster (I do use it on the NIFI cluster)
I did not see anything looking like an error in the NIFI app log (which makes sense as it succesfully writes, just in the wrong place)
Both clusters are newly generated on Amazon AWS with CloudBreak, and opening all nodes to all traffic did not help
Can you make sure that you are able move file from NiFi node to Hadoop using below command:-
hadoop fs -put
If you are able move your file using above command then you must check your Hadoop config file which you are passing in your PutHDFS processor.
Also, check that you don't have anyother flow running to make sure that no other flow is processing that file.

How can i copy files from external Hadoop cluster to Amazon S3 without running any commands on the cluster

I have scenario in which i have to pull data from Hadoop cluster into AWS.
I understand running dist-cp on the hadoop cluster is a way to copy the data into s3, but i have a restriction here, i wont be able to run any commands in the cluster. I should be able to pull the files from hadoop cluster into AWS. The data is available in hive.
I thought of the below options:
1) Sqoop data from Hive ? Is it possible ?
2) S3-distcp (running it on aws), if so what would be the configuration needed ?
Any Suggestions ?
If the hadoop cluster is visible from EC2-land, you could run a distcp command there, or, if it's a specific bit of data, some hive query which uses hdfs:// as input and writes out to s3. You'll need to deal with kerberos auth though: you cannot use distcp in an un-kerberized cluster to read data from a kerberized one, though you can go the other way.
You can also run distcp locally in 1+ machine, though you are limited by the bandwidth of those individual systems. distcp is best when it schedules the uploads on the hosts which actually have the data.
Finally, if it is incremental backup you are interested in, you can use the HDFS audit log as a source of changed files...this is what incremental backup tools tend to use

HDInsight hadoop-mapreduce-examples.jar where is the Output?

I ran the sample wordcount application in HDInsight The command ran successfully, but I cannot find the output.
The command that I ran
is
hadoop jar hadoop-mapreduce-examples.jar wordcount /example/data/gutenberg/davinci.txt /user/joe/WordCountOutput
I am expecting something to be created on the file system . But I don't see /user/joe/ created.
Please advice.
HDInsight uses Azure blob storage as its HDFS store by default therefore your output is in your storage account associated with the cluster. You can use something like CloudXplorer to easily read your blob storage account and find this data. It will be in your default WABS container under /user/joe/WordCountOutput
You could also run your command like this to have more control over your output location
hadoop jar hadoop-mapreduce-examples.jar wordcount /example/data/gutenberg/davinci.txt wabs://<contatiner>#<storageaccount>.blob.core.windows.net/user/joe/WordCountOutput
It is not in your machine's filesystem, but on Azure blobs. Typically, Hadoop MapReduce uses the Hadoop Distributed File System (HDFS), but as Thomas Jungblut correctly pointed in his comment, Azure blobs has completely replaced HDFS in HDInsight. Still, you should be able to access the output using the hdfs shell commands, like:
hadoop dfs -ls /user/jow/WordCountOutput
Perhaps HDInsight offers more ways to browse this filesystem (see Andrew Moll's answer), but I am not familiar with them, and this is actually quite easy already.

MIT StarCluster and S3

I am trying to run a mapreduce job on spot instances.
I launch my instances by using StarClusters and its hadoop plugin. I have no problem upload the data then put it into HDFS and then copy the result back from the HDFS.
My question is that is there way to load the data directly from s3 and push the result back to s3? (I don't want to manually download the data from s3 to HDFS and push the result from HDFS to s3, is there a way to do it in background)?
I am using the standard MIT starcluster ami
you cannot do it, but you can write a script to do that.
for example you can use:
hadoop distcp s3n://ID:key#mybucket/file /user/root/file
to put the file directly to hdfs from s3

How could i relate Amazon EC2,S3 and my HDFS?

I am learning hadoop in a pseudo distributed mode,so not much aware of the cluster. So when browsed about cluster i get that S3 is a data storage device. And EC2 is a computing service,but couldn't understand the real use of it. Will my HDFS be available in S3. If yes when i was learning hive i came across moving data from HDFS to S3 and this is mentioned as a archival logic.
hadoop distcp /data/log_messages/2011/12/02 s3n://ourbucket/logs/2011/12/02
My HDFS is landed on S3 so how would it be beneficial? This might be silly but if some one could give me a overview that would be helpful for me.
S3 is just storage, no computation is allowed. You can think S3 as a bucket which can hold data & you can retrieve data from it using there API.
If you are using AWS/EC2 then your hadoop cluster will be on AWS/EC2, it is different from S3. HDFS is just a file system in hadoop for maximizing input/output performance.
The command which you shared is distributed copy. It will copy data from your hdfs to S3. In short, EC2 will have HDFS as default file system in hadoop environment and you can move archive data or unused data to S3, as S3 storage is cheaper than EC2 machines.

Resources