Copy files from HDFS to Amazon S3 using distp and s3a scheme - hadoop

Using Apache Hadoop version 2.7.2 and trying to copy files from HDFS to Amazon S3 using below command.
hadoop distcp hdfs://<<namenode_host>>:9000/user/ubuntu/input/flightdata s3a://<<bucketid>>
Getting below exception using above command.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: s3a://<<bucketid>>.distcp.tmp.attempt_1462460298670_0004_m_000001_0
Thanks much for the help.

It should be possible to go from HDFS to S3 - I have done it before using syntax like the following, running it from a HDFS cluster:
distcp -Dfs.s3a.access.key=... -Dfs.s3a.secret.key=... /user/vagrant/bigdata s3a://mytestbucket/bigdata
It you run your command like this, does it work:
hadoop distcp hdfs://namenode_host:9000/user/ubuntu/input/flightdata s3a://bucketid/flightdata
From the exception, it looks like it is expecting a 'folder' to put the data in, as opposed to the root of the bucket.

You need to provide AWS credentials in order to successfully transfer files TO/FROM HDFS and S3.
You can pass the access_key_id and secret parameters as shown by #stephen above but you should use a credential provider api for production use where you can manage your credentials without passing them around in individual commands.
Ref: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html
Secondly you do not need to specify "hdfs" protocol. An absolute hdfs path is sufficient.

Related

How to copy a file from a GCS bucket in Dataproc to HDFS using google cloud?

I had uploaded the data file to the GCS bucket of my project in Dataproc. Now I want to copy that file to HDFS. How can I do that?
For a single "small" file
You can copy a single file from Google Cloud Storage (GCS) to HDFS using the hdfs copy command. Note that you need to run this from a node within the cluster:
hdfs dfs -cp gs://<bucket>/<object> <hdfs path>
This works because hdfs://<master node> is the default filesystem. You can explicitly specify the scheme and NameNode if desired:
hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>
Note that GCS objects use the gs: scheme. Paths should appear the same as they do when you use gsutil.
For a "large" file or large directory of files
When you use hdfs dfs, data is piped through your local machine. If you have a large dataset to copy, you will likely want to do this in parallel on the cluster using DistCp:
hadoop distcp gs://<bucket>/<directory> <HDFS target directory>
Consult the DistCp documentation for details.
Consider leaving data on GCS
Finally, consider leaving your data on GCS. Because the GCS connector implements Hadoop's distributed filesystem interface, it can be used as a drop-in replacement for HDFS in most cases. Notable exceptions are when you rely on (most) atomic file/directory operations or want to use a latency-sensitive application like HBase. The Dataproc HDFS migration guide gives a good overview of data migration.

Running oozie job using a modified hadoop config file to support S3 to HDFS

Hello I am trying to copy a file in my S3 bucket into HDFS using the cp command.
I do something like
Hadoop --config config fs -cp s3a://path hadooppath
This works well when my config is in my local.
However now I am trying to set it up as an oozie job. So when I am now unable to pass the configuration files present in config directory in my local system. Even if its in HDFS, then still it doesn't seem to work. Any suggestions ?
I tried -D command in Hadoop and passed name and value pairs, still it throws some error. It works only from my local system.
Did you Try DISTCP in oozie? Hadoop 2.7.2 will supports S3 data source. You can able to schedule it by coordinators. Just parse the credentials to coordinators either RESTAPI or in Properties files. Its easy way to copy a data periodically(Scheduled manner).
${HADOOP_HOME}/bin/hadoop distcp s3://<source>/ hdfs://<destination>/

Copy files from Remote Unix and Windows servers into HDFS without intermediate staging

How can I copy files from remote Unix and Windows servers into HDFS without intermediate staging from the command line?
You can use following command:
hadoop fs -cp /user/myuser/copyTestFolder/* hdfs://remoteServer:8020/user/remoteuser/copyTestFolder/
or vice versa to copy from server to local machine.
You can also read the hadoop documentation.
You can use WebHDFS and cURL to upload files. This will not require having any hadoop binaries on your client, just a cURL or cURL like client. The BigInsights Knowledge Center has information on how to administer the file system using the HttpFS REST APIs.

Transferring scripts from s3 to emr master

I've managed to get data files distributed on emr clusters, but can't get the simple python scripts copied over to the master instance to run the hadoop job.
Using aws cli (aws s3 cp s3://the_bucket/the_script.py .) returns
A client error (Forbidden) occurred when calling the HeadObject operation: Forbidden.
I tried starting emr clusters from the console, checking default in the IAM roles section,
I've setup two IAM roles EMR_DefaultRole , EMR_EC2_DefaultRole making sure they had all s3 access permissions available,
and I've made sure to run aws configure for both ec2-user and hadoop (confirming the right creds were in ~/.aws/config).
Still get the error above. If the hadoop user can distcp the data from the same s3 bucket that holds my python scripts, shouldn't hadoop user be able to copy those scripts using aws s3? Isn't the same user (hadoop) accessing the same bucket? Thanks for any pointers.

Does a file need to be in HDFS in order to use it in distributed cache?

I get
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/path/to/my.jar, expected: hdfs://ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
if I try to add a local file to distributed cache in hadoop. When the file is on HDFS, I don't get this error (obviously, since it's using the expected FS). Is there a way to use a local file in distributed cache without first copying it to hdfs? Here is a code snippet:
Configuration conf = job.getConfiguration();
FileSystem fs = FileSystem.getLocal(conf);
Path dependency = fs.makeQualified(new Path("/local/path/to/my.jar");
DistributedCache.addArchiveToClassPath(path, conf);
Thanks
It has to be in HDFS first. I'm going to go out on a limb here, but I think it is because the file is "pulled" to the local distributed cache by the slaves, not pushed. Since they are pulled, they have no way to access that local path.
No, I don't think you can put anything on the distributed cache without it being in HDFS first. All Hadoop jobs use input/output path in relation to HDFS.
File can be either in local system, hdfs, S3 or other cluster also. You need to specify as
-files hdfs:// if the file is in hdfs
by default it assumes local file system.

Resources