How to copy from subdirectories using s3DistCp - hadoop

Trying to use s3DistCp to copy from s3://my-bucket/dir1/ , s3://my-bucket/dir2, s3://my-bucket/dir3.
And all three DIRs has some files in them. Wanted to do something like:
hadoop jar s3distcp.jar --src s3://my-bucket/*/ --dest s3://my-bucket/some-other-dir/
But it generate an error saying that:
's3://my-bucket/*/' directory not found...
So does it mean s3DistCp doesn't take wildcards in paths? Is there any work around or any ideas?

Have you tried with s3n? Difference between Amazon S3 and S3n in Hadoop

Related

hdfs or hadoop command to sync the files or folder between local to hdfs

I have a local files which gets added daily so I want to sync these newly added files to hdfs.
I tried below command but all are complete copy, I want some command which copies only newly added files
$ hdfs dfs -cp /home/user/files/* /data/files/*
You can use hsync.
https://github.com/alexholmes/hsync
Its Alex's custom package and perhaps useful on a dev box but could be hard to deploy on production environment. I am looking for a similar solution but for now this seems to be closest. Other option is to write your own shell script to compare source/target file times and then overwrite newer files only.

Copy files from HDFS to Amazon S3 using distp and s3a scheme

Using Apache Hadoop version 2.7.2 and trying to copy files from HDFS to Amazon S3 using below command.
hadoop distcp hdfs://<<namenode_host>>:9000/user/ubuntu/input/flightdata s3a://<<bucketid>>
Getting below exception using above command.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: s3a://<<bucketid>>.distcp.tmp.attempt_1462460298670_0004_m_000001_0
Thanks much for the help.
It should be possible to go from HDFS to S3 - I have done it before using syntax like the following, running it from a HDFS cluster:
distcp -Dfs.s3a.access.key=... -Dfs.s3a.secret.key=... /user/vagrant/bigdata s3a://mytestbucket/bigdata
It you run your command like this, does it work:
hadoop distcp hdfs://namenode_host:9000/user/ubuntu/input/flightdata s3a://bucketid/flightdata
From the exception, it looks like it is expecting a 'folder' to put the data in, as opposed to the root of the bucket.
You need to provide AWS credentials in order to successfully transfer files TO/FROM HDFS and S3.
You can pass the access_key_id and secret parameters as shown by #stephen above but you should use a credential provider api for production use where you can manage your credentials without passing them around in individual commands.
Ref: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html
Secondly you do not need to specify "hdfs" protocol. An absolute hdfs path is sufficient.

Running oozie job using a modified hadoop config file to support S3 to HDFS

Hello I am trying to copy a file in my S3 bucket into HDFS using the cp command.
I do something like
Hadoop --config config fs -cp s3a://path hadooppath
This works well when my config is in my local.
However now I am trying to set it up as an oozie job. So when I am now unable to pass the configuration files present in config directory in my local system. Even if its in HDFS, then still it doesn't seem to work. Any suggestions ?
I tried -D command in Hadoop and passed name and value pairs, still it throws some error. It works only from my local system.
Did you Try DISTCP in oozie? Hadoop 2.7.2 will supports S3 data source. You can able to schedule it by coordinators. Just parse the credentials to coordinators either RESTAPI or in Properties files. Its easy way to copy a data periodically(Scheduled manner).
${HADOOP_HOME}/bin/hadoop distcp s3://<source>/ hdfs://<destination>/

Copy a file using WebHDFS

Is there a way to copy a file from (let's say) hdfs://old to hdfs://new without first downloading the file and then uploading it again?
Don't know about WebHDFS, but this is achievable using hadoop distcp.
The command looks something like this:
hadoop distcp hdfs://old_nn:8020/old/location/path.file hdfs://new_nn:8020/new/location/path.file

Make files available locally on Elastic MapReduce

The Hadoop documentation states it's possible to make files available locally by use of the -file option.
How can I do this using the Elastic MapReduce Ruby CLI?
You could use the DistributedCache with EMR to do this.
With the ruby client this can be done with the following option:
`--cache <path_to_file_being_cached#name_in_current_working_dir>`
It places a single file in the DistributedCache. It lets you specify the location (s3n or hdfs) of the file followed by its name as referenced in the current working directory of the application, and will place the file locally on your task nodes on the directory identified by mapred.local.dir (I think).
You can then access the files in your Mapper/Reducer tasks easily. I believe you can directly access it just like any normal file, but you may have to do something like DistributedCache.getLocalCacheFiles(job); in the setup method of your tasks.
An example to do this in the Ruby client taken from Amazon's forums:
./elastic-mapreduce --create --stream --input s3n://your_bucket/wordcount/input --output s3n://your_bucket/wordcount/output --mapper s3n://your_bucket/wordcount/wordSplitter.py --reducer aggregate --cache s3n://your_bucket/wordcount/stop-word-list#stop-word-list

Resources