GCP dataproc cluster hadoop job to move data from gs bucket to s3 amazon bucket fails [CONSOLE] - hadoop

First question on stack overflow, so please forgive me for any rookie mistakes.
I am currently working on moving a very large sum of data (700+ GiB) consisting of many small files of about 1-10MB each, from a folder in GCS bucket to a folder in s3.
Several attempts I made:
gsutil -m rsync -r gs://<path> s3://<path>
Results in a timeout due to large sums of data
gsutil -m cp -r gs://<path> s3://<path>
Takes way too long. Even with many parallel processes and/or threads it still transfers at about 3.4MiB/s on average. I have made sure to upgrade the VM instance in this attempt.
using rclone
Same performance issue as cp
Recently I have found another probable method of doing this. However I am not familiar with GCP so please bear with me, sorry.
This is the reference I found https://medium.com/swlh/transfer-data-from-gcs-to-s3-using-google-dataproc-with-airflow-aa49dc896dad
The method involves making a dataproc cluster through GCP console with the following configuration:
Name:
<dataproc-cluster-name>
Region:
asia-southeast1
Nodes configuration:
1 main 2 worker #2vCPU & #3.75GBMemory & #30GBPersistentDisk
properties:
core fs.s3.awsAccessKeyId <key>
core fs.s3.awsSecretAccessKey <secret>
core fs.s3.impl org.apache.hadoop.fs.s3.S3FileSystem
Then I submit the job through the console menu in GCP website:
At this moment, I start noticing issues, I cannot find hadoop-mapreduce/hadoop-distcp.jar anywhere. I can only find /usr/lib/hadoop/hadoop-distcp.jar by browsing root files through my main dataproc cluster VM instance
The job I submit:
Start time:
31 Mar 2021, 16:00:25
Elapsed time:
3 sec
Status:
Failed
Region
asia-southeast1
Cluster
<cluster-name>
Job type
Hadoop
Main class or JAR
file://usr/lib/hadoop/hadoop-distcp.jar
Arguments
-update
gs://*
s3://*
Returns an error
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2400: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNJARSHIM_USER: invalid variable name
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2365: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNJARSHIM_USER: invalid variable name
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2460: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNJARSHIM_OPTS: invalid variable name
2021-03-31 09:00:28,549 ERROR tools.DistCp: Invalid arguments:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2638)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3342)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3374)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:126)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3425)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3393)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:486)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:240)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:441)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunJarShim.main(HadoopRunJarShim.java:12)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2542)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2636)
... 18 more
Invalid arguments: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-append Reuse existing data in target files and
append new data to them if possible
-async Should distcp execution be blocking
-atomic Commit all changes or none
-bandwidth <arg> Specify bandwidth per map in MB, accepts
bandwidth as a fraction.
-blocksperchunk <arg> If set to a positive value, fileswith more
blocks than this value will be split into
chunks of <blocksperchunk> blocks to be
transferred in parallel, and reassembled on
the destination. By default,
<blocksperchunk> is 0 and the files will be
transmitted in their entirety without
splitting. This switch is only applicable
when the source file system implements
getBlockLocations method and the target
file system implements concat method
-copybuffersize <arg> Size of the copy buffer to use. By default
<copybuffersize> is 8192B.
-delete Delete from target, files missing in
source. Delete is applicable only with
update or overwrite options
-diff <arg> Use snapshot diff report to identify the
difference between source and target
-direct Write files directly to the target
location, avoiding temporary file rename.
-f <arg> List of files that need to be copied
-filelimit <arg> (Deprecated!) Limit number of files copied
to <= n
-filters <arg> The path to a file containing a list of
strings for paths to be excluded from the
copy.
-i Ignore failures during copy
-log <arg> Folder on DFS where distcp execution logs
are saved
-m <arg> Max number of concurrent maps to use for
copy
-numListstatusThreads <arg> Number of threads to use for building file
listing (max 40).
-overwrite Choose to overwrite target files
unconditionally, even if they exist.
-p <arg> preserve status (rbugpcaxt)(replication,
block-size, user, group, permission,
checksum-type, ACL, XATTR, timestamps). If
-p is specified with no <arg>, then
preserves replication, block size, user,
group, permission, checksum type and
timestamps. raw.* xattrs are preserved when
both the source and destination paths are
in the /.reserved/raw hierarchy (HDFS
only). raw.* xattrpreservation is
independent of the -p flag. Refer to the
DistCp documentation for more details.
-rdiff <arg> Use target snapshot diff report to identify
changes made on target
-sizelimit <arg> (Deprecated!) Limit number of files copied
to <= n bytes
-skipcrccheck Whether to skip CRC checks between source
and target paths.
-strategy <arg> Copy strategy to use. Default is dividing
work based on file sizes
-tmp <arg> Intermediate work path to be used for
atomic commit
-update Update target, copying only missing files
or directories
-v Log additional info (path, size) in the
SKIP/COPY log
-xtrack <arg> Save information about missing source files
to the specified directory
How can I fix this problem? Several fixes I find online aren't very helpful. Either they were using hadoop cli or have different jar files as mine. For example this one right here: Move data from google cloud storage to S3 using dataproc hadoop cluster and airflow and https://github.com/CoorpAcademy/docker-pyspark/issues/13
Disclaimers: I do not use hadoop cli or airflow. I use console to do this, submitting job through the dataproc cluster main VM instance shell also returns the same error. If this is required, any detailed reference would be appreciated, thankyou very much!
Update:
Fixed misleading path replacement on gsutil part
The problem was due to s3FileSystem no longer supported by hadoop. So I have to downgrade to an image with hadoop 2.10 [FIXED]. The speed however, was also not satisfactory

I think the Dataproc solution is overkill in your case. Dataproc would make sense if you needed to daily or hourly copy something like a TB of data from GCS to S3. But it sounds like yours will just be a one-time copy that you can let run for hours or days. I'd suggest running gsutil on a Google Cloud (GCP) instance. I've tried an AWS EC2 instance for this and it is always markedly slower for this particular operation.
Create your source and destination buckets in the same region. For example, us-east4 (N. Virginia) for GCS and us-east-1 (N. Virginia) for S3. Then deploy your instance in the same GCP region.
gsutil -m cp -r gs://* s3://*
. . . is probably not going to work. It definitely does not work in Dataproc, which always errors if I don't have either an explicit file location or a bucket/folder that ends with /
Instead, first try to explicitly copy one file successfully. Then try a whole folder or bucket.
How many files are you trying to copy?

Related

Streamsets Mapr FS origin/dest. KerberosPrincipal exception (using hadoop impersonation (in mapr 6.0))

I am trying to do a simple data move from a mapr fs origin to a mapr fs destination (this is not my use case, just doing this simple movement for testing purposes). When trying to validate this pipeline, the error message I see in the staging area is:
HADOOPFS_11 - Cannot connect to the filesystem. Check if the Hadoop FS location: 'maprfs:///mapr/mycluster.cluster.local' is valid or not: 'java.io.IOException: Provided Subject must contain a KerberosPrincipal
Tyring different variations of the hadoop fs URI field (eg. mfs:///mapr/mycluster.cluster.local, maprfs:///mycluster.cluster.local) does not seem to help. Looking at the logs after trying to validate, I see
2018-01-04 10:28:56,686 mfs2mfs/mapr2sqlserver850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Created source of type: com.streamsets.pipeline.stage.origin.maprfs.ClusterMapRFSSource#16978460 DClusterSourceOffsetCommitter *admin preview-pool-1-thread-3
2018-01-04 10:28:56,697 mfs2mfs/mapr2sqlserver850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Error connecting to FileSystem: java.io.IOException: Provided Subject must contain a KerberosPrincipal ClusterHdfsSource *admin preview-pool-1-thread-3
java.io.IOException: Provided Subject must contain a KerberosPrincipal
....
2018-01-04 10:20:39,159 mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Authentication Config: ClusterHdfsSource *admin preview-pool-1-thread-3
2018-01-04 10:20:39,159 mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 ERROR Issues: Issue[instance='MapRFS_01' service='null' group='HADOOP_FS' config='null' message='HADOOPFS_11 - Cannot connect to the filesystem. Check if the Hadoop FS location: 'maprfs:///mapr/mycluster.cluster.local' is valid or not: 'java.io.IOException: Provided Subject must contain a KerberosPrincipal''] ClusterHdfsSource *admin preview-pool-1-thread-3
2018-01-04 10:20:39,169 mfs2mfs/mapr2mapr850bfbf0-6dc0-4002-8d44-b73e33fcf9b3 INFO Validation Error: Failed to configure or connect to the 'maprfs:///mapr/mycluster.cluster.local' Hadoop file system: java.io.IOException: Provided Subject must contain a KerberosPrincipal HdfsTargetConfigBean *admin 0 preview-pool-1-thread-3
java.io.IOException: Provided Subject must contain a KerberosPrincipal
....
However, to my knowledge, the system is not running Keberos, so this error message is a bit confusing for me. Uncommenting #export SDC_JAVA_OPTS="-Dmaprlogin.password.enabled=true ${SDC_JAVA_OPTS}" in the sdc environment variable file for native mapr authentication did not seem to help the problem (even when reinstalling and commenting this line before running the streamsets mapr setup script).
Does anyone have any idea what is happening and how to fix it? Thanks.
This answer was provided on the mapr community forums and worked for me (using mapr v6.0). Note that the instruction here differ from those currently provided by the streamsets documentation. Throughout these instructions, I was logged in as user root.
After installing streamsets (and the mapr prerequisites) as per the documentation...
Change the owner of the the streamsets $SDC_DIST or $SDC_HOME location to the mapr user (or whatever other user you plan to use for the hadoop impersonation): $chown -R mapr:mapr $SDC_DIST (for me this was the /opt/streamsets-datacollector dir.). Do the same for $SDC_CONF (/etc/sdc for me) as well as /var/lib/sdc and var/log/sdc.
In $SDC_DIST/libexec/sdcd-env.sh, set the user and group name (near the top of the file) to mapr user "mapr" and enable mapr password login. The file should end up looking like:
# user that will run the data collector, it must exist in the system
#
export SDC_USER=mapr
# group of the user that will run the data collector, it must exist in the system
#
export SDC_GROUP=mapr
....
# Indicate that MapR Username/Password security is enabled
export SDC_JAVA_OPTS="-Dmaprlogin.password.enabled=true ${SDC_JAVA_OPTS}
Edit the file /usr/lib/systemd/system/sdc.service to look like:
[Service]
User=mapr
Group=mapr
$cd into /etc/systemd/system/ and create a directory called sdc.service.d. Within that directory, create a file (with any name) and add the contents (without spaces):
Environment=SDC_JAVA_OPTS=-Dmaprlogin.passowrd.enabled=true
If you are using mapr's sasl ticket auth. system (or something similar), generate a ticket for the this user on the node that is running streamsets. In this case, with the $maprlogin password command.
Then finally, restart the sdc service: $systemctl deamon-reload then $systemctl retart sdc.
Run something like $ps -aux | grep sdc | grep maprlogin to check that the sdc process is ownned by mapr and that the -Dmaprlogin.passowrd.enabled=true parameter has been successfully set. Once this is done, should be able to validate/run maprFS to maprFS operations in streamsets pipeline builder in batch processing mode.
** NOTE: If using Hadoop Configuration Directory param. instead of Hadoop FS URI, remember to have the files in your $HADOOP_HOME/conf directory (eg.hadoop-site.xml, yarn-site.xml, etc.) (in the case of mapr, something like /opt/mapr/hadoop/hadoop-<version>/etc/hadoop/) either soft-linked or hard-copied to a directory $SDC_DIST/resource/<some hadoop config dir. you made need to create> (I just copy eberything in the directory) and add this path to the Hadoop Configuration Directory param. for your MaprFS (or HadoopFS). In the sdc web UI Hadoop Configuration Directory box, it would look like Hadoop Configuration Directory: <the directory within $SDC_DIST/resources/ that holds the hadoop files>.
** NOTE: If you are still logging errors of the form
2018-01-16 14:26:10,883
ingest2sa_demodata_batch/ingest2sademodatabatchadca8442-cb00-4a0e-929b-df2babe4fd41
ERROR Error in Slave Runner: ClusterRunner *admin
runner-pool-2-thread-29
com.streamsets.datacollector.runner.PipelineRuntimeException:
CONTAINER_0800 - Pipeline
'ingest2sademodatabatchadca8442-cb00-4a0e-929b-df2babe4fd41'
validation error : HADOOPFS_11 - Cannot connect to the filesystem.
Check if the Hadoop FS location: 'maprfs:///' is valid or not:
'java.io.IOException: Provided Subject must contain a
KerberosPrincipal'
you may also need to add -Dmaprlogin.password.enabled=true to the pipeline's /cluster/Worker Java Options tab for the origin and destination hadoop FS stages.
** The video linked to in the mapr community link also says to generate a mapr ticket for the sdc user (the default user that sdc process runs as when running as a service), but I did not do this and the solution still worked for me (so if anyone has any idea why it should be done regardless, please let me know in the comments).

Spark distribute local file from master to nodes

I used to run Spark locally and distributing file to nodes has never caused me problems, but now I am moving things to Amazon cluster service and things starts to break down. Basically, I am processing some IP using the Maxmind GeoLiteCity.dat, which I placed on the local file system on the master (file:///home/hadoop/GeoLiteCity.dat).
following a question from earlier, I used the sc.addFile:
sc.addFile("file:///home/hadoop/GeoLiteCity.dat")
and call on it using something like:
val ipLookups = IpLookups(geoFile = Some(SparkFiles.get("GeoLiteCity.dat")), memCache = false, lruCache = 20000)
This works when running locally on my computer, but seems to be failing on the cluster (I do not know the reason for the failure, but I would appreciate it if someone can tell me how to display the logs for the process, the logs which are generated from Amazon service do not contain any information on which step is failing).
Do I have to somehow load the GeoLiteCity.dat onto the HDFS? Are there other ways to distribute a local file from the master across to the nodes without HDFS?
EDIT: Just to specify the way I run, I wrote a json file which does multiple steps, the first step is to run a bash script which transfers the GeoLiteCity.dat from Amazon S3 to the master:
#!/bin/bash
cd /home/hadoop
aws s3 cp s3://test/GeoLiteCity.dat GeoLiteCity.dat
After checking that the file is in the directory, The json then execute the Spark Jar, but fails. The logs produced by Amazon web UI does not show where the code breaks.
Instead of copying the file into master, load the file into s3 and read it from there
Refer http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter2/s3.html for reading files from S3.
You need to provide AWS Access Key ID and Secret Key. Either set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY or set it programmatically like,
sc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
sc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)
Then you can just read the file as text file. Like,
sc.textFile(s3n://test/GeoLiteCity.dat)
Additional reference :
How to read input from S3 in a Spark Streaming EC2 cluster application
https://stackoverflow.com/a/30852341/4057655

Accessing read-only Google Storage buckets from Hadoop

I am trying to access Google Storage bucket from a Hadoop cluster deployed in Google Cloud using the bdutil script. It fails if bucket access is read-only.
What am I doing:
Deploy a cluster with
bdutil deploy -e datastore_env.sh
On the master:
vgorelik#vgorelik-hadoop-m:~$ hadoop fs -ls gs://pgp-harvard-data-public 2>&1 | head -10
14/08/14 14:34:21 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.8-hadoop1
14/08/14 14:34:25 WARN gcsio.GoogleCloudStorage: Repairing batch of 174 missing directories.
14/08/14 14:34:26 ERROR gcsio.GoogleCloudStorage: Failed to repair some missing directories.
java.io.IOException: Multiple IOExceptions.
java.io.IOException: Multiple IOExceptions.
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createCompositeException(GoogleCloudStorageExceptions.java:61)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:361)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:372)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.listObjectInfo(GoogleCloudStorageImpl.java:914)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.listObjectInfo(CacheSupplementedGoogleCloudStorage.java:455)
Looking at GCS Java source code, it seems that Google Cloud Storage Connector for Hadoop needs empty "directory" objects, which it can create by its own if the bucket is writeable; otherwise it fails. Setting fs.gs.implicit.dir.repair.enable=false leads to "Error retrieving object" error.
Is it possible to use read-only buckets as MR job input somehow?
I use gsutil for files upload. Can it be forced to create these empty objects on file upload?
Yes, you can use a read-only Google Cloud Storage bucket as input for a Hadoop job.
For example, I have run this job many times:
./hadoop-install/bin/hadoop \
jar ./hadoop-install/contrib/streaming/hadoop-streaming-1.2.1.jar \
-input gs://pgp-harvard-data-public/hu0*/*/*/*/ASM/master* \
-mapper cgi-mapper.py -file cgi-mapper.py --numReduceTasks 0 \
-output gs://big-data-roadshow/output
This accesses the same read-only bucket you mention in your example above.
The difference between our examples is that mine ends with a glob (*), which the Google Cloud Storage Connector for Hadoop is able to expand without needing to use any of the "placeholder" directory objects.
I recommend you use gsutil to explore the read-only bucket you're interested in (since it doesn't need the "placeholder" objects) and once you have a glob expression that returns the list of objects you want processed, use that glob expression in your hadoop command.
The answer to your second question ("Can gsutil be forced to create these empty objects on file upload") is currently "no".

Using hadoop distcp to copy data to s3 block filesystem: The specified copy source is larger than the maximum allowable size for a copy source

I'm trying to use hadoop's distcp to copy data from HDFS to S3 (not S3N). My understanding is that using the s3:// protocol, Hadoop will store the individual blocks on S3, and each S3 'file' will effectively be an HDFS block.
Hadoop version is 2.2.0 running on Amazon EMR.
However, trying to do a simple distcp, I get the following error:
Caused by: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 71C64ECE79FCC244, AWS Error Code: InvalidRequest, AWS Error Message: The specified copy source is larger than the maximum allowable size for a copy source: 5368709120, S3 Extended Request ID: uAnvxtrNolvs0qm6htIrKjpD0VFxzjqgIeN9RtGFmXflUHDcSqwnZGZgWt5PwoTy
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:619)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:317)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:170)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:2943)
at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1235)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.copy(Jets3tNativeFileSystemStore.java:277)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.$Proxy11.copy(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:1217)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.promoteTmpToTarget(RetriableFileCopyCommand.java:161)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:110)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:83)
at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
Some of my source files are >5GB. Looking at the error, it seems that distcp is trying to blindly copy files from HDFS into S3, as if it were using the S3 Native filesystem. Because of the files that are >5GB, this is failing, as S3 doesn't support put requests >5GB.
Why is this happening? I would have thought that distcp would try to put the individual blocks onto S3, and these should only be 64MB (my HDFS blocksize).
In order to write files with size > 4GB - one must use multi-part uploads. This seems to have been fixed in Hadoop version 2.4.0 (see: https://issues.apache.org/jira/browse/HADOOP-9454).
That said - this is one of the reasons why it makes sense to use AWS native Hadoop offerings like EMR and Qubole. They are already setup to deal with such idiosyncracies. (Full Disclosure - I am one of the founders #Qubole). In addition to vanilla multipart uploads - we also support streaming multi part uploads - where the file is continuously uploaded to S3 in small chunks even as it is being generated. (in vanilla multipart upload - we first wait for the file to be fullly generated and only then upload in chunks to S3).
Here is the example from wiki : http://wiki.apache.org/hadoop/AmazonS3
% ${HADOOP_HOME}/bin/hadoop distcp hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998 s3://123:456#nutch/

Make files available locally on Elastic MapReduce

The Hadoop documentation states it's possible to make files available locally by use of the -file option.
How can I do this using the Elastic MapReduce Ruby CLI?
You could use the DistributedCache with EMR to do this.
With the ruby client this can be done with the following option:
`--cache <path_to_file_being_cached#name_in_current_working_dir>`
It places a single file in the DistributedCache. It lets you specify the location (s3n or hdfs) of the file followed by its name as referenced in the current working directory of the application, and will place the file locally on your task nodes on the directory identified by mapred.local.dir (I think).
You can then access the files in your Mapper/Reducer tasks easily. I believe you can directly access it just like any normal file, but you may have to do something like DistributedCache.getLocalCacheFiles(job); in the setup method of your tasks.
An example to do this in the Ruby client taken from Amazon's forums:
./elastic-mapreduce --create --stream --input s3n://your_bucket/wordcount/input --output s3n://your_bucket/wordcount/output --mapper s3n://your_bucket/wordcount/wordSplitter.py --reducer aggregate --cache s3n://your_bucket/wordcount/stop-word-list#stop-word-list

Resources