Using hadoop distcp to copy data to s3 block filesystem: The specified copy source is larger than the maximum allowable size for a copy source - hadoop

I'm trying to use hadoop's distcp to copy data from HDFS to S3 (not S3N). My understanding is that using the s3:// protocol, Hadoop will store the individual blocks on S3, and each S3 'file' will effectively be an HDFS block.
Hadoop version is 2.2.0 running on Amazon EMR.
However, trying to do a simple distcp, I get the following error:
Caused by: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 71C64ECE79FCC244, AWS Error Code: InvalidRequest, AWS Error Message: The specified copy source is larger than the maximum allowable size for a copy source: 5368709120, S3 Extended Request ID: uAnvxtrNolvs0qm6htIrKjpD0VFxzjqgIeN9RtGFmXflUHDcSqwnZGZgWt5PwoTy
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:619)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:317)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:170)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:2943)
at com.amazonaws.services.s3.AmazonS3Client.copyObject(AmazonS3Client.java:1235)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.copy(Jets3tNativeFileSystemStore.java:277)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.$Proxy11.copy(Unknown Source)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:1217)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.promoteTmpToTarget(RetriableFileCopyCommand.java:161)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:110)
at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:83)
at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
Some of my source files are >5GB. Looking at the error, it seems that distcp is trying to blindly copy files from HDFS into S3, as if it were using the S3 Native filesystem. Because of the files that are >5GB, this is failing, as S3 doesn't support put requests >5GB.
Why is this happening? I would have thought that distcp would try to put the individual blocks onto S3, and these should only be 64MB (my HDFS blocksize).

In order to write files with size > 4GB - one must use multi-part uploads. This seems to have been fixed in Hadoop version 2.4.0 (see: https://issues.apache.org/jira/browse/HADOOP-9454).
That said - this is one of the reasons why it makes sense to use AWS native Hadoop offerings like EMR and Qubole. They are already setup to deal with such idiosyncracies. (Full Disclosure - I am one of the founders #Qubole). In addition to vanilla multipart uploads - we also support streaming multi part uploads - where the file is continuously uploaded to S3 in small chunks even as it is being generated. (in vanilla multipart upload - we first wait for the file to be fullly generated and only then upload in chunks to S3).

Here is the example from wiki : http://wiki.apache.org/hadoop/AmazonS3
% ${HADOOP_HOME}/bin/hadoop distcp hdfs://domU-12-31-33-00-02-DF:9001/user/nutch/0070206153839-1998 s3://123:456#nutch/

Related

GCP dataproc cluster hadoop job to move data from gs bucket to s3 amazon bucket fails [CONSOLE]

First question on stack overflow, so please forgive me for any rookie mistakes.
I am currently working on moving a very large sum of data (700+ GiB) consisting of many small files of about 1-10MB each, from a folder in GCS bucket to a folder in s3.
Several attempts I made:
gsutil -m rsync -r gs://<path> s3://<path>
Results in a timeout due to large sums of data
gsutil -m cp -r gs://<path> s3://<path>
Takes way too long. Even with many parallel processes and/or threads it still transfers at about 3.4MiB/s on average. I have made sure to upgrade the VM instance in this attempt.
using rclone
Same performance issue as cp
Recently I have found another probable method of doing this. However I am not familiar with GCP so please bear with me, sorry.
This is the reference I found https://medium.com/swlh/transfer-data-from-gcs-to-s3-using-google-dataproc-with-airflow-aa49dc896dad
The method involves making a dataproc cluster through GCP console with the following configuration:
Name:
<dataproc-cluster-name>
Region:
asia-southeast1
Nodes configuration:
1 main 2 worker #2vCPU & #3.75GBMemory & #30GBPersistentDisk
properties:
core fs.s3.awsAccessKeyId <key>
core fs.s3.awsSecretAccessKey <secret>
core fs.s3.impl org.apache.hadoop.fs.s3.S3FileSystem
Then I submit the job through the console menu in GCP website:
At this moment, I start noticing issues, I cannot find hadoop-mapreduce/hadoop-distcp.jar anywhere. I can only find /usr/lib/hadoop/hadoop-distcp.jar by browsing root files through my main dataproc cluster VM instance
The job I submit:
Start time:
31 Mar 2021, 16:00:25
Elapsed time:
3 sec
Status:
Failed
Region
asia-southeast1
Cluster
<cluster-name>
Job type
Hadoop
Main class or JAR
file://usr/lib/hadoop/hadoop-distcp.jar
Arguments
-update
gs://*
s3://*
Returns an error
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2400: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNJARSHIM_USER: invalid variable name
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2365: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNJARSHIM_USER: invalid variable name
/usr/lib/hadoop/libexec//hadoop-functions.sh: line 2460: HADOOP_COM.GOOGLE.CLOUD.HADOOP.SERVICES.AGENT.JOB.SHIM.HADOOPRUNJARSHIM_OPTS: invalid variable name
2021-03-31 09:00:28,549 ERROR tools.DistCp: Invalid arguments:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2638)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3342)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3374)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:126)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3425)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3393)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:486)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:240)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:441)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunJarShim.main(HadoopRunJarShim.java:12)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2542)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2636)
... 18 more
Invalid arguments: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3.S3FileSystem not found
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-append Reuse existing data in target files and
append new data to them if possible
-async Should distcp execution be blocking
-atomic Commit all changes or none
-bandwidth <arg> Specify bandwidth per map in MB, accepts
bandwidth as a fraction.
-blocksperchunk <arg> If set to a positive value, fileswith more
blocks than this value will be split into
chunks of <blocksperchunk> blocks to be
transferred in parallel, and reassembled on
the destination. By default,
<blocksperchunk> is 0 and the files will be
transmitted in their entirety without
splitting. This switch is only applicable
when the source file system implements
getBlockLocations method and the target
file system implements concat method
-copybuffersize <arg> Size of the copy buffer to use. By default
<copybuffersize> is 8192B.
-delete Delete from target, files missing in
source. Delete is applicable only with
update or overwrite options
-diff <arg> Use snapshot diff report to identify the
difference between source and target
-direct Write files directly to the target
location, avoiding temporary file rename.
-f <arg> List of files that need to be copied
-filelimit <arg> (Deprecated!) Limit number of files copied
to <= n
-filters <arg> The path to a file containing a list of
strings for paths to be excluded from the
copy.
-i Ignore failures during copy
-log <arg> Folder on DFS where distcp execution logs
are saved
-m <arg> Max number of concurrent maps to use for
copy
-numListstatusThreads <arg> Number of threads to use for building file
listing (max 40).
-overwrite Choose to overwrite target files
unconditionally, even if they exist.
-p <arg> preserve status (rbugpcaxt)(replication,
block-size, user, group, permission,
checksum-type, ACL, XATTR, timestamps). If
-p is specified with no <arg>, then
preserves replication, block size, user,
group, permission, checksum type and
timestamps. raw.* xattrs are preserved when
both the source and destination paths are
in the /.reserved/raw hierarchy (HDFS
only). raw.* xattrpreservation is
independent of the -p flag. Refer to the
DistCp documentation for more details.
-rdiff <arg> Use target snapshot diff report to identify
changes made on target
-sizelimit <arg> (Deprecated!) Limit number of files copied
to <= n bytes
-skipcrccheck Whether to skip CRC checks between source
and target paths.
-strategy <arg> Copy strategy to use. Default is dividing
work based on file sizes
-tmp <arg> Intermediate work path to be used for
atomic commit
-update Update target, copying only missing files
or directories
-v Log additional info (path, size) in the
SKIP/COPY log
-xtrack <arg> Save information about missing source files
to the specified directory
How can I fix this problem? Several fixes I find online aren't very helpful. Either they were using hadoop cli or have different jar files as mine. For example this one right here: Move data from google cloud storage to S3 using dataproc hadoop cluster and airflow and https://github.com/CoorpAcademy/docker-pyspark/issues/13
Disclaimers: I do not use hadoop cli or airflow. I use console to do this, submitting job through the dataproc cluster main VM instance shell also returns the same error. If this is required, any detailed reference would be appreciated, thankyou very much!
Update:
Fixed misleading path replacement on gsutil part
The problem was due to s3FileSystem no longer supported by hadoop. So I have to downgrade to an image with hadoop 2.10 [FIXED]. The speed however, was also not satisfactory
I think the Dataproc solution is overkill in your case. Dataproc would make sense if you needed to daily or hourly copy something like a TB of data from GCS to S3. But it sounds like yours will just be a one-time copy that you can let run for hours or days. I'd suggest running gsutil on a Google Cloud (GCP) instance. I've tried an AWS EC2 instance for this and it is always markedly slower for this particular operation.
Create your source and destination buckets in the same region. For example, us-east4 (N. Virginia) for GCS and us-east-1 (N. Virginia) for S3. Then deploy your instance in the same GCP region.
gsutil -m cp -r gs://* s3://*
. . . is probably not going to work. It definitely does not work in Dataproc, which always errors if I don't have either an explicit file location or a bucket/folder that ends with /
Instead, first try to explicitly copy one file successfully. Then try a whole folder or bucket.
How many files are you trying to copy?

How to copy a file from a GCS bucket in Dataproc to HDFS using google cloud?

I had uploaded the data file to the GCS bucket of my project in Dataproc. Now I want to copy that file to HDFS. How can I do that?
For a single "small" file
You can copy a single file from Google Cloud Storage (GCS) to HDFS using the hdfs copy command. Note that you need to run this from a node within the cluster:
hdfs dfs -cp gs://<bucket>/<object> <hdfs path>
This works because hdfs://<master node> is the default filesystem. You can explicitly specify the scheme and NameNode if desired:
hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>
Note that GCS objects use the gs: scheme. Paths should appear the same as they do when you use gsutil.
For a "large" file or large directory of files
When you use hdfs dfs, data is piped through your local machine. If you have a large dataset to copy, you will likely want to do this in parallel on the cluster using DistCp:
hadoop distcp gs://<bucket>/<directory> <HDFS target directory>
Consult the DistCp documentation for details.
Consider leaving data on GCS
Finally, consider leaving your data on GCS. Because the GCS connector implements Hadoop's distributed filesystem interface, it can be used as a drop-in replacement for HDFS in most cases. Notable exceptions are when you rely on (most) atomic file/directory operations or want to use a latency-sensitive application like HBase. The Dataproc HDFS migration guide gives a good overview of data migration.

Using S3 Links When Running Pig 0.14.0 in Local Mode?

I'm running Pig 0.14 in local mode. I'm running simple scripts over data in S3. I'd like to refer to these files directly in these scripts, e.g.:
x = LOAD 's3://bucket/path/to/file1.json' AS (...);
// Magic happens
STORE x INTO 's3://bucket/path/to/file2.json';
However, when I use the following command line:
$PIG_HOME/bin/pig -x local -P $HOME/credentials.properties -f $HOME/script.pig
I get the following error:
Failed Jobs:
JobId Alias Feature Message Outputs
N/A mainplinks MAP_ONLY Message: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Input path does not exist: s3://bucket/path/to/file.json
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:979)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.pig.backend.hadoop20.PigJobControl.mainLoopAction(PigJobControl.java:157)
at org.apache.pig.backend.hadoop20.PigJobControl.run(PigJobControl.java:134)
at java.lang.Thread.run(Thread.java:748)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: s3://com.w2ogroup.analytics.soma.prod/airy/fb25b5c6/data/mainplinks.json
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:265)
... 20 more
file:/tmp/temp-948194340/tmp-48450066,
I can confirm that LOAD is failing; I suspect that STORE will fail too. REGISTER S3 links also fail. I can confirm that the links referenced by LOAD and REGISTER exist, and the links referred to by STORE don't, as Pig expects.
I've solved some issues already. For example, I dropped jets3t-0.7.1 into $PIG_HOME/lib, which fixed runtime errors due to the presence of S3 links at all. Additionally, I've provided the relevant AWS keys, and I can confirm that these keys work because I use them AWSCLI to do the same work.
If I use awscli to copy the files to local disk and rewrite the links to use the local file system, everything works fine. Thus, I'm convinced that the issue is S3-related.
How can I convince Pig to handle these S3 links properly?
AFAIK, the way Pig read from S3 is through HDFS. Furthermore, in order Pig to be able to access HDFS, Pig must not run locally. For setting up non-local Pig easily, I'd suggest you to spin up an EMR cluster (which I have tried this on).
So first you need to setup your HDFS properly to access data from S3.
On your hdfs-site.xml configuration, make sure to set values for fs.s3a keys:
<property>
<name>fs.s3a.access.key</name>
<value>{YOUR_ACCESS_KEY}</value>
<description>AWS access key ID. Omit for Role-based authentication.</description>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>{YOUR_SECRET_KEY}</value>
<description>AWS secret key. Omit for Role-based authentication.</description>
</property>
There should not be any need to restart HDFS service but there is no harm on doing so. To restart a service, run initctl list then sudo stop <service name according to initctl output>.
Verify that you can access S3 from HDFS by running (note the s3a protocol):
$ hdfs dfs -ls s3a://bucket/path/to/file
If you get no error then you are now able to use S3 path in Pig. Run Pig in either MapReduce or Tez mode:
pig -x tez -f script.pig or pig -x mapreduce -f script.pig.
https://community.hortonworks.com/articles/25578/how-to-access-data-files-stored-in-aws-s3-buckets.html

Copy files from HDFS to Amazon S3 using distp and s3a scheme

Using Apache Hadoop version 2.7.2 and trying to copy files from HDFS to Amazon S3 using below command.
hadoop distcp hdfs://<<namenode_host>>:9000/user/ubuntu/input/flightdata s3a://<<bucketid>>
Getting below exception using above command.
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: s3a://<<bucketid>>.distcp.tmp.attempt_1462460298670_0004_m_000001_0
Thanks much for the help.
It should be possible to go from HDFS to S3 - I have done it before using syntax like the following, running it from a HDFS cluster:
distcp -Dfs.s3a.access.key=... -Dfs.s3a.secret.key=... /user/vagrant/bigdata s3a://mytestbucket/bigdata
It you run your command like this, does it work:
hadoop distcp hdfs://namenode_host:9000/user/ubuntu/input/flightdata s3a://bucketid/flightdata
From the exception, it looks like it is expecting a 'folder' to put the data in, as opposed to the root of the bucket.
You need to provide AWS credentials in order to successfully transfer files TO/FROM HDFS and S3.
You can pass the access_key_id and secret parameters as shown by #stephen above but you should use a credential provider api for production use where you can manage your credentials without passing them around in individual commands.
Ref: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CredentialProviderAPI.html
Secondly you do not need to specify "hdfs" protocol. An absolute hdfs path is sufficient.

Temporary storage usage between distcp and s3distcp

I read the documentation for Amazon's S3DistCp - it says,
"During a copy operation, S3DistCp stages a temporary copy of the
output in HDFS on the cluster. There must be sufficient free space in
HDFS to stage the data, otherwise the copy operation fails. In
addition, if S3DistCp fails, it does not clean the temporary HDFS
directory, therefore you must manually purge the temporary files. For
example, if you copy 500 GB of data from HDFS to S3, S3DistCp copies
the entire 500 GB into a temporary directory in HDFS, then uploads the
data to Amazon S3 from the temporary directory".
This is not insignificant especially if you have a large HDFS cluster. Does anybody know if the regular Hadoop DistCp has this same behaviour of staging the files to copy in a temporary folder?
Distcp does not use a temporary folder rather distcp used Map Reduce for the file copy in inter/intra cluster. The same used for HDFS to S3 also. AFAIK distcp will not fail the whole bunch of file copy if it fails for some reason.
If total of 500 GB file copy needs to be happen and if 200 GB of file already copied in and distcp failed you have the 200 GB of data in S3. When you try to rerun the distcp job again it will skip the already existing files.
For more information about commands look at the distcp guide here

Resources