I have a huge bucket of S3files that I want to put on HDFS. Given the amount of files involved my preferred solution is to use 'distributed copy'. However for some reason I can't get hadoop distcp to take my Amazon S3 credentials. The command I use is:
hadoop distcp -update s3a://[bucket]/[folder]/[filename] hdfs:///some/path/ -D fs.s3a.awsAccessKeyId=[keyid] -D fs.s3a.awsSecretAccessKey=[secretkey] -D fs.s3a.fast.upload=true
However that acts the same as if the '-D' arguments aren't there.
ERROR tools.DistCp: Exception encountered
java.io.InterruptedIOException: doesBucketExist on [bucket]: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: Unable to load credentials from service endpoint
I've looked at the hadoop distcp documentation, but can't find a solution there on why this isn't working. I've tried -Dfs.s3n.awsAccessKeyId as a flag which didn't work either. I've read how explicitly passing credentials isn't good practice, so maybe this is just some gentil suggestion to do it some other way?
How is one supposed to pass S3 credentials with distcp? Anyone knows?
It appears the format of credentials flags has changed since the previous version. The following command works:
hadoop distcp \
-Dfs.s3a.access.key=[accesskey] \
-Dfs.s3a.secret.key=[secretkey] \
-Dfs.s3a.fast.upload=true \
-update \
s3a://[bucket]/[folder]/[filename] hdfs:///some/path
In case if some one came for with same error using -D hadoop.security.credential.provider.path, please ensure your credentials store(jceks file ) is located in distributed file system(hdfs) as distcp starts form one of the node manager node so it can access the same.
Koen's answer helped me, here is my version.
hadoop distcp \
-Dfs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider \
-Dfs.s3a.access.key=[accesskey] \
-Dfs.s3a.secret.key=[secretkey] \
-Dfs.s3a.session.token=[sessiontoken] \
-Dfs.s3a.fast.upload=true \
hdfs:///some/path s3a://[bucket]/[folder]/[filename]
Related
For testing purposes we would like to write to a relative local path like target/pipelines. The attempted URI was
file://target/pipelines/output.parquet
which was accessed via Spark:
if (!FileSystem.get(spark.sparkContext.hadoopConfiguration).exists(new Path(path))) {
However the hadoop filesystem api does not seem too keen on that:
Wrong FS: file://target/pipelines/inputData1, expected: file:///
Full stacktrace:
java.lang.IllegalArgumentException:
Wrong FS: file://target/pipelines/inputData1, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
at com.mycompany.DataFrameUtils$.generateParquetFile(DataFrameUtils.scala:71)
So is it not possible to write to a local relative path ?
For accessing local file system using Hadoop API, you have to use either single / or / three times after colon
file:/target/pipelines/output.parquet
or
file:///target/pipelines/output.parquet
In order to use the relative path (pwd), If it's through the command line - below command should work.
hadoop fs -Dfs.defaultFS="file:/" -ls testdir
If you want to use the same in the Scala or Java application, You need to set the below config in the driver code, take out file:/// from the path variable, Just give relative path.
conf.set("fs.defaultFS","file:/"); # Hadoop Configuration Object.
In Spark this configuration can be over ridden in the spark-submit command line using --conf option and take out file:/// from the path variable, Just give relative path.
./bin/spark-submit
--conf fs.defaultFS="file:/" \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
... # other options
<application-jar> \
I am trying to copy files from HDFS to S3 using distcp by executing the following command
hadoop distcp -fs.s3a.access.key=AccessKey -fs.s3a.secret.key=SecrerKey \
s3n://testbdr/test2 hdfs://hostname:portnumber/tmp/test
But I am getting the following error:
17/09/05 02:59:30 ERROR tools.DistCp: Invalid arguments:
java.lang.IllegalArgumentException: Both source file listing and source paths present
at org.apache.hadoop.tools.OptionsParser.parseSourceAndTargetPaths(OptionsParser.java:341)
at org.apache.hadoop.tools.OptionsParser.parse(OptionsParser.java:89)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:112)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:436)
Invalid arguments: Both source file listing and source paths present
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-append Reuse existing data in target files and
append new data to them if possible
-async Should distcp execution be blocking
To pass configuration parameters, you have to use -D
hadoop distcp -Dfs.s3a.access.key=AccessKey -Dfs.s3a.secret.key=SecrerKey \
s3n://testbdr/test2 hdfs://hostname:portnumber/tmp/test
Old Command
hadoop distcp -Dfs.s3a.access.key=AccessKey -Dfs.s3a.secret.key=SecretKey \
s3n://testbdr/test2 hdfs://hostname:portnumber/tmp/test
Rectified Command
hadoop distcp -Dfs.s3n.awsAccessKeyId=AccessKey -Dfs.s3n.awsSecretAccessKey=SecretKey \
s3n://testbdr/test2 hdfs://hostname:portnumber/tmp/test
I have a Hive table that i am trying to index into SolrCloud using morphline, however, the data behind the Hive table is ONE big file 20GB that morphline is taking a long time to process.
Instead of running multiple mappers and reducers, there can only be 1 mapper running, probably due to the fact that we have only one file.
yarn jar /opt/<path>/search-mr-1.0.0-cdh5.5.1-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file morphlines.conf \
--output-dir hdfs://<outputdir> \
--zk-host node1.datafireball.com:2181/solr \
--collection <collectionname> \
--input-list <filewherethedatais> \
--mappers 6
And it still kicked out only 1 job... and this is taking forever, can anyone shed some light on this?
Resources You might find helpful:
Cloudera Mapreduce Batch Index into Solrcloud
Kitesdk which morphline belongs to.
I am trying to read an auxiliary file in my mapper and here are my codes and commands.
mapper code:
#!/usr/bin/env python
from itertools import combinations
from operator import itemgetter
import sys
storage = {}
with open('inputData', 'r') as inputFile:
for line in inputFile:
first, second = line.split()
storage[(first, second)] = 0
for line in sys.stdin:
do_something()
And here is my command:
hadoop jar hadoop-streaming-2.7.1.jar \
-D stream.num.map.output.key.fields=2 \
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D mapred.text.key.comparator.options='-k1,1 -k2,2' \
-D mapred.map.tasks=20 \
-D mapred.reduce.tasks=10 \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-mapper mapper.py -file mapper.py \
-reducer reducer.py -file reducer.py \
-file inputData \
-input /data \
-output /result
But I keep getting this error, which indicates that my mapper fails to read from stdin. After deleting the read file part, my code works, So I have pinppointed the place where the error occurs, but I don't know what should be the correct way of reading from it. Can anyone help?
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
The error you are getting means your mapper failed to write to their stdout stream for too long.
For example, a common reason for error is that in your do_something() function, you have a for loop that contains continue statement with certain conditions. Then when that condition happens too often in your input data, your script runs over continue many times consecutively, without generating any output to stdout. Hadoop waits for too long without seeing anything, so the task is considered failed.
Another possibility is that your input data file is too large, and it took too long to read. But I think that is considered setup time because it is before the first line of output. I am not sure though.
There are two relatively easy ways to solve this:
(developer side) Modify your code to output something every now and then. In the case of continue, write a short dummy symbol like '\n' to let Hadoop know your script is alive.
(system side) I believe you can set the following parameter with -D option, which controls for the waitout time in milli-seconds
mapreduce.reduce.shuffle.read.timeout
I have never tried option 2. Usually I'd avoid streaming on data that requires filtering. Streaming, especially when done with scripting language like Python, should be doing as little work as possible. My use cases are mostly post-processing output data from Apache Pig, where filtering will already be done in Pig scripts and I need something that is not available in Jython.
I'm running through Amazon's example of running Elastic MapReduce and keep getting hit with the following error:
Error launching job , Output path already exists.
Here is the command to run the job that I am using:
C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --create --stream \
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
--input s3://elasticmapreduce/samples/wordcount/input \
--output [A path to a bucket you own on Amazon S3, such as, s3n://myawsbucket] \
--reducer aggregate
Here is where the example comes from here
I'm following Amazon'd directions for the output directory. The bucket name is s3n://mp.maptester321mark/. I've looked through all their suggestions for problems on this url
Here is my credentials.json info:
{
"access_id": "1234123412",
"private_key": "1234123412",
"keypair": "markkeypair",
"key-pair-file": "C:/Ruby/elastic-mapreduce-cli/markkeypair",
"log_uri": "s3n://mp-mapreduce/",
"region": "us-west-2"
}
hadoop jobs won't clobber directories that already exist. You just need to run:
hadoop fs -rmr <output_dir>
before your job ot just use the AWS console to remove the directory.
Use:
--output s3n://mp.maptester321mark/output
instead of:
--output s3n://mp.maptester321mark/
I suppose EMR makes the output bucket before running and that means you'll already have your output directory / if you specify --output s3n://mp.maptester321mark/ and that might be the reason why you get this error.
---> If the folder (bucket) already exists then remove it.
---> If you delete it and you still get the above error make sure your output is like this
s3n://some_bucket_name/your_output_bucket if you have it like this s3n://your_output_bucket/
its an issue with EMR!! as i think it first creates bucket on the path (some_bucket_name) and then tries to create the (your_output_bucket).
Thanks
Hari