Amazon Elastic MapReduce: Output directory - hadoop

I'm running through Amazon's example of running Elastic MapReduce and keep getting hit with the following error:
Error launching job , Output path already exists.
Here is the command to run the job that I am using:
C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --create --stream \
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
--input s3://elasticmapreduce/samples/wordcount/input \
--output [A path to a bucket you own on Amazon S3, such as, s3n://myawsbucket] \
--reducer aggregate
Here is where the example comes from here
I'm following Amazon'd directions for the output directory. The bucket name is s3n://mp.maptester321mark/. I've looked through all their suggestions for problems on this url
Here is my credentials.json info:
{
"access_id": "1234123412",
"private_key": "1234123412",
"keypair": "markkeypair",
"key-pair-file": "C:/Ruby/elastic-mapreduce-cli/markkeypair",
"log_uri": "s3n://mp-mapreduce/",
"region": "us-west-2"
}

hadoop jobs won't clobber directories that already exist. You just need to run:
hadoop fs -rmr <output_dir>
before your job ot just use the AWS console to remove the directory.

Use:
--output s3n://mp.maptester321mark/output
instead of:
--output s3n://mp.maptester321mark/
I suppose EMR makes the output bucket before running and that means you'll already have your output directory / if you specify --output s3n://mp.maptester321mark/ and that might be the reason why you get this error.

---> If the folder (bucket) already exists then remove it.
---> If you delete it and you still get the above error make sure your output is like this
s3n://some_bucket_name/your_output_bucket if you have it like this s3n://your_output_bucket/
its an issue with EMR!! as i think it first creates bucket on the path (some_bucket_name) and then tries to create the (your_output_bucket).
Thanks
Hari

Related

MapReduceIndexerTool output dir error "Cannot write parent of file"

I want to use Cloudera's MapReduceIndexerTool to understand how morphlines work. I created a basic morphline that just reads lines from the input file and I tried to run that tool using that command:
hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar org.apache.solr.hadoop.MapReduceIndexerTool \
--morphline-file morphline.conf \
--output-dir hdfs:///hostname/dir/ \
--dry-run true
Hadoop is installed on the same machine where I run this command.
The error I'm getting is the following:
net.sourceforge.argparse4j.inf.ArgumentParserException: Cannot write parent of file: hdfs:/hostname/dir
at org.apache.solr.hadoop.PathArgumentType.verifyCanWriteParent(PathArgumentType.java:200)
The /dir directory has 777 permissions on it, so it is definitely allowed to write into it. I don't know what I should do to allow it to write into that output directory.
I'm new to HDFS and I don't know how I should approach this problem. Logs don't offer me any info about that.
What I tried until now (with no result):
created a hierarchy of 2 directories (/dir/dir2) and put 777 permissions on both of them
changed the output-dir schema from hdfs:///... to hdfs://... because all the examples in the --help menu are built that way, but this leads to an invalid schema error
Thank you.
It states 'cannot write parent of file'. And the parent in your case is /. Take a look into the source:
private void verifyCanWriteParent(ArgumentParser parser, Path file) throws ArgumentParserException, IOException {
Path parent = file.getParent();
if (parent == null || !fs.exists(parent) || !fs.getFileStatus(parent).getPermission().getUserAction().implies(FsAction.WRITE)) {
throw new ArgumentParserException("Cannot write parent of file: " + file, parser);
}
}
In the message printed is file, in your case hdfs:/hostname/dir, so file.getParent() will be /.
Additionally you can try the permissions with hadoop fs command, for example you can try to create a zero length file in the path:
hadoop fs -touchz /test-file
I solved that problem after days of working on it.
The problem is with that line --output-dir hdfs:///hostname/dir/.
First of all, there are not 3 slashes at the beginning as I put in my continuous trying to make this work, there are only 2 (as in any valid HDFS URI). Actually I put 3 slashes because otherwise, the tool throws an invalid schema exception! You can easily see in this code that the schema check is done before the verifyCanWriteParent check.
I tried to get the hostname by simply running the hostname command on the Cent OS machine that I was running the tool on. This was the main issue. I analyzed the /etc/hosts file and I saw that there are 2 hostnames for the same local IP. I took the second one and it worked. (I also attached the port to the hostname, so the final format is the following: --output-dir hdfs://correct_hostname:8020/path/to/file/from/hdfs
This error is very confusing because everywhere you look for the namenode hostname, you will see the same thing that the hostname command returns. Moreover, the errors are not structured in a way that you can diagnose the problem and take a logical path to solve it.
Additional information regarding this tool and debugging it
If you want to see the actual code that runs behind it, check the cloudera version that you are running and select the same branch on the official repository. The master is not up to date.
If you want to just run this tool to play with the morphline (by using the --dry-run option) without connecting to Solr and playing with it, you can't. You have to specify a Zookeeper endpoint and a Solr collection or a solr config directory, which involves additional work to research on. This is something that can be improved to this tool.
You don't need to run the tool with -u hdfs, it works with a regular user.

Hadoop distcp No AWS Credentials provided

I have a huge bucket of S3files that I want to put on HDFS. Given the amount of files involved my preferred solution is to use 'distributed copy'. However for some reason I can't get hadoop distcp to take my Amazon S3 credentials. The command I use is:
hadoop distcp -update s3a://[bucket]/[folder]/[filename] hdfs:///some/path/ -D fs.s3a.awsAccessKeyId=[keyid] -D fs.s3a.awsSecretAccessKey=[secretkey] -D fs.s3a.fast.upload=true
However that acts the same as if the '-D' arguments aren't there.
ERROR tools.DistCp: Exception encountered
java.io.InterruptedIOException: doesBucketExist on [bucket]: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: Unable to load credentials from service endpoint
I've looked at the hadoop distcp documentation, but can't find a solution there on why this isn't working. I've tried -Dfs.s3n.awsAccessKeyId as a flag which didn't work either. I've read how explicitly passing credentials isn't good practice, so maybe this is just some gentil suggestion to do it some other way?
How is one supposed to pass S3 credentials with distcp? Anyone knows?
It appears the format of credentials flags has changed since the previous version. The following command works:
hadoop distcp \
-Dfs.s3a.access.key=[accesskey] \
-Dfs.s3a.secret.key=[secretkey] \
-Dfs.s3a.fast.upload=true \
-update \
s3a://[bucket]/[folder]/[filename] hdfs:///some/path
In case if some one came for with same error using -D hadoop.security.credential.provider.path, please ensure your credentials store(jceks file ) is located in distributed file system(hdfs) as distcp starts form one of the node manager node so it can access the same.
Koen's answer helped me, here is my version.
hadoop distcp \
-Dfs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider \
-Dfs.s3a.access.key=[accesskey] \
-Dfs.s3a.secret.key=[secretkey] \
-Dfs.s3a.session.token=[sessiontoken] \
-Dfs.s3a.fast.upload=true \
hdfs:///some/path s3a://[bucket]/[folder]/[filename]

Writing Spark dataframe as parquet to S3 without creating a _temporary folder

Using pyspark I'm reading a dataframe from parquet files on Amazon S3 like
dataS3 = sql.read.parquet("s3a://" + s3_bucket_in)
This works without problems. But then I try to write the data
dataS3.write.parquet("s3a://" + s3_bucket_out)
I do get the following exception
py4j.protocol.Py4JJavaError: An error occurred while calling o39.parquet.
: java.lang.IllegalArgumentException: java.net.URISyntaxException:
Relative path in absolute URI: s3a://<s3_bucket_out>_temporary
It seems to me that Spark is trying to create a _temporary folder first, before it is writing to write into the given bucket. Can this be prevent somehow, so that spark is writing directly to the given output bucket?
You can't eliminate the _temporary file as that's used to keep the intermediate
work of a query hidden until it's complete
But that's OK, as this isn't the problem. The problem is that the output committer gets a bit confused trying to write to the root directory (can't delete it, see)
You need to write to a subdirectory under a bucket, with a full prefix. e.g.
s3a://mybucket/work/out .
I should add that trying to commit data to S3A is not reliable, precisely because of the way it mimics rename() by what is something like ls -rlf src | xargs -p8 -I% "cp % dst/% && rm %". Because ls has delayed consistency on S3, it can miss newly created files, so not copy them.
See: Improving Apache Spark for the details.
Right now, you can only reliably commit to s3a by writing to HDFS and then copying. EMR s3 works around this by using DynamoDB to offer a consistent listing
I had the same issue when writing the root of S3 bucket:
df.save("s3://bucketname")
I resolved it by adding a / after the bucket name:
df.save("s3://bucketname/")

How can multiple files be specified with "-files" in the CLI of Amazon for EMR?

I am trying to start an amazon cluster via the amazon CLI, but I am a little bit confused how I should specify multiple files. My current call is as follows:
aws emr create-cluster --steps Type=STREAMING,Name='Intra country development',ActionOnFailure=CONTINUE,Args=[-files,s3://betaestimationtest/mapper.py,-
files,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-
input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
--ami-version 3.1.0
--instance-groupsInstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate
--log-uri s3://betaestimationtest/logs
However, Hadoop now complains that it cannot find the reducer file:
Caused by: java.io.IOException: Cannot run program "reducer.py": error=2, No such file or directory
What am I doing wrong? The file does exist in the folder I specify
For passing multiple files in a streaming step, you need to use file:// to pass the steps as a json file.
AWS CLI shorthand syntax uses comma as delimeter to separate a list of args. So when we try to pass in parameters like: "-files","s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py", then the shorthand syntax parser will treat mapper.py and reducer.py files as two parameters.
The workaround is to use the json format. Please see the examples below.
aws emr create-cluster --steps file://./mysteps.json --ami-version 3.1.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate --log-uri s3://betaestimationtest/logs
mysteps.json looks like:
[
{
"Name": "Intra country development",
"Type": "STREAMING",
"ActionOnFailure": "CONTINUE",
"Args": [
"-files",
"s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py",
"-mapper",
"mapper.py",
"-reducer",
"reducer.py",
"-input",
" s3://betaestimationtest/output_0_inte",
"-output",
" s3://betaestimationtest/output_1_intra"
]}
]
You can also find examples here: https://github.com/aws/aws-cli/blob/develop/awscli/examples/emr/create-cluster-examples.rst. See example 13.
Hope it helps!
You are specifying -files twice, you only need to specify once. I forget if the CLI needs the separator to be a space or a comma for multiple values, but you can try that out.
You should replace:
Args=[-files,s3://betaestimationtest/mapper.py,-files,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
with:
Args=[-files,s3://betaestimationtest/mapper.py s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
or if that fails, with:
Args=[-files,s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]
Add an escape for comma separating files:
Args=[-files,s3://betaestimationtest/mapper.py\\,s3://betaestimationtest/reducer.py,-mapper,mapper.py,-reducer,reducer.py,-input,s3://betaestimationtest/output_0_inter,-output,s3://betaestimationtest/output_1_intra]

Passing files in different S3 folders as input to mapreduce

Our log files are stored in year/month/day/hourly buckets on S3. See below for structure.
How do i pass all the logs on day=20 as input to my map reduce program?
Eg:
bucket = logs/year=2014/month=8/day=20/hour=1/log1_1.txt
bucket = logs/year=2014/month=8/day=20/hour=2/log2_1.txt
bucket = logs/year=2014/month=8/day=20/hour=2/log2_2.txt
bucket = logs/year=2014/month=8/day=20/hour=2/log2_3.txt
bucket = logs/year=2014/month=8/day=20/hour=3/log3_1.txt
bucket = logs/year=2014/month=8/day=20/hour=4/log4_1.txt
When you say "bucket" do you actually mean distinct S3 buckets or do you mean folders/directories in a bucket? Creating that many buckets will end up hitting the S3 account limit for the number of buckets you can create.
Assuming you meant folders/directories in the bucket, use s3distcp as a step in your EMR cluster to copy the logs you want to HDFS and then use the HDFS directory as the input to the MR program
s3distcp takes an src directory and a srcPattern to filter the items found in src. In your example, you could do:
./elastic-mapreduce --jobflow JobFlowID --jar \
/home/hadoop/lib/emr-s3distcp-1.0.jar \
--arg --src --arg s3://logs/ \
--arg --srcPattern --arg '.*day-20.*'
--arg --dest --arg hdfs://input/
All of the logs files that have day=20 in the path will be copied to the input directory on the HDFS of the EMR cluster with JobFlowID.

Resources