Hadoop streaming tasks on EMR always fail with "PipeMapRed.waitOutputThreads(): subprocess failed with code 143"

Hadoop streaming tasks on EMR always fail with "PipeMapRed.waitOutputThreads(): subprocess failed with code 143" - hadoop

My hadoop streaming map-reduce jobs on Amazon EMR keep failing with the
following error:
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:372)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:586)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
From what I have read online it appears that this is related to a SIGTERM being
sent to the task (See this thread here). I have tried experimenting with
--jobconf "mapred.task.timeout=X" but still receive the same error for values
of X even up to an hour. I have also tried reporting
reporter:status:<message> at regular intervals to the STDERR as described in
the streaming docs. This also however does nothing to prevent this error
occurring. As far as I can see my process starts and begins working initially as
I get the expected output being produced in log files. Each task attempt
however always ends in this error.
This is the code I am using to launch my streaming job with make:
instances = 50
type = m1.small
bid = 0.010
maptasks = 20000
timeout = 3600000
hadoop: upload_scripts upload_data
emr -c ~/.ec2/credentials.json \
--create \
--name "Run $(maptasks) jobs with $(timeout) minute timeout and no reducer" \
--instance-group master \
--instance-type $(type) \
--instance-count 1 \
--instance-group core \
--instance-type $(type) \
--instance-count 1 \
--instance-group task \
--instance-type $(type) \
--instance-count $(instances) \
--bid-price $(bid) \
--bootstrap-action $(S3-srpt)$(bootstrap-database) \
--args "$(database)","$(http)/data","$(hadoop)" \
--bootstrap-action $(S3-srpt)$(bootstrap-phmmer) \
--args "$(hadoop)" \
--stream \
--jobconf "mapred.map.tasks=$(maptasks)" \
--jobconf "mapred.task.timeout=$(timeout)" \
--input $(S3-data)$(database) \
--output $(S3-otpt)$(shell date +%Y-%m-%d-%H-%M-%S) \
--mapper '$(S3-srpt)$(mapper-phmmer) $(hadoop)/$(database) $(hadoop)/phmmer' \
--reducer NONE

Related

Shell: How to stop a command line if it times out？

I want to stop the command if it run more than 1 minutes and continue run the next command.
for((part=0;part<=100;part++));
do
spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--conf spark.pyspark.python=myenv/bin/python3 \
python_demo.py $part
done
The command spark-submit will submit my code to yarn. After submitting successfully, it'll run until the code ython_demo.py stops. But now I want to the continue to submit if one is submitted successfully.
Now the shell run like :
spark-submit -> submit successfully （about 1 minutes)-> run python_demo.py(it will run for a long time) -> spark-submit next part
Expected:
spark-submit -> if run more than 1 minutes(It means a successful submission)-> spark-submit next part

With timeout command you can do something like this:
for((part=1;part<=100;part++))
do
timeout 60 2>/dev/null spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--conf spark.pyspark.python=myenv/bin/python3 \
python_demo.py $part
if [[ $? != 0 ]]
then
break
else
continue
fi
done

Configure EMR Cluster for Fair Scheduling

I am trying to spin up an emr cluster with fair scheduling such that I can run multiple steps in parallel. I see that this is possible via pipeline (https://aws.amazon.com/about-aws/whats-new/2015/06/run-parallel-hadoop-jobs-on-your-amazon-emr-cluster-using-aws-data-pipeline/), but I already have cluster management / creating automated via an airflow job calling the awscli[1] so it would be great to just update my configurations.
aws emr create-cluster \
--applications Name=Spark Name=Ganglia \
--ec2-attributes "${EC2_PROPERTIES}" \
--service-role EMR_DefaultRole \
--release-label emr-5.8.0 \
--log-uri ${S3_LOGS} \
--enable-debugging \
--name ${CLUSTER_NAME} \
--region us-east-1 \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=m3.xlarge)
I think it may be achieved using the --configurations (https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html) flag, but not sure of the correct env names

Yes, you are correct. You can use EMR configurations to achieve your goal. You can create a JSON file with something like below :
yarn-config.json:
[
{
"Classification": "yarn-site",
"Properties": {
"yarn.resourcemanager.scheduler.class": "org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler"
}
}
]
as per Hadoop Fair Scheduler docs
Then modify you AWS CLI as :
aws emr create-cluster \
--applications Name=Spark Name=Ganglia \
--ec2-attributes "${EC2_PROPERTIES}" \
--service-role EMR_DefaultRole \
--release-label emr-5.8.0 \
--log-uri ${S3_LOGS} \
--enable-debugging \
--name ${CLUSTER_NAME} \
--region us-east-1 \
--instance-groups \
--configurations file://yarn-config.json
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=4,InstanceType=m3.xlarge)

Vagrant keeps losing file doing provision

I'm running into an odd behavior on the latest version of vagrant in a Windows7/msys/Virtualbox environment setup, where after executing a vagrant up command I get an error with rsync; 'file has vanished: "/c/Users/spencerd/workspace/watcher/.LISTEN' doing the provisioning stage.
Since google, irc, and issue trackers have little to no documentation on this issue I wonder if anyone else ran into this and what would the fix be?
And for the record I have successfully build a box using the same vagrant file and provisioning script. For those that want to look, the project code is up at https://gist.github.com/denzuko/a6b7cce2eae636b0512d, with the debug log at gist.github.com/

After digging further into the directory structure and running into issues with git pushing code up I was able to find a non-existant file that needed to be removed after a reboot.
Thus, doing a reboot and a rm -rf -- "./.LISTEN\ \ \ \ \ 0\ \ \ \ \ \ 100\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ " did the trick.

Setting the number of Reducers for an Amazon EMR application

I am trying to run the wordcount example under Amazon EMR.
-1- First, I create a cluster with the following command:
./elastic-mapreduce --create --name "MyTest" --alive
This creates a cluster with a single instance and returns a jobID, lets say j-12NWUOKABCDEF
-2- Second, I start a Job using the following command:
./elastic-mapreduce --jobflow j-12NWUOKABCDEF --jar s3n://mybucket/jar-files/wordcount.jar --main-class abc.WordCount
--arg s3n://mybucket/input-data/
--arg s3n://mybucket/output-data/
--arg -Dmapred.reduce.tasks=3
My WordCount class belongs to the package abc.
This executes without any problem, but I am getting only one reducer.
Which means that the parameter "mapred.reduce.tasks=3" is ignored.
Is there any way to specify the number of reducers that I want my application to use ?
Thank you,
Neeraj.

The "-D" and the "mapred.reduce.tasks=3" should be separate arguments.

Try to launch the EMR cluster by setting reducers and mapper with --bootstrap-action option as
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons --args "-m,mapred.map.tasks=6,-m,mapred.reduce.tasks=3"

You can use the streaming Jar's built-in option of -numReduceTasks. For example with the Ruby EMR CLI tool:
elastic-mapreduce --create --enable-debugging \
--ami-version "3.3.1" \
--log-uri s3n://someBucket/logs \
--name "someJob" \
--num-instances 6 \
--master-instance-type "m3.xlarge" --slave-instance-type "c3.8xlarge" \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia \
--stream \
--arg "-files" \
--arg "s3://someBucket/some_job.py,s3://someBucket/some_file.txt" \
--mapper "python27 some_job.py some_file.txt" \
--reducer cat \
--args "-numReduceTasks,8" \
--input s3://someBucket/myInput \
--output s3://someBucket/myOutput \
--step-name "main processing"

hadoop copying from hdfs to S3

I've successfully completed mahout vectorizing job on Amazon EMR (using Mahout on Elastic MapReduce as reference). Now I want to copy results from HDFS to S3 (to use it in future clustering).
For that I've used hadoop distcp:
den#aws:~$ elastic-mapreduce --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
> --arg hdfs://my.bucket/prj1/seqfiles \
> --arg s3n://ACCESS_KEY:SECRET_KEY#my.bucket/prj1/seqfiles \
> -j $JOBID
Failed. Found that suggestion: Use s3distcp Tried it also:
elastic-mapreduce --jobflow $JOBID \
> --jar --arg s3://eu-west-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
> --arg --s3Endpoint --arg 's3-eu-west-1.amazonaws.com' \
> --arg --src --arg 'hdfs://my.bucket/prj1/seqfiles' \
> --arg --dest --arg 's3://my.bucket/prj1/seqfiles'
In both cases I have the same error: java.net.UnknownHostException: unknown host: my.bucket
Below the full error output for the 2nd case.
2012-09-06 13:25:08,209 FATAL com.amazon.external.elasticmapreduce.s3distcp.S3DistCp (main): Failed to get source file system
java.net.UnknownHostException: unknown host: my.bucket
at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1193)
at org.apache.hadoop.ipc.Client.call(Client.java:1047)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at $Proxy1.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:401)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:127)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:249)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:214)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1413)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:68)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1431)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:256)
at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:431)
at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)

I've found a bug:
The main problem is not
java.net.UnknownHostException: unknown host: my.bucket
but:
2012-09-06 13:27:33,909 FATAL com.amazon.external.elasticmapreduce.s3distcp.S3DistCp (main): Failed to get source file system
So. After adding 1 more slash in source path - job was started without problems. Correct command is:
elastic-mapreduce --jobflow $JOBID \
> --jar --arg s3://eu-west-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
> --arg --s3Endpoint --arg 's3-eu-west-1.amazonaws.com' \
> --arg --src --arg 'hdfs:///my.bucket/prj1/seqfiles' \
> --arg --dest --arg 's3://my.bucket/prj1/seqfiles'
P.S. So. it is working. Job is correctly finished. I've successfully copied dir with 30Gb file.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hadoop streaming tasks on EMR always fail with "PipeMapRed.waitOutputThreads(): subprocess failed with code 143" - hadoop

Related

Shell: How to stop a command line if it times out？

Configure EMR Cluster for Fair Scheduling

Vagrant keeps losing file doing provision

Setting the number of Reducers for an Amazon EMR application

hadoop copying from hdfs to S3

Categories

Resources