Hadoop streaming tasks on EMR always fail with "PipeMapRed.waitOutputThreads(): subprocess failed with code 143" - hadoop

My hadoop streaming map-reduce jobs on Amazon EMR keep failing with the
following error:
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 143
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:372)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:586)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:135)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:441)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
From what I have read online it appears that this is related to a SIGTERM being
sent to the task (See this thread here). I have tried experimenting with
--jobconf "mapred.task.timeout=X" but still receive the same error for values
of X even up to an hour. I have also tried reporting
reporter:status:<message> at regular intervals to the STDERR as described in
the streaming docs. This also however does nothing to prevent this error
occurring. As far as I can see my process starts and begins working initially as
I get the expected output being produced in log files. Each task attempt
however always ends in this error.
This is the code I am using to launch my streaming job with make:
instances = 50
type = m1.small
bid = 0.010
maptasks = 20000
timeout = 3600000
hadoop: upload_scripts upload_data
emr -c ~/.ec2/credentials.json \
--create \
--name "Run $(maptasks) jobs with $(timeout) minute timeout and no reducer" \
--instance-group master \
--instance-type $(type) \
--instance-count 1 \
--instance-group core \
--instance-type $(type) \
--instance-count 1 \
--instance-group task \
--instance-type $(type) \
--instance-count $(instances) \
--bid-price $(bid) \
--bootstrap-action $(S3-srpt)$(bootstrap-database) \
--args "$(database)","$(http)/data","$(hadoop)" \
--bootstrap-action $(S3-srpt)$(bootstrap-phmmer) \
--args "$(hadoop)" \
--stream \
--jobconf "mapred.map.tasks=$(maptasks)" \
--jobconf "mapred.task.timeout=$(timeout)" \
--input $(S3-data)$(database) \
--output $(S3-otpt)$(shell date +%Y-%m-%d-%H-%M-%S) \
--mapper '$(S3-srpt)$(mapper-phmmer) $(hadoop)/$(database) $(hadoop)/phmmer' \
--reducer NONE

Related

Shell: How to stop a command line if it times out?

I want to stop the command if it run more than 1 minutes and continue run the next command.
for((part=0;part<=100;part++));
do
spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--conf spark.pyspark.python=myenv/bin/python3 \
python_demo.py $part
done
The command spark-submit will submit my code to yarn. After submitting successfully, it'll run until the code ython_demo.py stops. But now I want to the continue to submit if one is submitted successfully.
Now the shell run like :
spark-submit -> submit successfully (about 1 minutes)-> run python_demo.py(it will run for a long time) -> spark-submit next part
Expected:
spark-submit -> if run more than 1 minutes(It means a successful submission)-> spark-submit next part
With timeout command you can do something like this:
for((part=1;part<=100;part++))
do
timeout 60 2>/dev/null spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--conf spark.pyspark.python=myenv/bin/python3 \
python_demo.py $part
if [[ $? != 0 ]]
then
break
else
continue
fi
done

Configure EMR Cluster for Fair Scheduling

I am trying to spin up an emr cluster with fair scheduling such that I can run multiple steps in parallel. I see that this is possible via pipeline (https://aws.amazon.com/about-aws/whats-new/2015/06/run-parallel-hadoop-jobs-on-your-amazon-emr-cluster-using-aws-data-pipeline/), but I already have cluster management / creating automated via an airflow job calling the awscli[1] so it would be great to just update my configurations.
aws emr create-cluster \
--applications Name=Spark Name=Ganglia \
--ec2-attributes "${EC2_PROPERTIES}" \
--service-role EMR_DefaultRole \
--release-label emr-5.8.0 \
--log-uri ${S3_LOGS} \
--enable-debugging \
--name ${CLUSTER_NAME} \
--region us-east-1 \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=4,InstanceType=m3.xlarge)
I think it may be achieved using the --configurations (https://docs.aws.amazon.com/cli/latest/reference/emr/create-cluster.html) flag, but not sure of the correct env names
Yes, you are correct. You can use EMR configurations to achieve your goal. You can create a JSON file with something like below :
yarn-config.json:
[
{
"Classification": "yarn-site",
"Properties": {
"yarn.resourcemanager.scheduler.class": "org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler"
}
}
]
as per Hadoop Fair Scheduler docs
Then modify you AWS CLI as :
aws emr create-cluster \
--applications Name=Spark Name=Ganglia \
--ec2-attributes "${EC2_PROPERTIES}" \
--service-role EMR_DefaultRole \
--release-label emr-5.8.0 \
--log-uri ${S3_LOGS} \
--enable-debugging \
--name ${CLUSTER_NAME} \
--region us-east-1 \
--instance-groups \
--configurations file://yarn-config.json
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge
InstanceGroupType=CORE,InstanceCount=4,InstanceType=m3.xlarge)

Vagrant keeps losing file doing provision

I'm running into an odd behavior on the latest version of vagrant in a Windows7/msys/Virtualbox environment setup, where after executing a vagrant up command I get an error with rsync; 'file has vanished: "/c/Users/spencerd/workspace/watcher/.LISTEN' doing the provisioning stage.
Since google, irc, and issue trackers have little to no documentation on this issue I wonder if anyone else ran into this and what would the fix be?
And for the record I have successfully build a box using the same vagrant file and provisioning script. For those that want to look, the project code is up at https://gist.github.com/denzuko/a6b7cce2eae636b0512d, with the debug log at gist.github.com/
After digging further into the directory structure and running into issues with git pushing code up I was able to find a non-existant file that needed to be removed after a reboot.
Thus, doing a reboot and a rm -rf -- "./.LISTEN\ \ \ \ \ 0\ \ \ \ \ \ 100\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ " did the trick.

Setting the number of Reducers for an Amazon EMR application

I am trying to run the wordcount example under Amazon EMR.
-1- First, I create a cluster with the following command:
./elastic-mapreduce --create --name "MyTest" --alive
This creates a cluster with a single instance and returns a jobID, lets say j-12NWUOKABCDEF
-2- Second, I start a Job using the following command:
./elastic-mapreduce --jobflow j-12NWUOKABCDEF --jar s3n://mybucket/jar-files/wordcount.jar --main-class abc.WordCount
--arg s3n://mybucket/input-data/
--arg s3n://mybucket/output-data/
--arg -Dmapred.reduce.tasks=3
My WordCount class belongs to the package abc.
This executes without any problem, but I am getting only one reducer.
Which means that the parameter "mapred.reduce.tasks=3" is ignored.
Is there any way to specify the number of reducers that I want my application to use ?
Thank you,
Neeraj.
The "-D" and the "mapred.reduce.tasks=3" should be separate arguments.
Try to launch the EMR cluster by setting reducers and mapper with --bootstrap-action option as
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons --args "-m,mapred.map.tasks=6,-m,mapred.reduce.tasks=3"
You can use the streaming Jar's built-in option of -numReduceTasks. For example with the Ruby EMR CLI tool:
elastic-mapreduce --create --enable-debugging \
--ami-version "3.3.1" \
--log-uri s3n://someBucket/logs \
--name "someJob" \
--num-instances 6 \
--master-instance-type "m3.xlarge" --slave-instance-type "c3.8xlarge" \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia \
--stream \
--arg "-files" \
--arg "s3://someBucket/some_job.py,s3://someBucket/some_file.txt" \
--mapper "python27 some_job.py some_file.txt" \
--reducer cat \
--args "-numReduceTasks,8" \
--input s3://someBucket/myInput \
--output s3://someBucket/myOutput \
--step-name "main processing"

hadoop copying from hdfs to S3

I've successfully completed mahout vectorizing job on Amazon EMR (using Mahout on Elastic MapReduce as reference). Now I want to copy results from HDFS to S3 (to use it in future clustering).
For that I've used hadoop distcp:
den#aws:~$ elastic-mapreduce --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
> --arg hdfs://my.bucket/prj1/seqfiles \
> --arg s3n://ACCESS_KEY:SECRET_KEY#my.bucket/prj1/seqfiles \
> -j $JOBID
Failed. Found that suggestion: Use s3distcp Tried it also:
elastic-mapreduce --jobflow $JOBID \
> --jar --arg s3://eu-west-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
> --arg --s3Endpoint --arg 's3-eu-west-1.amazonaws.com' \
> --arg --src --arg 'hdfs://my.bucket/prj1/seqfiles' \
> --arg --dest --arg 's3://my.bucket/prj1/seqfiles'
In both cases I have the same error: java.net.UnknownHostException: unknown host: my.bucket
Below the full error output for the 2nd case.
2012-09-06 13:25:08,209 FATAL com.amazon.external.elasticmapreduce.s3distcp.S3DistCp (main): Failed to get source file system
java.net.UnknownHostException: unknown host: my.bucket
at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1193)
at org.apache.hadoop.ipc.Client.call(Client.java:1047)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at $Proxy1.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:401)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:127)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:249)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:214)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1413)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:68)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1431)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:256)
at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:431)
at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
I've found a bug:
The main problem is not
java.net.UnknownHostException: unknown host: my.bucket
but:
2012-09-06 13:27:33,909 FATAL com.amazon.external.elasticmapreduce.s3distcp.S3DistCp (main): Failed to get source file system
So. After adding 1 more slash in source path - job was started without problems. Correct command is:
elastic-mapreduce --jobflow $JOBID \
> --jar --arg s3://eu-west-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
> --arg --s3Endpoint --arg 's3-eu-west-1.amazonaws.com' \
> --arg --src --arg 'hdfs:///my.bucket/prj1/seqfiles' \
> --arg --dest --arg 's3://my.bucket/prj1/seqfiles'
P.S. So. it is working. Job is correctly finished. I've successfully copied dir with 30Gb file.

Resources