AWS EMR - job counters exceeded limit 120 - hadoop

I have a hadoop code base that I inherited and which I'm trying to get running on EMR. But I'm running into issues with the job counters. I get an error saying that I'm exceeding the default limit of 120. I looked into my code and I see I have about 40 counters, and EMR adds another 30 internal counters, but that should still be within the 120 default limit.
I'm running on EMR AMI version 2.4.2, and Amazon 1.0.3 hadoop distribution.
Is there a way to increase the limit? I saw More than 120 counters in hadoop . But I'm not sure how to set this up on EMR.
Is there any way I can get more debug to figure out what is going on?

You can raise the counter limit with this configuration:
[
{
"Classification": "mapred-site",
"Properties": {
"mapreduce.job.counters.max:": "1024"
}
}
]
Here are Amazon's instructions on how to register those instructions with your cluster. (I'm not pasting it here directly because there are many ways to do it, depending on how you create and use your cluster.)

Related

Phoenix csv Bulk load fails with large data sets

I'm trying to load a dataset (280GB) using the Phoenix csv bulk load tool on a HDInsight Hbase cluster. The job fails with the following error:
18/02/23 06:09:10 INFO mapreduce.Job: Task Id :
attempt_1519326441231_0004_m_000067_0, Status : FAILEDError: Java heap
spaceContainer killed by the ApplicationMaster.Container killed on
request. Exit code is 143Container exited with a non-zero exit code
143
Here's my cluster configuration:
Region Nodes
8 cores, 56 GB RAM, 1.5TB HDD
Master Nodes
4 cores, 28GB, 1.5TB HDD
I tried increasing the value of yarn.nodemanager.resource.memory-mb from 5GB to 38GB, but the job still fails.
Can anyone please help me troubleshoot this issue?
Can you provide more details ? Such as how are you kicking off the job? Are you following the instructions here - https://blogs.msdn.microsoft.com/azuredatalake/2017/02/14/hdinsight-how-to-perform-bulk-load-with-phoenix/ ?
Specifically Can you provide the command you used and also some more info as in is the job failing immediately or does it run for a while and then start to fail? Any other log messages than the one you described above ?

writing rdd from spark to Elastic Search fails

I am trying to write a pair rdd to Elastic Search on Elastic Cloud on version 2.4.0.
I am using elasticsearch-spark_2.10-2.4.0 plugin to write to ES.
Here is the code I am using to write to ES:
def predict_imgs(r):
import json
out_d = {}
out_d["pid"] = r["pid"]
out_d["other_stuff"] = r["other_stuff"]
return (r["pid"], json.dumps(out_d))
res2 = res1.map(predict_imgs)
es_write_conf = {
"es.nodes" : image_es,
#"es.port" : "9243",
"es.resource" : "index/type",
"es.nodes.wan.only":"True",
"es.write.operation":"upsert",
"es.mapping.id":"product_id",
"es.nodes.discovery" : "false",
"es.net.http.auth.user": "username",
"es.net.http.auth.pass": "pass",
"es.input.json": "true",
"es.http.timeout":"1m",
"es.scroll.size":"10",
"es.batch.size.bytes":"1mb",
"es.http.retries":"1",
"es.batch.size.entries":"5",
"es.batch.write.refresh":"False",
"es.batch.write.retry.count":"1",
"es.batch.write.retry.wait":"10s"}
res2.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_write_conf)
The Error I get is as follows:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 744 in stage 26.0 failed 4 times, most recent failure: Lost task 744.3 in stage 26.0 (TID 2841, 10.181.252.29): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
The interesting part is this works when I do a take on the first few elements on rdd2 and then make a new rdd out of it and write it to ES, it works flawlessly:
x = sc.parallelize([res2.take(1)])
x.saveAsNewAPIHadoopFile(
path='-',
outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",
keyClass="org.apache.hadoop.io.NullWritable",
valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",
conf=es_write_conf)
I am using Elastic Cloud (cloud offering of Elastic Search) and Databricks (cloud offering of Apache Spark)
Could it be that ES is not able to keep up with the through put of Spark writing to ES ?
I increased our Elastic Cloud size from 2GB RAM to 8GB RAM.
Are there any recommended configs for the es_write_conf I used above? Any other confs that you can think of?
Does updating to ES 5.0 help?
Any help is appreciated. Have been struggling with this for a few days now. Thank you.
It looks like problem with pyspark calculations, not necessarly elasticsearch saving process. Ensure your RDDs are OK by:
Performing count() on rdd1 (to "materialize" results)
Performing count() on rdd2
If counts are OK, try with caching results before saving into ES:
res2.cache()
res2.count() # to fill the cache
res2.saveAsNewAPIHadoopFile(...
It the problem still appears, try to look at dead executors stderr and stdout (you can find them on Executors tab in SparkUI).
I also noticed the very small batch size in es_write_conf, try increasing it to 500 or 1000 to get better performance.

Why are locality levels all ANY in a Spark wordcount application running on HDFS?

I ran a Spark cluster of 12 nodes (8G memory and 8 cores for each) for some tests.
I'm trying to figure out why data localities of a simple wordcount app in "map" stage are all "Any". The 14GB dataset is stored in HDFS.
I have run into the same problem and in my case it was a problem with the configuration. I was running on the EC2 and I had a name mismatch. Maybe the same thing happened to you.
When you check how HDFS sees you cluster it should be something along this lines:
hdfs dfsadmin -printTopology
Rack: /default-rack
172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
And the same should be seen in executors' address in the UI (by default it's http://your-cluster-public-dns:8080/).
In my case I was using public hostname for spark slaves. I have changed my SPARK_LOCAL_IP in $SPARK/conf/spark-env.sh to use the private name as well, and after that change I get NODE_LOCAL most of the times.
I encounter the same problem today. This is my situation:
My cluster have 9 workers(each setup one executor by default) ,when i set --total-executor-cores 9, the Locality lever is NODE_LOCAL, but when i set the total-executor-cores below 9 such as --total-executor-cores 7, then Locality lever become ANY, and the total time cost is 10X than NODE_LOCAL lever. You can have a try.
I'm running my cluster on EC2s, and I fixed my problem by adding the following to spark-env.sh on the name node
SPARK_MASTER_HOST=<name node hostname>
and then adding the following to spark-env.sh on the data nodes
SPARK_LOCAL_HOSTNAME=<data node hostname>
Don't start slaves like this start-all.sh. u should start every slave alonely
$SPARK_HOME/sbin/start-slave.sh -h <hostname> <masterURI>

Mahout - ParallelALSFactorizationJob running too long?

I am trying to run Mahout ALS recommendation on AWS EMR cluster, however, it takes much longer than I expected.
The following is the command I run:
aws add-steps --cluster-id <cluster_id> \
--steps Type=CUSTOM_JAR,\
Name="Mahout ALS Factorization Job",\
Jar=s3://<my_bucket>/recproto/mahout-mr-0.10.0-job.jar,\
MainClass=org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob,\
Args=["--input","s3://<my_bucket>/recproto/trainingdata/userClicks.csv.gz",\
"--output","s3://<my_bucket>/recproto/als-output/",\
"--implicitFeedback","true",\
"--lambda","150",\
"--alpha","0.05",\
"--numFeatures","100",\
"--numIterations","3",\
"--numThreadsPerSolver","4",\
"--usesLongIDs","true"]
In the userClicks.csv file, there are 1,567,808 ratings from 335,636 users and 23,934 items.
The job is run on a 10-c3.xlarge nodes EMR cluster, and the job has been running for more than 2 hours. I would like to know is this normal? In the case of my rating file, which scale of EMR cluster and parameters should I use so I can get a more acceptable running time?
I solved this problem by simply use Spark ALS. The training process spends LESS THAN 2 MINUTES ON MY LAPTOP on the same dataset with the same parameters.
I can now understand why some machine learning algorithms are deprecated due to performance issues...(e.g., the Minhash algorithm)

How concurrent # mappers and # reducers are calculated in Hadoop 2 + YARN?

I've searched by sometime and I've found that a MapReduce cluster using hadoop2 + yarn has the following number of concurrent maps and reduces per node:
Concurrent Maps # = yarn.nodemanager.resource.memory-mb / mapreduce.map.memory.mb
Concurrent Reduces # = yarn.nodemanager.resource.memory-mb / mapreduce.reduce.memory.mb
However, I've set up a cluster with 10 machines, with these configurations:
'yarn_site' => {
'yarn.nodemanager.resource.cpu-vcores' => '32',
'yarn.nodemanager.resource.memory-mb' => '16793',
'yarn.scheduler.minimum-allocation-mb' => '532',
'yarn.nodemanager.vmem-pmem-ratio' => '5',
'yarn.nodemanager.pmem-check-enabled' => 'false'
},
'mapred_site' => {
'mapreduce.map.memory.mb' => '4669',
'mapreduce.reduce.memory.mb' => '4915',
'mapreduce.map.java.opts' => '-Xmx4669m',
'mapreduce.reduce.java.opts' => '-Xmx4915m'
}
But after the cluster is set up, hadoop allows 6 containers for the entire cluster. What am I forgetting? What am I doing wrong?
Not sure if this is the same issue you're having, but I had a similar issue, where I launched an EMR cluster of 20 nodes of c3.8xlarge in the core instance group and similarly found the cluster to be severely underutilized when running a job (only 30 mappers were running concurrently across the entire cluster, even though the memory/vcore configs in YARN and MapReduce for my particular cluster show that over 500 concurrent containers can run). I was using Hadoop 2.4.0 on AMI 3.5.0.
It turns out that the instance group matters for some reason. When I relaunched the cluster with 20 nodes in task instance group and only 1 core node, that made a HUGE difference. I got over 500+ mappers running concurrently (in my case, the mappers were mostly downloading files from S3 and as such don't need HDFS).
I'm not sure why the different instance group type makes a difference, given that both can equally run tasks, but clearly they are being treated differently.
I thought I'd mention it here, given that I ran into this issue myself and using a different group type helped.

Resources