Spark job just hangs with large data - hadoop

I am trying to query from s3 (15 days of data). I tried querying them separately (each day) it works fine. It works fine for 14 days as well. But when I query 15 days the job keeps running forever (hangs) and the task # is not updating.
My settings :
I am using 51 node cluster r3.4x large with dynamic allocation and maximum resource turned on.
All I am doing is =
val startTime="2017-11-21T08:00:00Z"
val endTime="2017-12-05T08:00:00Z"
val start = DateUtils.getLocalTimeStamp( startTime )
val end = DateUtils.getLocalTimeStamp( endTime )
val days: Int = Days.daysBetween( start, end ).getDays
val files: Seq[String] = (0 to days)
.map( start.plusDays )
.map( d => s"$input_path${DateTimeFormat.forPattern( "yyyy/MM/dd" ).print( d )}/*/*" )
sqlSession.sparkContext.textFile( files.mkString( "," ) ).count
When I run the same with 14 days, I got 197337380 (count) and I ran the 15th day separately and got 27676788. But when I query 15 days total the job hangs
Update :
The job works fine with :
var df = sqlSession.createDataFrame(sc.emptyRDD[Row], schema)
for(n <- files ){
val tempDF = sqlSession.read.schema( schema ).json(n)
df = df(tempDF)
}
df.count
But can some one explain why it works now but not before ?
UPDATE : After setting mapreduce.input.fileinputformat.split.minsize to 256 GB it works fine now.

Dynamic allocation and maximize resource allocation are both different settings, one would be disabled when other is active. With Maximize resource allocation in EMR, 1 executor per node is launched, and it allocates all the cores and memory to that executor.
I would recommend taking a different route. You seem to have a pretty big cluster with 51 nodes, not sure if it is even required. However, follow this rule of thumb to begin with, and you will get a hang of how to tune these configurations.
Cluster memory - minimum of 2X the data you are dealing with.
Now assuming 51 nodes is what you require, try below:
r3.4x has 16 CPUs - so you can put all of them to use by leaving one for the OS and other processes.
Set your number of executors to 150 - this will allocate 3 executors per node.
Set number of cores per executor to 5 (3 executors per node)
Set your executor memory to roughly total host memory/3 = 35G
You got to control the parallelism (default partitions), set this to number of total cores you have ~ 800
Adjust shuffle partitions - make this twice of number of cores - 1600
Above configurations have been working like a charm for me. You can monitor the resource utilization on Spark UI.
Also, in your yarn config /etc/hadoop/conf/capacity-scheduler.xml file, set yarn.scheduler.capacity.resource-calculator to org.apache.hadoop.yarn.util.resource.DominantResourceCalculator - which will allow Spark to really go full throttle with those CPUs. Restart yarn service after change.

You should be increasing the executor memory and # executors, If the data is huge try increasing the Driver memory.
My suggestion is to not use the dynamic resource allocation and let it run and see if it still hangs or not (Please note that spark job can consume entire cluster resources and make other applications starve for resources try this approach when no jobs are running). if it doesn't hang that means you should play with the resource allocation, then start hardcoding the resources and keep increasing resources so that you can find the best resource allocation you can possibly use.
Below links can help you understand the resource allocation and optimization of resources.
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
https://community.hortonworks.com/articles/42803/spark-on-yarn-executor-resource-allocation-optimiz.html

Related

How to use Sphinx Search with concurrency?

I have a large database (100M rows) indexed by SphinxSearch. Each search takes 0.1-0.5s. However, if I run 10 searches concurrently, they take 20s on average.
Is it the expected behaviour of SphinxSearch?
Should I adjust the config or move to another search engine for concurrency?
My config file is simple:
searchd
{
listen = 9312
listen = 9306:mysql41
pid_file = /var/searchd.pid
read_timeout = 30
log = /var/log/sphinxsearch/searchd.log
query_log = /var/log/sphinxsearch/query.log
}
Is it the expected behaviour of SphinxSearch?
It heavily depends on the number of CPUs. If you have more than 10 physical CPUs then latency degradation from 0.5 sec to 20 sec by increasing the concurrency from 1 to 10 is definitely not expected. In this case first of all make sure all your CPUs are busy under the concurrency load. If it's not - depending on your Sphinx version and multi-tasking mode let it run with more threads.
Should I adjust the config or move to another search engine for concurrency?
I recommend Manticore Search as:
it's open source - https://github.com/manticoresoftware/manticoresearch/
it's the only fork of Sphinx and if you are familiar with Sphinx in general it shouldn't be a problem to migrate
hundreds of bugs have been fixed
the multi-tasking mode is completely different (coroutines)

How to make Hadoop/EMR use more containers per node

I'm in the process of moving our application from Hadoop 1.0.3 to 2.7, on EMR v5.1.0. I got it running, but I'm still having problems getting my head around the resource-allocation system in Yarn. With the default settings provided by EMR, Hadoop only allocates one container per node, even if I select a larger instance type for the nodes. This is a problem, since we'll now be using twice as many nodes to do the same amount of work.
I want to squeeze more containers into one node, and ensure that we're using all the available resources. I assume that I shouldn't touch yarn.nodemanager.resource.memory-mb or yarn.nodemanager.resource.cpu-vcores, since those are set by EMR to reflect the actual available resources. Which settings do I have to change?
Your container sizes are defined by setting the memory (default criteria for a container) and vcores. The following can be configured:
yarn-scheduler.minimum-allocation-mb
yarn-scheduler.maximum-allocation-mb
yarn-scheduler.increment-allocation-mb
yarn-scheduler.minimum-allocation-vcores
yarn-scheduler.maximum-allocation-vcores
yarn-scheduler.increment-allocation-vcores
All the following criteria must be satified (they are per container, except for yarn.nodemanager.resource.cpu-vcores and yarn.nodemanager.resource.memory-mb which are per NodeManager hence per DataNode):
1 <= yarn-scheduler.minimum-allocation-vcores <= yarn-scheduler.maximum-allocation-vcores
yarn-scheduler.maximum-allocation-vcores <= yarn.nodemanager.resource.cpu-vcores
yarn-scheduler.increment-allocation-vcores = 1
1024 <= yarn-scheduler.minimum-allocation-mb <= yarn-scheduler.maximum-allocation-mb
yarn-scheduler.maximum-allocation-mb <= yarn.nodemanager.resource.memory-mb
yarn-scheduler.increment-allocation-mb = 512
You can also see this helpful link https://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_ig_yarn_tuning.html

es_rejected_execution_exception rejected execution

I'm getting the following error when doing indexing.
es_rejected_execution_exception rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1#16248886
on EsThreadPoolExecutor[bulk, queue capacity = 50,
org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#739e3764[Running,
pool size = 16, active threads = 16, queued tasks = 51, completed
tasks = 407667]
My current setup:
Two nodes. One is the master (data: true, master: true) while the other one is data only (data: true, master: false). They are both EC2 I2.4XL (16 Cores, 122GB RAM, 320GB instance storage). 2 shards, 1 replication.
Those two nodes are being fed by our aggregation server which has 20 separate workers. Each worker makes bulk indexing request to our ES cluster with 50 items to index. Each item is between 1000-4000 characters.
Current server setup: 4x client facing servers -> aggregation server -> ElasticSearch.
Now the issue is this error only started occurring when we introduced the second node. Before when we had one machine, we got consistent indexing throughput of 20k request per second. Now with two machine, once it hits the 10k mark (~20% CPU usage)
we start getting some of the errors outlined above.
But here is the interesting thing which I have noticed. We have a mock item generator which generates a random document to be indexed. Generally these documents are of the same size, but have random parameters. We use this to do the stress test and check the stability. This mock item generator sends requests to aggregation server which in turn passes them to Elasticsearch. The interesting thing is, we are able to index around 40-45k (# ~80% CPU usage) items per second without getting this error. So it seems really interesting as to why we get this error. Has anyone seen this error or know what could be causing it?

How concurrent # mappers and # reducers are calculated in Hadoop 2 + YARN?

I've searched by sometime and I've found that a MapReduce cluster using hadoop2 + yarn has the following number of concurrent maps and reduces per node:
Concurrent Maps # = yarn.nodemanager.resource.memory-mb / mapreduce.map.memory.mb
Concurrent Reduces # = yarn.nodemanager.resource.memory-mb / mapreduce.reduce.memory.mb
However, I've set up a cluster with 10 machines, with these configurations:
'yarn_site' => {
'yarn.nodemanager.resource.cpu-vcores' => '32',
'yarn.nodemanager.resource.memory-mb' => '16793',
'yarn.scheduler.minimum-allocation-mb' => '532',
'yarn.nodemanager.vmem-pmem-ratio' => '5',
'yarn.nodemanager.pmem-check-enabled' => 'false'
},
'mapred_site' => {
'mapreduce.map.memory.mb' => '4669',
'mapreduce.reduce.memory.mb' => '4915',
'mapreduce.map.java.opts' => '-Xmx4669m',
'mapreduce.reduce.java.opts' => '-Xmx4915m'
}
But after the cluster is set up, hadoop allows 6 containers for the entire cluster. What am I forgetting? What am I doing wrong?
Not sure if this is the same issue you're having, but I had a similar issue, where I launched an EMR cluster of 20 nodes of c3.8xlarge in the core instance group and similarly found the cluster to be severely underutilized when running a job (only 30 mappers were running concurrently across the entire cluster, even though the memory/vcore configs in YARN and MapReduce for my particular cluster show that over 500 concurrent containers can run). I was using Hadoop 2.4.0 on AMI 3.5.0.
It turns out that the instance group matters for some reason. When I relaunched the cluster with 20 nodes in task instance group and only 1 core node, that made a HUGE difference. I got over 500+ mappers running concurrently (in my case, the mappers were mostly downloading files from S3 and as such don't need HDFS).
I'm not sure why the different instance group type makes a difference, given that both can equally run tasks, but clearly they are being treated differently.
I thought I'd mention it here, given that I ran into this issue myself and using a different group type helped.

Azure worker role: how do I query the memory usage from within the role and sleep or reboot

I have a worker role that runs a number of parallel background workers. These workers run tasks that last from one minute to 5 hours and use quite a lot of memory.
I would like to delay the start of a new worker by testing the current level of memory consumption. Something like this:
while (memoryAvailable < 50%) {
Thread.Sleep( 1000 * 60 * 10 ); // 10 minutes
}
Can I test for available memory within a worker role?
Also, can I automate a reboot of the instance if memory drops below a certain amount?
Since your worker role instances are Windows Server 2012, you can just set up an appropriate perf counter during role startup ( OnStart() ) with whichever pertinent Memory counters you're interested in, and set up a task to observe the perf counter periodically. When available memory drops below your threshold (or committed bytes exceeds your threshold), you can easily recycle the role instance:
RoleEnvironment.RequestRecycle();

Resources