Adjusting pool size of akka.net cluster pool router - cluster-computing

Here's the scenario I am looking to achieve using Akka.Net clustering:
A user submits a load generation task to be performed, to a coordinator actor. (The task has its parameters and a 'load factor' - so a factor of 10 would create 10 parallel tasks.)
The coordinator actor creates a cluster pool router with pool size equal to the load factor, and broadcasts a message to tell the routees to execute the task.
As expected, the required number of routees is created on the cluster nodes, and the load is generated.
Now, I need to enable a ramp-up and ramp-down capability. I am looking to do that by periodically adjusting the pool size so every x seconds, the pool size is increased for a ramp-up, and for ramp-down, every x seconds, the pool size is reduced.
The cluster pool router is created using the following to create an initial pool of 1 routee:
ClusterRouterPoolSettings settings = new ClusterRouterPoolSettings(1, 1000, false, "worker");
var props = Props.Create<TestActor>(() => new TestActor())
.WithRouter(new ClusterRouterPool(new BroadcastPool(100), settings));
var testActorRouter = Context.ActorOf(props, actorName);
To ramp up pool size by 5, the coordinator uses the following:
testActorRouter.Ask<Router>(new AdjustPoolSize(5));
However, this results in the 5 routees being added on the coordinator's node, instead of on the other cluster nodes. This is in spite of the pool settings specifying the role as "worker" (which the coordinator node doesn't have) and specifying local routees = false.
Is there a different way to adjust the pool size so it increases the cluster pool routees?
Thanks!

Related

Nifi Group Content by Given Attributes

I am trying to run a script or a custom processor to group data by given attributes every hour. Queue size is up to 30-40k on a single run and it might go up to 200k depending on the case.
MergeContent does not fit since there is no limit on min-max counts.
RouteOnAttribute does not fit since there are too many combinations.
Solution 1: Consume all flow files and group by attributes and create the new flow file and push the new one. Not ideal but gave it a try.
While running this when I had 33k flow files on queue waiting.
session.getQueueSize().getObjectCount()
This number is returning 10k all the time even though I increased the queue threshold numbers on output flows.
Solution 2: Better approach is consume one flow file and and filter flow files matching the provided attributes
final List<FlowFile> flowFiles = session.get(file -> {
if (correlationId.equals(Arrays.stream(keys).map(file::getAttribute).collect(Collectors.joining(":"))))
return FlowFileFilter.FlowFileFilterResult.ACCEPT_AND_CONTINUE;
return FlowFileFilter.FlowFileFilterResult.REJECT_AND_CONTINUE;
});
Again with 33k waiting in the queue I was expecting around 200 new grouped flow files but 320 is created. It looks like a similar issue above and does not scan all waiting flow files on filter query.
Problems-Question:
Is there a parameter to change so this getObjectCount can take up to 300k?
Is there a way to filter all waiting flow files again by changing a parameter or by changing the processor?
I tried making default queue threshold 300k on nifi.properties but it didn't help
in nifi.properties there is a parameter that affects batching behavior
nifi.queue.swap.threshold=20000
here is my test flow:
1. GenerateFlowFile with "batch size = 50K"
2. ExecuteGroovyScript with script below
3. LogAttrribute (disabled) - just to have queue after groovy
groovy script:
def ffList = session.get(100000) // get batch with maximum 100K files from incoming queue
if(!ffList)return
def ff = session.create() // create new empty file
ff.batch_size = ffList.size() // set attribute to real batch size
session.remove(ffList) // drop all incoming batch files
REL_SUCCESS << ff // transfer new file to success
with parameters above there are 4 files generated in output:
1. batch_size = 20000
2. batch_size = 10000
3. batch_size = 10000
4. batch_size = 10000
according to documentation:
There is also the notion of "swapping" FlowFiles. This occurs when the number of FlowFiles in a connection queue exceeds the value set in the nifi.queue.swap.threshold property. The FlowFiles with the lowest priority in the connection queue are serialized and written to disk in a "swap file" in batches of 10,000.
This explains that from 50K incoming files - 20K it keeps inmemory and others in swap batched by 10K.
i don't know how increasing of nifi.queue.swap.threshold property will affect your system performance and memory consumption, but i set it to 100K on my local nifi 1.16.3 and it looks good with multiple small files, and first batch increased to 100K by this.

Spark job just hangs with large data

I am trying to query from s3 (15 days of data). I tried querying them separately (each day) it works fine. It works fine for 14 days as well. But when I query 15 days the job keeps running forever (hangs) and the task # is not updating.
My settings :
I am using 51 node cluster r3.4x large with dynamic allocation and maximum resource turned on.
All I am doing is =
val startTime="2017-11-21T08:00:00Z"
val endTime="2017-12-05T08:00:00Z"
val start = DateUtils.getLocalTimeStamp( startTime )
val end = DateUtils.getLocalTimeStamp( endTime )
val days: Int = Days.daysBetween( start, end ).getDays
val files: Seq[String] = (0 to days)
.map( start.plusDays )
.map( d => s"$input_path${DateTimeFormat.forPattern( "yyyy/MM/dd" ).print( d )}/*/*" )
sqlSession.sparkContext.textFile( files.mkString( "," ) ).count
When I run the same with 14 days, I got 197337380 (count) and I ran the 15th day separately and got 27676788. But when I query 15 days total the job hangs
Update :
The job works fine with :
var df = sqlSession.createDataFrame(sc.emptyRDD[Row], schema)
for(n <- files ){
val tempDF = sqlSession.read.schema( schema ).json(n)
df = df(tempDF)
}
df.count
But can some one explain why it works now but not before ?
UPDATE : After setting mapreduce.input.fileinputformat.split.minsize to 256 GB it works fine now.
Dynamic allocation and maximize resource allocation are both different settings, one would be disabled when other is active. With Maximize resource allocation in EMR, 1 executor per node is launched, and it allocates all the cores and memory to that executor.
I would recommend taking a different route. You seem to have a pretty big cluster with 51 nodes, not sure if it is even required. However, follow this rule of thumb to begin with, and you will get a hang of how to tune these configurations.
Cluster memory - minimum of 2X the data you are dealing with.
Now assuming 51 nodes is what you require, try below:
r3.4x has 16 CPUs - so you can put all of them to use by leaving one for the OS and other processes.
Set your number of executors to 150 - this will allocate 3 executors per node.
Set number of cores per executor to 5 (3 executors per node)
Set your executor memory to roughly total host memory/3 = 35G
You got to control the parallelism (default partitions), set this to number of total cores you have ~ 800
Adjust shuffle partitions - make this twice of number of cores - 1600
Above configurations have been working like a charm for me. You can monitor the resource utilization on Spark UI.
Also, in your yarn config /etc/hadoop/conf/capacity-scheduler.xml file, set yarn.scheduler.capacity.resource-calculator to org.apache.hadoop.yarn.util.resource.DominantResourceCalculator - which will allow Spark to really go full throttle with those CPUs. Restart yarn service after change.
You should be increasing the executor memory and # executors, If the data is huge try increasing the Driver memory.
My suggestion is to not use the dynamic resource allocation and let it run and see if it still hangs or not (Please note that spark job can consume entire cluster resources and make other applications starve for resources try this approach when no jobs are running). if it doesn't hang that means you should play with the resource allocation, then start hardcoding the resources and keep increasing resources so that you can find the best resource allocation you can possibly use.
Below links can help you understand the resource allocation and optimization of resources.
http://site.clairvoyantsoft.com/understanding-resource-allocation-configurations-spark-application/
https://community.hortonworks.com/articles/42803/spark-on-yarn-executor-resource-allocation-optimiz.html

Threadpool/Queue size limitation unsolved

I am using ES to do some data indexing in Windows OS. However, I have come across with the following errors always. It seems that it would be a queue size or threadpool size problem. However, I could not find any document that reveal how can I change the Windows settings to solve it.
[2016-07-20 11:11:56,343][DEBUG][action.search ] [Adaptoid] [cpu-2015.09.23][2], node[1Qp4zwR_Q5GLX-VChDOc2Q], [P], v[42], s[STARTED], a[id=KznRm9A5S0OhTMZMoED0qA]: Failed to execute [org.elasticsearch.action.search.SearchRequest#444b07] lastShard [true]
RemoteTransportException[[Adaptoid][172.16.1.238:9300][indices:data/read/search[phase/query]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4#cd47e on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#9c72f5[Running, pool size = 4, active threads = 4, queued tasks = 1000, completed tasks = 1226]]];
Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4#cd47e on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#9c72f5[Running, pool size = 4, active threads = 4, queued tasks = 1000, completed tasks = 1226]]]
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50)
Is there anyone who have experience with this?
There is no problem with Elasticsearch, but with your indexing procedure. By throwing that exception ES is telling you that you are sending too many search requests to ES and is not able to keep up.
If, at the same time, you are doing indexing the pressure (memory, CPU, merging segments) from the indexing process could affect the other operations ES is performing. So, if you also indexing, do it at a lower pace as it's affecting the search operations.

es_rejected_execution_exception rejected execution

I'm getting the following error when doing indexing.
es_rejected_execution_exception rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1#16248886
on EsThreadPoolExecutor[bulk, queue capacity = 50,
org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#739e3764[Running,
pool size = 16, active threads = 16, queued tasks = 51, completed
tasks = 407667]
My current setup:
Two nodes. One is the master (data: true, master: true) while the other one is data only (data: true, master: false). They are both EC2 I2.4XL (16 Cores, 122GB RAM, 320GB instance storage). 2 shards, 1 replication.
Those two nodes are being fed by our aggregation server which has 20 separate workers. Each worker makes bulk indexing request to our ES cluster with 50 items to index. Each item is between 1000-4000 characters.
Current server setup: 4x client facing servers -> aggregation server -> ElasticSearch.
Now the issue is this error only started occurring when we introduced the second node. Before when we had one machine, we got consistent indexing throughput of 20k request per second. Now with two machine, once it hits the 10k mark (~20% CPU usage)
we start getting some of the errors outlined above.
But here is the interesting thing which I have noticed. We have a mock item generator which generates a random document to be indexed. Generally these documents are of the same size, but have random parameters. We use this to do the stress test and check the stability. This mock item generator sends requests to aggregation server which in turn passes them to Elasticsearch. The interesting thing is, we are able to index around 40-45k (# ~80% CPU usage) items per second without getting this error. So it seems really interesting as to why we get this error. Has anyone seen this error or know what could be causing it?

Azure worker role: how do I query the memory usage from within the role and sleep or reboot

I have a worker role that runs a number of parallel background workers. These workers run tasks that last from one minute to 5 hours and use quite a lot of memory.
I would like to delay the start of a new worker by testing the current level of memory consumption. Something like this:
while (memoryAvailable < 50%) {
Thread.Sleep( 1000 * 60 * 10 ); // 10 minutes
}
Can I test for available memory within a worker role?
Also, can I automate a reboot of the instance if memory drops below a certain amount?
Since your worker role instances are Windows Server 2012, you can just set up an appropriate perf counter during role startup ( OnStart() ) with whichever pertinent Memory counters you're interested in, and set up a task to observe the perf counter periodically. When available memory drops below your threshold (or committed bytes exceeds your threshold), you can easily recycle the role instance:
RoleEnvironment.RequestRecycle();

Resources