Oracle 11g resource manager clarification - oracle

I'm hoping that someone can just confirm my understanding of how the resource manager works...
If I've got a 4 node RAC with 2 consumer groups and 2 services, the services send each consumer group to one node only i.e. consumer group 1 ALWAYS gets sent to node 1 and 2 and consumer group 2 ALWAYS gets sent to node 3 and 4.
If I've got a tiered resource plan such as:
Group Name | L0 | L1 | max
Group 1 | 75% | 0 | 80%
Group 2 | 0 | 75% | 80%
Am I right in saying that as group 1 is on nodes 1 and 2 and group 2 is on nodes 3 and 4, they will each have 75% of resources available on their respective nodes? and both be limited to 80% on the node they are running on?
i.e. Resources are constricted and calculated on a per node basis and not a per cluster.
So just because a connection on node 1 group 1 is using 80% of resources, another connection on node 2 group 1 will have up to 80% available to it and not 0%.
And similarly if group 1 is using its allocated maximum, group 2 will also get its full share on nodes 3 and 4 as group 1 which is of higher priority isn't running on those nodes.

I've had a response from Oracle Support:
Resource management's limits are applied per node except PARALLEL_TARGET_PERCENTAGE, so for your example, you are right.
Connections in consumer group 2 only ever hit node 2 (due to the
services), group 2 will get a minimum of 75% of resources on the 2nd
node and potentially 100% if no max limit has been set or 80% if the max limit has been set.

Related

Elastic search bulk write requests end up on the same node from the cluster, causing the cluster to reject writes

I have an ElasticSearch 2.1.1 cluster with 11 nodes.
Most of the time everything goes well, but when the load increases (writing data from multiple Storm topologies) it seems that all the write requests go to the same node and end up with a node overworked with his queue over the limit, while other nodes are just siting there doing nothing.
node bulk.active bulk.queue
1 0 0
2 0 0
3 32 114
4 0 0
.and so on
And after a while the cluster starts to reject the write requests:
nested: EsRejectedExecutionException[rejected execution
of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1#7e368c0b on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#13ff1612[Running, pool size = 32, active threads = 32, queued tasks =
55, completed tasks = 249363622]
After the load passes it recovers, but the same thing happens each time the load increases.
Has anyone encountered something like this? What might be the cause?

How does Raft guarantee log consistency?

I'm learning Raft, and I already know the basic mechanism of Raft.
When a Leader is elected, it is responsible to update the Followers' log to the Leader's one. When updating a Follower, it finds the first matched <entry, term> backwards, and update the Follower with the following logs.
How does Raft guarantee the logs of the Leader and the Follower before the matched <entry, term> are the same? Will this case happen:
|
Leader v
Entry : 1 2 3 4 5 6 7 8 9 10
Term : 1 1 1 2 2 3 3 3 3 3
Follower
Entry : 1 2 3 4 5 6 7
Term : 1 1 1 1 2 3 3
This property of the Raft algorithm is called Log Matching.
If two logs contain an entry with the same index and term, then the
logs are identical in all entries up through the given index
This holds because:
When sending an AppendEntries RPC, the leader includes the index and
term of the entry in its log that immediately precedes the new
entries. If the follower does not find an entry in its log with the
same index and term, then it refuses the new entries. The consistency
check acts as an induction step: the initial empty state of the logs
satisfies the Log Matching Property, and the consistency check
preserves the Log Matching Property whenever logs are extended. As a
result, whenever AppendEntries returns successfully, the leader knows
that the follower’s log is identical to its own log up through the new
entries.
Source https://raft.github.io/raft.pdf

Tuples taking more time in reaching from spout to last bolt(aka Complete Latency) is high

Version Info:
"org.apache.storm" % "storm-core" % "1.2.1"
"org.apache.storm" % "storm-kafka-client" % "1.2.1"
I was creating and experimenting with a Storm topology I have created which has 4 bolts and one kafka spout.
I was trying to tune configs like parallelism of these bolts, max-spout-pending, etc to see how much scale I can get out of it. After some configuration config/results looks something like below:
max-spout-pending: 1200
Kafka Spout Executors: 10
Num Workers: 10
+----------+-------------+----------+----------------------+
| boltName | Parallelism | Capacity | Execute latency (ms) |
+----------+-------------+----------+----------------------+
| __acker | 10 | 0.008 | 0.005 |
| bolt1 | 15 | 0.047 | 0.211 |
| bolt2 | 150 | 0.846 | 33.151 |
| bolt3 | 1500 | 0.765 | 289.679 |
| bolt4 | 48 | 0.768 | 10.451 |
+----------+-------------+----------+----------------------+
Process latency and Execute latency are almost same. There is an HTTP call involved in bolt 3 which is taking approximately that much time and bolt 2 and bolt 4 are also doing some I/O operation.
While I can see that each bolt can individually process more than 3k, (bolt3: 1500/289.679ms = 5.17k qps, bolt4: 48/10.451ms = 4.59k qps and so on), but overall this topology is processing tuples at only ~3k qps. I am running it on 10 boxes(so one worker per box), having 12 core CPU and 32GB RAM. I have given each worker process -xms 8Gb and -xmx 10Gb, so RAM should also not be constraint. I see GC also happening properly, 4 GC per minute taking around total time of 350ms in a minute(from flight recording of worker process of 1 minute).
I see Complete Latency for each tuple to be around 4 sec, which is something I am not able to understand, as If I compute all time taken by all bolts, it comes around 334 ms, but as mentioned here, tuples can be waiting in buffers, it suggests to increase dop(degree of parallelism), which I have done and reached above state.
I add some more metering and I see tuples are taking on average around 1.3sec to reach from bolt 2 to bolt 3 and 5 sec from bolt 3 to bolt 4. While I understand Storm might be keeping them in it's outbound or inbound buffer, My question is how do I reduce it as these bolts should be able to process more tuples in a second as par my earlier calculation, what is holding them from entering and being processed at faster rate?
I think your issue may be due to ack tuples, that are used to start and stop the complete latency clock, being stuck waiting at the ackers.
You have a lot of bolts and presumably high throughput which will result in a lot of ack messages. Try increasing the number of ackers, using the topology.acker.executors config value which will hopefully reduce the queuing delay for the ack tuples.
If you are also using a custom metrics consumer you may also want to increase the parallelism of this component too, given the number of bolts you have.

Elasticsearch nodes not participating in indexing

Background
With our Elasticsearch nodes, I've noticed very high CPU usage per I/O throughput when indexing documents (queries seem to be ok). I was able to increase throughput via vertical scaling (adding more CPUs to the servers) but I wanted to see what kind of increase I would get by horizontal scaling (doubling the number of nodes from 2 to 4).
Problem
I expected to see increased throughput with the expanded cluster size but the performance was actually a little worse. I also noticed that half of the nodes reported very little I/O and CPU usage.
Research
I saw that the primary shard distribution was wonky so I shuffled some of them around using the re-route API. This didn't really have any effect other than to change which two nodes were being used.
The _search_shards API indicates that all nodes and shards should participate.
Question
I'm not sure why only two nodes are participating in indexing. Once a document has been indexed, is there a way to see which shard it resides in? Is there something obvious that I'm missing?
Setup
Servers: 2 CPU, 10g JVM, 18G RAM, 500G SSD
Index: 8 shards, 1 replica
Routing Key: _id
Total Document Count: 4.1M
Index Document Count: 50k
Avg Document Size: 14.6K
Max Document Size: 32.4M
Stats
Shards
files-v2 4 r STARTED 664644 8.4gb 10.240.219.136 es-qa-03
files-v2 4 p STARTED 664644 8.4gb 10.240.211.15 es-qa-01
files-v2 7 r STARTED 854807 10.5gb 10.240.53.190 es-qa-04
files-v2 7 p STARTED 854807 10.2gb 10.240.147.89 es-qa-02
files-v2 0 r STARTED 147515 711.4mb 10.240.53.190 es-qa-04
files-v2 0 p STARTED 147515 711.4mb 10.240.211.15 es-qa-01
files-v2 3 r STARTED 347552 1.2gb 10.240.53.190 es-qa-04
files-v2 3 p STARTED 347552 1.2gb 10.240.147.89 es-qa-02
files-v2 1 p STARTED 649461 3.5gb 10.240.219.136 es-qa-03
files-v2 1 r STARTED 649461 3.5gb 10.240.147.89 es-qa-02
files-v2 5 r STARTED 488581 3.6gb 10.240.219.136 es-qa-03
files-v2 5 p STARTED 488581 3.6gb 10.240.211.15 es-qa-01
files-v2 6 r STARTED 186067 916.8mb 10.240.147.89 es-qa-02
files-v2 6 p STARTED 186067 916.8mb 10.240.211.15 es-qa-01
files-v2 2 r STARTED 765970 7.8gb 10.240.53.190 es-qa-04
files-v2 2 p STARTED 765970 7.8gb 10.240.219.136 es-qa-03
Make sure that JVM + Elastic configurations are same on all nodes.
For testing purpose - try to make all nodes to hold all data (in your case set number of replicas to 3).
About document-shard relation:
https://www.elastic.co/guide/en/elasticsearch/guide/current/routing-value.html
OK, so I think I found it. I'm using Spring Data's Elasticsearch repository. Inside their save(doc) method, there's a call to refresh:
public <S extends T> S save(S entity) {
Assert.notNull(entity, "Cannot save 'null' entity.");
elasticsearchOperations.index(createIndexQuery(entity));
elasticsearchOperations.refresh(entityInformation.getIndexName(), true);
return entity;
}
I bypassed this by invoking the API without Spring's abstraction and the CPU usage for all nodes was much, much better. I'm still not quite clear why a refresh would have effect on 2 nodes (instead of 1 or all) but the issue appears to be resolved.

MPI virtual graph topology broadcast

I have a following problem:
I would like to create a virtual topology based on tree graph for example:
0
/ \
1 5
| \ |
2 4 3
Vertices' numbers are ranks of processes.
I managed to do that and i have a handle on my new communicator:
MPI_Comm graph_comm;
MPI_Graph_create(MPI_COMM_WORLD, nnodes, indexes, edges, 0, &graph_comm);
Now my question is:
Is there a possibility to send a broadcast (MPI_Bcast) from each of parent nodes that has children to their children only (in this example process with rank 0 sends bcast to processes 1, 5; process with rank 1 sends bcast to processes 2, 4; process with rank 5 sends bcast to process 3)
It seems to be impossible and one has to create separate communicators for broadcasting. While both MPI_Graph_neighbors_count and MPI_Graph_neighbors should be enough to create new groups, one might wonder why do we need graph topologies in the first place if those groups can be created with exactly the same data as graph topology would be?
Yes, you must create groups in every process and then you can call MPI_Bcast on each group, where root is parent of node (in your example 0 is parent for 1 & 5, but you should remember that root rank is assigned to local communicator so 0 does not have to be 0 in local group, it depends how you create it).
This can help: Group and Communicator Management Routines

Resources