Rebalancing algorithm with restrictions - algorithm

Please help in solving the following problem.
The following entities are given:
Application. Applications reside on storage and they generate traffic through service node.
Service. Service is divided into several nodes. Each node has access to local or/and shared storage.
Storage. This is where applications resides. It can be local (connected to only one service node) or shared by several nodes.
Rules:
Each application is placed on some particular storage. And the storage cannot be changed.
The service node for the application can be changed to another one as long as new service node has access to the application's storage.
For example, if App resides on local storage of Node0, it can only be served by Node0. But if App resides on storage shared0, it can be served by Node0, Node1 or Node2.
The problem is to find the algorithm to rebalance applications between service nodes, given that all applications are already placed on their datastores. And to have this rebalancing as fair as possible.
If we take for example shared2 storage, the solution seems trivial: we take the apps count for Node3 and Node4 and divide all apps equally between them.
But when it comes to shared1 it becomes more complicated since Node2 also has access to shared0 storage. So when rebalancing apps from group [Node2, Node5] we also have to take into account apps from group [Node0, Node1, Node2]. Groups [Node2, Node5] and [Node0, Node1, Node2] are intersecting and rebalancing should be performed for all groups at once.
I suspect that there should be well-known working algorithm to this problem, but still cannot find it.

I think the Hungarian Matching algorithm would fit your needs. However, it might be a simple enough problem to try your own approach.
If you separated all the unconnected graphs, you'll have some set of Shared storage units per graph, each set being associated with a collection of Apps. If you spread each of those Apps evenly across each Storage's associated Nodes, you would have some Nodes with more Apps than others. Those nodes will be connected to multiple Shared storage units.
If all vacant nodes are filled, there should always be a transitive relationship between any two Nodes within a connected graph such that an App from one can be decreased ann an App from the other can be increased, even if some intermediate displacements are needed. So, if you iteratively move an App along the path from the heaviest Node to the lightest Node, shortcutting if you reach a vacant Node, and swapping Apps at any intermediate node as needed to continue along that path through one or more Shared storage units, you should be balanced when the count of the heaviest and lightest nodes differ by no more than one.

Related

Oracle NoSQL Storage Nodes

What are the guidelines for creating Oracle NoSQL Database Storage Node (SN), can we create multiple storage nodes on the same machine? If so what are the trades off? I looked at the product documentation, but it's not clear
So digging deeper here's what was found :
It is recommended that Storage Nodes (SNs) be allocated one per node in the cluster for availability and performance reasons. If you believe that a given node has the I/O and CPU resources to host multiple Replication Nodes, the Storage Node's capacity parameter can be set to a value greater than one, and the system will know that multiple RNs may be hosted at this SN. This way, the system can:
ensure that each Replication Node in a shard is hosted on a different Storage Node, reducing a shard's vulnerability to failure dynamically divide up memory and other hardware resources among the Replication Nodes ensure that the master Replication Nodes, which are the ones which service write operations in a store, are distributed evenly among the Storage Nodes both at startup, and after any failovers. If more than one SN is hosted on the same node, multiple SNs are lost if that node fails, and data may become inaccessible.
You can set the capacity parameter for a Storage Node in several ways:
When using the makebootconfig command
List item with the change-policy command
List item with the plan change-params command.
Also, in very limited situations, such as for early prototyping and experimentation, it can be useful to create multiple SNs on the same node.
On a single machine, a Storage Node is uniquely identified by its root directory (KVROOT) plus a configuration file name, which defaults to "config.xml." This means you can create multiple SNs as by creating a unique KVROOT directory for each SN. Usually, these would be on different nodes, but it's also possible to have them on a single node.

Number of nodes AWS Elasticsearch

I read documentation, but unfortunately I still don't understand one thing. While creating AWS Elasticsearch domain, I need to choose "Number of nodes" in "Data nodes" section.
If i specify 3 data nodes and 3-AZ, what it actually means?
I have suggestions:
I'll get 3 nodes with their own storages (EBS). One of node is master and other 2 are replicas in different AZ. Just copy of master, not to lose data if master node become broken.
I'll get 3 nodes with their own storages (EBS). All of them will work independent and on their storadges are different data. So at the same time data can be processed by different nodes and store on different storages.
It looks like in other AZ's should be replicas. but then I don't understand why I have different values of free space on different nodes
Please, explain how it works.
Many thanks for any info or links.
I haven't used AWS Elasticsearch, but I've used the Cloud Elasticsearch service.
When you use 3 AZ (availability zones), means that your cluster will use 3 zones in order to make it resilient. If one zone has problems, then the nodes in that zone will have problems as well.
As the description section mentions, you need to specify multiples of 3 if you choose 3 AZ. If you have 3 nodes, then every AZ will have one zone. If one zone has problems, then that node is out, the two remaining will have to pick up from there.
Now in order to answer your question. What do you get with these configurations. You can check so yourself. Use this via kibana or any HTTP client
GET _nodes
Check for the sections:
nodes.roles
nodes.attributes
In the various documentations, blog posts etc you will see that for production usage, 3 nodes and 3 AZ is a good starting point in order to have a resilient production cluster.
So let's take it step by step:
You need an even number of master nodes in order to avoid the split brain problem.
You need more than one node in your cluster in order to make it resilient (if the node is unavailable).
By combining these two you have the minimum requirement of 3 nodes (no mention of zones yet).
But having one master and two data nodes, will not cut it. You need to have 3 master-eligible nodes. So if you have one node that is out, the other two can still form a quorum and vote a new master, so your cluster will be operational with two nodes. But in order for this to work, you need to set your primary shards and replica shards in a way that any two of your nodes can hold your entire data.
Examples (for simplicity we have only one index):
1 primary, 2 replicas. Every node holds one shard which is 100% of the data
3 primaries, 1 replica. Every node will hold one primary and one replica (33% primary, 33% replica). Two nodes combined (which is the minimum to form a quorum as well) will hold all your data (and some more)
You can have more combinations but you get the idea.
As you can see, the shard configuration needs to go along with your number and type of nodes (master-eligible, data only etc).
Now, if you add the availability zones, you take care of the problem of one zone being problematic. If your cluster was as a whole in one zone (3 nodes in one node), then if that zone was problematic then your whole cluster is out.
If you set up one master node and two data nodes (which are not master eligible), having 3 AZ (or 3 nodes even) doesn't do much for resiliency, since if the master goes out, your cluster cannot elect a new one and it will be out until a master node is up again. Now for the same setup if a data node goes out, then if you have your shards configured in a way that there is redundancy (meaning that the two nodes remaining have all the data if combined), then it will work fine.
Your answers should be covered in following three points.
If i specify 3 data nodes and 3-AZ, what it actually means?
This means that your data and replica's will be available in 3AZs with none of the replica in same AZ as the data node. Check this link. For example, When you say you want 2 data nodes in 2 AZ. DN1 will be saved in (let's say) AZ1 and it's replica will be stored in AZ2. DN2 will be in AZ2 and it's replica will be in AZ1.
It looks like in other AZ's should be replicas. but then I don't understand why I have different values of free space on different nodes
It is because when you give your AWS Elasticsearch some amount of storage, the cluster divides the specified storage space in all data nodes. If you specify 100G of storage on the cluster with 2 data nodes, it divides the storage space equally on all data nodes i.e. two data nodes with 50G of available storage space on each.
Sometime you will see more nodes than you specified on the cluster. It took me a while to understand this behaviour. The reason behind this is when you update these configs on AWS ES, it takes some time to stabilize the cluster. So if you see more data or master nodes as expected hold on for a while and wait for it to stabilize.
Thanks everyone for help. To understand how much space available/allocated, run next queries:
GET /_cat/allocation?v
GET /_cat/indices?v
GET /_cat/shards?v
So, if i create 3 nodes, than I create 3 different nodes with separated storages, they are not replicas. Some data is stored in one node, some data in another.

How does balancer work in HDFS?

Balancer iteratively moves replicas from DataNodes with higher utilization to DataNodes with lower utilization.
Will that affect the concept of Rack awarness ?
For example
I have three machines placed in two racks and data is placed by following the concept of rack awarness.
What would happen if I add a new machine to the cluster and run the balancer command?
Rack awareness & data locality is a YARN concept. The HDFS balancer only cares about leveling out the Datanode usage.
If you have 3 machines, with 3 replicas by default, then every machine could be guaranteed to have 1 replica, therefore with 2 racks, you're practically guaranteed to have rack locality.
Node locality is more performant than rack awareness, anyway.
If you have 10 GB intra cluster speeds between nodes, data locality is a moot point. This is why AWS can still reasonably process data in S3, for example, where data locality processing is not available
If your question is how load balancing is used: Load balancing is helpful in spreading the load equally across the free nodes when a node is loaded above its threshold level.
Now A cluster is considered balanced if for each data node, the ratio of used space at the node to the total capacity of node (known as the utilization of the node) differs from the the ratio of used space at the cluster to the total capacity of the cluster (utilization of the cluster) by no more than the threshold value.
When you apply load balancing during runtime, it is called dynamic load balancing and this can be realized both in a direct or iterative manner according to the execution node selection:
In the iterative methods, the final destination node is determined through several iteration steps.
In the direct methods, the final destination node is selected in one step.
Rack Awareness
Rack Awareness prevents losing data when an entire rack fails and allows to make use of bandwidth from multiple racks when reading a file.
On Multiple rack cluster, block replications are maintained with a policy that no more than one replica is placed on one node and no more than two replicas are placed in the same rack with a constraint that number of racks used for block replication should be always less than total no of block replicas.
For example,
When a new block is created, the first replica is placed on the local node, the second one is placed at a different rack, the third one is on a different node at the local rack.
When re-replicating a block, if the number of existing replicas is one, place the second one on a different rack.
When the number of existing replicas is two, if the two replicas are on the same rack, place the third one on a different rack;
For reading, the name node first checks if the client’s computer is located in the cluster. If yes, block locations are returned from the close data nodes to the client.
It minimizes the write cost and maximizing read speed.

How to deal with Split Brain with an cluster have the two number of nodes?

I am leaning some basic concept of cluster computing and I have some questions to ask.
According to this article:
If a cluster splits into two (or more) groups of nodes that can no longer communicate with each other (aka.partitions), quorum is used to prevent resources from starting on more nodes than desired, which would risk data corruption.
A cluster has quorum when more than half of all known nodes are online in the same partition, or for the mathematically inclined, whenever the following equation is true:
total_nodes < 2 * active_nodes
For example, if a 5-node cluster split into 3- and 2-node paritions, the 3-node partition would have quorum and could continue serving resources. If a 6-node cluster split into two 3-node partitions, neither partition would have quorum; pacemaker’s default behavior in such cases is to stop all resources, in order to prevent data corruption.
Two-node clusters are a special case.
By the above definition, a two-node cluster would only have quorum when both nodes are running. This would make the creation of a two-node cluster pointless
Questions:
From above,I came out with some confuse, why we can not stop all cluster resources like “6-node cluster”?What`s the special lies in the two node cluster?
You are correct that a two node cluster can only have quorum when they are in communication. Thus if the cluster was to split, using the default behavior, the resources would stop.
The solution is to not use the default behavior. Simply set Pacemaker to no-quorum-policy=ignore. This will instruct Pacemaker to continue to run resources even when quorum is lost.
...But wait, now what happens if the cluster communication is broke but both nodes are still operational. Will they not consider their peers dead and both become the active nodes? Now I have two primaries, and potentially diverging data, or conflicts on my network, right? This issue is addressed via STONITH. Properly configured STONITH will ensure that only one node is ever active at a given time and essentially prevent split-brains from even occurring.
An excellent article further explaining STONITH and it's importance was written by LMB back in 2010 here: http://advogato.org/person/lmb/diary/105.html

Comparing huge trees over slow network

I have a tree like data structure:
I have a list of work orders
each work orders has several operations
each operation has several roles
each role has several resources nodes
Each work order, operation, role and resource node has a number of attributes.
I have two instances of such data strucutre: master and slave. I wish to periodically update slave and make it in sync with master. My question is: how do I do it really fast?
The problems are:
those two instances are huge
those two instances are on separate networks, connected by low throughput network
speed is critical parameter
[edit] 4. I do not have access to transaction log on master, just state of the master at this point in time (I have only read access on SQL views and that's it). [/edit]
What I was thinking was creating Merkle tree on both sides by hashing together node ID, node atributes and child nodes' hashes (bottom up, obviously).
And then comparing the trees by:
transmiting the list of top level hashes over the network
determining nodes which are not equal
recursively repeating the proces for mismatching nodes
Thus I get a list of nodes which are not in sync and then I update them.
The problem I see here is that I have to recreate both Merkle trees every time I compare instances, which costs time.
So, I was wondering if there is any other algorithm which I can try out?

Resources