Comparing huge trees over slow network

Comparing huge trees over slow network - algorithm

I have a tree like data structure:
I have a list of work orders
each work orders has several operations
each operation has several roles
each role has several resources nodes
Each work order, operation, role and resource node has a number of attributes.
I have two instances of such data strucutre: master and slave. I wish to periodically update slave and make it in sync with master. My question is: how do I do it really fast?
The problems are:
those two instances are huge
those two instances are on separate networks, connected by low throughput network
speed is critical parameter
[edit] 4. I do not have access to transaction log on master, just state of the master at this point in time (I have only read access on SQL views and that's it). [/edit]
What I was thinking was creating Merkle tree on both sides by hashing together node ID, node atributes and child nodes' hashes (bottom up, obviously).
And then comparing the trees by:
transmiting the list of top level hashes over the network
determining nodes which are not equal
recursively repeating the proces for mismatching nodes
Thus I get a list of nodes which are not in sync and then I update them.
The problem I see here is that I have to recreate both Merkle trees every time I compare instances, which costs time.
So, I was wondering if there is any other algorithm which I can try out?

Related

Rebalancing algorithm with restrictions

Please help in solving the following problem.
The following entities are given:
Application. Applications reside on storage and they generate traffic through service node.
Service. Service is divided into several nodes. Each node has access to local or/and shared storage.
Storage. This is where applications resides. It can be local (connected to only one service node) or shared by several nodes.
Rules:
Each application is placed on some particular storage. And the storage cannot be changed.
The service node for the application can be changed to another one as long as new service node has access to the application's storage.
For example, if App resides on local storage of Node0, it can only be served by Node0. But if App resides on storage shared0, it can be served by Node0, Node1 or Node2.
The problem is to find the algorithm to rebalance applications between service nodes, given that all applications are already placed on their datastores. And to have this rebalancing as fair as possible.
If we take for example shared2 storage, the solution seems trivial: we take the apps count for Node3 and Node4 and divide all apps equally between them.
But when it comes to shared1 it becomes more complicated since Node2 also has access to shared0 storage. So when rebalancing apps from group [Node2, Node5] we also have to take into account apps from group [Node0, Node1, Node2]. Groups [Node2, Node5] and [Node0, Node1, Node2] are intersecting and rebalancing should be performed for all groups at once.
I suspect that there should be well-known working algorithm to this problem, but still cannot find it.

I think the Hungarian Matching algorithm would fit your needs. However, it might be a simple enough problem to try your own approach.
If you separated all the unconnected graphs, you'll have some set of Shared storage units per graph, each set being associated with a collection of Apps. If you spread each of those Apps evenly across each Storage's associated Nodes, you would have some Nodes with more Apps than others. Those nodes will be connected to multiple Shared storage units.
If all vacant nodes are filled, there should always be a transitive relationship between any two Nodes within a connected graph such that an App from one can be decreased ann an App from the other can be increased, even if some intermediate displacements are needed. So, if you iteratively move an App along the path from the heaviest Node to the lightest Node, shortcutting if you reach a vacant Node, and swapping Apps at any intermediate node as needed to continue along that path through one or more Shared storage units, you should be balanced when the count of the heaviest and lightest nodes differ by no more than one.

Number of nodes AWS Elasticsearch

I read documentation, but unfortunately I still don't understand one thing. While creating AWS Elasticsearch domain, I need to choose "Number of nodes" in "Data nodes" section.
If i specify 3 data nodes and 3-AZ, what it actually means?
I have suggestions:
I'll get 3 nodes with their own storages (EBS). One of node is master and other 2 are replicas in different AZ. Just copy of master, not to lose data if master node become broken.
I'll get 3 nodes with their own storages (EBS). All of them will work independent and on their storadges are different data. So at the same time data can be processed by different nodes and store on different storages.
It looks like in other AZ's should be replicas. but then I don't understand why I have different values of free space on different nodes
Please, explain how it works.
Many thanks for any info or links.

I haven't used AWS Elasticsearch, but I've used the Cloud Elasticsearch service.
When you use 3 AZ (availability zones), means that your cluster will use 3 zones in order to make it resilient. If one zone has problems, then the nodes in that zone will have problems as well.
As the description section mentions, you need to specify multiples of 3 if you choose 3 AZ. If you have 3 nodes, then every AZ will have one zone. If one zone has problems, then that node is out, the two remaining will have to pick up from there.
Now in order to answer your question. What do you get with these configurations. You can check so yourself. Use this via kibana or any HTTP client
GET _nodes
Check for the sections:
nodes.roles
nodes.attributes
In the various documentations, blog posts etc you will see that for production usage, 3 nodes and 3 AZ is a good starting point in order to have a resilient production cluster.
So let's take it step by step:
You need an even number of master nodes in order to avoid the split brain problem.
You need more than one node in your cluster in order to make it resilient (if the node is unavailable).
By combining these two you have the minimum requirement of 3 nodes (no mention of zones yet).
But having one master and two data nodes, will not cut it. You need to have 3 master-eligible nodes. So if you have one node that is out, the other two can still form a quorum and vote a new master, so your cluster will be operational with two nodes. But in order for this to work, you need to set your primary shards and replica shards in a way that any two of your nodes can hold your entire data.
Examples (for simplicity we have only one index):
1 primary, 2 replicas. Every node holds one shard which is 100% of the data
3 primaries, 1 replica. Every node will hold one primary and one replica (33% primary, 33% replica). Two nodes combined (which is the minimum to form a quorum as well) will hold all your data (and some more)
You can have more combinations but you get the idea.
As you can see, the shard configuration needs to go along with your number and type of nodes (master-eligible, data only etc).
Now, if you add the availability zones, you take care of the problem of one zone being problematic. If your cluster was as a whole in one zone (3 nodes in one node), then if that zone was problematic then your whole cluster is out.
If you set up one master node and two data nodes (which are not master eligible), having 3 AZ (or 3 nodes even) doesn't do much for resiliency, since if the master goes out, your cluster cannot elect a new one and it will be out until a master node is up again. Now for the same setup if a data node goes out, then if you have your shards configured in a way that there is redundancy (meaning that the two nodes remaining have all the data if combined), then it will work fine.

Your answers should be covered in following three points.
If i specify 3 data nodes and 3-AZ, what it actually means?
This means that your data and replica's will be available in 3AZs with none of the replica in same AZ as the data node. Check this link. For example, When you say you want 2 data nodes in 2 AZ. DN1 will be saved in (let's say) AZ1 and it's replica will be stored in AZ2. DN2 will be in AZ2 and it's replica will be in AZ1.
It looks like in other AZ's should be replicas. but then I don't understand why I have different values of free space on different nodes
It is because when you give your AWS Elasticsearch some amount of storage, the cluster divides the specified storage space in all data nodes. If you specify 100G of storage on the cluster with 2 data nodes, it divides the storage space equally on all data nodes i.e. two data nodes with 50G of available storage space on each.
Sometime you will see more nodes than you specified on the cluster. It took me a while to understand this behaviour. The reason behind this is when you update these configs on AWS ES, it takes some time to stabilize the cluster. So if you see more data or master nodes as expected hold on for a while and wait for it to stabilize.

Thanks everyone for help. To understand how much space available/allocated, run next queries:
GET /_cat/allocation?v
GET /_cat/indices?v
GET /_cat/shards?v
So, if i create 3 nodes, than I create 3 different nodes with separated storages, they are not replicas. Some data is stored in one node, some data in another.

Create sets of minimal cardinality from set of pairs

I have a set of pairs of IDs like
(123;1765)
(1212;8977)...
I need to separate those pairs into n groups with an inividual size (number of pairs) each. Those sets should have minimum cardinality (=there should be as few as possible different ids in each group).
Are there any existing algorithms which solve this problem? I'm not sure where/how to search for it.
This is necessary, because I currently work on the load balancing of one of my projects and each node should have to load as few IDs as possible because of limited RAM (each ID is connected to a larger dataset).
Edit:
Some background:
Different nodes in a cluster have to compare datasets identified by IDs. Each comparison is a pair of IDs (compare dataset of ID1 with ID2). Each node gets a bunch of pairs to know which IDs it has to compare and loads the corresponding datasets into RAM. A master node divides a big bunch of pairs into smaller bunches and distributes them to the slave nodes. Because each node can only store a limited amount of datasets, those smaller bunches need to contain as few different IDs as possible. But the nodes have different amounts of RAM, so the groups with minimal cardinality should have different sizes.
The comparison is symmetric, so compare(ID1, ID2) is the same as compare(ID2, ID1), so each pair is unique. Which datasets need to be compared is degined by a client which sents those jobs to the master as a bunch of pairs of IDs.
An example:
A client wants the comparison of dataset (1;2), (7;9), (9;105), (7;105), (2;4), (4;1) (usually here should be much more comparisons, so millions usually)
The client sends those pairs to the master, which has two registered slaves. Now the master needs to divide that stack of work into two groups, but the more different IDs are part of each group the more datasets need to be loaded by the slaves (ID corresponds to specific dataset, remember?).
So ideally the master would create a group like ((1;2), (2;4), (4;1)) (only contains 3 different IDs, so the slave only has to load 3 datasets) and ((7;9), (9;105), (7; 105)) (again just three IDs) instead of:
((1;2), (9;105)...) and ((2;4), (7;105)...). Here both slaves need to load 4 IDs and more, and e.g. both slaves need to load the datasets no. 2 and 105.
This needs to be optimized somehow..

My first instinct is to say that perhaps this could be resolved with a special cluster analysis where you customize the aggregation and distance functions.
The cluster members would be pairs.
The cluster aggregate would be the set-theoretical union of all pairs in the
cluster (this is instead of an average or median in the standard approach).
The distance function of any pair in comparison to the cluster would be the
number of elements in the pair that are not found in the cluster aggregate
(so the cardinality of the set difference; this replaces the Euclidean
distance in the standard approach).
Some cluster algorithms have you set the number of desired clusters in
advance, so you would set it to two.
And finally, because you need to balance things so that the cluster
aggregates have the same number of elements, further tweaking, but still
doable.
But, you say you will have millions of points to compare. The processing required for cluster analysis increases exponentially the more input you put in. In this situation, it is worth researching whether your problem is NP or NP-complete. I'm not well versed in that, but I suspect it is, in which case a true optimum will always escape you.
But, if you discover that your problem is in fact NP-complete, then you can still optimize, you just won't be able to guarantee arrival at the global optimum in a reasonable amount of time. So, for instance, you can break your set of pairs into subsets and run an algorithm such as above on the subsets. That may still be an improvement.

How does HAWQ split query plan into slices?

A query plan in HAWQ can be split into several slices which can be run independently. How does HAWQ split query plan into slices?

First, let us clarify the meaning of slice. Slice is a sub-tree of the whole query plan and is a tree of operators, and these operators can run in a same node. Running in the same node means node running the slice doesn't need communicate with other nodes for data exchange.
So, if there are data exchange requirements, we split the plan, by adding motion node.
As #ztao said, there are three kinds of motion nodes.
Gather. One node needs to gather data from all the nodes. Usually used in the top slice which runs on Query dispatcher side. The dispatcher gathers all the results, do some operations, and return the result back to end-user.
Broadcast. Data in one node needs to be broadcasted to all the nodes. Usually used in join between small table and large table. We can broadcast the data of small tables to all nodes containing data of large table, so a hash-join can be executed next.
Redistribute. Data exist in multiple nodes following by some distribution policy, need redistribute the data according to a new policy. Usually used in join between two large tables, and the distribution key for the two table are not same. Need redistribute one table to ensure they both are collocated.

Motion node(Gather/Broadcast/Redistribute) is added for different scenarios which splits query plan to different slices for parallel run purpose. For example, there is a nest loop join, whose outer child is a Table A SeqScan, and inner child is a Table B SeqScan. In optimizer code, it will decide to insert a motion node(would be broadcast or redistribute) in either outer child or inner child based on cost.
NestLoop
/ \
/ \
SeqScan A Broadcast motion
|
SeqScan B

Note that #slices is equal to #motion nodes +1.
Motion node will be added into the plan when there is a need to redistribute data. Aggregation, join, sort functions etc. will all generate motion nodes.
For example, Sort can be partially done on segments, but the data on different segments are still unordered. We must redistributed the data to exact one segment in the upper slice to do the merge sort.

How to decide order of a B-tree

B trees are said to be particularly useful in case of huge amount of data that cannot fit in main memory.
My question is then how do we decide order of B tree or how many keys to store in a node ? Or how many children a node should have ?
I came across that everywhere people are using 4/5 keys per node. How does it solve the huge data and disk read problem ?

Typically, you'd choose the order so that the resulting node is as large as possible while still fitting into the block device page size. If you're trying to build a B-tree for an on-disk database, you'd probably pick the order such that each node fits into a single disk page, thereby minimizing the number of disk reads and writes necessary to perform each operation. If you wanted to build an in-memory B-tree, you'd likely pick either the L2 or L3 cache line sizes as your target and try to fit as many keys as possible into a node without exceeding that size. In either case, you'd have to look up the specs to determine what size to use.
Of course, you could always just experiment and try to determine this empirically as well. :-)
Hope this helps!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio