How does HAWQ split query plan into slices? - hawq

A query plan in HAWQ can be split into several slices which can be run independently. How does HAWQ split query plan into slices?

First, let us clarify the meaning of slice. Slice is a sub-tree of the whole query plan and is a tree of operators, and these operators can run in a same node. Running in the same node means node running the slice doesn't need communicate with other nodes for data exchange.
So, if there are data exchange requirements, we split the plan, by adding motion node.
As #ztao said, there are three kinds of motion nodes.
Gather. One node needs to gather data from all the nodes. Usually used in the top slice which runs on Query dispatcher side. The dispatcher gathers all the results, do some operations, and return the result back to end-user.
Broadcast. Data in one node needs to be broadcasted to all the nodes. Usually used in join between small table and large table. We can broadcast the data of small tables to all nodes containing data of large table, so a hash-join can be executed next.
Redistribute. Data exist in multiple nodes following by some distribution policy, need redistribute the data according to a new policy. Usually used in join between two large tables, and the distribution key for the two table are not same. Need redistribute one table to ensure they both are collocated.

Motion node(Gather/Broadcast/Redistribute) is added for different scenarios which splits query plan to different slices for parallel run purpose. For example, there is a nest loop join, whose outer child is a Table A SeqScan, and inner child is a Table B SeqScan. In optimizer code, it will decide to insert a motion node(would be broadcast or redistribute) in either outer child or inner child based on cost.
NestLoop
/ \
/ \
SeqScan A Broadcast motion
|
SeqScan B

Note that #slices is equal to #motion nodes +1.
Motion node will be added into the plan when there is a need to redistribute data. Aggregation, join, sort functions etc. will all generate motion nodes.
For example, Sort can be partially done on segments, but the data on different segments are still unordered. We must redistributed the data to exact one segment in the upper slice to do the merge sort.

Related

How to do small queries efficiently in Clickhouse

In our deployment, there are one thousand shards. The insertions are done via a distributed table with sharding jumpConsistentHash(colX, 1000). When I query for rows with colX=... and turn on send_logs_level='trace', I see the query is sent to all shards and is executed on each shard. This is limiting our QPS (queries per second). Checking with Clickhouse document, it states:
SELECT queries are sent to all the shards and work regardless of how data is distributed across the shards (they can be distributed completely randomly).
When you add a new shard, you don’t have to transfer the old data to it.
You can write new data with a heavier weight – the data will be distributed slightly unevenly, but queries will work correctly and efficiently.
You should be concerned about the sharding scheme in the following cases:
* Queries are used that require joining data (IN or JOIN) by a specific key. If data is sharded by this key, you can use local IN or JOIN instead of GLOBAL IN or GLOBAL JOIN, which is much more efficient.
* A large number of servers is used (hundreds or more) with a large number of small queries (queries of individual clients - websites, advertisers, or partners).
In order for the small queries to not affect the entire cluster, it makes sense to locate data for a single client on a single shard.
Alternatively, as we’ve done in Yandex.Metrica, you can set up bi-level sharding: divide the entire cluster into “layers”, where a layer may consist of multiple shards.
Data for a single client is located on a single layer, but shards can be added to a layer as necessary, and data is randomly distributed within them.
Distributed tables are created for each layer, and a single shared distributed table is created for global queries.
It seems there is a solution for such small queries as in our case (the second bullet above), but I am not clear about the point. Does it mean when querying for a specific query with predicate colX=..., I need to find the corresponding "layer" that contains its rows and then query on the corresponding distributed table for this layer?
Is there a way to query on the global distributed table for these small queries?

Create sets of minimal cardinality from set of pairs

I have a set of pairs of IDs like
(123;1765)
(1212;8977)...
I need to separate those pairs into n groups with an inividual size (number of pairs) each. Those sets should have minimum cardinality (=there should be as few as possible different ids in each group).
Are there any existing algorithms which solve this problem? I'm not sure where/how to search for it.
This is necessary, because I currently work on the load balancing of one of my projects and each node should have to load as few IDs as possible because of limited RAM (each ID is connected to a larger dataset).
Edit:
Some background:
Different nodes in a cluster have to compare datasets identified by IDs. Each comparison is a pair of IDs (compare dataset of ID1 with ID2). Each node gets a bunch of pairs to know which IDs it has to compare and loads the corresponding datasets into RAM. A master node divides a big bunch of pairs into smaller bunches and distributes them to the slave nodes. Because each node can only store a limited amount of datasets, those smaller bunches need to contain as few different IDs as possible. But the nodes have different amounts of RAM, so the groups with minimal cardinality should have different sizes.
The comparison is symmetric, so compare(ID1, ID2) is the same as compare(ID2, ID1), so each pair is unique. Which datasets need to be compared is degined by a client which sents those jobs to the master as a bunch of pairs of IDs.
An example:
A client wants the comparison of dataset (1;2), (7;9), (9;105), (7;105), (2;4), (4;1) (usually here should be much more comparisons, so millions usually)
The client sends those pairs to the master, which has two registered slaves. Now the master needs to divide that stack of work into two groups, but the more different IDs are part of each group the more datasets need to be loaded by the slaves (ID corresponds to specific dataset, remember?).
So ideally the master would create a group like ((1;2), (2;4), (4;1)) (only contains 3 different IDs, so the slave only has to load 3 datasets) and ((7;9), (9;105), (7; 105)) (again just three IDs) instead of:
((1;2), (9;105)...) and ((2;4), (7;105)...). Here both slaves need to load 4 IDs and more, and e.g. both slaves need to load the datasets no. 2 and 105.
This needs to be optimized somehow..
My first instinct is to say that perhaps this could be resolved with a special cluster analysis where you customize the aggregation and distance functions.
The cluster members would be pairs.
The cluster aggregate would be the set-theoretical union of all pairs in the
cluster (this is instead of an average or median in the standard approach).
The distance function of any pair in comparison to the cluster would be the
number of elements in the pair that are not found in the cluster aggregate
(so the cardinality of the set difference; this replaces the Euclidean
distance in the standard approach).
Some cluster algorithms have you set the number of desired clusters in
advance, so you would set it to two.
And finally, because you need to balance things so that the cluster
aggregates have the same number of elements, further tweaking, but still
doable.
But, you say you will have millions of points to compare. The processing required for cluster analysis increases exponentially the more input you put in. In this situation, it is worth researching whether your problem is NP or NP-complete. I'm not well versed in that, but I suspect it is, in which case a true optimum will always escape you.
But, if you discover that your problem is in fact NP-complete, then you can still optimize, you just won't be able to guarantee arrival at the global optimum in a reasonable amount of time. So, for instance, you can break your set of pairs into subsets and run an algorithm such as above on the subsets. That may still be an improvement.

Strange replication in Cassandra

I have configured locally 3 nodes in on 'Test Cluster' of Cassandra. When I run them and create some keyspace or table also on all three nodes the keyspace or the table appears.
The problem I'm dealing with is, when I'm importing from CSV millions of rows in the table I already built the whole data suddenly appears on all three nodes. I have the same data replicated over the three nodes.
As I'm familiar with, the data I'm importing should be replicated/distributed over the nodes but partially. One partition on the first node, second on third, third on second node, fourth again on first node and ...
Am I right or I'm missing something big?
Also, my write speed locally is about 10k rows / second for the multi-node cluster. Isn't that a little bit too low?
I want to create discussion so I can maybe learn something more from your experience and see where I'm messing things.
Thank you!
The number of nodes that data is written to in your cluster is determined by the Replication Factor for that keyspace. If you have 3 nodes and the data is being written to all the nodes, then this setting must be set to 3. If you only want the data the be replicated to two nodes, you'd set this value to two.
Your write speed will be affected by the consistency level you are specifying on the write. If you have it set to ALL then you have to wait until all the nodes that are going to write the data have written the data (in your case all 3 nodes based on your replication factor). Dropping your consistency level on the write will probably net you faster write times. There is a balance between your replication factor, write consistency level, and read consistency level that you can research further.

Comparing huge trees over slow network

I have a tree like data structure:
I have a list of work orders
each work orders has several operations
each operation has several roles
each role has several resources nodes
Each work order, operation, role and resource node has a number of attributes.
I have two instances of such data strucutre: master and slave. I wish to periodically update slave and make it in sync with master. My question is: how do I do it really fast?
The problems are:
those two instances are huge
those two instances are on separate networks, connected by low throughput network
speed is critical parameter
[edit] 4. I do not have access to transaction log on master, just state of the master at this point in time (I have only read access on SQL views and that's it). [/edit]
What I was thinking was creating Merkle tree on both sides by hashing together node ID, node atributes and child nodes' hashes (bottom up, obviously).
And then comparing the trees by:
transmiting the list of top level hashes over the network
determining nodes which are not equal
recursively repeating the proces for mismatching nodes
Thus I get a list of nodes which are not in sync and then I update them.
The problem I see here is that I have to recreate both Merkle trees every time I compare instances, which costs time.
So, I was wondering if there is any other algorithm which I can try out?

Map side join in Hadoop loses advantage data locality?

My Question is related to Map side join in Hadoop.
I was reading ProHadoop the other day I did not understand following sentence
"The map-side join provides a framework for performing operations on multiple sorted
datasets. Although the individual map tasks in a join lose much of the advantage of data locality,
the overall job gains due to the potential for the elimination of the reduce phase and/or the
great reduction in the amount of data required for the reduce."
How can it lose advantage of data locality when if sorted data sets are stored on HDFS?Wan't job tracker in Hadoop will run task tracker in on the same on where data set block localize?
Correct my understanding please.
The satement is correct. You do not loss all data locality, but part of it. Lets see how it works:
We usually distinguish smaller and bigger part of the join.
Smaller partitions of the join are distributed to places where corresponding bigger partitions are stored. As a result we loss data locality for one of the joined datasets.
I don't know what does David mean, but to me, this is because you have only map phase, then you just go there and finish your job by bring different tables together, without any gains about HDFS?
This is the process followed in Map-side join:
Suppose we have two datasets R and S, assume that both of them fit into the main memory. R is large and S is small.
Smaller dataset is loaded to the main memory iteratively to match the pairs with R.
In this case, we achieve data locality for R but not S.

Resources