Greenplum - master node is a bottleneck? - greenplum

I read the Greenplum architecture here https://gpdb.docs.pivotal.io/530/admin_guide/intro/arch_overview.html
This looks like one master node vs so many segment nodes?
Question 1: Is master node not a bottleneck as it is just one doing all the work for so many segments?
Question 2: Is it fair to compare the segment's work like work done by mapper (mapper as in MapReduce) and Master Node's work as reducer? If yes - then how does it handle this disproportion of number of instances ?

A1. No, the master is mostly idle. It handles client connections, generating query plans, monitoring the nodes for availability, and returning results back to the clients.
A2. No. The master is more similar to the NameNode but it does even less than that. The NameNode keeps track of the block locations where the Greenplum master doesn't.

Related

How to write to specific datanode in hdfs using pyspark

I have a requirement to write common data to the same hdfs data nodes, like how we repartition in pyspark on a column to bring similar data into the same worker node, even replicas should be in the same node.
For instance, we have a file, table1.csv
Id, data
1, A
1, B
2, C
2, D
And another tablet.csv
Id, data
1, X
1, Y
2, Z
2, X1
Then datanode1 should only have (1,A),(1,B),(1,X),(1,Y)
and datanode2 should only have (2,C),(2,D),(2,Z),(2,X1)
And replication within datanodes.
It can be separate files as well based on keys. But each key should map it to a particular node.
I tried with pyspark writing to hdfs, but it just randomly assigned the datanodes when I checked with hdfs DFS fsck.
Read about rackid by setting rack topology but is there away to select which rack to store data on?
Any help is appreciated, I'm totally stuck.
KR
Alex
I maitain that without actually exposing the problem this is not going to help you but as you technically asked for a solution here's a couple ways to do what you want, but won't actually solve the underlying problem.
If you want to shift the problem to resource starvation:
Spark setting:
spark.locality.wait - technically doesn't solve your problem but is actually likely to help you immediately before you implement anything else I list here. This is should be your goto move before trying anything else as it's trivial to try.
Pro: just wait until you get a node with the data. Cheap and fast to implement.
Con: Doesn't promise to solve data locality, just will wait for a while incase the right nodes come up. It doesn't guarantee that when you run your job it will be placed on the nodes with the data.
** yarn labels**
to allocate your worker nodes to specific nodes.
Pro: This should ensure at least 1 copy of the data lands within a set of worker nodes/data nodes. If subsequent jobs also use this node label you should get good data locality. Technically it doesn't specify where data is written but by caveat yarn will write to the hdfs node it's on first.
Con: You will create congestion on these nodes, or may have to wait for other jobs to finish so you can get allocated or you may carve these into a queue that no other jobs can access reducing the functional capacity of yarn. (HDFS will still work fine)
Use Cluster federation
Ensures data lands inside a certain set of machines.
pro: A folder can be assigned to a set of data nodes
Con: You have to allocated another name node, and although this satisfies your requirement it doesn't mean you'll get data locality. Great example of something that will fit the requirement but might not solve the problem. It doesn't guarantee that when you run your job it will be placed on the nodes with the data.
My-Data-is-Everywhere
hadoop dfs -setrep -w 10000 /path of the file
Turn up replication for just the folder that contains this data equal to the number of nodes in the cluster.
Pro: All your data nodes have the data you need.
Con: You are wasting space. Not necessarily bad, but can't really be done for large amounts of data without impeding your cluster's space.
Whack-a-mole:
Turn off datanodes, until the data is replicated where you want it. Turn all other nodes back on.
Pro: You fulfill your requirement.
Con: It's very disruptive to anyone trying to use the cluster. It doesn't guarantee that when you run your job it will be placed on the nodes with the data. Again it kinda points out how silly your requirement is.
Racking-my-brain
Someone smarter than me might be able to develop a rack strategy in your cluster that would ensure data is always written to specific nodes that you could then "hope" you are allocated to... haven't fully developed the strategy in my mind but likely some math genius could work it out.
You could also implement HBASE and allocate region servers such that the data landed on the 3 servers. (As this would technically fulfill your requirement). (As it would be on 3 servers and in HDFS)

How do I set failover on my netapp clusters?

I have two clusters of NetApp (main and dr), in each I have two nodes.
If one of the nodes in either cluster goes down, the other node kicks in and act as one node cluster.
Now my question is, what happens when a whole cluster falls down due to problems of power supply?
I've heard about "Metro Cluster" but I want to ask if there is another option to do so.
It depends on what RPO you need. Metrocluster does synchronous replication of every write and thus provides zero RPO (data loss)
On the other hand you could use Snapmirror which basically takes periodic snapshots and stores them on the other cluster. As you can imagine you should expect some data loss.

Clustered elasticsearch setup (two master nodes)

We are currently setting up an environment with two elasticsearch instances (clustered servers).
Since it's clustered, we need to make sure that data (indexes) are synched between the two instances.
We do not have the possibility to setup an additional (3rd) server/instance to act as the 'master'.
Therefore we have configured both instances as master and data nodes. So instance 1 is master & node and instance 2 is also master & node.
The synchronization works fine when both instances are up and running. But when one instance is down, the other instance keeps trying to connect with the instance that is down, which obviously fails because the instance is down. Therefore the node that is up is also not functioning anymore, because it can not connect to his 'master' node (which is the node that is down), even though the instance itself is also a 'master'.
The following errors are logged in this case:
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
org.elasticsearch.transport.ConnectTransportException: [xxxxx-xxxxx-2][xx.xx.xx.xx:9300] connect_exception
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: no further information: xx.xx.xx.xx/xx.xx.xx.xx:9300
In short: two elasticsearch master instances in a clustered setup. When one is down, the other one does not function because it can not connect to the 'master' instance.
Desired result: If one of the master instances is down, the other should continue functioning (without throwing errors).
Any recommendations on how to solve this, without having to setup an additional server that is the 'master' and the other 2 the 'slaves'?
Thanks
To be able to vote, masters must be a minimum of 2.
That's why you must have a minimum of 3 master nodes if you want your cluster to resist to the loss of one node.
You can just add a specialized small master node by settings all other roles to false.
This node can have very few resources .
As describe in this post :
https://discuss.elastic.co/t/master-node-resource-requirement/84609
Dedicated master nodes need persistent storage, but not a lot of it. 1-2 CPU cores and 2-4GB RAM is often sufficient for smaller deployments. As dedicated master nodes do not store data you can also set the heap to a higher percentage (75%-80%) of total RAM that is recommended for data nodes.
If there are no options to increase 1 more node then you can set
minimum_master_nodes=1 . This will let your es cluster up even if 1 node is up. But it may lead to split brain issue as we restricted only 1 node to be visible to form cluster.
In that scenario you have to restart the cluster to resolve split brain.
I would suggest you to upgrade to elasticsearch 7.0 or above. There you can live with two nodes each master eligible and split brain issue will not come.
You should not have 2 master eligible nodes in the cluster as its a very risky thing and can lead to split brain issue.
Master nodes doesn't require much resources, but as you have just two data nodes, you can still live without having a dedicated master nodes(but please aware that it has downsides) to just save the cost.
So simply, remove master role from another node and you should be good to go.

Why RDBMS is said to be not partition tolerant?

Partition Tolerance - The system continues to operate as a whole even if individual servers fail or can't be reached.
Better definition from this link
Even if the connections between nodes are down, the other two (A & C)
promises, are kept.
Now consider we have master slave model in both RDBMS(oracle) and mongodb. I am not able to understand why RDBMS is said to not partition tolerant but mongo is partition tolerant.
Consider I have 1 master and 2 slaves. In case master gets down in mongo, reelection is done to select one of the slave as Master so that system continues to operate.
Does not happen the same in RDBMS system like oracle/Mysql ?
See this article about CAP theorem and MySQL.
Replication in Mysql cluster is synchronous, meaning transaction is not committed before replication happen. In this case your data should be consistent, however cluster may be not available for some clients in some cases after partition occurs. It depends on the number of nodes and arbitration process. So MySQL cluster can be made partition tolerant.
Partition handling in one cluster:
If there are not enough live nodes to serve all of the data stored - shutdown
Serving a subset of user data (and risking data consistency) is not an option
If there are not enough failed or unreachable nodes to serve all of the data stored - continue and provide service
No other subset of nodes can be isolated from us and serving clients
If there are enough failed or unreachable nodes to serve all of the data stored - arbitrate.
There could be another subset of nodes regrouped into a viable cluster out there.
Replication btw 2 clusters is asynchronous.
Edit: MySql can be also configured as a cluster, in this case it is CP, otherwise it is CA and Partition tolerance can be broken by having 2 masters.

How to explicilty define datanodes to store a particular given file in HDFS?

I want to write a script or something like .xml file which explicitly defines the datanodes in Hadoop cluster to store a particular file blocks.
for example:
Suppose there are 4 slave nodes and 1 Master node (total 5 nodes in hadoop cluster ).
there are two files file01(size=120 MB) and file02(size=160 MB).Default block size =64MB
Now I want to store one of two blocks of file01 at slave node1 and other one at slave node2.
Similarly one of three blocks of file02 at slave node1, second one at slave node3 and third one at slave node4.
So,my question is how can I do this ?
actually there is one method :Make changes in conf/slaves file every time to store a file.
but I don't want to do this
So, there is another solution to do this ??
I hope I made my point clear.
Waiting for your kind response..!!!
There is no method to achieve what you are asking here - the Name Node will replicate blocks to data nodes based upon rack configuration, replication factor and node availability, so even if you do managed to get a block on two particular data nodes, if one of those nodes goes down, the name node will replicate the block to another node.
Your requirement is also assuming a replication factor of 1, which doesn't give you any data redundancy (which is a bad thing if you lose a data node).
Let the namenode manage block assignments and use the balancer periodically if you want to keep your cluster evenly distibuted
NameNode is an ultimate authority to decide on the block placement.
There is Jira about the requirements to make this algorithm pluggable:
https://issues.apache.org/jira/browse/HDFS-385
but unfortunetely it is in the 0.21 version, which is not production (alhough working not bad at all).
I would suggest to plug you algorithm to 0.21 if you are on the research state and then wait for 0.23 to became production, or, to downgrade the code to 0.20 if you do need it now.

Resources