When we talk about nosql distributed database system, we know that all of them fall under the 2 out of three of CAP theoram. For a distributed cluster where network failure and node failure are inevitable partition tolerance is a necessity hence leaving us to chose one from availability and consistency. So its basically CP or AP.
My questions are
Under which category does hadoop fall into.
Let's say I have a cluster with 6 nodes ABC and DEF, During a network failure let's say node A,B,C and node D,E,F are divided into two independent cluster.
Now in a consistent and partition tolerant system (CP) model since an update in node A wont replicate to node D the consistency of the system wont allow user to update or read data till the network is up again running, Hence making the database down.
Whereas an Available and partition tolerant system would allow the user of node D to see the old data when update is made at node A but doesn't guarantee the user of node D of the latest data. But after some time when the network is up running again it replicates the latest data of node A into node D and hence allows the user of node D to view the latest data.
From the above two scenarios we can conclude that In an AP model there is no scope for database going hence allowing user to write and read even during failure and promises user latest data when the network is up again, So Why do people go for Consistent and partition tolerant model (CP). In my perspective during network failure (AP) has an advantage over (CP) allowing user to read and write data while the database under (CP) is down.
Is there any system that can provide CAP together excluding the concept of Cassandra's eventually consistency.
When does a user Choose availability over consistency and vice versa. Is there any database out there that allows user to switch its choice accordingly between CP and AP.
Thanks in advance :)
HDFS has a unique central decision point, the namenode. As such it can only fall in the CP side, since taking down the namenode takes down the entire HDFS system (no Availability). Hadoop does not try to hide this:
The NameNode is a Single Point of Failure for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline. There is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy.
Since the decission where to place data and where it can be read from is always handled by the namenode, which maintains a consistent view in memory, HDFS is always consistent (C). It is also partition tolerant in that it can handle loosing data nodes, subject to replication factor and data topology strategies.
Is there any system that can provide CAP together?
Yes, such systems are often mentioned in Marketing and other non-technical publications.
When does a user Choose availability over consistency and vice versa.
This is a business use case decision. When availability is more important they choose AP. When consistency is more important, they choose CP. In general when money changes hands the consistency takes precedence. Almost every other case favors availability.
Is there any database out there that allows user to switch its choice accordingly between CP and AP
Systems that allows you to modify both the write and the read quorums can be tuned to be either CP or AP, depending on the needs.
Related
I have a requirement to write common data to the same hdfs data nodes, like how we repartition in pyspark on a column to bring similar data into the same worker node, even replicas should be in the same node.
For instance, we have a file, table1.csv
Id, data
1, A
1, B
2, C
2, D
And another tablet.csv
Id, data
1, X
1, Y
2, Z
2, X1
Then datanode1 should only have (1,A),(1,B),(1,X),(1,Y)
and datanode2 should only have (2,C),(2,D),(2,Z),(2,X1)
And replication within datanodes.
It can be separate files as well based on keys. But each key should map it to a particular node.
I tried with pyspark writing to hdfs, but it just randomly assigned the datanodes when I checked with hdfs DFS fsck.
Read about rackid by setting rack topology but is there away to select which rack to store data on?
Any help is appreciated, I'm totally stuck.
KR
Alex
I maitain that without actually exposing the problem this is not going to help you but as you technically asked for a solution here's a couple ways to do what you want, but won't actually solve the underlying problem.
If you want to shift the problem to resource starvation:
Spark setting:
spark.locality.wait - technically doesn't solve your problem but is actually likely to help you immediately before you implement anything else I list here. This is should be your goto move before trying anything else as it's trivial to try.
Pro: just wait until you get a node with the data. Cheap and fast to implement.
Con: Doesn't promise to solve data locality, just will wait for a while incase the right nodes come up. It doesn't guarantee that when you run your job it will be placed on the nodes with the data.
** yarn labels**
to allocate your worker nodes to specific nodes.
Pro: This should ensure at least 1 copy of the data lands within a set of worker nodes/data nodes. If subsequent jobs also use this node label you should get good data locality. Technically it doesn't specify where data is written but by caveat yarn will write to the hdfs node it's on first.
Con: You will create congestion on these nodes, or may have to wait for other jobs to finish so you can get allocated or you may carve these into a queue that no other jobs can access reducing the functional capacity of yarn. (HDFS will still work fine)
Use Cluster federation
Ensures data lands inside a certain set of machines.
pro: A folder can be assigned to a set of data nodes
Con: You have to allocated another name node, and although this satisfies your requirement it doesn't mean you'll get data locality. Great example of something that will fit the requirement but might not solve the problem. It doesn't guarantee that when you run your job it will be placed on the nodes with the data.
My-Data-is-Everywhere
hadoop dfs -setrep -w 10000 /path of the file
Turn up replication for just the folder that contains this data equal to the number of nodes in the cluster.
Pro: All your data nodes have the data you need.
Con: You are wasting space. Not necessarily bad, but can't really be done for large amounts of data without impeding your cluster's space.
Whack-a-mole:
Turn off datanodes, until the data is replicated where you want it. Turn all other nodes back on.
Pro: You fulfill your requirement.
Con: It's very disruptive to anyone trying to use the cluster. It doesn't guarantee that when you run your job it will be placed on the nodes with the data. Again it kinda points out how silly your requirement is.
Racking-my-brain
Someone smarter than me might be able to develop a rack strategy in your cluster that would ensure data is always written to specific nodes that you could then "hope" you are allocated to... haven't fully developed the strategy in my mind but likely some math genius could work it out.
You could also implement HBASE and allocate region servers such that the data landed on the 3 servers. (As this would technically fulfill your requirement). (As it would be on 3 servers and in HDFS)
I have two clusters of NetApp (main and dr), in each I have two nodes.
If one of the nodes in either cluster goes down, the other node kicks in and act as one node cluster.
Now my question is, what happens when a whole cluster falls down due to problems of power supply?
I've heard about "Metro Cluster" but I want to ask if there is another option to do so.
It depends on what RPO you need. Metrocluster does synchronous replication of every write and thus provides zero RPO (data loss)
On the other hand you could use Snapmirror which basically takes periodic snapshots and stores them on the other cluster. As you can imagine you should expect some data loss.
How is data consistency handled in the distributed cache using Oracle coherence where each cluster node is responsible only for a piece of data?
I also have confusion about below
Are cluster nodes on different servers and each has its own local cache?
For instance say I have node A, with cache "a" and node B and with cache "b", is the database on a
separate server D?
When is an update, is update first made on D and written back to cache a and b, or how does data consistency work.
Explanation in laymen terms will be helpful as I am new to Oracle Cohernace
Thank you!
Coherence uses two different distribution mechanisms: full replication and data partitioning; each distributed cache is configured to use one of these. Most caches in most large systems use the partitioned model, because they scale very well, adding storage with each server and maintaining very high performance even up to hundreds of servers.
The Coherence software architecture is service based; when Coherence starts, it first creates a local service for managing clustering, and that service communicates over the network to locate and then join (or create, if it is the first server running) the cluster.
If you have any partitioned caches, then those are managed by partitioned cache service(s). A partitioned cache service coordinates across the cluster to manage the entirety of the partitioned cache. It does this dynamically, starting by dividing the responsibilities of data management evenly across all of the storage enabled nodes. The data in the cache(s) is partitioned, which means "sliced up", so that some values will go to server 1, some values to server 2, etc. The data ownership model prevents any confusion about who owns what, so even if a message gets delayed on the network and ends up at the wrong server, no damage is done, and the system self-corrects. If a server dies, whatever data (slices) it was managing is backed up by one or more other server, and the servers work together to ensure that new back-ups are made for any data that does not have the desired number of backups. It is a dynamic system.
There are several different APIs provided to an application, starting with an API as simple as using a hash map (in fact it is the Java Map API).
Data locality as defined by many Hadoop tutorial sites (i.e. https://techvidvan.com/tutorials/data-locality-in-hadoop-mapreduce/) states that: "Data locality in Hadoop is the process of moving the computation close to where the actual data resides instead of moving large data to computation. This minimizes overall network congestion."
I can understand having the node where the data resides process the computation for those data, instead of moving data around, would be efficient. However, what does it mean by "moving the computation close to where the actual data resides"? Does this mean that if the data sits in a server in Germany, it is better to use the server in France to do the computation on those data instead of using the server in Singapore to do the computation since France is closer to Germany than Singapore?
Typically people talk about this on a quite different scale, especially within a Hadoop context.
Suppose you have a cluster of 5 nodes, you store a file there and need to do a calculation on it.
With data locality you try to make the calculation happen on the node(s) where the data is stored (rather than for example the first node that has compute resources available).
This reduces network load.
It is good to realize that in many new infrastructures the network is not the bottleneck, so you will keep hearing more about the decoupling of compute and storage.
I +1 Dennis Jaheruddin's answer, and just wanted to add -- you can actually see different locality levels in MR when you check job counters, in Job History UI for example.
HDFS and YARN are rack-aware so its not just binary same-or-other node: in the above screen, Data-local means the task was running local to the machine that contained actual data; Rack-local -- that the data wasn't local to the node running the task and needed to be copied, but was still on the same rack; and finally the Other local case -- where the data wasn't available local, nor on the same rack, so it had to be copied over two switches to the node that run the computation.
Partition Tolerance - The system continues to operate as a whole even if individual servers fail or can't be reached.
Better definition from this link
Even if the connections between nodes are down, the other two (A & C)
promises, are kept.
Now consider we have master slave model in both RDBMS(oracle) and mongodb. I am not able to understand why RDBMS is said to not partition tolerant but mongo is partition tolerant.
Consider I have 1 master and 2 slaves. In case master gets down in mongo, reelection is done to select one of the slave as Master so that system continues to operate.
Does not happen the same in RDBMS system like oracle/Mysql ?
See this article about CAP theorem and MySQL.
Replication in Mysql cluster is synchronous, meaning transaction is not committed before replication happen. In this case your data should be consistent, however cluster may be not available for some clients in some cases after partition occurs. It depends on the number of nodes and arbitration process. So MySQL cluster can be made partition tolerant.
Partition handling in one cluster:
If there are not enough live nodes to serve all of the data stored - shutdown
Serving a subset of user data (and risking data consistency) is not an option
If there are not enough failed or unreachable nodes to serve all of the data stored - continue and provide service
No other subset of nodes can be isolated from us and serving clients
If there are enough failed or unreachable nodes to serve all of the data stored - arbitrate.
There could be another subset of nodes regrouped into a viable cluster out there.
Replication btw 2 clusters is asynchronous.
Edit: MySql can be also configured as a cluster, in this case it is CP, otherwise it is CA and Partition tolerance can be broken by having 2 masters.