Scaling a relational DB: Combining sharding and replication - sharding

I have been reading Lewis Lin's "The System Design Interview". In the example cases, when designing Twitter, he first talks about sharding the database. Then, to improve availability, he suggests replicating each shard with a main (write) replica and a few secondary (read) replicas. Writes will flow to the write replicas of each shard, and reads will flow directly to the read replicas.
It makes sense to me so far.
He then suggests that we break up our system design, and have a write-optimized work flow and a read-optimized work flow. I.e. have some write application servers, and some read application servers. Each workflow can be scaled independently.
My confusion comes here: at this point, he starts to draw the database diagram differently. Instead of drawing several shards, each with one write db and 2 read dbs, he draws one "write DB shard cluster" and one "read db shard cluster". My interpretation is that at this step, architecturally, nothing changed. What changes is how he draws it. Is this correct? Or was there an architectural change in the last step? He doesn't explain it much.

Related

How yandex implemented 2 layered sharding

In the clickhouse documentation, there is a mention of Yandex.Metrica, implementing Bi-Level sharding.
"Alternatively, as we've done in Yandex.Metrica, you can set up bi-level sharding: divide the entire cluster into "layers", where a layer may consist of multiple shards. Data for a single client is located on a single layer, but shards can be added to a layer as necessary, and data is randomly distributed within them."
Is there a detailed implementation for this sharding scheme, documented some place.
Logically Yandex.Metrica has only one high-cardinality ID column that serves as main sharding key.
By default SELECTs from table with Distributed engine requests partial results from one replica of each shard.
If you have like hundreds servers or more, it's a lot of network communication to query all shards (probably 1/2 or 1/3 of all servers) which might introduce more latency than the actual query execution.
The reason for this behavior is that ClickHouse allows to write data directly to shards (bypassing Distributed engine and it's configured sharding key) and the application that does it is not forced to comply with sharding key of Distributed table (it can chose differently to spread data more evenly or by whatever other reason).
So the idea of that bi-level sharding is to split large cluster into smaller sub-clusters (10-20 servers each) and make most SELECT queries go through a Distributed tables that are configured against sub-clusters, thus making less network communication necessary and lowering the impact of possible stragglers.
Global Distributed tables for whole large cluster is also configured for some ad-hoc or overview style queries, but they are not so frequent and have lower latency requirements.
This still leaves freedom for the application that writes data to balance it arbitrarily between shards forming sub-cluster (by writing directly to them).
But to make this all work together applications that write and read data need to have a consistent mapping from whatever high-cardinality ID is used (CounterID in case of Metrica) to sub-cluster ID and hostnames it consists of. Metrica stores this mapping in MySQL, but in other cases something else might look more applicable.
Alternative approach is to use "optimize_skip_unused_shards" setting that makes SELECT queries which have a condition on sharding key of Distributed table to skip shards that are not supposed to have data. It introduces the requirement for data to be distributed between shards exactly as if it was written through this Distributed table or the report will not include some misplaced data.

Datastax Cassandra - Spanning Cluster node across amazon region

I planning to launch three EC2 instance across Amazon hosting region. For say, Region-A,Region-B and Region-C.
Based on the above plan, Each region act as Cluster(Or Datacenter) and have one node.(Correct me if I am wrong).
Using this infrastructure, Can I attain below configuration?
Replication Factor : 2
Write and Read Level:QUORUM.
My basic intention to do these are to achieve "If two region are went down, I can be survive with remaining one region".
Please help me with your inputs.
Note: I am very new to cassandra, hence whatever your inputs you are given will be useful for me.
Thanks
If you have a replication factor of 2 and use CL of Quorum, you will not tolerate failure i.e. if a node goes down, and you only get 1 ack - thats not a majority of responses.
If you deploy across multiple regions, each region is, as you mention, a DC in your Cluster. Each individual DC is a complete replica of all your data i.e. it will hold all the data for your keyspace. If you read/write at a LOCAL_* consistency (eg. LOCAL_ONE, LOCAL_QUORUM) level within each region, then you can tolerate the loss of the other regions.
The number of replicas in each DC/Region and the consistency level you are using to read/write in that DC will determine how much failure you can tolerate. If you are using QUORUM - this is a cross-DC consistency level. It will require a majority of acks from ALL replicas in your cluster in all DCs. If you loose 2 regions then its unlikely that you will be getting a quorum of responses.
Also, its worth remembering that Cassandra can be made aware of the AZ's it is deployed on in the Region and can do its best to ensure replicas of your data are placed in multiple AZs. This will give you even better tolerance to failure.
If this was me and I didnt need to have a strong cross-DC consistency level (like QUORUM). I would have 4 nodes in each region, deployed across each AZ and then a replication factor of 3 in each region. I would then be reading/writing at LOCAL_QUORUM or LOCAL_ONE (preferably). If you go with LOCAL_ONE than you could have fewer replicas in each DC e.g a replication factor of 2 with LOCAL_ONE means you could tolerate the loss of 1 replica.
However, this would be more expensive than what your initially suggesting but (for me) that would be the minimum setup I would need if I wanted to be in multiple regions and tolerate the loss of 2. You could go with 3 nodes in each region if you wanted to really save costs.

Redis failover and Partitioning?

I am using client side partitioning on a 4 node redis setup. The writes and reads are distributed among the nodes. Redis is used as a persistence layer for volatile data as well as a cache by different parts of application. We also have a cassandra deployment for persisting non-volatile data.
On redis we peak at nearly 1k ops/sec (instantaneous_ops_per_sec). The load is expected to increase with time. There are many operations where we query for a non-existent key to check whether data is present for that key.
I want to achieve following things:
Writes should failover to something when a redis node goes down.
There should be a backup for reading the data lost when the redis node went down.
If we add more redis nodes in the future (or a dead node comes back up), reads and writes should be re-distributed consistently.
I am trying to figure out suitable design to handle the above scenario. I have thought of the following options:
Create hot slaves for the existing nodes and swap them as and when a master goes down. This will not address the third point.
Write a Application layer to persist data in both redis and cassandra allowing a lazy load path for reads when a redis node goes down. This approach will have an overhead of writing to two stores.
Which is a better approach? Is there a suitable alternative to the above approaches?
A load of 1k ops/s is far below the capabilities of Redis. You would need to increase by up to two or more orders of magnitude before you come close to overloading it. If you aren't expecting to exceed 50-70,000 ops/second and are not exceeding your available single/0-node memory I really wouldn't bother with sharding your data as it is more effort than it is worth.
That said, I wouldn't do sharding for this client-side. I'd look at something like Twemproxy/Nutcracker to do it do you. This provides a path to a Redis Cluster as well as the ability to scale out connections and proved transparent client-side support for failover scenarios.
To handle failover in the client you would want to set up two instances per slot (in your description a write node) with one shaved to the other. Then you would run a Sentinel Constellation to manage the failover.
Then you would need to have your client code connect to sentinel to get the current master connectivity for each slot. This also means client code which can reconnect to the newly promoted master when a failover occurs. If you have load Balancers available you can place your Redis nodes behind one or more (preferably two with failover) and eliminated client reconnection requirements, but you would then need to implement a sentinel script or monitor to update the load balancer configuration on failover.
For the Sentinel Constellation a standard 3 node setup will work fine. If you do your load balancing with software in nodes you control it would be best to have at least two sentinel nodes on the load Balancers to provide natural connectivity tests.
Given your description I would test out running a single master with multiple read slaves, and instead of hashing in client code, distribute reads to slaves and writes to master. This will provide a much simpler setup and likely less complex code on the client side. Scaling read slaves is easier and simpler, and as you describe it the vast majority if ops will be read requests so it fits your described usage pattern precisely.
You would still need to use Sentinel to manage failover, but that complexity will still exist, resulting in a net decrease in code and code complexity. For a single master, sentinel is almost trivial so setup; the caveats being code to either manage a load balancer or Virtual IP or to handle sentinel discovery in the client code.
You are opening the distributed database Pandora's box here.
My best suggestion is; don't do it, don't implement your own Redis Cluster unless you can afford loosing data and / or you can take some downtime.
If you can afford running on not-yet-production-ready software, my suggestion is to have a look at the official Redis Cluster implementation; if your requirements are low enough for you to kick your own cluster implementation, chances are that you can afford using Redis Cluster directly which has a community behind.
Have you considered looking at different software than Redis? Cassandra,Riak,DynamoDB,Hadoop are great examples of mature distributes databases that would do what you asked out of the box.

MongoDB capacity planning

I have an Oracle Database with around 7 millions of records/day and I want to switch to MongoDB. (~300Gb)
To setup a POC, I'd like to know how many nodes I need? I think 2 replica of 3 node in 2 shard will be enough but I want to know your thinking about it :)
I'd like to have an HA setup :)
Thanks in advance!
For MongoDB to work efficiently, you need to know your working set size..You need to know how much data does 7 million records/day amounts to. This is active data that will need to stay in RAM for high performance.
Also, be very sure WHY you are migrating to Mongo. I'm guessing..in your case, it is scalability..
but know your data well before doing so.
For your POC, keeping two shards means roughly 150GB on each.. If you have that much disk available, no problem.
Give some consideration to your sharding keys, what fields does it make sense for you to shared your data set on? This will impact on the decision of how many shards to deploy, verses the capacity of each shard. You might go with relatively few shards maybe two or three big deep shards if your data can be easily segmented into half or thirds, or several more lighter thinner shards if you can shard on a more diverse key.
It is relatively straightforward to upgrade from a MongoDB replica set configuration to a sharded cluster (each shard is actually a replica set). Rather than predetermining that sharding is the right solution to start with, I would think about what your reasons for sharding are (eg. will your application requirements outgrow the resources of a single machine; how much of your data set will be active working set for queries, etc).
It would be worth starting with replica sets and benchmarking this as part of planning your architecture and POC.
Some notes to get you started:
MongoDB's journaling, which is enabled by default as of 1.9.2, provides crash recovery and durability in the storage engine.
Replica sets are the building block for high availability, automatic failover, and data redundancy. Each replica set needs a minimum of three nodes (for example, three data nodes or two data nodes and an arbiter) to enable failover to a new primary via an election.
Sharding is useful for horizontal scaling once your data or writes exceed the resources of a single server.
Other considerations include planning your documents based on your application usage .. for example, if your documents will be updated frequently and grow in size over time, you may want to consider manual padding to prevent excessive document moves.
If this is your first MongoDB project you should definitely read the FAQs on Replica Sets and Sharding with MongoDB, as well as for Application Developers.
Note that choosing a good shard key for your use case is an important consideration. A poor choice of shard key can lead to "hot spots" for data writes, or unbalanced shards if you plan to delete large amounts of data.

Hbase read high load

I'm in research process for noSQL solution for our company needs.
For now the search narrows to hBase. I've read a lot about architecture, performance etc, but one thing is still uncovered for me.
For example if you have 100 nodes cluster, and one row gets 100.000 simultaneous requests. In this case all the 100.000 requests will hit only one node, where the row is stored? As I understand HBase replication is only for data backup (not for read load balance), and there no any master/slave mechanism (like in MySQL)?
Regarding to 100.000 concurrent requests for single row - I think nobody is good for this currently. Under normal condition it is simply not needed - clients are anyway isolated from DB so access is limited in this case (and probably cached).
Regarding to storage and replication. First, there is at least 2 types of replication and actually it is not HBase. HBase relies on HDFS which is fault tolerant by nature. Read about HBase master and HBase region server role if you need to understand details but in general all things related to replication go to HDFS.
I guess 100,000 concurrent request will not work very well on HBase, however real world scenarios seems to work quite well
yfrog get 10K request per second and eBay chose it for the new version of their product search engine as well as Facebook for their messaging system
You can also take a look at hstack benchmarks on more modest cluster
HBase replication is not only for data backup, also availability. As that does not seem to be the only point you cover with your question here I pointed you to that link where you can find more information. If you have specific questions regarding your schema design you should start in the home page of the Apache hosted project first of all. For the last question mark about master/slave, that URL I sent still applies (And you can ask the HBase developers about it if you are unsure anyway): http://hbase.apache.org/replication.html

Resources