How yandex implemented 2 layered sharding - sharding

In the clickhouse documentation, there is a mention of Yandex.Metrica, implementing Bi-Level sharding.
"Alternatively, as we've done in Yandex.Metrica, you can set up bi-level sharding: divide the entire cluster into "layers", where a layer may consist of multiple shards. Data for a single client is located on a single layer, but shards can be added to a layer as necessary, and data is randomly distributed within them."
Is there a detailed implementation for this sharding scheme, documented some place.

Logically Yandex.Metrica has only one high-cardinality ID column that serves as main sharding key.
By default SELECTs from table with Distributed engine requests partial results from one replica of each shard.
If you have like hundreds servers or more, it's a lot of network communication to query all shards (probably 1/2 or 1/3 of all servers) which might introduce more latency than the actual query execution.
The reason for this behavior is that ClickHouse allows to write data directly to shards (bypassing Distributed engine and it's configured sharding key) and the application that does it is not forced to comply with sharding key of Distributed table (it can chose differently to spread data more evenly or by whatever other reason).
So the idea of that bi-level sharding is to split large cluster into smaller sub-clusters (10-20 servers each) and make most SELECT queries go through a Distributed tables that are configured against sub-clusters, thus making less network communication necessary and lowering the impact of possible stragglers.
Global Distributed tables for whole large cluster is also configured for some ad-hoc or overview style queries, but they are not so frequent and have lower latency requirements.
This still leaves freedom for the application that writes data to balance it arbitrarily between shards forming sub-cluster (by writing directly to them).
But to make this all work together applications that write and read data need to have a consistent mapping from whatever high-cardinality ID is used (CounterID in case of Metrica) to sub-cluster ID and hostnames it consists of. Metrica stores this mapping in MySQL, but in other cases something else might look more applicable.
Alternative approach is to use "optimize_skip_unused_shards" setting that makes SELECT queries which have a condition on sharding key of Distributed table to skip shards that are not supposed to have data. It introduces the requirement for data to be distributed between shards exactly as if it was written through this Distributed table or the report will not include some misplaced data.

Related

How can I shard request using Aeron Cluster

I'd like to understand the capabilities of Aeron Clusters with respect to sharing requests across different back-end cluster application instances. I am thinking of something similar to partitions in Kafka where distinct back-end consumer processes the workload in independent processes. There should be a partition key which defines how to find the partition, or it could be a consumer provided hash, etc.
I read this article but it was not much help https://aeroncookbook.com/aeron-cluster/on-sharding/
So far I have only been reading the documentation and the API documents.
I also read the aeoroncookbook site: https://aeroncookbook.com/aeron-cluster/on-sharding/
Could someone provide an example of this if it is possible? The cookbook does not really do much good here because it imposes a similar problem but with dependencies between the shards.
Aeron Cluster does not directly support sharding. Its primary goal is redundant copies of the same data across multiple nodes. Sharding would need to be something that layered on via your own application logic. An approach would be to run multiple clusters and utilize a key to partition data across the clusters, then within your client application run multiple cluster clients (one for each cluster) and select the approach client based on the data that you are interacting with.

If I shard my microservice data by enduser/tenant and each server has a subset of the total data, how do I query for data across all servers?

Imagine a highly scalable architecture where each tenant is sharded and distributed by region and availability zone and each server holds a subset of total data. There is also redundancy, there are physical shards that host the same logical shard.
This works great, clients can use a map/reduce style to retrieve all data when handling requests that return extreme amounts of data if they know all the logical shards that a user is assigned to. This solves the problem if the amount of data for a user is larger than the capacity of any individual server's storage or memory or compute.
My question then is, if the data for a noun microservice is isolated and sharded across multiple servers and every server hosts a different subset of users or tenants, how do I create a view of all the objects in the system? I've effectively denormalised for performance but that means there is extreme read amplification to see the total number of objects in the system.
If I wanted a GUI that would show me all the noun objects of each microservice, and there is N physical shards and M noun services, I would need to create N×M requests to fetch all the data, sort it for presentation. It would be incredibly inefficient.
I'm thinking for more of an administration GUI perspective. Nobody wants to log into X microservice or microservice frontends to manage all the data in the system.
Is this a usecase for data warehousing or data lake?
Is this a usecase for data warehousing or data lake?
Yes. Replicating data into a central repository (Operational Data Store, Data Lake, or Data Warehouse) is common pattern in microservice and multi-tenant application architectures.

How to do small queries efficiently in Clickhouse

In our deployment, there are one thousand shards. The insertions are done via a distributed table with sharding jumpConsistentHash(colX, 1000). When I query for rows with colX=... and turn on send_logs_level='trace', I see the query is sent to all shards and is executed on each shard. This is limiting our QPS (queries per second). Checking with Clickhouse document, it states:
SELECT queries are sent to all the shards and work regardless of how data is distributed across the shards (they can be distributed completely randomly).
When you add a new shard, you don’t have to transfer the old data to it.
You can write new data with a heavier weight – the data will be distributed slightly unevenly, but queries will work correctly and efficiently.
You should be concerned about the sharding scheme in the following cases:
* Queries are used that require joining data (IN or JOIN) by a specific key. If data is sharded by this key, you can use local IN or JOIN instead of GLOBAL IN or GLOBAL JOIN, which is much more efficient.
* A large number of servers is used (hundreds or more) with a large number of small queries (queries of individual clients - websites, advertisers, or partners).
In order for the small queries to not affect the entire cluster, it makes sense to locate data for a single client on a single shard.
Alternatively, as we’ve done in Yandex.Metrica, you can set up bi-level sharding: divide the entire cluster into “layers”, where a layer may consist of multiple shards.
Data for a single client is located on a single layer, but shards can be added to a layer as necessary, and data is randomly distributed within them.
Distributed tables are created for each layer, and a single shared distributed table is created for global queries.
It seems there is a solution for such small queries as in our case (the second bullet above), but I am not clear about the point. Does it mean when querying for a specific query with predicate colX=..., I need to find the corresponding "layer" that contains its rows and then query on the corresponding distributed table for this layer?
Is there a way to query on the global distributed table for these small queries?

MongoDB capacity planning

I have an Oracle Database with around 7 millions of records/day and I want to switch to MongoDB. (~300Gb)
To setup a POC, I'd like to know how many nodes I need? I think 2 replica of 3 node in 2 shard will be enough but I want to know your thinking about it :)
I'd like to have an HA setup :)
Thanks in advance!
For MongoDB to work efficiently, you need to know your working set size..You need to know how much data does 7 million records/day amounts to. This is active data that will need to stay in RAM for high performance.
Also, be very sure WHY you are migrating to Mongo. I'm guessing..in your case, it is scalability..
but know your data well before doing so.
For your POC, keeping two shards means roughly 150GB on each.. If you have that much disk available, no problem.
Give some consideration to your sharding keys, what fields does it make sense for you to shared your data set on? This will impact on the decision of how many shards to deploy, verses the capacity of each shard. You might go with relatively few shards maybe two or three big deep shards if your data can be easily segmented into half or thirds, or several more lighter thinner shards if you can shard on a more diverse key.
It is relatively straightforward to upgrade from a MongoDB replica set configuration to a sharded cluster (each shard is actually a replica set). Rather than predetermining that sharding is the right solution to start with, I would think about what your reasons for sharding are (eg. will your application requirements outgrow the resources of a single machine; how much of your data set will be active working set for queries, etc).
It would be worth starting with replica sets and benchmarking this as part of planning your architecture and POC.
Some notes to get you started:
MongoDB's journaling, which is enabled by default as of 1.9.2, provides crash recovery and durability in the storage engine.
Replica sets are the building block for high availability, automatic failover, and data redundancy. Each replica set needs a minimum of three nodes (for example, three data nodes or two data nodes and an arbiter) to enable failover to a new primary via an election.
Sharding is useful for horizontal scaling once your data or writes exceed the resources of a single server.
Other considerations include planning your documents based on your application usage .. for example, if your documents will be updated frequently and grow in size over time, you may want to consider manual padding to prevent excessive document moves.
If this is your first MongoDB project you should definitely read the FAQs on Replica Sets and Sharding with MongoDB, as well as for Application Developers.
Note that choosing a good shard key for your use case is an important consideration. A poor choice of shard key can lead to "hot spots" for data writes, or unbalanced shards if you plan to delete large amounts of data.

Distributed and replicated data storage for small amounts of data under Windows

We're looking for a good solution to a caching problem. We'd like to distribute a relatively small amount of data (perhaps 10's of GBs) among a cluster of web servers such that:
The data is replicated to all nodes
The data is persistent
The data can be accessed locally
Our motivation for a caching solution is that we currently have a single point of failure: a SQL Server database. We're unable to set up a fail-over cluster for this database, unfortunately. We're already using Memcached to a large extent, but we want to avoid the problem where if a Memcached node goes down, we'd suddenly have a large amount of cache misses and therefore experience a massive amount of requests to one endpoint.
We'd prefer instead to have local persistent caches on each web server node so that the resulting load would be distributed. When a retrieval is made, it would pass through the following:
Check for data in Memcached. If it's not there...
Check for data in local persistent storage. If it's not there...
Retrieve data from the database.
When data changes, the cache key is invalidated at both caching layers.
We've been looking at a bunch of potential solutions, but none of them seem to match exactly what we need:
CouchDB
This is pretty close; the data model we'd like to cache is very document-oriented. However, its replication model isn't exactly what we're looking for. It seems to me as though replication is an action you have to perform rather than a permanent relationship among nodes. You can set up continuous replication, but this doesn't persist between restarts.
Cassandra
This solution seems to be mostly geared toward those with large storage requirements. We have a large amount of users, but small amounts of data. Cassandra looks to be able to support n number of fail-over nodes, but 100% replication among nodes doesn't seem to be what it's intended for; instead, it seems more geared toward distribution only.
SAN
One attractive idea is that we can store a bunch of files on a SAN or similar type of appliance. I haven't worked with these before, but it seems like this would still be a single point of failure; if the SAN goes down, we'd suddenly be going to the database for all cache misses.
DFS Replication
A simple Google search revealed this. It seems to do what we want; it synchronizes files across all nodes in a replication cluster. But the marketing text makes it look like it's more of a system for ensuring documents are copied to different office locations. Also, it has limits, like a file count maximum, that wouldn't work well for us.
Have any of you had similar requirements to ours and found a good solution that meets your needs?
We've been using Riak successfully in production for several months now for a problem that's somewhat similar to what you describe. We too have evaluated CouchDB and Cassandra before.
The advantage of Riak in this sort of problems imo is that distribution and data replication are at the core of the system. You define how many replicas of the data across the cluster you want and it takes care of the rest (it's a bit more complicated than that of course, but that's the essence). We went through adding nodes, removing nodes, had nodes crush, and it's proven surprisingly resilient.
It's a lot like Couch in other matters - document oriented, REST interface, Erlang.
You can check the hazelcast.
It does not persist the data but provides a fail-over system. Each node can have a number of nodes to backup it's data in case a node fails.

Resources