How to do small queries efficiently in Clickhouse - clickhouse

In our deployment, there are one thousand shards. The insertions are done via a distributed table with sharding jumpConsistentHash(colX, 1000). When I query for rows with colX=... and turn on send_logs_level='trace', I see the query is sent to all shards and is executed on each shard. This is limiting our QPS (queries per second). Checking with Clickhouse document, it states:
SELECT queries are sent to all the shards and work regardless of how data is distributed across the shards (they can be distributed completely randomly).
When you add a new shard, you don’t have to transfer the old data to it.
You can write new data with a heavier weight – the data will be distributed slightly unevenly, but queries will work correctly and efficiently.
You should be concerned about the sharding scheme in the following cases:
* Queries are used that require joining data (IN or JOIN) by a specific key. If data is sharded by this key, you can use local IN or JOIN instead of GLOBAL IN or GLOBAL JOIN, which is much more efficient.
* A large number of servers is used (hundreds or more) with a large number of small queries (queries of individual clients - websites, advertisers, or partners).
In order for the small queries to not affect the entire cluster, it makes sense to locate data for a single client on a single shard.
Alternatively, as we’ve done in Yandex.Metrica, you can set up bi-level sharding: divide the entire cluster into “layers”, where a layer may consist of multiple shards.
Data for a single client is located on a single layer, but shards can be added to a layer as necessary, and data is randomly distributed within them.
Distributed tables are created for each layer, and a single shared distributed table is created for global queries.
It seems there is a solution for such small queries as in our case (the second bullet above), but I am not clear about the point. Does it mean when querying for a specific query with predicate colX=..., I need to find the corresponding "layer" that contains its rows and then query on the corresponding distributed table for this layer?
Is there a way to query on the global distributed table for these small queries?

Related

Proper way to populate cache from Cassandra

I want to have a memory cache layer in my application. To populate cache with items, I have to get data from a large Cassandra table. Select all is not recommended, because without using partition keys, it's a slow read operation. Prior to that I can "predict" partition keys using other Cassandra table that I'll have to read all again, but relatively it's a smaller volume table. After reading data from user table and creating a list of potential partition keys (userX, userY) that may or may not be present in initial table. With that list try and populate cache by executing select queries with each potential key. That also doesn't sound like a really good idea.
So the question is? How to properly populate cache layer with data from Cassandra DB?
The second option is preferred for warming up or pre-loading your cache.
Single-partition asynchronous queries from multiple client/app instances is much better than doing a full table scan. Asynchronous queries from lots of clients distributes the load efficiently to all nodes in the cluster which is why they perform better.
It should be said that if you've got your data model right and you've sized your cluster correctly, you can achieve single-digit millisecond latencies. I work with a lot of large organisations who have a 95% SLA for 6-8ms reads. Cheers!

ElasticSearch - Should I Shard by Partition?

I have an ES cluster of 80mm documents with 4 data nodes and 3 master nodes. Searching in the cluster is pretty fast depending on the query, but is always painfully slow to scroll when I need to pull millions of documents out at once.
I do have logical partitions in my data, and only search on a given partition at a time (client id). Though these partitions don't necessarily have an even distribution of documents. One partition may have 1mm documents while another only has 100k.
For this reason I never considered partitioning my shards since I'm certain it wouldn't be an even distribution.
Is my thinking correct or could I see faster query/scroll times by keeping partitioned data localized to a shard?
The outcome of routing depends on the use case, but if applied correctly it can make the difference between a hard working cluster or a performant one.
With routing enabled, write and search operations will hit only the single shard which is relevant according to the routing parameter. It will reduce the impact on many layers of the cluster like distribution of requests, network traffic, threads/IOPS on the nodes, merging the results etc.
But the data will be distributed unevenly across the shards associated with the index. You'll potentially get highly loaded shards on the one side and barely used shards on the other side. Also the optimal size for a shard (40-50GB) will be violated by the same reason. For small shards there will be to much overhead handling the shard in comparison to the data being held by the shard. And for large shards there will be to much data to search though.
In order to overcome this downside, there is another option for routed indices: Increase the partition size. All routed requests will go to a larger partition, not a single shard but a subset of available shards. This will reduce the risk of imbalanced shards while reducing the search impact. Just set index.routing_partition_size while creating the index to a value larger 1 but lower than index.number_of_shards. Now the requests will be routed across shards in the partition rather than one (basic routing) or all shards (no routing) in the index. It's a reasonable trade-off between route-optimized reading and balanced data distribution.
I see another potential improvement: With routing enabled, there will be still more than one logical partition (the client in your case) per shard and irrelevant data have to be visited while searching. That's why you should think about using index sorting in order to improve reading speed in the underlying segment files of a shard. This feature is available with or without _routing. Having all associated data stored together will help you to reduce search speed too. But it comes at costs of writing speed because the documents have to be ordered for flush or segment merge operations.
Here is a example index creation request using it all together, assuming your most common case is writing and reading the data routed using client_id and mostly querying for specific order_id:
PUT my_index
{
"settings": {
"index": {
"number_of_shards": 10,
"routing_partition_size: 2,
"sort.field": ["client_id", "order_id"],
"sort.order": ["asc", "desc"]
}
},
"mappings": {
...
}
}
This answer has been written while 7.5 was the current version of elasticsearch.

How yandex implemented 2 layered sharding

In the clickhouse documentation, there is a mention of Yandex.Metrica, implementing Bi-Level sharding.
"Alternatively, as we've done in Yandex.Metrica, you can set up bi-level sharding: divide the entire cluster into "layers", where a layer may consist of multiple shards. Data for a single client is located on a single layer, but shards can be added to a layer as necessary, and data is randomly distributed within them."
Is there a detailed implementation for this sharding scheme, documented some place.
Logically Yandex.Metrica has only one high-cardinality ID column that serves as main sharding key.
By default SELECTs from table with Distributed engine requests partial results from one replica of each shard.
If you have like hundreds servers or more, it's a lot of network communication to query all shards (probably 1/2 or 1/3 of all servers) which might introduce more latency than the actual query execution.
The reason for this behavior is that ClickHouse allows to write data directly to shards (bypassing Distributed engine and it's configured sharding key) and the application that does it is not forced to comply with sharding key of Distributed table (it can chose differently to spread data more evenly or by whatever other reason).
So the idea of that bi-level sharding is to split large cluster into smaller sub-clusters (10-20 servers each) and make most SELECT queries go through a Distributed tables that are configured against sub-clusters, thus making less network communication necessary and lowering the impact of possible stragglers.
Global Distributed tables for whole large cluster is also configured for some ad-hoc or overview style queries, but they are not so frequent and have lower latency requirements.
This still leaves freedom for the application that writes data to balance it arbitrarily between shards forming sub-cluster (by writing directly to them).
But to make this all work together applications that write and read data need to have a consistent mapping from whatever high-cardinality ID is used (CounterID in case of Metrica) to sub-cluster ID and hostnames it consists of. Metrica stores this mapping in MySQL, but in other cases something else might look more applicable.
Alternative approach is to use "optimize_skip_unused_shards" setting that makes SELECT queries which have a condition on sharding key of Distributed table to skip shards that are not supposed to have data. It introduces the requirement for data to be distributed between shards exactly as if it was written through this Distributed table or the report will not include some misplaced data.

Scaling horizontally meaning

I am learning ElasticSearch and in their documentation it's written this line
Performing full SQL-style joins in a distributed system like
Elasticsearch is prohibitively expensive. Instead, Elasticsearch
offers two forms of join which are designed to scale horizontally.
Please someone explain me in layman term what does the 2nd sentence means.
As a preamble you might want to go through another thread on SO that explains horizontal vs vertical scaling.
Most of the time, an ES cluster is designed to grow horizontally, meaning that whenever your cluster starts to show some signs of weaknesses (slow queries, slow indexing, etc), all you need to do is add one or more nodes to your cluster and ES will spread the load on more hardware, and thus, lighten the burden on existing nodes. That's what horizontal scaling is all about and ES is perfectly designed for this given the way it partitions the indexes into shards that get assigned to the nodes in your cluster.
As you know, ES has no JOIN feature and they did it on purpose for the reason mentioned above (i.e. "prohibitively expensive"). There are four ways to model relationships in ES:
by denormalizing your data (preferred)
by using nested types
by using parent/child documents
by using application-side joins
The link you referred to, which introduces the nested, has_parent and has_child queries, is about the second and third bullet point above. Nested and parent/child documents have been designed in such a way as to take advantage as much as possible from the index/shard partitioning model that ES supports.
When using a nested field (1-N relationship), each element inside of the nested array is just another hidden document under the hood and is stored in a shard somewhere in your cluster. When using a join field (1-N relationship), parent and child documents are also documents stored in your index within a shard located somewhere in your cluster. When your index grows (i.e. when you have more and more parent and child and/or nested data), you add nodes and the shards containing your documents will get spread within the cluster transparently. This means that wherever your documents are stored, you can retrieve them as well as their related documents without having to perform expensive joins.
So you will get more information about scaling horizontal here
In Elasticsearch terms when you start two or more instances on ES in same network with same cluster configs then they will connect to each other and create a distributed network.So if you add one more computer or node and started one ES instance there and keep the cluster config same that node will automatically will get attached to the previous cluster and the data and the request load will be shared .When you make any request to ES may be its a read or write request each request can be processed parallel and you get the speed according to the no of node and shards in them of each index.
Get more information here

HBase: Create multiple tables or single table with many columns?

When does it make sense to create multiple tables as opposed to a single table with a large number of columns. I understand that typically tables have only a few column families (1-2) and that each column family can support 1000+ columns.
When does it make sense to create separate tables when HBase seems to perform well with a potentially large number of columns within a single table?
Before answering the question itself, let me first state some of the major factors that come into play. I am going to assume that the file system in use is HDFS.
A table is divided into non-overlapping partitions of the keyspace called regions.
The key-range -> region mapping is stored in a special single region table called meta.
The data in one HBase column family for a region is stored in a single HDFS directory. It's usually several files but for all intents and purposes, we can assume that a region's data for a column family is stored in a single file on HDFS called a StoreFile / HFile.
A StoreFile is essentially a sorted file containing KeyValues. A KeyValue logically represents the following in order: (RowLength, RowKey, FamilyLength, FamilyName, Qualifier, Timestamp, Type). For example, if you have only two KVs in your region for a CF where the key is same but values in two columns, this is how the StoreFile will look like (except that it's actually byte encoded, and metadata like length etc. is also stored as I mentioned above):
Key1:Family1:Qualifier1:Timestamp1:Value1:Put
Key1:Family1:Qualifier2:Timestamp2:Value2:Put
The StoreFile is divided into blocks (default 64KB) and the key range contained in each data block is indexed by multi-level indexes. A random lookup inside a single block can be done using index + binary search. However, the scans have to go serially through a particular block after locating the starting position in the first block needed for scan.
HBase is a LSM-tree based database which means that it has an in-memory log (called Memstore) that is periodically flushed to the filesystem creating the StoreFiles. The Memstore is shared for all columns inside a single region for a particular column family.
There are several optimizations involved while dealing with reading/writing data from/to HBase, but the information given above holds true conceptually. Given the above statements, the following are the pros of having several columns vs several tables over the other approach:
Single Table with multiple columns
Better on-disk compression due to prefix encoding since all data for a Key is stored together rather than on multiple files across tables. This also results in reduced disk activity due to smaller data size.
Lesser load on meta table because the total number regions is going to be smaller. You'll have N number of regions for just one table rather than N*M regions for M tables. This means faster region lookup and low contention on meta table, which is a concern for large clusters.
Faster reads and low IO amplification (causing less disk activity) when you need to read several columns for a single row key.
You get advantage of row level transactions, batching and other performance optimizations when writing to multiple columns for a single row key.
When to use this:
If you want to perform row level transactions across multiple columns, you have to put them in a single table.
Even when you don't need row level transactions, but you often write to or query from multiple columns for the same row key. A good rule for thumb is that if on an average, more than 20% for your columns have values for a single row, you should try to put them together in a single table.
When you have too many columns.
Multiple Tables
Faster scans for each table and low IO amplification if the scans are mostly concerned only with one column (remember sequential look-ups in scans will unnecessarily read columns they don't need).
Good logical separation of data, especially when you don't need to share row keys across columns. Have one table for one type of row keys.
When to use:
When there is a clear logical separation of data. For example, if your row key schema differs across different sets of columns, put those sets of columns in separate tables.
When only a small percentage of columns have values for a row key (Look below for a better approach).
You want to have different storage configs for different sets of columns. E.g. TTL, compaction rate, blocking file counts, memstore size etc. (Look below for a better approach in this use case).
An alternative of sorts: Multiple CFs in single table
As you can see from above, there are pros of both the approaches. The choice becomes really difficult in cases where you have same structure of row key for several columns (so, you want to share row key for storage efficiency or need transactions across columns) but the data is very sparse (which means you write/read only small percentage of columns for a row key).
It seems like you need the best of both worlds in this case. That's where column families come in. If you can partition your column set into logical subsets where you mostly access/read/write only to a single subset, or you need storage level configs per subset (like TTL, Storage class, write heavy compaction schedule etc.), then you can make each subset a column family.
Since data for a particular column family is stored in single file (set of files), you get better locality while reading a subset of columns without slowing down the scans.
However, there is a catch:
Do not try to unnecessarily use column families. There is a cost associated with them, and HBase does not do well with 10+ CFs due to how region level write locks, monitoring etc. work in HBase. Use CFs only if you have a logical relationship between columns across CFs but you don't generally perform operations across CFs or need to have different storage configs for different CFs.
It's perfectly fine to use only a single CF containing all your columns if you share row key schema across them, unless you have a very sparse data set, in which case you might need different CFs or different tables based on above mentioned points.

Resources