what is the advantage of using btree in aerospike for primary index? - b-tree

I was going through the documentation of Aerospike. and found that for storing primary key, Aerospike uses hashing and hash is pointed to a BTree and bTree contains pointer to actual record.
As far as i know Redis only uses hashing (for collision resolution they maintain a list for hash).and hash is pointed to actual record.
what is the advantage of Btree use by aerospike? doesn't it mean that to access a record by its primary key aerospike would take O(logn) ? while redis would take only O(1).
I may be wrong but that's all i understood from documentation. can some one please throw more light on this topic.?

I'm not sure the point of the question, but here goes:
Actually Aerospike's primary index is a distributed hash of red-black trees, between 1 and 4096 sprigs per partition (see the partition-tree-sprigs config param).
There are 4096 logical partitions which are evenly distributed across the nodes of the cluster. The key identifying any record is a 20-byte digest produced by passing the (namespace, set, PK) 3-tuple through RIPEMD-160 (the client does that automatically for you). The record is consistently hashed to a specific partition, as bits in this digest are used to calculate the partition ID.
As opposed to Redis, which was designed to be a single core, single-threaded application running on a single node, Aerospike was built as a distributed database. It's true that users can ad-hoc cluster Redis using application-side solutions or middleware. In the case of Aerospike all the nodes in the cluster, and all the clients share a partition map.
Since the client is aware of the cluster's partition map, it is always one hop away from the node holding the master partition (or a node holding the replica partition - this is controlled by the replica read policy). So, it's O(1) to the correct node in the cluster. Within that node it's O(1) to find the partition's rbTree, and then all operations are O(log n).
When there isn't a lot of data in a hash table (assuming you're right about the data structure used by Redis), finding a record should be O(1). But, once there are more elements than slots in the hash table it switches to a linked list, which is O(n). For the rbTree it's O(log n) for both average and worst case. Aerospike is intended to handle large data sets with predictable low latency, so the rbTree was more appropriate. The cost of getting a record will be the same regardless of the amount of data in the cluster.
Addition: as of Aerospike DB version 4.2, sprigs became much cheaper in terms of memory, and the limit off 4096 sprigs per partition has been removed. You can effectively turn sprigs into a depth 1 red-black tree by allocating enough sprigs, so O(log n) can be made to virtually be the same as O(1). For example, if you wanted an average tree depth of 1 for the sprigs, and you had a billion objects in your database, you’d set partition-tree-sprigs to 262144 (has to be a power of 2), which would cost 848MB evenly distributed to the nodes (283MB in a 3 node cluster).

Related

Max value of number_of_routing_shards in Elasticsearch 6.x

What is the max recommended value of number_of_routing_shards for an index?
Can I specify a very high value like 30000? What are the side effects if I do so?
Shards are "slices" of an index created by elasticsearch to have flexibility to distribute indexed data. For example, among several datanodes.
Shards, in the low level are independent sets of lucene segments that work autonomously, which can be queried independently. This makes possible the high performance because search operations can be split into independent processes.
The more shards you have the more flexible becomes the storage assignment for a given index. This obviously has some caveats.
Distributed searches must wait each other to merge step-results into a consistent response. If there are many shards, the query must be sliced into more parts, (which has a computing overhead). The query is distributed to each shard, whose hashes match any of the current search (not all shards are necesary hit by every query) therefore the most busy (slower) shard, will define the overall performance of your search.
It's better to have a balanced number of indexes. Each index has a memory footprint that is stored in the cluster state. The more indexes you have the bigger the cluster state, the more time it takes to be shared among all cluster nodes.
The more shards an index has, the complexer it becomes, therefore the size taken to serialize it into the cluster state is bigger, slowing things down globally.
This will give you an index with 30.000 shards (according https://www.elastic.co/guide/en/elasticsearch/reference/6.x/indices-split-index.html), which is ... useless.
As all software tuning, recommended values vary with your:
use case
hardware (VM / network / disk ...) ?
metrics

How do partition size affect read/write performances in Cassandra?

I can partition my table into a small amount of bigger partitions or several smaller partitions, but in my use case the big partition is still small in size, it will never exceed 100MB. There will be millions of users reading from this table so is there a risk of congestion when having so many users reading from a single partition?
I can imagine that splitting the read queries between several physical nodes is faster than reading from a single physical node, but does splitting read queries between several virtual nodes improve performance in the same way? The number of big partitions will exceed the number of physical nodes, so will spreading the data further through the virtual nodes with smaller partitions improve the read performance? Is the answer any different for updating partitions of counter tables?
So basically, what I need to know is if millions of users reading from the same partition (that is below 100MB in size) will introduce congestion. This is the answer that actually matters for my project. But I also want to know if spreading the data further (regular and counter tables), beyond the number of physical nodes through smaller partitions will increase the read/write performance.
Any reference links would be extremely appreciated since I'll be writing a report and referencing an article, journal or documentation is always preferred.
In my opinion accessing the same partition ( We are actually talking about "row" in cassandra 3.0) is not a problem. If the load on your cluster is increasing then you just need to add more node, this is the no single point of failure principle. Each node in your cluster is able to fulfil the user request ( depending on your replication factor and read consistency).
Also if you know that a partition key is going to be accessed a lot then you can play with the key and row cache functionality of your table, you will avoid any disk access

Hbase table duplication

There is a way to duplicate table data on every node of a cluster?
I need to do a performance test with the maximum grade of locality of the data.
By default, HBase distributes data on a small fraction of the cluster nodes (on 1 or 2 nodes), maybe because my data isn't very big-data ( ~ 2 GB ).
I know that Hbase is designed for much larger data sets, but in this case, it is a requirement for me.
There are a lot of nice reads* about it (see the end of the post) but I'll try to explain it with my own words ;)
HBase is not responsible of data replication, the Hadoop HDFS is, and by default is configured with a replication factor of 3, that means all data will be stored in at least 3 nodes.
Data locality is a key aspect to get good performance, but achieving maximum data locality is easy: you only need to colocate your HBase Regionservers (RS) along to the Hadoop Datanodes (DN), so, all your DN should have also the RS role. Once you have that, HBase will automatically move the data where it's needed (on major compactions) to achieve data locality and that's all, as long as each RS has the data of the regions it serves locally you'll have data locality.
Even when you have the data replicated to multiple DN, each region (and the rows they contain) will be served by just one RS, it doesn't matter you have a replication factor of 3, 10 or 100... Reading a row belonging to the region #1 will always hit the same RS, and that will be the one that hosts the region (which will read the data locally from the HDFS because of data locality). If the RS hosting that region goes down, the region will be assigned to another RS automatically (because the data is also replicated to other DN)
What you can do is to split your table in a way each RS has even buckets of rows (regions) assigned to it, so as much different RS as possible work simultaneously when you read or write data, increasing your overall throughput as long as you don't always hit the same regions (called regionserver hotspotting**).
Therefore, you should always start by ensuring that all the regions of your table are assigned to different RS and they receive the same volume of R/W requests. Once you've done that you can split your table into more regions once until you have an even number of regions on all the RS of your cluster (you may need to assign them manually if you're not happy with the load balancer).
Just remind that even when you seem to have a perfect distribution of regions you can still have poor performance if your data access pattern is not right (or it's uneven) and doesn't reach all regions evenly, in the end it all depends on your application.
(*) Recommended reads:
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
(**) To avoid RS hotspotting we always design our tables to have non-monotonically increasing row keys, so rows 1, 2, 3 ... N are hosted different regions, the common approach is to use the MD5(id) + id as rowkey. This approach has it's own set of drawbacks: you cannot scan the first 10 rows because they're salted.

ElasticSearch - Optimal number of Shards per node

I would appreciate if someone could suggest the optimal number of shards per ES node for optimal performance or provide any recommended way to arrive at the number of shards one should use, given the number of cores and memory foot print.
I'm late to the party, but I just wanted to point out a couple of things:
The optimal number of shards per index is always 1. However, that provides no possibility of horizontal scale.
The optimal number of shards per node is always 1. However, then you cannot scale horizontally more than your current number of nodes.
The main point is that shards have an inherent cost to both indexing and querying. Each shard is actually a separate Lucene index. When you run a query, Elasticsearch must run that query against each shard, and then compile the individual shard results together to come up with a final result to send back. The benefit to sharding is that the index can be distributed across the nodes in a cluster for higher availability. In other words, it's a trade-off.
Finally, it should be noted that any more than 1 shard per node will introduce I/O considerations. Since each shard must be indexed and queried individually, a node with 2 or more shards would require 2 or more separate I/O operations, which can't be run at the same time. If you have SSDs on your nodes then the actual cost of this can be reduced, since all the I/O happens much quicker. Still, it's something to be aware of.
That, then, begs the question of why would you want to have more than one shard per node? The answer to that is planned scalability. The number of shards in an index is fixed. The only way to add more shards later is to recreate the index and reindex all the data. Depending on the size of your index that may or may not be a big deal. At the time of writing, Stack Overflow's index is 203GB (see: https://stackexchange.com/performance). That's kind of a big deal to recreate all that data, so resharding would be a nightmare. If you have 3 nodes and a total of 6 shards, that means that you can scale out to up to 6 nodes at a later point easily without resharding.
There are three condition you consider before sharding..
Situation 1) You want to use elasticsearch with failover and high availability. Then you go for sharding.
In this case, you need to select number of shards according to number of nodes[ES instance] you want to use in production.
Consider you wanna give 3 nodes in production. Then you need to choose 1 primary shard and 2 replicas for every index. If you choose more shards than you need.
Situation 2) Your current server will hold the current data. But due to dynamic data increase future you may end up with no space on disk or your server cannot handle much data means, then you need to configure more no of shards like 2 or 3 shards (its up to your requirements) for each index. But there shouldn't any replica.
Situation 3) In this situation you the combined situation of situation 1 & 2. then you need to combine both configuration. Consider your data increased dynamically and also you need high availability and failover. Then you configure a index with 2 shards and 1 replica. Then you can share data among nodes and get an optimal performance..!
Note: Then query will be processed in each shard and perform mapreduce on results from all shards and return the result to us. So the map reduce process is expensive process. Minimum shards gives us optimal performance
If you are using only one node in production then, only one primary shards is optimal no of shards for each index.
Hope it helps..!
Just got back from configuring some log storage for 10 TB so let's talk sharding :D
Node limitations
Main source: The definitive guide to elasticsearch
HEAP: 32 GB at most:
If the heap is less than 32 GB, the JVM can use compressed pointers, which saves a lot of memory: 4 bytes per pointer instead of 8 bytes.
HEAP: 50% of the server memory at most. The rest is left to filesystem caches (thus 64 GB servers are a common sweet spot):
Lucene makes good use of the filesystem caches, which are managed by the kernel. Without enough filesystem cache space, performance will suffer. Furthermore, the more memory dedicated to the heap means less available for all your other fields using doc values.
[An index split in] N shards can spread the load over N servers:
1 shard can use all the processing power from 1 node (it's like an independent index). Operations on sharded indices are run concurrently on all shards and the result is aggregated.
Less shards is better (the ideal is 1 shard):
The overhead of sharding is significant. See this benchmark for numbers https://blog.trifork.com/2014/01/07/elasticsearch-how-many-shards/
Less servers is better (the ideal is 1 server (with 1 shard)]):
The load on an index can only be split across nodes by sharding (A shard is enough to use all resources on a node). More shards allow to use more servers but more servers bring more overhead for data aggregation... There is no free lunch.
Configuration
Usage: A single big index
We put everything in a single big index and let elasticsearch do all the hard work relating to sharding data. There is no logic whatsoever in the application so it's easier to dev and maintain.
Let's suppose that we plan for the index to be at most 111 GB in the future and we've got 50 GB servers (25 GB heap) from our cloud provider.
That means we should have 5 shards.
Note: Most people tend to overestimate their growth, try to be realistic. For instance, this 111GB example is already a BIG index. For comparison the stackoverflow index is 430 GB (2016) and it's a top 50 site worldwide, made entirely of written texts by millions of people.
Usage: Index by time
When there're too much data for a single index or it's getting too annoying to manage, the next thing is to split the index by time period.
The most extreme example is logging applications (logstach and graylog) which are using a new index every day.
The ideal configuration of 1-single-shard-per-index makes perfect sense in scenario. The index rotation period can be adjusted, if necessary, to keep the index smaller than the heap.
Special case: Let's imagine a popular internet forum with monthly indices. 99% of requests are hitting the last index. We have to set multiple shards (e.g. 3) to spread the load over multiple nodes. (Note: It's probably unnecessary optimization. A 99% hitrate is unlikely in the real world and the shard replica could distribute part of the read-only load anyway).
Usage: Going Exascale (just for the record)
ElasticSearch is magic. It's the easiest database to setup in cluster and it's one of the very few able to scale to many nodes (excluding Spanner ).
It's possible to go exascale with hundreds of elasticsearch nodes. There must be many indices and shards to spread the load on that many machines and that takes an appropriate sharding configuration (eventually adjusted per index).
The final bit of magic is to tune elasticsearch routing to target specific nodes for specific operations.
It might be also a good idea to have more than one primary shard per node, depends on use case. I have found out that bulk indexing was pretty slow, only one CPU core was used - so we had idle CPU power and very low IO, definitely hardware was not a bottleneck. Thread pool stats shown, that during indexing only one bulk thread was active. We have a lot of analyzers and complex tokenizer (decomposed analysis of German words). Increasing number of shards per node has resulted in more bulk threads being active (one per shard on node) and it has dramatically improved speed of indexing.
Number of primary shards and replicas depend upon following parameters:
No of Data Nodes: The replica shards for the given primary shard meant to be present on different data nodes, which means if there are 3 data Nodes: DN1, DN2, DN3 then if primary shard is in DN1 then the replica shard should be present in DN2 and/or DN3. Hence no of replicas should be less than total no of Data Nodes.
Capacity of each of the Data Nodes: Size of the shard cannot be more than the size of the data nodes hard disk and hence depending upon the expected size for the given index, no of primary shards should be defined.
Recovering mechanism in case of failure: If the data on the given index has quick recovering mechanism then 1 replica should be enough.
Performance requirement from the given index: As sharding helps in directing the client node to appropriate shard to improve the performance and hence depending upon the query parameter and size of the data belonging to that query parameter should be considered in defining the no of primary shards.
These are the ideal and basic guidelines to be followed, it should be optimized depending upon the actual use cases.
I have not tested this yet, but aws has a good articale about ES best practises. Look at Choosing Instance Types and Testing part.
Elastic.co recommends to:
[…] keep the number of shards per node below 20 per GB heap it has configured

What Mongo Index algorithm is using? Binary Tree?

I would like to know what kind of internal indexing algorithm MongoDB is using. Because I have some data want to store, and each document (row) has a id, which is probably a unique hash value. (e.g. generated by md5() or other hash algorithm). so, I would to understand which hash method I should use to create the id, so that it is fast for the MongoDB to index it. :)
Yes, mongoDB use b-tree, documentation:
An index is a data structure that collects information about the values
of the specified fields in the
documents of a collection. This data
structure is used by Mongo's query
optimizer to quickly sort through and
order the documents in a collection.
Formally speaking, these indexes are
implemented as "B-Tree" indexes.
I suggest to use mongodb ObjectId for collection _id, and don't care about: "How to create _id?" at all. Because it probably task for mongodb, but not for developer. I suppose that better to care about schema, indexes, etc..
For Mongo 3.2+, the default storage engine is WiredTiger, and B+ tree is used to store data.
WiredTiger maintains a table's data in memory using a data structure called a B-Tree ( B+ Tree to be specific), referring to the nodes of a B-Tree as pages. Internal pages carry only keys. The leaf pages store both keys and values.
And also LSM Tree is used to store data
WiredTiger supports Log-Structured Merge Trees, where updates are buffered in small files that fit in cache for fast random updates, then automatically merged into larger files in the background so that read latency approaches that of traditional Btree files. LSM trees automatically create Bloom filters to avoid unnecessary reads from files that cannot containing matching keys.
LSM
B+ Tree
Pros
* More space efficient by using append-only writes and periodic compaction * Better performance with fast-growing data and sequential writes * No fragmentation penalty because of how SSTable files are written and updated
* Excellent performance with read-intensive workloads
Cons
* CPU overhead for compaction can meaningfully affect performance and efficiency if not tuned appropriately * More tuning options increase flexibility but can seem complex to developers and operators * Read/search performance can be optimized with the use of bloom filters
* Increased space overhead to deal with fragmentation * Uses random writes which causes slower create/insert behavior * Concurrent writes may require locks which slows write performance * Scaling challenges, especially with >50% write transactions
The choice advice
If you don't require extreme write throughput btree is likely to be a better choice. Read throughput is better and high volumes of writes can be maintained.
If you have a workload that requires a high write throughput LSM is the best choice.
Source:
LSM vs B tree
WiredTiger Doc

Resources