I keep coming across these terms: high cardinality and low cardinality in Cassandra.
I don't understand what exactly they mean. what effects they have on queries and what is preferred.
Please explain with example since that will be be easy to follow.
The cardinality of X is nothing more than the number of elements that compose X.
In Cassandra the partition key cardinality is very important for partitioning data.
Since the partition key is responsible for the distribution of the data across the cluster, choosing a low cardinality key might lead to a situation in which your data are not distributed.
Imagine you have a cluster of 20 nodes storing comments -- the Replication Factor (RF) is 2. Each comment has it's own vote going from 1 to 5. Now, since you want to easily retrieve comments by votes, you might be tempted to choose vote as partition key.
CREATE TABLE comments(vote int, content text, id uuid, PRIMARY KEY(vote, id));
In this situation the only key responsible for data distribution is vote, which has a very low cardinality since it can contains only 5 values (1,2,3,4,5). This means that, in the best situation 5 different nodes will be the owners of the 5 different partitions (which are "all comments with vote 1" ... "all comments with vote 5"), and again in best situation, with a RF of 2, 10 different nodes will hold your data. As you can see you have a 20 nodes cluster which isn't used more than 50% in best situation.
Data distribution is very important, that's why partition key cardinality matters a lot
HTH,
Carlo
Related
I have a 3 tables structure, Customer, Invoice, InvoiceItem that I would like to try to move from the relational DB and store it in CosmosDB. Currently, there are quite intensive queries being run on the InvoiceItem table. This InvoiceItem table has up to 10 optional TagX columns that are basically text that might include the brand, group, type, or something that would group this InvoiceItem and make it searchable by saying (simplified):
SELECT * FROM InvoiceItem WHERE Tag1 LIKE '%shirt%' AND Tag2 LIKE '%training%'
A query like this on a multi-million table can take more than 8 minutes. We are working on the archiving strategy and indexes to speed up the process but it looked to me like CosmosDB could be worth trying in this case, since all of the data is write-once-read-many scenario.
Back to CosmosDB, how do I deal with those string tags in CosmosDB. As a start, I thought about having Invoice and InvoiceItem in the same partition with "type" property that would differ them. But then I cannot stick the tags anywhere so they would be easily searchable. Any ideas on how to set it up?
Thanks!
Textbook database performance issue caused by either lack of, or inefficient indexing.
With that many rows, index cardinality becomes important. You don't want to index the entire field, you only want to index the first n characters of the columns you're indexing, and only index columns you are searching, whether by join or direct where clauses.
The idea is to keep the indexes as small as possible, while still giving you the query performance you need.
With 18 million rows you probably want to start with an index cardinality of the square root of 18m.
That means to hit the index segment you need, you only want to search no more than 5000 index rows, each of which have 400-5000 rows in their segment, at least for sub-second result times.
indexing the first 3-4 letters would be a good starting point. Based on the square root of 18000000 being 4242 and the nearest exponent of 26(3) (assuming alpha characters only) overshooting that. Even if alpha-numeric, 3 characters is still a good starting point.
If the queries then run super fast, but the index takes forever to build, drop a character. This is called "index tuning". You pick a starting point and find the largest cardinality (lowest number of characters indexed) that gives you the performance you need.
If I'm way off because index performance in this DB is way off the mark of a relational db, you'll need to experiment.
As far as I'm concerned, a select query that takes more than a few seconds is unacceptable, except in rare cases. I once worked for a security company. Their license management system took minutes to pull large customers.
After indexing the tables correctly the largest customer took less than 2 seconds. I had to sift through a table with billions of rows for number of downloads, and some of these queries had 7 joins.
If that database can't do this with 18m rows, I'd seriously consider a migration to a better architecture, hardware, software or otherwise.
As index cardinality increases, the performance gains drop to negative as the index cardinality approaches table cardinality, as compared to no index.
As in all things in life, moderation. At the other end of the spectrum, an index with a cardinality of 2 is just about useless. Half of 8 minutes is 4 minutes, assuming a nearly equal distribution.... useless, so indexing a boolean field isn't a great thing to do, usually. There are few hard and fast rules though. Lots of edge cases. Experimentation is your friend.
I am creating a DynamoDB table to support an Alexa Skill for use as a podcast player. The way I envision the table is to use the episode number as the Partition Key and the PublicationDate as the optional Sort Key. I have two concerns about designing my table schema in this way.
First, say I wanted to query the table to get the latest episode - I'm not sure that I can do it in this fashion, as a query requires an equivalence operation on the Partition Key (episode = X), which I wouldn't know in advance. Am I correct in believing that a scan would be quite an expensive operation if the podcast has a large number of episodes (say more than 1000)?
I would need to look at each item in the table, compare its episode number (Partition Key value) to the previous returned Item and update a variable with the more recent Item each time one was found until all Items in the table were cycled through in this way.
Secondly, DynamoDB best practices say two things which work incongruently in my use-case (probably a sign that my design is flawed). First, the Partition Key should be unique or close to unique. Second, queries should be expected to be more or less uniformly dispersed amongst the keys. In my case, though, while the Partition Key would indeed be unique, I would expect the vast majority of queries to be targeting the latest Partition Key in the table, for the Item containing data for the latest podcast episode. What would be the impact on performance if, say for example, the skill gets 1000 queries on any given day all aimed at a single Partition Key?
Does anyone have a better table architecture solution for this type of data?
Thanks to everyone in advance!
Question 1:
First, say I wanted to query the table to get the latest episode - I'm
not sure that I can do it in this fashion, as a query requires an
equivalence operation on the Partition Key (episode = X), which I
wouldn't know in advance. Am I correct in believing that a scan would
be quite an expensive operation if the podcast has a large number of
episodes (say more than 1000)?
You are right that you would NOT be able to query for the latest episode because each episode is in their own Partition. Partitions are almost like different isolated tables so there is no way to query across all Partitions without Scanning (as you said).
Question 2:
Secondly, DynamoDB best practices say two things which work
incongruently in my use-case (probably a sign that my design is
flawed). First, the Partition Key should be unique or close to unique.
Second, queries should be expected to be more or less uniformly
dispersed amongst the keys. In my case, though, while the Partition
Key would indeed be unique, I would expect the vast majority of
queries to be targeting the latest Partition Key in the table, for the
Item containing data for the latest podcast episode. What would be the
impact on performance if, say for example, the skill gets 1000 queries
on any given day all aimed at a single Partition Key?
The issue here is two fold, AWS expects you to be reading (and writing) equally to each partition (or close to equally) so basically what is going to happen is you are going to pay for Write Units (and Read Units) on the partitions you are NOT using, even though you are not using them.
Exactly how much more that is going to run you is going to depend on the number of times you QUERY the database, however, Reading is much cheaper than writing and 1000 reads is basically nothing on a table with 1000 items. ie. You MIGHT be able to get away with it but it's not ideal.
Alternate Table Schema / Key Design
What other Queries will you make? ie. other than "Check for latest Episode"
How many Podcasts are added per day? week? year?
Are there multiple 'shows' or categories that could be used for Partition Keys that might have more even distribution and could be 'known'?
I'm trying to model a database of users. These users have various vital statistics: age, sex, height, weight, hair color, etc.
I want to be able to write queries like these:
get all users 5'1" to 6'0" tall with red hair who weigh more than 100 pounds
or
get all users who are men who are 6'0" are ages 31-37 and have black hair
How can I model my data in order to make these queries? Let's assume this database will hold billions of users. I can't think of an approach that wouldn't require me to make MANY requests or cluster the data on VERY few nodes.
EDIT:
Just a little more background, let's assume this thought problem is to build a dating website. The site should allow users to filter people based on the aforementioned criteria (age, sex, height, weight, hair, etc.). These filters are optional, and you can have as many as you want. This site has 2 billion users. Is that something that can be achieved through data modeling alone?
IF I UNDERSTAND THINGS CORRECTLY
If I have 2 billion users and I create both of the tables mentioned in the first answer (assuming options of male and female for sex, and blonde, brown, red for hair color), I will, for the first table, be putting at most 2 billion records on one node if everyone has blonde hair. Best case scenario, 2/3 billion records on three nodes. In the second case, I will be putting 2/5 billion records on each node in the best case with the same worst case. Am I wrong? Shouldn't the partition keys be more unique than that?
So if you are trying to model you data inside Cassandra then the general rule is that you need to make a table per query. There are also significant restrictions on what you can filter your query by. If you want to understand some of the restrictions I suggest you take a look at this post:
http://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
or my long answer here:
cassandra - how to perform table query?
All of the above only applies if you are running fixed queries that are known ahead of time. If instead you are looking to perform some sort of analytical analysis on your data (it sounds like you might be) than I would look at using Spark in conjunction with Cassandra. This will provide you a fast tool to do in-memory processing of your data. If you look at using Datastax (Community or Enterprise) then Spark also has a connector that makes reading and writing data to and from Cassandra easy.
Edited with Additional Information
Based on the query "get all users 5'1" to 6'0" tall with red hair who weigh more than 100 pounds" you would need to build a table with following:
CREATE TABLE user_by_haircolor_weight_height (
haircolor text,
weight float,
height_in int,
user varchar,
PRIMARY KEY ((haircolor), weight, height_in)
);
You could then query this by:
SELECT * from user_by_haircolor_weight_height where haircolor='red' and weight>100 and height_in>61 and height_in<73;
For the query "get all users who are men who are 6'0" are ages 31-37 and have black hair" you would need to build a similar table with a
PRIMARY KEY ((haircolor, sex), height_in, age)
In the end if what you are trying to do is perform either ad-hoc or a set number analytics (i.e. can have a bit more latency than a straight CQL query) on the data stored in you cassandra table than I suggest you look at using Spark. If you need something a bit more real-time to handle ad-hoc queries you can look at using Solr to perform Lucene powered searches on your table.
my recommendation is :
1) keep main table with proper partition key, so that million records being spread across cluster, don't here use any cluster column which will cross row key limitation of 2gb etc.,
2) depending on query pattern you may better create additional tables(like index) as much as possible to keep inverted index data in it. coz write is cheap.
3) use multiple query to get what you need.
4) last option is, use DSE solr search capability.
Just to reiterate the end of the conversation:
"Your understanding is correct and you are correct in stating that partition keys should be more unique than that. Each partition had a maximum size of 2GB but a practical limit is lower. In practice you would want your data partitioned into far smaller chunks that the table above. Given the ad-hoc nature of your queries in your example I do not think you would be able to practically do this by data modelling alone. I would suggest looking at using a Solr index on a table. This would allow you a robust search capability. If you use Datastax you are even able to query this via CQL"
Cassandra alone is not a good candidate for this sort of complex filtering across a very large data set.
Can anyone explain how the cost is evaluated in an Oracle explain plan?
Is there any specific algorithm to determine the cost of the query?
For example: full table scans have higher cost, index scan lower... How does Oracle evaluate the cases for full table scan, index range scan, etc.?
This link is same as what I am asking: Question about Cost in Oracle Explain Plan
But can anyone explain with an example, we can find the cost by executing explain plan, but how does it work internally?
There are many, many specific algorithms for computing the cost. Far more than could realistically be discussed here. Jonathan Lewis has done an admirable job of walking through how the cost-based optimizer decides on the cost of a query in his book Cost-Based Oracle Fundamentals. If you're really interested, that's going to to be the best place to start.
It is a fallacy to assume that full table scans will have a higher cost than, say, an index scan. It depends on the optimizer's estimates of the number of rows in the table and the optimizer's estimates of the number of rows the query will return (which, in turn, depends on the optimizer's estimates of the selectivity of the various predicates), the relative cost of a sequential read vs. a serial read, the speed of the processor, the speed of the disk, the probability that blocks will be available in the buffer cache, your database's optimizer settings, your session's optimizer settings, the PARALLEL attribute of your tables and indexes, and a whole bunch of other factors (this is why it takes a book to really start to dive into this sort of thing). In general, Oracle will prefer a full table scan if your query is going to return a large fraction of the rows in your table and an index access if your query is going to return a small fraction of the rows in your table. And "small fraction" is generally much smaller than people initially estimate-- if you're returning 20-25% of the rows in a table, for example, you're almost always better off using a full table scan.
If you are trying to use the COST column in a query plan to determine whether the plan is "good" or "bad", you're probably going down the wrong path. The COST is only valid if the optimizer's estimates are accurate. But the most common reason that query plans would be incorrect is that the optimizer's estimates are incorrect (statistics are incorrect, Oracle's estimates of selectivity are incorrect, etc.). That means that if you see one plan for a query that has a cost of 6 and a plan for a different version of that query that has a cost of 6 million, it is entirely possible that the plan that has a cost of 6 million is more efficient because the plan with the low cost is incorrectly assuming that some step is going to return 1 row rather than 1 million rows.
You are much better served ignoring the COST column and focusing on the CARDINALITY column. CARDINALITY is the optimizer's estimate of the number of rows that are going to be returned at each step of the plan. CARDINALITY is something you can directly test and compare against. If, for example, you see a step in the plan that involves a full scan of table A with no predicates and you know that A has roughly 100,000 rows, it would be concerning if the optimizer's CARDINALITY estimate was either way too high or way too low. If it was estimating the cardinality to be 100 or 10,000,000 then the optimizer would almost certainly be either picking the table scan in error or feeding that data into a later step where its cost estimate would be grossly incorrect leading it to pick a poor join order or a poor join method. And it would probably indicate that the statistics on table A were incorrect. On the other hand, if you see that the cardinality estimates at each step is reasonably close to reality, there is a very good chance that Oracle has picked a reasonably good plan for the query.
Another place to get started on understanding the CBO's algorithms is this paper by Wolfgang Breitling. Jonathan Lewis's book is more detailed and more recent, but the paper is a good introduction.
In the 9i documentation Oracle produced an authoratative looking mathematical model for cost:
Cost = (#SRds * sreadtim +
#MRds * mreadtim +
#CPUCycles / cpuspeed ) / sreadtim
where:
#SRDs is the number of single block reads
#MRDs is the number of multi block reads
#CPUCycles is the number of CPU Cycles *)
sreadtim is the single block read time
mreadtim is the multi block read time
cpuspeed is the CPU cycles per second
So it gives a good idea of the factors which go into calculating cost. This was why Oracle introduced the capability to gather system statistics: to provide accurate values for CPU speed, etc
Now we fast forward to the equivalent 11g documentation and we find the maths has been replaced with a cursory explanation:
"Cost of the operation as estimated by the optimizer's query approach.
Cost is not determined for table access operations. The value of this
column does not have any particular unit of measurement; it is merely
a weighted value used to compare costs of execution plans. The value
of this column is a function of the CPU_COST and IO_COST columns."
I think this reflects the fact that cost just isn't a very reliable indicator of execution time. Jonathan Lewis recently posted a pertinent blog piece. He shows two similar looking queries; their explain plans are different but they have identical costs. Nevertheless when it comes to runtime one query performs considerably slower than the other. Read it here.
I'm interested to find out if there is a performance benefit to partitioning a numeric column that is often the target of a query. Currently I have a materialized view that contains ~50 million records. When using a regular b-tree index and searching by this numeric column I get a cost of 7 and query results in about 0.8 seconds (with non-primed cache). After adding a global hash partition (with 64 partitions) for that column I get a cost of 6 and query results in about 0.2 seconds (again with non-primed cache).
My first reaction is that the partitioned index has improved the performance of my query. However, I realize that this may just be a coincidence and could be totally dependent on the values being searched on, or others I'm not aware of. So my question is: is there a performance benefit to adding a global hash partition to a numeric column on a large table or is the cost of determining which index partitions to scan out-weighed by the cost of just doing a full range scan on a non-indexed partition?
I'm sure this, like many Oracle questions, can be answered with an "it depends." :) I'm interested in learning what factors I should consider to determine the benefits of each approach.
Thanks!
I'm pretty sure you have found this reference in your research - Partitioned Tables and Indexes. However I give a link to it if somebody is interested, this is a very good material about partitioning.
Straight to the point - Partitioned index just decomposes the index into pieces (16 in your situation) and spread the data depending on their hashed partitioning key. When you want to use it, Oracle "calculates" the hash of the key and determine in which section to continue with searching.
Knowing how index searching works, on really huge data I think it is better to choose the partitioned index in order to decrease the index tree you traverse (regular index). It really depends on the data, which is in the table (how regular index tree is composed) and is hashing and direct jump to lower node faster than regular tree traverse from the start node.
Finally, you must be more confident with the test results. If one technique gives better results on your exact data than some other don't worry to implement it.