Let's assume I have a forum software, and I would like to sort the threads by the amount of views it has. The views would be stored in a counter.
Having experience in relational databases, I thought this would be simple to solve, turns out it's not. I have thought about creating one massive row with the columns being counters (thus being sorted), but as a single row can only be stored on a single node, this does not seam feasible (beats the point of using Cassandra).
How can I sort by counter column in Cassandra?
You can't sort big data. That's one of the fundamental assumptions.
The only things that you can sort by on cassandra, are the things that cassandra uses to store its data - the row key and the column key.
Moving to NoSQL from normal SQL you have to drop the notion of being able to sort/join data. It's just (generally) not possible in Big Data implementations.
To update on this question:
Korya is correct that you cannot assume that ALL NoSQL of BigData nature cannot sort (MongoDB can sort and it is NoSql).
Regarding to Cassandra itself: you can sort any given elements of your Primary Key AFTER your partition key inside a Composite Key:
Example:
Primary Key ((A),B,C,D);
A is your partition Key.
B,C,D are part of your composite Key, and can now be sorted ASC (default) or DESC. If you want something naturally in latest first (ie time) then you would specify it in your schema:
WITH CLUSTERING ORDER BY (media_type_id ASC,media_id ASC);
As far as the question goes about counters:
You cannot sort the counter inside cassandra because the counter would need to be part of the KEY and the key is unique.
As pointed by Martin the solution refenced by a whitepage example of eBay they explain that two tables are used to keep track.
Related
I am creating a DynamoDB table to support an Alexa Skill for use as a podcast player. The way I envision the table is to use the episode number as the Partition Key and the PublicationDate as the optional Sort Key. I have two concerns about designing my table schema in this way.
First, say I wanted to query the table to get the latest episode - I'm not sure that I can do it in this fashion, as a query requires an equivalence operation on the Partition Key (episode = X), which I wouldn't know in advance. Am I correct in believing that a scan would be quite an expensive operation if the podcast has a large number of episodes (say more than 1000)?
I would need to look at each item in the table, compare its episode number (Partition Key value) to the previous returned Item and update a variable with the more recent Item each time one was found until all Items in the table were cycled through in this way.
Secondly, DynamoDB best practices say two things which work incongruently in my use-case (probably a sign that my design is flawed). First, the Partition Key should be unique or close to unique. Second, queries should be expected to be more or less uniformly dispersed amongst the keys. In my case, though, while the Partition Key would indeed be unique, I would expect the vast majority of queries to be targeting the latest Partition Key in the table, for the Item containing data for the latest podcast episode. What would be the impact on performance if, say for example, the skill gets 1000 queries on any given day all aimed at a single Partition Key?
Does anyone have a better table architecture solution for this type of data?
Thanks to everyone in advance!
Question 1:
First, say I wanted to query the table to get the latest episode - I'm
not sure that I can do it in this fashion, as a query requires an
equivalence operation on the Partition Key (episode = X), which I
wouldn't know in advance. Am I correct in believing that a scan would
be quite an expensive operation if the podcast has a large number of
episodes (say more than 1000)?
You are right that you would NOT be able to query for the latest episode because each episode is in their own Partition. Partitions are almost like different isolated tables so there is no way to query across all Partitions without Scanning (as you said).
Question 2:
Secondly, DynamoDB best practices say two things which work
incongruently in my use-case (probably a sign that my design is
flawed). First, the Partition Key should be unique or close to unique.
Second, queries should be expected to be more or less uniformly
dispersed amongst the keys. In my case, though, while the Partition
Key would indeed be unique, I would expect the vast majority of
queries to be targeting the latest Partition Key in the table, for the
Item containing data for the latest podcast episode. What would be the
impact on performance if, say for example, the skill gets 1000 queries
on any given day all aimed at a single Partition Key?
The issue here is two fold, AWS expects you to be reading (and writing) equally to each partition (or close to equally) so basically what is going to happen is you are going to pay for Write Units (and Read Units) on the partitions you are NOT using, even though you are not using them.
Exactly how much more that is going to run you is going to depend on the number of times you QUERY the database, however, Reading is much cheaper than writing and 1000 reads is basically nothing on a table with 1000 items. ie. You MIGHT be able to get away with it but it's not ideal.
Alternate Table Schema / Key Design
What other Queries will you make? ie. other than "Check for latest Episode"
How many Podcasts are added per day? week? year?
Are there multiple 'shows' or categories that could be used for Partition Keys that might have more even distribution and could be 'known'?
this is related to cassandra time series modeling when time can go backward, but I think I have a better scenario to explain why the topic is important.
Imagine I have a simple table
CREATE TABLE measures(
key text,
measure_time timestamp,
value int,
PRIMARY KEY (key, measure_time))
WITH CLUSTERING ORDER BY (measure_time DESC);
The purpose of the clustering key is to have data arranged in a decreasing timestamp ordering. This leads to very efficient range-based queries, that for a given key lead to sequential disk reading (which are intrinsically fast).
Many times I have seen suggestions to use a generated timeuuid as timestamp value ( using now() ), and this is obviously intrinsically ordered. But you can't always do that. It seems to me a very common pattern, you can't use it if:
1) your user wants to query on the actual time when the measure has been taken, not the time where the measure has been written.
2) you use multiple writing threads
So, I want to understand what happens if I write data in an unordered fashion (with respect to measure_time column).
I have personally tested that if I insert timestamp-unordered values, Cassandra indeed reports them to me in a timestamp-ordered fashion when I run a select.
But what happens "under the hood"? In my opinion, it is impossible that data are still ordered on disk. At some point in fact data need to be flushed on disk. Imagine you flush a data set in the time range [0,10]. What if the next data set to flush has measures with timestamp=9? Are data re-arranged on disk? At what cost?
Hope I was clear, I couldn't find any explanation about this on Datastax site but I admit I'm quite a novice on Cassandra. Any pointers appreciated
Sure, once written a SSTable file is immutable, Your timestamp=9 will end up in another SSTable, and C* will have to merge and sort data from both SSTables, if you'll request both timestamp=10 and timestamp=9. And that would be less effective than reading from a single SSTable.
The Compaction process may merge those two SSTables into new single one. See http://www.datastax.com/dev/blog/when-to-use-leveled-compaction
And try to avoid very wide rows/partitions, which will be the case if you have a lot measurements (i.e. a lot of measure_time values) for a single key.
Following the pointers in an ebay tech blog and a datastax developers blog, I model some event log data in Cassandra 1.2. As a partition key, I use “ddmmyyhh|bucket”, where bucket is any number between 0 and the number of nodes in the cluster.
The Data model
cqlsh:Log> CREATE TABLE transactions (yymmddhh varchar, bucket int,
rId int, created timeuuid, data map, PRIMARY
KEY((yymmddhh, bucket), created) );
(rId identifies the resource that fired the event.)
(map is are key value pairs derived from a JSON; keys change, but not much)
I assume that this translates into a composite primary/row key with X buckets per hours.
My column names are than timeuuids. Querying this data model works as expected (I can query time ranges.)
The problem is the performance: the time to insert a new row increases continuously.
So I am doing s.th. wrong, but can't pinpoint the problem.
When I use the timeuuid as a part of the row key, the performance remains stable on a high level, but this would prevent me from querying it (a query without the row key of course throws an error message about "filtering").
Any help? Thanks!
UPDATE
Switching from the map data-type to a predefined column names alleviates the problem. Insert times now seem to remain at around <0.005s per insert.
The core question remains:
How is my usage of the "map" datatype in efficient? And what would be an efficient way for thousands of inserts with only slight variation in the keys.
My keys I use data into the map mostly remain the same. I understood the datastax documentation (can't post link due to reputation limitations, sorry, but easy to find) to say that each key creates an additional column -- or does it create one new column per "map"?? That would be... hard to believe to me.
I suggest you model your rows a little differently. The collections aren't very good to use in cases where you might end up with too many elements in them. The reason is a limitation in the Cassandra binary protocol which uses two bytes to represent the number of elements in a collection. This means that if your collection has more than 2^16 elements in it the size field will overflow and even though the server sends all of the elements back to the client, the client only sees the N % 2^16 first elements (so if you have 2^16 + 3 elements it will look to the client as if there are only 3 elements).
If there is no risk of getting that many elements into your collections, you can ignore this advice. I would not think that using collections gives you worse performance, I'm not really sure how that would happen.
CQL3 collections are basically just a hack on top of the storage model (and I don't mean hack in any negative sense), you can make a MAP-like row that is not constrained by the above limitation yourself:
CREATE TABLE transactions (
yymmddhh VARCHAR,
bucket INT,
created TIMEUUID,
rId INT,
key VARCHAR,
value VARCHAR,
PRIMARY KEY ((yymmddhh, bucket), created, rId, key)
)
(Notice that I moved rId and the map key into the primary key, I don't know what rId is, but I assume that this would be correct)
This has two drawbacks over using a MAP: it requires you to reassemble the map when you query the data (you would get back a row per map entry), and it uses a litte more space since C* will insert a few extra columns, but the upside is that there is no problem with getting too big collections.
In the end it depends a lot on how you want to query your data. Don't optimize for insertions, optimize for reads. For example: if you don't need to read back the whole map every time, but usually just read one or two keys from it, put the key in the partition/row key instead and have a separate partition/row per key (this assumes that the set of keys will be fixed so you know what to query for, so as I said: it depends a lot on how you want to query your data).
You also mentioned in a comment that the performance improved when you increased the number of buckets from three (0-2) to 300 (0-299). The reason for this is that you spread the load much more evenly thoughout the cluster. When you have a partition/row key that is based on time, like your yymmddhh, there will always be a hot partition where all writes go (it moves throughout the day, but at any given moment it will hit only one node). You correctly added a smoothing factor with the bucket column/cell, but with only three values the likelyhood of at least two ending up on the same physical node are too high. With three hundred you will have a much better spread.
use yymmddhh as rowkey and bucket+timeUUID as column name,where each bucket have 20 or fix no of records,buckets can be managed using counter cloumn family
How to take a join of two record sets using Map Reduce ? Most of the solutions including those posted on SO suggest that I emit the records based on common key and in the reducer add them to say a HashMap and then take a cross product. (eg. Join of two datasets in Mapreduce/Hadoop)
This solution is very good and works for majority of the cases but in my case my issue is rather different. I am dealing with a data which has got billions of records and taking a cross product of two sets is impossible because in many cases the hashmap will end up having few million objects. So I encounter a Heap Space Error.
I need a much more efficient solution. The whole point of MR is to deal with very high amount of data I want to know if there is any solution that can help me avoid this issue.
Don't know if this is still relevant for anyone, but I facing a similar issue these days. My intention is to use a key-value store, most likely Cassandra, and use it for the cross product. This means:
When running on a line of type A, look for the key in Cassandra. If exists - merge A records into the existing value (B elements). If not - create a key, and add A elements as value.
When running on a line of type B, look for the key in Cassandra. If exists - merge B records into the existing value (A elements). If not - create a key, and add B elements as value.
This would require additional server for Cassandra, and probably some disk space, but since I'm running in the cloud (Google's bdutil Hadoop framework), don't think it should be much of a problem.
You should look into how Pig does skew joins. The idea is that if your data contains too many values with the same key (even if there is no data skew) , you can create artificial keys and spread the key distribution. This would make sure that each reducer gets less number of records than otherwise. For e.g. if you were to suffix "1" to 50% of your key "K1" and "2" the other 50% you will end with half the records on the reducer one (1K1) and the other half goes to 2K2.
If the distribution of the keys values are not known before hand you could some kind of sampling algorithm.
this article xrds:article in the subsection "An Example of the Tradeoff" describes a way (the first one) that every single record is joined with all the other records of the input file. I wonder how could that be possible in mapreduce without passing the whole input file in only one mapper.
There are three major types of joins (there are a few others out there) for MapReduce.
Reduce Side Join - For both data sets, you output the "foreign key" as the output key of the mapper. You use something like MultipleInputs to load two data sets at once. In the reducer, data from both data sets is brought together by foreign key, which allows you to do the join logic (like Cartesian product, perhaps) there. This is general purpose and will work for just about every situation.
Replicated Join - You push out the smaller data set into the DistributedCache. In each matter, you load the smaller data set from there into memory. As records pass through the mapper, join the data up against the in-memory data set. This is what you suggest in your question. It should be only used when the smaller data set can be stored in memory.
Composite Join - This one is a bit niche because it needs to be set up. If two data sets are sorted and partitioned by the foreign key, then you can do a composite join using CompositeInputFormat. It basically does a merge-like operation that is pretty efficient.
Shameless plug for my book MapReduce Design Patterns: there is a whole chapter on joins (chapter 5).
Check out the code examples for the book here: https://github.com/adamjshook/mapreducepatterns/tree/master/MRDP/src/main/java/mrdp/ch5