Is there a way to make rethinkdb generate primary key automatically and to ensure the key is in an increasing order., like say 1 to n
I know when we insert a row into rethinkdb it automatically generates a primary key and returns a variable generated_keys, but I want a primary key which increases in a linear fashion say like starting from 4000 to n or 5000 to n, so on.
I am not aware if it is possible with RethinkDB. However, I know that it is not a feature that would scale well in a cluster of DB servers as it would introduce a bottle neck in the inserts commands.
It is always possible to do it by hand however, simply by providing an id field to the documents you are inserting in the tables.
Related
I am creating a DynamoDB table to support an Alexa Skill for use as a podcast player. The way I envision the table is to use the episode number as the Partition Key and the PublicationDate as the optional Sort Key. I have two concerns about designing my table schema in this way.
First, say I wanted to query the table to get the latest episode - I'm not sure that I can do it in this fashion, as a query requires an equivalence operation on the Partition Key (episode = X), which I wouldn't know in advance. Am I correct in believing that a scan would be quite an expensive operation if the podcast has a large number of episodes (say more than 1000)?
I would need to look at each item in the table, compare its episode number (Partition Key value) to the previous returned Item and update a variable with the more recent Item each time one was found until all Items in the table were cycled through in this way.
Secondly, DynamoDB best practices say two things which work incongruently in my use-case (probably a sign that my design is flawed). First, the Partition Key should be unique or close to unique. Second, queries should be expected to be more or less uniformly dispersed amongst the keys. In my case, though, while the Partition Key would indeed be unique, I would expect the vast majority of queries to be targeting the latest Partition Key in the table, for the Item containing data for the latest podcast episode. What would be the impact on performance if, say for example, the skill gets 1000 queries on any given day all aimed at a single Partition Key?
Does anyone have a better table architecture solution for this type of data?
Thanks to everyone in advance!
Question 1:
First, say I wanted to query the table to get the latest episode - I'm
not sure that I can do it in this fashion, as a query requires an
equivalence operation on the Partition Key (episode = X), which I
wouldn't know in advance. Am I correct in believing that a scan would
be quite an expensive operation if the podcast has a large number of
episodes (say more than 1000)?
You are right that you would NOT be able to query for the latest episode because each episode is in their own Partition. Partitions are almost like different isolated tables so there is no way to query across all Partitions without Scanning (as you said).
Question 2:
Secondly, DynamoDB best practices say two things which work
incongruently in my use-case (probably a sign that my design is
flawed). First, the Partition Key should be unique or close to unique.
Second, queries should be expected to be more or less uniformly
dispersed amongst the keys. In my case, though, while the Partition
Key would indeed be unique, I would expect the vast majority of
queries to be targeting the latest Partition Key in the table, for the
Item containing data for the latest podcast episode. What would be the
impact on performance if, say for example, the skill gets 1000 queries
on any given day all aimed at a single Partition Key?
The issue here is two fold, AWS expects you to be reading (and writing) equally to each partition (or close to equally) so basically what is going to happen is you are going to pay for Write Units (and Read Units) on the partitions you are NOT using, even though you are not using them.
Exactly how much more that is going to run you is going to depend on the number of times you QUERY the database, however, Reading is much cheaper than writing and 1000 reads is basically nothing on a table with 1000 items. ie. You MIGHT be able to get away with it but it's not ideal.
Alternate Table Schema / Key Design
What other Queries will you make? ie. other than "Check for latest Episode"
How many Podcasts are added per day? week? year?
Are there multiple 'shows' or categories that could be used for Partition Keys that might have more even distribution and could be 'known'?
When adding new data to a form my primary key sequence increases by 1.
However if i was to delete a data and replace it with new data the sequence would carry on.
So for example my primary keys for data go 1,2,3,4,5,6,10 because of previously deleted rows.
I hope that makes sence.
SEQUENCE values in Oracle are guaranteed to be unique, but you cannot expect the values to form a contiguous sequence without any gaps.
Even if you would never delete any rows from the table, you're likely to see gaps at some point, because sequence values are cached (pre-reserved) between different transactions.
It is a SEQUENCE of numbers, it doesn't care if you have used the "current value" or not.
As opposed to MySQL, in Oracle the Sequence is not tied to a column, but it is a separate object that you can ask a value from (through your_sequence.nextval). To handle the uniqueness, it doesn't take back values and offer them again.
If you always want to have a dense sequence of ID-s even through deletion, you would have to either
rearrange the ID-s (read: change ID-s of the rows newer than the deleted one), or
without knowing your task, I would suggest using the DENSE_RANK analytic function for querying your dataset, and separating the real (in-table) ID-s from the ranking of the rows.
I'm designing a table which I has a multiple foreign keys. What I did is create an extra column for primary key which will work more as a correlative, but I could also make the foreign keys as a composite primary key.
So my question is about performance: Is it better (at least for Oracle) to have a composite primary key than a index? What is better for my case?
Thanks!
As #Sylvain_Leroux points out, the term "better" is actually very ambiguous depending on your goals because there are tradeoffs to both approaches.
Ensure Composite Key is Actually Unique
First of all, if you want to use a composite primary key out of the foreign keys, then you must be sure that the combination of the foreign keys will be truly unique for each record. Otherwise, of course, you won't be able to use them as a primary key. If instead you are describing using a composite key made up of the foreign keys plus a surrogate key, that's kind of the worst of both worlds and is generally frowned upon.
ETL Back Room Considerations
The choice you are considering is a common one in OLAP, where a designer must choose whether or not to use a surrogate key for the fact table or a composite key comprised of the keys of the dimension tables. This advice from page 487 of Ralph Kimball's The Data Warehouse Toolkit Third Edition would therefore apply to your situation (you can consider your table as being analogous to what he describes as a fact table, and the foreign keys are for tables that he refers to as dimensions):
Fact table surrogate keys have a number of uses in the ETL back room. First, as previously described, they can be used as the basis for backing out or resuming an interrupted load. Second, they provide immediate and unambiguous identification of a single fact row without needing to constrain multiple dimensions to fetch a unique row. Third, updates to fact table rows can be replaced by inserts plus deletes because the fact table surrogate key is now the actual key for the fact table. Thus, a row containing updated columns can now be inserted into the fact table without overwriting the row it is to replace. When all such insertions are complete, then the underlying old rows can be deleted in a single step. Fourth, the fact table surrogate key is an ideal parent key to be used in a parent/child design. The fact table surrogate key appears as a foreign key in the child, along with the parent's dimension foreign key.
Performance Considerations
From a performance perspective, the records are stored in order by primary key(s) physically on the disk. That makes reads based on queries that use a foreign key (or keys) for lookup faster, but also could mean that writes will be slower if they require inserting records at points other than at the end. This is because the DBMS will have to physically move the records to make room (this is slightly oversimplified because there are some schemes employed by the DBMS to combat this, but they are overwhelmed if the inserts are numerous enough).
If you were to use a surrogate key, the insert problem wouldn't be an issue, but of course in situations where you are looking up by foreign keys, you wouldn't get the advantage of having your data in order physically on the disk. Assuming you would put an index on each foreign key, then that would add some overhead to insert tasks because the DBMS has to update multiple indices.
All of this is only noticeable with large amounts of data and will not make much of a difference for a relatively small amount of data.
Following the pointers in an ebay tech blog and a datastax developers blog, I model some event log data in Cassandra 1.2. As a partition key, I use “ddmmyyhh|bucket”, where bucket is any number between 0 and the number of nodes in the cluster.
The Data model
cqlsh:Log> CREATE TABLE transactions (yymmddhh varchar, bucket int,
rId int, created timeuuid, data map, PRIMARY
KEY((yymmddhh, bucket), created) );
(rId identifies the resource that fired the event.)
(map is are key value pairs derived from a JSON; keys change, but not much)
I assume that this translates into a composite primary/row key with X buckets per hours.
My column names are than timeuuids. Querying this data model works as expected (I can query time ranges.)
The problem is the performance: the time to insert a new row increases continuously.
So I am doing s.th. wrong, but can't pinpoint the problem.
When I use the timeuuid as a part of the row key, the performance remains stable on a high level, but this would prevent me from querying it (a query without the row key of course throws an error message about "filtering").
Any help? Thanks!
UPDATE
Switching from the map data-type to a predefined column names alleviates the problem. Insert times now seem to remain at around <0.005s per insert.
The core question remains:
How is my usage of the "map" datatype in efficient? And what would be an efficient way for thousands of inserts with only slight variation in the keys.
My keys I use data into the map mostly remain the same. I understood the datastax documentation (can't post link due to reputation limitations, sorry, but easy to find) to say that each key creates an additional column -- or does it create one new column per "map"?? That would be... hard to believe to me.
I suggest you model your rows a little differently. The collections aren't very good to use in cases where you might end up with too many elements in them. The reason is a limitation in the Cassandra binary protocol which uses two bytes to represent the number of elements in a collection. This means that if your collection has more than 2^16 elements in it the size field will overflow and even though the server sends all of the elements back to the client, the client only sees the N % 2^16 first elements (so if you have 2^16 + 3 elements it will look to the client as if there are only 3 elements).
If there is no risk of getting that many elements into your collections, you can ignore this advice. I would not think that using collections gives you worse performance, I'm not really sure how that would happen.
CQL3 collections are basically just a hack on top of the storage model (and I don't mean hack in any negative sense), you can make a MAP-like row that is not constrained by the above limitation yourself:
CREATE TABLE transactions (
yymmddhh VARCHAR,
bucket INT,
created TIMEUUID,
rId INT,
key VARCHAR,
value VARCHAR,
PRIMARY KEY ((yymmddhh, bucket), created, rId, key)
)
(Notice that I moved rId and the map key into the primary key, I don't know what rId is, but I assume that this would be correct)
This has two drawbacks over using a MAP: it requires you to reassemble the map when you query the data (you would get back a row per map entry), and it uses a litte more space since C* will insert a few extra columns, but the upside is that there is no problem with getting too big collections.
In the end it depends a lot on how you want to query your data. Don't optimize for insertions, optimize for reads. For example: if you don't need to read back the whole map every time, but usually just read one or two keys from it, put the key in the partition/row key instead and have a separate partition/row per key (this assumes that the set of keys will be fixed so you know what to query for, so as I said: it depends a lot on how you want to query your data).
You also mentioned in a comment that the performance improved when you increased the number of buckets from three (0-2) to 300 (0-299). The reason for this is that you spread the load much more evenly thoughout the cluster. When you have a partition/row key that is based on time, like your yymmddhh, there will always be a hot partition where all writes go (it moves throughout the day, but at any given moment it will hit only one node). You correctly added a smoothing factor with the bucket column/cell, but with only three values the likelyhood of at least two ending up on the same physical node are too high. With three hundred you will have a much better spread.
use yymmddhh as rowkey and bucket+timeUUID as column name,where each bucket have 20 or fix no of records,buckets can be managed using counter cloumn family
I've been having some difficulty scaling up the application and decided to ask a question here.
Consider a relational database (say mysql). Let's say it allows users to make posts and these are stored in the post table (has fields: postid, posterid, data, timestamp). So, when you go to retrieve all posts by you sorted by recency, you simply get all posts with posterid = you and order by date. Simple enough.
This process will use timestamp as the index since it has the highest cardinality and correctly so. So, beyond looking into the indexes, it'll take literally 1 row fetch from disk to complete this task. Awesome!
But let's say it's been 1 million more posts (in the system) by other users since you last posted. Then, in order to get your latest post, the database will peg the index on timestamp again, and it's not like we know how many posts have happened since then (or should we at least manually estimate and set preferred key)? Then we wasted looking into a million and one rows just to fetch a single row.
Additionally, a set of posts from multiple arbitrary users would be one of the use cases, so I cannot make fields like userid_timestamp to create a sub-index.
Am I seeing this wrong? Or what must be changed fundamentally from the application to allow such operation to occur at least somewhat efficiently?
Indexing
If you have a query: ... WHERE posterid = you ORDER BY timestamp [DESC], then you need a composite index on {posterid, timestamp}.
Finding all posts of a given user is done by a range scan on the index's leading edge (posterid).
Finding user's oldest/newest post can be done in a single index seek, which is proportional to the B-Tree height, which is proportional to log(N) where N is number of indexed rows.
To understand why, take a look at Anatomy of an SQL Index.
Clustering
The leafs of a "normal" B-Tree index hold "pointers" (physical addresses) to indexed rows, while the rows themselves reside in a separate data structure called "table heap". The heap can be eliminated by storing rows directly in leafs of the B-Tree, which is called clustering. This has its pros and cons, but if you have one predominant kind of query, eliminating the table heap access through clustering is definitely something to consider.
In this particular case, the table could be created like this:
CREATE TABLE T (
posterid int,
`timestamp` DATETIME,
data VARCHAR(50),
PRIMARY KEY (posterid, `timestamp`)
);
The MySQL/InnoDB clusters all its tables and uses primary key as clustering key. We haven't used the surrogate key (postid) since secondary indexes in clustered tables can be expensive and we already have the natural key. If you really need the surrogate key, consider making it alternate key and keeping the clustering established through the natural key.
For queries like
where posterid = 5
order by timestamp
or
where posterid in (4, 578, 222299, ...etc...)
order by timestamp
make an index on (posterid, timestamp) and the database should pick it all by itself.
edit - i just tried this with mysql
CREATE TABLE `posts` (
`id` INT(11) NOT NULL,
`ts` INT NOT NULL,
`data` VARCHAR(100) NULL DEFAULT NULL,
INDEX `id_ts` (`id`, `ts`),
INDEX `id` (`id`),
INDEX `ts` (`ts`),
INDEX `ts_id` (`ts`, `id`)
)
ENGINE=InnoDB
I filled it with a lot of data, and
explain
select * from posts where id = 5 order by ts
picks the id_ts index
Assuming you use hash tables to implement your Data Base - yes. Hash tables are not ordered, and you have no other way but to iterate all elements in order to find the maximal.
However, if you use some ordered DS, such as a B+ tree (which is actually pretty optimized for disks and thus data bases), it is a different story.
You can store elements in your B+ tree ordered by user (primary order/comparator) and date (secondary comparator, descending). Once you have this DS, finding the first element can be achieved in O(log(n)) disk seeks by finding the first element matching the primary criteria (user-id).
I am not familiar with the implementations of data bases, but AFAIK, some of them do allow you to create an index, based on a B+ tree - and by doing so, you can achieve finding the last post of a user more efficiently.
P.S.
To be exact, the concept of "greatest" element or ordering is not well defined in Relational Algebra. There is no max operator. To get the max element of a table R with a single column a one should actually create the Cartesian product of that table and find this entry. There is no max nor sort operator in strict relational algebra (though it does exist in SQL)
(Assuming set, and not multiset semantics):
MAX = R \ Project(Select(R x R, R1.a < R2.a),R1.a)