We are using hibernate, jpa and spring and our db is postgres 9. We are using sequence to autogenrate primary key. But what we have noticed is, it is skipping 20 numbers when new records is inserted in that tables and in our sequence we have increment by 1, then why postgres incrementing next value to 20. We do use cache value as "20".
That's normal. You can tell Hibernate not to cache sequence values - at a performance cost to inserts - but this still doesn't mean you won't have sequence gaps.
I wrote more about this on an older answer - here.
Sequences have gaps. That's their nature. If they couldn't have gaps, you could only have one transaction inserting at a time.
See:
CREATE SEQUENCE
Sequence manipulation functions
for details.
If you expect gapless sequences, you need to understand that you'll have to do all your inserts serially, with only one transaction able to do work at a time. To learn more, search for "postgresql gaples sequence". Relying on gapless sequences in the DB is usually a bad idea; instead, have your application construct the user-visible values when it fetches values, using the row_number() window function or similar.
Related:
Re-using deleted IDs
Related
Is there a way to make rethinkdb generate primary key automatically and to ensure the key is in an increasing order., like say 1 to n
I know when we insert a row into rethinkdb it automatically generates a primary key and returns a variable generated_keys, but I want a primary key which increases in a linear fashion say like starting from 4000 to n or 5000 to n, so on.
I am not aware if it is possible with RethinkDB. However, I know that it is not a feature that would scale well in a cluster of DB servers as it would introduce a bottle neck in the inserts commands.
It is always possible to do it by hand however, simply by providing an id field to the documents you are inserting in the tables.
Is there some sort of performance difference for inserting, updating, or deleting data when you use the TEXT data type?
I went here and found this:
Tip: There is no performance difference among these three types, apart
from increased storage space when using the blank-padded type, and a
few extra CPU cycles to check the length when storing into a
length-constrained column. While character(n) has performance
advantages in some other database systems, there is no such advantage
in PostgreSQL; in fact character(n) is usually the slowest of the
three because of its additional storage costs. In most situations text
or character varying should be used instead.
This makes me believe there should not be a performance difference, but my friend, who is much more experienced than I am, says inserts, updates, and deletes are slower for the TEXT data type.
I had a table that was partitioned with a trigger and function, and extremely heavily indexed, but the inserts did not go that slow.
Now I have another table, with 5 more columns all of which are text data type, the same exact trigger and function, no indexes, but the inserts are terribly slow.
From my experience, I think he is correct, but what do you guys think?
Edit #1:
I am uploading the same exact data, just the 2nd version has 5 more columns.
Edit #2:
By "Slow" I mean with the first scenario, I was able to insert 500 or more rows per second, but now I can only insert 20 rows per second.
Edit #3: I didn't add the indexes to the 2nd scenario like they are in the 1st scenario because indexes are supposed to slow down inserts, updates, and deletes, from my understanding.
Edit #4: I guarantee it is exactly the same data, because I'm the one uploading it. The only difference is, the 2nd scenario has 5 additional columns, all text data type.
Edit #5: Even when I removed all of the indexes on scenario 2 and left all of them on scenario 1, the inserts were still slower on scenario 2.
Edit #6: Both scenarios have the same exact trigger and function.
Edit #7:
I am using an ETL tool, Pentaho, to insert the data, so there is no way for me to show you the code being used to insert the data.
I think I might have had too many transformation steps in the ETL tool. When I tried to insert data in the same transformation as the steps that actually transform the data, it was massively slow, but when I simply inserted the data already transformed into an empty table and then inserted data from this table into the actual table I'm using,the inserts were much faster than scenario 1 at 4000 rows per second.
The only difference between scenario 1 and scenario 2, other than the increase in columns in scenario 2, is the number of steps in the ETL transformation.Scenario two has about 20 or more steps in the ETL transformation. In some cases, there are 50 more.
I think I can solve my problem by reducing the number of transformation steps, or putting the transformed data into an empty table and then inserting the data from this table into the actual table I'm using.
PostgreSQL text and character varying are the same, with the exception of the (optional) length limit for the latter. They will perform identically.
The only reasons to prefer character varying are
you want to impose a length limit
you want to conform with the SQL standard
When adding new data to a form my primary key sequence increases by 1.
However if i was to delete a data and replace it with new data the sequence would carry on.
So for example my primary keys for data go 1,2,3,4,5,6,10 because of previously deleted rows.
I hope that makes sence.
SEQUENCE values in Oracle are guaranteed to be unique, but you cannot expect the values to form a contiguous sequence without any gaps.
Even if you would never delete any rows from the table, you're likely to see gaps at some point, because sequence values are cached (pre-reserved) between different transactions.
It is a SEQUENCE of numbers, it doesn't care if you have used the "current value" or not.
As opposed to MySQL, in Oracle the Sequence is not tied to a column, but it is a separate object that you can ask a value from (through your_sequence.nextval). To handle the uniqueness, it doesn't take back values and offer them again.
If you always want to have a dense sequence of ID-s even through deletion, you would have to either
rearrange the ID-s (read: change ID-s of the rows newer than the deleted one), or
without knowing your task, I would suggest using the DENSE_RANK analytic function for querying your dataset, and separating the real (in-table) ID-s from the ranking of the rows.
Following the pointers in an ebay tech blog and a datastax developers blog, I model some event log data in Cassandra 1.2. As a partition key, I use “ddmmyyhh|bucket”, where bucket is any number between 0 and the number of nodes in the cluster.
The Data model
cqlsh:Log> CREATE TABLE transactions (yymmddhh varchar, bucket int,
rId int, created timeuuid, data map, PRIMARY
KEY((yymmddhh, bucket), created) );
(rId identifies the resource that fired the event.)
(map is are key value pairs derived from a JSON; keys change, but not much)
I assume that this translates into a composite primary/row key with X buckets per hours.
My column names are than timeuuids. Querying this data model works as expected (I can query time ranges.)
The problem is the performance: the time to insert a new row increases continuously.
So I am doing s.th. wrong, but can't pinpoint the problem.
When I use the timeuuid as a part of the row key, the performance remains stable on a high level, but this would prevent me from querying it (a query without the row key of course throws an error message about "filtering").
Any help? Thanks!
UPDATE
Switching from the map data-type to a predefined column names alleviates the problem. Insert times now seem to remain at around <0.005s per insert.
The core question remains:
How is my usage of the "map" datatype in efficient? And what would be an efficient way for thousands of inserts with only slight variation in the keys.
My keys I use data into the map mostly remain the same. I understood the datastax documentation (can't post link due to reputation limitations, sorry, but easy to find) to say that each key creates an additional column -- or does it create one new column per "map"?? That would be... hard to believe to me.
I suggest you model your rows a little differently. The collections aren't very good to use in cases where you might end up with too many elements in them. The reason is a limitation in the Cassandra binary protocol which uses two bytes to represent the number of elements in a collection. This means that if your collection has more than 2^16 elements in it the size field will overflow and even though the server sends all of the elements back to the client, the client only sees the N % 2^16 first elements (so if you have 2^16 + 3 elements it will look to the client as if there are only 3 elements).
If there is no risk of getting that many elements into your collections, you can ignore this advice. I would not think that using collections gives you worse performance, I'm not really sure how that would happen.
CQL3 collections are basically just a hack on top of the storage model (and I don't mean hack in any negative sense), you can make a MAP-like row that is not constrained by the above limitation yourself:
CREATE TABLE transactions (
yymmddhh VARCHAR,
bucket INT,
created TIMEUUID,
rId INT,
key VARCHAR,
value VARCHAR,
PRIMARY KEY ((yymmddhh, bucket), created, rId, key)
)
(Notice that I moved rId and the map key into the primary key, I don't know what rId is, but I assume that this would be correct)
This has two drawbacks over using a MAP: it requires you to reassemble the map when you query the data (you would get back a row per map entry), and it uses a litte more space since C* will insert a few extra columns, but the upside is that there is no problem with getting too big collections.
In the end it depends a lot on how you want to query your data. Don't optimize for insertions, optimize for reads. For example: if you don't need to read back the whole map every time, but usually just read one or two keys from it, put the key in the partition/row key instead and have a separate partition/row per key (this assumes that the set of keys will be fixed so you know what to query for, so as I said: it depends a lot on how you want to query your data).
You also mentioned in a comment that the performance improved when you increased the number of buckets from three (0-2) to 300 (0-299). The reason for this is that you spread the load much more evenly thoughout the cluster. When you have a partition/row key that is based on time, like your yymmddhh, there will always be a hot partition where all writes go (it moves throughout the day, but at any given moment it will hit only one node). You correctly added a smoothing factor with the bucket column/cell, but with only three values the likelyhood of at least two ending up on the same physical node are too high. With three hundred you will have a much better spread.
use yymmddhh as rowkey and bucket+timeUUID as column name,where each bucket have 20 or fix no of records,buckets can be managed using counter cloumn family
Performance question about indexing large amounts of data. I have a large table (~30 million rows), with 4 of the columns indexed to allow for fast searching. Currently I set the indexs (indices?) up, then import my data. This takes roughly 4 hours, depending on the speed of the db server. Would it be quicker/more efficient to import the data first, and then perform index building?
I'd temper af's answer by saying that it would probably be the case that "index first, insert after" would be slower than "insert first, index after" where you are inserting records into a table with a clustered index, but not inserting records in the natural order of that index. The reason being that for each insert, the data rows themselves would be have to be ordered on disk.
As an example, consider a table with a clustered primary key on a uniqueidentifier field. The (nearly) random nature of a guid would mean that it is possible for one row to be added at the top of the data, causing all data in the current page to be shuffled along (and maybe data in lower pages too), but the next row added at the bottom. If the clustering was on, say, a datetime column, and you happened to be adding rows in date order, then the records would naturally be inserted in the correct order on disk and expensive data sorting/shuffling operations would not be needed.
I'd back up Winston Smith's answer of "it depends", but suggest that your clustered index may be a significant factor in determining which strategy is faster for your current circumstances. You could even try not having a clustered index at all, and see what happens. Let me know?
Inserting data while indices are in place causes DBMS to update them after every row. Because of this, it's usually faster to insert the data first and create indices afterwards. Especially if there is that much data.
(However, it's always possible there are special circumstances which may cause different performance characteristics. Trying it is the only way to know for sure.)
It will depend entirely on your particular data and indexing strategy. Any answer you get here is really a guess.
The only way to know for sure, is to try both and take appropriate measurements, which won't be difficult to do.