Oracle (PL/SQL): Is UPDATE RETURNING concurrent? - oracle

I'm using table with a counter to ensure unique id's on a child element.
I know it is usually better to use a sequence, but I can't use it because I have a lot of counters (a customer can create a couple of buckets and each of them needs to have their own counter, they have to start with 1 (it's a requirement, my customer needs "human readable" keys).
I'm creating records (let's call them items) that have a prikey (bucket_id, num = counter).
I need to guarantee that the bucket_id / num combination is unique (so using a sequence as prikey won't fix my problem).
The creation of rows doesn't happen in pl/sql, so I need to claim the number (btw: it's not against the requirements to have gaps).
My solution was:
UPDATE bucket
SET counter = counter + 1
WHERE id = param_id
RETURNING counter INTO num_forprikey;
PL/SQL returns var_num_forprikey so the item record can be created.
Question:
Will I always get unique num_forprikey even if the user concurrently asks for new items in a bucket?

Will I always get unique num_forprikey
even if the user concurrently asks for
new items in a bucket?
Yes, at least up to a point. The first user to issue that update gets a lock on the row. So no other user can successfully issue that same statement until user numero uno commits (or rolls back). So uniqueness is guaranteed.
Obviously, the cavil is regarding concurrency. Your access to the row is serialized, so there is no way for two users to get a new PRIKEY simultaneously. This is not necessarily a problem. It depends on how many users you have creating new Items, and how often they do it. One user peeling off numbers in the same session won't notice a thing.

I seem to recall this problem from many years back working on of all things an INGRES database. There were no sequences in those days so a lot of effort was put into finding the best scaling solution for this problem by the top INGRES minds of the day. I was fortunate enough to be working along side them so that even though my mind is pitifully smaller than any of theirs, proxmity = residual affect and I retained something. This was one of the things. Let me see if I can remember.
1) for each counter you need row in a work table.
2) each time you need a number
a) lock the row
b) update it
c) get its new value (you use returning for this which I avoid like the plague)
d) commit the update to release your lock on the row
The reason for the commit is for trying to get some kind of scalability. There will always be a limit but you do not serialize on getting a number for any period of time.
In the oracle world we would improve the situation by using a function defined as an AUTONOMOUS_TRANSACTION in order to acquire the next number. IF you think about it, this solution requires that gaps be allowed which you said is OK. By commiting the number update independently of the main transaction, you gain scalability but you introduce gapping.
You will have to accept the fact that your scalability will drop dramatically in this scenario. This is due to at least two reasons:
1) the update/select/commit sequence does its best to reduce the time during which the KEY row is locked, but it is still not zero. Under heavy load, you will serialize and eventually be limited.
2) you are commiting on every key get. A commit is an expensive operation requiring many memory and file management actions on the part of the database. This will limit you also.
In the end you are likely looking at three or more orders of magnitude drop in concurrent transaction load because you are not using sequences. I base this on my experience of the past.
But if you customer requires it, what can you do right?
Good luck. I have not tested the code for syntax errors, I leave that to you.
create or replace function get_next_key (key_name_p in varchar2) return number is
pragma autonomous_transaction;
kev_v number;
begin
update key_table set key = key + 1 where key_name = key_name_p;
select key_name into key_name_v from key_name where key_name = key_name_p;
commit;
return (key_v);
end;
/
show errors

You can still use sequences, just use the row_number() analytic function to please your users. I described it here in more detail: http://rwijk.blogspot.com/2008/01/sequence-within-parent.html
Regards,
Rob.

I'd figure out how to make sequences work. It's the only guarantee, though an exception clause could be coded
http://www.orafaq.com/forum/t/83382/0/ The benefit to sequences (and they could be dynamically created, is you can specify nocache and guarantee order)

Related

Getting next sequence value in correct order

I have a function in oracle database that gets me the next value of the sequence. I also have a following PySpark Code:
def get_next_seq_value():
QUERY = "SELECT SCHEMA.GET_NEXT_SEQ_VALUE FROM DUAL"
sqlContext.clearCache()
next_seq_value_df = sqlContext.read.format("jdbc").options(url=URL, driver=DRIVER, QUERY=QUERY, user=USER, password=PASSWORD).load().unpersist()
next_seq_value = next_seq_value_df.take(1)[0][0]
return next_seq_value
And I call this function from here:
array = []
for each_item in df_list:
next_seq_value = get_next_seq_value().encode('utf-8').strip()
array.append(next_seq_value)
The problem is the following:
When I run the following the array looks like this:
['545671', '545672', '545673', '545694', '545695', '545696']
Why don't I see the 545674 and 545675... it just skipped to '545694'. How do I make sure it calls the function in order.
Default sequence cache size is 20:
If you omit both CACHE and NOCACHE, then the database caches 20 sequence numbers by default.
So looks like another session called nextval of your sequence between your calls.
In addition from your code QUERY = "SELECT SCHEMA.GET_NEXT_SEQ_VALUE FROM DUAL" looks like you wrapped your_sequence.nextval into the function GET_NEXT_SEQ_VALUE. It looks like overkill here: you get extra calls (SQL->PL/SQL-> call .nextval()) and overhead here. You can either use just select seq.nextval from dual or :x := seq.nextval;. And if you want to generate N values, you can use: select seq.nextval from dual connect by level<=20;
Totally agree with both of the previous answers. I'm not sure what type of database architecture you're using, but I'd also like to point out that with Oracle RAC each cluster node instance will have a separate cache for the sequence too.
Eg:
node 1: sequence cache 101-120
node 2: sequence cache 121-140
node 3: sequence cache 141-160
So depending on which node happens to process a request the nextval might not be in sequential order, either.
The point is that when using sequences you should only count on the values being unique, not necessarily without gaps (eliminating the cache can impact performance severely), or even necessarily in sequential order depending on your physical server architecture. If keeping things in sequential order no matter what is important, add a timestamp to your record in addition to the sequence counter.
Your problem is apparently not the wrong order of the *sequence generated IDs but the gaps.
While you decide to use sequences you generally must count with gaps.
If you use the default cache size of 20 you will loose on average with end of each session 10 IDs.
You may reduce this with NOCACHE but even here is you call the nextvaland than rollback the transaction this ID may gets lost. As the next transaction typically starts with a new nextval...

How Can I get the last inserted sequence value for respective to a web session in JSP and Oracle?

First of all I beg to request you, please do not treat this as duplicate.
I have seen all the threads for this issue but none was of my type.
I am developing an online registration system using JBOSS 6 and Oracle 11g. I want to give every registrant a unique form number sequentially.
For this, I think oracle's sequence_name.nextval for a primary key field is best for inserting a unique yet sequential number and for retrieving the same I would use sequence_name.currval. Till this I hope, it's ok.
But will this ensure parity if two or more concurrent users submits the web form simultaneously? (I mean will there be any overlap of interchange of value among the concurrent users?)
More precisely, is it session dependent?
Let me give two hypothetical situations so that matter becomes clearer.
Say there are two users, user1 and user2 trying to register at the same time sitting at Newyork and Paris respectively. The max(form_no) is say 100 before they click the submit button. Now, in the code I have written say
insert into member(....) values(seq_form_no.nextval,....).
Now since the two users will invoke the same query sitting at two different terminals will they get their own sequential id or user1 will get user2's or vice-versa? Hope I made the issue clear. See, the sequence will be unique, I know, but I want to associate the ids inserted respectively.
Thanks in advance.
I'm not sure to understand. But simply said, a SENQUENCE ensure uniqueness of the generated number among concurrent transactions/connections. Unless if the sequence was created with the CYCLE option, from within a transaction, you can rely on a strictly monotonically increasing (resp. decreasing) numbering. But not from the absence of gap (probably what you where expecting when talking about "sequential numbers").
Worth mentioning that sequence numbers never go backward. When someone acquires a value, it is "consumed" from the sequence and will never get back inside (beside CYCLE) -- even if you rollback the current transaction.
From the doc (emphasis mine):
When a sequence number is generated, the sequence is incremented, independent of the transaction committing or rolling back. If two users concurrently increment the same sequence, then the sequence numbers each user acquires may have gaps, because sequence numbers are being generated by the other user. One user can never acquire the sequence number generated by another user. After a sequence value is generated by one user, that user can continue to access that value regardless of whether the sequence is incremented by another user.
My JSP is a little bit ... "rusty", but something like that will work as expected:
<sql:update dataSource="${ds}" var="result">
INSERT INTO member(....) values(seq_form_no.nextval,....);
</sql:update>
<sql:query dataSource="${ds}" var="last_inserted_member_id">
SELECT seq_form_no.currval FROM DUAL;
</sql:query>

Postgres optimize UPDATE

I have to do a bit complicated data import. I need to do a number of UPDATEs which currently updating over 3 million rows in one query. This query is applying about 30-45 sec each (some of them even 4-5 minutes). My question is, whether I can speed it up. Where can I read something about it, what kind of indexes and on which columns I can set to improve those updates. I don't need exacly answer, so I don't show the tables. I am looking for some stuff to learn about it.
Two things:
1) Post an EXPLAIN ANALYZE of your UPDATE query.
2) If your UPDATE does not need to be atomic, then you may want to consider breaking apart the number of rows affected by your UPDATE. To minimize the number of "lost rows" due to exceeding the Free Space Map, consider the following approach:
BEGIN
UPDATE ... LIMIT N; or some predicate that would limit the number of rows (e.g. WHERE username ilike 'a%';).
COMMIT
VACUUM table_being_updated
Repeat steps 1-4 until all rows are updated.
ANALYZE table_being_updated
I suspect you're updating every row in your table and don't need all rows to be visible with the new value at the end of a single transaction, therefore the above approach of breaking the UPDATE up in to smaller transactions will be a good approach.
And yes, an INDEX on the relevant columns specified in the UPDATE's predicate will help will dramatically help. Again, post an EXPLAIN ANALYZE if you need further assistance.
If by a number of UPDATEs you mean one UPDATE command to each updated row then the problem is that all the target table's indexes will be updated and all constraints will be checked at each updated row. If that is the case then try instead to update all rows with a single UPDATE:
update t
set a = t2.b
from t2
where t.id = t2.id
If the imported data is in a text file then insert it in a temp table first and update from there. See my answer here

Would using partitions be a good idea in such a situation?

Context: Oracle 10 database.
In a rather large table (several million records) we recently started to see some performance troubles. The table has some special behaviours / conditions.
its mostly write once and then never gets changed again
during the first day or so the records are classified from 0..N (lets call that column class). records might get reclassified several times during that first day
new entries are added with class 0 meaning "not yet classified"
every hour or so a process classifies the new reocrds and gives them a new class from 1..N
all the readers are only interested in class 1
all records older than a day hardly change their class, > 1 is getting cleaned up a after a few day
Now, as most access is done to class 1, that column is often involved in queries (class = 1), together with other conditions. We have a index on the class column, and then again for certain other columns.
To my question: We are now thinking to partition that table by class. As far as I have understood this would make indexing/working with the data faster, as the class = 1 is already separated from the rest of the data and therefore access to it is implicitly more efficient. Is this correct?
If you agree that this is a good idea I will further read into the topic!
Thanks
Cheers
Update 2010.11.30
Thank you very much for the input. I wasn't aware that its a extra option :) thanks for pointing that out (before I invest too much time into it). But beside the license issue, it appears to me as partition aren't necessarily a good solution in this context.
What operations are experiencing slowness and have you been able to identify why those operations are slow?
If you partition by class, you will be slowing down the process of updating the class for a row. Since that would force a row to move from one partition to another, you'd be turning an update into a delete from the first partition and an insert into the second partition. If your hourly process is slow and it is slow because it takes time to find all the new records, the performance trade-off here may be quite reasonable. If your hourly process is slow because it takes time to compute what the new class should be and to update all the rows, on the other hand, that trade-off is probably a very poor idea.
Because partitioning is an extra cost option on top of the enterprise edition license, I would suggest making sure that you can't use some function-based indexes to get most of the performance improvements you're targeting at relatively little cost. If, for example, you had two function-based indexes
CREATE INDEX idx_new_entries
ON your_table( (CASE WHEN class = 0 THEN primary_key ELSE null END) );
CREATE INDEX idx_class1_entries
ON your_table( (CASE WHEN class = 1 THEN primary_key ELSE null END) );
along with a couple of views
CREATE VIEW vw_new_entries
AS
SELECT (CASE WHEN class = 0 THEN primary_key ELSE null END) primary_key,
<<list of columns>>
FROM your_table
WHERE class = 0
CREATE VIEW vw_class1_entries
AS
SELECT (CASE WHEN class = 1 THEN primary_key ELSE null END) primary_key,
<<list of columns>>
FROM your_table
WHERE class = 1
then any queries against the new views that filtered on the PRIMARY_KEY would use the function-based indexes which in turn would only index the appropriate rows in the underlying table. That may allow you to improve lookup performance without needing to resort to partitioning.
How big is the table in MB? What is tghe growth rate? Are you purging data or do you plan to purge data? What indexes are on the table now? Can you give us the sample table definition? Partitioning is an extra license option. Have you verified that someone is going to actually pay for it?
and most importantly, please provide sample queries
What you have provided is not enough information to base a decision on.
Yepp, sounds like a good idea.
There are better alternatives to this though, but an easy fix is a partition.

What would be the best algorithm to find an ID that is not used from a table that has the capacity to hold a million rows

To elaborate ..
a) A table (BIGTABLE) has a capacity to hold a million rows with a primary Key as the ID. (random and unique)
b) What algorithm can be used to arrive at an ID that has not been used so far. This number will be used to insert another row into table BIGTABLE.
Updated the question with more details..
C) This table already has about 100 K rows and the primary key is not an set as identity.
d) Currently, a random number is generated as the primary key and a row inserted into this table, if the insert fails another random number is generated. the problem is sometimes it goes into a loop and the random numbers generated are pretty random, but unfortunately, They already exist in the table. so if we re try the random number generation number after some time it works.
e) The sybase rand() function is used to generate the random number.
Hope this addition to the question helps clarify some points.
The question is of course: why do you want a random ID?
One case where I encountered a similar requirement, was for client IDs of a webapp: the client identifies himself with his client ID (stored in a cookie), so it has to be hard to brute force guess another client's ID (because that would allow hijacking his data).
The solution I went with, was to combine a sequential int32 with a random int32 to obtain an int64 that I used as the client ID. In PostgreSQL:
CREATE FUNCTION lift(integer, integer) returns bigint AS $$
SELECT ($1::bigint << 31) + $2
$$ LANGUAGE SQL;
CREATE FUNCTION random_pos_int() RETURNS integer AS $$
select floor((lift(1,0) - 1)*random())::integer
$$ LANGUAGE sql;
ALTER TABLE client ALTER COLUMN id SET DEFAULT
lift((nextval('client_id_seq'::regclass))::integer, random_pos_int());
The generated IDs are 'half' random, while the other 'half' guarantees you cannot obtain the same ID twice:
select lift(1, random_pos_int()); => 3108167398
select lift(2, random_pos_int()); => 4673906795
select lift(3, random_pos_int()); => 7414644984
...
Why is the unique ID Random? Why not use IDENTITY?
How was the ID chosen for the existing rows.
The simplest thing to do is probably (Select Max(ID) from BIGTABLE) and then make sure your new "Random" ID is larger than that...
EDIT: Based on the added information I'd suggest that you're screwed.
If it's an option: Copy the table, then redefine it and use an Identity Column.
If, as another answer speculated, you do need a truly random Identifier: make your PK two fields. An Identity Field and then a random number.
If you simply can't change the tables structure checking to see if the id exists before trying the insert is probably your only recourse.
There isn't really a good algorithm for this. You can use this basic construct to find an unused id:
int id;
do {
id = generateRandomId();
} while (doesIdAlreadyExist(id));
doSomethingWithNewId(id);
Your best bet is to make your key space big enough that the probability of collisions is extremely low, then don't worry about it. As mentioned, GUIDs will do this for you. Or, you can use a pure random number as long as it has enough bits.
This page has the formula for calculating the collision probability.
A bit outside of the box.
Why not pre-generate your random numbers ahead of time? That way, when you insert a new row into bigtable, the check has already been made. That would make inserts into bigtable a constant time operation.
You will have to perform the checks eventually, but that could be offloaded to a second process that doesn’t involve the sensitive process of inserting into bigtable.
Or go generate a few billion random numbers, and delete the duplicates, then you won't have to worry for quite some time.
Make the key field UNIQUE and IDENTITY and you wont have to worry about it.
If this is something you'll need to do often you will probably want to maintain a live (non-db) data structure to help you quickly answer this question. A 10-way tree would be good. When the app starts it populates the tree by reading the keys from the db, and then keeps it in sync with the various inserts and deletes made in the db. So long as your app is the only one updating the db the tree can be consulted very quickly when verifying that the next large random key is not already in use.
Pick a random number, check if it already exists, if so then keep trying until you hit one that doesn't.
Edit: Or
better yet, skip the check and just try to insert the row with different IDs until it works.
First question: Is this a planned database or a already functional one. If it already has data inside then the answer by bmdhacks is correct. If it is a planned database here is the second question:
Does your primary key really need to be random? If the answer is yes then use a function to create a random id from with a known seed and a counter to know how many Ids have been created. Each Id created will increment the counter.
If you keep the seed secret (i.e., have the seed called and declared private) then no one else should be able to predict the next ID.
If ID is purely random, there is no algorithm to find an unused ID in a similarly random fashion without brute forcing. However, as long as the bit-depth of your random unique id is reasonably large (say 64 bits), you're pretty safe from collisions with only a million rows. If it collides on insert, just try again.
depending on your database you might have the option of either using a sequenser (oracle) or a autoincrement (mysql, ms sql, etc). Or last resort do a select max(id) + 1 as new id - just be carefull of concurrent requests so you don't end up with the same max-id twice - wrap it in a lock with the upcomming insert statement
I've seen this done so many times before via brute force, using random number generators, and it's always a bad idea. Generating a random number outside of the db and attempting to see if it exists will put a lot strain on your app and database. And it could lead to 2 processes picking the same id.
Your best option is to use MySQL's autoincrement ability. Other databases have similar functionality. You are guaranteed a unique id and won't have issues with concurrency.
It is probably a bad idea to scan every value in that table every time looking for a unique value. I think the way to do this would be to have a value in another table, lock on that table, read the value, calculate the value of the next id, write the value of the next id, release the lock. You can then use the id you read with the confidence your current process is the only one holding that unique value. Not sure how well it scales.
Alternatively use a GUID for your ids, since each newly generated GUID is supposed to be unique.
Is it a requirement that the new ID also be random? If so, the best answer is just to loop over (randomize, test for existence) until you find one that doesn't exist.
If the data just happens to be random, but that isn't a strong constraint, you can just use SELECT MAX(idcolumn), increment in a way appropriate to the data, and use that as the primary key for your next record.
You need to do this atomically, so either lock the table or use some other concurrency control appropriate to your DB configuration and schema. Stored procs, table locks, row locks, SELECT...FOR UPDATE, whatever.
Note that in either approach you may need to handle failed transactions. You may theoretically get duplicate key issues in the first (though that's unlikely if your key space is sparsely populated), and you are likely to get deadlocks on some DBs with approaches like SELECT...FOR UPDATE. So be sure to check and restart the transaction on error.
First check if Max(ID) + 1 is not taken and use that.
If Max(ID) + 1 exceeds the maximum then select an ordered chunk at the top and start looping backwards looking for a hole. Repeat the chunks until you run out of numbers (in which case throw a big error).
if the "hole" is found then save the ID in another table and you can use that as the starting point for the next case to save looping.
Skipping the reasoning of the task itself, the only algorithm that
will give you an ID not in the table
that will be used to insert a new line in the table
will result in a table still having random unique IDs
is generating a random number and then checking if it's already used
The best algorithm in that case is to generate a random number and do a select to see if it exists, or just try to add it if your database errs out sanely. Depending on the range of your key, vs, how many records there are, this could be a small amount of time. It also has the ability to spike and isn't consistent at all.
Would it be possible to run some queries on the BigTable and see if there are any ranges that could be exploited? ie. between 100,000 and 234,000 there are no ID's yet, so we could add ID's there?
Why not append your random number creator with the current date in seconds. This way the only way to have an identical ID is if two users are created at the same second and are given the same random number by your generator.

Resources