Alternative to ORA_HASH? - oracle

We are working with a table in a 3rd party database that does not have a primary key but does have a unique index.
I have therefore been looking at using the ORA_HASH function to produce a de facto unique Id by passing in the values of the columns in the unique index.
Unfortunately, I can already see that we have a few collisions, which means that we can't derive a unique id using this method.
Is there an alternative to ORA_HASH that would provide a unique id for a unique input?
I suppose I could generate an Id using DBMS_CRYPTO.Hash but I'd ideally like to get a numeric value.
Edit
The added complication is that I then need to store these records in another (SQL Server) database and then compare the records from the original and the replica tables. So rank doesn't help me here since records can be added or deleted in the original table.

DBMS_CRYPTO.HASH could be used to generate a high-bit hash (high enough to give you a very low, but not zero, chance of collisions), but it returns 'RAW' not 'NUMBER'.
To guarantee no collisions ever, you need a one-to-one hash function. As far as I know, Oracle does not provide one.
A practical approach would be to create a new table to map unique keys to a newly generated primary key. E.g., unique value ("ABC",123, 888) maps to 838491 (where you generated 838491 using a sequence).
You'd have to update the mapping table periodically, to account for inserted rows, and that would be a pain, but it would let you generate your own PKs and keep track of them without a lot of complication.

Have you tried:
DBMS_UTILITY.GET_HASH_VALUE (
name VARCHAR2,
base NUMBER,
hash_size NUMBER)
RETURN NUMBER;

Related

Randomly generated public unique ids

Currently I'm generating unique ids for rows in my database using int and auto_increment. These ids are public facing so in the url you can see something like this https://example.com/path/1 or https://example.com/path/2
After talking with another engineer they've advised me that I should use randomly generated ids so that they're not guessable.
How can I generate a unique ID & without doing a forloop on the database each time to make sure it's unique? e.g. take stripe for example. All of their ids are price_sdfgsdfg or prod_iisdfgsdfg. Whats the best way to generate unique ids for rows like these?
Without knowing which language or database you're using, the simplest way is using uuids.
To prevent downloading all existing database unique keys, and then for looping over them all, simply just try to INSERT INTO whichever table you are using.
If the result fails (e.g. Exception), then the row is taken, continue.
If the result passes, break loop.
This only works when you have a column which is NOT NULL, and UNIQUE.
That's how I "know" without looping over the whole database of IDs, or downloading them into local memory, etc.
Using auto_increment wont lead to duplicates because when a SQL or no-SQL table is in use, it will be locked and given to the next available number in the queue, which is the beauty of databases.
SQL example (mySQL, SQLite, mariadb):
CREATE TABLE `my_db`.`my_table` ( `unique_id` INT NOT NULL , UNIQUE (`unique_id`)) ENGINE = InnoDB;`
Insert a unique_id
INSERT INTO `test` (`unique_id`) VALUES ('999999999');
Great, we have a row
INSERT INTO `test` (`unique_id`) VALUES ('999999999');
If not, then retry:
Error:
#1062 - Duplicate entry '999999999' for key 'unique_id'
If these are public URLs, and the content is sensitive, then I definitely do not recommend int's as someone can trivially guess 1 through 99999999... etc.
In any language, have a look at /dev/urandom.
In shell/bash scripts, I might use uuidgen:
9dccd646-043e-4984-9126-3060b4ced180
In Python, I'll use pandas:
df.set_index(pd.util.hash_pandas_object(df, encoding='utf8'), drop=True, inplace=True)
df.index.rename('hash', inplace=True)
Lastly, UUID's aren't perfect: they are only a-f 0-9 all lowercase, but they are easy to generate: every language has one.
In JavaScript you may want to check out some secure Open Source apps, for example, Jitsi: https://github.com/jitsi/js-utils/blob/master/random/roomNameGenerator.js where they conjugate word:
E.g. Satisfied-Global-Architectural-Bitter

DynamoDb delete with sort key

I have fields below in dynamo dB table
event_on -- string type
user_id -- number type
event name -- string type
Since this table may have multiple records for user_id and event_on is the single field which can be unique so I made it primary key and user_id as sort key
Now I want to delete the all records of a user, so My code is
response = dynamodb.delete_item(
TableName=events,
Key={
"user_id": {"N": str(userId)}
})
It throwing error
Exception occured An error occurred (ValidationException) when calling
the DeleteItem operation: The provided key element does not match the
schema
also is there anyway to delete with range
Can someone suggest me what should I have do with dynamodb table structure to make this code work
Thanks,
It sounds like you've modeled your data using a composite primary key, which means you have both a partition key and a sort key. Here's an example of what that looks like with some sample data.
In DynamoDB, the most efficient way to access items (aka "rows" in RDBMS language) is by specifying either the full primary key (getItem) or the partition key (query). If you want to search by any other attribute, you'll need to use the scan operation. Be very careful with scan, since it can be a costly way (both in performance and money) to access your data.
When it comes to deletion, you have a few options.
deleteItem - Deletes a single item in a table by primary key.
batchWriteItem - The BatchWriteItem operation puts or deletes multiple items in one or more tables. A single call to BatchWriteItem can write up to 16 MB of data, which can comprise as many as 25 put or delete requests
TimeToLive - You can utilize DynamoDBs Time To Live (TTL) feature to delete items you no longer need. Keep in mind that TTL only marks your items for deletion and actual deletion could take up to 48 hours.
In order to effectively use any of these options, you'll first need to identify which items you want to delete. Because you want to fetch using the value of the sort key alone, you have two options;
Use scan to find the items of interest. This is not ideal but is an option if you cannot change your data model.
Create a global secondary index (GSI) that swaps your partition key and sort key values. This pattern is called an inverted index. This would allow you to identify all items with a given user_id.
If you choose option 2, your data would look like this
This would allow you to fetch all item for a given user, which you could then delete using one of the methods I outlined above.
As you can see here, delete_item needs the primary key and not the sort key. You would have to do a full scan, and delete everything that contains the given sort key.
If you are created a DynamoDB table by the Primary key and sort key, you should provide both values to remove items from that table.
If the sort key was not added to the primary key on the table creation process, the record can be removed by the Primary key.
How I solved it.
Actually, I tried to not add the sort key when created the table. And I'm using indexes for sorting and getting items.

How to insert a unique random integer in SQLite?

Let's say I'm saving users in a database and I want each user to have a unique random ID (this isn't actually the case, just a simpler example). When I INSERT the user, is there a way to insert a unique random ID?
I know I can easily just do an auto-increment column so that each row would have a unique integer, but I need a random number for this system specifically.
Sample of what my standard insert query for a new user:
INSERT INTO 'Users' VALUES ('RandomID', 'Bleh', 'Bleh2') (random value here, 1, 2)
I was wondering the same and found an interesting answer in this article: use a pre-populated table.
You create before hand a table of n rows with unique random numbers, using any method (for example, by trying to insert random numbers in a UNIQUE field and silently failing when the number isn't unique, until you get the number of rows needed).
Once that table is created you are 100% certain that the numbers in it are unique, and you can simply use them sequentially and discarding them after use (or not).
When using this method you need to have some kind of alert system for when you approach the limit of rows in the pre-populated table, so that you can generate a new set of values anew.
Sqlite has random() which returns a random integer. But it may not be unique every time. You can append time stamp or row_id with it to get unique random number.
Based on this documentation for the C API function sqlite3_randomness(), it might be possible to make a table's primary key random by creating a dummy row and forcing it's primary key value to the largest possible ROWID. Any new rows after that should be random.
That said, I don't know that that behavior is contractual or just a current implementation detail. Use at your own risk.

Is there any way to generate an ID without a sequence?

Current application use JPA to auto generate table/entity id. Now a requirement wants to get a query to manually insert data in to the database using SQL queries
So the questions are:
Is it worth to create a sequence in this schema just for this little requirement?
If answer to 1 is no, then what could be a plan b?
Yes. A sequence is trivial - why would you not do it?
N/A
Few ways:
Use a UUID. UUIDs are pseudo-random, large alphanumeric strings which are guaranteed to be unique once generated.
Does the data have something unique? Like a timestamp, or IP address, etc? If so, use that
Combination of current timestamp + some less unique value in the data
Combination of current timestamp + some integer i that you keep incrementing
There are others (including generating a checksum, custom random numbers instead of UUIDs, etc) - but those have the possibility of overlaps, so not mentioning them.
Edit: Minor clarifications
Are you just doing a single data load into an empty table, and there are no other users concurrently inserting data? If so, you can just use ROWNUM to generate the IDs starting from 1, e.g.
INSERT INTO mytable
SELECT ROWNUM AS ID
,etc AS etc
FROM ...

What would be the best algorithm to find an ID that is not used from a table that has the capacity to hold a million rows

To elaborate ..
a) A table (BIGTABLE) has a capacity to hold a million rows with a primary Key as the ID. (random and unique)
b) What algorithm can be used to arrive at an ID that has not been used so far. This number will be used to insert another row into table BIGTABLE.
Updated the question with more details..
C) This table already has about 100 K rows and the primary key is not an set as identity.
d) Currently, a random number is generated as the primary key and a row inserted into this table, if the insert fails another random number is generated. the problem is sometimes it goes into a loop and the random numbers generated are pretty random, but unfortunately, They already exist in the table. so if we re try the random number generation number after some time it works.
e) The sybase rand() function is used to generate the random number.
Hope this addition to the question helps clarify some points.
The question is of course: why do you want a random ID?
One case where I encountered a similar requirement, was for client IDs of a webapp: the client identifies himself with his client ID (stored in a cookie), so it has to be hard to brute force guess another client's ID (because that would allow hijacking his data).
The solution I went with, was to combine a sequential int32 with a random int32 to obtain an int64 that I used as the client ID. In PostgreSQL:
CREATE FUNCTION lift(integer, integer) returns bigint AS $$
SELECT ($1::bigint << 31) + $2
$$ LANGUAGE SQL;
CREATE FUNCTION random_pos_int() RETURNS integer AS $$
select floor((lift(1,0) - 1)*random())::integer
$$ LANGUAGE sql;
ALTER TABLE client ALTER COLUMN id SET DEFAULT
lift((nextval('client_id_seq'::regclass))::integer, random_pos_int());
The generated IDs are 'half' random, while the other 'half' guarantees you cannot obtain the same ID twice:
select lift(1, random_pos_int()); => 3108167398
select lift(2, random_pos_int()); => 4673906795
select lift(3, random_pos_int()); => 7414644984
...
Why is the unique ID Random? Why not use IDENTITY?
How was the ID chosen for the existing rows.
The simplest thing to do is probably (Select Max(ID) from BIGTABLE) and then make sure your new "Random" ID is larger than that...
EDIT: Based on the added information I'd suggest that you're screwed.
If it's an option: Copy the table, then redefine it and use an Identity Column.
If, as another answer speculated, you do need a truly random Identifier: make your PK two fields. An Identity Field and then a random number.
If you simply can't change the tables structure checking to see if the id exists before trying the insert is probably your only recourse.
There isn't really a good algorithm for this. You can use this basic construct to find an unused id:
int id;
do {
id = generateRandomId();
} while (doesIdAlreadyExist(id));
doSomethingWithNewId(id);
Your best bet is to make your key space big enough that the probability of collisions is extremely low, then don't worry about it. As mentioned, GUIDs will do this for you. Or, you can use a pure random number as long as it has enough bits.
This page has the formula for calculating the collision probability.
A bit outside of the box.
Why not pre-generate your random numbers ahead of time? That way, when you insert a new row into bigtable, the check has already been made. That would make inserts into bigtable a constant time operation.
You will have to perform the checks eventually, but that could be offloaded to a second process that doesn’t involve the sensitive process of inserting into bigtable.
Or go generate a few billion random numbers, and delete the duplicates, then you won't have to worry for quite some time.
Make the key field UNIQUE and IDENTITY and you wont have to worry about it.
If this is something you'll need to do often you will probably want to maintain a live (non-db) data structure to help you quickly answer this question. A 10-way tree would be good. When the app starts it populates the tree by reading the keys from the db, and then keeps it in sync with the various inserts and deletes made in the db. So long as your app is the only one updating the db the tree can be consulted very quickly when verifying that the next large random key is not already in use.
Pick a random number, check if it already exists, if so then keep trying until you hit one that doesn't.
Edit: Or
better yet, skip the check and just try to insert the row with different IDs until it works.
First question: Is this a planned database or a already functional one. If it already has data inside then the answer by bmdhacks is correct. If it is a planned database here is the second question:
Does your primary key really need to be random? If the answer is yes then use a function to create a random id from with a known seed and a counter to know how many Ids have been created. Each Id created will increment the counter.
If you keep the seed secret (i.e., have the seed called and declared private) then no one else should be able to predict the next ID.
If ID is purely random, there is no algorithm to find an unused ID in a similarly random fashion without brute forcing. However, as long as the bit-depth of your random unique id is reasonably large (say 64 bits), you're pretty safe from collisions with only a million rows. If it collides on insert, just try again.
depending on your database you might have the option of either using a sequenser (oracle) or a autoincrement (mysql, ms sql, etc). Or last resort do a select max(id) + 1 as new id - just be carefull of concurrent requests so you don't end up with the same max-id twice - wrap it in a lock with the upcomming insert statement
I've seen this done so many times before via brute force, using random number generators, and it's always a bad idea. Generating a random number outside of the db and attempting to see if it exists will put a lot strain on your app and database. And it could lead to 2 processes picking the same id.
Your best option is to use MySQL's autoincrement ability. Other databases have similar functionality. You are guaranteed a unique id and won't have issues with concurrency.
It is probably a bad idea to scan every value in that table every time looking for a unique value. I think the way to do this would be to have a value in another table, lock on that table, read the value, calculate the value of the next id, write the value of the next id, release the lock. You can then use the id you read with the confidence your current process is the only one holding that unique value. Not sure how well it scales.
Alternatively use a GUID for your ids, since each newly generated GUID is supposed to be unique.
Is it a requirement that the new ID also be random? If so, the best answer is just to loop over (randomize, test for existence) until you find one that doesn't exist.
If the data just happens to be random, but that isn't a strong constraint, you can just use SELECT MAX(idcolumn), increment in a way appropriate to the data, and use that as the primary key for your next record.
You need to do this atomically, so either lock the table or use some other concurrency control appropriate to your DB configuration and schema. Stored procs, table locks, row locks, SELECT...FOR UPDATE, whatever.
Note that in either approach you may need to handle failed transactions. You may theoretically get duplicate key issues in the first (though that's unlikely if your key space is sparsely populated), and you are likely to get deadlocks on some DBs with approaches like SELECT...FOR UPDATE. So be sure to check and restart the transaction on error.
First check if Max(ID) + 1 is not taken and use that.
If Max(ID) + 1 exceeds the maximum then select an ordered chunk at the top and start looping backwards looking for a hole. Repeat the chunks until you run out of numbers (in which case throw a big error).
if the "hole" is found then save the ID in another table and you can use that as the starting point for the next case to save looping.
Skipping the reasoning of the task itself, the only algorithm that
will give you an ID not in the table
that will be used to insert a new line in the table
will result in a table still having random unique IDs
is generating a random number and then checking if it's already used
The best algorithm in that case is to generate a random number and do a select to see if it exists, or just try to add it if your database errs out sanely. Depending on the range of your key, vs, how many records there are, this could be a small amount of time. It also has the ability to spike and isn't consistent at all.
Would it be possible to run some queries on the BigTable and see if there are any ranges that could be exploited? ie. between 100,000 and 234,000 there are no ID's yet, so we could add ID's there?
Why not append your random number creator with the current date in seconds. This way the only way to have an identical ID is if two users are created at the same second and are given the same random number by your generator.

Resources