I'm writing a script which supposed to merge some data from sql-based db. Each row has a long-integer as a primary key (incremental). I was thinking about hashing these ids so that they'll somehow 'look' like the other ids already in my RethinkDB table. What I'm trying to achive here is to avoid dups in case of an attempt to merge the same data again, but keeping the original integers as ids along with the generated ids of the data saved directly to RethinkDB's table feels weird.
Can I do that?
How does RethinkDB generate auto ids anyways?
And am I approaching this correctly..?
RethinkDB uses a string-encoding of 128 bit UUIDs (basically hashed integers).
The string format looks like this: "HHHHHHHH-HHHH-HHHH-HHHH-HHHHHHHHHHHH" where every 'H' is a hexadecimal digit of the 128 bit integer. The characters 0-9 and a-f (lower case) are used.
If you want to generate such UUIDs from an existing integer, I recommend hashing the integer first. This will give you an even distribution over the whole key space (this makes sharding easier and avoids hotspots).
As a second step you have to format the hash value in a string of the format shown above. If you don't have enough digits, it's fine to leave some of the last 'H' as constant 0.
If you really want to go into the details of UUID generation, here are two links for further reading:
RFC 4122 "A Universally Unique IDentifier (UUID) URN Namespace" https://www.rfc-editor.org/rfc/rfc4122
RethinkDB's implementation of UUID generation and formatting https://github.com/rethinkdb/rethinkdb/blob/next/src/containers/uuid.cc
Related
Currently I'm generating unique ids for rows in my database using int and auto_increment. These ids are public facing so in the url you can see something like this https://example.com/path/1 or https://example.com/path/2
After talking with another engineer they've advised me that I should use randomly generated ids so that they're not guessable.
How can I generate a unique ID & without doing a forloop on the database each time to make sure it's unique? e.g. take stripe for example. All of their ids are price_sdfgsdfg or prod_iisdfgsdfg. Whats the best way to generate unique ids for rows like these?
Without knowing which language or database you're using, the simplest way is using uuids.
To prevent downloading all existing database unique keys, and then for looping over them all, simply just try to INSERT INTO whichever table you are using.
If the result fails (e.g. Exception), then the row is taken, continue.
If the result passes, break loop.
This only works when you have a column which is NOT NULL, and UNIQUE.
That's how I "know" without looping over the whole database of IDs, or downloading them into local memory, etc.
Using auto_increment wont lead to duplicates because when a SQL or no-SQL table is in use, it will be locked and given to the next available number in the queue, which is the beauty of databases.
SQL example (mySQL, SQLite, mariadb):
CREATE TABLE `my_db`.`my_table` ( `unique_id` INT NOT NULL , UNIQUE (`unique_id`)) ENGINE = InnoDB;`
Insert a unique_id
INSERT INTO `test` (`unique_id`) VALUES ('999999999');
Great, we have a row
INSERT INTO `test` (`unique_id`) VALUES ('999999999');
If not, then retry:
Error:
#1062 - Duplicate entry '999999999' for key 'unique_id'
If these are public URLs, and the content is sensitive, then I definitely do not recommend int's as someone can trivially guess 1 through 99999999... etc.
In any language, have a look at /dev/urandom.
In shell/bash scripts, I might use uuidgen:
9dccd646-043e-4984-9126-3060b4ced180
In Python, I'll use pandas:
df.set_index(pd.util.hash_pandas_object(df, encoding='utf8'), drop=True, inplace=True)
df.index.rename('hash', inplace=True)
Lastly, UUID's aren't perfect: they are only a-f 0-9 all lowercase, but they are easy to generate: every language has one.
In JavaScript you may want to check out some secure Open Source apps, for example, Jitsi: https://github.com/jitsi/js-utils/blob/master/random/roomNameGenerator.js where they conjugate word:
E.g. Satisfied-Global-Architectural-Bitter
I'm looking for optimal way to search through millions of records that contain serial number saved as varchar column which ends with specified string key.
I was using EndsWith, however performance is rather poor if several queries are sent.
Is there a better way to do it?
EDIT:
Since search key is of variable length, I can't create column that holds cut-off value of serial number. However, I've done some tests with using Substring and Equals vs EndsWith and I've lowered down execution speed to 40% of the one of EndsWith.
I'm still looking for better solution though :)
Unfortunately, searching for strings ending with a particular pattern is difficult on most databases+, because searching for string suffixes cannot use an index. This results in full table scans, which may be slow on tables with millions of rows.
If your database supports reverse indexes, add one for your string key column; otherwise, you can improve performance by simulating reverse indexes:
Add a column for storing your string key in reverse
If your RDBMS supports computed columns, add one for the reversed key
Otherwise, define a trigger that populates the reversed column from the key column
Create an index on the reversed column
Use the reversed column for your searches by passing in the reversed suffix that you are looking for.
For example, if you have data like this
key
-----------
01-02-3-xyz
07-12-8-abc
then the augmented table would have
key rev_key
----------- -----------
01-02-3-xyz zyx-3-20-10
07-12-8-abc cba-8-21-70
and your search for ENDS_WITH(key, '3-xyz') would ask for STARTS_WITH(rev_key, 'zyx-3'). Since string indexes speed up lookups by prefix, the "starts with" lookup would go much faster.
+ One notable exception is Oracle, which provides reverse key indexes specifically for situations like this.
I'm working on a project that uses pkg_crypto to protect users' personal information. There are several thousand rows (which is expected to grow to maybe several tens of thousands), and whenever I use a WHERE or ORDER BY clause in a query, the whole table is decrypted before the results are returned. This takes several seconds for a single query, which is usable for development but will probably not be very good for the release.
Is there a way to create an index that will work on the encrypted columns without compromising security?
The inserts and selects look something like this (with iBatis):
insert:
INSERT INTO "USER_TABLE"
(
"ID"
,"LOGIN"
,"PASSWORD"
,"NAME"
,"EMAIL"
)
VALUES
(
user_table_seq.nextval,
#login#
,#password#
,pkg_crypto.encrypt(#name#, 'key')
,pkg_crypto.encrypt(#email#, 'key')
)
select:
SELECT
"ID"
,"LOGIN"
,"PASSWORD"
,pkg_crypto.decrypt("NAME", 'key') NAME
,pkg_crypto.decrypt("EMAIL", 'key') EMAIL
FROM "USER_TABLE"
WHERE pkg_crypto.decrypt("NAME", 'key') LIKE #name# || '%'
AND pkg_crypto.decrypt("EMAIL", 'key') LIKE '%' || #email#
I'll preemptively put out there that the password is hashed by the servlet before being passed to the db.
Do you need to use PKG_CRYPTO to encrypt the data (which, I'm assuming, is something you wrote that calls either DBMS_CRYPTO or DBMS_OBFUSCATION_TOOLKIT? Oracle has a feature called transparent data encryption (TDE) (though this is an extra cost option) that would allow you to have Oracle transparently encrypt the data on disk, decrypt it when it's read from disk, and then use this sort of LIKE predicate on your data.
Substantially, the answer is No.
When each value is encrypted, it has a random IV (initialization vector) chosen to go with it. And this means that you cannot predict what is going into the index. If you re-encrypt the value (even with the same key), you will get a different result. Therefore, you cannot meaningfully use an index on the encrypted value because you cannot reproduce the encryption for the value you're searching for. The index would, in any case, only be useful for equality searches. The data would be in a random sequence.
You might do better with a hash value stored (as well as the encrypted value). If you hash the names with a known algorithm, then you can reproduce the hash value on demand and find the rows that match. But simply knowing the hash won't allow you (or an intruder) to determine the value that was hashed except through pre-computed 'rainbow tables'.
So, you cannot meaningfully index encrypted columns - not even for uniqueness (since the same value would be encrypted different ways by virtue of the random IV).
I'm building a database of a very important state institution. So far, for Primary Key columns I've sometimes used numbers, particularly int32 or int64, and sometimes I've used char but still I stored only numeric characters. Since this a very crucial task I have to make it both performance and comfort-wise efficient. Now I want to know,if it makes any difference to use number or char for a PK column considering the values in the char column will still be numeric characters.
P.S.You might wonder what the point is in using char if I anyway store numeric characters. The reason is becuase the PK column is composed of many parts, like 3 characters for country, 3 characters for province,3 characters for city and 6 characters for a person. And besides that I can make string operations on those columns without worrying about explicit conversion.(I know there's implicit conversion in Oracle but it's discouraged to rely on it.)
Neither as such.
Your primary key should be a surrogate. Int or big int, Guid if scaling is going to be an issue.
Then you should have country, province, city and person as a unique compound key. The 'intelligent' number thingy is rarely a good idea.
integer data type commonly used for primary key because it give better performance for data indexing than varchar. If you need another unique key like combination of characters, you can add it beside primary key marked as unique.
Another important reason, for me, to use generic data for primary key like integer or GUID is for flexibility. For example, you can use email address as primary key of user table. but few days later, when you need to change your application rule that one user can have more than one email address. you will get it more difficult to change you data structure.
just wondering does anyone in here have good idea about generating nice order id?
for example
832-28-394, which show a quite nice and formal order id (rather than just use an database auto increment number like ID=35).
the order id need to look random so it can not be able to guess by user.
e.g. 832-28-395 (shoudnt exist) so there will always some gap between each id.
just like the account number for your bank card?
Cheers
If you are using .NET you can use System.Guid.NewGuid()
The auto-incremented IDs are stored as integer or long integer data. One of the reasons for this is that this format is compact, saving space, including in indexes which are typically inclusive a primary key for use with joins and such.
If you wish to create a nice looking id following a particular format syntax, you'll need to manage the generation of the IDs yourself, and store these in a "regular" column not one that is auto-incremented.
I suggest you keep using "ugly looking" ids, be they auto-incremented or not, and format these value for display purposes only, using whatever format you may desire, including some format that use the values from several columns. Depending on the database system you are using you may be able to declare custom functions, at the level of the database itself, allowing you to obtain the readily formatted value with a simple query (as in
SELECT MakeAFancyId(id_field), some_other_columns, ..
FROM ...
If you cannot use some built-in or custom function at the level of SQL, you'll need to format the value supplied by SQL (an integer of sorts), into the desired format, on the client-side, using the language associated with your UI / presentation framework.
I'd create something where the first eight numbers are loosely in a pattern, and a third quartet looks random but is really a sort of checksum.
So, for example, the first eight digits increment based on the current seconds on the server clock.
The last four could be something like the sum of the first four, plus twice the sum of the second four, which will give either a two or three digit number. The final digit is calculated so that the sum of all 11 digits plus this last one is a multiple of 9.
This is slightly akin to how barcode numbers are verified. You can format the resulting 12 digits any way you want, although it is the first eight that are unique here.
Hash the clock time.
Mod by 100,000 or something.
Format with hyphens.
Check for duplicates. If found, restart.
I would suggest using a autoincrement ID in the database to link tables and as a primary key. Integer fields are always faster than string fields for indexing and well as searching.
You can have the order number field (which is for display) as a different field in the order table which will be used to display. And whenever you are planning to send a URl to a user or display a URL to the user which has order ID (which is a autoincremented number) you can encrypt it with some algorithm.
Both your purpose will be solved.
But I suggest not to make string as primary key. Though you can have a unique constraint on the order number which is going to be displayed.
Hope this helps.
Kalpak Luniya
I would suggest internally you keep the database derived primary key, which is auto-incremented.
For the visible order number, you will probably need a longer length than 8 characters, if you are using this for security.
If you are using Ruby, look at SecureRandom, which will generate sufficiently random strings to accomodate this. For example, you can use SecureRandom.hex(16), and it will give you a 16 digit hex number. I believe it can also give you base 64 strings, which will look weirder but be shorter.
Make sure this is not your only security on an order, as it may not be that hard to find a valid order number within your 8 digit code, especially if some are some sort of checksum.
For security reasons i suggest that you should use Criptographicaly secure random number generator. Think about idea on icreasing User Id length -if you have 1 million users then the probability to gues User ID in first try is 0.01 and 67 tries to increase probability over 0.5