I'm building a database of a very important state institution. So far, for Primary Key columns I've sometimes used numbers, particularly int32 or int64, and sometimes I've used char but still I stored only numeric characters. Since this a very crucial task I have to make it both performance and comfort-wise efficient. Now I want to know,if it makes any difference to use number or char for a PK column considering the values in the char column will still be numeric characters.
P.S.You might wonder what the point is in using char if I anyway store numeric characters. The reason is becuase the PK column is composed of many parts, like 3 characters for country, 3 characters for province,3 characters for city and 6 characters for a person. And besides that I can make string operations on those columns without worrying about explicit conversion.(I know there's implicit conversion in Oracle but it's discouraged to rely on it.)
Neither as such.
Your primary key should be a surrogate. Int or big int, Guid if scaling is going to be an issue.
Then you should have country, province, city and person as a unique compound key. The 'intelligent' number thingy is rarely a good idea.
integer data type commonly used for primary key because it give better performance for data indexing than varchar. If you need another unique key like combination of characters, you can add it beside primary key marked as unique.
Another important reason, for me, to use generic data for primary key like integer or GUID is for flexibility. For example, you can use email address as primary key of user table. but few days later, when you need to change your application rule that one user can have more than one email address. you will get it more difficult to change you data structure.
Related
We are working with a table in a 3rd party database that does not have a primary key but does have a unique index.
I have therefore been looking at using the ORA_HASH function to produce a de facto unique Id by passing in the values of the columns in the unique index.
Unfortunately, I can already see that we have a few collisions, which means that we can't derive a unique id using this method.
Is there an alternative to ORA_HASH that would provide a unique id for a unique input?
I suppose I could generate an Id using DBMS_CRYPTO.Hash but I'd ideally like to get a numeric value.
Edit
The added complication is that I then need to store these records in another (SQL Server) database and then compare the records from the original and the replica tables. So rank doesn't help me here since records can be added or deleted in the original table.
DBMS_CRYPTO.HASH could be used to generate a high-bit hash (high enough to give you a very low, but not zero, chance of collisions), but it returns 'RAW' not 'NUMBER'.
To guarantee no collisions ever, you need a one-to-one hash function. As far as I know, Oracle does not provide one.
A practical approach would be to create a new table to map unique keys to a newly generated primary key. E.g., unique value ("ABC",123, 888) maps to 838491 (where you generated 838491 using a sequence).
You'd have to update the mapping table periodically, to account for inserted rows, and that would be a pain, but it would let you generate your own PKs and keep track of them without a lot of complication.
Have you tried:
DBMS_UTILITY.GET_HASH_VALUE (
name VARCHAR2,
base NUMBER,
hash_size NUMBER)
RETURN NUMBER;
I'm writing a script which supposed to merge some data from sql-based db. Each row has a long-integer as a primary key (incremental). I was thinking about hashing these ids so that they'll somehow 'look' like the other ids already in my RethinkDB table. What I'm trying to achive here is to avoid dups in case of an attempt to merge the same data again, but keeping the original integers as ids along with the generated ids of the data saved directly to RethinkDB's table feels weird.
Can I do that?
How does RethinkDB generate auto ids anyways?
And am I approaching this correctly..?
RethinkDB uses a string-encoding of 128 bit UUIDs (basically hashed integers).
The string format looks like this: "HHHHHHHH-HHHH-HHHH-HHHH-HHHHHHHHHHHH" where every 'H' is a hexadecimal digit of the 128 bit integer. The characters 0-9 and a-f (lower case) are used.
If you want to generate such UUIDs from an existing integer, I recommend hashing the integer first. This will give you an even distribution over the whole key space (this makes sharding easier and avoids hotspots).
As a second step you have to format the hash value in a string of the format shown above. If you don't have enough digits, it's fine to leave some of the last 'H' as constant 0.
If you really want to go into the details of UUID generation, here are two links for further reading:
RFC 4122 "A Universally Unique IDentifier (UUID) URN Namespace" https://www.rfc-editor.org/rfc/rfc4122
RethinkDB's implementation of UUID generation and formatting https://github.com/rethinkdb/rethinkdb/blob/next/src/containers/uuid.cc
I have a column called "status" in PostgreSQL. First it used to be "status_id" of type integer. The values were kept on client, so there was no table on the server called statuses where I'd keep those statuses and then do inner join with the first table.
I used to send the ids of the statuses from the client (they had the names on the client). However, at some point I understood I'd better make the server hold those statuses. Not in a separate table but in the first one and I want to make them strings. So the initial table will have a status column of type string (varchar, to be more specific). I read it wouldn't be that slow.
In general, is it a good idea? I suppose it is because doing inner join (in case I'd keep statuses in the separate table) each time is expensive as well as sending ids from the client.
1) The only concern I have is that the column status should be of type char, not varchar. It should make it more effective I suppose. Is that so?
2) If the first case is correct then I'm not sure I'll be able to name all the statuses using exactly the same amount of characters, let's say, 5 characters. Some of them might be longer, some shorter. How can I solve this?
UPDATE:
It's not denationalization because I'm talking about 1 single table. There is no and has never been the second table called Statuses with the fields (id, status_name).
What I'm trying to convey is that I could use char(n) for status_name and also add index on it. Then it should be fast enough. However, it might be or not possible to name all the statuses with the certain (n) amount of characters and that's the only concern.
I don't think so using char or varchar instead integer is good idea. It is hard to expect how much slower it will be than integer PK, but this design will be slower - impact will be more terrible when you will join larger tables. If you can, use ENUM types instead.
http://www.postgresql.org/docs/9.2/static/datatype-enum.html
CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy');
CREATE TABLE person (
name text,
current_mood mood
);
INSERT INTO person VALUES ('Moe', 'happy');
SELECT * FROM person WHERE current_mood = 'happy';
name | current_mood
------+--------------
Moe | happy
(1 row)
PostgreSQL varchar and char types are very similar. Internal implementation is same - char can be (it is paradox) little bit slower due addition by spaces.
I'd go one step further. Never use the outdated data type char(n), unless you know you have to (for compatibility or some rare exotic reason). The type is utterly useless in a modern database. Padding strings with blank characters is nonsense, and if you have to do it, you can do it in a cheaper fashion with rpad() on data retrieval.
SELECT rpad('short', 10) AS char_10_string;
varchar is basically the same as text and allows a length specifier: varchar(n). I generally use just text. If I need to limit the length, I use a CHECK constraint. Here's one example, why.
Whenever you can use a simple integer (or enum) instead, that's a bit smaller and faster in every respect. Consider #Pavel's answer for enum.
As for:
because doing inner join (...) each time is expensive
Well, it carries a small cost, but it's generally cheaper than redundantly saving text representation of the status instead of a much cheaper integer in the main table. That kind of rumor is spread by people having problems understanding the concept of database normalization. The enum type is a compromise here - for relatively static sets of values.
I'm looking for optimal way to search through millions of records that contain serial number saved as varchar column which ends with specified string key.
I was using EndsWith, however performance is rather poor if several queries are sent.
Is there a better way to do it?
EDIT:
Since search key is of variable length, I can't create column that holds cut-off value of serial number. However, I've done some tests with using Substring and Equals vs EndsWith and I've lowered down execution speed to 40% of the one of EndsWith.
I'm still looking for better solution though :)
Unfortunately, searching for strings ending with a particular pattern is difficult on most databases+, because searching for string suffixes cannot use an index. This results in full table scans, which may be slow on tables with millions of rows.
If your database supports reverse indexes, add one for your string key column; otherwise, you can improve performance by simulating reverse indexes:
Add a column for storing your string key in reverse
If your RDBMS supports computed columns, add one for the reversed key
Otherwise, define a trigger that populates the reversed column from the key column
Create an index on the reversed column
Use the reversed column for your searches by passing in the reversed suffix that you are looking for.
For example, if you have data like this
key
-----------
01-02-3-xyz
07-12-8-abc
then the augmented table would have
key rev_key
----------- -----------
01-02-3-xyz zyx-3-20-10
07-12-8-abc cba-8-21-70
and your search for ENDS_WITH(key, '3-xyz') would ask for STARTS_WITH(rev_key, 'zyx-3'). Since string indexes speed up lookups by prefix, the "starts with" lookup would go much faster.
+ One notable exception is Oracle, which provides reverse key indexes specifically for situations like this.
just wondering does anyone in here have good idea about generating nice order id?
for example
832-28-394, which show a quite nice and formal order id (rather than just use an database auto increment number like ID=35).
the order id need to look random so it can not be able to guess by user.
e.g. 832-28-395 (shoudnt exist) so there will always some gap between each id.
just like the account number for your bank card?
Cheers
If you are using .NET you can use System.Guid.NewGuid()
The auto-incremented IDs are stored as integer or long integer data. One of the reasons for this is that this format is compact, saving space, including in indexes which are typically inclusive a primary key for use with joins and such.
If you wish to create a nice looking id following a particular format syntax, you'll need to manage the generation of the IDs yourself, and store these in a "regular" column not one that is auto-incremented.
I suggest you keep using "ugly looking" ids, be they auto-incremented or not, and format these value for display purposes only, using whatever format you may desire, including some format that use the values from several columns. Depending on the database system you are using you may be able to declare custom functions, at the level of the database itself, allowing you to obtain the readily formatted value with a simple query (as in
SELECT MakeAFancyId(id_field), some_other_columns, ..
FROM ...
If you cannot use some built-in or custom function at the level of SQL, you'll need to format the value supplied by SQL (an integer of sorts), into the desired format, on the client-side, using the language associated with your UI / presentation framework.
I'd create something where the first eight numbers are loosely in a pattern, and a third quartet looks random but is really a sort of checksum.
So, for example, the first eight digits increment based on the current seconds on the server clock.
The last four could be something like the sum of the first four, plus twice the sum of the second four, which will give either a two or three digit number. The final digit is calculated so that the sum of all 11 digits plus this last one is a multiple of 9.
This is slightly akin to how barcode numbers are verified. You can format the resulting 12 digits any way you want, although it is the first eight that are unique here.
Hash the clock time.
Mod by 100,000 or something.
Format with hyphens.
Check for duplicates. If found, restart.
I would suggest using a autoincrement ID in the database to link tables and as a primary key. Integer fields are always faster than string fields for indexing and well as searching.
You can have the order number field (which is for display) as a different field in the order table which will be used to display. And whenever you are planning to send a URl to a user or display a URL to the user which has order ID (which is a autoincremented number) you can encrypt it with some algorithm.
Both your purpose will be solved.
But I suggest not to make string as primary key. Though you can have a unique constraint on the order number which is going to be displayed.
Hope this helps.
Kalpak Luniya
I would suggest internally you keep the database derived primary key, which is auto-incremented.
For the visible order number, you will probably need a longer length than 8 characters, if you are using this for security.
If you are using Ruby, look at SecureRandom, which will generate sufficiently random strings to accomodate this. For example, you can use SecureRandom.hex(16), and it will give you a 16 digit hex number. I believe it can also give you base 64 strings, which will look weirder but be shorter.
Make sure this is not your only security on an order, as it may not be that hard to find a valid order number within your 8 digit code, especially if some are some sort of checksum.
For security reasons i suggest that you should use Criptographicaly secure random number generator. Think about idea on icreasing User Id length -if you have 1 million users then the probability to gues User ID in first try is 0.01 and 67 tries to increase probability over 0.5