I use STANDARD_HASH in the manner below to hash credit card numbers. It returns hashes with 40 characters. This seems excessive for credit card numbers which have 16 digits. I would like to save space in my export. How can I create shorter hashes while still achieving these goals:
Have the same level of security and non-reversibility as
STANDARD_HASH
Keep the likelihood of two card numbers receiving the same hash very small (though if this happens a few times, it's OK)
Have the shortest possible hash result in terms of characters or space required when exporting to a CSV
Perform this operation while using as few database resources as possible
Perform this operation using read-only access to the database
If a method exists which achieves goals 2 and 3, then I expect that goal 1 could be achieved by using this method to hash the output of STANDARD_HASH.
SELECT STANDARD_HASH(TRIM(' 123456789123456789 ' )) FROM DUAL;
TRIM removes the spaces and then STANDARD_HASH returns a hash of length 64.
Here's the same example on db<>fiddle:
https://dbfiddle.uk/?rdbms=oracle_18&fiddle=7cd086f1b60f69eb3bc6f54d4a211844
The database version is "Oracle Database 18c Enterprise Edition".
That length of 64 is not the length of the result, but just how it displays. STANDARD_HASH returns a RAW value, that is displayed as hexadecimal.
You can convert this raw value into something usable using the UTL_RAW functions at https://docs.oracle.com/database/121/TTPLP/u_raw.htm#TTPLP71498
Eg
SELECT UTL_RAW.CAST_TO_VARCHAR2 (STANDARD_HASH(TRIM(' 123456789123456789 ' ))) FROM DUAL;
Note that when you try this in the fiddle, you’ll find a few ? that represent non-printable characters, so allow for that in your export.
Edit to add : STANDARD_HASH uses SHA1 by default - but that and MD5 have vulnerabilities - better to just add the extra parameter to STANDARD_HASH to use a longer SHA -see https://docs.oracle.com/database/121/SQLRF/functions183.htm#SQLRF55647
SELECT UTL_RAW.CAST_TO_VARCHAR2 (STANDARD_HASH(TRIM(' 123456789123456789 ' ), ‘SHA256’)) FROM DUAL;
Edit to address the 5 points :
it uses the same STANDARD_HASH so is the same
SHA1 is prone to collisions, so as above swap to SHA256 or higher
STANDARD_HASH uses industry-standard hashing algorithms. It is what it is. Be aware that by its very nature, hashing returns binary values, so it is your responsibility to convert them to appropriate format - eg for CSV files, you can convert to Base64 (see Base64 encoding and decoding in oracle )
and 5. No additional resources
Edit to respond to addition comments :
Yes, full SELECT you stated looks correct :
select utl_raw.cast_to_varchar2(utl_encode.base64_encode(
STANDARD_HASH(TRIM(' 123456789123456789 ' ), 'SHA1'))) FROM dual;
Base64 operates on groups of 3 bytes at a time, and appends "=" for each byte short. SHA1 hashes are always 20 bytes, so is always 1 byte short.
So offhand, you COULD trim that trailing "=" off - though I would advise against it (lean code beats premature optimisation). For example, if you subsequently decided to upgrade from SHA1 to SHA256, that generates hashes with a different number of bytes, and therefore potentially 0 or 2 "=" at the end, so weird bugs await.
Yes, "+" and "/" are valid characters in the Base64 output (along with 0-9, and upper-and lower- case letters - hence 64 characters in all, plus the =), but importantly commas and double-quotes are not - so yes, Base64 strings are safe to go into a CSV format.
FYI, a quick summary of Base64 (since I guess that you like me always like to have an overview of what I'm dealing with)
Base64 is used to translate a stream of binary data into printable strings. Now 3 bytes of binary data is 24 bits, which of course can be regarded as 4 lots of 6-bits (we can ignore the byte boundaries). Any collection of 6 bits has 2^6 = 64 possible values (hence the Base64 name), which are represented as 64 characters :
Upper-case letters
Lower case letters (so yes, case-sensitive).
digits 0-9
"+" and "/"
Hence each character in the Base64 output represents the next 6 bits of the binary data.
Related
I need to join two tables - one table has householdid which is CHAR30, which appears to have center alignment and the other householdid as numeric 20. I need to convert to the numeric 20 but when I do that it appears truncated, perhaps because of the strange alignment (not all of the 30 positions are actually needed).
When I try to keep the full 30 positions as a numeric I instead get a conversion to scientific notation so of course this will not work as a key id for later operations.
As long as the number is converted properly, it doesn't matter what format it has. A format just tells SAS how to show you the number. Behind the scenes, it is just a DOUBLE.
1.0 = 1 = 1e0
Now if you have converted to a number and cannot get a join, then look at the informat you used to read it in.
try
num_id = input(strip(char_id),best32.);
Strip removes leading and trailing blanks. The BEST32. INFORMAT tries its "best" to read the number up to 32 characters in length.
You cannot store a 20 digit number as a numeric in SAS. SAS stores all numbers as 8 byte floating point and so does not have enough bits to represent that many digits uniquely. You can ask SAS what is the largest integer it can represent exactly by using the CONSTANT() function.
1 data _null_;
2 x=constant('EXACTINT',8);
3 put x = comma32. ;
4 run;
x=9,007,199,254,740,992
Read and store your 20 and 30 digit strings as character variables.
Use the bestd32. format. Tends to work out pretty well for long key variables. Depending on the length of the variable, you can change 32 to whichever length you need.
Based on the comments under the original question, the only thing you can do is convert all ID fields to strings, and use the strings to do the joins. #Reeza suggested this in one of the comments but it should have been posted as an answer.
I assume you are pulling this information out of another database/system that allows for greater numeric precision then SAS does. If you don't convert the values to strings when they are read into SAS, then you run the risk of losing precision.
If you lose precision, the ID in SAS is likely to become very slightly different to the ID in the original system, which can cause problems when searching the original system for an ID obtained from SAS.
Be sure you don't read the numbers into SAS as numeric, then convert to string. If you do it this way you are still losing precision as soon as the numbers are stored in SAS as numeric variables.
I'm looking for a way to implement a hashing mechanism to hash an input (0 to 2^32 - 1) to a fixed possibly 12 character hash.
Background:
I have a transaction table, where the primary key is auto increment (max size is 2^32) and I have to show an invoice no to the client which has to be of decent characters length (I'm thinking 12) and so since the client shouldn't get id as 0000-0000-0001, I was thinking hashing is the best way to go.
The main requirement (that I can think of) is that many to one mapping should never take place, and should not be slow.
Would it be okay if I use a common hashing mechanism and then drop the extra characters. (md5 for example in php generates 32 character string)?
The way I understand, there is no need to be secure cryptographically, and so I can generate a custom hash if possible.
Similar links:
1) Symmetric Bijective Algorithm for Integers
2) Pseudo-random-looking one-to-one int32->int32 function
Using md5 and chopping off most of it is not a good idea, because there is no guarantee that you would get a unique cache. Besides, you have much easier alternatives available to you, because you have a lot more bits than you need.
Values in the range [0..232] need 32 bit (duh!). You have 12 printable characters, which give you 72 bits if you stay within Base-64 encoding range of characters. You don't even need that many characters - you can use three bits per character for the initial eight characters, and two bits per character for the last four digits. This way your 12 characters would stay in the range ['0'..'7'], and the last four would be in the range ['0'..'3']. Of course you are not bound to numeric digits - you could use letters for some groups of digits, to give it a more "randomized" appearance.
the id is auto increment, and I don't think that I should give invoice numbers as 000...001 and so on.
Start with least significant bits when you generate these representations, then proceed to least significant, or make an arbitrary (but fixed) map of which bits go to what digit in the 12-character representation. This way the IDs would not look sequential, but would remain fully reversible.
My program is a decoder for a binary protocol. One of the fields in that binary protocol is an encoded String. Each character in the String is printable, and represents an integral value. According to the spec of the protocol I'm decoding, the integral value it represents is taken from the following table, where all possible characters are listed:
Character Value
========= =====
0 0
1 1
2 2
3 3
[...]
: 10
; 11
< 12
= 13
[...]
B 18
So for example, the character = represents an integral 13.
My code was originally using ord to get the ASCII code for the character, and then subtracting 48 from that, like this:
def Decode(val)
val[0].ord - 48
end
...which works perfectly, assuming that val consists only of characters listed in that table (this is verified elsewhere).
However, in another question, I was told that:
You are asking for a Ruby way to use ord, where using it is against
the Ruby way.
It seems to me that ord is exactly what I need here, so I don't understand why using ord here is not a Rubyist way to do what I'm trying to do.
So my questions are:
First and foremost, what is the Rubyist way to write my function above?
Secondary, why is using ord here a non-Rubyist practice?
A note on encoding: This protocol which I'm decoding specifies precisely that these strings are ASCII encoded. No other encoding is possible here. Protocols like this are extremely common in my industry (stock & commodity markets).
I guess the Rubyistic way, and faster, to decode the string into an array of integers is the unpack method:
"=01:".unpack("C*").map {|v| v - 48}
>> [13, 0, 1, 10]
The unpack method, with "C*" param, converts each character to an 8-bit unsigned integer.
Probably ord is entirely safe and appropriate in your case, as the source data should always be encoded the same way. Especially if when reading the data you set the encoding to 'US-ASCII' (although the format used looks safe for 'ASCII-8BIT', 'UTF-8' and 'ISO-8859', which may be the point of it - it seems resilient to many conversions, and does not use all possible byte values). However, ord is intended to be used with character semantics, and technically you want byte semantics. With basic ASCII and variants there is no practical difference, all byte values below 128 are the same character code.
I would suggest using String#unpack as a general method for converting binary input to Ruby data types, but there is not an unpack code for "use this byte with an offset", so that becomes a two-part process.
For eg. if we have two strings 2 and 10, 10 will come first if we order lexicographically.
The very trivial sol will be to repeat a character n number of time.
eg. 2 can be encoded as aa
10 as aaaaaaaaaa
This way the lex order is same as the numeric one.
But, is there a more elegant way to do this?
When converting the numbers to strings make sure that all the strings have the same length, by appending 0s in the front if necessary. So 2 and 10 would be encoded as "02" and "10".
While kjampani's solution is probably the best and easiest in normal applications, another way which is more space-efficient is to prepend every string with its own length. Of course, you need to encode the length in a way which is also consistently sorted.
If you know all the strings are fairly short, you can just encode their length as a fixed-length base-X sequence, where X is the number of character codes you're willing to use (popular values are 64, 96, 255 and 256.) Note that you have to use the character codes in lexicographical order, so normal base64 won't work.
One variable-length order-preserving encoding is the one used by UTF-8. (Not UTF-8 directly, which has a couple of corner cases which will get in the way, but the same encoding technique. The order-preserving property of UTF-8 is occasionally really useful.) The full range of such compressed codes can encode values up to 42 bits long, with an average of five payload bits per byte. That's sufficient for pretty long strings; four terabyte long strings are pretty rare in the wild; but if you need longer, it's possible, too, by extending the size prefix over more than one byte.
Break the string into successive sub strings of letters and numbers and then sort by comparing each substring as an integer if it's an numeric string
"aaa2" ---> aaa + 2
"aaa1000" ---> aaa + 1000
aaa == aaa
Since they're equal, we continue:
1000 > 2
Hence, aaa1000 > aaa2.
I want to encrypt some info for a licensing system and I want the result to be able to be typed in by the user.
Update: This operation must be reversible (decrypt-able)
E.g.,
Encrypt ( ComputerID+ProductID) -> (any standard ASCII character that can be typed. Ideally maybe even just A-Z).
So far what I did was to convert the encrypted text to HEX (so it's any character from 0-F) but that doubles the number of characters.
I'm using VB6.
I'm thinking I'd do some operation on each pair of (Input$(x) and Key$(x)) and then do a MOD to keep it within a range of ascii values (maybe 0-9-A-Z)
Any suggestions of a good algorithm?
Look into Base64 "encryption."
Base 64 will convert a number into 64 different ASCII characters, verses hex which is only 16 different ASCII characters... Making Base64 more compact and what you are looking for.
EDIT:
Code to do this in VB6 is available here: http://www.nonhostile.com/howto-encode-decode-base64-vb6.asp
Per Fuzzy Lollipop, below, Base32 looks like an even better option. Bonus points if you can find an example of that.
EDIT: I found an example of Base32 for VB6 although I've not tried it yet. -Clay
encode the encrypted bytes in HEX, or Base32 or Base64
Do you want this to be reversible -- to recover the IDs from the encrypted text? If so then it matters how you combine the key and input strings.
Usually you'd XOR each byte pair (work with byte arrays to avoid Unicode issues), circulating on the key string if it's shorter than the input. You can then use Base N encoding (32, 64 etc) to generate the license string.
Both operations are reversible: you can recover the XORed strings from the Base N string, then XOR with the key again to get the original IDs.
If you don't care about reversing the operations, then any convolution of key and ID will do. XOR is just the simplest.