Convert a long character field to numeric, NOT scientific notation (SAS) - format

I need to join two tables - one table has householdid which is CHAR30, which appears to have center alignment and the other householdid as numeric 20. I need to convert to the numeric 20 but when I do that it appears truncated, perhaps because of the strange alignment (not all of the 30 positions are actually needed).
When I try to keep the full 30 positions as a numeric I instead get a conversion to scientific notation so of course this will not work as a key id for later operations.

As long as the number is converted properly, it doesn't matter what format it has. A format just tells SAS how to show you the number. Behind the scenes, it is just a DOUBLE.
1.0 = 1 = 1e0
Now if you have converted to a number and cannot get a join, then look at the informat you used to read it in.
try
num_id = input(strip(char_id),best32.);
Strip removes leading and trailing blanks. The BEST32. INFORMAT tries its "best" to read the number up to 32 characters in length.

You cannot store a 20 digit number as a numeric in SAS. SAS stores all numbers as 8 byte floating point and so does not have enough bits to represent that many digits uniquely. You can ask SAS what is the largest integer it can represent exactly by using the CONSTANT() function.
1 data _null_;
2 x=constant('EXACTINT',8);
3 put x = comma32. ;
4 run;
x=9,007,199,254,740,992
Read and store your 20 and 30 digit strings as character variables.

Use the bestd32. format. Tends to work out pretty well for long key variables. Depending on the length of the variable, you can change 32 to whichever length you need.

Based on the comments under the original question, the only thing you can do is convert all ID fields to strings, and use the strings to do the joins. #Reeza suggested this in one of the comments but it should have been posted as an answer.
I assume you are pulling this information out of another database/system that allows for greater numeric precision then SAS does. If you don't convert the values to strings when they are read into SAS, then you run the risk of losing precision.
If you lose precision, the ID in SAS is likely to become very slightly different to the ID in the original system, which can cause problems when searching the original system for an ID obtained from SAS.
Be sure you don't read the numbers into SAS as numeric, then convert to string. If you do it this way you are still losing precision as soon as the numbers are stored in SAS as numeric variables.

Related

Shorter hashes of credit card numbers in Oracle

I use STANDARD_HASH in the manner below to hash credit card numbers. It returns hashes with 40 characters. This seems excessive for credit card numbers which have 16 digits. I would like to save space in my export. How can I create shorter hashes while still achieving these goals:
Have the same level of security and non-reversibility as
STANDARD_HASH
Keep the likelihood of two card numbers receiving the same hash very small (though if this happens a few times, it's OK)
Have the shortest possible hash result in terms of characters or space required when exporting to a CSV
Perform this operation while using as few database resources as possible
Perform this operation using read-only access to the database
If a method exists which achieves goals 2 and 3, then I expect that goal 1 could be achieved by using this method to hash the output of STANDARD_HASH.
SELECT STANDARD_HASH(TRIM(' 123456789123456789 ' )) FROM DUAL;
TRIM removes the spaces and then STANDARD_HASH returns a hash of length 64.
Here's the same example on db<>fiddle:
https://dbfiddle.uk/?rdbms=oracle_18&fiddle=7cd086f1b60f69eb3bc6f54d4a211844
The database version is "Oracle Database 18c Enterprise Edition".
That length of 64 is not the length of the result, but just how it displays. STANDARD_HASH returns a RAW value, that is displayed as hexadecimal.
You can convert this raw value into something usable using the UTL_RAW functions at https://docs.oracle.com/database/121/TTPLP/u_raw.htm#TTPLP71498
Eg
SELECT UTL_RAW.CAST_TO_VARCHAR2 (STANDARD_HASH(TRIM(' 123456789123456789 ' ))) FROM DUAL;
Note that when you try this in the fiddle, you’ll find a few ? that represent non-printable characters, so allow for that in your export.
Edit to add : STANDARD_HASH uses SHA1 by default - but that and MD5 have vulnerabilities - better to just add the extra parameter to STANDARD_HASH to use a longer SHA -see https://docs.oracle.com/database/121/SQLRF/functions183.htm#SQLRF55647
SELECT UTL_RAW.CAST_TO_VARCHAR2 (STANDARD_HASH(TRIM(' 123456789123456789 ' ), ‘SHA256’)) FROM DUAL;
Edit to address the 5 points :
it uses the same STANDARD_HASH so is the same
SHA1 is prone to collisions, so as above swap to SHA256 or higher
STANDARD_HASH uses industry-standard hashing algorithms. It is what it is. Be aware that by its very nature, hashing returns binary values, so it is your responsibility to convert them to appropriate format - eg for CSV files, you can convert to Base64 (see Base64 encoding and decoding in oracle )
and 5. No additional resources
Edit to respond to addition comments :
Yes, full SELECT you stated looks correct :
select utl_raw.cast_to_varchar2(utl_encode.base64_encode(
STANDARD_HASH(TRIM(' 123456789123456789 ' ), 'SHA1'))) FROM dual;
Base64 operates on groups of 3 bytes at a time, and appends "=" for each byte short. SHA1 hashes are always 20 bytes, so is always 1 byte short.
So offhand, you COULD trim that trailing "=" off - though I would advise against it (lean code beats premature optimisation). For example, if you subsequently decided to upgrade from SHA1 to SHA256, that generates hashes with a different number of bytes, and therefore potentially 0 or 2 "=" at the end, so weird bugs await.
Yes, "+" and "/" are valid characters in the Base64 output (along with 0-9, and upper-and lower- case letters - hence 64 characters in all, plus the =), but importantly commas and double-quotes are not - so yes, Base64 strings are safe to go into a CSV format.
FYI, a quick summary of Base64 (since I guess that you like me always like to have an overview of what I'm dealing with)
Base64 is used to translate a stream of binary data into printable strings. Now 3 bytes of binary data is 24 bits, which of course can be regarded as 4 lots of 6-bits (we can ignore the byte boundaries). Any collection of 6 bits has 2^6 = 64 possible values (hence the Base64 name), which are represented as 64 characters :
Upper-case letters
Lower case letters (so yes, case-sensitive).
digits 0-9
"+" and "/"
Hence each character in the Base64 output represents the next 6 bits of the binary data.

Fastest algorithm to convert hexadecimal numbers into decimal form without using a fixed length variable to store the result

I want to write a program to convert hexadecimal numbers into their decimal forms without using a variable of fixed length to store the result because that would restrict the range of inputs that my program can work with.
Let's say I were to use a variable of type long long int to calculate, store and print the result. Doing so would limit the range of hexadecimal numbers that my program can handle to between 8000000000000001 and 7FFFFFFFFFFFFFFF. Anything outside this range would cause the variable to overflow.
I did write a program that calculates and stores the decimal result in a dynamically allocated string by performing carry and borrow operations but it runs much slower, even for numbers that are as big as 7FFFFFFFF!
Then I stumbled onto this site which could take numbers that are way outside the range of a 64 bit variable. I tried their converter with numbers much larger than 16^65 - 1 and still couldn't get it to overflow. It just kept on going and printing the result.
I figured that they must be using a much better algorithm for hex to decimal conversion, one that isn't limited to 64 bit values.
So far, Google's search results have only led me to algorithms that use some fixed-length variable for storing the result.
That's why I am here. I wanna know if such an algorithm exists and if it does, what is it?
Well, it sounds like you already did it when you wrote "a program that calculates and stores the decimal result in a dynamically allocated string by performing carry and borrow operations".
Converting from base 16 (hexadecimal) to base 10 means implementing multiplication and addition of numbers in a base 10x representation. Then for each hex digit d, you calculate result = result*16 + d. When you're done you have the same number in a 10-based representation that is easy to write out as a decimal string.
There could be any number of reasons why your string-based method was slow. If you provide it, I'm sure someone could comment.
The most important trick for making it reasonably fast, though, is to pick the right base to convert to and from. I would probably do the multiplication and addition in base 109, so that each digit will be as large as possible while still fitting into a 32-bit integer, and process 7 hex digits at a time, which is as many as I can while only multiplying by single digits.
For every 7 hex digts, I'd convert them to a number d, and then do result = result * ‭(16^7) + d.
Then I can get the 9 decimal digits for each resulting digit in base 109.
This process is pretty easy, since you only have to multiply by single digits. I'm sure there are faster, more complicated ways that recursively break the number into equal-sized pieces.

hashing mechanism to hash an input (0 to 2^32 - 1) to a fixed possibly 12 character hash

I'm looking for a way to implement a hashing mechanism to hash an input (0 to 2^32 - 1) to a fixed possibly 12 character hash.
Background:
I have a transaction table, where the primary key is auto increment (max size is 2^32) and I have to show an invoice no to the client which has to be of decent characters length (I'm thinking 12) and so since the client shouldn't get id as 0000-0000-0001, I was thinking hashing is the best way to go.
The main requirement (that I can think of) is that many to one mapping should never take place, and should not be slow.
Would it be okay if I use a common hashing mechanism and then drop the extra characters. (md5 for example in php generates 32 character string)?
The way I understand, there is no need to be secure cryptographically, and so I can generate a custom hash if possible.
Similar links:
1) Symmetric Bijective Algorithm for Integers
2) Pseudo-random-looking one-to-one int32->int32 function
Using md5 and chopping off most of it is not a good idea, because there is no guarantee that you would get a unique cache. Besides, you have much easier alternatives available to you, because you have a lot more bits than you need.
Values in the range [0..232] need 32 bit (duh!). You have 12 printable characters, which give you 72 bits if you stay within Base-64 encoding range of characters. You don't even need that many characters - you can use three bits per character for the initial eight characters, and two bits per character for the last four digits. This way your 12 characters would stay in the range ['0'..'7'], and the last four would be in the range ['0'..'3']. Of course you are not bound to numeric digits - you could use letters for some groups of digits, to give it a more "randomized" appearance.
the id is auto increment, and I don't think that I should give invoice numbers as 000...001 and so on.
Start with least significant bits when you generate these representations, then proceed to least significant, or make an arbitrary (but fixed) map of which bits go to what digit in the 12-character representation. This way the IDs would not look sequential, but would remain fully reversible.

Bash string compression

I'd like to know how I can compress a string into fewer characters using a shell script. The goal is to take a Mac's serial number and MAC address then compress those values into a 14 character string. I'm not sure if this is possible, but I'd like to hear if anyone has any suggestions.
Thank you
Your question is way too vague to result in a detailed answer.
Given your restriction of a 14 character string output, you won't be able to use "real" compression (like zip), due to the overhead. This leaves you with simple algorithms, like RLE or bit concatenation.
If by "string" you mean "printable string", i.e. only about 62 or so values are usable in a character (depending on the exact printable set you choose), then you have an additional space constraint.
A handy trick you could use with the MAC address part is, since it belongs to an Apple device, you already know that the first three values (AA:BB:CC) are one of 297 combinations, so you could save 6 characters (plus 2 for the colons) worth of information into 2+ characters (depending on your output character set, see above).
The remaining three MAC address values are base-16 (0-9, A-F), so you could "compress" this information slightly as well.
A similar analysis can be done for the Mac serial number (which values can it take? how much space can be saved?).
The effort to do this in bash would be disproportionate though. I'd highly recommend a C (or other programming language) approach.
Cheating answer
Get someone at Apple to give you access to the database I'm assuming they have which matches devices' serial numbers to MAC addresses. Then you can just store the MAC address and look it up in the database whenever you need the serial number. The 64-bit MAC address can easily be stored in 12 characters with standard base64 encoding.
Frustrating answer
You have to make some unreliable assumptions just to make this approachable. You can fix the assumptions later, but I don't know if it would still fit in 14 characters. Personally, I have no idea why you want to save space by reprocessing the serial and MAC numbers, but here's how I'd start.
Simplifying assumptions
Apple will never use MAC address prefixes beyond the 297 combinations mentioned in Sir Athos' answer.
The "new" Mac serial number format in this article from
2010 is the only format Apple has used or ever will use.
Core concepts of encoding
You're taking something which could have n possible values and you're converting it into something else with n possible values.
There may be gaps in the original's possible values, such as if Apple cancels building a manufacturing plant after already assigning it a location code.
There may be gaps in your encoded form's possible values, perhaps in anticipation of Apple doing things that would fill the gaps.
Abstract integer encoding
Break apart the serial number into groups as "PPP Y W SSS CCCC" (like the article describes)
Make groups for the first 3 bytes and last 5 bytes of the MAC address.
Translate each group into a number from 0 to n-1 where n is the number of possible values for something in the group. As far as I can tell from the article, the values are n_P=36^3, n_Y=20, n_W=27, n_S=3^3, and n_C=36^4. The first 3 MAC bytes has 297 values and the last 5 have 2^(8*5)=2^40 values.
Set a variable, i, to the value of the first group's number.
For each remaining group's number, multiply i by the number of values possible for the group, and then add the number to i.
Base n encoding
Make a list of n characters that you want to use in your final output.
Print the character in your list at index i%n.
Subtract the modulus from the integer encoding and divide by n.
Repeat 1 and 2 until the integer becomes 0.
Result
This results in a total of 36^3 * 20 * 27 * 36 * 7 * 297 * 2^40 ~= 2 * 10^24 combinations. If you let n=64 for a custom base64 encoding
(without any padding characters), then you can barely fit that into ceiling(log(2 * 10^24) / log(64)) = 14 characters. If you use all 95 printable ASCII characters, then you can fit it into ceiling(log(2 * 10^24) / log(95)) = 13 characters.
Fixing the assumptions
If you're trying to build something that uses this and are determined to make it work, here's what you need to do to make it solid, along with some tips.
Do the same analysis on every other serial number format you may care about. You might want to see if there's any redundant information between the serial and MAC numbers.
Figure out a way to detect between serial number formats. Adding an extra thing at the end of the abstract number encoding can enable you to track which version it uses.
Think long and careful about the format you're making. It's a lot easier to make changes before you're stuck with backwards compatibility.
If you can, use a language that's well suited for mapping between values, doing a lot of arithmetic, and handling big numbers. You may be able to do it in Bash, but it'd probably be easier in, say, Python.

How do I trim the zero value after decimal

As I tried to debug, I found that : just as I type in
Dim value As Double
value = 0.90000
then hit enter, and it automatically converts to 0.9
Shouldn't it keep the precision in double in visual basic?
For my calculation, I absolutely need to show the precision
If precision is required then the Currency data type is what you want to use.
There are at least two representations of your value in play. One is the value you see on the screen -- a string -- and one is the internal representation -- a binary value. In dealing with fractional values, the two are often not equivalent and where they aren't, it's because they can't be.
If you stick with doubles, VB will maintain 53 bits of mantissa throughout your calculations, no matter how they might appear when printed. If you transition through the string domain, say by saving to a file or DB and later retrieving, it often has to leave some of that precision behind. It's inevitable, because the interface between the two domains is not perfect. Some values that can be exactly represented as strings (or Decimals, that is, powers of ten) can't be exactly represented as fractional powers of 2.
This has nothing to do with VB, it's the nature of floating point. The best you can do is control where the rounding occurs. For this purpose your friend is the Format function, which controls how a value appears in string form.
? Format$(0.9, "0.00000") will show you an example.
You are getting what you see on the screen confused with what bits are being set in the Double to make that number.
VB is simply being "helpful", and simply knocking off excess zeros. But for all intents and purposes,
0.9
is identical to
0.90000
If you don't believe me, try doing this comparison:
Debug.Print CDbl("0.9") = CDbl("0.90000")
As has already been said, displayed precision can be shown using the Format$() function, e.g.
Debug.Print Format$(0.9, "0.00000")
No, it shouldn't keep the precision. Binary floating point values don't retain this information... and it would be somewhat odd to do so, given that you're expressing the value in one base even though it's being represented in another.
I don't know whether VB6 has a decimal floating point type, but that's probably what you want - or a fixed point decimal type, perhaps. Certainly in .NET, System.Decimal has retained extra 0s from .NET 1.1 onwards. If this doesn't help you, you could think about remembering two integers - e.g. "90000" and "100000" in this case, so that the value you're representing is one integer divided by another, with the associated level of precision.
EDIT: I thought that Currency may be what you want, but according to this article, that's fixed at 4 decimal places, and you're trying to retain 5. You could potentially just multiply by 10, if you always want 5 decimal places - but it's an awkward thing to remember to do everywhere... and you'd have to work out how to format it appropriately. It would also always be 4 decimal places, I suspect, even if you'd specified fewer - so if you want "0.300" to be different to "0.3000" then Currency may not be appropriate. I'm entirely basing this on articles online though...
You can also enter the value as 0.9# instead. This helps avoid implicit coercion within an expression that may truncate the precision you expect. In most cases the compiler won't require this hint though because floating point literals default to Double (indeed, the IDE typically deletes the # symbol unless the value was an integer, e.g. 9#).
Contrast the results of these:
MsgBox TypeName(0.9)
MsgBox TypeName(0.9!)
MsgBox TypeName(0.9#)

Resources