Bash string compression - macos

I'd like to know how I can compress a string into fewer characters using a shell script. The goal is to take a Mac's serial number and MAC address then compress those values into a 14 character string. I'm not sure if this is possible, but I'd like to hear if anyone has any suggestions.
Thank you

Your question is way too vague to result in a detailed answer.
Given your restriction of a 14 character string output, you won't be able to use "real" compression (like zip), due to the overhead. This leaves you with simple algorithms, like RLE or bit concatenation.
If by "string" you mean "printable string", i.e. only about 62 or so values are usable in a character (depending on the exact printable set you choose), then you have an additional space constraint.
A handy trick you could use with the MAC address part is, since it belongs to an Apple device, you already know that the first three values (AA:BB:CC) are one of 297 combinations, so you could save 6 characters (plus 2 for the colons) worth of information into 2+ characters (depending on your output character set, see above).
The remaining three MAC address values are base-16 (0-9, A-F), so you could "compress" this information slightly as well.
A similar analysis can be done for the Mac serial number (which values can it take? how much space can be saved?).
The effort to do this in bash would be disproportionate though. I'd highly recommend a C (or other programming language) approach.

Cheating answer
Get someone at Apple to give you access to the database I'm assuming they have which matches devices' serial numbers to MAC addresses. Then you can just store the MAC address and look it up in the database whenever you need the serial number. The 64-bit MAC address can easily be stored in 12 characters with standard base64 encoding.
Frustrating answer
You have to make some unreliable assumptions just to make this approachable. You can fix the assumptions later, but I don't know if it would still fit in 14 characters. Personally, I have no idea why you want to save space by reprocessing the serial and MAC numbers, but here's how I'd start.
Simplifying assumptions
Apple will never use MAC address prefixes beyond the 297 combinations mentioned in Sir Athos' answer.
The "new" Mac serial number format in this article from
2010 is the only format Apple has used or ever will use.
Core concepts of encoding
You're taking something which could have n possible values and you're converting it into something else with n possible values.
There may be gaps in the original's possible values, such as if Apple cancels building a manufacturing plant after already assigning it a location code.
There may be gaps in your encoded form's possible values, perhaps in anticipation of Apple doing things that would fill the gaps.
Abstract integer encoding
Break apart the serial number into groups as "PPP Y W SSS CCCC" (like the article describes)
Make groups for the first 3 bytes and last 5 bytes of the MAC address.
Translate each group into a number from 0 to n-1 where n is the number of possible values for something in the group. As far as I can tell from the article, the values are n_P=36^3, n_Y=20, n_W=27, n_S=3^3, and n_C=36^4. The first 3 MAC bytes has 297 values and the last 5 have 2^(8*5)=2^40 values.
Set a variable, i, to the value of the first group's number.
For each remaining group's number, multiply i by the number of values possible for the group, and then add the number to i.
Base n encoding
Make a list of n characters that you want to use in your final output.
Print the character in your list at index i%n.
Subtract the modulus from the integer encoding and divide by n.
Repeat 1 and 2 until the integer becomes 0.
Result
This results in a total of 36^3 * 20 * 27 * 36 * 7 * 297 * 2^40 ~= 2 * 10^24 combinations. If you let n=64 for a custom base64 encoding
(without any padding characters), then you can barely fit that into ceiling(log(2 * 10^24) / log(64)) = 14 characters. If you use all 95 printable ASCII characters, then you can fit it into ceiling(log(2 * 10^24) / log(95)) = 13 characters.
Fixing the assumptions
If you're trying to build something that uses this and are determined to make it work, here's what you need to do to make it solid, along with some tips.
Do the same analysis on every other serial number format you may care about. You might want to see if there's any redundant information between the serial and MAC numbers.
Figure out a way to detect between serial number formats. Adding an extra thing at the end of the abstract number encoding can enable you to track which version it uses.
Think long and careful about the format you're making. It's a lot easier to make changes before you're stuck with backwards compatibility.
If you can, use a language that's well suited for mapping between values, doing a lot of arithmetic, and handling big numbers. You may be able to do it in Bash, but it'd probably be easier in, say, Python.

Related

How to build unique file names based on its content?

I plan to build a unique file name based on its content. For example, by its SHA256 hash. Files with the same content must have the same name.
The easiest way is to convert hash to a hex string. A file name will be 32 bytes length * 2 = 64 characters. This is pretty long name to operate with. How to make it shorter?
I implemented a sort of "Base32" coding - a vocabulary string that includes digits and 22 letters. I use only five bits of every byte to build file name with 32 characters. Much better.
I am looking for a balance between file name length and low collision probability. If the number of files is expected to be less than 500K, how long should the filename be? 8? 16? 24? 32?
Is there any recommended method to build short unique filenames at all?
If you use an N-bit cryptographic hash on M files, you can estimate the probability of at least one collision to be M2/2N+1
For 500K files, that's about 1/2N-37
Using base32, 16 chars gives probability of collision 1/243 -- a few trillion to 1 odds.
If that won't do, then 24 chars gives 1/283.
If you're willing to check them all and re-generate on collision, then 8 chars is fine.
Number of collisions depend on the content of the files, the hash-algorithm and the length of the hash.
In general: The longer the hash-value is the less likely are collisions (if your content does not especially provoke collisions).
You cannot avoid the possibility of collisions unless you use the content as file-name (or a lossless compression of it).
To shorten the filenames you could allow more different characters for the file-name. (But we aware what characters your OS allows and which you are willing to use).
I would go for a kind of base32 encoding to avoid problems with filesystems that do not distinguish between upper and lower case character.

ASCII - Whats the point of it?

I always wanted to ask this, I know that ASCII uses numbers to represent characters like 65 = A
Whats the point? computer understand when i press A is A why we need to convert to 65?
You have it backwards: computers understand when you press an A because of codes like ASCII. Or rather, one part of the computer is able to tell another part of the computer that you pressed an A because they agree on conventions of binary signals like ASCII.
At its lowest level, each part of the computer "knows" that it is in one of two states - maybe off and on, maybe high voltage and low voltage, maybe two directions of magnetism, and so on. For convenience, we label these two states 0 and 1. We then build elaborate (and microscopic) sequences of machinery that each say "if this thing's a 1, then do this, if it's a 0 do this".
If we string a sequence of 1s and 0s together, we can write a number, like 1010; and we can make machinery that does maths with those numbers, like 1010 + 0001 = 1011. Alternatively, we can string a much longer sequence together to represent the brightness of pixels from the top left to bottom right of a screen, in order - a bitmap image. The computer doesn't "know" which sequences are numbers and which are images, we just tell it "draw the screen based on this sequence" and "calculate my wages based on this sequence".
If we want to represent not numbers or images, but text, we need to come up with a sequence of bits for each letter and symbol. It doesn't really matter what sequence we use, we just need to be consistent - we could say that 000001 is A, and as long as we remember that's what we chose, we can write programs that deal with text. ASCII is simply one of those mappings of sequences of bits to letters and symbols.
Note that A is not defined as "65" in ASCII, it's defined as the 7 bit sequence 1000001; it just happens that that's the same sequence of bits we generally use for the number 65. Note also that ASCII is a very old mapping, and almost never used directly in modern computers; it is however very influential, and a lot of more recent mappings are designed to use the same or similar sequences for the letters and symbols that it covers.

hashing mechanism to hash an input (0 to 2^32 - 1) to a fixed possibly 12 character hash

I'm looking for a way to implement a hashing mechanism to hash an input (0 to 2^32 - 1) to a fixed possibly 12 character hash.
Background:
I have a transaction table, where the primary key is auto increment (max size is 2^32) and I have to show an invoice no to the client which has to be of decent characters length (I'm thinking 12) and so since the client shouldn't get id as 0000-0000-0001, I was thinking hashing is the best way to go.
The main requirement (that I can think of) is that many to one mapping should never take place, and should not be slow.
Would it be okay if I use a common hashing mechanism and then drop the extra characters. (md5 for example in php generates 32 character string)?
The way I understand, there is no need to be secure cryptographically, and so I can generate a custom hash if possible.
Similar links:
1) Symmetric Bijective Algorithm for Integers
2) Pseudo-random-looking one-to-one int32->int32 function
Using md5 and chopping off most of it is not a good idea, because there is no guarantee that you would get a unique cache. Besides, you have much easier alternatives available to you, because you have a lot more bits than you need.
Values in the range [0..232] need 32 bit (duh!). You have 12 printable characters, which give you 72 bits if you stay within Base-64 encoding range of characters. You don't even need that many characters - you can use three bits per character for the initial eight characters, and two bits per character for the last four digits. This way your 12 characters would stay in the range ['0'..'7'], and the last four would be in the range ['0'..'3']. Of course you are not bound to numeric digits - you could use letters for some groups of digits, to give it a more "randomized" appearance.
the id is auto increment, and I don't think that I should give invoice numbers as 000...001 and so on.
Start with least significant bits when you generate these representations, then proceed to least significant, or make an arbitrary (but fixed) map of which bits go to what digit in the 12-character representation. This way the IDs would not look sequential, but would remain fully reversible.

What's the name of this algorithm/routine?

I am writing a utility class which converts strings from one alphabet to another, this is useful in situations where you have a target alphabet you wish to use, with a restriction on the number of characters available. For example, if you can use lower case letters and numbers, but only 12 characters its possible to compress a timestamp from the alphabet 01234567989 -: into abcdefghijklmnopqrstuvwxyz01234567989 so 2010-10-29 13:14:00 might become 5hhyo9v8mk6avy (19 charaters reduced to 16).
The class is designed to convert back and forth between alphabets, and also calculate the longest source string that can safely be stored in a target alphabet given a particular number of characters.
Was thinking of publishing this through Google code, however I'd obviously like other people to find it and use it - hence the question on what this is called. I've had to use this approach in two separate projects, with Bloomberg and a proprietary system, when you need to generate unique file names of a certain length, but want to keep some plaintext, so GUIDs aren't appropriate.
Your examples bear some similarity to a Dictionary coder with a fixed target and source dictionaries. Also worthwhile to look at is Fibonacci coding, which has a fixed target dictionary (of variable-length bits), which is variably targeted.
I think it also depends whether it is very important that your target alphabet has fixed width entries - if you allow for a fixed alphabet with variable length codes, your compression ratio will approach your entropy that much more optimally! If the source alphabet distribution is known in advance, a static Huffman tree could easily be generated.
Here is a simple algorithm:
Consider that you don't have to transmit the alphabet used for encoding. Also, you don't use (and transmit) the probabilities of the input symbols, as in standard compressions, so we just re-encode somehow the data.
In this case we can consider that the input data are in number represented with base equal to the cardinality of the input alphabet. We just have to change its representation to another base, that is a simple task.
EDITED example:
input alpabet: ABC, output alphabet: 0123456789
message ABAC will translate to 0102 in base 3, that is 11 (9 + 2) in base 10.
11 to base 10: 11
We could have a problem decoding it, because we don't know how many 0-es to use at the begining of the decoded result, so we have to use one of the modifications:
1) encode somehow in the stream the size of compressed data.
2) use a dummy 1 at the start of the stream: in this way our example will become:
10102 (base 3) = 81 + 9 + 2 = 92 (base 10).
Now after decoding we just have to ignore the first 1 (this also provides a basic error detection).
The main problem of this approach is that in most cases (GCD == 1) each new encoded character will completely change the output. This will be very inneficient and difficult to implement. We end up with arithmetic coding as the best solution (actually a simplified version of it).
You probably know about Base64 which does the same thing just usually the other way around. Too bad there are way too many Google results on BaseX or BaseN...

How to generate a verification code/number?

I'm working on an application where users have to make a call and type a verification number with the keypad of their phone.
I would like to be able to detect if the number they type is correct or not. The phone system does not have access to a list of valid numbers, but instead, it will validate the number against an algorithm (like a credit card number).
Here are some of the requirements :
It must be difficult to type a valid random code
It must be difficult to have a valid code if I make a typo (transposition of digits, wrong digit)
I must have a reasonable number of possible combinations (let's say 1M)
The code must be as short as possible, to avoid errors from the user
Given these requirements, how would you generate such a number?
EDIT :
#Haaked: The code has to be numerical because the user types it with its phone.
#matt b: On the first step, the code is displayed on a Web page, the second step is to call and type in the code. I don't know the user's phone number.
Followup : I've found several algorithms to check the validity of numbers (See this interesting Google Code project : checkDigits).
After some research, I think I'll go with the ISO 7064 Mod 97,10 formula. It seems pretty solid as it is used to validate IBAN (International Bank Account Number).
The formula is very simple:
Take a number : 123456
Apply the following formula to obtain the 2 digits checksum : mod(98 - mod(number * 100, 97), 97) => 76
Concat number and checksum to obtain the code => 12345676
To validate a code, verify that mod(code, 97) == 1
Test :
mod(12345676, 97) = 1 => GOOD
mod(21345676, 97) = 50 => BAD !
mod(12345678, 97) = 10 => BAD !
Apparently, this algorithm catches most of the errors.
Another interesting option was the Verhoeff algorithm. It has only one verification digit and is more difficult to implement (compared to the simple formula above).
For 1M combinations you'll need 6 digits. To make sure that there aren't any accidentally valid codes, I suggest 9 digits with a 1/1000 chance that a random code works. I'd also suggest using another digit (10 total) to perform an integrity check. As far as distribution patterns, random will suffice and the check digit will ensure that a single error will not result in a correct code.
Edit: Apparently I didn't fully read your request. Using a credit card number, you could perform a hash on it (MD5 or SHA1 or something similar). You then truncate at an appropriate spot (for example 9 characters) and convert to base 10. Then you add the check digit(s) and this should more or less work for your purposes.
You want to segment your code. Part of it should be a 16-bit CRC of the rest of the code.
If all you want is a verification number then just use a sequence number (assuming you have a single point of generation). That way you know you are not getting duplicates.
Then you prefix the sequence with a CRC-16 of that sequence number AND some private key. You can use anything for the private key, as long as you keep it private. Make it something big, at least a GUID, but it could be the text to War and Peace from project Gutenberg. Just needs to be secret and constant. Having a private key prevents people from being able to forge a key, but using a 16 bit CR makes it easier to break.
To validate you just split the number into its two parts, and then take a CRC-16 of the sequence number and the private key.
If you want to obscure the sequential portion more, then split the CRC in two parts. Put 3 digits at the front and 2 at the back of the sequence (zero pad so the length of the CRC is consistent).
This method allows you to start with smaller keys too. The first 10 keys will be 6 digits.
Does it have to be only numbers? You could create a random number between 1 and 1M (I'd suggest even higher though) and then Base32 encode it. The next thing you need to do is Hash that value (using a secret salt value) and base32 encode the hash. Then append the two strings together, perhaps separated by the dash.
That way, you can verify the incoming code algorithmically. You just take the left side of the code, hash it using your secret salt, and compare that value to the right side of the code.
I must have a reasonnable number of possible combinations (let's say 1M)
The code must be as short as possible, to avoid errors from the user
Well, if you want it to have at least one million combinations, then you need at least six digits. Is that short enough?
When you are creating the verification code, do you have access to the caller's phone number?
If so I would use the caller's phone number and run it through some sort of hashing function so that you can guarantee that the verification code you gave to the caller in step 1 is the same one that they are entering in step 2 (to make sure they aren't using a friend's validation code or they simply made a very lucky guess).
About the hashing, I'm not sure if it's possible to take a 10 digit number and come out with a hash result that would be < 10 digits (I guess you'd have to live with a certain amount of collision) but I think this would help ensure the user is who they say they are.
Of course this won't work if the phone number used in step 1 is different than the one they are calling from in step 2.
Assuming you already know how to detect which key the user hit, this should be doable reasonably easily. In the security world, there is the notion of a "one time" password. This is sometimes referred to as a "disposable password." Normally these are restricted to the (easily typable) ASCII values. So, [a-zA-z0-9] and a bunch of easily typable symbols. like comma, period, semi colon, and parenthesis. In your case, though, you'd probably want to limit the range to [0-9] and possibly include * and #.
I am unable to explain all the technical details of how these one-time codes are generated (or work) adequately. There is some intermediate math behind it, which I'd butcher without first reviewing it myself. Suffice it to say that you use an algorithm to generate a stream of one time passwords. No matter how mnay previous codes you know, the subsequent one should be impossibel to guess! In your case, you'll simply use each password on the list as the user's random code.
Rather than fail at explaining the details of the implementation myself, I'll direct you to a 9 page article where you can read up on it youself: https://www.grc.com/ppp.htm
It sounds like you have the unspoken requirement that it must be quickly determined, via algorithm, that the code is valid. This would rule out you simply handing out a list of one time pad numbers.
There are several ways people have done this in the past.
Make a public key and private key. Encode the numbers 0-999,999 using the private key, and hand out the results. You'll need to throw in some random numbers to make the result come out to the longer version, and you'll have to convert the result from base 64 to base 10. When you get a number entered, convert it back to base64, apply the private key, and see if the intereting numbers are under 1,000,000 (discard the random numbers).
Use a reversible hash function
Use the first million numbers from a PRN seeded at a specific value. The "checking" function can get the seed, and know that the next million values are good. It can either generate them each time and check one by one when a code is received, or on program startup store them all in a table, sorted, and then use binary search (maximum of compares) since one million integers is not a whole lot of space.
There are a bunch of other options, but these are common and easy to implement.
-Adam
You linked to the check digits project, and using the "encode" function seems like a good solution. It says:
encode may throw an exception if 'bad' data (e.g. non-numeric) is passed to it, while verify only returns true or false. The idea here is that encode normally gets it's data from 'trusted' internal sources (a database key for instance), so it should be pretty usual, in fact, exceptional that bad data is being passed in.
So it sounds like you could pass the encode function a database key (5 digits, for instance) and you could get a number out that would meet your requirements.

Resources