how to represent a n-byte array in less than 2*n characters - algorithm

given that a n-byte array can be represented as a 2*n character string using hex, is there a way to represent the n-byte array in less than 2*n characters?
for example, typically, an integer(int32) can be considered as a 4-byte array of data

The advantage of hex is that splitting an 8-bit byte into two equal halves is about the simplest thing you can do to map a byte to printable ASCII characters. More efficient methods consider multiple bytes as a block:
Base-64 uses 64 ASCII characters to represent 6 bits at a time. Every 3 bytes (i.e. 24 bits) are split into 4 6-bit base-64 digits, where the "digits" are:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
(and if the input is not a multiple of 3 bytes long, a 65th character, "=", is used for padding at the end). Note that there are some variant forms of base-64 use different characters for the last two "digits".
Ascii85 is another representation, which is somewhat less well-known, but commonly used: it's often the way that binary data is encoded within PostScript and PDF files. This considers every 4 bytes (big-endian) as an unsigned integer, which is represented as a 5-digit number in base 85, with each base-85 digit encoded as ASCII code 33+n (i.e. "!" for 0, up to "u" for 84) - plus a special case where the single character "z" may be used (instead of "!!!!!") to represent 4 zero bytes.
(Why 85? Because 845 < 232 < 855.)

yes, using binary (in which case it takes n bytes, not surprisingly), or using any base higher than 16, a common one is base 64.

It might depend on the exact numbers you want to represent. For instance, the number 9223372036854775808, which requres 8 bytes to represent in binary, takes only 4 bytes in ascii, if you use the product of primes representation (which is "2^63").

How about base-64?
It all depends on what characters you're willing to use in your encoding (i.e. representation).

Base64 fits 6 bits in each character, which means that 3 bytes will fit in 4 characters.

Using 65536 of about 90000 defined Unicode characters you may represent binary string in N/2 characters.

Yes. Use more characters than just 0-9 and a-f. A single character (assuming 8-bit) can have 256 values, so you can represent an n-byte number in n characters.
If it needs to be printable, you can just choose some set of characters to represent various values. A good option is base-64 in that case.

Related

How many numbers can we store with 1 bit?

I want to know how many characters or numbers can I store in 1 bit only. It will be more helpful if you tell it in octal, hexadecimal.
I want to know how many characters or numbers can I store in 1 bit only.
It is not practical to use a single bit to store numbers or characters. However, you could say:
One integer provided that the integer is in the range 0 to 1.
One ASCII character provided that the character is either NUL (0x00) or SOH (0x01).
The bottom line is that a single bit has two states: 0 and 1. Any value domain with more that two values in the domain cannot be represented using a single bit.
It will be more helpful if you tell it in octal, hexadecimal.
That is not relevant to the problem. Octal and hexadecimal are different textual representations for numeric data. They make no difference to the meaning of the numbers, or (in most cases1) the way that you represent the numbers in a computer.
1 - The exception is when you are representing numbers as text; e.g. when you represent the number 42 in a text document as the character '4' followed by the character '2'.
A bit is a "binary digit", or a value from a set of size two. If you have one or more bits, you raise 2 to the power of the number of bits. So, 2ยน gives 2. The field in Mathematics is called combinatorics.

Encode string to an specified length using any algo

Is there a way to compress/encode string to specified length(8/10 character).
I have a combination of secret key and a numeric value of 16 digit, and I want to create a unique id with combination of these both. which length should be between 8-12, and it should not change if combination is same.
Please suggest a way.
If it's 16 decimal digits and your string can contain any characters, then sure. If you want ten characters out, then you'd need 40 different characters. 4010 > 1016. Or for nine characters out, you need 60 different characters. 609 > 1016. E.g. some subset of the upper case letters, lower case letters, and digits (62 to choose 40 or 60 from). Then it is simply a matter of base conversion either way. Convert from base 10 to base 40 or 60, and then back.
Many languages already have Base-64 coding routines, which will get you to nine characters.
Eight is a problem, since you would need 100 characters (1008 == 1016), and there are only 95 printable ASCII characters.
You could use a secure hash function, like sha512, and truncate the resulting hex string to the desired length.
If you want slightly more entropy, you can base64 encode it before truncating.

Compress many numbers into a string

I was wondering if there's a way to compress 20 or so large numbers (~10^8) into a string of a reasonable length. For instance, if the numbers were stored as hex and concatenated, it'd be at least 160 characters long. I wonder if there's a smart way to compress the numbers in and get them back out. I was thinking about having a sequence 0-9 as reference and let one part of the input string be a number <1024. That number is to be converted to binary, which serves as a mask, i.e. indicating which digits exist in the number. It's still not clear where to go on from here.
Are there any better alternatives?
Thanks
If these large numbers are of the same size in bytes, and if you always know the count of those numbers, there is an easy way to do it. You simply Have an array of your bytes, and instead of reading them out as integers, you read them out as characters. Are you trying to obfuscate your values or just pack them to be easily transferred?
When I'm compacting a lot of values into one, reversible String, I usually go with base 64 conversion. This can really cut off quite a lot of the length from a String, but note that it may take up just as much memory in representing it.
Example
This number in decimal:
10000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
is the following in Base 64:
Yki8xQRRVqd403ldXJUT8Ungkh/A3Th2TMtNlpwLPYVgct2eE8MAn0bs4o/fv1bmo4oUNQa/9WtZ8gRE7IG+UHX+LniaQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Why you can't do this too an extreme level
Think about it for a second. Let's say you've got a number of length 10. And you want to represent that number with 5 characters, so a 50% rate compression scheme. First, we work out how many possible numbers you can represent with 10 digits.. which is..
2^10 = 1024
Okay, that's fine. How many numbers can we express with 5 digits:
2^5 = 32
So, you can only display 32 different numbers with 5 bits, whereas you can display 1024 numbers with 10 bits. For compression to work, there needs to be some mapping between the compressed value and the extracted value. Let's try and make that mapping happen..
Normal - Compressed
0 0
1 1
2 2
.. ...
31 31
32 ??
33 ??
34 ??
... ...
1023 ??
There is no mapping for most of the numbers that can be represented by the expanded value.
This is known as the Pigeonhole Principle and in this example our value for n is greater than our value for m, hence we need to map values from our compressed values to more than one normal value, which makes things incredibly complex. (thankyou Oli for reminding me).
You need to be much more descriptive about what you mean by "string" and "~10^8". Can your "string" contain any sequence of bytes? Or is it restricted to a subset of possible bytes? If so, how exactly is it restricted? What are the limits on your "large numbers"? What do they represent?
Numbers up to 108 can be represented in 27 bits. 20 of them would be 540 bits, which could be stored in a string of 68 bytes, if any sequence of bytes is permitted. If the contents of a string are limited, it will take more bits. If your range of numbers is larger, it will take more bits.
store all numbers as strings to a marisa trie: https://code.google.com/p/marisa-trie/
Base64 the resulting trie dictionary
It depends of course a lot on your input. But it is a possibility to build a (very) compact representation this way.

Can I pack 2 digits(32bit) in the 12 strings [0-9a-z]{13}

I wanna pack 2 32bit digit in to 12strings
example
123 456 -> abcdef0000000
The choice [0-9a-z] gives 36 symbols. With 12 such symbols, you get 36^12 = 4.7e18 different strings.
3 32-bit digits is the same as one 96-bit digit, which has 2^96 = 7.9e28 different values.
There are more digits than you can represent with that alphabet in 12 positions, so no, you cannot achieve the packing you want.
If you use [0-9a-zA-Z] you have 62 symbols. 62^12 = 3.2e21 - still short.
Given a number of symbols S and string length L, you can figure out how many bits of info you can store by solving S^L >= 2^B. For your case, B = 96. Let's use the Base64 alphabet, so S = 64 (this also makes the math easier). Then:
64^L >= 2^96
take log2 of both sides
L * log2(64) >= 96 * log2(2)
L * 6 >= 96
L >= 16
So with a Base64 alphabet, you can store 96 bits in 16 symbols.
Assuming you mean 3 numbers:
Whether you can get them in the string depends on the language. You can't have ANY limitations on the character set, though. A 32 bit number is 4 bytes, three of them make 12 bytes. Zeros are a possibility--any language that won't let you store a zero byte in the string won't allow this. However, there are languages that don't care--Delphi comes to mind. It stores strings with length codes rather than terminators and thus you can put anything in a string.
Assuming you mean 2 numbers:
[0-9][A-Z] is 36 symbols. 36^6 < 2^32, it won't work.
[0-9][A-Z][a-z] is 62 symbols. 62^6 > 2^32, you can do it.
On the other hand, if you're talking strings that hold wide characters of some type even the three numbers can easily be packed in, although you'll have to use some non-ASCII characters to do the job. Pick any 256 symbols and you're good to go. You can even pick 2048 symbols (you'll find more than that if you look at the east Asian languages) and pack each number into three characters.
You are practically wanting to compact a 3 4-byte length numbers into one 12-byte length string. Yes, you can do this, but you will have to use all 256 ANSI symbols (some of which are unprintable, like #13, #7, #10), and it won't even be a "pack", just first 4 bytes will hold one number, second 4 bytes will hold second number, third 4 bytes will hold third number.
If you want to pack TWO 4-byte numbers into a READABLE (printable) string of length 12, you need to be able to stuff 4 bytes into 6 "digits" of a string, thankfully 12 is divisible by 2, and you don't have to play with uneven pack length. So, solving 256^4 <= X^6, we receive X as 41, meaning yes, if you use a Base64 alphabet (as in comments to previous andwer) you are capable of having such a string.

How to encode a number as a string such that the lexicographic order of the generated string is in the same order as the numeric order

For eg. if we have two strings 2 and 10, 10 will come first if we order lexicographically.
The very trivial sol will be to repeat a character n number of time.
eg. 2 can be encoded as aa
10 as aaaaaaaaaa
This way the lex order is same as the numeric one.
But, is there a more elegant way to do this?
When converting the numbers to strings make sure that all the strings have the same length, by appending 0s in the front if necessary. So 2 and 10 would be encoded as "02" and "10".
While kjampani's solution is probably the best and easiest in normal applications, another way which is more space-efficient is to prepend every string with its own length. Of course, you need to encode the length in a way which is also consistently sorted.
If you know all the strings are fairly short, you can just encode their length as a fixed-length base-X sequence, where X is the number of character codes you're willing to use (popular values are 64, 96, 255 and 256.) Note that you have to use the character codes in lexicographical order, so normal base64 won't work.
One variable-length order-preserving encoding is the one used by UTF-8. (Not UTF-8 directly, which has a couple of corner cases which will get in the way, but the same encoding technique. The order-preserving property of UTF-8 is occasionally really useful.) The full range of such compressed codes can encode values up to 42 bits long, with an average of five payload bits per byte. That's sufficient for pretty long strings; four terabyte long strings are pretty rare in the wild; but if you need longer, it's possible, too, by extending the size prefix over more than one byte.
Break the string into successive sub strings of letters and numbers and then sort by comparing each substring as an integer if it's an numeric string
"aaa2" ---> aaa + 2
"aaa1000" ---> aaa + 1000
aaa == aaa
Since they're equal, we continue:
1000 > 2
Hence, aaa1000 > aaa2.

Resources