Base X string encoding - algorithm

I'm looking for a routine that will encode a string (stream of bytes) into an arbitrary base/alphabet (like base64 encoding but I get to choose the alphabet). I've seen a few routines that do base X encoding for a number, but not for a string.

There is my implementation of BaseX (BaseN) encoding algorithm: https://github.com/KvanTTT/BaseNcoding.
Also you can experiment with different alphabets and parameters at demo-site: http://kvanttt.github.io/BaseNcoding/

Every algorithm I've seen (and written) for this has a sequence of characters, and does a divmod of the number by the length of the sequence, uses the mod for the index into the sequence for the digit character, and feeds the div back into the process.

Related

Why does the SecureRandom#hex method double its length parameter n?

I came across the SecureRandom#hex method when trying to generate salts for passwords in a ruby on rails application. Why does it double the length parameter / insist that the returned string is even in length?
The method generates a random sequence of n bytes (cf. the random_bytes method), and then returns the base-16 representation of that sequence (which has two hex digits per byte).
This is also why the base64 and urlsafe_base64 methods return a string of length roughly 4n/3: they generate n bytes, and then perform the Base-64 encoding.

How to encode a number as a string such that the lexicographic order of the generated string is in the same order as the numeric order

For eg. if we have two strings 2 and 10, 10 will come first if we order lexicographically.
The very trivial sol will be to repeat a character n number of time.
eg. 2 can be encoded as aa
10 as aaaaaaaaaa
This way the lex order is same as the numeric one.
But, is there a more elegant way to do this?
When converting the numbers to strings make sure that all the strings have the same length, by appending 0s in the front if necessary. So 2 and 10 would be encoded as "02" and "10".
While kjampani's solution is probably the best and easiest in normal applications, another way which is more space-efficient is to prepend every string with its own length. Of course, you need to encode the length in a way which is also consistently sorted.
If you know all the strings are fairly short, you can just encode their length as a fixed-length base-X sequence, where X is the number of character codes you're willing to use (popular values are 64, 96, 255 and 256.) Note that you have to use the character codes in lexicographical order, so normal base64 won't work.
One variable-length order-preserving encoding is the one used by UTF-8. (Not UTF-8 directly, which has a couple of corner cases which will get in the way, but the same encoding technique. The order-preserving property of UTF-8 is occasionally really useful.) The full range of such compressed codes can encode values up to 42 bits long, with an average of five payload bits per byte. That's sufficient for pretty long strings; four terabyte long strings are pretty rare in the wild; but if you need longer, it's possible, too, by extending the size prefix over more than one byte.
Break the string into successive sub strings of letters and numbers and then sort by comparing each substring as an integer if it's an numeric string
"aaa2" ---> aaa + 2
"aaa1000" ---> aaa + 1000
aaa == aaa
Since they're equal, we continue:
1000 > 2
Hence, aaa1000 > aaa2.

How to get ASCII code of an character in assembly language?

I need to enter a string and to show that string like array of ASCII codes.
How can i implement it in assembly language.
In assembly language, characters are already encoded in ASCII (or unicode or whatever). You work with characters as numbers.
What you need to be able to is to format numbers in their denary representation, for output. This is not specific to character codes.
There will almost certainly be library routines to do this, but it's not hard to do yourself. Basically, you write a loop which repeatedly extracts the lowest digit from the number (by taking the residue of the number modulo 10 - look for a MOD instruction), converts that into the character code for a digit (by adding 48) and adds it to a buffer, then divides the number by 10 to move on to the next digit. You repeat that until the number is zero.

how to represent a n-byte array in less than 2*n characters

given that a n-byte array can be represented as a 2*n character string using hex, is there a way to represent the n-byte array in less than 2*n characters?
for example, typically, an integer(int32) can be considered as a 4-byte array of data
The advantage of hex is that splitting an 8-bit byte into two equal halves is about the simplest thing you can do to map a byte to printable ASCII characters. More efficient methods consider multiple bytes as a block:
Base-64 uses 64 ASCII characters to represent 6 bits at a time. Every 3 bytes (i.e. 24 bits) are split into 4 6-bit base-64 digits, where the "digits" are:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/
(and if the input is not a multiple of 3 bytes long, a 65th character, "=", is used for padding at the end). Note that there are some variant forms of base-64 use different characters for the last two "digits".
Ascii85 is another representation, which is somewhat less well-known, but commonly used: it's often the way that binary data is encoded within PostScript and PDF files. This considers every 4 bytes (big-endian) as an unsigned integer, which is represented as a 5-digit number in base 85, with each base-85 digit encoded as ASCII code 33+n (i.e. "!" for 0, up to "u" for 84) - plus a special case where the single character "z" may be used (instead of "!!!!!") to represent 4 zero bytes.
(Why 85? Because 845 < 232 < 855.)
yes, using binary (in which case it takes n bytes, not surprisingly), or using any base higher than 16, a common one is base 64.
It might depend on the exact numbers you want to represent. For instance, the number 9223372036854775808, which requres 8 bytes to represent in binary, takes only 4 bytes in ascii, if you use the product of primes representation (which is "2^63").
How about base-64?
It all depends on what characters you're willing to use in your encoding (i.e. representation).
Base64 fits 6 bits in each character, which means that 3 bytes will fit in 4 characters.
Using 65536 of about 90000 defined Unicode characters you may represent binary string in N/2 characters.
Yes. Use more characters than just 0-9 and a-f. A single character (assuming 8-bit) can have 256 values, so you can represent an n-byte number in n characters.
If it needs to be printable, you can just choose some set of characters to represent various values. A good option is base-64 in that case.

Encrypt printable text so result is still printable (can be typed)

I want to encrypt some info for a licensing system and I want the result to be able to be typed in by the user.
Update: This operation must be reversible (decrypt-able)
E.g.,
Encrypt ( ComputerID+ProductID) -> (any standard ASCII character that can be typed. Ideally maybe even just A-Z).
So far what I did was to convert the encrypted text to HEX (so it's any character from 0-F) but that doubles the number of characters.
I'm using VB6.
I'm thinking I'd do some operation on each pair of (Input$(x) and Key$(x)) and then do a MOD to keep it within a range of ascii values (maybe 0-9-A-Z)
Any suggestions of a good algorithm?
Look into Base64 "encryption."
Base 64 will convert a number into 64 different ASCII characters, verses hex which is only 16 different ASCII characters... Making Base64 more compact and what you are looking for.
EDIT:
Code to do this in VB6 is available here: http://www.nonhostile.com/howto-encode-decode-base64-vb6.asp
Per Fuzzy Lollipop, below, Base32 looks like an even better option. Bonus points if you can find an example of that.
EDIT: I found an example of Base32 for VB6 although I've not tried it yet. -Clay
encode the encrypted bytes in HEX, or Base32 or Base64
Do you want this to be reversible -- to recover the IDs from the encrypted text? If so then it matters how you combine the key and input strings.
Usually you'd XOR each byte pair (work with byte arrays to avoid Unicode issues), circulating on the key string if it's shorter than the input. You can then use Base N encoding (32, 64 etc) to generate the license string.
Both operations are reversible: you can recover the XORed strings from the Base N string, then XOR with the key again to get the original IDs.
If you don't care about reversing the operations, then any convolution of key and ID will do. XOR is just the simplest.

Resources