How to convert from any large arbitrary base to another - algorithm

What I’d like to do is to convert a string from one "alphabet" to another, much like converting numbers between bases, but more abstract and with arbitrary digits.
For instance, converting "255" from the alphabet "0123456789" to the alphabet "0123456789ABCDEF" would result in "FF". One way to do this is to convert the input string into an integer, and then back again. Like so: (pseudocode)
int decode(string input, string alphabet) {
int value = 0;
for(i = 0; i < input.length; i++) {
int index = alphabet.indexOf(input[i]);
value += index * pow(alphabet.length, input.length - i - 1);
}
return value;
}
string encode(int value, string alphabet) {
string encoded = "";
while(value > 0) {
int index = value % alphabet.length;
encoded = alphabet[index] + encoded;
value = floor(value / alphabet.length);
}
return encoded;
}
Such that decode("255", "0123456789") returns the integer 255, and encode(255, "0123456789ABCDEF") returns "FF".
This works for small alphabets, but I’d like to be able to use base 26 (all the uppercase letters) or base 52 (uppercase and lowercase) or base 62 (uppercase, lowercase and digits), and values that are potentially over a hundred digits. The algorithm above would, theoretically, work for such alphabets, but, in practice, I’m running into integer overflow because the numbers get so big so fast when you start doing 62^100.
What I’m wondering is if there is an algorithm to do a conversion like this without having to keep up with such gigantic integers? Perhaps a way to begin the output of the result before the entire input string has been processed?
My intuition tells me that it might be possible, but my math skills are lacking. Any help would be appreciated.
There are a few similar questions here on StackOverflow, but none seem to be exactly what I'm looking for.

A general way to store numbers in an arbitrary base would be to store it as an array of integers. Minimally, a number would be denoted by a base and array of int (or short or long depending on the range of bases you want) representing different digits in that base.
Next, you need to implement multiplication in that arbitrary base.
After that you can implement conversion (clue: if x is the old base, calculate x, x^2, x^3,..., in the new base. After that, multiply digits from old base accordingly to these numbers and then add them up).

Java-like Pseudocode:
ArbitraryBaseNumber source = new ArbitraryBaseNumber(11,"103A");
ArbitraryBaseNumber target = new ArbitraryBaseNumber(3,"0");
for(int digit : base3Num.getDigitListAsIntegers()) { // [1,0,3,10]
target.incrementBy(digit);
if(not final digit) {
target.multiplyBy(source.base);
}
}
The challenge that remains, of course, is to implement ArbitraryBaseNumber, with incrementBy(int) and multiplyBy(int) methods. Essentially to do that, you do in code exactly what a schoolchild does when doing addition and long-multiplication on paper. Google and you'll find example.

Related

How to create a unique hash that will match any strings permutations

Given a string abcd how can I create a unique hashing method that will hash those 4 characters to match bcad or any other permutation of the letters abcd?
Currently I have this code
long hashString(string a) {
long hashed = 0;
for(int i = 0; i < a.length(); i++) {
hashed += a[i] * 7; // Timed by a prime to make the hash more unique?
}
return hashed;
}
Now this will not work becasue ad will hash with bc.
I know you can make it more unique by multiplying the position of the letter by the letter itself hashed += a[i] * i but then the string will not hash to its permutations.
Is it possible to create a hash that achieves this?
Edit
Some have suggested sorting the strings before you hash them. Which is a valid answer but the sorting would take O(nlog) time and I am looking for a hash function that runs in O(n) time.
I am looking to do this in O(1) memory.
Create an array of 26 integers, corresponding to letters a-z. Initialize it to 0. Scan the string from beginning to end, and increment the array element corresponding to the current letter. Note that up to this point the algorithm has O(n) time complexity and O(1) space complexity (since the array size is a constant).
Finally, hash the contents of the array using your favorite hash function.
The basic thing you can do is sort the strings before applying the hash function. So, to compute the hash of "adbc" or "dcba" you instead compute the hash of "abcd".
If you want to make sure that there are no collisions in your hash function, then the only way is to have the hash result be a string. There are many more strings than there are 32-bit (or 64-bit) integers so collisions are innevitable (collisions are unlikely with a good hash function though).
Easiest way to understand: sort the letters in the string, and then hash the resulting string.
Some variations on your original idea also work, like:
long hashString(string a) {
long hashed = 0;
for(int i = 0; i < a.length(); i++) {
long t = a[i] * 16777619;
hashed += t^(t>>8);
}
return hashed;
}
I suppose you need a hash such that two anagrams will hash to the same value. I'd suggest you sort them first and use any of the common hash function such as md5. I write the following code using Scala:
import java.security.MessageDigest
def hash(s: String) = {
MessageDigest.getInstance("MD5").digest(s.sorted.getBytes)
}
Note in scala:
scala> "hello".sorted
res0: String = ehllo
scala> "cinema".sorted
res1: String = aceimn
Synopsis: store a histogram of the letters in the hash value.
Step 1: compute a histogram of the letters (since a histogram uniquely identifies the letters in the string without regard to the order of the letters).
int histogram[26];
for ( int i = 0; i < a.length(); i++ )
histogram[a[i] - 'a']++;
Step 2: pack the histogram into the hash value. You have several options here. Which option to choose depends on what sort of limitations you can put on the strings.
If you knew that each letter would appear no more than 3 times, then it takes 2 bits to represent the count, so you could create a 52-bit hash that's guaranteed to be unique.
If you're willing to use a 128-bit hash, then you've got 5 bits for 24 letters, and 4 bits for 2 letters (e.g. q and z). The 128-bit hash allows each letter to appear 31 times (15 times for q and z).
But if you want a fixed sized hash, say 16-bit, then you need to pack the histogram into those 16 bits in a way that reduces collisions. The easiest way to do that is to create a 26 byte message (one byte for each entry in the histogram, allowing each letter to appear up to 255 times). Then take the 16-bit CRC of the message, using your favorite CRC generator.

Encode a byte array with an alphabet, output should look randomly distributed

I'm encoding binary data b1, b2, ... bn using an alphabet. But since the binary representations of the bs are more or less sequential, a simple mapping of bits to chars results in very similar strings. Example:
encode(b1) => "QXpB4IBsu0"
encode(b2) => "QXpB36Bsu0"
...
I'm looking for ways to make the output more "random", meaning more difficult to guess the input b when looking at the output string.
Some requirements:
For differenent bs, the output strings must be different. Encoding the same b multiple times does not necessarily have to result in the same output. As long as there are no collisions between the output strings of different input bs, everything is fine.
If it is of any importance: each b is around ~50-60 bits. The alphabet contains 64 characters.
The encoding function should not produce larger output strings than the ones you get by just using a simple mapping from the bits of bs to chars of the alphabet (given the values above, this means ~10 characters for each b). So just using a hash function like SHA is not an option.
Possible solutions for this problem don't need to be "cryptographically secure". If someone invests enough time and effort to reconstruct the binary data, then so be it. But the goal is to make it as difficult as possible. It maybe helps that a decode function is not needed anyway.
What I am doing at the moment:
take the next 4 bits from the binary data, let's say xxxx
prepend 2 random bits r to get rrxxxx
lookup the corresponding char in the alphabet: val char = alphabet[rrxxxx] and add it to the result (this works because the alphabet's size is 64)
continue with step 1
This appraoch adds some noise to the output string, however, the size of the string is increased by 50% due to the random bits. I could add more noise by adding more random bits (rrrxxx or even rrrrxx), but the output would get larger and larger. One of the requirements I mentioned above is not to increase the size of the output string. Currently I'm only using this approach because I have no better idea.
As an alternative procedure, I thought about shuffling the bits of an input b before applying the alphabet. But since it must be guaranteed that different bs result in different strings, the shuffle function should use some kind of determinism (maybe a secret number as an argument) instead of being completely random. I wasn't able to come up wih such a function.
I'm wondering if there is a better way, any hint is appreciated.
Basically, you need a reversible pseudo-random mapping from each possible 50-bit value to another 50-bit value. You can achieve this with a reversible Linear Congruential Generator (the kind used for some pseudo-random number generators).
When encoding, apply the LCG to your number in the forward direction, then encode with base64. If you need to decode, decode from base64, then apply the LCG in the opposite direction to get your original number back.
This answer contains some code for a reversible LCG. You'll need one with a period of 250. The constants used to define your LCG would be your secret numbers.
You want to use a multiplicative inverse. That will take the sequential key and transform it into a non-sequential number. There is a one-to-one relationship between the keys and their results. So no two numbers will create the same non-sequential key, and the process is reversible.
I have a small example, written in C#, that illustrates the process.
private void DoIt()
{
const long m = 101;
const long x = 387420489; // must be coprime to m
var multInv = MultiplicativeInverse(x, m);
var nums = new HashSet<long>();
for (long i = 0; i < 100; ++i)
{
var encoded = i*x%m;
var decoded = encoded*multInv%m;
Console.WriteLine("{0} => {1} => {2}", i, encoded, decoded);
if (!nums.Add(encoded))
{
Console.WriteLine("Duplicate");
}
}
}
private long MultiplicativeInverse(long x, long modulus)
{
return ExtendedEuclideanDivision(x, modulus).Item1%modulus;
}
private static Tuple<long, long> ExtendedEuclideanDivision(long a, long b)
{
if (a < 0)
{
var result = ExtendedEuclideanDivision(-a, b);
return Tuple.Create(-result.Item1, result.Item2);
}
if (b < 0)
{
var result = ExtendedEuclideanDivision(a, -b);
return Tuple.Create(result.Item1, -result.Item2);
}
if (b == 0)
{
return Tuple.Create(1L, 0L);
}
var q = a/b;
var r = a%b;
var rslt = ExtendedEuclideanDivision(b, r);
var s = rslt.Item1;
var t = rslt.Item2;
return Tuple.Create(t, s - q*t);
}
Code cribbed from the above-mentioned article, and supporting materials.
The idea, then, is to take your sequential number, compute the inverse, and then base-64 encode it. To reverse the process, base-64 decode the value you're given, run it through the reverse calculation, and you have the original number.

Hash decryption

I have a hash decryption function. If input is 664804774844 output is agdpeew. I use modulo and division for finding letter index. But in while loop I written i = 7 becauce I know output string (agdpeew) size. How I can find i?
A decryption function:
var f = function (h) {
var letters, result, i;
i = 7;
result = "";
letters = "acdegilmnoprstuw";
while (i) {
Result += letters [parseInt (h % 37)];
h = h / 37;
i--;
}
return result.split("").reverse().join("");
};
An encrypted function:
hash (s) {
h = 7;
letters = "acdegilmnoprstuw";
for(i = 0; i < s.length; i++) {
h = (h * 37 + letters.indexOf(s[i]));
}
return h;
}
It depends on how you handle overflows. If your "encryption" function allows inputs long enough that h would overflow at some point then you are stuffed and your current method of decryption wouldn't work at all.
If you can guarantee no overflowing then your final h will be the sum of terms of the form (An)x^n where An is the nth letter in your sequence converted to a number via your indexof method (and x in this case is 37)
Your decryption basically takes the x^0 term (by using mod x) and then converts that. It then divides by x (using integer maths presumably) to lose the old x^0 term and get a new one to interpret.
This means that you can actually just keep doing this until your h is 0 and at that point you know you've dealt with all the characters.
An interesting note is that x just needs to be greater than length of letters (because An must be less than x). A smaller X would give more possible input characters before overflow.
If you are allowing overflow then you have no way to do this unless you know how long the input was. Even then it might be tricky. If your input is unlimited in length then you could have a 1000 character input and with all those combinations there are a lot of possible values of h. Though in fact there are not. There are still 2^32 possible outcomes (in fact less with your algorithm) and if you have more than 2^32 possible inputs then you cannot possibly have a reversible function because you must have at least 2 inputs that would match that hash value.
This is why leppie says you cannot decrypt a hash value because you lose information in creating it that cannot be recovered. Unless you have constraints or some other information then you are stuck.

reflexive hash?

Is there a class of hash algorithms, whether theoretical or practical, such that an algo in the class might be considered 'reflexive' according a definition given below:
hash1 = algo1 ( "input text 1" )
hash1 = algo1 ( "input text 1" + hash1 )
The + operator might be concatenation or any other specified operation to combine the output (hash1) back into the input ("input text 1") so that the algorithm (algo1) will produce exactly the same result. i.e. collision on input and input+output.
The + operator must combine the entirety of both inputs and the algo may not discard part of the input.
The algorithm must produce 128 bits of entropy in the output.
It may, but need not, be cryptographically hard to reverse the output back to one or both possible inputs.
I am not a mathematician, but a good answer might include a proof of why such a class of algorithms cannot exist. This is not an abstract question, however. I am genuinely interested in using such an algorithm in my system, if one does exist.
Sure, here's a trivial one:
def algo1(input):
sum = 0
for i in input:
sum += ord(i)
return chr(sum % 256) + chr(-sum % 256)
Concatenate the result and the "hash" doesn't change. It's pretty easy to come up with something similar when you can reverse the hash.
Yes, you can get this effect with a CRC.
What you need to do is:
Implement an algorithm that will find a sequence of N input bits leading from one given state (of the N-bit CRC accumulator) to another.
Compute the CRC of your input in the normal way. Note the final state (call it A)
Using the function implemented in (1), find a sequence of bits that lead from A to A. This sequence is your hash code. You can now append it to the input.
[Initial state] >- input string -> [A] >- hash -> [A] ...
Here is one way to find the hash. (Note: there is an error in the numbers in the CRC32 example, but the algorithm works.)
And here's an implementation in Java. Note: I've used a 32-bit CRC (smaller than the 64 you specify) because that's implemented in the standard library, but with third-party library code you can easily extend it to larger hashes.
public static byte[] hash(byte[] input) {
CRC32 crc = new CRC32();
crc.update(input);
int reg = ~ (int) crc.getValue();
return delta(reg, reg);
}
public static void main(String[] args) {
byte[] prefix = "Hello, World!".getBytes(Charsets.UTF_8);
System.err.printf("%s => %s%n", Arrays.toString(prefix), Arrays.toString(hash(prefix)));
byte[] suffix = hash(prefix);
byte[] combined = ArrayUtils.addAll(prefix, suffix);
System.err.printf("%s => %s%n", Arrays.toString(combined), Arrays.toString(hash(combined)));
}
private static byte[] delta(int from, int to) {
ByteBuffer buf = ByteBuffer.allocate(8);
buf.order(ByteOrder.LITTLE_ENDIAN);
buf.putInt(from);
buf.putInt(to);
for (int i = 8; i-- > 4;) {
int e = CRCINVINDEX[buf.get(i) & 0xff];
buf.putInt(i - 3, buf.getInt(i - 3) ^ CRC32TAB[e]);
buf.put(i - 4, (byte) (e ^ buf.get(i - 4)));
}
return Arrays.copyOfRange(buf.array(), 0, 4);
}
private static final int[] CRC32TAB = new int[0x100];
private static final int[] CRCINVINDEX = new int[0x100];
static {
CRC32 crc = new CRC32();
for (int b = 0; b < 0x100; ++ b) {
crc.update(~b);
CRC32TAB[b] = 0xFF000000 ^ (int) crc.getValue();
CRCINVINDEX[CRC32TAB[b] >>> 24] = b;
crc.reset();
}
}
Building on ephemiat's answer, I think you can do something like this:
Pick your favorite symmetric key block cipher (e.g.: AES) . For concreteness, let's say that it operates on 128-bit blocks. For a given key K, denote the encryption function and decryption function by Enc(K, block) and Dec(K, block), respectively, so that block = Dec(K, Enc(K, block)) = Enc(K, Dec(K, block)).
Divide your input into an array of 128-bit blocks (padding as necessary). You can either choose a fixed key K or make it part of the input to the hash. In the following, we'll assume that it's fixed.
def hash(input):
state = arbitrary 128-bit initialization vector
for i = 1 to len(input) do
state = state ^ Enc(K, input[i])
return concatenate(state, Dec(K, state))
This function returns a 256-bit hash. It should be not too hard to verify that it satisfies the "reflexivity" condition with one caveat -- the inputs must be padded to a whole number of 128-bit blocks before the hash is adjoined. In other words, instead of hash(input) = hash(input + hash(input)) as originally specified, we have hash(input) = hash(input' + hash(input)) where input' is just the padded input. I hope this isn't too onerous.
Well, I can tell you that you won't get a proof of nonexistence. Here's an example:
operator+(a,b): compute a 64-bit hash of a, a 64-bit hash of b, and concatenate the bitstrings, returning an 128-bit hash.
algo1: for some 128-bit value, ignore the last 64 bits and compute some hash of the first 64.
Informally, any algo1 that yields the first operator to + as its first step will do. Maybe not as interesting a class as you were looking for, but it fits the bill. And it's not without real-world instances either. Lots of password hashing algorithms truncate their input.
I'm pretty sure that such a "reflexive hash" function (if it did exist in more than the trivial sense) would not be a useful hash function in the normal sense.
For an example of a "trivial" reflexive hash function:
int hash(Object obj) { return 0; }

algorithm for generating a random numeric string, 10,000 chars in length?

Can be in any language or even pseudocode. I was asked this in an interview question, and was curious what you guys can come up with.
I think this is a trick question - the obvious answer of generating digits using a standard library routine is almost certainly flawed, if you want to generate every possible 10000 digit number with equal probability...
If an algorithmic random number generator maintains n bits of state, then clearly it can generate at most 2n possible different output sequences, because there are only 2n different initial configurations.
233219 < 1010000 < 233220, so if your algorithm uses less than 33220 bits of internal state, it cannot possibly generate some of the 1010000 possible 10000-digit (decimal) numbers.
Typical standard library random number generators won't use anything like this much internal state. Even the Mersenne Twister (the most frequently mentioned generator with a large state that I'm aware of) only keeps 624 32-bit words (= 19968 bits) of state.
Just one of many ways. You can pass in any string of the alphabet of characters you want to use:
public class RandomUtils
{
private static readonly Random random = new Random((int)DateTime.Now.Ticks);
public static string GenerateRandomDigitString(int length)
{
const string digits = "1234567890";
return GenerateRandomString(length, digits);
}
public static string GenerateRandomAlphaString(int length)
{
const string alpha = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
return GenerateRandomString(length, alpha);
}
public static string GenerateRandomString(int length, string alphabet)
{
int maxlen = alphabet.Length;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++)
{
sb.Append(alphabet[random.Next(0, maxlen)]);
}
return sb.ToString();
}
}
Without additional requirements, this will work:
StringBuilder randomStr = new StringBuilder(10000);
Random rnd = new Random();
for(int i = 0; i<10000;i++)
{
char randomChar = rnd.AsChar();
randomStr[i] = randomChar;
}
This will result in unprintable characters and other unpleasentness. Using an ASCII encoder you can get letters, numbers and punctutaiton by sticking to the range 32 - 126. Or creating a random number between 0 and 94 and adding 32. Not sure which aspect they were looking for in the question.
BTW, No I did not know the visible range off the top of my head, I looked it up on wikipedia.
Generate a number in the range 0..9. Convert it to a digit. Stuff that into a string. Repeat 10000 times.
I always like saying Computer Random Numbers are always only pseudo-random. Anyway, your favourite language will invariably have a random library. Next what is a numeric string ? 0-9 valued for each character ? Well let's start with that assumption. So we can generate bytes between to Ascii codes of 0-9 with offset (48) and (int) random*10 (since random generators typically return floats). Then place these all in a char buffer 10000 long and convert to string.
Return a string containing 10,000 1s -- that's just as random as any other digit string of the same length.
I think the real question was to determine what the interviewer actually wanted. For example, random in what sense? Uncompressable? Random over multiple runs of the same algorithm? Etc.
You can start with a list of seed digits:
seeds = [4,9,3,1,2,5,5,4,4,8,4,3] # This should be relatively large
Then, use a counter to keep track of which digit was last used. This would be system-wide and shouldn't reset with the system:
def next_digit():
counter = 0
while True:
yield counter
counter += 1
pos_it = next_digit()
rand_it = next_digit()
Next, use an algorithm that uses modulus to determine the "next number":
def random_digit():
position = pos_it.next() % len(seeds)
digit = seeds[position] * rand_it.next()
return digit % 10
Last, generate 10,000 of those digits.
output = ""
for i in range(10000):
output = "%s%s" % (output, random_digit())
I believe that an ideal answer would use more prime numbers, but this should be pretty sufficient.

Resources