Encode a byte array with an alphabet, output should look randomly distributed - algorithm

I'm encoding binary data b1, b2, ... bn using an alphabet. But since the binary representations of the bs are more or less sequential, a simple mapping of bits to chars results in very similar strings. Example:
encode(b1) => "QXpB4IBsu0"
encode(b2) => "QXpB36Bsu0"
...
I'm looking for ways to make the output more "random", meaning more difficult to guess the input b when looking at the output string.
Some requirements:
For differenent bs, the output strings must be different. Encoding the same b multiple times does not necessarily have to result in the same output. As long as there are no collisions between the output strings of different input bs, everything is fine.
If it is of any importance: each b is around ~50-60 bits. The alphabet contains 64 characters.
The encoding function should not produce larger output strings than the ones you get by just using a simple mapping from the bits of bs to chars of the alphabet (given the values above, this means ~10 characters for each b). So just using a hash function like SHA is not an option.
Possible solutions for this problem don't need to be "cryptographically secure". If someone invests enough time and effort to reconstruct the binary data, then so be it. But the goal is to make it as difficult as possible. It maybe helps that a decode function is not needed anyway.
What I am doing at the moment:
take the next 4 bits from the binary data, let's say xxxx
prepend 2 random bits r to get rrxxxx
lookup the corresponding char in the alphabet: val char = alphabet[rrxxxx] and add it to the result (this works because the alphabet's size is 64)
continue with step 1
This appraoch adds some noise to the output string, however, the size of the string is increased by 50% due to the random bits. I could add more noise by adding more random bits (rrrxxx or even rrrrxx), but the output would get larger and larger. One of the requirements I mentioned above is not to increase the size of the output string. Currently I'm only using this approach because I have no better idea.
As an alternative procedure, I thought about shuffling the bits of an input b before applying the alphabet. But since it must be guaranteed that different bs result in different strings, the shuffle function should use some kind of determinism (maybe a secret number as an argument) instead of being completely random. I wasn't able to come up wih such a function.
I'm wondering if there is a better way, any hint is appreciated.

Basically, you need a reversible pseudo-random mapping from each possible 50-bit value to another 50-bit value. You can achieve this with a reversible Linear Congruential Generator (the kind used for some pseudo-random number generators).
When encoding, apply the LCG to your number in the forward direction, then encode with base64. If you need to decode, decode from base64, then apply the LCG in the opposite direction to get your original number back.
This answer contains some code for a reversible LCG. You'll need one with a period of 250. The constants used to define your LCG would be your secret numbers.

You want to use a multiplicative inverse. That will take the sequential key and transform it into a non-sequential number. There is a one-to-one relationship between the keys and their results. So no two numbers will create the same non-sequential key, and the process is reversible.
I have a small example, written in C#, that illustrates the process.
private void DoIt()
{
const long m = 101;
const long x = 387420489; // must be coprime to m
var multInv = MultiplicativeInverse(x, m);
var nums = new HashSet<long>();
for (long i = 0; i < 100; ++i)
{
var encoded = i*x%m;
var decoded = encoded*multInv%m;
Console.WriteLine("{0} => {1} => {2}", i, encoded, decoded);
if (!nums.Add(encoded))
{
Console.WriteLine("Duplicate");
}
}
}
private long MultiplicativeInverse(long x, long modulus)
{
return ExtendedEuclideanDivision(x, modulus).Item1%modulus;
}
private static Tuple<long, long> ExtendedEuclideanDivision(long a, long b)
{
if (a < 0)
{
var result = ExtendedEuclideanDivision(-a, b);
return Tuple.Create(-result.Item1, result.Item2);
}
if (b < 0)
{
var result = ExtendedEuclideanDivision(a, -b);
return Tuple.Create(result.Item1, -result.Item2);
}
if (b == 0)
{
return Tuple.Create(1L, 0L);
}
var q = a/b;
var r = a%b;
var rslt = ExtendedEuclideanDivision(b, r);
var s = rslt.Item1;
var t = rslt.Item2;
return Tuple.Create(t, s - q*t);
}
Code cribbed from the above-mentioned article, and supporting materials.
The idea, then, is to take your sequential number, compute the inverse, and then base-64 encode it. To reverse the process, base-64 decode the value you're given, run it through the reverse calculation, and you have the original number.

Related

Hash decryption

I have a hash decryption function. If input is 664804774844 output is agdpeew. I use modulo and division for finding letter index. But in while loop I written i = 7 becauce I know output string (agdpeew) size. How I can find i?
A decryption function:
var f = function (h) {
var letters, result, i;
i = 7;
result = "";
letters = "acdegilmnoprstuw";
while (i) {
Result += letters [parseInt (h % 37)];
h = h / 37;
i--;
}
return result.split("").reverse().join("");
};
An encrypted function:
hash (s) {
h = 7;
letters = "acdegilmnoprstuw";
for(i = 0; i < s.length; i++) {
h = (h * 37 + letters.indexOf(s[i]));
}
return h;
}
It depends on how you handle overflows. If your "encryption" function allows inputs long enough that h would overflow at some point then you are stuffed and your current method of decryption wouldn't work at all.
If you can guarantee no overflowing then your final h will be the sum of terms of the form (An)x^n where An is the nth letter in your sequence converted to a number via your indexof method (and x in this case is 37)
Your decryption basically takes the x^0 term (by using mod x) and then converts that. It then divides by x (using integer maths presumably) to lose the old x^0 term and get a new one to interpret.
This means that you can actually just keep doing this until your h is 0 and at that point you know you've dealt with all the characters.
An interesting note is that x just needs to be greater than length of letters (because An must be less than x). A smaller X would give more possible input characters before overflow.
If you are allowing overflow then you have no way to do this unless you know how long the input was. Even then it might be tricky. If your input is unlimited in length then you could have a 1000 character input and with all those combinations there are a lot of possible values of h. Though in fact there are not. There are still 2^32 possible outcomes (in fact less with your algorithm) and if you have more than 2^32 possible inputs then you cannot possibly have a reversible function because you must have at least 2 inputs that would match that hash value.
This is why leppie says you cannot decrypt a hash value because you lose information in creating it that cannot be recovered. Unless you have constraints or some other information then you are stuck.

Data structure for set of (non-disjoint) sets

I'm looking for a data structure that roughly corresponds to (in Java terms) Map<Set<int>, double>. Essentially a set of sets of labeled marbles, where each set of marbles is associated with a scalar. I want it to be able to efficiently handle the following operations:
Add a given integer to every set.
Remove every set that contains (or does not contain) a given integer, or at least set the associated double to 0.
Union two of the maps, adding together the doubles for sets that appear in both.
Multiply all of the doubles by a given double.
Rarely, iterate over the entire map.
under the following conditions:
The integers will fall within a constrained range (between 1 and 10,000 or so); the exact range will be known at compile-time.
Most of the integers within the range (80-90%) will never be used, but which ones will not be easily determinable until the end of the calculation.
The number of integers used will almost always still be over 100.
Many of the sets will be very similar, differing only by a few elements.
It may be possible to identify certain groups of integers that frequently appear only in sequential order: for example, if a set contains the integers 27 and 29 then it (almost?) certainly contains 28 as well.
It may be possible to identify these groups prior to running the calculation.
These groups would typically have 100 or so integers.
I've considered tries, but I don't see a good way to handle the "remove every set that contains a given integer" operation.
The purpose of this data structure would be to represent discrete random variables and permit addition, multiplication, and scalar multiplication operations on them. Each of these discrete random variables would ultimately have been created by applying these operations to a fixed (at compile-time) set of independent Bernoulli random variables (i.e. each takes the value 1 or 0 with some probability).
The systems being modeled are close to being representable as a time-inhomogeneous Markov chains (which would of course simplify this immensely) but, unfortunately, it is essential to track the duration since various transitions.
Here's a data structure, that can do all of your operations pretty efficiently:
I'm going to refer to it as a BitmapArray for this explanation.
Thinking about it, apparently for just the operations you have described a sorted array with bitmaps as keys and weights(your doubles) as values will be pretty efficient.
The bitmaps are what maintain membership in your set. Since you said the range of integers in the set are between 1-10,000, we can maintain information about any set with a bitmap of length 10,000.
It's gonna be tough sorting an array where the keys can be as big as 2^10000, but you can be smart about implementing the comparison function in the following way:
Iterate from left to right on the two bitmaps
XOR the bits on each index
Say you get a 1 at ith position
Whichever bitmap has 1 at ith position is greater
If you never get a 1 they're equal
I know this is still a slow comparison.
But not too slow, Here's a benchmark fiddle I did on bitmaps with length 10000.
This is in Javascript, if you're going to write in Java, it's going to perform even better.
function runTest() {
var num = document.getElementById("txtValue").value;
num = isNaN(num * 1) ? 0 : num * 1;
/*For integers in the range 1-10,000 the worst case for comparison are any equal integers which will cause the comparision to iterate over the whole BitArray*/
bitmap1 = convertToBitmap(10000, num);
bitmap2 = convertToBitmap(10000, num);
before = new Date().getMilliseconds();
var result = firstIsGreater(bitmap1, bitmap2, 10000);
after = new Date().getMilliseconds();
alert(result + " in time: " + (after-before) + " ms");
}
function convertToBitmap(size, number) {
var bits = new Array();
var q = number;
do {
bits.push(q % 2);
q = Math.floor(q / 2);
} while (q > 0);
xbitArray = new Array();
for (var i = 0; i < size; i++) {
xbitArray.push(0);
}
var j = xbitArray.length - 1;
for (var i = bits.length - 1; i >= 0; i--) {
xbitArray[j] = bits[i];
j--
}
return xbitArray;
}
function firstIsGreater(bitArray1, bitArray2, lengthOfArrays) {
for (var i = 0; i < lengthOfArrays; i++) {
if (bitArray1[i] ^ bitArray2[i]) {
if (bitArray1[i]) return true;
else return false;
}
}
return false;
}
document.getElementById("btnTest").onclick = function (e) {
runTest();
};
Also, remember that you only have to do this once, when building your BitmapArray (or while taking unions) and then it's going to become pretty efficient for the operations you'd do most often:
Note: N is the length of the BitmapArray.
Add integer to every set: Worst/best case O(N) time. Flip a 0 to 1 in each bitmap.
Remove every set that contains a given integer: Worst case O(N) time.
For each bitmap check the bit that represents the given integer, if 1 mark it's index.
Compress the array by deleting all marked indices.
If you're okay with just setting the weights to 0 it'll be even more efficient. This also makes it very easy if you want to remove all sets that have any element in a given set.
Union of two maps: Worst case O(N1+N2) time. Just like merging two sorted arrays, except you have to be smart about comparisons once more.
Multiply all of the doubles by a given double: Worst/best case O(N) time. Iterate and multiply each value by the input double.
Iterate over the BitmapArray: Worst/best case O(1) time for next element.

How to convert from any large arbitrary base to another

What I’d like to do is to convert a string from one "alphabet" to another, much like converting numbers between bases, but more abstract and with arbitrary digits.
For instance, converting "255" from the alphabet "0123456789" to the alphabet "0123456789ABCDEF" would result in "FF". One way to do this is to convert the input string into an integer, and then back again. Like so: (pseudocode)
int decode(string input, string alphabet) {
int value = 0;
for(i = 0; i < input.length; i++) {
int index = alphabet.indexOf(input[i]);
value += index * pow(alphabet.length, input.length - i - 1);
}
return value;
}
string encode(int value, string alphabet) {
string encoded = "";
while(value > 0) {
int index = value % alphabet.length;
encoded = alphabet[index] + encoded;
value = floor(value / alphabet.length);
}
return encoded;
}
Such that decode("255", "0123456789") returns the integer 255, and encode(255, "0123456789ABCDEF") returns "FF".
This works for small alphabets, but I’d like to be able to use base 26 (all the uppercase letters) or base 52 (uppercase and lowercase) or base 62 (uppercase, lowercase and digits), and values that are potentially over a hundred digits. The algorithm above would, theoretically, work for such alphabets, but, in practice, I’m running into integer overflow because the numbers get so big so fast when you start doing 62^100.
What I’m wondering is if there is an algorithm to do a conversion like this without having to keep up with such gigantic integers? Perhaps a way to begin the output of the result before the entire input string has been processed?
My intuition tells me that it might be possible, but my math skills are lacking. Any help would be appreciated.
There are a few similar questions here on StackOverflow, but none seem to be exactly what I'm looking for.
A general way to store numbers in an arbitrary base would be to store it as an array of integers. Minimally, a number would be denoted by a base and array of int (or short or long depending on the range of bases you want) representing different digits in that base.
Next, you need to implement multiplication in that arbitrary base.
After that you can implement conversion (clue: if x is the old base, calculate x, x^2, x^3,..., in the new base. After that, multiply digits from old base accordingly to these numbers and then add them up).
Java-like Pseudocode:
ArbitraryBaseNumber source = new ArbitraryBaseNumber(11,"103A");
ArbitraryBaseNumber target = new ArbitraryBaseNumber(3,"0");
for(int digit : base3Num.getDigitListAsIntegers()) { // [1,0,3,10]
target.incrementBy(digit);
if(not final digit) {
target.multiplyBy(source.base);
}
}
The challenge that remains, of course, is to implement ArbitraryBaseNumber, with incrementBy(int) and multiplyBy(int) methods. Essentially to do that, you do in code exactly what a schoolchild does when doing addition and long-multiplication on paper. Google and you'll find example.

reflexive hash?

Is there a class of hash algorithms, whether theoretical or practical, such that an algo in the class might be considered 'reflexive' according a definition given below:
hash1 = algo1 ( "input text 1" )
hash1 = algo1 ( "input text 1" + hash1 )
The + operator might be concatenation or any other specified operation to combine the output (hash1) back into the input ("input text 1") so that the algorithm (algo1) will produce exactly the same result. i.e. collision on input and input+output.
The + operator must combine the entirety of both inputs and the algo may not discard part of the input.
The algorithm must produce 128 bits of entropy in the output.
It may, but need not, be cryptographically hard to reverse the output back to one or both possible inputs.
I am not a mathematician, but a good answer might include a proof of why such a class of algorithms cannot exist. This is not an abstract question, however. I am genuinely interested in using such an algorithm in my system, if one does exist.
Sure, here's a trivial one:
def algo1(input):
sum = 0
for i in input:
sum += ord(i)
return chr(sum % 256) + chr(-sum % 256)
Concatenate the result and the "hash" doesn't change. It's pretty easy to come up with something similar when you can reverse the hash.
Yes, you can get this effect with a CRC.
What you need to do is:
Implement an algorithm that will find a sequence of N input bits leading from one given state (of the N-bit CRC accumulator) to another.
Compute the CRC of your input in the normal way. Note the final state (call it A)
Using the function implemented in (1), find a sequence of bits that lead from A to A. This sequence is your hash code. You can now append it to the input.
[Initial state] >- input string -> [A] >- hash -> [A] ...
Here is one way to find the hash. (Note: there is an error in the numbers in the CRC32 example, but the algorithm works.)
And here's an implementation in Java. Note: I've used a 32-bit CRC (smaller than the 64 you specify) because that's implemented in the standard library, but with third-party library code you can easily extend it to larger hashes.
public static byte[] hash(byte[] input) {
CRC32 crc = new CRC32();
crc.update(input);
int reg = ~ (int) crc.getValue();
return delta(reg, reg);
}
public static void main(String[] args) {
byte[] prefix = "Hello, World!".getBytes(Charsets.UTF_8);
System.err.printf("%s => %s%n", Arrays.toString(prefix), Arrays.toString(hash(prefix)));
byte[] suffix = hash(prefix);
byte[] combined = ArrayUtils.addAll(prefix, suffix);
System.err.printf("%s => %s%n", Arrays.toString(combined), Arrays.toString(hash(combined)));
}
private static byte[] delta(int from, int to) {
ByteBuffer buf = ByteBuffer.allocate(8);
buf.order(ByteOrder.LITTLE_ENDIAN);
buf.putInt(from);
buf.putInt(to);
for (int i = 8; i-- > 4;) {
int e = CRCINVINDEX[buf.get(i) & 0xff];
buf.putInt(i - 3, buf.getInt(i - 3) ^ CRC32TAB[e]);
buf.put(i - 4, (byte) (e ^ buf.get(i - 4)));
}
return Arrays.copyOfRange(buf.array(), 0, 4);
}
private static final int[] CRC32TAB = new int[0x100];
private static final int[] CRCINVINDEX = new int[0x100];
static {
CRC32 crc = new CRC32();
for (int b = 0; b < 0x100; ++ b) {
crc.update(~b);
CRC32TAB[b] = 0xFF000000 ^ (int) crc.getValue();
CRCINVINDEX[CRC32TAB[b] >>> 24] = b;
crc.reset();
}
}
Building on ephemiat's answer, I think you can do something like this:
Pick your favorite symmetric key block cipher (e.g.: AES) . For concreteness, let's say that it operates on 128-bit blocks. For a given key K, denote the encryption function and decryption function by Enc(K, block) and Dec(K, block), respectively, so that block = Dec(K, Enc(K, block)) = Enc(K, Dec(K, block)).
Divide your input into an array of 128-bit blocks (padding as necessary). You can either choose a fixed key K or make it part of the input to the hash. In the following, we'll assume that it's fixed.
def hash(input):
state = arbitrary 128-bit initialization vector
for i = 1 to len(input) do
state = state ^ Enc(K, input[i])
return concatenate(state, Dec(K, state))
This function returns a 256-bit hash. It should be not too hard to verify that it satisfies the "reflexivity" condition with one caveat -- the inputs must be padded to a whole number of 128-bit blocks before the hash is adjoined. In other words, instead of hash(input) = hash(input + hash(input)) as originally specified, we have hash(input) = hash(input' + hash(input)) where input' is just the padded input. I hope this isn't too onerous.
Well, I can tell you that you won't get a proof of nonexistence. Here's an example:
operator+(a,b): compute a 64-bit hash of a, a 64-bit hash of b, and concatenate the bitstrings, returning an 128-bit hash.
algo1: for some 128-bit value, ignore the last 64 bits and compute some hash of the first 64.
Informally, any algo1 that yields the first operator to + as its first step will do. Maybe not as interesting a class as you were looking for, but it fits the bill. And it's not without real-world instances either. Lots of password hashing algorithms truncate their input.
I'm pretty sure that such a "reflexive hash" function (if it did exist in more than the trivial sense) would not be a useful hash function in the normal sense.
For an example of a "trivial" reflexive hash function:
int hash(Object obj) { return 0; }

algorithm for generating a random numeric string, 10,000 chars in length?

Can be in any language or even pseudocode. I was asked this in an interview question, and was curious what you guys can come up with.
I think this is a trick question - the obvious answer of generating digits using a standard library routine is almost certainly flawed, if you want to generate every possible 10000 digit number with equal probability...
If an algorithmic random number generator maintains n bits of state, then clearly it can generate at most 2n possible different output sequences, because there are only 2n different initial configurations.
233219 < 1010000 < 233220, so if your algorithm uses less than 33220 bits of internal state, it cannot possibly generate some of the 1010000 possible 10000-digit (decimal) numbers.
Typical standard library random number generators won't use anything like this much internal state. Even the Mersenne Twister (the most frequently mentioned generator with a large state that I'm aware of) only keeps 624 32-bit words (= 19968 bits) of state.
Just one of many ways. You can pass in any string of the alphabet of characters you want to use:
public class RandomUtils
{
private static readonly Random random = new Random((int)DateTime.Now.Ticks);
public static string GenerateRandomDigitString(int length)
{
const string digits = "1234567890";
return GenerateRandomString(length, digits);
}
public static string GenerateRandomAlphaString(int length)
{
const string alpha = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
return GenerateRandomString(length, alpha);
}
public static string GenerateRandomString(int length, string alphabet)
{
int maxlen = alphabet.Length;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < length; i++)
{
sb.Append(alphabet[random.Next(0, maxlen)]);
}
return sb.ToString();
}
}
Without additional requirements, this will work:
StringBuilder randomStr = new StringBuilder(10000);
Random rnd = new Random();
for(int i = 0; i<10000;i++)
{
char randomChar = rnd.AsChar();
randomStr[i] = randomChar;
}
This will result in unprintable characters and other unpleasentness. Using an ASCII encoder you can get letters, numbers and punctutaiton by sticking to the range 32 - 126. Or creating a random number between 0 and 94 and adding 32. Not sure which aspect they were looking for in the question.
BTW, No I did not know the visible range off the top of my head, I looked it up on wikipedia.
Generate a number in the range 0..9. Convert it to a digit. Stuff that into a string. Repeat 10000 times.
I always like saying Computer Random Numbers are always only pseudo-random. Anyway, your favourite language will invariably have a random library. Next what is a numeric string ? 0-9 valued for each character ? Well let's start with that assumption. So we can generate bytes between to Ascii codes of 0-9 with offset (48) and (int) random*10 (since random generators typically return floats). Then place these all in a char buffer 10000 long and convert to string.
Return a string containing 10,000 1s -- that's just as random as any other digit string of the same length.
I think the real question was to determine what the interviewer actually wanted. For example, random in what sense? Uncompressable? Random over multiple runs of the same algorithm? Etc.
You can start with a list of seed digits:
seeds = [4,9,3,1,2,5,5,4,4,8,4,3] # This should be relatively large
Then, use a counter to keep track of which digit was last used. This would be system-wide and shouldn't reset with the system:
def next_digit():
counter = 0
while True:
yield counter
counter += 1
pos_it = next_digit()
rand_it = next_digit()
Next, use an algorithm that uses modulus to determine the "next number":
def random_digit():
position = pos_it.next() % len(seeds)
digit = seeds[position] * rand_it.next()
return digit % 10
Last, generate 10,000 of those digits.
output = ""
for i in range(10000):
output = "%s%s" % (output, random_digit())
I believe that an ideal answer would use more prime numbers, but this should be pretty sufficient.

Resources