Algorithm to Map Strings to Short Replacements - algorithm

I'm looking at ways to deterministically replace unique strings with unique and optimally short replacements. So I have a finite set of strings, and the best compression I could achieve so far is through an enumeration algorithm, where I order the input set and then replace the strings with an enumeration of char strings over an extended alphabet (a..z, A...Z, aa...zz, aA... zZ, a0...z9, Aa..., aaa...zaa, aaA...zaaA, ....).
This works wonderfully as far as compression is concerned, but has the severe drawback that it is not atomic on any given input string. Rather, its result depends on knowing all input strings right from the start, and on the ordering of the input set.
Anybody knows of an algorithm that has similar compression but doesn't require knowing all input strings upfront?! Hashing for example would not work for me, as depending on the size of the input set I'd need a hash length of 8-12 for the hashes to be unique, and that would be too long as replacements (currently, the replacement strings are 1-3 chars long for my use cases (<10,000 input strings)). Also, if theoreticians among us know this is wasted effort, I would be interested to hear :-) .

You could use your enumeration scheme, but sorted by the order in which you first encounter the input strings.
For example, the first string you ever process can be mapped to "a".
The next distinct string would be mapped to "b", etc.
Every time you process a string, you'd need to look it up to see if it has already been mapped.

"Optimally short" depends on the population of strings from which your samples are drawn. In the absence of systematic redundancy in the population, you will find that only a fraction of arbitrary strings can be compressed at all (e.g., consider trying to compress random bit strings).
If you can make assumptions about your data, such as "the strings are expected to be mainly composed of English words" then you can do something simple and effective based on letter frequency (e.g., for English, the relative frequency order is something like ETAOINSHRDLUGCY..., so you would want to use fewer bits to represent Es and more bits to represent uncommon letters like Q).
Cheers.

Related

Is it possible to check if a short sequence of text is random or not?

Is it possible to check if a short sequence of text, e.g. two or three words, is random or not?
My first thought was to calculate the entropy on the string.
H("hello world") = 2.84535
H("sdzfjksher") = 3.12193
but any combination of the chars in "hello world" will result in the same entropy, but will create a random string like "llloo ehrdw". Entropy based methods works great on long strings like text. Here you can also count single chars to determinate that its a language. You can also use Zipfs Law here to check for real languages...
the next method would be a lookup table of common words, like a normal english dictionary. The problem with this method is to create a list of words first.
For example:
input string result
------------------------------------------------------
"hello world" matches 2 words
"helloworld" random string
"lllooehrdw" random string
"hello.world" probably 2 words
"a.be.was" probably 3 words (but this is probably a strange edge case)
So its all about finding words here to compare them with your wordlist, which can be really hard.
Another problem with all these methods could be, that they only detect certain languages or need to be trained to a certain language. Consider that we only want to use english for now.
So is there any good method to do this, or do i need to accept False Positives and False Negatives?
You could count the frequency of characters used in the text and compare this with known character distributions in English and/or other languages. This will give an indication of the probability that the text is/resembles that language or not.
Sounds like you want to use the frequencies of the letters to see if a string is a word or random letter.
http://scottbryce.com/cryptograms/stats.htm
Combining statistics and wordlists sounds like the way to go reduce false positives.

Arbitrary base conversion algorithm for (textually represented) integers

I am looking a general algorithm that would convert from one (arbitrary) numerical base to another (also arbitrary) without storing the result in a large integer and performing arithmetic operations on it in between.
The algorithm I am looking for takes an array of numerical values in a given base (that would mostly be a string of characters) and returns the result alike.
Thank you for help.
I would say it is not possible. For certain bases it would be possible to convert from one string to another, by just streaming the chars through (e.g. if one base is a multiple of the other, like octal->hex), but for arbitrary bases it is not possible without arithmetic operations.
If you would do it with strings/chars in between it would be still big integer arithmetic, but your integers were just in a (unnecessary big) unusual format.
So you have just the choice between: Either reprogram arithmetic operations with char encoded numbers, or do the step and use a big integer library and walk the convert(char(base1->bigInt), convert(bigInt->base2) path.
It's computable, but it's not pretty.
Seriously, it'd probably be easier and faster to include one of the many bignum libraries or write your own.

Repetition-based, pattern-based data compression algorithm

Suppose I have the following string:
ABCADCADCADABC
I want to compress it by finding repeating substrings.
What's an algorithm that gives the optimal compression?
In the above example it should return
AB*1 CAD*3 ABC*1
For comparison, a greedy algorithm might return
ABC*1 ADC*2 AD*1 ABC*1
Depending on whether you prefer fast and simple or high compression ratio you could take a look into the Lempel-Ziv-Welch (LZW) or Lempel-Ziv-Markov chain (LZMA) algorithms. They both keep dictionaries of recurring strings.
This sounds like a job for suffix arrays/trees!
http://en.wikipedia.org/wiki/Suffix_array
You can use a suffix array built over your string to figure out patterns that repeat. For instance, we can build a suffix array over your example as follows (I'm using $ as always coming after every letter, you can sort it so that $ comes before every letter ... either way will work):
ABCADCADCADABC$
ABC$
ADABC$
ADCADABC$
ADCADCADABC$
BCADCADCADABC$
BC$
CADABC$
CADCADABC$
CADCADCADABC$
C$
DABC$
DCADABC$
DCADCADABC$
$
From this, we can more easily see the common patterns in the string. Using the information in this suffix array representation, we can see that CAD is repeated 3x in a local area, and we'd likely use this as our choice for compression. ADC and DCA and so on are not as attractive because they compress less of the string.
http://en.wikipedia.org/wiki/Suffix_tree
Suffix trees are more efficient ways of doing the same task. Once you wrap your head around how to do something using suffix arrays, it's not too far of a jump to go onto suffix trees. In fact, this is used in popular compression algorithms including LZW 1 and BWT (Bzip) 2.
It may not be practically relevant, but for the particular question you ask there is a dynamic programming solution. If you have computed the optimum way to compress the strings of length 1, 2, 3...n-1 starting from the first character, then you can compute the optimum way to compress the string of length n starting from the first character by looking at the last k characters for each possibility k and seeing if they form a multiple of a simple string. If so, compute the cost of compressing the first n-k characters and then expressing the last k characters using a multiple of a string.
So in your example you would finish up by noticing that ABC was a multiple of itself, and that if you expressed this as ABC*1 you could use the answer you had already worked out for the first 11 characters of AB CAD*3 to produce AB*1 CAD*3 ABC*1
Better still would be:
ABCAD(6,3)(3,11)
where (n,d) is a length and distance back of a match. So (6,3) copies six bytes starting from three bytes back. While that may sound a little odd, by the time it gets three bytes in, the next three bytes it needs have been copied. So CADCAD is appended. The (3,11) causes ABC to be appended.
This is called LZ77 compression. It is what is implemented by zip, gzip, and zlib using the deflate compressed data format. That format not only references previous string matches, but also uses Huffman compression on the literals (e.g. ABCAD) as well as the lengths and distances.

Guessing the hash function?

I'd like to know which algorithm is employed. I strongly assume it's something simple and hopefully common. There's no lag in generating the results, for instance.
Input: any string
Output: 5 hex characters (0-F)
I have access to as many keys and results as I wish, but I don't know how exactly I could harness this to attack the function. Is there any method? If I knew any functions that converted to 5-chars to start with then I might be able to brute force for a salt or something.
I know for example that:
a=06a07
b=bfbb5
c=63447
(in case you have something in mind)
In normal use it converts random 32-char strings into 5-char strings.
The only way to derive a hash function from data is through brute force, perhaps combined with some cleverness. There are an infinite number of hash functions, and the good ones perform what is essentially one-way encryption, so it's a question of trial and error.
It's practically irrelevant that your function converts 32-character strings into 5-character hashes; the output is probably truncated. For fun, here are some perfectly legitimate examples, the last 3 of which are cryptographically terrible:
Use the MD5 hashing algorithm, which generates a 16-character hash, and use the 10th through the 14th characters.
Use the SHA-1 algorithm and take the last 5 characters.
If the input string is alphabetic, use the simple substitution A=1, B=2, C=3, ... and take the first 5 digits.
Find each character on your keyboard, measure its distance from the left edge in millimeters, and use every other digit, in reverse order, starting with the last one.
Create a stackoverflow user whose name is the 32-bit string, divide 113 by the corresponding user ID number, and take the first 5 digits after the decimal. (But don't tell 'em I told you to do it!)
Depending on what you need this for, if you have access to as many keys and results as you wish, you might want to try a rainbow table approach. 5 hex chars is only 1mln combinations. You should be able to brute-force generate a map of strings that match all of the resulting hashes in no time. Then you don't need to know the original string, just an equivalent string that generates the same hash, or brute-force entry by iterating over the 1mln input strings.
Following on from a comment I just made to Pontus Gagge, suppose the hash algorithm is as follows:
Append some long, constant string to the input
Compute the SHA-256 hash of the result
Output the last 5 chars of the hash.
Then I'm pretty sure there's no computationally feasible way from your chosen-plaintext attack to figure out what the hashing function is. To even prove that SHA-256 is in use (assuming it's a good hash function, which as far as we currently know it is), I think you'd need to know the long string, which is only stored inside the "black box".
That said, if I knew any published 20-bit hash functions, then I'd be checking those first. But I don't know any: all the usual non-crypto string hashing functions are 32 bit, because that's the expected size of an integer type. You should perhaps compare your results to those of CRC, PJW, and BUZ hash on the same strings, as well as some variants of DJB hash with different primes, and any string hash functions built in to well-known programming languages, like java.lang.String.hashCode. It could be that the 5 output chars are selected from the 8 hex chars generated by one of those.
Beyond that (and any other well-known string hashes you can find), I'm out of ideas. To cryptanalyse a black box hash, you start by looking for correlations between the bits of the input and the bits of the output. This gives you clues what functions might be involved in the hash. But that's a huge subject and not one I'm familiar with.
This sounds mildly illicit.
Not to rain on your parade or anything, but if the implementors have done their work right, you wouldn't notice lags beyond a few tens of milliseconds on modern CPU's even with strong cryptographic hashes, and knowing the algorithm won't help you if they have used salt correctly. If you don't have access to the code or binaries, your only hope is a trivial mistake, whether caused by technical limitations or carelesseness.
There is an uncountable infinity of potential (hash) functions for any given set of inputs and outputs, and if you have no clue better than an upper bound on their computational complexity (from the lag you detect), you have a very long search ahead of you...

Symmetric Bijective String Algorithm?

I'm looking for an algorithm that can do a one-to-one mapping of a string onto another string.
I want an algorithm that given an alphabet I can perform a symmetric mapping function.
For example:
Let's consider that I have the alphabet "A","B","C","D","E","F". I want something like F("ABC") = "CEA" and F("CEA") = "ABC" for every N letter permutation.
Surely, an algorithm like this exists. If you know of an algorithm, please post the name of it and I can research it. If I haven't been clear enough in my request, please let me know.
Thanks in advance.
Edit 1:
I should clarify that I want enough entropy so that F("ABC") would equal "CEA" and F("CEA") = "ABC" but then I do NOT want F("ABD") to equal "CEF". Notice how two input letters stayed the same and the two corresponding output letters stayed the same?
So a Caesar Cipher/ROT13 or shuffling the array would not be sufficient. However, I don't need any "real" security. Just enough entropy for the output of the function to appear random. Weak encryption algorithms welcome.
Just create an array of objects that contain 2 fields -- a letter, and a random number. Sort the array. By the random numbers. This creates a mapping where the i-th letter of the alphabet now maps to the i-th letter in the array.
If simple transposition or substitution isn't quite enough, it sounds like you want to advance to a polyalphabetic cipher. The Vigenère cipher is extremely easy to implement in code, but is still difficult to break without using a computer.
I suggest the following.
Perform a dense coding of the input to positive integers - with an alphabet size of n and string length of m you can code the string into integers between zero and n^m - 1. In your example this would be the range [0,215]. Now perform a fixed involution on the encoded number and decode it again.
Take RC4, settle for some password, and you're done. (Not that this would be very safe.)
Take the set of all permutations of your alphabet, shuffle it, and map the first half of the set onto the second half. Bad for large alphabets, of course. :)
Nah, thought that over, I forgot about character repetitions. Maybe divide the input into chunks without repeating chars and apply my suggestion to all of those chunks.
I would restate your problem thus, and give you a strategy for that restatement:
"A substitution cypher where a change in input leads to a larger change in output".
The blocking of characters is irrelevant-- in the end, it's just mappings between numbers. I'll speak of letters here, but you can extend it to any block of n characters.
One of the easiest routes for this is a rotating substitution based on input. Since you already looked at the Vigenere cipher, it should be easy to understand. Instead of making the key be static, have it be dependent on the previous letter. That is, rotate through substitutions a different amount per each input.
The variable rotation satisfies the condition of making each small change push out to a larger change. Note that the algorithm will only push changes in one direction such that changes towards the end have smaller effects. You could run the algorithm both ways (front-to-back, then back-to-front) so that every letter of cleartext changed has the possibility of changing the entire string.
The internal rotation strategy elides the need for keys, while of course losing of most of the cryptographic security. It makes sense in context, though, as you are aiming for entropy rather than security.
You can solve this problem with Format-preserving encryption.
One Java-Library can be found under https://github.com/EVGStudents/FPE.git. There you can define a Regex and encrypt/decrypt string values matching this regex.

Resources