if i convert a file's contents into a single large number and express it as a mathematical expression, does it mean I have compressed the file? - algorithm

assuming the mathematical expression has less number of characters than the original number.
example-
20880467999847912034355032910578 can be expressed as (23^23 +10)
this looks like a good compression method. Will it work for compressing large files?
UPDATE- i didn't mean converting a file into a large binary number. lets say i have a text file and i replace all the characters in it with their ascii values. now i have a large number in the decimal number system. i can express it as a mathematical expression like in the example above.

The notion you're looking for is Kolmogorov complexity - it's a measure of how algorithmically incompressible a number is. See this wiki article for a rigorous definition and examples of such numbers.

If you take the contents of a file as a large binary number, and find an expression which evaluates to that number and can be stored more compactly than the number itself, then yes, you have compressed the file.
Unfortunately, for most files, you'll never find such an expression.
Simple logic (see the link posted by #OliCharlesworth) should convince you that it's impossible to find such an expression for all or even most files. Even for files which might have a suitable expression, finding it will be very, very difficult. If you want to convince yourself of this, try this challenge:
Take the following ASCII string:
"Holy Kolmogorov complexity, Batman! Compress this sucker down good and you'll get a pretty penny, my fine lad!"
Interpreted as a binary number, with the high-order digits coming first, that is: 2280899635869589768629811602006623364651019118009864206881173103187172975244099647369151382436996220022807793898568915685059542016541775658916080587423284053601554008368389985872997499032440860090224967472423163775276043175694884234152335588829534778866153948275745.
Try to find a polynomial which evaluates to that number. All the numbers used must be integral, and the total number of decimal digits appearing in the polynomial must be less than 80. If you succeed, I will send you a small cash prize by PayPal.

Yes, by definition. You have correctly defined compression as representing something larger with something smaller.
How do you propose to do this? How often will that work? There's the rub.

Related

Algorithm for one sign checksum

I am desperate in the search for an algorithm to create a checksum that is a maximum of two characters long and can recognize the confusion of characters in the input sequence. When testing different algorithms, such as Luhn, CRC24 or CRC32, the checksums were always longer than two characters. If I reduce the checksum to two or even one character, then no longer all commutations are recognized.
Does any of you know an algorithm that meets my needs? I already have a name with which I can continue my search. I would be very grateful for your help.
Taking that your data is alphanumeric, you want to detect all the permutations (in the perfect case), and you can afford to use the binary checksum (i.e. full 16 bits), my guess is that you should probably go with CRC-16 (as already suggested by #Paul Hankin in the comments), as it is more information-dense compared to check-digit algorithms like Luhn or Damm, and is more "generic" when it comes to possible types of errors.
Maybe something like CRC-CCITT (CRC-16-CCITT), you can give it a try here, to see how it works for you.

Arbitrary base conversion algorithm for (textually represented) integers

I am looking a general algorithm that would convert from one (arbitrary) numerical base to another (also arbitrary) without storing the result in a large integer and performing arithmetic operations on it in between.
The algorithm I am looking for takes an array of numerical values in a given base (that would mostly be a string of characters) and returns the result alike.
Thank you for help.
I would say it is not possible. For certain bases it would be possible to convert from one string to another, by just streaming the chars through (e.g. if one base is a multiple of the other, like octal->hex), but for arbitrary bases it is not possible without arithmetic operations.
If you would do it with strings/chars in between it would be still big integer arithmetic, but your integers were just in a (unnecessary big) unusual format.
So you have just the choice between: Either reprogram arithmetic operations with char encoded numbers, or do the step and use a big integer library and walk the convert(char(base1->bigInt), convert(bigInt->base2) path.
It's computable, but it's not pretty.
Seriously, it'd probably be easier and faster to include one of the many bignum libraries or write your own.

Repetition-based, pattern-based data compression algorithm

Suppose I have the following string:
ABCADCADCADABC
I want to compress it by finding repeating substrings.
What's an algorithm that gives the optimal compression?
In the above example it should return
AB*1 CAD*3 ABC*1
For comparison, a greedy algorithm might return
ABC*1 ADC*2 AD*1 ABC*1
Depending on whether you prefer fast and simple or high compression ratio you could take a look into the Lempel-Ziv-Welch (LZW) or Lempel-Ziv-Markov chain (LZMA) algorithms. They both keep dictionaries of recurring strings.
This sounds like a job for suffix arrays/trees!
http://en.wikipedia.org/wiki/Suffix_array
You can use a suffix array built over your string to figure out patterns that repeat. For instance, we can build a suffix array over your example as follows (I'm using $ as always coming after every letter, you can sort it so that $ comes before every letter ... either way will work):
ABCADCADCADABC$
ABC$
ADABC$
ADCADABC$
ADCADCADABC$
BCADCADCADABC$
BC$
CADABC$
CADCADABC$
CADCADCADABC$
C$
DABC$
DCADABC$
DCADCADABC$
$
From this, we can more easily see the common patterns in the string. Using the information in this suffix array representation, we can see that CAD is repeated 3x in a local area, and we'd likely use this as our choice for compression. ADC and DCA and so on are not as attractive because they compress less of the string.
http://en.wikipedia.org/wiki/Suffix_tree
Suffix trees are more efficient ways of doing the same task. Once you wrap your head around how to do something using suffix arrays, it's not too far of a jump to go onto suffix trees. In fact, this is used in popular compression algorithms including LZW 1 and BWT (Bzip) 2.
It may not be practically relevant, but for the particular question you ask there is a dynamic programming solution. If you have computed the optimum way to compress the strings of length 1, 2, 3...n-1 starting from the first character, then you can compute the optimum way to compress the string of length n starting from the first character by looking at the last k characters for each possibility k and seeing if they form a multiple of a simple string. If so, compute the cost of compressing the first n-k characters and then expressing the last k characters using a multiple of a string.
So in your example you would finish up by noticing that ABC was a multiple of itself, and that if you expressed this as ABC*1 you could use the answer you had already worked out for the first 11 characters of AB CAD*3 to produce AB*1 CAD*3 ABC*1
Better still would be:
ABCAD(6,3)(3,11)
where (n,d) is a length and distance back of a match. So (6,3) copies six bytes starting from three bytes back. While that may sound a little odd, by the time it gets three bytes in, the next three bytes it needs have been copied. So CADCAD is appended. The (3,11) causes ABC to be appended.
This is called LZ77 compression. It is what is implemented by zip, gzip, and zlib using the deflate compressed data format. That format not only references previous string matches, but also uses Huffman compression on the literals (e.g. ABCAD) as well as the lengths and distances.

Algorithm to Map Strings to Short Replacements

I'm looking at ways to deterministically replace unique strings with unique and optimally short replacements. So I have a finite set of strings, and the best compression I could achieve so far is through an enumeration algorithm, where I order the input set and then replace the strings with an enumeration of char strings over an extended alphabet (a..z, A...Z, aa...zz, aA... zZ, a0...z9, Aa..., aaa...zaa, aaA...zaaA, ....).
This works wonderfully as far as compression is concerned, but has the severe drawback that it is not atomic on any given input string. Rather, its result depends on knowing all input strings right from the start, and on the ordering of the input set.
Anybody knows of an algorithm that has similar compression but doesn't require knowing all input strings upfront?! Hashing for example would not work for me, as depending on the size of the input set I'd need a hash length of 8-12 for the hashes to be unique, and that would be too long as replacements (currently, the replacement strings are 1-3 chars long for my use cases (<10,000 input strings)). Also, if theoreticians among us know this is wasted effort, I would be interested to hear :-) .
You could use your enumeration scheme, but sorted by the order in which you first encounter the input strings.
For example, the first string you ever process can be mapped to "a".
The next distinct string would be mapped to "b", etc.
Every time you process a string, you'd need to look it up to see if it has already been mapped.
"Optimally short" depends on the population of strings from which your samples are drawn. In the absence of systematic redundancy in the population, you will find that only a fraction of arbitrary strings can be compressed at all (e.g., consider trying to compress random bit strings).
If you can make assumptions about your data, such as "the strings are expected to be mainly composed of English words" then you can do something simple and effective based on letter frequency (e.g., for English, the relative frequency order is something like ETAOINSHRDLUGCY..., so you would want to use fewer bits to represent Es and more bits to represent uncommon letters like Q).
Cheers.

Symmetric Bijective String Algorithm?

I'm looking for an algorithm that can do a one-to-one mapping of a string onto another string.
I want an algorithm that given an alphabet I can perform a symmetric mapping function.
For example:
Let's consider that I have the alphabet "A","B","C","D","E","F". I want something like F("ABC") = "CEA" and F("CEA") = "ABC" for every N letter permutation.
Surely, an algorithm like this exists. If you know of an algorithm, please post the name of it and I can research it. If I haven't been clear enough in my request, please let me know.
Thanks in advance.
Edit 1:
I should clarify that I want enough entropy so that F("ABC") would equal "CEA" and F("CEA") = "ABC" but then I do NOT want F("ABD") to equal "CEF". Notice how two input letters stayed the same and the two corresponding output letters stayed the same?
So a Caesar Cipher/ROT13 or shuffling the array would not be sufficient. However, I don't need any "real" security. Just enough entropy for the output of the function to appear random. Weak encryption algorithms welcome.
Just create an array of objects that contain 2 fields -- a letter, and a random number. Sort the array. By the random numbers. This creates a mapping where the i-th letter of the alphabet now maps to the i-th letter in the array.
If simple transposition or substitution isn't quite enough, it sounds like you want to advance to a polyalphabetic cipher. The Vigenère cipher is extremely easy to implement in code, but is still difficult to break without using a computer.
I suggest the following.
Perform a dense coding of the input to positive integers - with an alphabet size of n and string length of m you can code the string into integers between zero and n^m - 1. In your example this would be the range [0,215]. Now perform a fixed involution on the encoded number and decode it again.
Take RC4, settle for some password, and you're done. (Not that this would be very safe.)
Take the set of all permutations of your alphabet, shuffle it, and map the first half of the set onto the second half. Bad for large alphabets, of course. :)
Nah, thought that over, I forgot about character repetitions. Maybe divide the input into chunks without repeating chars and apply my suggestion to all of those chunks.
I would restate your problem thus, and give you a strategy for that restatement:
"A substitution cypher where a change in input leads to a larger change in output".
The blocking of characters is irrelevant-- in the end, it's just mappings between numbers. I'll speak of letters here, but you can extend it to any block of n characters.
One of the easiest routes for this is a rotating substitution based on input. Since you already looked at the Vigenere cipher, it should be easy to understand. Instead of making the key be static, have it be dependent on the previous letter. That is, rotate through substitutions a different amount per each input.
The variable rotation satisfies the condition of making each small change push out to a larger change. Note that the algorithm will only push changes in one direction such that changes towards the end have smaller effects. You could run the algorithm both ways (front-to-back, then back-to-front) so that every letter of cleartext changed has the possibility of changing the entire string.
The internal rotation strategy elides the need for keys, while of course losing of most of the cryptographic security. It makes sense in context, though, as you are aiming for entropy rather than security.
You can solve this problem with Format-preserving encryption.
One Java-Library can be found under https://github.com/EVGStudents/FPE.git. There you can define a Regex and encrypt/decrypt string values matching this regex.

Resources