What's the name of this algorithm/routine? - algorithm
I am writing a utility class which converts strings from one alphabet to another, this is useful in situations where you have a target alphabet you wish to use, with a restriction on the number of characters available. For example, if you can use lower case letters and numbers, but only 12 characters its possible to compress a timestamp from the alphabet 01234567989 -: into abcdefghijklmnopqrstuvwxyz01234567989 so 2010-10-29 13:14:00 might become 5hhyo9v8mk6avy (19 charaters reduced to 16).
The class is designed to convert back and forth between alphabets, and also calculate the longest source string that can safely be stored in a target alphabet given a particular number of characters.
Was thinking of publishing this through Google code, however I'd obviously like other people to find it and use it - hence the question on what this is called. I've had to use this approach in two separate projects, with Bloomberg and a proprietary system, when you need to generate unique file names of a certain length, but want to keep some plaintext, so GUIDs aren't appropriate.
Your examples bear some similarity to a Dictionary coder with a fixed target and source dictionaries. Also worthwhile to look at is Fibonacci coding, which has a fixed target dictionary (of variable-length bits), which is variably targeted.
I think it also depends whether it is very important that your target alphabet has fixed width entries - if you allow for a fixed alphabet with variable length codes, your compression ratio will approach your entropy that much more optimally! If the source alphabet distribution is known in advance, a static Huffman tree could easily be generated.
Here is a simple algorithm:
Consider that you don't have to transmit the alphabet used for encoding. Also, you don't use (and transmit) the probabilities of the input symbols, as in standard compressions, so we just re-encode somehow the data.
In this case we can consider that the input data are in number represented with base equal to the cardinality of the input alphabet. We just have to change its representation to another base, that is a simple task.
EDITED example:
input alpabet: ABC, output alphabet: 0123456789
message ABAC will translate to 0102 in base 3, that is 11 (9 + 2) in base 10.
11 to base 10: 11
We could have a problem decoding it, because we don't know how many 0-es to use at the begining of the decoded result, so we have to use one of the modifications:
1) encode somehow in the stream the size of compressed data.
2) use a dummy 1 at the start of the stream: in this way our example will become:
10102 (base 3) = 81 + 9 + 2 = 92 (base 10).
Now after decoding we just have to ignore the first 1 (this also provides a basic error detection).
The main problem of this approach is that in most cases (GCD == 1) each new encoded character will completely change the output. This will be very inneficient and difficult to implement. We end up with arithmetic coding as the best solution (actually a simplified version of it).
You probably know about Base64 which does the same thing just usually the other way around. Too bad there are way too many Google results on BaseX or BaseN...
Related
How to build unique file names based on its content?
I plan to build a unique file name based on its content. For example, by its SHA256 hash. Files with the same content must have the same name. The easiest way is to convert hash to a hex string. A file name will be 32 bytes length * 2 = 64 characters. This is pretty long name to operate with. How to make it shorter? I implemented a sort of "Base32" coding - a vocabulary string that includes digits and 22 letters. I use only five bits of every byte to build file name with 32 characters. Much better. I am looking for a balance between file name length and low collision probability. If the number of files is expected to be less than 500K, how long should the filename be? 8? 16? 24? 32? Is there any recommended method to build short unique filenames at all?
If you use an N-bit cryptographic hash on M files, you can estimate the probability of at least one collision to be M2/2N+1 For 500K files, that's about 1/2N-37 Using base32, 16 chars gives probability of collision 1/243 -- a few trillion to 1 odds. If that won't do, then 24 chars gives 1/283. If you're willing to check them all and re-generate on collision, then 8 chars is fine.
Number of collisions depend on the content of the files, the hash-algorithm and the length of the hash. In general: The longer the hash-value is the less likely are collisions (if your content does not especially provoke collisions). You cannot avoid the possibility of collisions unless you use the content as file-name (or a lossless compression of it). To shorten the filenames you could allow more different characters for the file-name. (But we aware what characters your OS allows and which you are willing to use). I would go for a kind of base32 encoding to avoid problems with filesystems that do not distinguish between upper and lower case character.
Bash string compression
I'd like to know how I can compress a string into fewer characters using a shell script. The goal is to take a Mac's serial number and MAC address then compress those values into a 14 character string. I'm not sure if this is possible, but I'd like to hear if anyone has any suggestions. Thank you
Your question is way too vague to result in a detailed answer. Given your restriction of a 14 character string output, you won't be able to use "real" compression (like zip), due to the overhead. This leaves you with simple algorithms, like RLE or bit concatenation. If by "string" you mean "printable string", i.e. only about 62 or so values are usable in a character (depending on the exact printable set you choose), then you have an additional space constraint. A handy trick you could use with the MAC address part is, since it belongs to an Apple device, you already know that the first three values (AA:BB:CC) are one of 297 combinations, so you could save 6 characters (plus 2 for the colons) worth of information into 2+ characters (depending on your output character set, see above). The remaining three MAC address values are base-16 (0-9, A-F), so you could "compress" this information slightly as well. A similar analysis can be done for the Mac serial number (which values can it take? how much space can be saved?). The effort to do this in bash would be disproportionate though. I'd highly recommend a C (or other programming language) approach.
Cheating answer Get someone at Apple to give you access to the database I'm assuming they have which matches devices' serial numbers to MAC addresses. Then you can just store the MAC address and look it up in the database whenever you need the serial number. The 64-bit MAC address can easily be stored in 12 characters with standard base64 encoding. Frustrating answer You have to make some unreliable assumptions just to make this approachable. You can fix the assumptions later, but I don't know if it would still fit in 14 characters. Personally, I have no idea why you want to save space by reprocessing the serial and MAC numbers, but here's how I'd start. Simplifying assumptions Apple will never use MAC address prefixes beyond the 297 combinations mentioned in Sir Athos' answer. The "new" Mac serial number format in this article from 2010 is the only format Apple has used or ever will use. Core concepts of encoding You're taking something which could have n possible values and you're converting it into something else with n possible values. There may be gaps in the original's possible values, such as if Apple cancels building a manufacturing plant after already assigning it a location code. There may be gaps in your encoded form's possible values, perhaps in anticipation of Apple doing things that would fill the gaps. Abstract integer encoding Break apart the serial number into groups as "PPP Y W SSS CCCC" (like the article describes) Make groups for the first 3 bytes and last 5 bytes of the MAC address. Translate each group into a number from 0 to n-1 where n is the number of possible values for something in the group. As far as I can tell from the article, the values are n_P=36^3, n_Y=20, n_W=27, n_S=3^3, and n_C=36^4. The first 3 MAC bytes has 297 values and the last 5 have 2^(8*5)=2^40 values. Set a variable, i, to the value of the first group's number. For each remaining group's number, multiply i by the number of values possible for the group, and then add the number to i. Base n encoding Make a list of n characters that you want to use in your final output. Print the character in your list at index i%n. Subtract the modulus from the integer encoding and divide by n. Repeat 1 and 2 until the integer becomes 0. Result This results in a total of 36^3 * 20 * 27 * 36 * 7 * 297 * 2^40 ~= 2 * 10^24 combinations. If you let n=64 for a custom base64 encoding (without any padding characters), then you can barely fit that into ceiling(log(2 * 10^24) / log(64)) = 14 characters. If you use all 95 printable ASCII characters, then you can fit it into ceiling(log(2 * 10^24) / log(95)) = 13 characters. Fixing the assumptions If you're trying to build something that uses this and are determined to make it work, here's what you need to do to make it solid, along with some tips. Do the same analysis on every other serial number format you may care about. You might want to see if there's any redundant information between the serial and MAC numbers. Figure out a way to detect between serial number formats. Adding an extra thing at the end of the abstract number encoding can enable you to track which version it uses. Think long and careful about the format you're making. It's a lot easier to make changes before you're stuck with backwards compatibility. If you can, use a language that's well suited for mapping between values, doing a lot of arithmetic, and handling big numbers. You may be able to do it in Bash, but it'd probably be easier in, say, Python.
Encode an array of integers to a short string
Problem: I want to compress an array of non-negative integers of non-fixed length (but it should be 300 to 400), containing mostly 0's, some 1's, a few 2's. Although unlikely, it is also possible to have bigger numbers. For example, here is an array of 360 elements: 0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0, 0,0,4,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,5,2,0,0,0, 0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,1,2,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0. Goal: The goal is to compress an array like this, into a shortest possible encoding using letters and numbers. Ideally, something like: sd58x7y What I've tried: I tried to use "delta encoding", and use zeroes to denote any value higher than 1. For example: {0,0,1,0,0,0,2,0,1} would be denoted as: 2,3,0,1. To decode it, one would read from left to right, and write down "2 zeroes, one, 3 zeroes, one, 0 zeroes, one (this would add to the previous one, and thus have a two), 1 zero, one". To eliminate the need of delimiters (commas) and thus saves more space, I tried to use only one alphanumerical character to denote delta values of 0 to 35 (using 0 to y), while leaving letter z as "35 PLUS the next character". I think this is called "variable bit" or something like that. For example, if there are 40 zeroes in a row, I'd encode it as "z5". That's as far as I got... the resultant string is still very long (it would be about 20 characters long in the above example). I would ideally want something like, 8 characters or even shorter. Thanks for your time; any help or inspiration would be greatly appreciated!
Since your example contains long runs of zeroes, your first step (which it appears you have already taken) could be to use run-lenth encoding (RLE) to compress them. The output from this step would be a list of integers, starting with a run-length count of zeroes, then alternating between that and the non-zero values. (a zero-run-length of 0 will indicate successive non-zero values...) Second, you can encode your integers in a small number of bits, using a class of methods called universal codes. These methods generally compress small integers using a smaller number of bits than larger integers, and also provide the ability to encode integers of any size (which is pretty spiffy...). You can tune the encoding to improve compression based on the exact distribution you expect. You may also want to look into how JPEG-style encoding works. After DCT and quantization, the JPEG entropy encoding problem seems similar to yours. Finally, if you want to go for maximum compression, you might want to look up arithmetic encoding, which can compress your data arbitrarily close to the statistical minimum entropy. The above links explain how to compress to a stream of raw bits. In order to convert them to a string of letters and numbers, you will need to add another encoding step, which converts the raw bits to such a string. As one commenter points out, you may want to look into base64 representation; or (for maximum efficiency with whatever alphabet is available) you could try using arithmetic compression "in reverse". Additional notes on compression in general: the "shortest possible encoding" depends greatly on the exact properties of your data source. Effectively, any given compression technique describes a statistical model of the kind of data it compresses best. Also, once you set up an encoding based on the kind of data you expect, if you try to use it on data unlike the kind you expect, the result may be an expansion, rather than a compression. You can limit this expansion by providing an alternative, uncompressed format, to be used in such cases...
In your data you have: 14 1s (3.89% of data) 4 2s (1.11%) 1 3s, 4s and 5s (0.28%) 339 0s (94.17%) Assuming that your numbers are not independent of each other and you do not have any other information, the total entropy of your data is 0.407 bits per number, that is 146.4212 bits overall (18.3 bytes). So it is impossible to encode in 8 bytes.
Is there a hash function for binary data which produces closer hashes when the data is more similar?
I'm looking for something like a hash function but for which it's output is closer the closer two different inputs are? Something like: f(1010101) = 0 #original hash f(1010111) = 1 #very close to the original hash as they differ by one bit f(0101010) = 9999 #not very close to the original hash they all bits are different (example outputs for demonstration purposes only) All of the input data will be of the same length. I want to make comparisons between a file a lots of other files and be able to determine which other file has the fewest differences from it.
You may try this algorithm. http://en.wikipedia.org/wiki/Levenshtein_distance Since this is string only. You may convert all your binary to string for example: 0 -> "00000000" 1 -> "00000001"
You might be interested in either simhashing or shingling. If you are only trying to detect similarity between documents, there are other techniques that may suit you better (like TF-IDF.) The second link is part of a good book whose other chapters delve into general information retrieval topics, including these other techniques.
You should not use a hash for this. You must compute signatures containing several characteristic values like : file name file size Is binary / Is ascii only date (if needed) some other more complex like : variance of the values of bytes average value of bytes average length of same value bits sequence (in compressed files there are no long identical bit sequences) ... Then you can compare signatures. But the most important is to know what kind of data is in these files. If it is images, the size and main color are more important. If it is sound, you could analyse only some frequencies...
You might want to look at the source code to unix utilities like cmp or the FileCmp stuff in Python and use that to try to determine a reasonable algorithm. In my uninformed opinion, calculating a hash is not likely to work well. First, it can be expensive to calculate a hash. Second, what you're trying to do sounds more like a job for encoding than a hash; once you start thinking of it that way, it's not clear that it's even worth transforming the file that way. If you have some constraints, specifying them might be useful. For example, if all the files are the exact same length, that may simplify things. Or if you are only interested in differences between bits in the same position and not interested in things that are similar only if you compare bits in different positions (e.g., two files are identical, except that one has everything shifted three bits--should those be considered similar or not similar?).
You could calculate the population count of the XOR of the two files, which is exactly the number of bits that are not the same between the two files. So it just does precisely what you asked for, no approximations.
You can represent your data as a binary vector of features and then use dimensionality reduction either with SVD or with random indexing.
What you're looking for is a file fingerprint of sorts. For plain text, something like Nilsimsa (http://ixazon.dynip.com/~cmeclax/nilsimsa.html) works reasonably well. There are a variety of different names for this type of technique. Fuzzy Hashing/Locality Sensitive Hashing/Distance Based Hashing/Dimensional reduction and a few others. Tools can generate a fixed length output or variable length output, but the outputs are generally comparable (eg by levenshtein distance) and similar inputs yield similar outputs. The link above for nilsimsa gives two similar spam messages and here are the example outputs: 773e2df0a02a319ec34a0b71d54029111da90838cbc20ecd3d2d4e18c25a3025 spam1 47182cf0802a11dec24a3b75d5042d310ca90838c9d20ecc3d610e98560a3645 spam2 * * ** *** * ** ** ** ** * ******* **** ** * * * Spamsum and sdhash are more useful for arbitrary binary data. There are also algorithms specifically for images that will work regardless of whether it's a jpg or a png. Identical images in different formats wouldn't be noticed by eg spamsum.
algorithm to convert pathname to unique number
I want to convert windows pathname to unique integer. Eg: For pathname C:\temp\a.out, if i add ascii value of all the characters, i get 1234. But some other path can also generate the same number. So, what is the best way to generate unique numbers for different pathnames?
Look into Hash functions. Make sure to consider the case-insensitive nature of most Windows filenames when performing the hash. Most likely, the language you are using provides a library function (or collection of functions) which can take the hash of a string (or just data). SHA1 is popular and has low collision. Here on Stackoverflow there are many questions pertaining to hash functions. To get you started, simply search for "hash function". This may be a useful SO question for your case: What is a performant string hashing function that results in a 32 bit integer with low collision rates?.
there are more possible pathnames than integers, therefore you can't have true uniqueness. You could settle for something like an MD5 hash.
Perfect hashing
Yes, you'll need to use some kind of hash function, simply because the domain of your input is greater than the range of your output. In other words, there are almost certainly more valid pathnames than there are numbers representable in your target language's data type. So it will not be possible to completely avoid collisions. If this guarantee is essential to your application, you won't be able to do it by translation to integers.
How about something like this: Use a hash of (String->n bits) for each directory level. Alloting 20 bits for each of 10 directory levels is clearly not going to scale, but maybe a telescoping level of bits, under the assumption that the lowest directory level will be the most populated - e.g. if you have (from root) /A/B/C/D/E/F, output some sort of n-bit number where bits n/2 - n hashes F bits n/4 - n/2 bits hashes E n/8 - n/4 bits hashes D etc. etc.
If this is on Unix, you could just grab its inode number. ls -i shows it on the command line. The stat() command allows you to retrive it from a program. Soft links would show up as the same file, while hard links would show up as a different file. This may or may not be behavior you want. I see a lot of folks talking about hashing. That could work, but theoretically if your hash does anything more than compress out integer values that are not allowable in file names, then you could have clashes. If that is unacceptable for you, then your hash is always going to be nearly as many digits as the file name. At that point, you might as well just use the file name.
Jimmy Said there are more possible pathnames than integers, therefore you can't have true uniqueness. You could settle for something like an MD5 hash. I don't think there are more possible path names then integers. As a construction to create a unique number from a pathname we can convert each letter to a (two-digit) number (so from 10-25,26=., then other special chars, and 27 being / --this is assuming there are less then 89 different characters, else, we can move to three digit encoding) home/nlucaroni/documents/cv.pdf 1724221427232130121027242318271324122827123136251315 This forms a bijection (although, if you count only valid path names then the surjective property fails, but normally one doesn't care about that holding) --Come up with a path that isn't an integer. This number obviously doesn't fit in a 64_bit unsigned int (max being 18446744073709551615), so it's not practical, but this isn't the point of my response.
You can read here Best way to determine if two path reference to same file in C# how you can uniquely identify a path. You need three numbers (dwVolumeSerialNumber, nFileIndexHigh and nFileIndexLow), maybe you can combine those three numbers to a new number with three times more bits. See also here: What are your favorite extension methods for C#? (codeplex.com/extensionoverflow) .
To all the people saying "it's not possible because you have more possible paths than integers to store them in": no. The poster never specified an implementation language; some languages support arbitrary-length integers. Python, for example. Say we take the 32,000 character paths as the limit mentioned in one of the other comments. If we have 256 different characters to use with paths we get: Python 2.5.1 (r251:54863, May 18 2007, 16:56:43) [GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> 32000L**256L 20815864389328798163850480654728171077230524494533409610638224700807216119346720596024478883464648369684843227908562015582767132496646929816279813211354641525848259018778440691546366699323167100945918841095379622423387354295096957733925002768876520583464697770622321657076833170056511209332449663781837603694136444406281042053396870977465916057756101739472373801429441421111406337458176000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000L >>> Notice how Python represents that just fine? Yes, there's probably a better way to do it, but that doesn't mean it's impossible. EDIT: rjack pointed out that it's actually 256^32000, not the other way around. Python still handles it just fine. The performance may leave something to be desired, but saying it's mathematically impossible is wrong.