How to elegantly compute the anagram signature of a word in ruby? - ruby

Arising out of this question, I'm looking for an elegant (ruby) way to compute the word signature suggested in this answer.
The idea suggested is to sort the letters in the word, and also run length encode repeated letters. So, for example "mississippi" first becomes "iiiimppssss", and then could be further shortened by encoding as "4impp4s".
I'm relatively new to ruby and though I could hack something together, I'm sure this is a one liner for somebody with more experience of ruby. I'd be interested to see people's approaches and improve my ruby knowledge.
edit: to clarify, performance of computing the signature doesn't much matter for my application. I'm looking to compute the signature so I can store it with each word in a large database of words (450K words), then query for words which have the same signature (i.e. all anagrams of a given word, that are actual english words). Hence the focus on space. The 'elegant' part is just to satisfy my curiosity.

The fastest way to create a sorted list of the letters is this:
"mississippi".unpack("c*").sort.pack("c*")
It is quite a bit faster than split('') and join(). For comparison it is also best to pack the array back together into a String, so you dont have to compare arrays.

I'm not much of a Ruby person either, but as I noted on the other comment this seems to work for the algorithm described.
s = "mississippi"
s.split('').sort.join.gsub(/(.)\1{2,}/) { |s| s.length.to_s + s[0,1] }
Of course, you'll want to make sure the word is lowercase, doesn't contain numbers, etc.
As requested, I'll try to explain the code. Please forgive me if I don't get all of the Ruby or reg ex terminology correct, but here goes.
I think the split/sort/join part is pretty straightforward. The interesting part for me starts at the call to gsub. This will replace a substring that matches the regular expression with the return value from the block that follows it. The reg ex finds any character and creates a backreference. That's the "(.)" part. Then, we continue the matching process using the backreference "\1" that evaluates to whatever character was found by the first part of the match. We want that character to be found a minimum of two more times for a total minimum number of occurrences of three. This is done using the quantifier "{2,}".
If a match is found, the matching substring is then passed to the next block of code as an argument thanks to the "|s|" part. Finally, we use the string equivalent of the matching substring's length and append to it whatever character makes up that substring (they should all be the same) and return the concatenated value. The returned value replaces the original matching substring. The whole process continues until nothing is left to match since it's a global substitution on the original string.
I apologize if that's confusing. As is often the case, it's easier for me to visualize the solution than to explain it clearly.

I don't see an elegant solution. You could use the split message to get the characters into an array, but then once you've sorted the list I don't see a nice linear-time concatenate primitive to get back to a string. I'm surprised.
Incidentally, run-length encoding is almost certainly a waste of time. I'd have to see some very impressive measurements before I'd think it worth considering. If you avoid run-length encoding, you can anagrammatize any string, not just a string of letters. And if you know you have only letters and are trying to save space, you can pack them 5 bits to a letter.
---Irma Vep
EDIT: the other poster found join which I missed. Nice.

Related

Split string into equal slices/chunks

I have a string of length N and I want to split it into equal parts of length L (assuming the last part might be shorter).
What I came up with is:
string.split('').each_slice(L).map(&:join)
but this is toooooo long (and too ugly, to be honest.) Am I unable to read the documentation properly, or is there no built-in method to perform this task?
What about this?
string.scan(/.{,#{L}}/)
For your case, it would be better to use split, not scan.
string.split(/.{,#{L}}/)
scan used where need to get exact pattern from the string, and it will skip the last part of the text since it will be shorter than L in most cases.
And here is the example if you want to keep word boundary:
string.split(/(.{1,#{L}})(?:\s|$)/)

Is there a method to find the most specific pattern for a string?

I'm wondering whether there is a way to generate the most specific regular expression (if such a thing exists) that matches a given string. Here's an illustration of what I want the method to do:
str = "(17 + 31)"
find_pattern(str)
# => /^\(\d+ \+ \d+\)$/ (or something more specific)
My intuition was to use Regex.new to accumulate the desired pattern by looping through str and checking for known patterns like \d, \s, and so on. I suspect there is an easy way for doing this.
This is in essence an algorithm compression problem. The simplest way to match a list of known strings is to use Regexp.union factory method, but that just tries each string in turn, it does not do anything "clever":
combined_rx = Regexp.union( "(17 + 31)", "(17 + 45)" )
=> /\(17\ \+\ 31\)|\(17\ \+\ 45\)/
This can still be useful to construct multi-stage validators, without you needing to write loops to check them all.
However, a generic pattern matcher that could figure out what you mean to match from examples is not really possible. There are too many ways in which you could consider strings to be similar or not. The closest I could think of would be genetic programming where you supply a large list of should match/should not match strings and the code guesses at the best regex by constructing random Regexp objects (a challenge in itself) and seeing how accurately they match and don't match your examples. The best matchers could be combined and mutated and tried again until you got 100% accuracy. This might be a fun project, but ultimately much more effort for most purposes than writing the regular expressions yourself from a description of the problem.
If your problem is heavily constrained - e.g. any example integer could always be replaced by \d+, any example space by \s+ etc, then you could work through the string replacing "matchable units", in fact using the same regular expressions checked in turn. E.g. if you match \A\d+ then consume the match from the string, and add \d+ to your regex. Then take the remainder of the string and look for next matching pattern. Working this way will have its limitations (you must know the full set of patterns you want to match in advance, and all examples would have to be unambiguous). However, it is more tractable than a genetic program.

Is it possible to check if a short sequence of text is random or not?

Is it possible to check if a short sequence of text, e.g. two or three words, is random or not?
My first thought was to calculate the entropy on the string.
H("hello world") = 2.84535
H("sdzfjksher") = 3.12193
but any combination of the chars in "hello world" will result in the same entropy, but will create a random string like "llloo ehrdw". Entropy based methods works great on long strings like text. Here you can also count single chars to determinate that its a language. You can also use Zipfs Law here to check for real languages...
the next method would be a lookup table of common words, like a normal english dictionary. The problem with this method is to create a list of words first.
For example:
input string result
------------------------------------------------------
"hello world" matches 2 words
"helloworld" random string
"lllooehrdw" random string
"hello.world" probably 2 words
"a.be.was" probably 3 words (but this is probably a strange edge case)
So its all about finding words here to compare them with your wordlist, which can be really hard.
Another problem with all these methods could be, that they only detect certain languages or need to be trained to a certain language. Consider that we only want to use english for now.
So is there any good method to do this, or do i need to accept False Positives and False Negatives?
You could count the frequency of characters used in the text and compare this with known character distributions in English and/or other languages. This will give an indication of the probability that the text is/resembles that language or not.
Sounds like you want to use the frequencies of the letters to see if a string is a word or random letter.
http://scottbryce.com/cryptograms/stats.htm
Combining statistics and wordlists sounds like the way to go reduce false positives.

Is there any algorithm to judge a string is meaningful

The problem is, I have to scan executable file and find out the strings for analysis, use strings.exe from sysinternals. However, How to distinguish meaningful strings and the trivial strings, Is there any algorithm or thought to solve this problem(statistics? probability?).
for example:
extract strings from strings.exe(part of all strings)
S`A
waA
RmA
>rA
5xA
GetModuleHandleA
LocalFree
LoadLibraryA
LocalAlloc
GetCommandLineW
From empirical judgement, the last five strings is meaningful, and the first 5 ones are not.
So how to solve this problem, do not use a dictionary like black list or white list.
Simple algorithm: Break candidate strings into words on first caps/whitespace/digits, and then compare words against some dictionary.
use N-Grams
N-Gram will tell you what is the probability that word is meaningfull. Read about markov chains and n-grams (http://en.wikipedia.org/wiki/N-gram) . Treat each letter as state, and take the set of meaningfull and meaningless words. For example:
Meaningless word are B^^#, #AT
Normal words: BOOK, CAT
create two Language models for them (trigram will be the best) http://en.wikipedia.org/wiki/Language_model
and now you can check in which model word was probably generated and take language model with probability greater than in other one. this will satisfy your condition
remember that you need set of meaningless words ( i think around 1000 will be ok) and not meaningless
Is there a definite rule for meaningful words? Or are they simply words from dictionary?
If they are words from dictionary, then you can use trie's
you can look up a word until the next char is not capitalized. if its capitalized then start from beginning of the trie and look for the next word.
Just my 2 cents.
Ivar

Symmetric Bijective String Algorithm?

I'm looking for an algorithm that can do a one-to-one mapping of a string onto another string.
I want an algorithm that given an alphabet I can perform a symmetric mapping function.
For example:
Let's consider that I have the alphabet "A","B","C","D","E","F". I want something like F("ABC") = "CEA" and F("CEA") = "ABC" for every N letter permutation.
Surely, an algorithm like this exists. If you know of an algorithm, please post the name of it and I can research it. If I haven't been clear enough in my request, please let me know.
Thanks in advance.
Edit 1:
I should clarify that I want enough entropy so that F("ABC") would equal "CEA" and F("CEA") = "ABC" but then I do NOT want F("ABD") to equal "CEF". Notice how two input letters stayed the same and the two corresponding output letters stayed the same?
So a Caesar Cipher/ROT13 or shuffling the array would not be sufficient. However, I don't need any "real" security. Just enough entropy for the output of the function to appear random. Weak encryption algorithms welcome.
Just create an array of objects that contain 2 fields -- a letter, and a random number. Sort the array. By the random numbers. This creates a mapping where the i-th letter of the alphabet now maps to the i-th letter in the array.
If simple transposition or substitution isn't quite enough, it sounds like you want to advance to a polyalphabetic cipher. The Vigenère cipher is extremely easy to implement in code, but is still difficult to break without using a computer.
I suggest the following.
Perform a dense coding of the input to positive integers - with an alphabet size of n and string length of m you can code the string into integers between zero and n^m - 1. In your example this would be the range [0,215]. Now perform a fixed involution on the encoded number and decode it again.
Take RC4, settle for some password, and you're done. (Not that this would be very safe.)
Take the set of all permutations of your alphabet, shuffle it, and map the first half of the set onto the second half. Bad for large alphabets, of course. :)
Nah, thought that over, I forgot about character repetitions. Maybe divide the input into chunks without repeating chars and apply my suggestion to all of those chunks.
I would restate your problem thus, and give you a strategy for that restatement:
"A substitution cypher where a change in input leads to a larger change in output".
The blocking of characters is irrelevant-- in the end, it's just mappings between numbers. I'll speak of letters here, but you can extend it to any block of n characters.
One of the easiest routes for this is a rotating substitution based on input. Since you already looked at the Vigenere cipher, it should be easy to understand. Instead of making the key be static, have it be dependent on the previous letter. That is, rotate through substitutions a different amount per each input.
The variable rotation satisfies the condition of making each small change push out to a larger change. Note that the algorithm will only push changes in one direction such that changes towards the end have smaller effects. You could run the algorithm both ways (front-to-back, then back-to-front) so that every letter of cleartext changed has the possibility of changing the entire string.
The internal rotation strategy elides the need for keys, while of course losing of most of the cryptographic security. It makes sense in context, though, as you are aiming for entropy rather than security.
You can solve this problem with Format-preserving encryption.
One Java-Library can be found under https://github.com/EVGStudents/FPE.git. There you can define a Regex and encrypt/decrypt string values matching this regex.

Resources