The problem is, I have to scan executable file and find out the strings for analysis, use strings.exe from sysinternals. However, How to distinguish meaningful strings and the trivial strings, Is there any algorithm or thought to solve this problem(statistics? probability?).
for example:
extract strings from strings.exe(part of all strings)
S`A
waA
RmA
>rA
5xA
GetModuleHandleA
LocalFree
LoadLibraryA
LocalAlloc
GetCommandLineW
From empirical judgement, the last five strings is meaningful, and the first 5 ones are not.
So how to solve this problem, do not use a dictionary like black list or white list.
Simple algorithm: Break candidate strings into words on first caps/whitespace/digits, and then compare words against some dictionary.
use N-Grams
N-Gram will tell you what is the probability that word is meaningfull. Read about markov chains and n-grams (http://en.wikipedia.org/wiki/N-gram) . Treat each letter as state, and take the set of meaningfull and meaningless words. For example:
Meaningless word are B^^#, #AT
Normal words: BOOK, CAT
create two Language models for them (trigram will be the best) http://en.wikipedia.org/wiki/Language_model
and now you can check in which model word was probably generated and take language model with probability greater than in other one. this will satisfy your condition
remember that you need set of meaningless words ( i think around 1000 will be ok) and not meaningless
Is there a definite rule for meaningful words? Or are they simply words from dictionary?
If they are words from dictionary, then you can use trie's
you can look up a word until the next char is not capitalized. if its capitalized then start from beginning of the trie and look for the next word.
Just my 2 cents.
Ivar
Related
I have some text generated by some lousy OCR software.
The output contains mixture of words and space-separated characters, which should have been grouped into words. For example,
Expr e s s i o n Syntax
S u m m a r y o f T e r minology
should have been
Expression Syntax
Summary of Terminology
What algorithms can group characters into words?
If I program in Python, C#, Java, C or C++, what libraries provide the implementation of the algorithms?
Thanks.
Minimal approach:
In your input, remove the space before any single letter words. Mark the final words created as part of this somehow (prefix them with a symbol not in the input, for example).
Get a dictionary of English words, sorted longest to shortest.
For each marked word in your input, find the longest match and break that off as a word. Repeat on the characters left over in the original "word" until there's nothing left over. (In the case where there's no match just leave it alone.)
More sophisticated, overkill approach:
The problem of splitting words without spaces is a real-world problem in languages commonly written without spaces, such as Chinese and Japanese. I'm familiar with Japanese so I'll mainly speak with reference to that.
Typical approaches use a dictionary and a sequence model. The model is trained to learn transition properties between labels - part of speech tagging, combined with the dictionary, is used to figure out the relative likelihood of different potential places to split words. Then the most likely sequence of splits for a whole sentence is solved for using (for example) the Viterbi algorithm.
Creating a system like this is almost certainly overkill if you're just cleaning OCR data, but if you're interested it may be worth looking into.
A sample case where the more sophisticated approach will work and the simple one won't:
input: Playforthefunofit
simple output: Play forth efunofit (forth is longer than for)
sophistiated output: Play for the fun of it (forth efunofit is a low-frequency - that is, unnatural - transition, while for the is not)
You can work around the issue with the simple approach to some extent by adding common short-word sequences to your dictionary as units. For example, add forthe as a dictionary word, and split it in a post processing step.
Hope that helps - good luck!
Is it possible to check if a short sequence of text, e.g. two or three words, is random or not?
My first thought was to calculate the entropy on the string.
H("hello world") = 2.84535
H("sdzfjksher") = 3.12193
but any combination of the chars in "hello world" will result in the same entropy, but will create a random string like "llloo ehrdw". Entropy based methods works great on long strings like text. Here you can also count single chars to determinate that its a language. You can also use Zipfs Law here to check for real languages...
the next method would be a lookup table of common words, like a normal english dictionary. The problem with this method is to create a list of words first.
For example:
input string result
------------------------------------------------------
"hello world" matches 2 words
"helloworld" random string
"lllooehrdw" random string
"hello.world" probably 2 words
"a.be.was" probably 3 words (but this is probably a strange edge case)
So its all about finding words here to compare them with your wordlist, which can be really hard.
Another problem with all these methods could be, that they only detect certain languages or need to be trained to a certain language. Consider that we only want to use english for now.
So is there any good method to do this, or do i need to accept False Positives and False Negatives?
You could count the frequency of characters used in the text and compare this with known character distributions in English and/or other languages. This will give an indication of the probability that the text is/resembles that language or not.
Sounds like you want to use the frequencies of the letters to see if a string is a word or random letter.
http://scottbryce.com/cryptograms/stats.htm
Combining statistics and wordlists sounds like the way to go reduce false positives.
I have a set of pairs of character strings, e.g.:
abba - aba,
haha - aha,
baa - ba,
exb - esp,
xa - za
The second (right) string in the pair is somewhat similar to the first (left) string.
That is, a character from the first string can be represented by nothing, itself or a character from a small set of characters.
There's no simple rule for this character-to-character mapping, although there are some patterns.
Given several thousands of such string pairs, how do I deduce the transformation rules such that if I apply them to the left strings, I get the right strings?
The solution can be approximate, working correctly for, say, 80-95% of the strings.
Would you recommend to use some kind of a genetic algorithm? If so, how?
If you could align the characters, or rather groups of characters, you could work out tables saying that aa => a, bb => z, and so on. If you had such tables, you could align the characters using http://en.wikipedia.org/wiki/Dynamic_time_warping. One approach is therefore to guess an alignment (e.g. one for one, just as a starting point, or just align the first and last characters of each sequence), work out a translation table from that, use DTW to get a new alignment, work out a revised translation table, and iterate in that way. Perhaps you could wrap this up with enough maths to show that there is some measure of optimality or probability that such passes increase, climbing to a local maximum.
There is probably some way of doing this by modelling a Hidden Markov Model that generates both sequences simultaneously and then deriving rules from that model, but I would not chose this approach unless I was already familiar with HMMs and had software to use as a starting point that I was happy to modify.
You can use text to speech to create sound waves. then compare sound waves with other's and match them with percentages.
This is my theory how Google has such a advanced spell checker.
I'm looking at ways to deterministically replace unique strings with unique and optimally short replacements. So I have a finite set of strings, and the best compression I could achieve so far is through an enumeration algorithm, where I order the input set and then replace the strings with an enumeration of char strings over an extended alphabet (a..z, A...Z, aa...zz, aA... zZ, a0...z9, Aa..., aaa...zaa, aaA...zaaA, ....).
This works wonderfully as far as compression is concerned, but has the severe drawback that it is not atomic on any given input string. Rather, its result depends on knowing all input strings right from the start, and on the ordering of the input set.
Anybody knows of an algorithm that has similar compression but doesn't require knowing all input strings upfront?! Hashing for example would not work for me, as depending on the size of the input set I'd need a hash length of 8-12 for the hashes to be unique, and that would be too long as replacements (currently, the replacement strings are 1-3 chars long for my use cases (<10,000 input strings)). Also, if theoreticians among us know this is wasted effort, I would be interested to hear :-) .
You could use your enumeration scheme, but sorted by the order in which you first encounter the input strings.
For example, the first string you ever process can be mapped to "a".
The next distinct string would be mapped to "b", etc.
Every time you process a string, you'd need to look it up to see if it has already been mapped.
"Optimally short" depends on the population of strings from which your samples are drawn. In the absence of systematic redundancy in the population, you will find that only a fraction of arbitrary strings can be compressed at all (e.g., consider trying to compress random bit strings).
If you can make assumptions about your data, such as "the strings are expected to be mainly composed of English words" then you can do something simple and effective based on letter frequency (e.g., for English, the relative frequency order is something like ETAOINSHRDLUGCY..., so you would want to use fewer bits to represent Es and more bits to represent uncommon letters like Q).
Cheers.
Arising out of this question, I'm looking for an elegant (ruby) way to compute the word signature suggested in this answer.
The idea suggested is to sort the letters in the word, and also run length encode repeated letters. So, for example "mississippi" first becomes "iiiimppssss", and then could be further shortened by encoding as "4impp4s".
I'm relatively new to ruby and though I could hack something together, I'm sure this is a one liner for somebody with more experience of ruby. I'd be interested to see people's approaches and improve my ruby knowledge.
edit: to clarify, performance of computing the signature doesn't much matter for my application. I'm looking to compute the signature so I can store it with each word in a large database of words (450K words), then query for words which have the same signature (i.e. all anagrams of a given word, that are actual english words). Hence the focus on space. The 'elegant' part is just to satisfy my curiosity.
The fastest way to create a sorted list of the letters is this:
"mississippi".unpack("c*").sort.pack("c*")
It is quite a bit faster than split('') and join(). For comparison it is also best to pack the array back together into a String, so you dont have to compare arrays.
I'm not much of a Ruby person either, but as I noted on the other comment this seems to work for the algorithm described.
s = "mississippi"
s.split('').sort.join.gsub(/(.)\1{2,}/) { |s| s.length.to_s + s[0,1] }
Of course, you'll want to make sure the word is lowercase, doesn't contain numbers, etc.
As requested, I'll try to explain the code. Please forgive me if I don't get all of the Ruby or reg ex terminology correct, but here goes.
I think the split/sort/join part is pretty straightforward. The interesting part for me starts at the call to gsub. This will replace a substring that matches the regular expression with the return value from the block that follows it. The reg ex finds any character and creates a backreference. That's the "(.)" part. Then, we continue the matching process using the backreference "\1" that evaluates to whatever character was found by the first part of the match. We want that character to be found a minimum of two more times for a total minimum number of occurrences of three. This is done using the quantifier "{2,}".
If a match is found, the matching substring is then passed to the next block of code as an argument thanks to the "|s|" part. Finally, we use the string equivalent of the matching substring's length and append to it whatever character makes up that substring (they should all be the same) and return the concatenated value. The returned value replaces the original matching substring. The whole process continues until nothing is left to match since it's a global substitution on the original string.
I apologize if that's confusing. As is often the case, it's easier for me to visualize the solution than to explain it clearly.
I don't see an elegant solution. You could use the split message to get the characters into an array, but then once you've sorted the list I don't see a nice linear-time concatenate primitive to get back to a string. I'm surprised.
Incidentally, run-length encoding is almost certainly a waste of time. I'd have to see some very impressive measurements before I'd think it worth considering. If you avoid run-length encoding, you can anagrammatize any string, not just a string of letters. And if you know you have only letters and are trying to save space, you can pack them 5 bits to a letter.
---Irma Vep
EDIT: the other poster found join which I missed. Nice.