Word Pattern Finder - algorithm

Problem: find all words that follow a pattern (independently from the actual symbols used to define the pattern).
Almost identical to what this site does: http://design215.com/toolbox/wordpattern.php
Enter patterns like: ABCCDE
This will find words like "bloody",
"kitten", and "valley". The above pattern will NOT find words like
"fennel" or "hippie" because that would require the pattern to be
ABCCBE.
Please note: I need a version of that algorithm that does find words like "fennel" or "hippie" even with an ABCCDE pattern.
To complicate things further, there is the possibility to add known characters anywhere in the searching pattern, for example: cBBX (where c is the known character) will yield cees, coof, cook, cool ...
What I've done so far: I found this answer (Pattern matching for strings independent from symbols) that solves my problem almost perfectly, but if I assign an integer to every word I need to compare, I will encounter two problems.
The first is the number of unique digits I can use. For example, if the pattern is XYZABCDEFG, the equivalent digit pattern will be 1 2 3 4 5 6 7 8 9 and then? 10? Consider that I would use the digit 0 to indicate a known character (for example, aBe --> 010 --> 10). Using hexadecimal digits will move the problem further, but will not solve it.
The second problem is the maximum length of the pattern: a Long in Java is 19-digit long, and I need no restriction in my patterns (although I don't think there exists a word with 20 different characters).
To solve those problems, I could store each digit of the pattern in an array, but then it becomes an array-to-array comparison instead of an integer comparison, thus taking a lot more time to compute.
As a side note: according to the algorithm used, what data structure will be the best suited for storing the dictionary? I was thinking about using an hash-map, converting each word into its digit-pattern equivalent (assuming no known character) and using this number as an hash (of course, there would be a lot of collisions). In that way searching will require first to match the numeric pattern, and then to scan the results to find all the words that have the known characters at the right place (if present in the original searching pattern).
Also, the dictionary is not static: words can be added and deleted.
EDIT:
This answer (https://stackoverflow.com/a/44604329/4452829) works fairly well and it's fast (testing for equal lengths before matching the patterns). The only problem is that I need a version of that algorithm that find words like "fennel" or "hippie" even with an ABCCDE pattern.
I've already implemented a way to check for known characters.
EDIT 2:
Ok, by checking if each character in the pattern is greater or equal than the corresponding character in the current word (normalized as a temporary pattern) I am almost done: it correctly matches the search pattern ABCA with the word ABBA and it correctly ignores the word ABAC. The last problem remaining is that if (for example) the pattern is ABBA it will match the word ABAA, and that's not correct.
EDIT 3:
Meh, not pretty but it seems to be working (I'm using Python because it's fast to code with it). Also, the search pattern can be any sequence of symbols, using lowercase letters as fixed characters and everything else as wildcards; there is also no need to convert each word in an abstract pattern.
def match(pattern, fixed_chars, word):
d = dict()
if len(pattern) != len(word):
return False
if check_fixed_char(word, fixed_chars) is False:
return False
for i in range(0, len(pattern)):
cp = pattern[i]
cw = word[i]
if cp in d:
if d[cp] != cw:
return False
else:
d[cp] = cw
if cp > cw:
return False
return True

A long time ago I wrote a program for solving cryptograms which was based on the same concept (generating word patterns such that "kitten" and "valley" both map to "abccde."
My technique did involve generating a sort of index of words by pattern.
The core abstraction function looks like:
#!python
#!/usr/bin/env python
import string
def abstract(word):
'''find abstract word pattern
dog or cat -> abc, book or feel -> abbc
'''
al = list(string.ascii_lowercase)
d = dict
for i in word:
if i not in d:
d[i] = al.pop(0)
return ''.join([d[i] for i in word])
From there building our index is pretty easy. Assume we have a file like /usr/share/dict/words (commonly found on Unix-like systems including MacOS X and Linux):
#!/usr/bin/python
words_by_pattern = dict()
words = set()
with open('/usr/share/dict/words') as f:
for each in f:
words.add(each.strip().lower())
for each in sorted(words):
pattern = abstract(each)
if pattern not in words_by_pattern:
words_by_pattern[pattern] = list()
words_by_pattern[pattern].append(each)
... that takes less than two seconds on my laptop for about 234,000 "words" (Although you might want to use a more refined or constrained word list for your application).
Another interesting trick at this point is to find the patterns which are most unique (returns the fewest possible words). We can create a histogram of patterns thus:
histogram = [(len(words_by_pattern[x]),x) for x in words_by_pattern.keys()]
histogram.sort()
I find that the this gives me:
8077 abcdef
7882 abcdefg
6373 abcde
6074 abcdefgh
3835 abcd
3765 abcdefghi
1794 abcdefghij
1175 abc
1159 abccde
925 abccdef
Note that abc, abcd, and abcde are all in the top ten. In other words the most common letter patterns for words include all of those with no repeats among 3 to 10 characters.
You can also look at the histogram of the histogram. In other words how many patterns only show one word: for example aabca only matches "eerie" and aabcb only matches "llama". There are over 48,000 patterns with only a single matching word and almost six thousand with just two words and so on.
Note: I don't use digits; I use letters to create the pattern mappings.
I don't know if this helps with your project at all; but this are very simple snippets of code. (They're intentionally verbose).

This can easily be achieved through using Regular Expressions.
For example, the below pattern matches any word that has ABCCDE pattern:
(?:([A-z])(?!\1)([A-z])(?!\1|\2)([A-z])(?=\3)([A-z])(?!\1|\2|\3|\5)([A-z])(?!\1|\2|\3|\5|\6)([A-z]))
And this one matches ABCCBE:
(?:([A-z])(?!\1)([A-z])(?!\1|\2)([A-z])(?=\3)([A-z])(?=\2)([A-z])(?!\1|\2|\3|\5|\6)([A-z]))
To cover both above pattern, you can use:
(?:([A-z])(?!\1)([A-z])(?!\1|\2)([A-z])(?=\3)([A-z])(?(?=\2)|(?!\1\2\3\5))([A-z])(?!\1|\2|\3|\5|\6)([A-z]))
Going this path, your challenge would be generating the above Regex pattern out of the alphabetic notation you used.
And please note that you may want to use the i Regex flag when using these if case-insensitivity is a requirement.
For more Regex info, take a look at:
Look-around
Back-referencing

Related

What algorithms can group characters into words?

I have some text generated by some lousy OCR software.
The output contains mixture of words and space-separated characters, which should have been grouped into words. For example,
Expr e s s i o n Syntax
S u m m a r y o f T e r minology
should have been
Expression Syntax
Summary of Terminology
What algorithms can group characters into words?
If I program in Python, C#, Java, C or C++, what libraries provide the implementation of the algorithms?
Thanks.
Minimal approach:
In your input, remove the space before any single letter words. Mark the final words created as part of this somehow (prefix them with a symbol not in the input, for example).
Get a dictionary of English words, sorted longest to shortest.
For each marked word in your input, find the longest match and break that off as a word. Repeat on the characters left over in the original "word" until there's nothing left over. (In the case where there's no match just leave it alone.)
More sophisticated, overkill approach:
The problem of splitting words without spaces is a real-world problem in languages commonly written without spaces, such as Chinese and Japanese. I'm familiar with Japanese so I'll mainly speak with reference to that.
Typical approaches use a dictionary and a sequence model. The model is trained to learn transition properties between labels - part of speech tagging, combined with the dictionary, is used to figure out the relative likelihood of different potential places to split words. Then the most likely sequence of splits for a whole sentence is solved for using (for example) the Viterbi algorithm.
Creating a system like this is almost certainly overkill if you're just cleaning OCR data, but if you're interested it may be worth looking into.
A sample case where the more sophisticated approach will work and the simple one won't:
input: Playforthefunofit
simple output: Play forth efunofit (forth is longer than for)
sophistiated output: Play for the fun of it (forth efunofit is a low-frequency - that is, unnatural - transition, while for the is not)
You can work around the issue with the simple approach to some extent by adding common short-word sequences to your dictionary as units. For example, add forthe as a dictionary word, and split it in a post processing step.
Hope that helps - good luck!

how to improve performance to run faster (python)

Right now, my program takes more than 10mins LOL try to display all the possible words (if those words are in the file) that can be created from the given letters. In that file, it has more than 4000+ words
How to make my program run faster by using recursion, and not using any libraries because i'm new to it.
if user input letters: b d o s y
then it will look up all the possible words in that file to create:
b
d
boy
boys
by
the code:
words = set()
def found(word, file):
## Reads through file and tries
## to match given word in a line.
with open(file, 'r') as rf:
for line in rf.readlines():
if line.strip() == word:
return True
return False
def scramble(r_letters, s_letters):
## Output every possible combination of a word.
## Each recursive call moves a letter from
## r_letters (remaining letters) to
## s_letters (scrambled letters)
if s_letters:
words.add(s_letters)
for i in range(len(r_letters)):
scramble(r_letters[:i] + r_letters[i+1:], s_letters + r_letters[i])
thesarus = input("Enter the name of the file containing all of the words: ")
letters = input("Please enter your letters separated by a space: ")
word = ''.join(letters.split(' '))
scramble(word, '')
ll = list(words)
ll.sort()
for word in ll:
if found(word, thesarus):
print(word)
You program runs slow because your algorithm is inefficient.
Since you require in the question to use recursion (to generate all the possible combinations), you could improve at least in how you search in the file.
Your code open the file and search a single word reading it for every word. This is extremely inefficient.
First solution it comes into my mind is to read the file once and save each word in a set()
words_set = {line.strip() for line in open('somefile')}
or also (less concise)
words_set = set()
with open('somefile') as fp:
for line in fp:
words_set.add(line.strip())
Then, you just do
if word in words_set:
print(word)
I think there could be more efficient ways to do the whole program, but they don't require recursion.
Update
For the sake of discussion, I think it may be useful to also provide an algorithm which is better.
Your code generates all possible combinations, even if those are not likely to be part of a dictionary, in addition to the inefficient search in the file for each word.
A better solution involves storing the words in a more efficient way, such that it is much easier to tell if a particular combination exists or not. For example, you don't want to visit (in the file) all the words composed by characters not present in the list provided by the user.
There is a data structure which I believe to be quite effective for this kind of problems: the trie (or prefix tree). This data structure can be used to store all the thesaurus file, in place of the set that I suggested above.
Then, instead of generating all the possible combinations of letters, you just visit the tree with all the possible letters to find all the possible valid words.
So, for example, if your user enters h o m e x and you have no word starting with x in your thesaurus, you will not generate all the permutations starting with x, such as xe, xo, xh, xm, etc, saving a large amount of computations.

Counting words from a mixed-language document

Given a set of lines containing Chinese characters, Latin-alphabet-based words or a mixture of both, I wanted to obtain the word count.
To wit:
this is just an example
这只是个例子
should give 10 words ideally; but of course, without access to a dictionary, 例子 would best be treated as two separate characters. Therefore, a count of 11 words/characters would also be an acceptable result here.
Obviously, wc -w is not going to work. It considers the 6 Chinese characters / 5 words as 1 "word", and returns a total of 6.
How do I proceed? I am open to trying different languages, though bash and python will be the quickest for me right now.
You should split the text on Unicode word boundaries, then count the elements which contain letters or ideographs. If you're working with Python, you could use the uniseg or nltk packages, for example. Another approach is to simply use Unicode-aware regexes but these will only break on simple word boundaries. Also see the question Split unicode string on word boundaries.
Note that you'll need a more complex dictionary-based solution for some languages. UAX #29 states:
For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.
I thought about a quick hack since Chinese characters are 3 bytes long in UTF8:
(pseudocode)
for each character:
if character (byte) begins with 1:
add 1 to total chinese chars
if it is a space:
add 1 to total "normal" words
if it is a newline:
break
Then take total chinese chars / 3 + total words to get the sum for each line. This will give an erroneous count for the case of mixed languages, but should be a good start.
这是test
However, the above sentence will give a total of 2 (1 for each of the Chinese characters.) A space between the two languages would be needed to give the correct count.

Automata with kleene star

Im learning about automata. Can you please help me understand how automata with Kleene closure works? Let's say I have letters a,b,c and I need to find text that ends with Kleene star - like ab*bac - how will it work?
The question seems to be more about how an automaton would handle Kleene closure than what Kleene closure means.
With a simple regular expression, e.g., abc, it's pretty straightforward to design an automaton to recognize it. Each state essentially tells you where you are in the expression so far. State 0 means it's seen nothing yet. State 1 means it's seen a. State 2 means it's seen ab. Etc.
The difficulty with Kleene closure is that a pattern like ab*bc introduces ambiguity. Once the automaton has seen the a and is then faced with a b, it doesn't know whether that b is part of the b* or the literal b that follows it, and it won't know until it reads more symbols--maybe many more.
The simplistic answer is that the automaton simply has a state that literally means it doesn't know yet which path was taken.
In simple cases, you can build this automaton directly. In general cases, you usually build something called a non-deterministic finite automaton. You can either simulate the NDFA, or--if performance is critical--you can apply an algorithm that converts the NDFA to a deterministic one. The algorithm essentially generates all the ambiguous states for you.
The Kleene star('*') means you can have as many occurrences of the character as you want (0 or more).
a* will match any number of a's.
(ab)* will match any number of the string "ab"
If you are trying to match an actual asterisk in an expression, the way you would write it depends entirely on the syntax of the regex you are working with. For the general case, the backwards slash \ is used as an escape character:
\* will match an asterisk.
For recognizing a pattern at the end, use concatenation:
(a U b)*c* will match any string that contains 0 or more 'c's at the end, preceded by any number of a's or b's.
For matching text that ends with a Kleene star, again, you can have 0 or more occurrences of the string:
ab(c)* - Possible matches: ab, abc abcc, abccc, etc.
a(bc)* - Possible matches: a, abc, abcbc, abcbcbc, etc.
Your expression ab*bac in English would read something like:
a followed by 0 or more b followed by bac
strings that would evaluate as a match to the regular expression if used for search
abac
abbbbbbbbbbac
abbac
strings that would not match
abaca //added extra literal
bac //missing leading a
As stated in the previous answer actually searching for a * would require an escape character which is implementation specific and would require knowledge of your language/library of choice.

Is there any algorithm to judge a string is meaningful

The problem is, I have to scan executable file and find out the strings for analysis, use strings.exe from sysinternals. However, How to distinguish meaningful strings and the trivial strings, Is there any algorithm or thought to solve this problem(statistics? probability?).
for example:
extract strings from strings.exe(part of all strings)
S`A
waA
RmA
>rA
5xA
GetModuleHandleA
LocalFree
LoadLibraryA
LocalAlloc
GetCommandLineW
From empirical judgement, the last five strings is meaningful, and the first 5 ones are not.
So how to solve this problem, do not use a dictionary like black list or white list.
Simple algorithm: Break candidate strings into words on first caps/whitespace/digits, and then compare words against some dictionary.
use N-Grams
N-Gram will tell you what is the probability that word is meaningfull. Read about markov chains and n-grams (http://en.wikipedia.org/wiki/N-gram) . Treat each letter as state, and take the set of meaningfull and meaningless words. For example:
Meaningless word are B^^#, #AT
Normal words: BOOK, CAT
create two Language models for them (trigram will be the best) http://en.wikipedia.org/wiki/Language_model
and now you can check in which model word was probably generated and take language model with probability greater than in other one. this will satisfy your condition
remember that you need set of meaningless words ( i think around 1000 will be ok) and not meaningless
Is there a definite rule for meaningful words? Or are they simply words from dictionary?
If they are words from dictionary, then you can use trie's
you can look up a word until the next char is not capitalized. if its capitalized then start from beginning of the trie and look for the next word.
Just my 2 cents.
Ivar

Resources