how to improve performance to run faster (python) - performance

Right now, my program takes more than 10mins LOL try to display all the possible words (if those words are in the file) that can be created from the given letters. In that file, it has more than 4000+ words
How to make my program run faster by using recursion, and not using any libraries because i'm new to it.
if user input letters: b d o s y
then it will look up all the possible words in that file to create:
b
d
boy
boys
by
the code:
words = set()
def found(word, file):
## Reads through file and tries
## to match given word in a line.
with open(file, 'r') as rf:
for line in rf.readlines():
if line.strip() == word:
return True
return False
def scramble(r_letters, s_letters):
## Output every possible combination of a word.
## Each recursive call moves a letter from
## r_letters (remaining letters) to
## s_letters (scrambled letters)
if s_letters:
words.add(s_letters)
for i in range(len(r_letters)):
scramble(r_letters[:i] + r_letters[i+1:], s_letters + r_letters[i])
thesarus = input("Enter the name of the file containing all of the words: ")
letters = input("Please enter your letters separated by a space: ")
word = ''.join(letters.split(' '))
scramble(word, '')
ll = list(words)
ll.sort()
for word in ll:
if found(word, thesarus):
print(word)

You program runs slow because your algorithm is inefficient.
Since you require in the question to use recursion (to generate all the possible combinations), you could improve at least in how you search in the file.
Your code open the file and search a single word reading it for every word. This is extremely inefficient.
First solution it comes into my mind is to read the file once and save each word in a set()
words_set = {line.strip() for line in open('somefile')}
or also (less concise)
words_set = set()
with open('somefile') as fp:
for line in fp:
words_set.add(line.strip())
Then, you just do
if word in words_set:
print(word)
I think there could be more efficient ways to do the whole program, but they don't require recursion.
Update
For the sake of discussion, I think it may be useful to also provide an algorithm which is better.
Your code generates all possible combinations, even if those are not likely to be part of a dictionary, in addition to the inefficient search in the file for each word.
A better solution involves storing the words in a more efficient way, such that it is much easier to tell if a particular combination exists or not. For example, you don't want to visit (in the file) all the words composed by characters not present in the list provided by the user.
There is a data structure which I believe to be quite effective for this kind of problems: the trie (or prefix tree). This data structure can be used to store all the thesaurus file, in place of the set that I suggested above.
Then, instead of generating all the possible combinations of letters, you just visit the tree with all the possible letters to find all the possible valid words.
So, for example, if your user enters h o m e x and you have no word starting with x in your thesaurus, you will not generate all the permutations starting with x, such as xe, xo, xh, xm, etc, saving a large amount of computations.

Related

How to use machine learning to count words in text

Question:
Given a piece of text like "This is a test"; how to build a machine learning model to get the number of word occurrences for example in this piece, word count is 4. After training, it is possible to predict text word count.
I know it is easy to write a program (like below pseudo code),
data: memory.punctuation['~', '`', '!', '#', '#', '$', '%', '^', '&', '*', ...]
f: count.word(text) -> count =
f: tokenize(text) --list-->
f: count.token(list, filter) where filter(token)<not in memory.punctuation> -> count
however in this question, we require to use machine learning algorithm. I wonder how machine can learn the concept of count (currently, we know machine learning is good at classification). Any idea and suggestions? Thanks in advance.
Failures:
We can use sth like word2vec (encoder) to build word vectors; if we consider seq2seq approach, we can train sth like This is a test <s> 4 <e> This is very very long sentence and the word count is greater than ten <s> 4 1 <e> (4 1 to represent the number 14). However, it does not work since attention model is used to get similar vector for example text translating (This is a test --> 这(this) 是(is) 一个(a) 测试(test)). It is hard to find relationship between [this ...] and 4 which is an aggregated number (i.e. model not convergent).
We know machine learning is good at classification. If we treat "4" as a class, the number of classes is infinite; if we do a tricky and use count/text.length as prediction, i have not got a model that fit even training data set (model not convergent); for example, if we use many short sentence to train the model, it will fail to predict long sentence length. And it may be related to an information paradox: we can encode data in a book as 0.x and use a machine to to mark a position on a rod to split it into 2 parts with length a and b, where a/b = 0.x; but we cannot find a machine.
What about a regression problem?
I think it would work quite well and that at the end it will output a nearly whole numbers all the time.
Also you can train a simple RNN to do the job, assuming you are using a hot one encoding and take an output from the last state.
If V_h is all zeros but the space index (which will be 1) and V_x as well, than the network will actually sum the spaces, and if c is 1 at the end so the output will be the number of words - For every length!
I think we can take it as a classification problem for a character being the input and if word breaker as the output.
In other words, at some time point t, we output whether the input character at the same time point is a word breaker (YES) or not (NO). If yes, then increase the word count. If no, then read the next character.
In modern English language I don't think there are going to be long words. So simple RNN model should do perhaps without the concern of vanishing gradient.
Let me know what you think!
Use NLTK for counting words,
from nltk.tokenize import word_tokenize
text = "God is Great!"
word_count = len(word_tokenize(text))
print(word_count)

Word Pattern Finder

Problem: find all words that follow a pattern (independently from the actual symbols used to define the pattern).
Almost identical to what this site does: http://design215.com/toolbox/wordpattern.php
Enter patterns like: ABCCDE
This will find words like "bloody",
"kitten", and "valley". The above pattern will NOT find words like
"fennel" or "hippie" because that would require the pattern to be
ABCCBE.
Please note: I need a version of that algorithm that does find words like "fennel" or "hippie" even with an ABCCDE pattern.
To complicate things further, there is the possibility to add known characters anywhere in the searching pattern, for example: cBBX (where c is the known character) will yield cees, coof, cook, cool ...
What I've done so far: I found this answer (Pattern matching for strings independent from symbols) that solves my problem almost perfectly, but if I assign an integer to every word I need to compare, I will encounter two problems.
The first is the number of unique digits I can use. For example, if the pattern is XYZABCDEFG, the equivalent digit pattern will be 1 2 3 4 5 6 7 8 9 and then? 10? Consider that I would use the digit 0 to indicate a known character (for example, aBe --> 010 --> 10). Using hexadecimal digits will move the problem further, but will not solve it.
The second problem is the maximum length of the pattern: a Long in Java is 19-digit long, and I need no restriction in my patterns (although I don't think there exists a word with 20 different characters).
To solve those problems, I could store each digit of the pattern in an array, but then it becomes an array-to-array comparison instead of an integer comparison, thus taking a lot more time to compute.
As a side note: according to the algorithm used, what data structure will be the best suited for storing the dictionary? I was thinking about using an hash-map, converting each word into its digit-pattern equivalent (assuming no known character) and using this number as an hash (of course, there would be a lot of collisions). In that way searching will require first to match the numeric pattern, and then to scan the results to find all the words that have the known characters at the right place (if present in the original searching pattern).
Also, the dictionary is not static: words can be added and deleted.
EDIT:
This answer (https://stackoverflow.com/a/44604329/4452829) works fairly well and it's fast (testing for equal lengths before matching the patterns). The only problem is that I need a version of that algorithm that find words like "fennel" or "hippie" even with an ABCCDE pattern.
I've already implemented a way to check for known characters.
EDIT 2:
Ok, by checking if each character in the pattern is greater or equal than the corresponding character in the current word (normalized as a temporary pattern) I am almost done: it correctly matches the search pattern ABCA with the word ABBA and it correctly ignores the word ABAC. The last problem remaining is that if (for example) the pattern is ABBA it will match the word ABAA, and that's not correct.
EDIT 3:
Meh, not pretty but it seems to be working (I'm using Python because it's fast to code with it). Also, the search pattern can be any sequence of symbols, using lowercase letters as fixed characters and everything else as wildcards; there is also no need to convert each word in an abstract pattern.
def match(pattern, fixed_chars, word):
d = dict()
if len(pattern) != len(word):
return False
if check_fixed_char(word, fixed_chars) is False:
return False
for i in range(0, len(pattern)):
cp = pattern[i]
cw = word[i]
if cp in d:
if d[cp] != cw:
return False
else:
d[cp] = cw
if cp > cw:
return False
return True
A long time ago I wrote a program for solving cryptograms which was based on the same concept (generating word patterns such that "kitten" and "valley" both map to "abccde."
My technique did involve generating a sort of index of words by pattern.
The core abstraction function looks like:
#!python
#!/usr/bin/env python
import string
def abstract(word):
'''find abstract word pattern
dog or cat -> abc, book or feel -> abbc
'''
al = list(string.ascii_lowercase)
d = dict
for i in word:
if i not in d:
d[i] = al.pop(0)
return ''.join([d[i] for i in word])
From there building our index is pretty easy. Assume we have a file like /usr/share/dict/words (commonly found on Unix-like systems including MacOS X and Linux):
#!/usr/bin/python
words_by_pattern = dict()
words = set()
with open('/usr/share/dict/words') as f:
for each in f:
words.add(each.strip().lower())
for each in sorted(words):
pattern = abstract(each)
if pattern not in words_by_pattern:
words_by_pattern[pattern] = list()
words_by_pattern[pattern].append(each)
... that takes less than two seconds on my laptop for about 234,000 "words" (Although you might want to use a more refined or constrained word list for your application).
Another interesting trick at this point is to find the patterns which are most unique (returns the fewest possible words). We can create a histogram of patterns thus:
histogram = [(len(words_by_pattern[x]),x) for x in words_by_pattern.keys()]
histogram.sort()
I find that the this gives me:
8077 abcdef
7882 abcdefg
6373 abcde
6074 abcdefgh
3835 abcd
3765 abcdefghi
1794 abcdefghij
1175 abc
1159 abccde
925 abccdef
Note that abc, abcd, and abcde are all in the top ten. In other words the most common letter patterns for words include all of those with no repeats among 3 to 10 characters.
You can also look at the histogram of the histogram. In other words how many patterns only show one word: for example aabca only matches "eerie" and aabcb only matches "llama". There are over 48,000 patterns with only a single matching word and almost six thousand with just two words and so on.
Note: I don't use digits; I use letters to create the pattern mappings.
I don't know if this helps with your project at all; but this are very simple snippets of code. (They're intentionally verbose).
This can easily be achieved through using Regular Expressions.
For example, the below pattern matches any word that has ABCCDE pattern:
(?:([A-z])(?!\1)([A-z])(?!\1|\2)([A-z])(?=\3)([A-z])(?!\1|\2|\3|\5)([A-z])(?!\1|\2|\3|\5|\6)([A-z]))
And this one matches ABCCBE:
(?:([A-z])(?!\1)([A-z])(?!\1|\2)([A-z])(?=\3)([A-z])(?=\2)([A-z])(?!\1|\2|\3|\5|\6)([A-z]))
To cover both above pattern, you can use:
(?:([A-z])(?!\1)([A-z])(?!\1|\2)([A-z])(?=\3)([A-z])(?(?=\2)|(?!\1\2\3\5))([A-z])(?!\1|\2|\3|\5|\6)([A-z]))
Going this path, your challenge would be generating the above Regex pattern out of the alphabetic notation you used.
And please note that you may want to use the i Regex flag when using these if case-insensitivity is a requirement.
For more Regex info, take a look at:
Look-around
Back-referencing

Simple compression algorithm - how to avoid marker-strings?

I was trying to come up with a string compression algorithm for plaintext, e.g.
AAAAAAAABB -> A#8BB
where n symbols y are written out like
y#n
The problem is: what if I need to compress the string "A#8" ? That would confuse the decompression algorithm into thinking that the original input was "AAAAAAAA" instead of just "A#8".
How can I solve this problem? I was thinking of using a "marker" character instead of the #, but what if I wanted the algorithm to work with binary data? There is no marker character that can be used in that case I suppose
A simple solution is escaping: you could represent each # in the source by ##.
Everytime you encounter an #, you look one character ahead and find either a number (repeat previous character) or another # (its literally #).
A variant would be encoding each # as ##1, which would fit nicely into your current scheme and allows encoding n consecutive # as ##n.

What's the name of this programming pattern?

Often one needs to process a sequence of "chunks", which are read from a stream of "atoms", where each chunk consisting of a variable number of atoms, and there is no way for the program to know that it has received a complete chunk until it reads the first atom of the next chunk (or the stream of atoms becomes exhausted).
A straightforward algorithm for doing this task would look like this:
LOOP FOREVER:
SET x TO NEXT_ATOM
IF DONE(x) OR START_OF_CHUNK(x):
IF NOT EMPTY(accum):
PROCESS(accum)
END
if DONE(x):
BREAK
END
RESET(accum)
END
ADD x TO accum
END
So, my question is this:
Is there a name for this general class of problems and/or for the programming pattern shown above?
The remainder of this post are just a couple of (reasonably realistic) examples of what's described abstractly above. (The examples are in Python, although they could be translated easily to any imperative language.)
The first one is a function to produce a run-length encoding of an input string. In this case, the "atoms" are individual characters, and the "chunks" are maximal runs of the same character. Therefore, the program does not know that it has reached the end of a run until it reads the first character in the following run.
def rle(s):
'''Compute the run-length encoding of s.'''
n = len(s)
ret = []
accum = 0
v = object() # unique sentinel; ensures first test against x succeeds
i = 0
while True:
x = s[i] if i < n else None
i += 1
if x is None or x != v:
if accum > 0:
ret.append((accum, v))
if x is None:
break
accum = 0
v = x
accum += 1
return ret
The second example is a function that takes as argument a read handle to a FASTA-formatted file, and parses its contents. In this case, the atoms are lines of text. Each chunk consists of a specially-marked first line, called the "defline" (and distinguished by a '>' as its first character), followed by a variable number of lines containing stretches of nucleotide or protein sequence. Again, the code can detect the end of a chunk unambiguously only after reading the first atom (i.e. the defline) of the next chunk.
def read_fasta(fh):
'''Read the contents of a FASTA-formatted file.'''
ret = []
accum = []
while True:
x = fh.readline()
if x == '' or x.startswith('>'):
if accum:
ret.append((accum[0], ''.join(accum[1:])))
if x == '':
break
accum = []
accum.append(x.strip())
return ret
the only thing i can think of is that it's a very simple LL(1) parser. you are parsing (in a very simple way) data from left to right and you need to lookahead one value to know what is happening. see http://en.wikipedia.org/wiki/LL_parser
I implement this pattern regularly (esp. in conjunction with sorting) when aggregating simple statistics in indexing. I've never heard of a formal name but at our company internally we simply refer to it as "batching" or "grouping", after the SQL GROUP BY clause.
In our system batches are usually delimited by an extracted attribute (rather than a bald edge-driven predicate) which we call the batch or group key. By contrast your examples seem to check for explicit delimiters.
I believe that what you're describing is something called a streaming algorithm, an algorithm where the input is specified one element at a time until some stop condition is triggered. Streaming algorithms can be used to model algorithms where data is received over a network or from some device that generates data. Often, streaming algorithms assume that there is some fixed bound on the amount of memory that can be stored at any point in time, meaning that the algorithm needs to take special care to preserve important information while discarding useless data.
Many interesting algorithms in computer science fail to work in the streaming case, and there are a large class of algorithms specially designed to work for streams. For example, there are good streaming algorithms for finding the top k elements of a stream (see this question, for example), for randomly choosing k elements out of a stream, for finding elements that appear with high frequency in a stream, etc. One of the other answers to this question (from #andrew cooke) mentions that this resembles LL parsing. Indeed, LL parsing (and many other parsing algorithms, such as LR parsing) are streaming algorithms for doing parsing, but they're special cases of the more general streaming algorithm framework.
Hope this helps!

How to shuffle the lines in a file without reading the whole file in advance?

What's a good algorithm to shuffle the lines in a file without reading the whole file in advance?
I guess it would look something like this: Start reading the file line by line from the beginning, storing the line at each point and decide if you want to print one of the lines stored so far (and then remove from storage) or do nothing and proceed to the next line.
Can someone verify / prove this and/or maybe post working (perl, python, etc.) code?
Related questions, but not looking at memory-efficient algorithms:
How can I shuffle the lines of a text file on the Unix command line or in a shell script?
How can I randomize the lines in a file using standard tools on Red Hat Linux?
How can I print the lines in STDIN in random order in Perl?
I cannot think of a way to randomly do the entire file without somehow maintaining a list of what has already been written. I think if I had to do a memory efficient shuffle, I would scan the file, building a list of offsets for the new lines. Once I have this list of new line offsets, I would randomly pick one of them, write it to stdout, and then remove it from the list of offsets.
I am not familiar with perl, or python, but can demonstrate with php.
<?php
$offsets = array();
$f = fopen("file.txt", "r");
$offsets[] = ftell($f);
while (! feof($f))
{
if (fgetc($f) == "\n") $offsets[] = ftell($f);
}
shuffle($offsets);
foreach ($offsets as $offset)
{
fseek($f, $offset);
echo fgets($f);
}
fclose($f);
?>
The only other option I can think of, if scanning the file for new lines is absolutely unacceptable, would be (I am not going to code this one out):
Determine the filesize
Create a list of offsets and lengths already written to stdout
Loop until bytes_written == filesize
Seek to a random offset that is not already in your list of already written values
Back up from that seek to the previous newline or start of file
Display that line, and add it to the list of offsets and lengths written
Go to 3.
The following algorithm is linear in the number of lines in your input file.
Preprocessing:
Find n (the total number of lines) by scanning for newlines (or whatever) but store the character number signifying the beginning and end of each line. So you'd have 2 vectors, say, s and e of size n where characters numbering from s[i] to e[i] in your input file is the i th line. In C++ I'd use vector.
Randomly permute a vector of integers from 1 to n (in C++ it would be random_shuffle) and store it in a vector, say, p (e.g. 1 2 3 4 becomes p = [3 1 4 2]). This implies that line i of the new file is now line p[i] in the original file (i.e. in the above example, the 1st line of the new file is the 3rd line of the original file).
Main
Create a new file
Write the first line in the new file by reading the text in the original file between s[p[0]] and e[p[0]] and appending it to the new file.
Continue as in step 2 for all the other lines.
So the overall complexity is linear in the number of lines (since random_shuffle is linear) if you assume read/write & seeking in a file (incrementing the file pointer) are all constant time operations.
You can create an array for N strings and read the first N lines of the file into this array. For the rest you read one line, select by random one of the lines from the array, and replace the this string by the newly read string. Also you write out the string from the array to the output file. This has the advantage that you don't need to iterate over the file twice. The disadvantage is that it will not create a very random output file, especially when N is low (for example this algorithm can't move the last line more than N lines up in the output.)
Edit
Just an example in python:
import sys
import random
CACHE_SIZE = 23
lines = {}
for l in sys.stdin: # you can replace sys.stdin with xrange(200) to get a test output
i = random.randint(0, CACHE_SIZE-1)
old = lines.get(i)
if old:
print old,
lines[i] = l
for ignored, p in lines.iteritems():
print p,

Resources