I am trying to learn about the GADDAG data structure, developed by Steven A. Gordon. While I was reading the document here, I came across the following pseudocode example:
If pos <= 0 THEN {moving left:}
word <- L || word
...
I was unable to find what this means by searching around and I was wondering what it means.
Thank you!
From context, this appears to be string concatenation. The author mentions this in the paragraphs leading into the pseudocode:
In the GoOn procedure, the direction determines which side of the current word to concatenate the current letter to
This is also supported by the directionality implied in the pseudocode. If the position is below zero (that is, you're before the start of the word), you prepend the new letter to the front. If the position is greater than zero (that is, you're past the start of the word), you append the new letter to the end.
Apparently || is used in some languages to denote string concatenation, including PL/1 and SQL.
Related
I am trying to understand some ancient code from a DEC PDP10 written in BCPL. A sample of the code is as follows:
test scanner()=S.DOTNAME then
$( word1:=checklook.up(scan.info,S.SFUNC,"unknown Special function [:s]")
D7 of temp:=P1 of word1
scanner()
$) or D7 of temp:=SF.ACTION
What do the "D7 of temp" and "P1 of word1" constructs do in this case?
The unstoppable Martin Richards is continuing to add features to the BCPL language(a), despite the fact that so few people are aware of it(b). Only seven or so questions are tagged bcpl on Stack Overflow but don't get me wrong: I liked this language and I have fond memories of using it back in the '80s.
Some of the things added since the last time I used it are the sub-field operators SLCT and OF. As per the manual on Martin's own site:
An expression of the form K OF E accesses a field of consecutive bits in memory. K must be a manifest constant equal to SLCT length:shift:offset and E must yield a pointer, p say.
The field is contained entirely in the word at position p + offset. It has a bit length of length and is shift bits from the right hand end of the word. A length of zero is interpreted as the longest length possible consistent with shift and the word length of the implementation.
Hence it's a more fine-grained way of accessing parts of memory than just the ! "dereference entire word" operator in that it allows you to get at specific bits within a word.
(a) Including, apparently, a version for the Raspberry PI, which may finally give me an excuse to break out all those spare PIs I have lying around, and educate the kids about the "good old days".
(b) It was used for at least one MC6809 embedded system I worked on, and formed a non-trivial part of AmigaDOS many moons ago.
Right now, my program takes more than 10mins LOL try to display all the possible words (if those words are in the file) that can be created from the given letters. In that file, it has more than 4000+ words
How to make my program run faster by using recursion, and not using any libraries because i'm new to it.
if user input letters: b d o s y
then it will look up all the possible words in that file to create:
b
d
boy
boys
by
the code:
words = set()
def found(word, file):
## Reads through file and tries
## to match given word in a line.
with open(file, 'r') as rf:
for line in rf.readlines():
if line.strip() == word:
return True
return False
def scramble(r_letters, s_letters):
## Output every possible combination of a word.
## Each recursive call moves a letter from
## r_letters (remaining letters) to
## s_letters (scrambled letters)
if s_letters:
words.add(s_letters)
for i in range(len(r_letters)):
scramble(r_letters[:i] + r_letters[i+1:], s_letters + r_letters[i])
thesarus = input("Enter the name of the file containing all of the words: ")
letters = input("Please enter your letters separated by a space: ")
word = ''.join(letters.split(' '))
scramble(word, '')
ll = list(words)
ll.sort()
for word in ll:
if found(word, thesarus):
print(word)
You program runs slow because your algorithm is inefficient.
Since you require in the question to use recursion (to generate all the possible combinations), you could improve at least in how you search in the file.
Your code open the file and search a single word reading it for every word. This is extremely inefficient.
First solution it comes into my mind is to read the file once and save each word in a set()
words_set = {line.strip() for line in open('somefile')}
or also (less concise)
words_set = set()
with open('somefile') as fp:
for line in fp:
words_set.add(line.strip())
Then, you just do
if word in words_set:
print(word)
I think there could be more efficient ways to do the whole program, but they don't require recursion.
Update
For the sake of discussion, I think it may be useful to also provide an algorithm which is better.
Your code generates all possible combinations, even if those are not likely to be part of a dictionary, in addition to the inefficient search in the file for each word.
A better solution involves storing the words in a more efficient way, such that it is much easier to tell if a particular combination exists or not. For example, you don't want to visit (in the file) all the words composed by characters not present in the list provided by the user.
There is a data structure which I believe to be quite effective for this kind of problems: the trie (or prefix tree). This data structure can be used to store all the thesaurus file, in place of the set that I suggested above.
Then, instead of generating all the possible combinations of letters, you just visit the tree with all the possible letters to find all the possible valid words.
So, for example, if your user enters h o m e x and you have no word starting with x in your thesaurus, you will not generate all the permutations starting with x, such as xe, xo, xh, xm, etc, saving a large amount of computations.
This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
Before I write something about the problem, I need to let you know:
This problem is my homework (I had about 1 week to return working program)
I was working on this problem for about a week, every day, trying to figure out my own solution
I'm not asking for complete program; I need a general idea about the algorithm
Problem:
Given: a wordlist and a "grid", for example:
grid (X means any letter):
X X
XXXX
X X
XXXX
wordlist:
ccaa
baca
baaa
bbbb
You have to find example "solution" - is it possible to fit words from wordlist into a given grid? If there is at least one solution, print one (whichever correct). If no - print message, that there is no possible solution. For given example, there is a solution:
b c
baca
b a
baaa
It's hard for me to write everything that I've already tried (because English is not my native language and I also have a lot of papers with wrong ideas).
My naive algorithm works something like this:
First word needs just proper length, so find any (first?) word with proper length (I'm going to use given example grid and wordlist to demonstrate what I think):
c X
cXXX
a X
aXXX
For first common letter (on the crossing of 2 words) find any (first) word, that fit the grid (so, have proper length and common letter on proper position). If there is no such words, go back to (1) and take another first word. In the orginal example there is no word which starts with "c", so we go back to (1) and select next words (this step repeats few times until we have "bbbb" for 1st word). Now we have:
b X
bXXX
b X
bXXX
And we're looking for a word(s) which starts with "b", for example:
b X
baca
b X
bXXX
General process: try to find pairs of words which fit to the given grid. If there is no such words, go back to previous step and use another combination - if there is no such - there is no solution.
Everything above is chaotic, I hope that you understand at least problem description. I wrote a draft of an algorithm, but I'm not sure if that works and how to properly code this (in my case: c++). Moreover, there are cases (even in example above) that we need to find a word that depends on 2 or more other words.
Maybe I just can't see something obvious, maybe I'm too stupid, maybe... Well, I really tried to solve this problem. I don't know English well enough to precisely describe what I think about this problem, so I can't put here all my notes (I tried to describe one idea and it was hard). Believe or not, I've spend many long hours trying to figure out solution and I have almost nothing...
If you can describe a solution, or give a hint how to solve this problem, I would really appreciate this.
The corssword problem is NP-Complete, so your best shot is brute force: just try all possibilities, and stop when a possibility is a valid. Return failure when you exhausted all possible solutions.
reduction that prove that this problem is NP-Complete can be found in this article section 3.3
Brute force solution using backtracking could be: [pseudo code]:
solve(words,grid):
if words is empty:
if grid.isValudSol():
return grid
else:
return None
for each word in words:
possibleSol <- grid.fillFirst(word)
ret <- solve(words\{word},possibleSol)
if (ret != None):
return ret
return None
in here we assume fillFirst() is a function that fills the first space which was not already filled [first can actually be any consistent ordering of the empty spaces, but it should be consistent!] , and isValid() is returning a boolean indicating if the given grid is a valid solution.
I wrote a progam this morning. Here is a slightly more efficient version in pseudocode:
#pseudo-code
solve ( words , grid ) : solve ( words , grid , None )
solve ( words , grid , filledPositions ) :
if words is empty :
if grid is solved :
return grid
else :
raise ( no solution )
for ( current position ) as the first possible word position in grid
that is not of filledPositions :
# note : a word position must have no letters before the word
# 'before the word' means, eg, to the left of a horizontal word
# no letters may be placed over a ' '
# no letters may be placed off the grid
# note : a location may have two 'positions' : one across , one down
for each word in words :
make a copy of grid
try :
fill grid copy, with the current word, at the current position
except ( cannot fill position ) :
break
try :
return solve ( words\{word} , grid copy ,
filledPositions+{current position} )
except ( no solution ) :
break
raise ( no solution )
Here is my code for fitting a word horizontally in the grid : http://codepad.org/4UXoLcjR
Here are some things I used from the STL:
http://www.cplusplus.com/reference/algorithm/remove_copy/
http://www.cplusplus.com/reference/stl/vector/
Often one needs to process a sequence of "chunks", which are read from a stream of "atoms", where each chunk consisting of a variable number of atoms, and there is no way for the program to know that it has received a complete chunk until it reads the first atom of the next chunk (or the stream of atoms becomes exhausted).
A straightforward algorithm for doing this task would look like this:
LOOP FOREVER:
SET x TO NEXT_ATOM
IF DONE(x) OR START_OF_CHUNK(x):
IF NOT EMPTY(accum):
PROCESS(accum)
END
if DONE(x):
BREAK
END
RESET(accum)
END
ADD x TO accum
END
So, my question is this:
Is there a name for this general class of problems and/or for the programming pattern shown above?
The remainder of this post are just a couple of (reasonably realistic) examples of what's described abstractly above. (The examples are in Python, although they could be translated easily to any imperative language.)
The first one is a function to produce a run-length encoding of an input string. In this case, the "atoms" are individual characters, and the "chunks" are maximal runs of the same character. Therefore, the program does not know that it has reached the end of a run until it reads the first character in the following run.
def rle(s):
'''Compute the run-length encoding of s.'''
n = len(s)
ret = []
accum = 0
v = object() # unique sentinel; ensures first test against x succeeds
i = 0
while True:
x = s[i] if i < n else None
i += 1
if x is None or x != v:
if accum > 0:
ret.append((accum, v))
if x is None:
break
accum = 0
v = x
accum += 1
return ret
The second example is a function that takes as argument a read handle to a FASTA-formatted file, and parses its contents. In this case, the atoms are lines of text. Each chunk consists of a specially-marked first line, called the "defline" (and distinguished by a '>' as its first character), followed by a variable number of lines containing stretches of nucleotide or protein sequence. Again, the code can detect the end of a chunk unambiguously only after reading the first atom (i.e. the defline) of the next chunk.
def read_fasta(fh):
'''Read the contents of a FASTA-formatted file.'''
ret = []
accum = []
while True:
x = fh.readline()
if x == '' or x.startswith('>'):
if accum:
ret.append((accum[0], ''.join(accum[1:])))
if x == '':
break
accum = []
accum.append(x.strip())
return ret
the only thing i can think of is that it's a very simple LL(1) parser. you are parsing (in a very simple way) data from left to right and you need to lookahead one value to know what is happening. see http://en.wikipedia.org/wiki/LL_parser
I implement this pattern regularly (esp. in conjunction with sorting) when aggregating simple statistics in indexing. I've never heard of a formal name but at our company internally we simply refer to it as "batching" or "grouping", after the SQL GROUP BY clause.
In our system batches are usually delimited by an extracted attribute (rather than a bald edge-driven predicate) which we call the batch or group key. By contrast your examples seem to check for explicit delimiters.
I believe that what you're describing is something called a streaming algorithm, an algorithm where the input is specified one element at a time until some stop condition is triggered. Streaming algorithms can be used to model algorithms where data is received over a network or from some device that generates data. Often, streaming algorithms assume that there is some fixed bound on the amount of memory that can be stored at any point in time, meaning that the algorithm needs to take special care to preserve important information while discarding useless data.
Many interesting algorithms in computer science fail to work in the streaming case, and there are a large class of algorithms specially designed to work for streams. For example, there are good streaming algorithms for finding the top k elements of a stream (see this question, for example), for randomly choosing k elements out of a stream, for finding elements that appear with high frequency in a stream, etc. One of the other answers to this question (from #andrew cooke) mentions that this resembles LL parsing. Indeed, LL parsing (and many other parsing algorithms, such as LR parsing) are streaming algorithms for doing parsing, but they're special cases of the more general streaming algorithm framework.
Hope this helps!
To be up front, this is homework. That being said, it's extremely open ended and we've had almost zero guidance as to how to even begin thinking about this problem (or parallel algorithms in general). I'd like pointers in the right direction and not a full solution. Any reading that could help would be excellent as well.
I'm working on an efficient way to match the first occurrence of a pattern in a large amount of text using a parallel algorithm. The pattern is simple character matching, no regex involved. I've managed to come up with a possible way of finding all of the matches, but that then requires that I look through all of the matches and find the first one.
So the question is, will I have more success breaking the text up between processes and scanning that way? Or would it be best to have process-synchronized searching of some sort where the j'th process searches for the j'th character of the pattern? If then all processes return true for their match, the processes would change their position in matching said pattern and move up again, continuing until all characters have been matched and then returning the index of the first match.
What I have so far is extremely basic, and more than likely does not work. I won't be implementing this, but any pointers would be appreciated.
With p processors, a text of length t, and a pattern of length L, and a ceiling of L processors used:
for i=0 to t-l:
for j=0 to p:
processor j compares the text[i+j] to pattern[i+j]
On false match:
all processors terminate current comparison, i++
On true match by all processors:
Iterate p characters at a time until L characters have been compared
If all L comparisons return true:
return i (position of pattern)
Else:
i++
I am afraid that breaking the string will not do.
Generally speaking, early escaping is difficult, so you'd be better off breaking the text in chunks.
But let's ask Herb Sutter to explain searching with parallel algorithms first on Dr Dobbs. The idea is to use the non-uniformity of the distribution to have an early return. Of course Sutter is interested in any match, which is not the problem at hand, so let's adapt.
Here is my idea, let's say we have:
Text of length N
p Processors
heuristic: max is the maximum number of characters a chunk should contain, probably an order of magnitude greater than M the length of the pattern.
Now, what you want is to split your text into k equal chunks, where k is is minimal and size(chunk) is maximal yet inferior to max.
Then, we have a classical Producer-Consumer pattern: the p processes are feeded with the chunks of text, each process looking for the pattern in the chunk it receives.
The early escape is done by having a flag. You can either set the index of the chunk in which you found the pattern (and its position), or you can just set a boolean, and store the result in the processes themselves (in which case you'll have to go through all the processes once they have stop). The point is that each time a chunk is requested, the producer checks the flag, and stop feeding the processes if a match has been found (since the processes have been given the chunks in order).
Let's have an example, with 3 processors:
[ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ]
x x
The chunks 6 and 8 both contain the string.
The producer will first feed 1, 2 and 3 to the processes, then each process will advance at its own rhythm (it depends on the similarity of the text searched and the pattern).
Let's say we find the pattern in 8 before we find it in 6. Then the process that was working on 7 ends and tries to get another chunk, the producer stops it --> it would be irrelevant. Then the process working on 6 ends, with a result, and thus we know that the first occurrence was in 6, and we have its position.
The key idea is that you don't want to look at the whole text! It's wasteful!
Given a pattern of length L, and searching in a string of length N over P processors I would just split the string over the processors. Each processor would take a chunk of length N/P + L-1, with the last L-1 overlapping the string belonging to the next processor. Then each processor would perform boyer moore (the two pre-processing tables would be shared). When each finishes, they will return the result to the first processor, which maintains a table
Process Index
1 -1
2 2
3 23
After all processes have responded (or with a bit of thought you can have an early escape), you return the first match. This should be on average O(N/(L*P) + P).
The approach of having the i'th processor matching the i'th character would require too much inter process communication overhead.
EDIT: I realize you already have a solution, and are figuring out a way without having to find all solutions. Well I don't really think this approach is necessary. You can come up with some early escape conditions, they aren't that difficult, but I don't think they'll improve your performance that much in general (unless you have some additional knowledge the distribution of matches in your text).