What's the name of this programming pattern? - algorithm

Often one needs to process a sequence of "chunks", which are read from a stream of "atoms", where each chunk consisting of a variable number of atoms, and there is no way for the program to know that it has received a complete chunk until it reads the first atom of the next chunk (or the stream of atoms becomes exhausted).
A straightforward algorithm for doing this task would look like this:
LOOP FOREVER:
SET x TO NEXT_ATOM
IF DONE(x) OR START_OF_CHUNK(x):
IF NOT EMPTY(accum):
PROCESS(accum)
END
if DONE(x):
BREAK
END
RESET(accum)
END
ADD x TO accum
END
So, my question is this:
Is there a name for this general class of problems and/or for the programming pattern shown above?
The remainder of this post are just a couple of (reasonably realistic) examples of what's described abstractly above. (The examples are in Python, although they could be translated easily to any imperative language.)
The first one is a function to produce a run-length encoding of an input string. In this case, the "atoms" are individual characters, and the "chunks" are maximal runs of the same character. Therefore, the program does not know that it has reached the end of a run until it reads the first character in the following run.
def rle(s):
'''Compute the run-length encoding of s.'''
n = len(s)
ret = []
accum = 0
v = object() # unique sentinel; ensures first test against x succeeds
i = 0
while True:
x = s[i] if i < n else None
i += 1
if x is None or x != v:
if accum > 0:
ret.append((accum, v))
if x is None:
break
accum = 0
v = x
accum += 1
return ret
The second example is a function that takes as argument a read handle to a FASTA-formatted file, and parses its contents. In this case, the atoms are lines of text. Each chunk consists of a specially-marked first line, called the "defline" (and distinguished by a '>' as its first character), followed by a variable number of lines containing stretches of nucleotide or protein sequence. Again, the code can detect the end of a chunk unambiguously only after reading the first atom (i.e. the defline) of the next chunk.
def read_fasta(fh):
'''Read the contents of a FASTA-formatted file.'''
ret = []
accum = []
while True:
x = fh.readline()
if x == '' or x.startswith('>'):
if accum:
ret.append((accum[0], ''.join(accum[1:])))
if x == '':
break
accum = []
accum.append(x.strip())
return ret

the only thing i can think of is that it's a very simple LL(1) parser. you are parsing (in a very simple way) data from left to right and you need to lookahead one value to know what is happening. see http://en.wikipedia.org/wiki/LL_parser

I implement this pattern regularly (esp. in conjunction with sorting) when aggregating simple statistics in indexing. I've never heard of a formal name but at our company internally we simply refer to it as "batching" or "grouping", after the SQL GROUP BY clause.
In our system batches are usually delimited by an extracted attribute (rather than a bald edge-driven predicate) which we call the batch or group key. By contrast your examples seem to check for explicit delimiters.

I believe that what you're describing is something called a streaming algorithm, an algorithm where the input is specified one element at a time until some stop condition is triggered. Streaming algorithms can be used to model algorithms where data is received over a network or from some device that generates data. Often, streaming algorithms assume that there is some fixed bound on the amount of memory that can be stored at any point in time, meaning that the algorithm needs to take special care to preserve important information while discarding useless data.
Many interesting algorithms in computer science fail to work in the streaming case, and there are a large class of algorithms specially designed to work for streams. For example, there are good streaming algorithms for finding the top k elements of a stream (see this question, for example), for randomly choosing k elements out of a stream, for finding elements that appear with high frequency in a stream, etc. One of the other answers to this question (from #andrew cooke) mentions that this resembles LL parsing. Indeed, LL parsing (and many other parsing algorithms, such as LR parsing) are streaming algorithms for doing parsing, but they're special cases of the more general streaming algorithm framework.
Hope this helps!

Related

What does a <- a || b mean in pseudocode?

I am trying to learn about the GADDAG data structure, developed by Steven A. Gordon. While I was reading the document here, I came across the following pseudocode example:
If pos <= 0 THEN {moving left:}
word <- L || word
...
I was unable to find what this means by searching around and I was wondering what it means.
Thank you!
From context, this appears to be string concatenation. The author mentions this in the paragraphs leading into the pseudocode:
In the GoOn procedure, the direction determines which side of the current word to concatenate the current letter to
This is also supported by the directionality implied in the pseudocode. If the position is below zero (that is, you're before the start of the word), you prepend the new letter to the front. If the position is greater than zero (that is, you're past the start of the word), you append the new letter to the end.
Apparently || is used in some languages to denote string concatenation, including PL/1 and SQL.

how to improve performance to run faster (python)

Right now, my program takes more than 10mins LOL try to display all the possible words (if those words are in the file) that can be created from the given letters. In that file, it has more than 4000+ words
How to make my program run faster by using recursion, and not using any libraries because i'm new to it.
if user input letters: b d o s y
then it will look up all the possible words in that file to create:
b
d
boy
boys
by
the code:
words = set()
def found(word, file):
## Reads through file and tries
## to match given word in a line.
with open(file, 'r') as rf:
for line in rf.readlines():
if line.strip() == word:
return True
return False
def scramble(r_letters, s_letters):
## Output every possible combination of a word.
## Each recursive call moves a letter from
## r_letters (remaining letters) to
## s_letters (scrambled letters)
if s_letters:
words.add(s_letters)
for i in range(len(r_letters)):
scramble(r_letters[:i] + r_letters[i+1:], s_letters + r_letters[i])
thesarus = input("Enter the name of the file containing all of the words: ")
letters = input("Please enter your letters separated by a space: ")
word = ''.join(letters.split(' '))
scramble(word, '')
ll = list(words)
ll.sort()
for word in ll:
if found(word, thesarus):
print(word)
You program runs slow because your algorithm is inefficient.
Since you require in the question to use recursion (to generate all the possible combinations), you could improve at least in how you search in the file.
Your code open the file and search a single word reading it for every word. This is extremely inefficient.
First solution it comes into my mind is to read the file once and save each word in a set()
words_set = {line.strip() for line in open('somefile')}
or also (less concise)
words_set = set()
with open('somefile') as fp:
for line in fp:
words_set.add(line.strip())
Then, you just do
if word in words_set:
print(word)
I think there could be more efficient ways to do the whole program, but they don't require recursion.
Update
For the sake of discussion, I think it may be useful to also provide an algorithm which is better.
Your code generates all possible combinations, even if those are not likely to be part of a dictionary, in addition to the inefficient search in the file for each word.
A better solution involves storing the words in a more efficient way, such that it is much easier to tell if a particular combination exists or not. For example, you don't want to visit (in the file) all the words composed by characters not present in the list provided by the user.
There is a data structure which I believe to be quite effective for this kind of problems: the trie (or prefix tree). This data structure can be used to store all the thesaurus file, in place of the set that I suggested above.
Then, instead of generating all the possible combinations of letters, you just visit the tree with all the possible letters to find all the possible valid words.
So, for example, if your user enters h o m e x and you have no word starting with x in your thesaurus, you will not generate all the permutations starting with x, such as xe, xo, xh, xm, etc, saving a large amount of computations.

I need help optimizing this compression algorithm I came up with on my own

I tried coming up with a compression algorithm. I do little bit about compression theories and so am aware that this scheme that I have come up with could very well never achieve compression at all.
Currently it works only for a string with no consecutive repeating letters/digits/symbols. Once properly established I hope to extrapolate it to binary data etc. But first the algorithm:
Assuming there are only 4 letters: a,b,c,d; we create a matrix/array corresponding to the letters. Whenever a letter is encountered, the corresponding index is incremented so that the index of the last letter encountered is always largest. We incremement an index by 2 if it was originally zero. If it was not originally zero then we increment it by 2+(the second largest element in the matrix). An example to clarify:
Array = [a,b,c,d]
Initial state = [0,0,0,0]
Letter = a
New state = [2,0,0,0]
Letter = b
New state = [2,4,0,0]
.
.c
.d
.
New state = [2,4,6,8]
Letter = a
New state = [12,4,6,8]
//Explanation for the above state: 12 because Largest - Second Largest - 2 = Old value
Letter = d
New state = [12,4,6,22]
and so on...
Decompression is just this logic in reverse.
A rudimentary implementation of compression (in python):
(This function is very rudimentary so not the best kind of code...I know. I can optimize it once I get the core algorithm correct.)
def compress(text):
matrix = [0]*95 #we are concerned with 95 printable chars for now
for i in text:
temp = copy.deepcopy(matrix)
temp.sort()
largest = temp[-1]
if matrix[ord(i)-32] == 0:
matrix[ord(i)-32] = largest+2
else:
matrix[ord(i)-32] = largest+matrix[ord(i)-32]+2
return matrix
The returned matrix is then used for decompression. Now comes the tricky part:
I can't really call this compression at all because each number in the matrix generated from the function are of the order of 10**200 for a string of length 50000. So storing the matrix actually takes more space than storing the original string. I know...totally useless. But I had hoped prior to doing all this that I can use the mathematical properties of a matrix to effectively represent it in some kind of mathematical shorthand. I have tried many possibilities and failed. Some things that I tried:
Rank of the matrix. Failed because not unique.
Denote using the mod function. Failed because either the quotient or the remainder
Store each integer as a generator using pickle.
Store the matrix as a bitmap file but then the integers are too large to be able to store as color codes.
Let me iterate again that the algorithm could be optimized. e.g. instead of adding 2 we could add 1 and proceed. But don't really result in any compression. Same for the code. Minor optimizations later...first I want to improve the main algorithm.
Furthermore, it is very likely that this product of a mediocre and idle mind like myself could never be able to achieve compression after all. In which case, I would then like your help and ideas on what this could probably be useful in.
TL;DR: Check coded parts which depict a compression algorithm. The compressed result is longer than the original string. Can this be fixed? If yes, how?
PS: I have the entire code on my PC. Will create a repo on github and upload in some time.
Compression is essentially a predictive process. Look for patterns in the input and use them to encode the more likely next character(s) more efficiently than the less likely. I can't see anything in your algorithm that tries to build a predictive model.

Adaptive IO Optimization Problem

Here is an interesting optimization problem that I think about for some days now:
In a system I read data from a slow IO device. I don't know beforehand how much data I need. The exact length is only known once I have read an entire package (think of it as it has some kind of end-symbol). Reading more data than required is not a problem except that it wastes time in IO.
Two constrains also come into play: Reads are very slow. Each byte I read costs. Also each read-request has a constant setup cost regardless of the number of bytes I read. This makes reading byte by byte costly. As a rule of thumb: the setup costs are roughly as expensive as a read of 5 bytes.
The packages I read are usually between 9 and 64 bytes, but there are rare occurrences larger or smaller packages. The entire range will be between 1 to 120 bytes.
Of course I know a little bit of my data: Packages come in sequences of identical sizes. I can classify three patterns here:
Sequences of reads with identical sizes:
A A A A A ...
Alternating sequences:
A B A B A B A B ...
And sequences of triples:
A B C A B C A B C ...
The special case of degenerated triples exist as well:
A A B A A B A A B ...
(A, B and C denote some package size between 1 and 120 here).
Question:
Based on the size of the previous packages, how do I predict the size of the next read request? I need something that adapts fast, uses little storage (lets say below 500 bytes) and is fast from a computational point of view as well.
Oh - and pre-generating some tables won't work because the statistic of read sizes can vary a lot with different devices I read from.
Any ideas?
You need to read at least 3 packages and at most 4 packages to identify the pattern.
Read 3 packages. If they are all same size, then the pattern is AAAAAA...
If they are all not the same size, read the 4th package. If 1=3 & 2=4, pattern is ABAB. Otherwise, pattern is ABCABC...
With that outline, it is probably a good idea to do a speculative read of 3 package sizes (something like 3*64 bytes at a single go).
I don't see a problem here.. But first, several questions:
1) Can you read the input asyncronously (e.g. separate thread, interrupt routine, etc)?
2) Do you have some free memory for a buffer?
3) If you've commanded a longer read, are you able to obtain first byte(s) before the whole packet is read?
If so (and I think in most cases it can be implemented), then you can just have a separate thread that reads them at highest possible speed and stores them in a buffer, with stalling when the buffer gets full, so that you normal process can use a synchronous getc() on that buffer.
EDIT: I see.. it's because of CRC or encryption? Well, then you could use some ideas from data compression:
Consider a simple adaptive algorithm of order N for M possible symbols:
int freqs[M][M][M]; // [a][b][c] : occurences of outcome "c" when prev vals were "a" and "b"
int prev[2]; // some history
int predict(){
int prediction = 0;
for (i = 1; i < M; i++)
if (freqs[prev[0]][prev[1]][i] > freqs[prev[0]][prev[1]][prediction])
prediction = i;
return prediction;
};
void add_outcome(int val){
if (freqs[prev[0]][prev[1]][val]++ > DECAY_LIMIT){
for (i = 0; i < M; i++)
freqs[prev[0]][prev[1]][i] >>= 1;
};
pred[0] = pred[1];
pred[1] = val;
};
freqs has to be an array of order N+1, and you have to remember N previsous values. N and DECAY_LIMIT have to be adjusted according to the statistics of the input. However, even they can be made adaptive (for example, if it producess too many misses, then the decay limit can be shortened).
The last problem would be the alphabet. Depending on the context, if there are several distinct sizes, you can create a one-to-one mapping to your symbols. If more, then you can use quantitization to limit the number of symbols. The whole algorithm can be written with pointer arithmetics, so that N and M won't be hardcoded.
Since reading is so slow, I suppose you can throw some CPU power at it so you can try to make an educated guess of how much to read.
That would be basically a predictor, that would have a model based on probabilities. It would generate a sample of predictions of the upcoming message size, and the cost of each. Then pick the message size that has the best expected cost.
Then when you find out the actual message size, use Bayes rule to update the model probabilities, and do it again.
Maybe this sounds complicated, but if the probabilities are stored as fixed-point fractions you won't have to deal with floating-point, so it may be not much code. I would use something like a Metropolis-Hastings algorithm as my basic simulator and bayesian update framework. (This is just an initial stab at thinking about it.)

First-Occurrence Parallel String Matching Algorithm

To be up front, this is homework. That being said, it's extremely open ended and we've had almost zero guidance as to how to even begin thinking about this problem (or parallel algorithms in general). I'd like pointers in the right direction and not a full solution. Any reading that could help would be excellent as well.
I'm working on an efficient way to match the first occurrence of a pattern in a large amount of text using a parallel algorithm. The pattern is simple character matching, no regex involved. I've managed to come up with a possible way of finding all of the matches, but that then requires that I look through all of the matches and find the first one.
So the question is, will I have more success breaking the text up between processes and scanning that way? Or would it be best to have process-synchronized searching of some sort where the j'th process searches for the j'th character of the pattern? If then all processes return true for their match, the processes would change their position in matching said pattern and move up again, continuing until all characters have been matched and then returning the index of the first match.
What I have so far is extremely basic, and more than likely does not work. I won't be implementing this, but any pointers would be appreciated.
With p processors, a text of length t, and a pattern of length L, and a ceiling of L processors used:
for i=0 to t-l:
for j=0 to p:
processor j compares the text[i+j] to pattern[i+j]
On false match:
all processors terminate current comparison, i++
On true match by all processors:
Iterate p characters at a time until L characters have been compared
If all L comparisons return true:
return i (position of pattern)
Else:
i++
I am afraid that breaking the string will not do.
Generally speaking, early escaping is difficult, so you'd be better off breaking the text in chunks.
But let's ask Herb Sutter to explain searching with parallel algorithms first on Dr Dobbs. The idea is to use the non-uniformity of the distribution to have an early return. Of course Sutter is interested in any match, which is not the problem at hand, so let's adapt.
Here is my idea, let's say we have:
Text of length N
p Processors
heuristic: max is the maximum number of characters a chunk should contain, probably an order of magnitude greater than M the length of the pattern.
Now, what you want is to split your text into k equal chunks, where k is is minimal and size(chunk) is maximal yet inferior to max.
Then, we have a classical Producer-Consumer pattern: the p processes are feeded with the chunks of text, each process looking for the pattern in the chunk it receives.
The early escape is done by having a flag. You can either set the index of the chunk in which you found the pattern (and its position), or you can just set a boolean, and store the result in the processes themselves (in which case you'll have to go through all the processes once they have stop). The point is that each time a chunk is requested, the producer checks the flag, and stop feeding the processes if a match has been found (since the processes have been given the chunks in order).
Let's have an example, with 3 processors:
[ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ]
x x
The chunks 6 and 8 both contain the string.
The producer will first feed 1, 2 and 3 to the processes, then each process will advance at its own rhythm (it depends on the similarity of the text searched and the pattern).
Let's say we find the pattern in 8 before we find it in 6. Then the process that was working on 7 ends and tries to get another chunk, the producer stops it --> it would be irrelevant. Then the process working on 6 ends, with a result, and thus we know that the first occurrence was in 6, and we have its position.
The key idea is that you don't want to look at the whole text! It's wasteful!
Given a pattern of length L, and searching in a string of length N over P processors I would just split the string over the processors. Each processor would take a chunk of length N/P + L-1, with the last L-1 overlapping the string belonging to the next processor. Then each processor would perform boyer moore (the two pre-processing tables would be shared). When each finishes, they will return the result to the first processor, which maintains a table
Process Index
1 -1
2 2
3 23
After all processes have responded (or with a bit of thought you can have an early escape), you return the first match. This should be on average O(N/(L*P) + P).
The approach of having the i'th processor matching the i'th character would require too much inter process communication overhead.
EDIT: I realize you already have a solution, and are figuring out a way without having to find all solutions. Well I don't really think this approach is necessary. You can come up with some early escape conditions, they aren't that difficult, but I don't think they'll improve your performance that much in general (unless you have some additional knowledge the distribution of matches in your text).

Resources