First-Occurrence Parallel String Matching Algorithm - algorithm

To be up front, this is homework. That being said, it's extremely open ended and we've had almost zero guidance as to how to even begin thinking about this problem (or parallel algorithms in general). I'd like pointers in the right direction and not a full solution. Any reading that could help would be excellent as well.
I'm working on an efficient way to match the first occurrence of a pattern in a large amount of text using a parallel algorithm. The pattern is simple character matching, no regex involved. I've managed to come up with a possible way of finding all of the matches, but that then requires that I look through all of the matches and find the first one.
So the question is, will I have more success breaking the text up between processes and scanning that way? Or would it be best to have process-synchronized searching of some sort where the j'th process searches for the j'th character of the pattern? If then all processes return true for their match, the processes would change their position in matching said pattern and move up again, continuing until all characters have been matched and then returning the index of the first match.
What I have so far is extremely basic, and more than likely does not work. I won't be implementing this, but any pointers would be appreciated.
With p processors, a text of length t, and a pattern of length L, and a ceiling of L processors used:
for i=0 to t-l:
for j=0 to p:
processor j compares the text[i+j] to pattern[i+j]
On false match:
all processors terminate current comparison, i++
On true match by all processors:
Iterate p characters at a time until L characters have been compared
If all L comparisons return true:
return i (position of pattern)
Else:
i++

I am afraid that breaking the string will not do.
Generally speaking, early escaping is difficult, so you'd be better off breaking the text in chunks.
But let's ask Herb Sutter to explain searching with parallel algorithms first on Dr Dobbs. The idea is to use the non-uniformity of the distribution to have an early return. Of course Sutter is interested in any match, which is not the problem at hand, so let's adapt.
Here is my idea, let's say we have:
Text of length N
p Processors
heuristic: max is the maximum number of characters a chunk should contain, probably an order of magnitude greater than M the length of the pattern.
Now, what you want is to split your text into k equal chunks, where k is is minimal and size(chunk) is maximal yet inferior to max.
Then, we have a classical Producer-Consumer pattern: the p processes are feeded with the chunks of text, each process looking for the pattern in the chunk it receives.
The early escape is done by having a flag. You can either set the index of the chunk in which you found the pattern (and its position), or you can just set a boolean, and store the result in the processes themselves (in which case you'll have to go through all the processes once they have stop). The point is that each time a chunk is requested, the producer checks the flag, and stop feeding the processes if a match has been found (since the processes have been given the chunks in order).
Let's have an example, with 3 processors:
[ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ]
x x
The chunks 6 and 8 both contain the string.
The producer will first feed 1, 2 and 3 to the processes, then each process will advance at its own rhythm (it depends on the similarity of the text searched and the pattern).
Let's say we find the pattern in 8 before we find it in 6. Then the process that was working on 7 ends and tries to get another chunk, the producer stops it --> it would be irrelevant. Then the process working on 6 ends, with a result, and thus we know that the first occurrence was in 6, and we have its position.
The key idea is that you don't want to look at the whole text! It's wasteful!

Given a pattern of length L, and searching in a string of length N over P processors I would just split the string over the processors. Each processor would take a chunk of length N/P + L-1, with the last L-1 overlapping the string belonging to the next processor. Then each processor would perform boyer moore (the two pre-processing tables would be shared). When each finishes, they will return the result to the first processor, which maintains a table
Process Index
1 -1
2 2
3 23
After all processes have responded (or with a bit of thought you can have an early escape), you return the first match. This should be on average O(N/(L*P) + P).
The approach of having the i'th processor matching the i'th character would require too much inter process communication overhead.
EDIT: I realize you already have a solution, and are figuring out a way without having to find all solutions. Well I don't really think this approach is necessary. You can come up with some early escape conditions, they aren't that difficult, but I don't think they'll improve your performance that much in general (unless you have some additional knowledge the distribution of matches in your text).

Related

How to use machine learning to count words in text

Question:
Given a piece of text like "This is a test"; how to build a machine learning model to get the number of word occurrences for example in this piece, word count is 4. After training, it is possible to predict text word count.
I know it is easy to write a program (like below pseudo code),
data: memory.punctuation['~', '`', '!', '#', '#', '$', '%', '^', '&', '*', ...]
f: count.word(text) -> count =
f: tokenize(text) --list-->
f: count.token(list, filter) where filter(token)<not in memory.punctuation> -> count
however in this question, we require to use machine learning algorithm. I wonder how machine can learn the concept of count (currently, we know machine learning is good at classification). Any idea and suggestions? Thanks in advance.
Failures:
We can use sth like word2vec (encoder) to build word vectors; if we consider seq2seq approach, we can train sth like This is a test <s> 4 <e> This is very very long sentence and the word count is greater than ten <s> 4 1 <e> (4 1 to represent the number 14). However, it does not work since attention model is used to get similar vector for example text translating (This is a test --> 这(this) 是(is) 一个(a) 测试(test)). It is hard to find relationship between [this ...] and 4 which is an aggregated number (i.e. model not convergent).
We know machine learning is good at classification. If we treat "4" as a class, the number of classes is infinite; if we do a tricky and use count/text.length as prediction, i have not got a model that fit even training data set (model not convergent); for example, if we use many short sentence to train the model, it will fail to predict long sentence length. And it may be related to an information paradox: we can encode data in a book as 0.x and use a machine to to mark a position on a rod to split it into 2 parts with length a and b, where a/b = 0.x; but we cannot find a machine.
What about a regression problem?
I think it would work quite well and that at the end it will output a nearly whole numbers all the time.
Also you can train a simple RNN to do the job, assuming you are using a hot one encoding and take an output from the last state.
If V_h is all zeros but the space index (which will be 1) and V_x as well, than the network will actually sum the spaces, and if c is 1 at the end so the output will be the number of words - For every length!
I think we can take it as a classification problem for a character being the input and if word breaker as the output.
In other words, at some time point t, we output whether the input character at the same time point is a word breaker (YES) or not (NO). If yes, then increase the word count. If no, then read the next character.
In modern English language I don't think there are going to be long words. So simple RNN model should do perhaps without the concern of vanishing gradient.
Let me know what you think!
Use NLTK for counting words,
from nltk.tokenize import word_tokenize
text = "God is Great!"
word_count = len(word_tokenize(text))
print(word_count)

Random number generation from 1 to 7

I was going through Google Interview Questions. to implement the random number generation from 1 to 7.
I did write a simple code, I would like to understand if in the interview this question asked to me and if I write the below code is it Acceptable or not?
import time
def generate_rand():
ret = str(time.time()) # time in second like, 12345.1234
ret = int(ret[-1])
if ret == 0 or ret == 1:
return 1
elif ret > 7:
ret = ret - 7
return ret
return ret
while 1:
print(generate_rand())
time.sleep(1) # Just to see the output in the STDOUT
(Since the question seems to ask for analysis of issues in the code and not a solution, I am not providing one. )
The answer is unacceptable because:
You need to wait for a second for each random number. Many applications need a few hundred at a time. (If the sleep is just for convenience, note that even a microsecond granularity will not yield true random numbers as the last microsecond will be monotonically increasing until 10us are reached. You may get more than a few calls done in a span of 10us and there will be a set of monotonically increasing pseudo-random numbers).
Random numbers have uniform distribution. Each element should have the same probability in theory. In this case, you skew 1 more (twice the probability for 0, 1) and 7 more (thrice the probability for 7, 8, 9) compared to the others in the range 2-6.
Typically answers to this sort of a question will try to get a large range of numbers and distribute the ranges evenly from 1-7. For example, the above method would have worked fine if u had wanted randomness from 1-5 as 10 is evenly divisible by 5. Note that this will only solve (2) above.
For (1), there are other sources of randomness, such as /dev/random on a Linux OS.
You haven't really specified the constraints of the problem you're trying to solve, but if it's from a collection of interview questions it seems likely that it might be something like this.
In any case, the answer shown would not be acceptable for the following reasons:
The distribution of the results is not uniform, even if the samples you read from time.time() are uniform.
The results from time.time() will probably not be uniform. The result depends on the time at which you make the call, and if your calls are not uniformly distributed in time then the results will probably not be uniformly distributed either. In the worst case, if you're trying to randomise an array on a very fast processor then you might complete the entire operation before the time changes, so the whole array would be filled with the same value. Or at least large chunks of it would be.
The changes to the random value are highly predictable and can be inferred from the speed at which your program runs. In the very-fast-computer case you'll get a bunch of x followed by a bunch of x+1, but even if the computer is much slower or the clock is more precise, you're likely to get aliasing patterns which behave in a similarly predictable way.
Since you take the time value in decimal, it's likely that the least significant digit doesn't visit all possible values uniformly. It's most likely a conversion from binary to some arbitrary number of decimal digits, and the distribution of the least significant digit can be quite uneven when that happens.
The code should be much simpler. It's a complicated solution with many special cases, which reflects a piecemeal approach to the problem rather than an understanding of the relevant principles. An ideal solution would make the behaviour self-evident without having to consider each case individually.
The last one would probably end the interview, I'm afraid. Perhaps not if you could tell a good story about how you got there.
You need to understand the pigeonhole principle to begin to develop a solution. It looks like you're reducing the time to its least significant decimal digit for possible values 0 to 9. Legal results are 1 to 7. If you have seven pigeonholes and ten pigeons then you can start by putting your first seven pigeons into one hole each, but then you have three pigeons left. There's nowhere that you can put the remaining three pigeons (provided you only use whole pigeons) such that every hole has the same number of pigeons.
The problem is that if you pick a pigeon at random and ask what hole it's in, the answer is more likely to be a hole with two pigeons than a hole with one. This is what's called "non-uniform", and it causes all sorts of problems, depending on what you need your random numbers for.
You would either need to figure out how to ensure that all holes are filled equally, or you would have to come up with an explanation for why it doesn't matter.
Typically the "doesn't matter" answer is that each hole has either a million or a million and one pigeons in it, and for the scale of problem you're working with the bias would be undetectable.
Using the same general architecture you've created, I would do something like this:
import time
def generate_rand():
ret = str(time.time()) # time in second like, 12345.1234
ret = ret % 8 # will return pseudorandom numbers 0-7
if ret == 0:
return 1 # or you could also return the result of another call to generate_rand()
return ret
while 1:
print(generate_rand())
time.sleep(1)

Huffman encoding with variable length symbols

I'm thinking of using a Huffman code to compress text, but with symbols of variable length (strings). For example (using an underscore as a space):
huffman-code | symbol
------------------------------------
00 | _
01 | E
100 | THE
101 | A
1100 | UP
1101 | DOWN
11100 | .
11101 |
1111...
(etc...)
How can I construct the frequency table? Obviously there are some overlapping issues, the sequence _TH would appear neary as often as THE, but would be useless in the table (both _ and THE have short huffman code).
Does such an algorithm exists? Does it have a special name? What would be the tricks to generate the frequency table? Do I need to tokenize the input? I did not found anything in the litterature / web. (All this make me think also of radix trees).
I was thinking of using an iterative process:
Generate an huffman tree for all symbols of length 1 to N
Remove from the tree all symbols with N>1 and below a certain count threshold
Regenerate a second huffman tree, but this time tokenizing the input with the previous one (probably using a radix tree for lookup)
Repeat to 1 until we converge (or for a few times)
But I can't figure out how can I prevent the problem of overlaps (_TH vs THE) with this.
As long as you tokenize the text properly you don't have to worry about the overlap problem. You can define each token to be a word (longest continuous stream of characters), punctuation symbol or a whitespace character (' ', '\t', \n'). Thus by definition the tokens/symbols do not overlap.
But using Huffman coding directly isn't ideal for compressing text since it cannot make use of the dependencies between the symbols. For e.g. 'q' is likely followed by 'u', 'qu' is likely followed by a vowel, 'thank' is likely followed by 'you' and so on. You may want to look into a high order encoder like 'LZ' which can exploit this redundancy, by converting the data into a sequence of lookup addresses, copy lengths, and deviating symbols. Here's an example of how LZ works. You can then apply Huffman coding on each of the three streams to further compress the data. DEFLATE algorithm works exactly this way.
This is not a complete solution.
Since you have to store both the sequence and the lookup table, maybe you can greedily pick symbols that minimize the storage cost.
Step 1: Store all the symbols of length at most k in a try and keep track of their counts
Step 2: For each probable symbol, calculate the space saved (or compression ratio).
Encode_length(symbol) = log(N) - log(count(symbol))
Space_saved(symbol) = length(symbol)*count(symbol) - Encode_length(symbol)*count(symbol) - (length(symbol)+Encode_length(symbol))
N is the total frequency of all symbols (which we don't know yet, maybe approximate?).
Step 3: Select the optimal symbol and subtract frequency of other symbols that overlap with it.
Step 4: If the whole sequence is not encoded yet pick the next optimal symbol (i.e. go to step 2)
NOTE: This is just a outline and it is neither complete nor computationally efficient. If you are looking for a practical quick solution you should use krjampani's solution. This answer is purely academical.

how to deal with a big text file(about 300M)

There's a text file(about 300M) and I need to count the ten most offen occurred words(some stop words are exclued). Test machine has 8 cores and Linux system, any programming language is welcome and can use open-source framework only(hadoop is not an option), I don't have any mutithread programming experince, where can I start from and how to give a solution cost as little time as possible?
300M is not a big deal, a matter of seconds for your task, even for single core processing in a high-level interpreted language like python if you do it right. Python has an advantage that it will make your word-counting programming very easy to code and debug, compared to many lower-level languages. If you still want to parallelize (even though it will only take a matter of seconds to run single-core in python), I'm sure somebody can post a quick-and-easy way to do it.
How to solve this problem with a good scalability:
The problem can be solved by 2 map-reduce steps:
Step 1:
map(word):
emit(word,1)
Combine + Reduce(word,list<k>):
emit(word,sum(list))
After this step you have a list of (word,#occurances)
Step 2:
map(word,k):
emit(word,k):
Combine + Reduce(word,k): //not a list, because each word has only 1 entry.
find top 10 and yield (word,k) for the top 10. //see appendix1 for details
In step 2 you must use a single reducer, The problem is still scalable, because it (the single reducer) has only 10*#mappers entries as input.
Solution for 300 MB file:
Practically, 300MB is not such a large file, so you can just create a histogram (on memory, with a tree/hash based map), and then output the top k values from it.
Using a map that supports concurrency, you can split the file into parts, and let each thread modify the when it needs. Note that if it cab actually be splitted efficiently is FS dependent, and sometimes a linear scan by one thread is mandatory.
Appendix1:
How to get top k:
Use a min heap and iterate the elements, the min heap will contain the highest K elements at all times.
Fill the heap with first k elements.
For each element e:
If e > min.heap():
remove the smallest element from the heap, and add e instead.
Also, more details in this thread
Assuming that you have 1 word per line, you can do the following in python
from collections import Counter
FILE = 'test.txt'
count = Counter()
with open(FILE) as f:
for w in f.readlines():
count[w.rstrip()] += 1
print count.most_common()[0:10]
Read the file and create a map [Word, count] of all occurring word as keys and the value are the number of occurrences of the words while you read it.
Any language should do the job.
After reading the File once, you have the map.
Then iterate through the map and remember the ten word with the highest count value

Algorithms: random unique string

I need to generate string that meets the following requirements:
it should be a unique string;
string length should be 8 characters;
it should contain 2 digits;
all symbols (non-digital characters) should be upper case.
I will store them in a data base after generation (they will be assigned to other entities).
My intention is to do something like this:
Generate 2 random values from 0 to 9—they will be used for digits in the string;
generate 6 random values from 0 to 25 and add them to 64—they will be used as 6 symbols;
concatenate everything into one string;
check if the string already exists in the data base; if not—repeat.
My concern with regard to that algorithm is that it doesn't guarantee a result in finite time (if there are already A LOT of values in the data base).
Question: could you please give advice on how to improve this algorithm to be more deterministic?
Thanks.
it should be unique string;
string length should be 8 characters;
it should contains 2 digits;
all symbols (non-digital characters) - should be upper case.
Assuming:
requirements #2 and #3 are exact (exactly 8 chars, exactly 2 digits) and not a minimum
the "symbols" in requirement #4 are the 26 capital letters A through Z
you would like an evenly-distributed random string
Then your proposed method has two issues. One is that the letters A - Z are ASCII 65 - 90, not 64 - 89. The other is that it doesn't distribute the numbers evenly within the possible string space. That can be remedied by doing the following:
Generate two different integers between 0 and 7, and sort them.
Generate 2 random numbers from 0 to 9.
Generate 6 random letters from A to Z.
Use the two different integers in step #1 as positions, and put the 2 numbers in those positions.
Put the 6 random letters in the remaining positions.
There are 28 possibilities for the two different integers ((8*8 - 8 duplicates) / 2 orderings), 266 possibilities for the letters, and 100 possibilities for the numbers, the total # of valid combinations being Ncomb = 864964172800 = 8.64 x 1011.
edit: If you want to avoid the database for storage, but still guarantee both uniqueness of strings and have them be cryptographically secure, your best bet is a cryptographically random bijection from a counter between 0 and Nmax <= Ncomb to a subset of the space of possible output strings. (Bijection meaning there is a one-to-one correspondence between the output string and the input counter.)
This is possible with Feistel networks, which are commonly used in hash functions and symmetric cryptography (including AES). You'd probably want to choose Nmax = 239 which is the largest power of 2 <= Ncomb, and use a 39-bit Feistel network, using a constant key you keep secret. You then plug in your counter to the Feistel network, and out comes another 39-bit number X, which you then transform into the corresponding string as follows:
Repeat the following step 6 times:
Take X mod 26, generate a capital letter, and set X = X / 26.
Take X mod 100 to generate your two digits, and set X = X / 100.
X will now be between 0 and 17 inclusive (239 / 266 / 100 = 17.796...). Map this number to two unique digit positions (probably easiest using a lookup table, since we're only talking 28 possibilities. If you had more, use Floyd's algorithm for generating a unique permutation, and use the variable-base technique of mod + integer divide instead of generating a random number).
Follow the random approach above, but use the numbers generated by this algorithm instead.
Alternatively, use 40-bit numbers, and if the output of your Feistel network is > Ncomb, then increment the counter and try again. This covers the entire string space at the cost of rejecting invalid numbers and having to re-execute the algorithm. (But you don't need a database to do this.)
But this isn't something to get into unless you know what you're doing.
Are these user passwords? If so, there are a couple of things you need to take into account:
You must avoid 0/O and I/1, which can easily be mistaken for each other.
You must avoid too many consecutive letters, which might spell out a rude word.
As far as 2 is concerned, you can avoid the problem by using LLNLLNLL as your pattern (L = letter, N = number).
If you need 1 million passwords out of a pool of 2.5 billion, you will certainly get clashes in your database, so you have to deal with them gracefully. But a simple retry is enough, if your random number generator is robust.
I don't see anything in your requirements that states that the string needs to be random. You could just do something like the following pseudocode:
for letters in ( 'AAAAAA' .. 'ZZZZZZ' ) {
for numbers in ( 00 .. 99 ) {
string = letters + numbers
}
}
This will create unique strings eight characters long, with two digits and six upper-case letters.
If you need randomly-generated strings, then you need to keep some kind of record of which strings have been previously generated, so you're going to have to hit a DB (or keep them all in memory, or write them to a textfile) and check against that list.
I think you're safe well into your tens of thousands of such ID's, and even after that you're most likely alright.
Now if you want some determinism, you can always force a password after a certain number of failures. Say after 50 failures, you select a password at random and increment a part of it by 1 until you get a free one.
I'm willing to bet money though that you'll never see the extra functionality kick in during your life time :)
Do it the other way around: generate one big random number that you will split up to obtain the individual characters:
long bigrandom = ...;
int firstDigit = bigRandom % 10;
int secondDigit = ( bigrandom / 10 ) % 10;
and so on.
Then you only store the random number in your database and not the string. Since there's a one-to-one relationship between the string and the number, this doesn't really make a difference.
However, when you try to insert a new value, and it's already in the databse, you can easily find the smallest unallocated number graeter than the originally generated number, and use that instead of the one you generated.
What you gain from this method is that you're guaranteed to find an available code relatively quickly, even when most codes are already allocated.
For one thing, your list of requirements doesn't state that string has to be necessary random, so you might consider something like database index.
If 'random' is a requirement, you can do a few improvements.
Store string as a number in database. Not sure how much this improves perfromance.
Do not store used strings at all. You can employ 'index' approach above, but convert integer number to a string in a seemingly random fashion (e.g., employing bit shift). Without much research, nobody will notice pattern.
E.g., if we have sequence 1, 2, 3, 4, ... and use cyclic binary shift right by 1 bit, it'll be turned into 4, 1, 5, 2, ... (assuming we have 3 bits only)
It doesn't have to be a shift too, it can be a permutation or any other 'randomization'.
The problem with your approach is clearly that while you have few records, you are very unlikely to get collisions but as your number of records grows the chance will increase until it becomes more likely than not that you'll get a collision. Eventually you will be hitting multiple collisions before you get a 'valid' result. Every time will require a table scan to determine if the code is valid, and the whole thing turns into a mess.
The simplest solution is to precalculate your codes.
Start with the first code 00AAAA, and increment to generate 00AAAB, 00AAAC ... 99ZZZZ. Insert them into a table in random order. When you need a new code, retrieve to top record unused record from the table (then mark it as used). It's not a huge table, as pointed out above - only a few million records.
You don't need to calculate any random numbers and generate strings for each user (already done)
You don't need to check whether anything has already been used, just get the next available
No chance of getting multiple collisions before finding something usable.
If you ever need more 'codes', just generate some more 'random' strings and append them to the table.

Resources