Sliding window searching algorithm - algorithm

I am in need of a data storage type and algorithm for keeping track of the status of the last N items I have seen. Each item has a status of Pass or Fail, but the system I am monitoring will deemed to have failed if M items in a row have failed. Once the system is deemed to have failed I then need to scan back through the history of data and find the last window of width W in which all items had a "good" status.
For example, with a M=4 and W = 3:
1 Good
2 Good
3 Good
4 Good
5 Good |
6 Good |- Window of size 3 where all are good.
7 Good |
8 Bad
9 Bad
10 Good
11 Good
12 Bad
13 Good
14 Bad
15 Bad
16 Bad
17 Bad <== System is deemed bad at this point So scan backwards to find "Good" window.
I know that this is going to end up in something like a regular expression search and have vague recollections of Knuth floating up out the dark recesses of my memory, so could anyone point me towards a simple introduction on how to do this? Also for what it is worth I will be implementing this in C# .Net 3.5 on a Windows XP system seeing 3GB of Ram (and an i7 processor - sniff the machine used to have Windows 7 and it does have 8GB of memory - but that was a story for TDWTF)
Finally I will be scanning numbers of items in the 100,000's to millions in any given run of this system. I won't need to keep track of the entire run, just the subset of all items until a system failure occurs. When that happens I can dump all my collected data and start the process all over again. However for each item I am tracking, I will have to keep at least the pass/fail status, and a 10 char string. So I am looking for suggestions on how to collect and maintain this data in the system as well. Although I am tempted to say - "meh, it will all fit in memory even if the entire run pass with 100%, so its off to an array for you!"

I know that this is going to end up in something like a regular expression search
The problem is, actually, much simpler. We can take advantage of the fact that we're searching for subsequences consisting only of bad results (or only good results).
Something like this should work
// how many consecutive bad results we have at this point
int consecutiveFailures = 0;
// same for good results
int consecutivePasses = 0;
for each result
if result == 'pass' then
consecutiveFailures = 0;
++consecutivePasses;
else if result == 'fail' then
consecutivePasses = 0;
++consecutiveFailures;
end
if consecutiveFailures == M
// M consecutive failures, stop processing
...
end
if consecutivePasses >= W
// record last set of W consecutive passes for later use
...
end
end

Related

Random number generation from 1 to 7

I was going through Google Interview Questions. to implement the random number generation from 1 to 7.
I did write a simple code, I would like to understand if in the interview this question asked to me and if I write the below code is it Acceptable or not?
import time
def generate_rand():
ret = str(time.time()) # time in second like, 12345.1234
ret = int(ret[-1])
if ret == 0 or ret == 1:
return 1
elif ret > 7:
ret = ret - 7
return ret
return ret
while 1:
print(generate_rand())
time.sleep(1) # Just to see the output in the STDOUT
(Since the question seems to ask for analysis of issues in the code and not a solution, I am not providing one. )
The answer is unacceptable because:
You need to wait for a second for each random number. Many applications need a few hundred at a time. (If the sleep is just for convenience, note that even a microsecond granularity will not yield true random numbers as the last microsecond will be monotonically increasing until 10us are reached. You may get more than a few calls done in a span of 10us and there will be a set of monotonically increasing pseudo-random numbers).
Random numbers have uniform distribution. Each element should have the same probability in theory. In this case, you skew 1 more (twice the probability for 0, 1) and 7 more (thrice the probability for 7, 8, 9) compared to the others in the range 2-6.
Typically answers to this sort of a question will try to get a large range of numbers and distribute the ranges evenly from 1-7. For example, the above method would have worked fine if u had wanted randomness from 1-5 as 10 is evenly divisible by 5. Note that this will only solve (2) above.
For (1), there are other sources of randomness, such as /dev/random on a Linux OS.
You haven't really specified the constraints of the problem you're trying to solve, but if it's from a collection of interview questions it seems likely that it might be something like this.
In any case, the answer shown would not be acceptable for the following reasons:
The distribution of the results is not uniform, even if the samples you read from time.time() are uniform.
The results from time.time() will probably not be uniform. The result depends on the time at which you make the call, and if your calls are not uniformly distributed in time then the results will probably not be uniformly distributed either. In the worst case, if you're trying to randomise an array on a very fast processor then you might complete the entire operation before the time changes, so the whole array would be filled with the same value. Or at least large chunks of it would be.
The changes to the random value are highly predictable and can be inferred from the speed at which your program runs. In the very-fast-computer case you'll get a bunch of x followed by a bunch of x+1, but even if the computer is much slower or the clock is more precise, you're likely to get aliasing patterns which behave in a similarly predictable way.
Since you take the time value in decimal, it's likely that the least significant digit doesn't visit all possible values uniformly. It's most likely a conversion from binary to some arbitrary number of decimal digits, and the distribution of the least significant digit can be quite uneven when that happens.
The code should be much simpler. It's a complicated solution with many special cases, which reflects a piecemeal approach to the problem rather than an understanding of the relevant principles. An ideal solution would make the behaviour self-evident without having to consider each case individually.
The last one would probably end the interview, I'm afraid. Perhaps not if you could tell a good story about how you got there.
You need to understand the pigeonhole principle to begin to develop a solution. It looks like you're reducing the time to its least significant decimal digit for possible values 0 to 9. Legal results are 1 to 7. If you have seven pigeonholes and ten pigeons then you can start by putting your first seven pigeons into one hole each, but then you have three pigeons left. There's nowhere that you can put the remaining three pigeons (provided you only use whole pigeons) such that every hole has the same number of pigeons.
The problem is that if you pick a pigeon at random and ask what hole it's in, the answer is more likely to be a hole with two pigeons than a hole with one. This is what's called "non-uniform", and it causes all sorts of problems, depending on what you need your random numbers for.
You would either need to figure out how to ensure that all holes are filled equally, or you would have to come up with an explanation for why it doesn't matter.
Typically the "doesn't matter" answer is that each hole has either a million or a million and one pigeons in it, and for the scale of problem you're working with the bias would be undetectable.
Using the same general architecture you've created, I would do something like this:
import time
def generate_rand():
ret = str(time.time()) # time in second like, 12345.1234
ret = ret % 8 # will return pseudorandom numbers 0-7
if ret == 0:
return 1 # or you could also return the result of another call to generate_rand()
return ret
while 1:
print(generate_rand())
time.sleep(1)

how to deal with a big text file(about 300M)

There's a text file(about 300M) and I need to count the ten most offen occurred words(some stop words are exclued). Test machine has 8 cores and Linux system, any programming language is welcome and can use open-source framework only(hadoop is not an option), I don't have any mutithread programming experince, where can I start from and how to give a solution cost as little time as possible?
300M is not a big deal, a matter of seconds for your task, even for single core processing in a high-level interpreted language like python if you do it right. Python has an advantage that it will make your word-counting programming very easy to code and debug, compared to many lower-level languages. If you still want to parallelize (even though it will only take a matter of seconds to run single-core in python), I'm sure somebody can post a quick-and-easy way to do it.
How to solve this problem with a good scalability:
The problem can be solved by 2 map-reduce steps:
Step 1:
map(word):
emit(word,1)
Combine + Reduce(word,list<k>):
emit(word,sum(list))
After this step you have a list of (word,#occurances)
Step 2:
map(word,k):
emit(word,k):
Combine + Reduce(word,k): //not a list, because each word has only 1 entry.
find top 10 and yield (word,k) for the top 10. //see appendix1 for details
In step 2 you must use a single reducer, The problem is still scalable, because it (the single reducer) has only 10*#mappers entries as input.
Solution for 300 MB file:
Practically, 300MB is not such a large file, so you can just create a histogram (on memory, with a tree/hash based map), and then output the top k values from it.
Using a map that supports concurrency, you can split the file into parts, and let each thread modify the when it needs. Note that if it cab actually be splitted efficiently is FS dependent, and sometimes a linear scan by one thread is mandatory.
Appendix1:
How to get top k:
Use a min heap and iterate the elements, the min heap will contain the highest K elements at all times.
Fill the heap with first k elements.
For each element e:
If e > min.heap():
remove the smallest element from the heap, and add e instead.
Also, more details in this thread
Assuming that you have 1 word per line, you can do the following in python
from collections import Counter
FILE = 'test.txt'
count = Counter()
with open(FILE) as f:
for w in f.readlines():
count[w.rstrip()] += 1
print count.most_common()[0:10]
Read the file and create a map [Word, count] of all occurring word as keys and the value are the number of occurrences of the words while you read it.
Any language should do the job.
After reading the File once, you have the map.
Then iterate through the map and remember the ten word with the highest count value

time series simulation and logical checking with Matlab or with other tools

1) I have time series data and signals (indicators) that their value changes over time.
My question:
2) I need to do logical checking all the time, e.g. if signal 1 and 2 happened around the same time (were equal to a certain value e.g.=1) then I need to know the exact time in order to check what happened next.
3) to complicate things,e.g. if signal 3 happened in some time range after signal 1 and signal 2 were equal to 1, I would like to check other things.
4)The time series is very long and I need to deal with it segment by segment.
Please advice how to write it without inventing the wheel.
Is it recommended to write it in Matlab?, using a state machine? in C++?, using threads?
5) Does Matlab have a simulator ready for this kind of things?
How do I define the logical conditions in an efficient way?
6) Can I use data mining tools for this?
I saw this list of tools:
Data Mining open source tools
not sure where to start.
Thanks
The second and third question could be done like this in Matlab:
T = -range; % Assuming that t starts at 0.
for n = 1 : length(t)
if signal1(n) == 1 && signal2(n) == 1
T = t(n);
end
if t(n) - T < range && signal3(n) == 1
if % Conditions you want to get checked, could also be put in the previous if statement.
% Things you want to be executed if these coditions are met.
end
end
end
Using a lower level programming language like C++ would improve the rate at which it would be done. And if data is very long it could also reduce the amount of memory use by loading in an element of each array at the time.
Matlab has a simulator, called Simulink, but that is more meant for solving more complicated things, since you only conditionally want to do something.

What methods can I use to analyse and guess 4-bit checksum algorithm?

[Background Story]
I am working with a 5 year old user identification system, and I am trying to add IDs to the database. The problem I have is that the system that reads the ID numbers requires some sort of checksum, and no-one working here now has ever worked with it, so no-one knows how it works.
I have access to the list of existing IDs, which already have correct checksums. Also, as the checksum only has 16 possible values, I can create any ID I want and run it through the authentication system up to 16 times until I get the correct checksum (but this is quite time consuming)
[Question]
What methods can I use to help guess the checksum algorithm of used for some data?
I have tried a few simple methods such as XORing and summing, but these have not worked.
So my question is: if I have data (in hexadecimal) like this:
data checksum
00029921 1
00013481 B
00026001 3
00004541 8
What methods can I use work out what sort of checksum is used?
i.e. should I try sequential numbers such as 00029921,00029922,00029923,... or 00029911,00029921,00029931,... If I do this what patterns should I look for in the changing checksum?
Similarly, would comparing swapped digits tell me anything useful about the checksum?
i.e. 00013481 and 00031481
Is there anything else that could tell me something useful? What about inverting one bit, or maybe one hex digit?
I am assuming that this will be a common checksum algorithm, but I don't know where to start in testing it.
I have read the following links, but I am not sure if I can apply any of this to my case, as I don't think mine is a CRC.
stackoverflow.com/questions/149617/how-could-i-guess-a-checksum-algorithm
stackoverflow.com/questions/2896753/find-the-algorithm-that-generates-the-checksum
cosc.canterbury.ac.nz/greg.ewing/essays/CRC-Reverse-Engineering.html
[ANSWER]
I have now downloaded a much larger list of data, and it turned out to be simpler than I was expecting, but for completeness, here is what I did.
data:
00024901 A
00024911 B
00024921 C
00024931 D
00042811 A
00042871 0
00042881 1
00042891 2
00042901 A
00042921 C
00042961 0
00042971 1
00042981 2
00043021 4
00043031 5
00043041 6
00043051 7
00043061 8
00043071 9
00043081 A
00043101 3
00043111 4
00043121 5
00043141 7
00043151 8
00043161 9
00043171 A
00044291 E
From these, I could see that when just one value was increased by a value, the checksum was also increased by the same value as in:
00024901 A
00024911 B
Also, two digits swapped did not change the checksum:
00024901 A
00042901 A
This means that the polynomial value (for these two positions at least) must be the same
Finally, the checksum for 00000000 was A, so I calculated the sum of digits plus A mod 16:
( (Σxi) +0xA )mod16
And this matched for all the values I had. Just to check that there was nothing sneaky going on with the first 3 digits that never changed in my data, I made up and tested some numbers as Eric suggested, and those all worked with this too!
Many checksums I've seen use simple weighted values based on the position of the digits. For example, if the weights are 3,5,7 the checksum might be 3*c[0] + 5*c[1] + 7*c[2], then mod 10 for the result. (In your case, mod 16, since you have 4 bit checksum)
To check if this might be the case, I suggest that you feed some simple values into your system to get an answer:
1000000 = ?
0100000 = ?
0010000 = ?
... etc. If there are simple weights based on position, this may reveal it. Even if the algorithm is something different, feeding in nice, simple values and looking for patterns may be enlightening. As Matti suggested, you/we will likely need to see more samples before decoding the pattern.

First-Occurrence Parallel String Matching Algorithm

To be up front, this is homework. That being said, it's extremely open ended and we've had almost zero guidance as to how to even begin thinking about this problem (or parallel algorithms in general). I'd like pointers in the right direction and not a full solution. Any reading that could help would be excellent as well.
I'm working on an efficient way to match the first occurrence of a pattern in a large amount of text using a parallel algorithm. The pattern is simple character matching, no regex involved. I've managed to come up with a possible way of finding all of the matches, but that then requires that I look through all of the matches and find the first one.
So the question is, will I have more success breaking the text up between processes and scanning that way? Or would it be best to have process-synchronized searching of some sort where the j'th process searches for the j'th character of the pattern? If then all processes return true for their match, the processes would change their position in matching said pattern and move up again, continuing until all characters have been matched and then returning the index of the first match.
What I have so far is extremely basic, and more than likely does not work. I won't be implementing this, but any pointers would be appreciated.
With p processors, a text of length t, and a pattern of length L, and a ceiling of L processors used:
for i=0 to t-l:
for j=0 to p:
processor j compares the text[i+j] to pattern[i+j]
On false match:
all processors terminate current comparison, i++
On true match by all processors:
Iterate p characters at a time until L characters have been compared
If all L comparisons return true:
return i (position of pattern)
Else:
i++
I am afraid that breaking the string will not do.
Generally speaking, early escaping is difficult, so you'd be better off breaking the text in chunks.
But let's ask Herb Sutter to explain searching with parallel algorithms first on Dr Dobbs. The idea is to use the non-uniformity of the distribution to have an early return. Of course Sutter is interested in any match, which is not the problem at hand, so let's adapt.
Here is my idea, let's say we have:
Text of length N
p Processors
heuristic: max is the maximum number of characters a chunk should contain, probably an order of magnitude greater than M the length of the pattern.
Now, what you want is to split your text into k equal chunks, where k is is minimal and size(chunk) is maximal yet inferior to max.
Then, we have a classical Producer-Consumer pattern: the p processes are feeded with the chunks of text, each process looking for the pattern in the chunk it receives.
The early escape is done by having a flag. You can either set the index of the chunk in which you found the pattern (and its position), or you can just set a boolean, and store the result in the processes themselves (in which case you'll have to go through all the processes once they have stop). The point is that each time a chunk is requested, the producer checks the flag, and stop feeding the processes if a match has been found (since the processes have been given the chunks in order).
Let's have an example, with 3 processors:
[ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ]
x x
The chunks 6 and 8 both contain the string.
The producer will first feed 1, 2 and 3 to the processes, then each process will advance at its own rhythm (it depends on the similarity of the text searched and the pattern).
Let's say we find the pattern in 8 before we find it in 6. Then the process that was working on 7 ends and tries to get another chunk, the producer stops it --> it would be irrelevant. Then the process working on 6 ends, with a result, and thus we know that the first occurrence was in 6, and we have its position.
The key idea is that you don't want to look at the whole text! It's wasteful!
Given a pattern of length L, and searching in a string of length N over P processors I would just split the string over the processors. Each processor would take a chunk of length N/P + L-1, with the last L-1 overlapping the string belonging to the next processor. Then each processor would perform boyer moore (the two pre-processing tables would be shared). When each finishes, they will return the result to the first processor, which maintains a table
Process Index
1 -1
2 2
3 23
After all processes have responded (or with a bit of thought you can have an early escape), you return the first match. This should be on average O(N/(L*P) + P).
The approach of having the i'th processor matching the i'th character would require too much inter process communication overhead.
EDIT: I realize you already have a solution, and are figuring out a way without having to find all solutions. Well I don't really think this approach is necessary. You can come up with some early escape conditions, they aren't that difficult, but I don't think they'll improve your performance that much in general (unless you have some additional knowledge the distribution of matches in your text).

Resources