Known algorithms for incremental longest common substring? - algorithm

Longest common substring is a well-known problem. I'm interested in identifying common substrings in a set of strings that is growing over time.
Use case: suppose I'm emitting log messages inside a tight loop. Many logging libraries offer a flag for suppressing duplicate messages, but what about similar messages? Contrived example:
for file in files {
if ! process_file(file) {
warn!("Failed to process {}.", file);
}
}
If files has, say O(1) elements, this is all well & good. If it has O(10^6) elements, I'd like to see something like this in the log files:
Failed to process abc000000.txt.
Failed to process abc000001.txt.
Failed to process abc000002.txt.
[Repeated 912345 times]
in my log file.
I'm picturing a logging implementation that keeps track of longest common substring between the current & last log messages. If it spikes, it begins tracking all subsequent log messages. If more than n consecutive log messages display similar longest common substrings, it stops emitting them & starts a count.
Any prior art, here?

Related

Split text message into chunks of 30 chars limit with suffixed (k/n)

There is a text-messaging service, that sends messages to the user that can be at most 30 characters long. Create a function to splits the text into chunks so that it can be sent in multiple messages.
Each chunk can be:
up to 30 chars long
no word should be split in the middle.
each chunk has to have its order suffixed in the form of '(k/n)',
e.g "this is from chunk (1/2)", "this is the second chunk (2/2)"
if the text provided to the function is within 30 chars limit, no ordering should be suffixed.
Space wouldn't be considered as the part of word.
No words will have up to length including suffixes.
There can be N = 1000 characters.
Solution:- I am thinking DP/Greedy might solve the no of chunks without suffix (n/k), how to know the total no of suffix there beforehand.
In terms of the greedy approach that would be correct. More rigorously, we know by induction:
Base case:
Text message can fit into one message (<= 30 chars). In this case the greedy approach would be the same as the optimal solution (one message), thus it is correct.
Inductive step:
Let's assume that for a message that requires n messages to send, that the greedy solution is optimal. Now let's prove this for the n+1 case. With the greedy approach, everything except for the last 5 characters (taken up by the message numbering) will fit in the first n messages. We are then left with <= 25 characters, these characters will then fit into one message which is the optimal solution as shown by the base case.
--
Let's move forward and talk about how we would find out the number of suffixes we need. We have the case in which we would need no suffixes (fits into one message), this would happen if and only if the message is <= 30 characters.
However, if it is more, this is where it gets tricky. We now know that the message must be <= 25 characters to fit in the suffix. So now, we must iterate through the complete message that we are trying to send and check at the 25 character mark. If that is a space, perfect, we increase the number of suffixes we need by one, discard the space and everything before it and continue on our way. However, if it is not a space, the problem specifies that we need every word to be complete. So, therefore we have to backtrack until we find a space, then that would be the cut and the process would continue the same as before. We would continue this process until the entire message has been processed.

Performance of mapreduce with sorted keys

In Google's paper it states:
We guarantee that within a given partition, the intermediate
key/value pairs are processed in increasing key order. This ordering
guarantee makes it easy to generate a sorted output file per
partition, which is useful when the output file format needs to
support efficient random access lookups by key, or users of the output
find it convenient to have the data sorted.
I also tried with a simple example of "wordcount" with "mrJob" (Python) and it works as expected. In the output, all keys (words) are in alphabetical (ascending) order.
However, I don't understand why it's possible. MapReduce is for parallel processing that means all sub-processes are independent. For example, a reducer calculates the frequency of a word and write to the output and then it finishes. But for an order output, apparently a sub-process has to wait others to compare the keys before write its output. This means it cannot leverage the power of parallel processing.
So how they solve it?
Thank you.

Parallel top ten algorithm for distributed data

This is an interview question. Suppose there are a few computers and each computer keeps a very large log file of visited URLs. Find the top ten most visited URLs.
For example: Suppose there are only 3 computers and we need the top two most visited URLs.
Computer A: url1, url2, url1, url3
Computer B: url4, url2, url1, url1
Computer C: url3, url4, url1, url3
url1 appears 5 times in all logs
url2 2
url3 3
url4 2
So the answer is url1, url3
The log files are too large to fit in RAM and copy them by network. As I understand, it is important also to make the computation parallel and use all given computers.
How would you solve it?
This is a pretty standard problem for which there is a well-known solution. You simply sort the log files on each computer by URL and then merge them through a priority queue of size k (the number of items you want) on the "master" computer. This technique has been around since the 1960s, and is still in use today (although slightly modified) in the form of MapReduce.
On each computer, extract the URL and the count from the log file, and sort by URL. Because the log files are larger than will fit into memory, you need to do an on-disk merge. That entails reading a chunk of the log file, sorting by URL, writing the chunk to disk. Reading the next chunk, sorting, writing to disk, etc. At some point, you have M log file chunks, each sorted. You can then do an M-way merge. But instead of writing items to disk, you present them, in sorted order (sorted by URL, that is), to the "master".
Each machine sorts its own log.
The "master" computer merges the data from the separate computers and does the top K selection. This is actually two problems, but can be combined into one.
The master creates two priority queues: one for the merge, and one for the top K selection. The first is of size N, where N is the number of computers it's merging data from. The second is of size K: the number of items you want to select. I use a min heap for this, as it's easy and reasonably fast.
To set up the merge queue, initialize the queue and get the first item from each of the "worker" computers. In the pseudo-code below, "get lowest item from merge queue" means getting the root item from the merge queue and then getting the next item from whichever working computer presented that item. So if the queue contains [1, 2, 3], and the items came from computers B, C, A (in that order), then taking the lowest item would mean getting the next item from computer B and adding it to the priority queue.
The master then does the following:
working = get lowest item from merge queue
while (items left to merge)
{
temp = get lowest item from merge queue
while (temp.url == working.url)
{
working.count += temp.count
temp = get lowest item from merge queue
}
// Now have merged counts for one url.
if (topK.Count < desired_count)
{
// topK queue doesn't have enough items yet.
// so add this one.
topK.Add(working);
}
else if (topK.Peek().count < working.count)
{
// the count for this url is larger
// than the smallest item on the heap
// replace smallest on the heap with this one
topK.RemoveRoot()
topK.Add(working)
}
working = temp;
}
// Here you need to check the last item:
if (topK.Peek().count < working.count)
{
// the count for this url is larger
// than the smallest item on the heap
// replace smallest on the heap with this one
topK.RemoveRoot()
topK.Add(working)
}
At this point, the topK queue has the K items with the highest counts.
So each computer has to do a merge sort, which is O(n log n), where n is the number of items in that computer's log. The merge on the master is O(n), where n is the sum of all the items from the individual computers. Picking the top k items is O(n log k), where n is the number of unique urls.
The sorts are done in parallel, of course, with each computer preparing its own sorted list. But the "merge" part of the sort is done at the same time the master computer is merging, so there is some coordination, and all machines are involved at that stage.
Given the scale of the log files and the generic nature of the question, this is quite a difficult problem to solve. I do not think that there is one best algorithm for all situations. It depends on the nature of the contents of the log files. For example, take the corner case that all URLs are all unique in all log files. In that case, basically any solution will take a long time to draw that conclusion (if it even gets that far...), and there is not even an answer to your question because there is no top-ten.
I do not have a watertight algorithm that I can present, but I would explore a solution that uses histograms of hash values of the URLs as opposed to the URLs themselves. These histograms can be calculated by means of one-pass file reads, so it can deal with arbitrary size log files. In pseudo-code, I would go for something like this:
Use a hash function with a limited target space (say 10,000, note that colliding hash-values are expected) to calculate the hash value of each item in the log file and count how many times each of the has value occurs. Communicate the resulting histogram to a server (although it is probably also possible to avoid a central server at all by multicasting the result to every other node -- but I will stick with the more obvious server-approach here)
The server should merge the histograms and communicate the result back. Depending on the distribution of the URLs, there might be a number of clearly visible peaks already, containing the top-visited URLs.
Each of the nodes should then focus on the peaks in the histogram. It should go trough its log file again, use an additional hash function (again with a limited target space) to calculate a new hash-histogram for those URLs that have their first hash value in one of the peaks (the number of peaks to focus on would be a parameter to be tuned in the algorithm, depending on the distribution of the URLs), and calculate a second histogram with the new hash values. The result should be communicated to the server.
The server should merge the results again and analyse the new histogram versus the original histogram. Depending on clearly visible peaks, it might be able to draw conclusions about the two hash values of the top ten URLs already. Or it might have to instruct the machines to calculate more hash values with the second hash function, and probably after that go through a third pass of hash-calculations with yet another hash function. This has to continue until a conclusion can be drawn from the collective group of histograms what the hash values of the peak URLs are, and then the nodes can identify the different URLs from that.
Note that this mechanism will require tuning and optimization with regard to several aspects of the algorithm and hash-functions. It will also need orchestration by the server as to which calculations should be done at any time. It probably will also need to set some boundaries in order to conclude when no conclusion can be drawn, in other words when the "spectrum" of URL hash values is too flat to make it worth the effort to continue calculations.
This approach should work well if there is a clear distribution in the URLs though. I suspect that, practically speaking, the question only makes sense in that case anyway.
Assuming the conditions below are true:
You need the top n urls of m hosts.
You can't store the files in RAM
There is a master node
I would take the approach below:
Each node reads a portion of the file (ie. MAX urls, where MAX can be, let's say, 1000 urls) and keeps an array arr[MAX]={url,hits}.
When a node has read MAX urls off the file, it sends the list to the master node, and restarts reads until MAX urls is reached again.
When a node reaches the EOF, he sends the remaining list of urls and an EOF flag to the master node.
When the master node receives a list of urls, it compares it with its last list of urls and generates a new, updated one.
When the master node receives the EOF flag from every node and finishes reading his own file, the top n urls of the last version of his list are the ones we're looking for.
Or
A different approach that would release the master from doing all the job could be:
Every node reads its file and stores an array same as above, reading until EOF.
When EOF, the node will send the first url of the list and the number of hits to the master.
When the master has collected the first url and number of hits for each node, it generates a list. If the master node has less than n urls, it will ask the nodes to send the second one and so on. Until the master has the n urls sorted.
Pre-processing: Each computer system processes complete log file and prepares Unique URLs list with count against them.
Getting top URLs:
Calculate URL counts at each computer system
Collating process at a central system(Virtual)
Send URLs with count to a central processing unit one by one in DESC order(i.e from top most)
At central system collate incoming URL details
Repeat until sum of all the counts from incoming URLs is less than count of Tenth URL in the master list. A vital step to be absolutely certain
PS: You'll have top ten URLs across systems not necessarily in that order. To get the actual order you can reverse collation. For a given URL on top ten get individual count from dist-computers and form final order.
On each node count the number of occurrences of URL.
Then use a sharding function to distribute the url to another node which owns the key for URL. Now each node will have unique keys.
On Each node then again reduce to get the number occurrences of URLs and then find the top N URLs. Finally send only top N urls to master node which will find the top N URls among K*N items where K is number of node.
Eg: K=3
N1 - > url1,url2,url3,url1,url2
N2 - > url2,url4,url1,url5,url2
N3 - > url1,url4,url3,url1,url3
Step 1: Count the occurrence per url in each node.
N1 -> (url1,2),(url2,2),(url3,1)
N2 -> (url4,1),(url2,2),(url5,1),(url1,1)
N3 -> (url1,2),(url3,2),(url4,1)
Step 2: Sharding use hash function(for simplicity, let it be url number % K)
N1 -> (url1,2),(url1,1),(url1,2),(url4,1),(url4,1)
N2 -> (url2,2),(url2,2),(url5,1)
N3 -> (url3,2),(url3,1)
Step 4: Find the number of occurrences of each key within the node again.
N1 -> (url1,5),(url4,2)
N2 -> (url2,4),(url5,1)
N3 -> (url3,3)
Step 5: Send only top N to master. Let N=1
Master -> (url1,5),(url2,4),(url3,3)
Sort the result and get top 1 item which is url1
Step 1 is called map side reduce and it is done to avoid huge shuffle which will occur in Step2.
The below description is the idea for the solution. it is not a pseudocode.
Consider you have a collection of systems.
1.for each A: Collections(systems)
1.1) Run a daemonA in each computer which probes on the log file for changes.
1.2) When a change is noticed, wakeup AnalyzerThreadA
1.3) If AnalyzerThreadA finds a URL using some regex, then update localHashMapA with count++.
(key = URL, value = count ).
2) Push topTen entries of localHashMapA to ComputerA where AnalyzeAll daemon will be running.
The above step will be the last step in each system, which will push topTen entries to a master system, say for example: computerA.
3) AnalyzeAll running in computerA will resolve duplicates and update count in masterHashMap of URLs.
4) Print the topTen from masterHashMap.

Dynamic Array with O(1) removal of any element

This question is about a data structure I thought of. It is a dynamic array, like std::vector<> in C++, except the removal algorithm is different.
In a normal dynamic array, when an element is removed, all the remaining elements must be shifted down, which is O(n), unless it's the last element, which will be O(1).
In this one, if any element is removed, it is replaced by the last element. This of course loses ordering of the elements. But now removal of any element is constant time.
A list will have the same removal times, but this structure has random access. The only caveat with that is you don't know what you're accessing, since ordering could be jumbled, so what use is random access anyway. Plus a list won't mess up any pointers/iterators to elements.
So meh, this structure seems rather useless except for the very specific task of strictly walking through elements and perhaps removing them along the way. A list can do the same, but this has better cache performance.
So, does this strange/useless structure have a name, and does it have any uses? Or just a nice little brain storm?
This idea is used in Knuth (Fisher–Yates) shuffle. An element picked at random is replaced with the last one in the array. Since what we want is a random permutation anyway, the reordering doesn't matter.
So, does this strange/useless structure have a name, and does it have any uses?
I've used something similar in simulations of multi-process systems.
In a scheduler for processes implemented as state machines, each process is either waiting for an external event, active or completed. The scheduler has an array of pointers to the processes.
Initially each process is active, and the scheduler has the index of the last waiting and first completed process, initially zero and the length of the array.
V-- waiting
[ A-active, B-active, C-active, D-active ]
completed --^
^- run
To step the process to its next state, the scheduler iterates over the array and runs each process in turn. If a process reports that it is waiting, it's swapped with the process after the last waiting process in the array.
V-- waiting
[ A-waiting, B-active, C-active, D-active ]
completed --^
^- run
If it reports that it has completed, it's swapped with the process before the first completed array.
V-- waiting
[ A-waiting, D-active, C-active, B-completed ]
completed --^
^- run
So as the scheduler runs and processes transition from active to waiting or completed, the array become ordered with all the waiting processes at the start, all the active ones in the middle, and the completed ones at the end.
V-- waiting
[ A-waiting, C-waiting, D-active, B-completed ]
completed --^
^- run
After either a certain number of iterations, or when there are no more active processes, the completed processes are cleaned out of the array and external events are processed:
V-- waiting
[ A-waiting, C-waiting, D-completed, B-completed ]
completed --^
^- run == completed so stop
This is similar in that it's using swapping to remove items from a collection, but it is removing items from both ends rather and leaving the 'collection' in the middle.
I remember using this method plenty of times before. But I don't know a name for it.
Simple example: In a computer game you are iterating all the "bad guys" and calculating their movements etc. One thing that can happen to them is to disappear (their dead body finished fading away and is 99% transparent now). At that point you remove it from the list like you do, and resume iterator without increasing the iteration counter.
Something similar to this is done in a Binary Heap when deleting an item, however there the next step is to maintain the heap rule - O(log n).
I dont know of a name for it but it is better than a list in certain cases.
In particular, this would be vastly superior to a singly or doubly linked list for very small data.
Because you store everything contiguously there's no extra pointer overhead per element.
Hm, does that algorithm really have O(1) removal time?
That would mean that
Finding the element to remove is O(1)
Finding the last element (which will replace the deleted element) is O(1)
Finding the second-to-last element (the new "last" element) is O(1)
...which is not possible in any data structure I can come up with. Although a double-linked list could fullfill these constraints, given that you've already got a pointer to the element to remove.
It's called a Set.

Log combing algorithm

We get these ~50GB data files consisting of 16 byte codes, and I want to find any code that occurs 1/2% of the time or more. Is there any way I can do that in a single pass over the data?
Edit: There are tons of codes - it's possible that every code is different.
EPILOGUE: I've selected Darius Bacon as best answer, because I think the best algorithm is a modification of the majority element he linked to. The majority algorithm should be modifiable to only use a tiny amount of memory - like 201 codes to get 1/2% I think. Basically you just walk the stream counting up to 201 distinct codes. As soon as you find 201 distinct codes, you drop one of each code (deduct 1 from the counters, forgetting anything that becomes 0). At the end, you have dropped at most N/201 times, so any code occurring more times than that must still be around.
But it's a two pass algorithm, not one. You need a second pass to tally the counts of the candidates. It's actually easy to see that any solution to this problem must use at least 2 passes (the first batch of elements you load could all be different and one of those codes could end up being exactly 1/2%)
Thanks for the help!
Metwally et al., Efficient Computation of Frequent and Top-k Elements in Data Streams (2005). There were some other relevant papers I read for my work at Yahoo that I can't find now; but this looks like a good start.
Edit: Ah, see this Brian Hayes article. It sketches an exact algorithm due to Demaine et al., with references. It does it in one pass with very little memory, yielding a set of items including the frequent ones you're looking for, if they exist. Getting the exact counts takes a (now-tractable) second pass.
this will depend on the distribution of the codes. if there are a small enough number of distinct codes you can build a http://en.wikipedia.org/wiki/Frequency_distribution in core with a map. otherwise you probably will have to build a http://en.wikipedia.org/wiki/Histogram and then make multiple passes over the data examining frequencies of codes in each bucket.
Sort chunks of the file in memory, as if you were performing and external sort. Rather than writing out all of the sorted codes in each chunk, however, you can just write each distinct code and the number of occurrences in that chunk. Finally, merge these summary records to find the number of occurrences of each code.
This process scales to any size data, and it only makes one pass over the input data. Multiple merge passes may be required, depending on how many summary files you want to open at once.
Sorting the file allows you to count the number of occurrences of each code using a fixed amount of memory, regardless of the input size.
You also know the total number of codes (either by dividing the input size by a fixed code size, or by counting the number of variable length codes during the sorting pass in a more general problem).
So, you know the proportion of the input associated with each code.
This is basically the pipeline sort * | uniq -c
If every code appears just once, that's no problem; you just need to be able to count them.
That depends on how many different codes exist, and how much memory you have available.
My first idea would be to build a hash table of counters, with the codes as keys. Loop through the entire file, increasing the counter of the respective code, and counting the overall number. Finally, filter all keys with counters that exceed (* overall-counter 1/200).
If the files consist solely of 16-byte codes, and you know how large each file is, you can calculate the number of codes in each file. Then you can find the 0.5% threshold and follow any of the other suggestions to count the occurrences of each code, recording each one whose frequency crosses the threshold.
Do the contents of each file represent a single data set, or is there an arbitrary cutoff between files? In the latter case, and assuming a fairly constant distribution of codes over time, you can make your life simpler by splitting each file into smaller, more manageable chunks. As a bonus, you'll get preliminary results faster and can pipeline then into the next process earlier.

Resources