get unique words in text stream - algorithm

At a given instance can we find out unique words in a text stream.
One naive solution i can think of is using a hashmap to keep words count.
But this would require keeping words which have word count more than 1 in hashmap. In case of long text stream, it will be lot of words to maintain. Is there a way to crunch on space complexity for this.

You cannot get the number of distinct words exactly without paying the space complexity. However, you can get a reasonably good estimate by using the Flajolet-Martin approach, described on slide 20 of this slide deck.
Assuming the data stream consists of a universe of elements chosen from a set of size N, you can do the following steps, copied from the slides linked above.
Pick a hash function h that maps each of the N elements to at least log_2 (N) bits.
For each stream element a, let r(a) be the number of trailing 0's in h(a).
Record R = the maximum r(a) seen.
Estimated number of distinct elements = 2^R.

Related

Minimize space use with a randomized queue

I'm going over an algorithms online course and sometimes they pose ungraded bonus challenges for which no answer is provided. This is one of them:
You are given a positive integer k.
You will read a series of strings from the standard input (a total of n strings; n is not known to you until after you have exhausted all the strings).
You can make use of a randomized queue, which has a basic API: size() returns the number of elements in the queue; enqueue(String) adds the string into the queue; and dequeue() removes and returns a string from inside the queue, chosen uniformly at random.
Read all the input and at the end print k strings chosen uniformly at random from the set of n strings.
Use a randomized queue no larger than k.
I cannot satisfy 4 and 5 at the same time. I can get the distribution of the output to be uniform if I fill the queue with n strings and then make k calls to dequeue() or I can devise an scheme in which I only have k elements at most in the queue at any point, but the output is not uniform since the strings read at the beginning end up having either a greater or a smaller chance of being part of the final chosen set (depending on the algorithm I choose).
If I knew n in advance I could assign a random ID between 0 and n to each string I read, and keep a list of the k smallest IDs and their respective strings (e.g. k_smallest); if a new string is assigned a random ID smaller than any of the k I already have, I can decide to remove the largest element from k_smallest and add the new string to it. However, two problems arise: n is not known until after all strings have been read and the randomized queue does not allow to dequeue the largest element, only one at random.
I am very curious about the solution. How can this be solved using space proportional to k and not n?
The Key:
You need to keep track on how many elements you have read so far.
Algo:
l : number of enqueue(..)-calls so far.
Take the forst k elements and put them in your internal storage of size k. (e.g an array of size k). Set l:=k
For each enqueue(..) call after the first k, you need to decide which element to drop. If you have already enqueued l elements the probability with witch we need to keep the new element is k/l. If the random generator says keep it, you must remove a random element of the old elements. and replace it with the new one. l:=l+1
At any time you have have k evenly distributed values of the so far enqueued values (l) in your internal storage. At the end is l==n.
P.s.
The algorithm is much more intuitive for k=1. So if you have problems getting the idea, think it through with the simplest case k=1.

Efficiently search for pairs of numbers in various rows

Imagine you have N distinct people and that you have a record of where these people are, exactly M of these records to be exact.
For example
1,50,299
1,2,3,4,5,50,287
1,50,299
So you can see that 'person 1' is at the same place with 'person 50' three times. Here M = 3 obviously since there's only 3 lines. My question is given M of these lines, and a threshold value (i.e person A and B have been at the same place more than threshold times), what do you suggest the most efficient way of returning these co-occurrences?
So far I've built an N by N table, and looped through each row, incrementing table(N,M) every time N co occurs with M in a row. Obviously this is an awful approach and takes 0(n^2) to O(n^3) depending on how you implent. Any tips would be appreciated!
There is no need to create the table. Just create a hash/dictionary/whatever your language calls it. Then in pseudocode:
answer = []
for S in sets:
for (i, j) in pairs from S:
count[(i,j)]++
if threshold == count[(i,j)]:
answer.append((i,j))
If you have M sets of size of size K the running time will be O(M*K^2).
If you want you can actually keep the list of intersecting sets in a data structure parallel to count without changing the big-O.
Furthermore the same algorithm can be readily implemented in a distributed way using a map-reduce. For the count you just have to emit a key of (i, j) and a value of 1. In the reduce you count them. Actually generating the list of sets is similar.
The known concept for your case is Market Basket analysis. In this context, there are different algorithms. For example Apriori algorithm can be using for your case in a specific case for sets of size 2.
Moreover, in these cases to finding association rules with specific supports and conditions (which for your case is the threshold value) using from LSH and min-hash too.
you could use probability to speed it up, e.g. only check each pair with 1/50 probability. That will give you a 50x speed up. Then double check any pairs that make it close enough to 1/50th of M.
To double check any pairs, you can either go through the whole list again, or you could double check more efficiently if you do some clever kind of reverse indexing as you go. e.g. encode each persons row indices into 64 bit integers, you could use binary search / merge sort type techniques to see which 64 bit integers to compare, and use bit operations to compare 64 bit integers for matches. Other things to look up could be reverse indexing, binary indexed range trees / fenwick trees.

Minimum number of deletions for a given word to become a dictionary word

Given a dictionary as a hashtable. Find the minimum # of
deletions needed for a given word in order to make it match any word in the
dictionary.
Is there some clever trick to solve this problem in less than exponential complexity (trying all possible combinations)?
For starters, suppose that you have a single word w in the the hash table and that your word is x. You can delete letters from x to form w if and only if w is a subsequence of x, and in that case the number of letters you need to delete from x to form w is given by |x - w|. So certainly one option would be to just iterate over the hash table and, for each word, to see if x is a subsequence of that word, taking the best match you find across the table.
To analyze the runtime of this operation, let's suppose that there are n total words in your hash table and that their total length is L. Then the runtime of this operation is O(L), since you'll process each character across all the words at most once. The complexity of your initial approach is O(|x| · 2|x|) because there are 2|x| possible words you can make by deleting letters from x and you'll spend O(|x|) time processing each one. Depending on the size of your dictionary and the size of your word, one algorithm might be better than the other, but we can say that the runtime is O(min{L, |x|·2|x|) if you take the better of the two approaches.
You can build a trie and then see where your given word fits into it. The difference in the depth of your word and the closest existing parent is the number of deletions required.

Counting distinct common subsequences for a given set of strings

I was going through this paper about counting number of distinct common subsequences between two strings which has described a DP approach to do the same. Now, when there are more than two strings whose number of distinct common subsequences must be found, it might take an approach different from this one. What I want is that whether this task is achievable in time complexity less than exponential and how can it be done?
If you have an alphabet of size k, and m strings of size at most n then (assuming that all individual math operations are O(1)) this problem is solvable with dynamic programming in time at most O(k nm+1) and memory O(k nm). Those are not tight bounds, and in practice performance and memory should be significantly better than that. But in practice with long strings you will wind up needing big integer arithmetic, which will make math operations not O(1). Still it is polynomial.
Here is the trick in an unfortunately confusing sentence. We want to build up a series of tables listing, for each possible length of subsequence and each set of ways to pick one copy of a character from each string, the number of distinct subsequences there are whose minimal expression in each string ends at the chosen spot. If we do that, then the sum of all of those values is our final answer.
Here is an outline of how to do it (which you can do without understanding the above description).
For each string, build a transition table mapping (position in string, character) to the position of the next occurrence of that character. The tables should start with position 0 being before the first character. You can use -1 for running off of the end of the string.
Create a data structure that maps a list of integers the same size as the number of strings you have to another integer. This will be the count of subsequences of a fixed length whose shortest representation in each string ends at that set of positions.
Insert as the sole value (0, 0, ..., 0) -> 1 to represent the fact that there is 1 subsequence of length 0 and its shortest representation in each string ends at the start.
Set the total count of common subsequences to 0.
While that map is not empty:
Add the sum of values in that map to the total count of common subsequences.
Create a second map of the same type, with no data.
For each key/value pair in the first map:
For each possible character in your alphabet:
Construct a new vector of integers to be a new key by taking each string, looking at the position, then taking the next position of that character. Of course if you run off of the end of the string, break out of the loop.
If that key is not in your second map, insert it with value 0.
Increase the value for that key in the second map by your current value in the current map. (Basically add the number of subsequences that just had this minimal character transition.)
Copy the second data structure to the first.
The total count of distinct subsequences in common across all of the strings should now be correct.

Get most unique text from a group of text

I have a number of texts, for example 100.
I would keep the 10 most unique among them. I made a 100x100 matrix where I compared each text among them with the Levenshtein algorithm.
Is there an algorithm to select the 10 most unique?
EDIT :
What i want is the N most unique text that maximize the distance between this N text regardless of the 1st element of my set.
I want the most unique because i will publish these text to the web and i want avoid near duplicate.
A long comment rather than an answer ...
I don't think you've specified your requirement(s) clearly enough. How do you select the 1st element of your set of 10 strings ? Is it the string with the largest distance from any other string (in which case you are looking for the largest element in your array) or the one with the largest distance from all the other strings (in which case you are looking for the largest row- or column-sum in the array).
Moving on to the N (or 10 as you suggest) most distant strings, you have a number of choices.
You could select the N largest distances in the array. I suspect, not having seen your data, that it is likely that the string which is furthest from any other string may also be furthest away from several other strings too -- I mean you may find that several of the N largest entries in your array occur in the same row or column.
You could simply select the N strings with the largest row sums.
Or perhaps you are looking for a cluster of N strings which maximises the distance between all the strings in that cluster and all the strings in the remaining 100-N strings. This might lead you towards looking at, rather obviously, clustering algorithms.
I suggest you clarify your requirements and edit your question.
Since this looks like an eigenvalue problem, I would try to execute the Power iteration on the matrix, and reject the 90 highest values from the resulting vector. The power iteration normally converges very fast, within ~ten iterations. BTW: this solution assumes a similarity matrix. If the entries of your matrix are a measure of *dis*similarity ("distance"), you might need to use their inverses instead.

Resources