Algorithm for generating a list of recurring pairs - algorithm

Given a text file in the format below, each line is a list of up to 50
names. Write a program produces a list of pairs of names which appear
together in at least fifty different lists.
Tyra,Miranda,Naomi,Adriana,Kate,Elle,Heidi
Daniela,Miranda,Irina,Alessandra,Gisele,Adriana
In the above sample, Miranda and Adriana appear together twice, but
every other pair appears only once. It should return
"Miranda,Adriana\n". An approximate solution may be returned with
lists which appear at least 50 times with high probability.
I was thinking of the following solution:
Generate a Map <Pair,Integer> pairToCountMap, after reading through the file.
Iterate through the map, and print those with counts >= 50
Is there a better way to do this? The file could be very large, and I'm not sure what is meant by the approximate solution. Any links or resources would be much appreciated.

First let's assume that names are limited in length, so operations on them are constant time.
Your answer should be acceptable if it fits in memory. If you have N lines with m names each, your solution should take O(N*m*m) to complete.
If that data set doesn't fit in memory, you can write the pairs to a file, sort that file using a merge sort, then scan through to count pairs. The running time of this is O(N*m*log(N*m)), but due to details about speed of disk access will run much faster in practice.
If you have a distributed cluster, then you could use a MapReduce. It would run very similarly to the last solution.
As for the statistics approach, my guess is that they mean running through the list of files to find the frequency of each name, and the number of lines with different numbers of names in them. If we assume that each line is a random assortment of names, using statistics we can estimate how many intersections there are between any pair of common names. This will be roughly linear in the length of the file.

You can for each name obtain the list of the line numbers where it appears (use a hashtable to store the names), then for every pair of names get the size of the intersection of the corresponding line indices (in the case of two increasing sequences this is linear time).
Say the length of a name is limited by a constant. So if you have N names and M lines, then building the list is like O(MN) and the final stage is O(N^2 M).

Related

Amount of arrays with unique numbers

I have been wondering if there is any better solution of this problem:
Let's assume that there are n containers (they might not have the same length). In each of them we have some numbers. What is the amount of n-length arrays that are created by taking one element from every container? Those numbers in the newly formed arrays must be unique (e.g. (2,3,3) can not be created but (2,4,3) can).
Here is an exaple:
n=3
c1=(1,6,7)
c2=(1,6,7)
c3=(6,7)
The correct answer is 4, because we can create those four arrays: (1,6,7), (1,7,6), (6,1,7), (6,7,1).
Edit: None of the n containers contain duplicates and all the elements in the new arrays must have the same order as the order of the containers they belong to.
So my question is: Is there any better way to calculate the number of those arrays than just by generating every single possibility and checking if it has no repetitions?
You do not need to generate each possibility and then check whether or not it has repetitions - you can do that before adding the would-be duplicate element, saving a lot of wasted work further down the line. But yes, given the requirement that
all the elements in the new arrays must have the same order as the
order of the containers they belong to
you cannot simply count permutations, or combinations of m-over-n, which would have been much quicker (as there is a closed formula for those).
Therefore, the optimal algorithm is probably to use a backtracking approach with a set to avoid duplicates while building partial answers, and count the number of valid answers found.
The problem looks somewhat like counting possible answers to a 1-dimensional sudoku: choose one element each from each region, ensuring no duplicates. For many cases, there may be 0 answers - imagine n=4, c=[[1,2],[2,3],[3,1],[2,3]]. For example, if are less than k unique elements for a subset of k containers, no answer is possible.

A large file containing 1 million integers, what would be the fastest way to find the most occurring?

Basic approach would be to use an array or a hashmap to create a historgram of numbers and select the most frequent.
In this case let's assume that all the numbers from the file cannot be loaded into the main memory.
One way I can think of is to sort using external merge/quick sort and then chunk by chunk calculate the frequency. As they are sorted, we don't have to worry about the number appearing again after the sequence with a number finishes.
Is there a better and more efficient way to do this?
Well, a million isn't so much anymore, so lets assume we're talking about several billion integers.
In that case, I would suggest that you hash them and partition them into 2^N buckets (separate files or preallocated parts of the same file) using the top N bits of their hash values.
You would choose N so that the resulting buckets were highly likely to be small enough to process in memory.
You would then process each bucket by counting the occurrences of each unique value in a hash table or similar.
In the unlikely event that a bucket has too many unique values to fit in RAM, repartition using the next N bits of the hash and try again.

Efficiently search for pairs of numbers in various rows

Imagine you have N distinct people and that you have a record of where these people are, exactly M of these records to be exact.
For example
1,50,299
1,2,3,4,5,50,287
1,50,299
So you can see that 'person 1' is at the same place with 'person 50' three times. Here M = 3 obviously since there's only 3 lines. My question is given M of these lines, and a threshold value (i.e person A and B have been at the same place more than threshold times), what do you suggest the most efficient way of returning these co-occurrences?
So far I've built an N by N table, and looped through each row, incrementing table(N,M) every time N co occurs with M in a row. Obviously this is an awful approach and takes 0(n^2) to O(n^3) depending on how you implent. Any tips would be appreciated!
There is no need to create the table. Just create a hash/dictionary/whatever your language calls it. Then in pseudocode:
answer = []
for S in sets:
for (i, j) in pairs from S:
count[(i,j)]++
if threshold == count[(i,j)]:
answer.append((i,j))
If you have M sets of size of size K the running time will be O(M*K^2).
If you want you can actually keep the list of intersecting sets in a data structure parallel to count without changing the big-O.
Furthermore the same algorithm can be readily implemented in a distributed way using a map-reduce. For the count you just have to emit a key of (i, j) and a value of 1. In the reduce you count them. Actually generating the list of sets is similar.
The known concept for your case is Market Basket analysis. In this context, there are different algorithms. For example Apriori algorithm can be using for your case in a specific case for sets of size 2.
Moreover, in these cases to finding association rules with specific supports and conditions (which for your case is the threshold value) using from LSH and min-hash too.
you could use probability to speed it up, e.g. only check each pair with 1/50 probability. That will give you a 50x speed up. Then double check any pairs that make it close enough to 1/50th of M.
To double check any pairs, you can either go through the whole list again, or you could double check more efficiently if you do some clever kind of reverse indexing as you go. e.g. encode each persons row indices into 64 bit integers, you could use binary search / merge sort type techniques to see which 64 bit integers to compare, and use bit operations to compare 64 bit integers for matches. Other things to look up could be reverse indexing, binary indexed range trees / fenwick trees.

Find duplicate strings in a large file

A file contains a large number (eg.10 billion) of strings and you need to find duplicate Strings. You have N number of systems available. How will you find duplicates
erickson's answer is probably the one expected by whoever set this question.
You could use each of the N machines as a bucket in a hashtable:
for each string, (say string number i in sequence) compute a hash function on it, h.
send the the values of i and h to machine number n for storage, where n = h % N.
from each machine, retrieve a list of all hash values h for which more than one index was received, together with the list of indexes.
check the sets of strings with equal hash values, to see whether they're actually equal.
To be honest, though, for 10 billion strings you could plausibly do this on 1 PC. The hashtable might occupy something like 80-120 GB with a 32 bit hash, depending on exact hashtable implementation. If you're looking for an efficient solution, you have to be a bit more specific what you mean by "machine", because it depends how much storage each one has, and the relative cost of network communication.
Split the file into N pieces. On each machine, load as much of the piece into memory as you can, and sort the strings. Write these chunks to mass storage on that machine. On each machine, merge the chunks into a single stream, and then merge the stream from each machine into a stream that contains all of the strings in sorted order. Compare each string with the previous. If they are the same, it is a duplicate.

Generate sequence of integers in random order without constructing the whole list upfront [duplicate]

This question already has answers here:
Closed 14 years ago.
How can I generate the list of integers from 1 to N but in a random order, without ever constructing the whole list in memory?
(To be clear: Each number in the generated list must only appear once, so it must be the equivalent to creating the whole list in memory first, then shuffling.)
This has been determined to be a duplicate of this question.
very simple random is 1+((power(r,x)-1) mod p) will be from 1 to p for values of x from 1 to p and will be random where r and p are prime numbers and r <> p.
Not the whole list technically, but you could use a bit mask to decide if a number has already been selected. This has a lot less storage than the number list itself.
Set all N bits to 0, then for each desired number:
use one of the normal linear congruent methods to select a number from 1 to N.
if that number has already been used, find the next highest unused (0 bit), with wrap.
set that numbers bit to 1 and return it.
That way you're guaranteed only one use per number and relatively random results.
It might help to specify a language you are searching a solution for.
You could use a dynamic list where you store your generated numbers, since you will need a reference which numbers you already created. Every time you create a new number you could check if the number is contained in the list and throw it away if it is contained and try again.
The only possible way without such a list would be to use a number size where it is unlikely to generate a duplicate like a UUID if the algorithm is working correctly - but this doesn't guarantee that no duplicate is generated - it is just highly unlikely.
You will need at least half of the total list's memory, just to remember what you did already.
If you are in tough memory conditions, you may try so:
Keep the results generated so far in a tree, randomize the data, and insert it into the tree. If you cannot insert then generate another number and try again, etc, until the tree fills halfway.
When the tree fills halfway, you inverse it: you construct a tree holding numbers that you haven't used already, then pick them in random order.
It has some overhead for keeping the tree structure, but it may help when your pointers are considerably smaller in size than your data is.

Resources