Efficient list intersection algorithm

Efficient list intersection algorithm - algorithm

Given two lists (not necessarily sorted), what is the most efficient non-recursive algorithm to find the set intersection of those lists?
I don't believe I have access to hashing algorithms.

You could put all elements of the first list into a hash set. Then, iterate the second one and, for each of its elements, check the hash to see if it exists in the first list. If so, output it as an element of the intersection.

You might want to take a look at Bloom filters. They are bit vectors that give a probabilistic answer whether an element is a member of a set. Set intersection can be implemented with a simple bitwise AND operation. If you have a large number of null intersections, the Bloom filter can help you eliminate those quickly. You'll still have to resort to one of the other algorithms mentioned here to compute the actual intersection, however.
http://en.wikipedia.org/wiki/Bloom_filter

without hashing, I suppose you have two options:
The naive way is going to be compare each element to every other element. O(n^2)
Another way would be to sort the lists first, then iterate over them: O(n lg n) * 2 + 2 * O(n)

From the eviews features list it seems that it supports complex merges and joins (if this is 'join' as in DB terminology, it will compute an intersection). Now dig through your documentation :-)
Additionally, eviews has their own user forum - why not ask there_

with set 1 build a binary search tree with O(log n) and iterate set2 and search the BST m X O(log n) so total O(log n) + O(m)+O(log n) ==> O(log n)(m+1)

in C++ the following can be tried using STL map
vector<int> set_intersection(vector<int> s1, vector<int> s2){
vector<int> ret;
map<int, bool> store;
for(int i=0; i < s1.size(); i++){
store[s1[i]] = true;
}
for(int i=0; i < s2.size(); i++){
if(store[s2[i]] == true) ret.push_back(s2[i]);
}
return ret;
}

Here is another possible solution I came up with takes O(nlogn) in time complexity and without any extra storage. You can check it out here https://gist.github.com/4455373
Here is how it works: Assuming that the sets do not contain any repetition, merge all the sets into one and sort it. Then loop through the merged set and on each iteration create a subset between the current index i and i+n where n is the number of sets available in the universe. What we look for as we loop is a repeating sequence of size n equal to the number of sets in the universe.
If that subset at i is equal to that subset at n this means that the element at i is repeated n times which is equal to the total number of sets. And since there are no repetitions in any set that means each of the sets contain that value so we add it to the intersection. Then we shift the index by i + whats remaining between it and n because definitely none of those indexes are going to form a repeating sequence.

First, sort both lists using quicksort : O(n*log(n). Then, compare the lists by browsing the lowest values first, and add the common values. For example, in lua) :
function findIntersection(l1, l2)
i, j = 1,1
intersect = {}
while i < #l1 and j < #l2 do
if l1[i] == l2[i] then
i, j = i + 1, j + 1
table.insert(intersect, l1[i])
else if l1[i] > l2[j] then
l1, l2 = l2, l1
i, j = j, i
else
i = i + 1
end
end
return intersect
end
which is O(max(n, m)) where n and m are the sizes of the lists.
EDIT: quicksort is recursive, as said in the comments, but it looks like there are non-recursive implementations

Using skip pointers and SSE instructions can improve list intersection efficiency.

Why not implement your own simple hash table or hash set? It's worth it to avoid nlogn intersection if your lists are large as you say.
Since you know a bit about your data beforehand, you should be able to choose a good hash function.

I second the "sets" idea. In JavaScript, you could use the first list to populate an object, using the list elements as names. Then you use the list elements from the second list and see if those properties exist.

If there is a support for sets (as you call them in the title) as built-in usually there is a intersection method.
Anyway, as someone said you could do it easily (I will not post code, someone already did so) if you have the lists sorted. If you can't use recursion there is no problem. There are quick sort recursion-less implementations.

In PHP, something like
function intersect($X) { // X is an array of arrays; returns intersection of all the arrays
$counts = Array(); $result = Array();
foreach ($X AS $x) {
foreach ($x AS $y) { $counts[$y]++; }
}
foreach ($counts AS $x => $count) {
if ($count == count($X)) { $result[] = $x; }
}
return $result;
}

From the definition of Big-Oh notation:
T(N) = O(f(N)) if there are positive constants c and n 0 such that
T(N) ≤ cf(N) when N ≥ n 0.
Which in practice means that if the two lists are relatively small in size say something less 100 elements in each two for loops works just fine. Loop the first list and look for similar object in the second.
In my case it works just fine because I won't have more than 10 - 20 max elements in my lists.
However, a good solution is the sort the first O(n log n), sort the second also O(n log n) and merge them, another O(n log n) roughly speeking O(3 n log n), say that the two lists are the same size.

Time: O(n) Space: O(1) Solution for identifying points of intersection.
For example, the two given nodes will detect the point of intersection by swapping pointers every time they reach the end. Video Explanation Here.
public ListNode getIntersectionNode(ListNode headA, ListNode headB) {
ListNode pA = headA;
ListNode pB = headB;
while (pA != pB) {
pA = pA == null ? headB : pA.next;
pB = pB == null ? headA : pB.next;
}
return pA;
}
Thanks.
Edit
My interpretation of intersection is finding the point of intersection.
For example:
For the given lists A and B, A and B will "meet/intersect" at point c1, and the algo above will return c1. As OP stated that OP has NO access to Hashmaps or some sort, I believe OP is saying that the algo should have O(1) space complexity.
I got this idea from Leetcode some time ago, if interested: Intersection of Two Linked Lists.

Related

Find an algorithm for sorting integers with time complexity O(n + k*log(k))

Design an algorithm that sorts n integers where there are duplicates. The total number of different numbers is k. Your algorithm should have time complexity O(n + k*log(k)). The expected time is enough. For which values of k does the algorithm become linear?
I am not able to come up with a sorting algorithm for integers which satisfies the condition that it must be O(n + k*log(k)). I am not a very advanced programmer but I was in the problem before this one supposed to come up with an algorithm for all numbers xi in a list, 0 ≤ xi ≤ m such that the algorithm was O(n+m), where n was the number of elements in the list and m was the value of the biggest integer in the list. I solved that problem easily by using counting sort but I struggle with this problem. The condition that makes it the most difficult for me is the term k*log(k) under the ordo notation if that was n*log(n) instead I would be able to use merge sort, right? But that's not possible now so any ideas would be very helpful.
Thanks in advance!

Here is a possible solution:
Using a hash table, count the number of unique values and the number of duplicates of each value. This should have a complexity of O(n).
Enumerate the hashtable, storing the unique values into a temporary array. Complexity is O(k).
Sort this array with a standard algorithm such as mergesort: complexity is O(k.log(k)).
Create the resulting array by replicating the elements of the sorted array of unique values each the number of times stored in the hash table. complexity is O(n) + O(k).
Combined complexity is O(n + k.log(k)).
For example, if k is a small constant, sorting an array of n values converges toward linear time as n becomes larger and larger.
If during the first phase, where k is computed incrementally, it appears that k is not significantly smaller than n, drop the hash table and just sort the original array with a standard algorithm.

The runtime of O(n + k*log(k) indicates (like addition in runtimes often does) that you have 2 subroutines, one which runes in O(n) and the other that runs in O(k*log(k)).
You can first count the frequency of the elements in O(n) (for example in a Hashmap, look this up if youre not familiar with it, it's very useful).
Then you just sort the unique elements, from which there are k. This sorting runs in O(k*log(k)), use any sorting algorithm you want.
At the end replace the single unique elements by how often they actually appeared, by looking this up in the map you created in step 1.

A possible Java solution an be like this:
public List<Integer> sortArrayWithDuplicates(List<Integer> arr) {
// O(n)
Set<Integer> set = new HashSet<>(arr);
Map<Integer, Integer> freqMap = new HashMap<>();
for(Integer i: arr) {
freqMap.put(i, freqMap.getOrDefault(i, 0) + 1);
}
List<Integer> withoutDups = new ArrayList<>(set);
// Sorting => O(k(log(k)))
// as there are k different elements
Arrays.sort(withoutDups);
List<Integer> result = new ArrayList<>();
for(Integer i : withoutDups) {
int c = freqMap.get(i);
for(int j = 0; j < c; j++) {
result.add(i);
}
}
// return the result
return result;
}
The time complexity of the above code is O(n + k*log(k)) and solution is in the same line as answered above.

Algorithm: finding closest differences between two elements of array

You have a array, and a target. Find the difference of the two elements of the array. And this difference should be closest to the target.
(i.e., find i, j such that (array[i]- array[j]) should be closest to target)
Attempt:
I use a order_map (C++ hash-table) to store each element of the array. This would be O(n).
Then, I output the ordered element to a new array (which is sorted increasing number).
Next, I use two pointers.
int mini = INT_MAX, a, b;
for (int i=0, j = ordered_array.size() -1 ; i <j;) {
int tmp = ordered_array[i] - ordered_array[j];
if (abs(tmp - target) < mini) {
mini = abs(tmp - target);
a = i;
b = j;
}
if (tmp == target) return {i,j};
else if (tmp > target) j --;
else i ++;
}
return {a,b};
Question:
Is my algorithm still runs at O(n)?

If the array is already sorted, there is an O(n) algorithm: let two pointers swipe up through the array, whenever the difference between the pointed at elements is smaller than target increase the upper one, whenever it is larger than target increase the lower one. While doing so, keep track of the best result found so far.
When you reach the end of the array the overall best result has been found.
It is easy to see that this is really O(n). Both pointers will only move upwards, in each step you move exactly one pointer and the array has n elements. So after at most 2n steps this will halt.
As already mentioned in some comments, if you need to sort the array first, you need the O(n log n) effort for sorting (unless you can use some radix sort or counting sort).

Your searching phase is linear. Two-pointers approach is equivalent to this:
Make a copy of sorted array
Add `target` to every entry (shift values up or down) (left picture)
Invert shifted array indexing (right picture)
Walk through both arrays in parallel, checking for absolute difference
All stages are linear (and inverting is implicit in your code)
P.S. Is C++ hash map ordered? I doubt. Sorted array creation is in general O(nlogn) (except for special methods like counting or radix sort)

Very hard sorting algorithm problem - O(n) time - Time complextiy

Since the problem is long i can not describe it at title.
Imagine that we have 2 unsorted integer arrays. Both array lenght is n and they are containing interegers between 0 - n^765 (n power 765 maximum) .
I want to compare both arrays and find out whether they contain any same integer value or not with in O(n) time complexity.
no duplicates are possible in the same array
Any help and idea is appreciated.

What you want is impossible. Each element will be stored in up to log(n^765) bits, which is O(log n). So simply reading the contents of both arrays will take O(n*logn).
If you have a constant upper bound on the value of each element, You can solve this in O(n) average time by storing the elements of one array in a hash table, and then checking if the elements of the other array are contained in it.
Edit:
The solution you may be looking for is to use radix sort to sort your data, after which you can easily check for duplicate elements. You would look at your numbers in base n, and do 765 passes over your data. Each pass would use a bucket sort or counting sort to sort by a single digit (in base n). This process would take O(n) time in the worst case (assuming a constant upper bound on element size). Note that I doubt anyone would ever choose this over a hash table in practice.

By assuming multiplication and division is O(1):
Think about numbers, you can write them as:
Number(i) = A0 * n^765 + A1 * n^764 + .... + A764 * n + A765.
for coding number to this format, you should just do Number / n^i, Number % n^i, if you precompute, n^1, n^2, n^3, ... it can be done in O(n * 765)=> O(n) for all numbers. precomputation of n^i, can be done in O(i) since i at most is 765 it's O(1) for all items.
Now you can write Numbers(i) as array: Nembers(i) = (A0, A1, ..., A765) and know you can radix sort items :
first compare all A765, then ...., All of Ai's are in the range 0..n so for comparing Ai's you can use Counting sort (Counting sort is O(n)), so your radix sort is O(n * 765) which is O(n).
After radix sort you have two sorted array and you can simply find one similar item in O(n) or use merge algorithm (like merge sort) to find most possible similarity (not just one).
for generalization if the size of input items is O(n^C) it can be sorted in O(n) (C is fix number). but because the overhead of this way of sortings are big, prefer to using quicksort and similar algorithms. Simple sample of this question can be found in Introduction to Algorithm book, which asks if the numbers are in range (0..n^2) how to sort them in O(n).
Edit: for clarifying how you can find similar items in 2-sorted lists:
You have 2 sorted list, for example in merge sort how do you can merge two sorted list to one list? you will move from start of list 1, and list 2, and move your head pointer of list1 while head(list(1)) > head(list(2)), and after that do this for list2 and ..., so if there is a similar item your algorithm will stop (before reach the end of lists), or in the end of two lists your algorithm will stop.
it's as easy as bellow:
public int FindSimilarityInSortedLists(List<int> list1, List<int> list2)
{
int i = 0;
int j = 0;
while (i < list1.Count && j < list2.Count)
{
if (list1[i] == list2[j])
return list1[i];
if (list1[i] < list2[j])
i++;
else
j++;
}
return -1; // not found
}

If memory was unlimited you could simply create a hashtable with the integers as keys and the values the number of times they are found. Then to do your "fast" look up you simple query for an integer, discover if its contained within the hash table, and if found check that the value is 1 or 2. That would take O(n) to load and O(1) to query.

I do not think you can do it O(n).
You should check n values whether they are in the other array. This means you have n comparing operations at least if the other array has just 1 element. But as you have n element it the other array as well, you can do it just O(n*n)

How to generate a permutation?

My question is: given a list L of length n, and an integer i such that 0 <= i < n!, how can you write a function perm(L, n) to produce the ith permutation of L in O(n) time? What I mean by ith permutation is just the ith permutation in some implementation defined ordering that must have the properties:
For any i and any 2 lists A and B, perm(A, i) and perm(B, i) must both map the jth element of A and B to an element in the same position for both A and B.
For any inputs (A, i), (A, j) perm(A, i)==perm(A, j) if and only if i==j.
NOTE: this is not homework. In fact, I solved this 2 years ago, but I've completely forgotten how, and it's killing me. Also, here is a broken attempt I made at a solution:
def perm(s, i):
n = len(s)
perm = [0]*n
itCount = 0
for elem in s:
perm[i%n + itCount] = elem
i = i / n
n -= 1
itCount+=1
return perm
ALSO NOTE: the O(n) requirement is very important. Otherwise you could just generate the n! sized list of all permutations and just return its ith element.

def perm(sequence, index):
sequence = list(sequence)
result = []
for x in xrange(len(sequence)):
idx = index % len(sequence)
index /= len(sequence)
result.append( sequence[idx] )
# constant time non-order preserving removal
sequence[idx] = sequence[-1]
del sequence[-1]
return result
Based on the algorithm for shuffling, but we take the least significant part of the number each time to decide which element to take instead of a random number. Alternatively consider it like the problem of converting to some arbitrary base except that the base name shrinks for each additional digit.

Could you use factoradics? You can find an illustration via this MSDN article.
Update: I wrote an extension of the MSDN algorithm that finds i'th permutation of n things taken r at a time, even if n != r.

A computational minimalistic approach (written in C-style pseudocode):
function perm(list,i){
for(a=list.length;a;a--){
list.switch(a-1,i mod a);
i=i/a;
}
return list;
}
Note that implementations relying on removing elements from the original list tend to run in O(n^2) time, at best O(n*log(n)) given a special tree style list implementation designed for quickly inserting and removing list elements.
The above code rather than shrinking the original list and keeping it in order just moves an element from the end to the vacant location, still makes a perfect 1:1 mapping between index and permutation, just a slightly more scrambled one, but in pure O(n) time.

So, I think I finally solved it. Before I read any answers, I'll post my own here.
def perm(L, i):
n = len(L)
if (n == 1):
return L
else:
split = i%n
return [L[split]] + perm(L[:split] + L[split+1:], i/n)

There are n! permutations. The first character can be chosen from L in n ways. Each of those choices leave (n-1)! permutations among them. So this idea is enough for establishing an order. In general, you will figure out what part you are in, pick the appropriate element and then recurse / loop on the smaller L.
The argument that this works correctly is by induction on the length of the sequence. (sketch) For a length of 1, it is trivial. For a length of n, you use the above observation to split the problem into n parts, each with a question on an L' with length (n-1). By induction, all the L's are constructed correctly (and in linear time). Then it is clear we can use the IH to construct a solution for length n.

Median Algorithm in O(log n)

How can we remove the median of a set with time complexity O(log n)? Some idea?

If the set is sorted, finding the median requires O(1) item retrievals. If the items are in arbitrary sequence, it will not be possible to identify the median with certainty without examining the majority of the items. If one has examined most, but not all, of the items, that will allow one to guarantee that the median will be within some range [if the list contains duplicates, the upper and lower bounds may match], but examining the majority of the items in a list implies O(n) item retrievals.
If one has the information in a collection which is not fully ordered, but where certain ordering relationships are known, then the time required may require anywhere between O(1) and O(n) item retrievals, depending upon the nature of the known ordering relation.

For unsorted lists, repeatedly do O(n) partial sort until the element located at the median position is known. This is at least O(n), though.
Is there any information about the elements being sorted?

For a general, unsorted set, it is impossible to reliably find the median in better than O(n) time. You can find the median of a sorted set in O(1), or you can trivially sort the set yourself in O(n log n) time and then find the median in O(1), giving an O(n logn n) algorithm. Or, finally, there are more clever median selection algorithms that can work by partitioning instead of sorting and yield O(n) performance.
But if the set has no special properties and you are not allowed any pre-processing step, you will never get below O(n) by the simple fact that you will need to examine all of the elements at least once to ensure that your median is correct.

Here's a solution in Java, based on TreeSet:
public class SetWithMedian {
private SortedSet<Integer> s = new TreeSet<Integer>();
private Integer m = null;
public boolean contains(int e) {
return s.contains(e);
}
public Integer getMedian() {
return m;
}
public void add(int e) {
s.add(e);
updateMedian();
}
public void remove(int e) {
s.remove(e);
updateMedian();
}
private void updateMedian() {
if (s.size() == 0) {
m = null;
} else if (s.size() == 1) {
m = s.first();
} else {
SortedSet<Integer> h = s.headSet(m);
SortedSet<Integer> t = s.tailSet(m + 1);
int x = 1 - s.size() % 2;
if (h.size() < t.size() + x)
m = t.first();
else if (h.size() > t.size() + x)
m = h.last();
}
}
}
Removing the median (i.e. "s.remove(s.getMedian())") takes O(log n) time.
Edit: To help understand the code, here's the invariant condition of the class attributes:
private boolean isGood() {
if (s.isEmpty()) {
return m == null;
} else {
return s.contains(m) && s.headSet(m).size() + s.size() % 2 == s.tailSet(m).size();
}
}
In human-readable form:
If the set "s" is empty, then "m" must be
null.
If the set "s" is not empty, then it must
contain "m".
Let x be the number of elements
strictly less than "m", and let y be
the number of elements greater than
or equal "m". Then, if the total
number of elements is even, x must be
equal to y; otherwise, x+1 must be
equal to y.

Try a Red-black-tree. It should work quiet good and with a binary search you get ur log(n). It has aswell a remove and insert time of log(n) and rebalancing is done in log(n) aswell.

As mentioned in previous answers, there is no way to find the median without touching every element of the data structure. If the algorithm you look for must be executed sequentially, then the best you can do is O(n). The deterministic selection algorithm (median-of-medians) or BFPRT algorithm will solve the problem with a worst case of O(n). You can find more about that here: http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm
However, the median of medians algorithm can be made to run faster than O(n) making it parallel. Due to it's divide and conquer nature, the algorithm can be "easily" made parallel. For instance, when dividing the input array in elements of 5, you could potentially launch a thread for each sub-array, sort it and find the median within that thread. When this step finished the threads are joined and the algorithm is run again with the newly formed array of medians.
Note that such design would only be beneficial in really large data sets. The additional overhead that spawning threads has and merging them makes it unfeasible for smaller sets. This has a bit of insight: http://www.umiacs.umd.edu/research/EXPAR/papers/3494/node18.html
Note that you can find asymptotically faster algorithms out there, however they are not practical enough for daily use. Your best bet is the already mentioned sequential median-of-medians algorithm.

Master Yoda's randomized algorithm has, of course, a minimum complexity of n like any other, an expected complexity of n (not log n) and a maximum complexity of n squared like Quicksort. It's still very good.
In practice, the "random" pivot choice might sometimes be a fixed location (without involving a RNG) because the initial array elements are known to be random enough (e.g. a random permutation of distinct values, or independent and identically distributed) or deduced from an approximate or exactly known distribution of input values.

I know one randomize algorithm with time complexity of O(n) in expectation.
Here is the algorithm:
Input: array of n numbers A[1...n] [without loss of generality we can assume n is even]
Output: n/2th element in the sorted array.
Algorithm ( A[1..n] , k = n/2):
Pick a pivot - p universally at random from 1...n
Divided array into 2 parts:
L - having element <= A[p]
R - having element > A[p]
if(n/2 == |L|) A[|L| + 1] is the median stop
if( n/2 < |L|) re-curse on (L, k)
else re-curse on (R, k - (|L| + 1)
Complexity:
O( n)
proof is all mathematical. One page long. If you are interested ping me.

To expand on rwong's answer: Here is an example code
// partial_sort example
#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;
int main () {
int myints[] = {9,8,7,6,5,4,3,2,1};
vector<int> myvector (myints, myints+9);
vector<int>::iterator it;
partial_sort (myvector.begin(), myvector.begin()+5, myvector.end());
// print out content:
cout << "myvector contains:";
for (it=myvector.begin(); it!=myvector.end(); ++it)
cout << " " << *it;
cout << endl;
return 0;
}
Output:
myvector contains: 1 2 3 4 5 9 8 7 6
The element in the middle would be the median.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Efficient list intersection algorithm - algorithm

Given two lists (not necessarily sorted), what is the most efficient non-recursive algorithm to find the set intersection of those lists? I don't believe I have access to hashing algorithms.

You could put all elements of the first list into a hash set. Then, iterate the second one and, for each of its elements, check the hash to see if it exists in the first list. If so, output it as an element of the intersection.

without hashing, I suppose you have two options: The naive way is going to be compare each element to every other element. O(n^2) Another way would be to sort the lists first, then iterate over them: O(n lg n) * 2 + 2 * O(n)

From the eviews features list it seems that it supports complex merges and joins (if this is 'join' as in DB terminology, it will compute an intersection). Now dig through your documentation :-) Additionally, eviews has their own user forum - why not ask there_

with set 1 build a binary search tree with O(log n) and iterate set2 and search the BST m X O(log n) so total O(log n) + O(m)+O(log n) ==> O(log n)(m+1)

Using skip pointers and SSE instructions can improve list intersection efficiency.

Why not implement your own simple hash table or hash set? It's worth it to avoid nlogn intersection if your lists are large as you say. Since you know a bit about your data beforehand, you should be able to choose a good hash function.

I second the "sets" idea. In JavaScript, you could use the first list to populate an object, using the list elements as names. Then you use the list elements from the second list and see if those properties exist.

Related

Find an algorithm for sorting integers with time complexity O(n + k*log(k))

Algorithm: finding closest differences between two elements of array

Very hard sorting algorithm problem - O(n) time - Time complextiy

How to generate a permutation?

Median Algorithm in O(log n)

Categories

Resources