Give k sorted inverted lists, I want an efficient algorithm to get the union of these k lists?
Each inverted list is a read-only array in memory, each list contains integer in sorted order.
the result will be saved in a predefined array which is large enough. Is there any algorithm better than k-way merge?
K-Way merge is optimal. It has O(log(k)*n) ops [where n is the number of elements in all lists combined].
It is easy to see it cannot be done better - as #jpalecek mentioned, otherwise you could sort any array better then O(nlogn) by splitting it into chunks [inverted indexes] of size 1.
Note: This answer assumes it is important that inverted indexes
[resulting array] will be sorted. This assumption is true for most
applications that use inverted indexes, especially in the
Information-Retrieval area. This feature [sorted indexes] allows
elegant and quick intersection of indexes.
Note: that standard k-way merge allows duplications, you will have to
make sure that if an element is appearing in two lists, it will be
added only once [easy to do it by simply checking the last element in
the target array before adding].
If you don't need the resulting array to be sorted, the best approach would be using a hash table to mark which of the elements you have seen. This way, you can get O(n) (n being the total number of elements) time complexity.
Something along the lines of (Perl):
my %seen;
#merged = grep { exists $seen{$_} ? 0 : ($seen{$_} = 1) } (map {(#$_)} #inputs);
Related
I have to make an algorithm that matches items from two arrays, we are not allowed to sort either array first, we can only match by comparison with an item from array 1 to and an item to array 2 (comparisons being <,=,>). The output is two lists and they have the same order. I can think of ways to solve it using n(n+1)/2 time. The goal is nlog(n). I have been banging my head against a wall trying to think of a way but I can't. Can anyone give me a hint?
So to explain the input is two arrays ex. A = [1,3,6,2,5,4] B =[4,2,3,5,1,6] and the output is the two arrays with the same order. You can not sort the arrays individually first or compare items within the same array. You can only compare items across lists like so A_1<B_1, A_2=B_3, A_4<B_3.
Similar to quicksort:
Use a random A-element to partition B into smaller-B, equal-B and larger-B. Use its equal B-element to partition A. Recursively match smaller-A with smaller-B as well as larger-A with larger-B.
Just like quicksort, expected time is O(n log n) and worst case is O(n2).
Suppose I have an unsorted array P and it's sorted equivalent P_Sorted. Suppose L and R refer to the left and right halves of P. Is there a way to recover L_Sorted and R_Sorted from P and P_Sorted in linear time without using extra memory?
For further clarification, during a recursive merge sort implementation L_Sorted and R_Sorted would be merged together to form P_Sorted, so I'm kinda looking to reverse the merge step.
In a merge sort, you divide the array into two halves recursively and merge them. So at the last merge, you would have already sorted the left and right halves - they are sorted independently - that is why divide and conquer name.
Therefore when doing a merge you can just look at the sizes of the arrays to be merged and if they are equal ( even input size ) or differ by 1 ( odd input size ), you are at the last merge. Then you could store those sorted arrays in some variable before merging them.
BUT if you are not allowed to mess with the function, and you need to work only with the sorted array and the original array, I think the solution is not straightforward. I found an url that poses this problem and a possible solution.
It seems feasible in linear time for very specific datasets:
If there is a way to tell the original position of each data element in the sorted list, for example if these are records with a creation date and a name field and the original array is in chronological order, selecting from the sorted array the elements that fall in the first or second half can be done in a single scan in linear time with no space overhead.
In the general case, sorting the left and right half seems the most efficient way to get L_sorted and R_sorted, with or without P_sorted. The time complexity is O(n.log(n)).
According to wikipedia and other resources, quick sort happens to be a special case of sample sort, because we always choose 1 partitioning item, put it in it's place and continue the sort, so quick sort is sample sort, where m (the number of partitioning items at each step) is 1. So, my question is, for 1 < m < n does it have the same complexity as quick sort when it's not parallel?
The following is the algorithm for sample sort described on wikipedia.
1) Find splitters, values that break up the data into buckets, by sampling the data.
2) Use the sorted splitters to define buckets and place data in appropriate buckets.
3) Sort each of the buckets.
I am not exactly sure I understand this algorithm correctly, but I think we first find the partitioning item, put it in it's place and then look to the left and to the right to find more partitioning items there, and then recursively call the same function to partition each one of those m samples into m samples again, am I right? Because if so, it seems that sample sort performs the same as quick sort because it simply does the same thing, except half of it iteratively (when looking for splitters) and half of it recursively.
They will have different complexity. When m > 1, their running would be approximate to CNlogm+1N. The constant C will be large enough to make it slower than ordinary QuickSort because there is no known algorithm to partition list into m + 1 buckets as efficiency as partition list into two buckets.
For example, normal QuickSort would takes O(N) to partition the list into two sub array. Assuming in best case, QuickSort perfectly choose value that split list into two buckets of the same size.
Cn = 2Cn/2 + n = nlog2n
Let assume that m = 2 that's mean we need to partition the list into three sub array. Let assuming that in best case, we can perfectly choose values that split the list into three buckets of the same size. However, let's say the cost of partition is O(3N).
Cn = 3Cn/3 + 3n = 3nlog3n
As you can see
3nlog3n > nlog2n.
I'm working on a program that takes in a bunch (y) of integers and then needs to return the x highest integers in order. This code needs to be as fast as possible, but at the moment I dont think I have the best algorithm.
My approach/algorithm so far is to create a sorted list of integers (high to low) that have already been input and then handle each item as it comes in. For the first x items, I maintain a sorted array of integers, and when each new item comes in, I figure out where it should be placed using a binary sort. (Im also considering just taking in the first x items and then quick sorting them, but I dont know if this is faster) After the first x items have been sorted I then consider the rest of the items by first seeing if they qualify to enter the already sorted list of highest integers (by seeing if the new integer is greater than the integer at the end of the list) and if it does, add it to the sorted list via a binary search and remove the integer at the end of the list.
I was wondering if anyone had any advice as to how I can make this faster, or perhaps an entire new approach that is faster than this. Thanks.
This is a partial sort:
The fastest implementation is Quicksort where you only recurse on ranges containing the bottom/top k elements.
In C++ you can just use std::partial_sort
If you use a heap-ordered tree data structure to store the integers, inserting a new integer takes no more than lg N comparisons and removing the maximum takes no more than 2 lg N comparisions. Thus, to insert y items would require no more than y lg N comparisons and to remove the top x items would require no more than 2x lg N comparisons. The Wikipedia entry has references to a range of implementations.
This is called a top-N sort. Here is a very simple and efficient scheme. No fancy data structures needed.
Keep a list of the highest x elements (it starts out empty)
Split your input into chunks of x * 10 items
For each chunk, add the remembered list of the x highest items so far to it and sort it (e.g. quick sort)
Keep the x highest items. They form the new remembered list
goto 3 until all chunks processed
The remembered list is now your final result
This is O(N) in the number of items and only requires a normal quick sort as a primitive.
You don't seem to need the top N items in sorted order. Because of this, you can solve this in linear time.
Find the Nth largest array element using linear-time selection. Return it and all array elements larger than it.
If I have N arrays, what is the best(Time complexity. Space is not important) way to find the common elements. You could just find 1 element and stop.
Edit: The elements are all Numbers.
Edit: These are unsorted. Please do not sort and scan.
This is not a homework problem. Somebody asked me this question a long time ago. He was using a hash to solve the problem and asked me if I had a better way.
Create a hash index, with elements as keys, counts as values. Loop through all values and update the count in the index. Afterwards, run through the index and check which elements have count = N. Looking up an element in the index should be O(1), combined with looping through all M elements should be O(M).
If you want to keep order specific to a certain input array, loop over that array and test the element counts in the index in that order.
Some special cases:
if you know that the elements are (positive) integers with a maximum number that is not too high, you could just use a normal array as "hash" index to keep counts, where the number are just the array index.
I've assumed that in each array each number occurs only once. Adapting it for more occurrences should be easy (set the i-th bit in the count for the i-th array, or only update if the current element count == i-1).
EDIT when I answered the question, the question did not have the part of "a better way" than hashing in it.
The most direct method is to intersect the first 2 arrays and then intersecting this intersection with the remaining N-2 arrays.
If 'intersection' is not defined in the language in which you're working or you require a more specific answer (ie you need the answer to 'how do you do the intersection') then modify your question as such.
Without sorting there isn't an optimized way to do this based on the information given. (ie sorting and positioning all elements relatively to each other then iterating over the length of the arrays checking for defined elements in all the arrays at once)
The question asks is there a better way than hashing. There is no better way (i.e. better time complexity) than doing a hash as time to hash each element is typically constant. Empirical performance is also favorable particularly if the range of values is can be mapped one to one to an array maintaining counts. The time is then proportional to the number of elements across all the arrays. Sorting will not give better complexity, since this will still need to visit each element at least once, and then there is the log N for sorting each array.
Back to hashing, from a performance standpoint, you will get the best empirical performance by not processing each array fully, but processing only a block of elements from each array before proceeding onto the next array. This will take advantage of the CPU cache. It also results in fewer elements being hashed in favorable cases when common elements appear in the same regions of the array (e.g. common elements at the start of all arrays.) Worst case behaviour is no worse than hashing each array in full - merely that all elements are hashed.
I dont think approach suggested by catchmeifyoutry will work.
Let us say you have two arrays
1: {1,1,2,3,4,5}
2: {1,3,6,7}
then answer should be 1 and 3. But if we use hashtable approach, 1 will have count 3 and we will never find 1, int his situation.
Also problems becomes more complex if we have input something like this:
1: {1,1,1,2,3,4}
2: {1,1,5,6}
Here i think we should give output as 1,1. Suggested approach fails in both cases.
Solution :
read first array and put into hashtable. If we find same key again, dont increment counter. Read second array in same manner. Now in the hashtable we have common elelements which has count as 2.
But again this approach will fail in second input set which i gave earlier.
I'd first start with the degenerate case, finding common elements between 2 arrays (more on this later). From there I'll have a collection of common values which I will use as an array itself and compare it against the next array. This check would be performed N-1 times or until the "carry" array of common elements drops to size 0.
One could speed this up, I'd imagine, by divide-and-conquer, splitting the N arrays into the end nodes of a tree. The next level up the tree is N/2 common element arrays, and so forth and so on until you have an array at the top that is either filled or not. In either case, you'd have your answer.
Without sorting and scanning the best operational speed you'll get for comparing 2 arrays for common elements is O(N2).