How to augment a skip list such that we can extract max value of a specific segment of the skiplist efficiently? [Skiplist not sorted by value] - data-structures

i have a problem im struggling with.
I have a skiplist with elements:
element = (date,value)
The dates are the key's of the skiplist,and hence,the skiplist is sorted by date.
How can i augment the skiplist such that the function
Max(d1,d2) -> returns largest value between dates d1 and d2
is most efficient.
The values are integers.

The most efficient way is to iterate over each item from d1 to d2 and select the maximum item. Because the skip list is ordered by date, you cannot assume anything about the order of values: they might as well be randomly ordered. So you'll have to look at each one.
So it's O(log n) (on average: this is a skip list, after all) to find d1, and then it's O(range) to find the maximum element, where range is the number of items between d1 and d2, inclusive.
How you'd implement this is to add a function to the skip list that will allow you to iterate the list starting at an arbitrary element. You almost certainly already have a function that will iterate over the entire list in order, so all you have to do is create a function that will iterate over a range of keys (i.e. from a start key to an end key).

Related

fastest algorithm for sum queries in a range

Assume we have the following data, which consists of a consecutive 0's and 1's (the nature of data is that there are very very very few 1s.
data =
[0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0]
so a huge number of zeros, and then possibly some ones (which indicate that some sort of an event is happening).
You want to query this data many times. The query is that given two indices i and j what is sum(data[i:j]). For example, sum_query(i=12, j=25) = 2 in above example.
Note that you have all these queries in advance.
What sort of a data structure can help me evaluate all the queries as fast as possible?
My initial thoughts:
preprocess the data and obtain two shorter arrays: data_change and data_cumsum. The data_change will be filled up with the indices for when the sequence of 1s will start and when the next sequence of 0s will start, and so on. The data_cumsum will contain the corresponding cummulative sums up to indices represented in data_change, i.e. data_cumsum[k] = sum(data[0:data_change[k]])
In above example, the preprocessing results in: data_change=[8,11,18,20,31,35] and data_cumsum=[0,3,3,5,5,9]
Then if query comes for i=12 and j=25, I will do a binary search in this sorted data_change array to find the corresponding index for 12 and then for 25, which will result in the 0-based indices: bin_search(data_change, 12)=2 and bin_search(data_change, 25)=4.
Then I simply output the corresponding difference from the cumsum array: data_cumsum[4] - data_cumsum[2]. (I won't go into the detail of handling the situation where the any endpoint of the query range falls in the middle of the sequence of 1's, but those cases can be handled easily with an if-statement.
With linear space, linear preprocessing, constant query time, you can store an array of sums. The i'th position gets the sum of the first i elements. To get query(i,j) you take the difference of the sums (sums[j] - sums[i-1]).
I already gave an O(1) time, O(n) space answer. Here are some alternates that trade time for space.
1. Assuming that the number of 1s is O(log n) or better (say O(log n) for argument):
Store an array of ints representing the positions of the ones in the original array. so if the input is [1,0,0,0,1,0,1,1] then A = [0,4,6,7].
Given a query, use binary search on A for the start and end of the query in O(log(|A|)) = O(log(log(n)). If the element you're looking for isn't in A, find the smallest bigger index and the largest smaller index. E.g., for query (2,6) you'd return the indices for the 4 and the 6, which are (1,2). Then the answer is one more than the difference.
2. Take advantage of knowing all the queries up front (as mentioned by the OP in a comment to my other answer). Say Q = (Q1, Q2, ..., Qm) is the set of queries.
Process the queries, storing a map of start and end indices to the query. E.g., if Q1 = (12,92) then our map would include {92 => Q1, 12 => Q1}. This take O(m) time and O(m) space. Take note of the smallest start index and the largest end index.
Process the input data, starting with the smallest start index. Keep track of the running sum. For each index, check your map of queries. If the index is in the map, associate the current running sum with the appropriate query.
At the end, each query will have two sums associated with it. Add one to the difference to get the answer.
Worst case analysis:
O(n) + O(m) time, O(m) space. However, this is across all queries. The amortized time cost per query is O(n/m). This is the same as my constant time solution (which required O(n) preprocessing).
I would probably go with something like this:
# boilerplate testdata
from itertools import chain, permutations
data = [0,0,0,0,0,0,0,1,1,1]
chained = list(chain(*permutations(data,5))) # increase 5 to 10 if you dare
Preprozessing:
frSet = frozenset([i for i in range(len(chained)) if chained[i]==1])
"Counting":
# O(min(len(frSet), len(frozenset(range(200,500))))
summa = frSet.intersection(frozenset(range(200,500))) # use two sets for faster intersect
counted=len(summa)
"Sanity-Check"
print(sum([1 for x in frSet if x >= 200 and x<500]))
print(summa)
print(len(summa))
No edge cases needed, intersection will do all you need, slightly higher memory as you store each index not ranges of ones. Performance depends on intersection-Implementation.
This might be helpfull: https://wiki.python.org/moin/TimeComplexity#set

Optimized solution for checking whether two-time interval arrays are overlapping or not?

Can you find whether two-time interval arrays are overlapping or not, in an optimized way?
Suppose input array A contains 10 elements, and each and every element have a start date and end date, And similarly, input array B contains 4 elements, and each and every element have a start data and end data. Now find whether A and B are overlapping or not?
Example 1:
Input:
A={[1,5],[7,10],[11,15]}; //Array A contains 3elements, and each element have start and end time.
B={[6,10],[1,5]};//Array B contains 2elements, and each element have start and end time.
Output: Yes // why because A and B are overlapping at [6,10] || [1,5]
Example 2:
Input:
A={[1,5],[8,10],[11,15]}; //Array A contains 3elements, and each element have start and end time.
B={[5,8],[15,16]};//Array B contains 2elements, and each element have start and end time.
Output: No // why because A and B are not-overlapping at [5,8] || [15,16]
I know we can solve this problem by using brute force, by iterating each element
in B and comparing with each element of A to check whether overlapping or not(A[i].start<=B[j].start and A[i].end>B[j].start), it'll take O(N*M) where N is length of array A and M is length of B.
Can you please optimize the solution.
You can sort the array according to the start times. You can then check if the end time of an element is greater than the start time of next element by iterating through both the arrays simultaneously(use two pointers). If it is the case, then you have found an overlap.
Here is what you can do
Build a segment tree using the values of array A. if the first interval id (l1, r1), 2nd is (l2, r2) and so on.
for every interval in A(li, ri) update the segment tree such that we update each element in the interval (li, ri) to 1. this can be done in O(logn) using lazy propagation
now for each interval in B(lj, rj) try to query the segment tree for this range. a query would return the sum of the range (lj, rj)
if sum is 0, then that range is non overlapping. else it is overlapping
overall complexity O(nlogn)

How to assign many subsets to their largest supersets?

My data has large number of sets (few millions). Each of those set size is between few members to several tens of thousands integers. Many of those sets are subsets of larger sets (there are many of those super-sets). I'm trying to assign each subset to it's largest superset.
Please can anyone recommend algorithm for this type of task?
There are many algorithms for generating all possible sub-sets of a set, but this type of approach is time-prohibitive given my data size (e.g. this paper or SO question).
Example of my data-set:
A {1, 2, 3}
B {1, 3}
C {2, 4}
D {2, 4, 9}
E {3, 5}
F {1, 2, 3, 7}
Expected answer: B and A are subset of F (it's not important B is also subset of A); C is a subset of D; E remains unassigned.
Here's an idea that might work:
Build a table that maps number to a sorted list of sets, sorted first by size with largest first, and then, by size, arbitrarily but with some canonical order. (Say, alphabetically by set name.) So in your example, you'd have a table that maps 1 to [F, A, B], 2 to [F, A, D, C], 3 to [F, A, B, E] and so on. This can be implemented to take O(n log n) time where n is the total size of the input.
For each set in the input:
fetch the lists associated with each entry in that set. So for A, you'd get the lists associated with 1, 2, and 3. The total number of selects you'll issue in the runtime of the whole algorithm is O(n), so runtime so far is O(n log n + n) which is still O(n log n).
Now walk down each list simultaneously. If a set is the first entry in all three lists, then it's the largest set that contains the input set. Output that association and continue with the next input list. If not, then discard the smallest item among all the items in the input lists and try again. Implementing this last bit is tricky, but you can store the heads of all lists in a heap and get (IIRC) something like O(n log k) overall runtime where k is the maximum size of any individual set, so you can bound that at O(n log n) in the worst case.
So if I got everything straight, the runtime of the algorithm is overall O(n log n), which seems like probably as good as you're going to get for this problem.
Here is a python implementation of the algorithm:
from collections import defaultdict, deque
import heapq
def LargestSupersets(setlists):
'''Computes, for each item in the input, the largest superset in the same input.
setlists: A list of lists, each of which represents a set of items. Items must be hashable.
'''
# First, build a table that maps each element in any input setlist to a list of records
# of the form (-size of setlist, index of setlist), one for each setlist that contains
# the corresponding element
element_to_entries = defaultdict(list)
for idx, setlist in enumerate(setlists):
entry = (-len(setlist), idx) # cheesy way to make an entry that sorts properly -- largest first
for element in setlist:
element_to_entries[element].append(entry)
# Within each entry, sort so that larger items come first, with ties broken arbitrarily by
# the set's index
for entries in element_to_entries.values():
entries.sort()
# Now build up the output by going over each setlist and walking over the entries list for
# each element in the setlist. Since the entries list for each element is sorted largest to
# smallest, the first entry we find that is in every entry set we pulled will be the largest
# element of the input that contains each item in this setlist. We are guaranteed to eventually
# find such an element because, at the very least, the item we're iterating on itself is in
# each entries list.
output = []
for idx, setlist in enumerate(setlists):
num_elements = len(setlist)
buckets = [element_to_entries[element] for element in setlist]
# We implement the search for an item that appears in every list by maintaining a heap and
# a queue. We have the invariants that:
# 1. The queue contains the n smallest items across all the buckets, in order
# 2. The heap contains the smallest item from each bucket that has not already passed through
# the queue.
smallest_entries_heap = []
smallest_entries_deque = deque([], num_elements)
for bucket_idx, bucket in enumerate(buckets):
smallest_entries_heap.append((bucket[0], bucket_idx, 0))
heapq.heapify(smallest_entries_heap)
while (len(smallest_entries_deque) < num_elements or
smallest_entries_deque[0] != smallest_entries_deque[num_elements - 1]):
# First extract the next smallest entry in the queue ...
(smallest_entry, bucket_idx, element_within_bucket_idx) = heapq.heappop(smallest_entries_heap)
smallest_entries_deque.append(smallest_entry)
# ... then add the next-smallest item from the bucket that we just removed an element from
if element_within_bucket_idx + 1 < len(buckets[bucket_idx]):
new_element = buckets[bucket_idx][element_within_bucket_idx + 1]
heapq.heappush(smallest_entries_heap, (new_element, bucket_idx, element_within_bucket_idx + 1))
output.append((idx, smallest_entries_deque[0][1]))
return output
Note: don't trust my writeup too much here. I just thought of this algorithm right now, I haven't proved it correct or anything.
So you have millions of sets, with thousands of elements each. Just representing that dataset takes billions of integers. In your comparisons you'll quickly get to trillions of operations without even breaking a sweat.
Therefore I'll assume that you need a solution which will distribute across a lot of machines. Which means that I'll think in terms of https://en.wikipedia.org/wiki/MapReduce. A series of them.
Read the sets in, mapping them to k:v pairs of i: s where i is an element of the set s.
Receive a key of an integers, along with a list of sets. Map them off to pairs (s1, s2): i where s1 <= s2 are both sets that included to i. Do not omit to map each set to be paired with itself!
For each pair (s1, s2) count the size k of the intersection, and send off pairs s1: k, s2: k. (Only send the second if s1 and s2 are different.
For each set s receive the set of supersets. If it is maximal, send off s: s. Otherwise send off t: s for every t that is a strict superset of s.
For each set s, receive the set of subsets, with s in the list only if it is maximal. If s is maximal, send off t: s for every t that is a subset of s.
For each set we receive the set of maximal sets that it is a subset of. (There may be many.)
There are a lot of steps for this, but at its heart it requires repeated comparisons between pairs of sets with a common element for each common element. Potentially that is O(n * n * m) where n is the number of sets and m is the number of distinct elements that are in many sets.
Here is a simple suggestion for an algorithm that might give better results based on your numbers (n = 10^6 to 10^7 sets with m = 2 to 10^5 members, a lot of super/subsets). Of course it depends a lot on your data. Generally speaking complexity is much worse than for the other proposed algorithms. Maybe you could only process the sets with less than X, e.g. 1000 members that way and for the rest use the other proposed methods.
Sort the sets by their size.
Remove the first (smallest) set and start comparing it against the others from behind (largest set first).
Stop as soon as you found a superset and create a relation. Just remove if no superset was found.
Repeat 2. and 3. for all but the last set.
If you're using Excel, you could structure it as follows:
1) Create a cartesian plot as a two-way table that has all your data sets as titles on both the side and the top
2) In a seperate tab, create a row for each data set in the first column, along with a second column that will count the number of entries (ex: F has 4) and then just stack FIND(",") and MID formulas across the sheet to split out all the entries within each data set. Use the counter in the second column to do COUNTIF(">0"). Each variable you find can be your starting point in a subsequent FIND until it runs out of variables and just returns a blank.
3) Go back to your cartesian plot, and bring over the separate entries you just generated for your column titles (ex: F is 1,2,3,7). Use an AND statement to then check that each entry in your left hand column is in your top row data set using an OFFSET to your seperate area and utilizing your counter as the width for the OFFSET

Binary search with gaps

Let's imagine two arrays like this:
[8,2,3,4,9,5,7]
[0,1,1,0,0,1,1]
How can I perform a binary search only in numbers with an 1 below it, ignoring the rest?
I know this can be in O(log n) comparisons, but my current method is slower because it has to go through all the 0s until it hits an 1.
If you hit a number with a 0 below, you need to scan in both directions for a number with a 1 below until you find it -- or the local search space is exhausted. As the scan for a 1 is linear, the ratio of 0s to 1s determines whether the resulting algorithm can still be faster than linear.
This question is very old, but I've just discovered a wonderful little trick to solve this problem in most cases where it comes up. I'm writing this answer so that I can refer to it elsewhere:
Fast Append, Delete, and Binary Search in a Sorted Array
The need to dynamically insert or delete items from a sorted collection, while preserving the ability to search, typically forces us to switch from a simple array representation using binary search to some kind of search tree -- a far more complicated data structure.
If you only need to insert at the end, however (i.e., you always insert a largest or smallest item), or you don't need to insert at all, then it's possible to use a much simpler data structure. It consists of:
A dynamic (resizable) array of items, the item array; and
A dynamic array of integers, the set array. The set array is used as a disjoint set data structure, using the single-array representation described here: How to properly implement disjoint set data structure for finding spanning forests in Python?
The two arrays are always the same size. As long as there have been no deletions, the item array just contains the items in sorted order, and the set array is full of singleton sets corresponding to those items.
If items have been deleted, though, items in the item array are only valid if the there is a root set at the corresponding position in the set array. All sets that have been merged into a single root will be contiguous in the set array.
This data structure supports the required operations as follows:
Append (O(1))
To append a new largest item, just append the item to the item array, and append a new singleton set to the set array.
Delete (amortized effectively O(log N))
To delete a valid item, first call search to find the adjacent larger valid item. If there is no larger valid item, then just truncate both arrays to remove the item and all adjacent deleted items. Since merged sets are contiguous in the set array, this will leave both arrays in a consistent state.
Otherwise, merge the sets for the deleted item and adjacent item in the set array. If the deleted item's set is chosen as the new root, then move the adjacent item into the deleted item's position in the item array. Whichever position isn't chosen will be unused from now on, and can be nulled-out to release a reference if necessary.
If less than half of the item array is valid after a delete, then deleted items should be removed from the item array and the set array should be reset to an all-singleton state.
Search (amortized effectively O(log N))
Binary search proceeds normally, except that we need to find the representative item for every test position:
int find(item_array, set_array, itemToFind) {
int pos = 0;
int limit = item_array.length;
while (pos < limit) {
int testPos = pos + floor((limit-pos)/2);
if (item_array[find_set(set_array, testPos)] < itemToFind) {
pos = testPos + 1; //testPos is too low
} else {
limit = testPos; //testPos is not too low
}
}
if (pos >= item_array.length) {
return -1; //not found
}
pos = find_set(set_array, pos);
return (item_array[pos] == itemToFind) ? pos : -1;
}

Which algorithm can rank list based on history

There are N unique items.
There are K sorted lists, each list consists of a small subset of the items, each list does not contain the same item more than once.
The input is an unsorted list of items.
The algorithm should sort the list based on the K sorted lists.
Here is an example:
There are 100 items : item1, item2, ..., item100
There are some ranked lists available: List1: Item1>Item2>Item12, List2: Item12>item93>Item7, List3: Iterm1>Item3>Iterm97, List4: Iterm1>Iterm7>Item2
The input is: Iterm1, Item2, Iterm7 and Item98. The algorithm should sort the input based on those lists.
In terms of machine learning I am looking for an algorithm that can predict the 'right' order of a list of items (AKA active list) based on a training set of many partially ordered lists of items, each partial ordered list of items might contain other items that the active list does not contain.
Construct a directed acyclic graph (DAG) with input elements as nodes and define an edge from Itemi and Itemj if and only if Itemi appears immediately before Itemj in some list. Then you can obtain the desired order by doing a topological sort on the DAG.
I think what you mean is that the sorted lists define a partial ordering, yes? I.e. if Item1 appears before Item2 in one of the lists, it should be considered "bigger".
If this is correct, than the way to go is to first represent this in a more convenient form, e.g. a matrix M, such that M[1][2]==1 if Item1 precedes Item2 in one of the list. Then we have a simple comparator function:
if M[X][Y] == 1:
return 1 # X > Y
elif M[Y][X] == 1:
return -1 # Y > X
else
return 0 # the elements are not comparable
We can now sort the output according to this comparator.
You might want to run the transitive closure (Warshall's algorithm) on this matrix before sorting, in case there are, for example, lists Item1>Item3 and Item3>Item2, but no list where Item2 would appear together with Item1. Transitive closure would allow one to deduce from the two lists that Item1 should precede Item2.
I would compose a weighted graph from the input (number of links between A>B is the weight), put that into an N*N matrix, and perform the power-iteration (GIYF) on the matrix.

Resources