Say I have an array of arrays in Ruby,
[[100,300],
[400,500]]
that I'm building by adding successive lines of CSV data.
What's the best way, when adding a new subarray, to test if the range covered by the two numbers in the subarray is covered by any previous subarrays?
In other words, each subarray comprises a linear range (100-300 and 400-500) in the example above. If I want an exception to be thrown if I tried to add [499,501], for example, to the array because there would be overlap, how could I best test for this?
Since your subarrays are supposed to represent ranges, it might be a good idea to actually use an array of ranges instead of an array of array.
So your array becomes [100..300, 400..500].
For two ranges, we can easily define a method which checks whether two ranges overlap:
def overlap?(r1, r2)
r1.include?(r2.begin) || r2.include?(r1.begin)
end
Now to check whether a range r overlaps with any range in your array of ranges, you just need to check whether overlap?(r, r2) is true for any r2 in the array of ranges:
def any_overlap?(r, ranges)
ranges.any? do |r2|
overlap?(r, r2)
end
end
Which can be used like this:
any_overlap?(499..501, [100..300, 400..500])
#=> true
any_overlap?(599..601, [100..300, 400..500])
#=> false
Here any_overlap? takes O(n) time. So if you use any_overlap? every time you add a range to the array, the whole thing will be O(n**2).
However there's a way to do what you want without checking each range:
You add all the ranges to the array without checking for overlap. Then you check whether any range in the array overlaps with any other. You can do this efficiently in O(n log n) time by sorting the array on the beginning of each range and then testing whether two adjacent ranges overlap:
def any_overlap?(ranges)
ranges.sort_by(&:begin).each_cons(2).any? do |r1,r2|
overlap?(r1, r2)
end
end
Use multi_range and call overlaps? method to check whether there is any overlaps:
MultiRange.new([100..300, 400..500]).overlaps?(499..501)
Note that you may need to transform your input from array of arrays to array of ranges before using it. Possible working example is:
arrays = [[100, 300], [400, 500]]
subarray = [499, 501]
ranges = arrays.map{|a, b| a..b }
MultiRange.new(ranges).overlaps?(subarray[0].. subarray[1])
# => true
Related
Can you find whether two-time interval arrays are overlapping or not, in an optimized way?
Suppose input array A contains 10 elements, and each and every element have a start date and end date, And similarly, input array B contains 4 elements, and each and every element have a start data and end data. Now find whether A and B are overlapping or not?
Example 1:
Input:
A={[1,5],[7,10],[11,15]}; //Array A contains 3elements, and each element have start and end time.
B={[6,10],[1,5]};//Array B contains 2elements, and each element have start and end time.
Output: Yes // why because A and B are overlapping at [6,10] || [1,5]
Example 2:
Input:
A={[1,5],[8,10],[11,15]}; //Array A contains 3elements, and each element have start and end time.
B={[5,8],[15,16]};//Array B contains 2elements, and each element have start and end time.
Output: No // why because A and B are not-overlapping at [5,8] || [15,16]
I know we can solve this problem by using brute force, by iterating each element
in B and comparing with each element of A to check whether overlapping or not(A[i].start<=B[j].start and A[i].end>B[j].start), it'll take O(N*M) where N is length of array A and M is length of B.
Can you please optimize the solution.
You can sort the array according to the start times. You can then check if the end time of an element is greater than the start time of next element by iterating through both the arrays simultaneously(use two pointers). If it is the case, then you have found an overlap.
Here is what you can do
Build a segment tree using the values of array A. if the first interval id (l1, r1), 2nd is (l2, r2) and so on.
for every interval in A(li, ri) update the segment tree such that we update each element in the interval (li, ri) to 1. this can be done in O(logn) using lazy propagation
now for each interval in B(lj, rj) try to query the segment tree for this range. a query would return the sum of the range (lj, rj)
if sum is 0, then that range is non overlapping. else it is overlapping
overall complexity O(nlogn)
This question already has answers here:
How to check whether two lists are circularly identical in Python
(18 answers)
Closed 7 years ago.
I'm looking for an efficient way to compare lists of numbers to see if they match at any rotation (comparing 2 circular lists).
When the lists don't have duplicates, picking smallest/largest value and rotating both lists before comparisons works.
But when there may be many duplicate large values, this isn't so simple.
For example, lists [9, 2, 0, 0, 9] and [0, 0, 9, 9, 2] are matches,where [9, 0, 2, 0, 9] won't (since the order is different).
Heres an example of an in-efficient function which works.
def min_list_rotation(ls):
return min((ls[i:] + ls[:i] for i in range(len(ls))))
# example use
ls_a = [9, 2, 0, 0, 9]
ls_b = [0, 0, 9, 9, 2]
print(min_list_rotation(ls_a) == min_list_rotation(ls_b))
This can be improved on for efficiency...
check sorted lists match before running exhaustive tests.
only test rotations that start with the minimum value(skipping matching values after that)effectively finding the minimum value with the furthest & smallest number after it (continually - in the case there are multiple matching next-biggest values).
compare rotations without creating the new lists each time..
However its still not a very efficient method since it relies on checking many possibilities.
Is there a more efficient way to perform this comparison?
Related question:
Compare rotated lists in python
If you are looking for duplicates in a large number of lists, you could rotate each list to its lexicographically minimal string representation, then sort the list of lists or use a hash table to find duplicates. This canonicalisation step means that you don't need to compare every list with every other list. There are clever O(n) algorithms for finding the minimal rotation described at https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation.
You almost have it.
You can do some kind of "normalization" or "canonicalisation" of a list independently of the others, then you only need to compare item by item (or if you want, put them in a map, in a set to eliminate duplicates, ..."
1 take the minimum item, which is not preceded by itself (in a circular way)
In you example 92009, you should take the first 0 (not the second one)
2 If you have always the same item (say 00000), you just keep that: 00000
3 If you have the same item several times, take the next item, which is minimal, and keep going until you find one unique path with minimums.
Example: 90148301562 => you have 0148.. and 0156.. => you take 0148
4 If you can not separate the different paths (= if you have equality at infinite), you have a repeating pattern: then, no matters: you take any of them.
Example: 014376501437650143765 : you have the same pattern 0143765...
It is like AAA, where A = 0143765
5 When you have your list in this form, it is easy to compare two of them.
How to do that efficiently:
Iterate on your list to get the minimums Mx (not preceded by itself). If you find several, keep all of them.
Then, iterate from each minimum Mx, take the next item, and keep the minimums. If you do an entire cycle, you have a repeating pattern.
Except the case of repeating pattern, this must be the minimal way.
Hope it helps.
I would do this in expected O(N) time using a polynomial hash function to compute the hash of list A, and every cyclic shift of list B. Where a shift of list B has the same hash as list A, I'd compare the actual elements to see if they are equal.
The reason this is fast is that with polynomial hash functions (which are extremely common!), you can calculate the hash of each cyclic shift from the previous one in constant time, so you can calculate hashes for all of the cyclic shifts in O(N) time.
It works like this:
Let's say B has N elements, then the the hash of B using prime P is:
Hb=0;
for (i=0; i<N ; i++)
{
Hb = Hb*P + B[i];
}
This is an optimized way to evaluate a polynomial in P, and is equivalent to:
Hb=0;
for (i=0; i<N ; i++)
{
Hb += B[i] * P^(N-1-i); //^ is exponentiation, not XOR
}
Notice how every B[i] is multiplied by P^(N-1-i). If we shift B to the left by 1, then every every B[i] will be multiplied by an extra P, except the first one. Since multiplication distributes over addition, we can multiply all the components at once just by multiplying the whole hash, and then fix up the factor for the first element.
The hash of the left shift of B is just
Hb1 = Hb*P + B[0]*(1-(P^N))
The second left shift:
Hb2 = Hb1*P + B[1]*(1-(P^N))
and so on...
What is the most efficient way to determine whether a collection of sets is pairwise disjoint? -- i.e. verifying that the intersection between all pairs of sets is empty. How efficiently can this be done?
The sets from a collection are pairwise disjoint if, and only if, the size of their union equals the sum of their sizes (this statement applies to finite sets):
def pairwise_disjoint(sets) -> bool:
union = set().union(*sets)
return len(union) == sum(map(len, sets))
This could be a one-liner, but readability counts.
Expected linear time O(total number of elements):
def all_disjoint(sets):
union = set()
for s in sets:
for x in s:
if x in union:
return False
union.add(x)
return True
This is optimal under the assumption that your input is a collection of sets represented as some kind of unordered data structure (hash table?), because than you need to look at every element at least once.
You can do much better by using a different representation for your sets. For example, by maintaining a global hash table that stores for each element the number of sets it is stored in, you can do all the set operations optimally and also check for disjointness in O(1).
Using Python as psudo-code. The following tests for the intersection of each pair of sets only once.
def all_disjoint(sets):
S = list(sets)
while S:
s = S.pop() # remove an element
# loop over the remaining ones
for t in S:
# test for intersection
if not s.isdisjoint(t):
return False
return True
The number of intersection tests is the same as the number of edges in a fully connected graph with the same number of vertexes as there are sets. It also exits early if any pair is found not to be disjoint.
This is a solution for calculating the median value in an array. I get the first three lines, duh ;), but the third line is where the magic is happening. Can someone explain how the 'sorted' variable is using and why it's next to brackets, and why the other variable 'len' is enclosed in those parentheses and then brackets? It's almost like sorted is all of a sudden being used as an array? Thanks!
def median(array)
sorted = array.sort
len = sorted.length
return ((sorted[(len - 1) / 2] + sorted[len / 2]) / 2.0).to_f
end
puts median([3,2,3,8,91])
puts median([2,8,3,11,-5])
puts median([4,3,8,11])
Consider this:
[1,2,2,3,4] and [1,2,3,4]. Both arrays are sorted, but have odd and even numbers of elements respectively. So, that piece of code is taking into account these 2 cases.
sorted is indeed an array. You sort [2,3,1,4] and you get back [1,2,3,4]. Then you calculate the middle index (len - 1) / 2 and len / 2 for even / odd number of elements, and find the average of them.
Yes, array.sort is returning an array and it is assigned to sorted. You can then access it via array indices.
If you have an odd number of elements, say 5 elements as in the example, the indices come out to be:
(len-1)/2=(5-1)/2=2
len/2=5/2=2 --- (remember this is integer division, so the decimal gets truncated)
So you take the value at index 2 and add them, and then divide by 2, which is the same as the value at index 2.
If you have an even number of elements, say 4,
(len-1)/2=(4-1)/2=1 --- (remember this is integer division, so the decimal gets truncated)
len/2=4/2=2
So in this case, you are effectively averaging the two middle elements 1 and 2, which is the definition of median for when you have an even number of elements.
It's almost like sorted is all of a sudden being used as an array?
Yes, it is. On line 2 it's being initialized as being an array with the same elements as the input, but in ascending order (default sort is ascending). On line 3 you have len which is initialized with the length of the sorted array, so yeah, sorted is being used as an array since then, because that's what it is.
I have got numbers in a specific range (usually from 0 to about 1000). An algorithm selects some numbers from this range (about 3 to 10 numbers). This selection is done quite often, and I need to check if a permutation of the chosen numbers has already been selected.
e.g one step selects [1, 10, 3, 18] and another one [10, 18, 3, 1] then the second selection can be discarded because it is a permutation.
I need to do this check very fast. Right now I put all arrays in a hashmap, and use a custom hash function: just sums up all the elements, so 1+10+3+18=32, and also 10+18+3+1=32. For equals I use a bitset to quickly check if elements are in both sets (I do not need sorting when using the bitset, but it only works when the range of numbers is known and not too big).
This works ok, but can generate lots of collisions, so the equals() method is called quite often. I was wondering if there is a faster way to check for permutations?
Are there any good hash functions for permutations?
UPDATE
I have done a little benchmark: generate all combinations of numbers in the range 0 to 6, and array length 1 to 9. There are 3003 possible permutations, and a good hash should generated close to this many different hashes (I use 32 bit numbers for the hash):
41 different hashes for just adding (so there are lots of collisions)
8 different hashes for XOR'ing values together
286 different hashes for multiplying
3003 different hashes for (R + 2e) and multiplying as abc has suggested (using 1779033703 for R)
So abc's hash can be calculated very fast and is a lot better than all the rest. Thanks!
PS: I do not want to sort the values when I do not have to, because this would get too slow.
One potential candidate might be this.
Fix a odd integer R.
For each element e you want to hash compute the factor (R + 2*e).
Then compute the product of all these factors.
Finally divide the product by 2 to get the hash.
The factor 2 in (R + 2e) guarantees that all factors are odd, hence avoiding
that the product will ever become 0. The division by 2 at the end is because
the product will always be odd, hence the division just removes a constant bit.
E.g. I choose R = 1779033703. This is an arbitrary choice, doing some experiments should show if a given R is good or bad. Assume your values are [1, 10, 3, 18].
The product (computed using 32-bit ints) is
(R + 2) * (R + 20) * (R + 6) * (R + 36) = 3376724311
Hence the hash would be
3376724311/2 = 1688362155.
Summing the elements is already one of the simplest things you could do. But I don't think it's a particularly good hash function w.r.t. pseudo randomness.
If you sort your arrays before storing them or computing hashes, every good hash function will do.
If it's about speed: Have you measured where the bottleneck is? If your hash function is giving you a lot of collisions and you have to spend most of the time comparing the arrays bit-by-bit the hash function is obviously not good at what it's supposed to do. Sorting + Better Hash might be the solution.
If I understand your question correctly you want to test equality between sets where the items are not ordered. This is precisely what a Bloom filter will do for you. At the expense of a small number of false positives (in which case you'll need to make a call to a brute-force set comparison) you'll be able to compare such sets by checking whether their Bloom filter hash is equal.
The algebraic reason why this holds is that the OR operation is commutative. This holds for other semirings, too.
depending if you have a lot of collisions (so the same hash but not a permutation), you might presort the arrays while hashing them. In that case you can do a more aggressive kind of hashing where you don't only add up the numbers but add some bitmagick to it as well to get quite different hashes.
This is only beneficial if you get loads of unwanted collisions because the hash you are doing now is too poor. If you hardly get any collisions, the method you are using seems fine
I would suggest this:
1. Check if the lengths of permutations are the same (if not - they are not equal)
Sort only 1 array. Instead of sorting another array iterate through the elements of the 1st array and search for the presence of each of them in the 2nd array (compare only while the elements in the 2nd array are smaller - do not iterate through the whole array).
note: if you can have the same numbers in your permutaions (e.g. [1,2,2,10]) then you will need to remove elements from the 2nd array when it matches a member from the 1st one.
pseudo-code:
if length(arr1) <> length(arr2) return false;
sort(arr2);
for i=1 to length(arr1) {
elem=arr1[i];
j=1;
while (j<=length(arr2) and elem<arr2[j]) j=j+1;
if elem <> arr2[j] return false;
}
return true;
the idea is that instead of sorting another array we can just try to match all of its elements in the sorted array.
You can probably reduce the collisions a lot by using the product as well as the sum of the terms.
1*10*3*18=540 and 10*18*3*1=540
so the sum-product hash would be [32,540]
you still need to do something about collisions when they do happen though
I like using string's default hash code (Java, C# not sure about other languages), it generates pretty unique hash codes.
so if you first sort the array, and then generates a unique string using some delimiter.
so you can do the following (Java):
int[] arr = selectRandomNumbers();
Arrays.sort(arr);
int hash = (arr[0] + "," + arr[1] + "," + arr[2] + "," + arr[3]).hashCode();
if performance is an issue, you can change the suggested inefficient string concatenation to use StringBuilder or String.format
String.format("{0},{1},{2},{3}", arr[0],arr[1],arr[2],arr[3]);
String hash code of course doesn't guarantee that two distinct strings have different hash, but considering this suggested formatting, collisions should be extremely rare