finding similarity of sequences in ruby - ruby

I want to find the similarity of two sequences in Ruby based solely on the quantity of shared values. The sequential position of the values should be irrelevant. What should also be irrelevant is whether one sequence has any values that the other sequence does not have. Levenshtein distance was suggested to me, but it computes the number of edits required to make the sequences identical. Here's a simple example of where the flaw is there:
[1,2,3,4,5]
[2,3,4,5,6,7,8,9]
#Lev distance is 5
[1,2,3,4,5]
[6,7,8,9,10]
#Lev distance is 5
In a perfect world the first set would have much greater similarity than the second set. The crude, obvious solution is to use nested loops to check each value of the first sequence against each value of the second. Is there a more efficient way?

You can do an intersection for a pair of arrays using an & like this:
a = [1,2,3,4,5]
b = [2,3,4,5,6,7,8,9]
common = a & b # => [2, 3, 4, 5]
common.size # => 4
Is this what you are looking for?

If the sequences are sorted (or you sort them), all you have to do is walk down both lists, incrementing the similarity counter and popping off both values if they match. If they don't match, you pop off the smaller value, and continue until one list is empty. The complexity of this is O(n log n) for the sorting plus O(n) for the walk, where n is the sum of the lengths of the two lists.
You could also loop through each list, counting the incidence of each number (so you end up with a list of the counts of each value). Then you could compare these quantities, incrementing the similarity counter by the lesser count for each value.

Related

Count palindromic permutations ("mirrors") of an array

I've been trying to find a solution for this question:
Given an array of integers, count the distinct permutations that are palindromes ("mirrors"); that is, find the number of distinct ways that the array's elements can be rearranged so that they read the same way backward as forward. For example:
If the array is [1,1,2], then there is only one distinct palindromic permutation (namely [1,2,1]), so the desired result is 1.
If the array is [1,1,2,2], then there are two distinct palindromic permutations (namely [1,2,2,1] and [2,1,1,2]), so the desired result is 2.
If the array is [2,2,2,3,3], then there are two distinct palindromic permutations (namely [3,2,2,2,3] and [2,3,2,3,2]), so the desired result is 2.
I've been trying to solve this and been stuck for quite a while, and can't find any solution online. Any help will be appreciated (just starting out on algo & ds stuff)
My idea is to find the index of the median of that array (e.g., in example #1, the median is at index 1) and move all numbers after it to before it (so, [1,2,1]), and check using two pointers (one at end, one at start) if all numbers are equal.
However, this won't work if, let's say, #1 is arr = [1,2,2], as doing the above would be equal to 1,2,2. What I should've done in this case is then to move the 1 in between the 2s (sort of median from the end, if that makes sense). Sort of like the above method but the reverse (?)
Here is the general idea:
Count the frequency of each unique value.
If the array's length is odd, then exactly one frequency should be odd. If not, there are no mirrors. If so, that value will have to be placed in the center. The number of mirrors is then equal to what you would get for an array with one value less -- that value removed.
Now the array length is even. No frequencies should be odd, or else there are no mirrors. Now halve all those frequencies.
Determine how many permutations can be formed with those values and their (halved) frequencies. The formula is:
𝑛! / (𝑛1!𝑛2!𝑛3!...𝑛𝑘!)
where 𝑛 is the sum of all (halved) frequencies (i.e. half the size of the array), and the 𝑛𝑖 is the list of (halved) frequencies.

Searching in vector of pairs

I have a vector of pairs (datatype=double), where each pair is (a,b) and a less than b.For a number x, I want to find out number of pair in vector, where a<=x<=b.
Consider the vector size about 10^6.
My Approach
Sort the vector pair and perform a lower_bound operation for x over "a" in pair then iterate from start till my lower bound value and check for values of "b" which satisfies condition of x<=b.
Time Complexity
N(LogN) where N is vector size.
Issue
I have to perform this over large queries where this approach becomes inefficient.So is there any better solution to decrease the time complexity.
Sorry for my poor English and question formatting.
In addition to the previous answer, here's a suggestion how to prepare the ranges to optimize the subsequent lookup. The idea boils down to precomputing the result for all significantly different input values, but being smart about when values don't differ significantly.
To illustrate what I mean, let's consider this sequence of ranges:
1, 3
1, 8
2, 4
2, 6
The prepared output structure then looks like this:
1, 2 -> 2
2, 3 -> 4
3, 4 -> 3
4, 6 -> 2
6, 8 -> 1
For any number in the range 1, 2, there are two matching ranges in the initial sequence. For any number in the range 2, 3, there are four matches, etc. Note that there are five ranges here now, because some of the input ranges partially overlapped. Since for every range here the end value is also the start value of the next range, the end value can be optimized out. The result then looks like a simple map:
1 -> 2
2 -> 4
3 -> 3
4 -> 2
6 -> 1
8 -> 0
Note here that the last range didn't have one following, so the explicit zero becomes necessary. For the values before the first, that is implied. In order to find the result for a value, just find the key that is less than or equal to that value. This is a simple O(log n) lookup.
Firstly, if you just did a simple scan over the pairs, you would have O(n) complexity! The O(n log n) comes from sorting and for a one-off operation this is just overhead. This might even be the best way to do it, if you don't reuse the results and even if you just perform a few queries, it might still be better than sorting. Make sure you allow yourself to switch out the algorithm.
Anyhow, let's consider that you need to make many queries. Then, one relatively obvious step to improve things is to not iterate step-by-step after sorting. Instead, you can do a binary search for the lower bound. Simply partition the sequence into halves. The lower bound can be found in either half, which you can determine by looking at the middle element between the partitions. Recurse until you found the first element that can not possibly contain the value you search, because its start value is already greater.
Concerning the other direction, things are not that easy. Just because you sorted the ranges by the start value doesn't imply that the end values are sorted, too. Also, ranges that match and ranges that don't can be mixed in the sequence, so here you will have to perform a linear scan.
Lastly, some notes:
You could parallelize this algorithm using multithreading.
Depending on your number of searches M in your outer loop, you could also switch the outer loop with the inner one. That means that for every pair of the input vector, you check each of the M search values whether they fall within the range. This might be better, in particular when the M searches fit into the CPU cache.
This is a very typical style problem in for segment trees, binary indexed trees, interval trees.
There are two operations that you have to carry out on an array arr.
You have two operations on an array arr:
1. Range update: Add(a, b): for(int i = a; i <= b; ++i) arr[i]++
2. Point query : Query(x): return arr[x]
Alternately, you could formulate your problem slightly cleverly.
1. Point Update: Add(a, b): arr[a]++; arr[b+1]--;
2. Range Query: Query(x): return sum(arr[0], arr[1] ..... arr[x]);
In each of the cases above, you have one O(n) operation and one O(1) operation.
For the second case, the query is essentially a prefix sum calculation. Binary Indexed Trees are especially efficient at this task.
Tutorial for Binary Indexed Trees
IMPORTANT IDEA: ARRAY COMPRESSION
You did mention that the vector size is about 10^6, so there is a chance that you may not be able to create an array that big. If you are able to create a set that consists of all the as and bs and xs beforehand, then you can translate them into numbers from 1 to size of set.
SUPER CLEVER IDEA: MO's ALGORITHM
This is only allowed if you are allowed to solve the problem offline. What that means is that you can take all the query points x as input, solve them in any order as you like and store the solution, and then print the solution in the correct order.
Please mention if this is your situation, and only then will I elaborate further on this. But Binary Indexed Trees are going to be more efficient than Mo's algorithm.
EDIT:
Because your interval values are of type double, you must convert them to integers before you use my solution. Let me give an example,
Intervals = (1.1 to 1.9), (1.4 to 2.1)
Query Points = 1.5, 2.0
Here all the points that are of interest are not all the possible doubles, but just the above numbers = {1.1, 1.4, 1.5, 1.9, 2.0, 2.1}
If we map them into positive integers:
1.1 --> 1
1.4 --> 2
1.5 --> 3
1.9 --> 4
2.0 --> 5
2.1 --> 6
Then you could use segment trees/binary indexed trees.
For each pair a,b you can decompose so that a=+1 and b=-1 for the number of ranges valid for a particular value. Then in becomes a simple O(log n) lookup to see how many ranges encompass the search value.

Algorithms for dividing an array into n parts

In a recent campus Facebook interview i have asked to divide an array into 3 equal parts such that the sum in each array is roughly equal to sum/3.My Approach1. Sort The Array2. Fill the array[k] (k=0) uptil (array[k]<=sum/3)3. After that increment k and repeat the above step for array[k]Is there any better algorithm for this or it is NP Hard Problem
This is a variant of the partition problem (see http://en.wikipedia.org/wiki/Partition_problem for details). In fact a solution to this can solve that one (take an array, pad with 0s, and then solve this problem) so this problem is NP hard.
There is a dynamic programming approach that is pseudo-polynomial. For each i from 0 to the size of the array, you keep track of all possible combinations of current sizes for the sub arrays, and their current sums. As long as there are a limited number possible sums of subsets of the array, this runs acceptably fast.
The solution that I would have suggested is to just go for "good enough" closeness. First let's consider the simpler problem with all values positive. Then sort by value descending. Take that array in threes. Build up the three subsets by always adding the largest of the triple to the one with the smallest sum, the smallest to the one with the largest, and the middle to the middle. You will end up dividing the array evenly, and the difference will be no more than the value of the third smallest element.
For the general case you can divide into positive and negative, use the above approach on each, and then brute force all combinations of a group of positives, a group of negatives, and the few leftover values in the middle that did not divide evenly.
Here are details on a dynamic programming solution if you are interested. The running time and memory usage is O(n*(sum)^2) where n is the size of your array and sum is the sum of absolute values of your array values. For each array index j from 1 to n, store all the possible values you can get for your 3 subset sums when you split the array from index 1 to j into 3 subsets. Also for each possibility, store one possible way to split the array to get the 3 sums. Then to extend this information for 1 to (j+1) given the information from 1 to j, simply take each possible combination of 3 sums for splitting 1 to j and form the 3 combinations of 3 sums you get when you choose to add the (j+1)th array element to any one of the 3 subsets. Finally, when you are done and reach j = n, go through the set of all combinations of 3 subset sums you can get when you split array positions 1 to n into 3 sets, and choose the one whose maximum deviation from sum/3 is minimized. At first this may seem like O(n*(sum)^3) complexity, but for each j and each combination of the first 2 subset sums, the 3rd subset sum is uniquely determined. (because you are not allowed to omit any elements of the array). Thus the complexity really is O(n*(sum)^2).

Algorithm to generate a 'nearly sorted' or 'k sorted' list?

I want to generate some test data to test a function that merges 'k sorted' lists (lists where each element is at most k positions away from it's correct sorted position) into a single fully sorted list. I have an approach that works but I'm not sure how well randomized it is and I feel there should be a simpler / more elegant way to do this. My current approach:
Generate n random elements paired with an integer index.
Sort random elements.
Set paired index for each element to its sorted position.
Work backwards through the elements, swapping each element with an element a random distance between 1 and k positions behind it in the list. Only swap with the target element if its paired index is its current index (this avoids swapping an element that is already out of place and moving it further than k positions away from where it should be).
Copy the perturbed elements out into another list.
Like I say, this works but I'm interested in alternative / better approaches.
I think you could just fill an array with random integers and then run quicksort on it with a custom stopping condition.
If in a particular quicksort recursion your start and end indexes are less than k apart, then just return instead of continuing to recur.
Because of how quicksort works, every number in the start..end interval belongs somewhere in that region; worst case is that array[start] might really belong at array[end] (or vice versa) in truly sorted order. So, assuring that start and end are no more than k apart should be sufficient.
You can generate array of random numbers and then h-sort it like in shellsort, but without fiew last sorting steps when h is less then k.
Step 1: Randomly permute disjoint segments of length k. (Eg. 1 to K, k+1 to 2k ...)
Step 2: Permute conditionally again by swapping (that they don't break k-sorted assumption (1+t yo k+t, k+1+t to 1+2k+t ...) where t is a number between 1 and k (most preferably k/2)
Probably repeat step 2 multiple times with different t.
If I understand the problem, you want an algorithm to randomly pick a single k-sorted list of length n, uniformly selected from the universe U of all k-sorted lists of length n. (You will then run this algorithm m times to produce m lists as input test data.)
The first step is to count them. What is the size of U? |U|
The next step is to enumerate them. Create any one-to-one mapping F between the integers (1,2,...,|U|) and k-sorted lists of length n.
Then randomly select an integer x between 1 and |U| inclusive, and then apply F(x) to get the list.

From the given array of numbers find all the of numbers in group of 3 with sum value N

Given is a array of numbers:
1, 2, 8, 6, 9, 0, 4
We need to find all the numbers in group of three which sums to a value N ( say 11 in this example). Here, the possible numbers in group of three are:
{1,2,8}, {1,4,6}, {0,2,9}
The first solution I could think was of O(n^3). Later I could improve a little(n^2 log n) with the approach:
1. Sort the array.
2. Select any two number and perform binary search for the third element.
Can it be improved further with some other approaches?
You can certainly do it in O(n^2): for each i in the array, test whether two other values sum to N-i.
You can test in O(n) whether two values in a sorted array sum to k by sweeping from both ends at once. If the sum of the two elements you're on is too big, decrement the "right-to-left" index to make it smaller. If the sum is too small, increment the "left-to-right" index to make it bigger. If there's a pair that works, you'll find them, and you perform at most 2*n iterations before you run out of road at one end or the other. You might need code to ignore the value you're using as i, depends what the rules are.
You could instead use some kind of dynamic programming, working down from N, and you probably end up with time something like O(n*N) or so. Realistically I don't think that's any better: it looks like all your numbers are non-negative, so if n is much bigger than N then before you start you can quickly throw out any large values from the array, and also any duplicates beyond 3 copies of each value (or 2 copies, as long as you check whether 3*i == N before discarding the 3rd copy of i). After that step, n is O(N).

Resources