How to find the lowest value possible of a serie of ints? - algorithm

I have a sequence of integers (positive and negative) like this one:
12,-54,32,1,-2,-4,-8,12,56,-22,-21,4,17,35
And I need to find the worst result (smaller sum of values) possible taking any sub sequence of this sequence (and of course the start index and end index of that sub sequence).
Is there a way of doing this that is not 2^n (computing all the possible sequences one by one)?
For example, with this simple sequence:
1,2,-3,4,-6,4,-10,3,-2
The smaller sum of values would be the subsequence:
-6,4,-10 (with start index 4 and end index 6)

The issue of finding the minimum can be transformed into a maximum search by changing sign of each item.
For the maximum subsequence there exist well-known algorithms, see e.g. here.
You can either transform your list and apply said algorithm or slightly modify the algorithm itself (min instead of max or minus instead of plus) in order to work with your original list.

Related

Count palindromic permutations ("mirrors") of an array

I've been trying to find a solution for this question:
Given an array of integers, count the distinct permutations that are palindromes ("mirrors"); that is, find the number of distinct ways that the array's elements can be rearranged so that they read the same way backward as forward. For example:
If the array is [1,1,2], then there is only one distinct palindromic permutation (namely [1,2,1]), so the desired result is 1.
If the array is [1,1,2,2], then there are two distinct palindromic permutations (namely [1,2,2,1] and [2,1,1,2]), so the desired result is 2.
If the array is [2,2,2,3,3], then there are two distinct palindromic permutations (namely [3,2,2,2,3] and [2,3,2,3,2]), so the desired result is 2.
I've been trying to solve this and been stuck for quite a while, and can't find any solution online. Any help will be appreciated (just starting out on algo & ds stuff)
My idea is to find the index of the median of that array (e.g., in example #1, the median is at index 1) and move all numbers after it to before it (so, [1,2,1]), and check using two pointers (one at end, one at start) if all numbers are equal.
However, this won't work if, let's say, #1 is arr = [1,2,2], as doing the above would be equal to 1,2,2. What I should've done in this case is then to move the 1 in between the 2s (sort of median from the end, if that makes sense). Sort of like the above method but the reverse (?)
Here is the general idea:
Count the frequency of each unique value.
If the array's length is odd, then exactly one frequency should be odd. If not, there are no mirrors. If so, that value will have to be placed in the center. The number of mirrors is then equal to what you would get for an array with one value less -- that value removed.
Now the array length is even. No frequencies should be odd, or else there are no mirrors. Now halve all those frequencies.
Determine how many permutations can be formed with those values and their (halved) frequencies. The formula is:
𝑛! / (𝑛1!𝑛2!𝑛3!...𝑛𝑘!)
where 𝑛 is the sum of all (halved) frequencies (i.e. half the size of the array), and the 𝑛𝑖 is the list of (halved) frequencies.

Find medians in multiple sub ranges of a unordered list

E.g. given a unordered list of N elements, find the medians for sub ranges 0..100, 25..200, 400..1000, 10..500, ...
I don't see any better way than going through each sub range and run the standard median finding algorithms.
A simple example: [5 3 6 2 4]
The median for 0..3 is 5 . (Not 4, since we are asking the median of the first three elements of the original list)
INTEGER ELEMENTS:
If the type of your elements are integers, then the best way is to have a bucket for each number lies in any of your sub-ranges, where each bucket is used for counting the number its associated integer found in your input elements (for example, bucket[100] stores how many 100s are there in your input sequence). Basically you can achieve it in the following steps:
create buckets for each number lies in any of your sub-ranges.
iterate through all elements, for each number n, if we have bucket[n], then bucket[n]++.
compute the medians based on the aggregated values stored in your buckets.
Put it in another way, suppose you have a sub-range [0, 10], and you would like to compute the median. The bucket approach basically computes how many 0s are there in your inputs, and how many 1s are there in your inputs and so on. Suppose there are n numbers lies in range [0, 10], then the median is the n/2th largest element, which can be identified by finding the i such that bucket[0] + bucket[1] ... + bucket[i] greater than or equal to n/2 but bucket[0] + ... + bucket[i - 1] is less than n/2.
The nice thing about this is that even your input elements are stored in multiple machines (i.e., the distributed case), each machine can maintain its own buckets and only the aggregated values are required to pass through the intranet.
You can also use hierarchical-buckets, which involves multiple passes. In each pass, bucket[i] counts the number of elements in your input lies in a specific range (for example, [i * 2^K, (i+1) * 2^K]), and then narrow down the problem space by identifying which bucket will the medium lies after each step, then decrease K by 1 in the next step, and repeat until you can correctly identify the medium.
FLOATING-POINT ELEMENTS
The entire elements can fit into memory:
If your entire elements can fit into memory, first sorting the N element and then finding the medians for each sub ranges is the best option. The linear time heap solution also works well in this case if the number of your sub-ranges is less than logN.
The entire elements cannot fit into memory but stored in a single machine:
Generally, an external sort typically requires three disk-scans. Therefore, if the number of your sub-ranges is greater than or equal to 3, then first sorting the N elements and then finding the medians for each sub ranges by only loading necessary elements from the disk is the best choice. Otherwise, simply performing a scan for each sub-ranges and pick up those elements in the sub-range is better.
The entire elements are stored in multiple machines:
Since finding median is a holistic operator, meaning you cannot derive the final median of the entire input based on the medians of several parts of input, it is a hard problem that one cannot describe its solution in few sentences, but there are researches (see this as an example) have been focused on this problem.
I think that as the number of sub ranges increases you will very quickly find that it is quicker to sort and then retrieve the element numbers you want.
In practice, because there will be highly optimized sort routines you can call.
In theory, and perhaps in practice too, because since you are dealing with integers you need not pay n log n for a sort - see http://en.wikipedia.org/wiki/Integer_sorting.
If your data are in fact floating point and not NaNs then a little bit twiddling will in fact allow you to use integer sort on them - from - http://en.wikipedia.org/wiki/IEEE_754-1985#Comparing_floating-point_numbers - The binary representation has the special property that, excluding NaNs, any two numbers can be compared like sign and magnitude integers (although with modern computer processors this is no longer directly applicable): if the sign bit is different, the negative number precedes the positive number (except that negative zero and positive zero should be considered equal), otherwise, relative order is the same as lexicographical order but inverted for two negative numbers; endianness issues apply.
So you could check for NaNs and other funnies, pretend the floating point numbers are sign + magnitude integers, subtract when negative to correct the ordering for negative numbers, and then treat as normal 2s complement signed integers, sort, and then reverse the process.
My idea:
Sort the list into an array (using any appropriate sorting algorithm)
For each range, find the indices of the start and end of the range using binary search
Find the median by simply adding their indices and dividing by 2 (i.e. median of range [x,y] is arr[(x+y)/2])
Preprocessing time: O(n log n) for a generic sorting algorithm (like quick-sort) or the running time of the chosen sorting routine
Time per query: O(log n)
Dynamic list:
The above assumes that the list is static. If elements can freely be added or removed between queries, a modified Binary Search Tree could work, with each node keeping a count of the number of descendants it has. This will allow the same running time as above with a dynamic list.
The answer is ultimately going to be "in depends". There are a variety of approaches, any one of which will probably be suitable under most of the cases you may encounter. The problem is that each is going to perform differently for different inputs. Where one may perform better for one class of inputs, another will perform better for a different class of inputs.
As an example, the approach of sorting and then performing a binary search on the extremes of your ranges and then directly computing the median will be useful when the number of ranges you have to test is greater than log(N). On the other hand, if the number of ranges is smaller than log(N) it may be better to move elements of a given range to the beginning of the array and use a linear time selection algorithm to find the median.
All of this boils down to profiling to avoid premature optimization. If the approach you implement turns out to not be a bottleneck for your system's performance, figuring out how to improve it isn't going to be a useful exercise relative to streamlining those portions of your program which are bottlenecks.

Counting distinct common subsequences for a given set of strings

I was going through this paper about counting number of distinct common subsequences between two strings which has described a DP approach to do the same. Now, when there are more than two strings whose number of distinct common subsequences must be found, it might take an approach different from this one. What I want is that whether this task is achievable in time complexity less than exponential and how can it be done?
If you have an alphabet of size k, and m strings of size at most n then (assuming that all individual math operations are O(1)) this problem is solvable with dynamic programming in time at most O(k nm+1) and memory O(k nm). Those are not tight bounds, and in practice performance and memory should be significantly better than that. But in practice with long strings you will wind up needing big integer arithmetic, which will make math operations not O(1). Still it is polynomial.
Here is the trick in an unfortunately confusing sentence. We want to build up a series of tables listing, for each possible length of subsequence and each set of ways to pick one copy of a character from each string, the number of distinct subsequences there are whose minimal expression in each string ends at the chosen spot. If we do that, then the sum of all of those values is our final answer.
Here is an outline of how to do it (which you can do without understanding the above description).
For each string, build a transition table mapping (position in string, character) to the position of the next occurrence of that character. The tables should start with position 0 being before the first character. You can use -1 for running off of the end of the string.
Create a data structure that maps a list of integers the same size as the number of strings you have to another integer. This will be the count of subsequences of a fixed length whose shortest representation in each string ends at that set of positions.
Insert as the sole value (0, 0, ..., 0) -> 1 to represent the fact that there is 1 subsequence of length 0 and its shortest representation in each string ends at the start.
Set the total count of common subsequences to 0.
While that map is not empty:
Add the sum of values in that map to the total count of common subsequences.
Create a second map of the same type, with no data.
For each key/value pair in the first map:
For each possible character in your alphabet:
Construct a new vector of integers to be a new key by taking each string, looking at the position, then taking the next position of that character. Of course if you run off of the end of the string, break out of the loop.
If that key is not in your second map, insert it with value 0.
Increase the value for that key in the second map by your current value in the current map. (Basically add the number of subsequences that just had this minimal character transition.)
Copy the second data structure to the first.
The total count of distinct subsequences in common across all of the strings should now be correct.

Sort sub array minimaly

This is an interview question.
Given an array of integers, write a method to find indices m and n such that if you sorted elements m through n, the entire array would be sorted. Minimize n-m. i.e. find smallest sequence.
Observation
The integers before m should be ascending and smaller than (or equal to) any integers after.
Algorithm
Start from the first element, and stop upon first decreasing. (sub array SA)
Find minimum after. (MIN)
The start point is just after the maximum integer in SA that is smaller than (or equal to) MIN. (m is found)
Complexity
O(N)
Do similar for n.
You need to keep track of four things:
End of sorted region at the beginning
Start of sorted region at the end
Minimum number after the beginning region
Maximum number before the end region
Start by figuring out a preliminary value for 1 and 2, by scanning the array from the start and from the end until you find a misplaced value.
Then you scan everything after your preliminary 1, to find the minimum number. This is your 3. Find 4 in the same way.
Now you backtrack trough the start region of the array until you find the place where the minimum value should be. This is the exact answer to 1 and also your m.
Find n in the same way by backtracking through the end region to find where the maximum number should be.

Get most unique text from a group of text

I have a number of texts, for example 100.
I would keep the 10 most unique among them. I made a 100x100 matrix where I compared each text among them with the Levenshtein algorithm.
Is there an algorithm to select the 10 most unique?
EDIT :
What i want is the N most unique text that maximize the distance between this N text regardless of the 1st element of my set.
I want the most unique because i will publish these text to the web and i want avoid near duplicate.
A long comment rather than an answer ...
I don't think you've specified your requirement(s) clearly enough. How do you select the 1st element of your set of 10 strings ? Is it the string with the largest distance from any other string (in which case you are looking for the largest element in your array) or the one with the largest distance from all the other strings (in which case you are looking for the largest row- or column-sum in the array).
Moving on to the N (or 10 as you suggest) most distant strings, you have a number of choices.
You could select the N largest distances in the array. I suspect, not having seen your data, that it is likely that the string which is furthest from any other string may also be furthest away from several other strings too -- I mean you may find that several of the N largest entries in your array occur in the same row or column.
You could simply select the N strings with the largest row sums.
Or perhaps you are looking for a cluster of N strings which maximises the distance between all the strings in that cluster and all the strings in the remaining 100-N strings. This might lead you towards looking at, rather obviously, clustering algorithms.
I suggest you clarify your requirements and edit your question.
Since this looks like an eigenvalue problem, I would try to execute the Power iteration on the matrix, and reject the 90 highest values from the resulting vector. The power iteration normally converges very fast, within ~ten iterations. BTW: this solution assumes a similarity matrix. If the entries of your matrix are a measure of *dis*similarity ("distance"), you might need to use their inverses instead.

Resources