Computing a moving maximum [duplicate] - performance

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Find the min number in all contiguous subarrays of size l of a array of size n
I have a (large) array of numeric data (size N) and would like to compute an array of running maximums with a fixed window size w.
More directly, I can define a new array out[k-w+1] = max{data[k-w+1,...,k]} for k >= w-1 (this assumes 0-based arrays, as in C++).
Is there a better way to do this than N log(w)?
[I'm hoping there should be a linear one in N without dependence on w, like for moving average, but cannot find it. For N log(w) I think there is a way to manage with a sorted data structure which will do insert(), delete() and extract_max() altogether in log(w) or less on a structure of size w -- like a sorted binary tree, for example].
Thank you very much.

There is indeed an algorithm that can do this in O(N) time with no dependence on the window size w. The idea is to use a clever data structure that supports the following operations:
Enqueue, which adds a new element to the structure,
Dequeue, which removes the oldest element from the structure, and
Find-max, which returns (but does not remove) the minimum element from the structure.
This is essentially a queue data structure that supports access (but not removal) of the maximum element. Amazingly, as seen in this earlier question, it is possible to implement this data structure such that each of these operations runs in amortized O(1) time. As a result, if you use this structure to enqueue w elements, then continuously dequeue and enqueue another element into the structure while calling find-max as needed, it will take only O(n + Q) time, where Q is the number of queries you make. If you only care about the minimum of each window once, this ends up being O(n), with no dependence on the window size.
Hope this helps!

I'll demonstrate how to do it with the list:
L = [21, 17, 16, 7, 3, 9, 11, 18, 19, 5, 10, 23, 20, 15, 4, 14, 1, 2, 22, 13, 8, 12, 6]
with length N=23 and W = 4.
Make two new copies of your list:
L1 = [21, 17, 16, 7, 3, 9, 11, 18, 19, 5, 10, 23, 20, 15, 4, 14, 1, 2, 22, 13, 8, 12, 6]
L2 = [21, 17, 16, 7, 3, 9, 11, 18, 19, 5, 10, 23, 20, 15, 4, 14, 1, 2, 22, 13, 8, 12, 6]
Loop from i=0 to N-1. If i is not divisible by W, then replace L1[i] with max(L1[i],L1[i-1]).
L1 = [21, 21, 21, 21, | 3, 9, 11, 18, | 19, 19, 19, 23 | 20, 20, 20, 20 | 1, 2, 22, 22 | 8, 12, 12]
Loop from i=N-2 to0. If i+1 is not divisible by W, then replace L2[i] with max(L2[i], L2[i+1]).
L2 = [21, 17, 16, 7 | 18, 18, 18, 18 | 23, 23, 23, 23 | 20, 15, 14, 14 | 22, 22, 22, 13 | 12, 12, 6]
Make a list L3 of length N + 1 - W, so that L3[i] = max(L2[i], L1[i + W - 1])
L3 = [21, 17, 16, 11 | 18, 19, 19, 19 | 23, 23, 23, 23 | 20, 15, 14, 22 | 22, 22, 22, 13]
Then this list L3 is the moving maxima you seek, L2[i] is the maximum of the range between i and the next vertical line, while l1[i + W - 1] is the maximum of the range between the vertical line and i + W - 1.

Related

Tackling the 'Small Data' Problem with Distributed Computing Cluster?

I'm learning about Hadoop + MapReduce and Big Data and from my understanding it seems that the Hadoop ecosystem was mainly designed to analyze large amounts of data that's distributed on many servers. My problem is a bit different.
I have a relatively small amount of data (a file consisting of 1-10 million lines of numbers) which needs to be analyzed in millions of different ways. For example, consider the following dataset:
[1, 6, 7, 8, 10, 17, 19, 23, 27, 28, 28, 29, 29, 29, 29, 30, 32, 35, 36, 38]
[1, 3, 3, 4, 4, 5, 5, 10, 11, 12, 14, 16, 17, 18, 18, 20, 27, 28, 39, 40]
[2, 3, 7, 8, 10, 10, 12, 13, 14, 15, 15, 16, 17, 19, 27, 30, 32, 33, 34, 40]
[1, 9, 11, 13, 14, 15, 17, 17, 18, 18, 18, 19, 19, 23, 25, 26, 27, 31, 37, 39]
[5, 8, 8, 10, 14, 16, 16, 17, 20, 21, 22, 22, 23, 28, 29, 30, 32, 32, 33, 38]
[1, 1, 3, 3, 13, 17, 21, 24, 24, 25, 26, 26, 30, 31, 32, 35, 38, 39, 39, 39]
[1, 2, 4, 4, 5, 5, 10, 13, 14, 14, 14, 14, 15, 17, 28, 29, 29, 35, 37, 40]
[1, 2, 6, 8, 12, 13, 14, 15, 15, 15, 16, 22, 23, 24, 26, 30, 31, 36, 36, 40]
[3, 6, 7, 8, 8, 10, 10, 12, 13, 17, 17, 20, 21, 22, 33, 35, 35, 36, 39, 40]
[1, 3, 8, 8, 11, 11, 13, 18, 19, 19, 19, 23, 24, 25, 27, 33, 35, 37, 38, 40]
I need to analyze how frequently a number of each column (Column N) repeats itself a certain number of rows later (L rows later. For example, if we were analyzing Column A with 1L (1-Row-Later) the result would be as follows:
Note: The position does not need to match - so number can appear anywhere in the next row
Column: A N-Later: 1 Result: YES, NO, NO, NO, NO, YES, YES, NO, YES -> 4/9.
We would repeat the above analysis for each column separately and for maximum N later times. In the above dateset which only consists of 10 lines it means a maximum of 9 N later. But in a dateset of 1 million lines, the analyses (for each column) would be repeated 999,999 times.
I looked into the MapReduce framework but it doesn't seem to cut it; it doesn't seem like an efficient solution for this problem and it requires a great deal of work to convert the core code into a MapReduce friendly structure.
As you can see in the above example, each analyses is independent of each other. For example, it is possible to analyze Column A separately from Column B. It is also possible to perform 1L analyses separately from 2L and so on. However, unlike Hadoop where the data lives on separate machines, in our scenario, each server needs access to all of the data to perform it's "share" of analysis.
I looked into possible solutions for this problem and it seems there are very few options: Ray or building a custom application on top of YARN using Apache Twill. Apache Twill was moved to the Attic in 2020 which means that Ray is the only available option.
Is Ray the best way to tackle this problem or are there other, better options? Ideally, the solution should automatically handle fail over and distribute the processing load intelligently. For example, in the above example, if we wanted to distribute the load to 20 machines, one way of doing so would be to divide 999,999 N Later by 20 and let Machine A analyze 1L-49999L, Machine B from 50000L - 100000L and so on. However, when you think about it, the load isn't being distributed equally - as it takes much longer to analyze 1L vs. 500000L as the latter contains only about half the number of rows (for 500000L the first row we are analyzing is row 500001 so we are essentially omitting the first 500K rows from analysis).
It should also not require a great deal of modification to the core code (like MapReduce does).
I'm working with Java.
Thanks
Well you are right - your scenario and your technological stack are not that suitable. Which raise the question - why not (add) something more relevant to your current technological stack? For instance - Redis DB.
Seems that your common action is mainly count values and you want to prevent over-calculation and make it more performant (e.g. - properly index your data). Given that this is one of the main features of Redis - it sounds logical to use it as a processor
My suggestion:
Create a hashmap that uses the numeric value as key and its count as value. This way - you will be able to pull different calculations over those metrics and always iterate your data-set once. Afterwards - you just need to pull the data from Redis by different criteria (per calculation or metric).
From this point, it's easy to save your calculated data back to your database and make it ready for direct querying. The overall process may be similar to this:
Scan data from file
Properly index it to redis (using hashmap)
Make desired calculations (over the indexed count)
Save it in your DB (as a digested data-set)
Flush Redis DB
Query your DB (as much as you like)
Follow the docs for both populating and retrieving data

Optimal strategy for two player coin games

Two players take turns choosing one of the outer coins. At the end we calculate the difference
between the score two players get, given that they play optimally.
The greedy strategy of getting the max. value of coin often does not lead to the best results in my case.
Now I developed an algorithm:
Sample:{9,1,15,22,4,8}
We calculate the sum of coins in even index and that of coins in odd index.
Compare the two sum, (9+15+4)<(1+22+8) so sum of odd is greater. We then pick the coin with odd index, in our sample that would be 8.
the opponent, who plays optimally, will try to pick the greater coin, e.g. 9.
There is always a coin at odd index after the opponent finished, so we keep picking the coins
at odd index, that would be 1.
looping the above steps we will get a difference of (8+1+22) - (9+15+4) = 3.
6.vice versa if sum of even is greater in step 2.
I have compared the results generated by my algorithm with a 2nd algorithm similar to below one: https://www.geeksforgeeks.org/optimal-strategy-for-a-game-set-2/?ref=rp
And the results were congruent, until my test generated a random long array:
[6, 14, 6, 8, 6, 3, 14, 5, 18, 6, 19, 17, 10, 11, 14, 16, 15, 18, 7, 8, 6, 9, 0, 15, 7, 4, 19, 9, 5, 2, 0, 18, 2, 8, 19, 14, 4, 8, 11, 2, 6, 16, 16, 13, 10, 19, 6, 17, 13, 13, 15, 3, 18, 2, 14, 13, 3, 4, 2, 13, 17, 14, 3, 4, 14, 1, 15, 10, 2, 19, 2, 6, 16, 7, 16, 14, 7, 0, 9, 4, 9, 6, 15, 9, 3, 15, 11, 19, 7, 3, 18, 14, 11, 10, 2, 3, 7, 3, 18, 7, 7, 14, 6, 4, 6, 12, 4, 19, 15, 19, 17, 3, 3, 1, 9, 19, 12, 6, 7, 1, 6, 6, 19, 7, 15, 1, 1, 6]
My algorithm generated 26 as the result, while the 2nd algorithm generated 36.
Mine is nothing about dynamic programming and it requires less memory, whereas i also implemented the 2nd one with memoization.
This is confusing since mine is correct with most of the array cases until this one.
Any help would be appreciated!
If the array is of even length, your algorithm tries to produce a guaranteed win. You can prove that quite easily. But it doesn't necessarily produce the optimal win. In particular it won't find strategies where you want some coins that are on even indexes and others on odd indexes.
The following short example illustrates the point.
[10, 1, 1, 20, 1, 1]
Your algorithm will look at evens vs odds, realize that 10+1+1 < 1+20+1 and take the last element first. Guaranteeing a win by 10.
But you want both the 10 and the 20. Therefore the optimal strategy is to take the 10 leaving 1, 1, 20, 1, 1, whichever side the other person takes you take the other to get to 1, 20, 1, and then whichever side the other takes you take the middle. Resulting in you getting 10, 1, 20 and the other person getting 1, 1, 1. Guaranteeing a win by 28.

Remove n elements from array dynamically and add to another array

nums= [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
new_array=[]
How do I grab every two items divisible by 5 and add them to a new array.
This is the desired result:
the new_array should now contain these values
[[5,10],[15,20],[25,30]]
Note: I want to do this without pushing them all into the array and then performing
array.each_slice(2). The process should happen dynamically.
Try this
new_array = nums.select { |x| x % 5 == 0 }.each_slice(2).entries
No push involved.

How to Generate N random numbers from a SHA-256 Hash

I'm working on a "provably fair" site where let's say X participants enter into a drawing and we need to pick first 1 overall winner, but then ideally we also want to pick N sub-winners out of the X total.
(for the curious, the SHA-256 Hash will be the merkle tree root of a Bitcoin block at a pre-specified time)
So, given a SHA-256 hash, how do we generate N random numbers?
I think I know how to generate 1 random number (within ruby's Fixnum range). According to this article: http://patshaughnessy.net/2014/1/9/how-big-is-a-bignum
The maximum Fixnum integer is: 4611686018427387903
Let's pluck the first Y characters of the SHA-256 hash. We can generate one instead of relying on a Bitcoin merkle root with:
d = Digest::SHA256.hexdigest('hello')
> "2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824"
Let's take the first 6 characters, or: 2cf24d
Convert this to base 10:
'2cf24d'.to_i(16)
> 2945613
We now have a unique Fixnum based on our merkle root.
With X participants, let's say 17, we decide the winner with:
2945613 % 17
> 6
So assuming all entries know their order of entry, the sixth entrant can prove that they should be the winner.
Now -- what would be the best way to similarly pick N sub-winners? Let's say each of these entrants should get a smaller but still somewhat valuable prize.
Why not just use the hash for the seed?
[*1..17].shuffle(random: Random.new(0x2cf24d))
# => [15, 5, 9, 7, 14, 3, 16, 12, 2, 1, 17, 4, 6, 13, 11, 10, 8]
[*1..17].shuffle(random: Random.new(0x2cf24d))
# => [15, 5, 9, 7, 14, 3, 16, 12, 2, 1, 17, 4, 6, 13, 11, 10, 8]
EDIT: This is dependent on Ruby version though - I believe shuffle is different between JRuby and MRI, even though Random produces the same sequence. You could circumvent this by implementing shuffle yourself. See this question for more details. This workaround works consistently for me in both JRuby and MRI:
r = Random.new(0x2cf24d)
[*1..17].sort_by { r.rand }
# => [14, 11, 4, 10, 1, 3, 9, 13, 16, 17, 12, 5, 8, 2, 6, 7, 15]
r = Random.new(0x2cf24d)
[*1..17].sort_by { r.rand }
# => [14, 11, 4, 10, 1, 3, 9, 13, 16, 17, 12, 5, 8, 2, 6, 7, 15]

Retrieving elements from array regarding to an accumulating parameter

Assume that there are 2 arrays of elements and a function call will return elements within them. Each time a retrieval is performed, 8 elements will be retrieved from array 1, while 2 will be retrieved from array 2. And the elements to be retrieved is indicated by a number provided, assume that list 1 has 35 elements, and list 2 has 7, the situation will be like:
Assume the 2 arrays are:
array 1: 0, 1, 2, 3, 4, ..., 35
array 2: 0, 1, 2, 3, 4, 5, 6
number provided elements from array 1 elements from array 2
1 0, 1, 2, 3, 4, 5, 6, 7 0, 1
11 8, 9, 10, 11, 12, 13, 14, 15 2, 3
21 16, 17, 18, 19, 20, 21, 22, 23 4, 5
31 24, 25, 26, 27, 28, 29, 30, 31 6
40 32, 33, 34, 35 0, 1
46 0, 1, 2, 3, 4, 5, 6, 7 2, 3
56 8, 9, 10, 11, 12, 13, 14, 15 4, 5
66 16, 17, 18, 19, 20, 21, 22, 23 6
75 24, 25, 26, 27, 28, 29, 30, 31 0, 1
85 32, 33, 34, 35 2, 3
...
Each time a retrieval is done, the count of numbers returned will be added to the last provided number become the next provided number. If one of the list is exhausted (remaining elements fewer than 8), then the remaining numbers will be retrieved from that list, and next time it will start retrieving elements start from index 0 again, like the situations when number 31 and 40 is passed.
The question is, is there anyway to determine what position to start in both array when a number is provided? e.g. when number 40 is given, I should start at 32 in list 1, and 0 in list 2. Like the above situation, list one is exhausted every 5th retrieval, while list 2 exhausted at every 4th retrieval, but since the provided number is based on the accumulated count of number retrieved, how can I determine where to start this time when a number is given?
I have been thinking this for days and really feel frustrated about it. Thanks for any help!
Their is a cycle. And one cycle will have total_num numbers, we can get total_num from the code bellow:
def get_one_cycle_numbers:
n = len(a) / 8
m = len(b) / 2
g = gcd(n, m)
total_num = len(a) * n / g + len(b) * m / g
return total_num
When we get the provided number num we just num = num % total_num and simulate the cycle.
PS: Hope I got the right understanding of the question.

Resources