Tackling the 'Small Data' Problem with Distributed Computing Cluster? - hadoop

I'm learning about Hadoop + MapReduce and Big Data and from my understanding it seems that the Hadoop ecosystem was mainly designed to analyze large amounts of data that's distributed on many servers. My problem is a bit different.
I have a relatively small amount of data (a file consisting of 1-10 million lines of numbers) which needs to be analyzed in millions of different ways. For example, consider the following dataset:
[1, 6, 7, 8, 10, 17, 19, 23, 27, 28, 28, 29, 29, 29, 29, 30, 32, 35, 36, 38]
[1, 3, 3, 4, 4, 5, 5, 10, 11, 12, 14, 16, 17, 18, 18, 20, 27, 28, 39, 40]
[2, 3, 7, 8, 10, 10, 12, 13, 14, 15, 15, 16, 17, 19, 27, 30, 32, 33, 34, 40]
[1, 9, 11, 13, 14, 15, 17, 17, 18, 18, 18, 19, 19, 23, 25, 26, 27, 31, 37, 39]
[5, 8, 8, 10, 14, 16, 16, 17, 20, 21, 22, 22, 23, 28, 29, 30, 32, 32, 33, 38]
[1, 1, 3, 3, 13, 17, 21, 24, 24, 25, 26, 26, 30, 31, 32, 35, 38, 39, 39, 39]
[1, 2, 4, 4, 5, 5, 10, 13, 14, 14, 14, 14, 15, 17, 28, 29, 29, 35, 37, 40]
[1, 2, 6, 8, 12, 13, 14, 15, 15, 15, 16, 22, 23, 24, 26, 30, 31, 36, 36, 40]
[3, 6, 7, 8, 8, 10, 10, 12, 13, 17, 17, 20, 21, 22, 33, 35, 35, 36, 39, 40]
[1, 3, 8, 8, 11, 11, 13, 18, 19, 19, 19, 23, 24, 25, 27, 33, 35, 37, 38, 40]
I need to analyze how frequently a number of each column (Column N) repeats itself a certain number of rows later (L rows later. For example, if we were analyzing Column A with 1L (1-Row-Later) the result would be as follows:
Note: The position does not need to match - so number can appear anywhere in the next row
Column: A N-Later: 1 Result: YES, NO, NO, NO, NO, YES, YES, NO, YES -> 4/9.
We would repeat the above analysis for each column separately and for maximum N later times. In the above dateset which only consists of 10 lines it means a maximum of 9 N later. But in a dateset of 1 million lines, the analyses (for each column) would be repeated 999,999 times.
I looked into the MapReduce framework but it doesn't seem to cut it; it doesn't seem like an efficient solution for this problem and it requires a great deal of work to convert the core code into a MapReduce friendly structure.
As you can see in the above example, each analyses is independent of each other. For example, it is possible to analyze Column A separately from Column B. It is also possible to perform 1L analyses separately from 2L and so on. However, unlike Hadoop where the data lives on separate machines, in our scenario, each server needs access to all of the data to perform it's "share" of analysis.
I looked into possible solutions for this problem and it seems there are very few options: Ray or building a custom application on top of YARN using Apache Twill. Apache Twill was moved to the Attic in 2020 which means that Ray is the only available option.
Is Ray the best way to tackle this problem or are there other, better options? Ideally, the solution should automatically handle fail over and distribute the processing load intelligently. For example, in the above example, if we wanted to distribute the load to 20 machines, one way of doing so would be to divide 999,999 N Later by 20 and let Machine A analyze 1L-49999L, Machine B from 50000L - 100000L and so on. However, when you think about it, the load isn't being distributed equally - as it takes much longer to analyze 1L vs. 500000L as the latter contains only about half the number of rows (for 500000L the first row we are analyzing is row 500001 so we are essentially omitting the first 500K rows from analysis).
It should also not require a great deal of modification to the core code (like MapReduce does).
I'm working with Java.
Thanks

Well you are right - your scenario and your technological stack are not that suitable. Which raise the question - why not (add) something more relevant to your current technological stack? For instance - Redis DB.
Seems that your common action is mainly count values and you want to prevent over-calculation and make it more performant (e.g. - properly index your data). Given that this is one of the main features of Redis - it sounds logical to use it as a processor
My suggestion:
Create a hashmap that uses the numeric value as key and its count as value. This way - you will be able to pull different calculations over those metrics and always iterate your data-set once. Afterwards - you just need to pull the data from Redis by different criteria (per calculation or metric).
From this point, it's easy to save your calculated data back to your database and make it ready for direct querying. The overall process may be similar to this:
Scan data from file
Properly index it to redis (using hashmap)
Make desired calculations (over the indexed count)
Save it in your DB (as a digested data-set)
Flush Redis DB
Query your DB (as much as you like)
Follow the docs for both populating and retrieving data

Related

Exact orthogonalization of vectors in Wolfram

What I have is a matrix, I need to orthogonolize its eigen vectors.
That is basically all I need, but in exact form.
So here is my wolfram input
(orthogonolize(eigenvectors({{146, 112, 78, 17, 122}, {112, 86, 60, 13, 94}, {78, 60, 42 , 9, 66}, {17, 13, 9, 2, 14}, {122, 94, 66, 14, 104}})))
That gives me float numbers, while I need the exact forms.
Any ways to fix this?
Wolfram Mathematica, not WolframAlpha which is a completely different product with different rules and gives different results, given this
FullSimplify[Orthogonalize[Eigenvectors[{
{146, 112, 78, 17, 122}, {112, 86, 60, 13, 94}, {78, 60, 42 , 9, 66},
{17, 13, 9, 2, 14}, {122, 94, 66, 14, 104}}]]]
returns this exact form
{{Sqrt[121/342 + 52/(9*Sqrt[35587])], Sqrt[5/38 + 18/Sqrt[35587]],
Sqrt[25/342 + 64/(9*Sqrt[35587])], Sqrt[7/38 - 26/Sqrt[35587]]/3,
2*Sqrt[2/19 - 7/Sqrt[35587]]},
{-1/3*Sqrt[121/38 - 52/Sqrt[35587]], -Sqrt[5/38 - 18/Sqrt[35587]],
Sqrt[25/38 - 64/Sqrt[35587]]/3, -1/3*Sqrt[7/38 + 26/Sqrt[35587]],
Sqrt[8/19 + 28/Sqrt[35587]]},
{3/Sqrt[35], -Sqrt[5/7], 0, 0, 1/Sqrt[35]},
{-11/Sqrt[5110], -Sqrt[5/1022], 0, Sqrt[70/73], 4*Sqrt[2/2555]},
{-17/(3*Sqrt[2774]), -7/Sqrt[2774], Sqrt[146/19]/3, Sqrt[2/1387]/3, -9*Sqrt[2/1387]}}
Think of at least two different ways you can check that for correctness before you depend on that.
The last three of those can be simplified somewhat
1/Sqrt[35]*{3,-5,0,0,1},
1/Sqrt[5110]*{-11,-5,0,70,8},
1/(3*Sqrt[2774])*{-17,-21,146,2,-54}
but I cannot yet see a way to simplify the first two to a third of their current size. Can anyone else see a way to do that? Please check these results very carefully.

Optimal strategy for two player coin games

Two players take turns choosing one of the outer coins. At the end we calculate the difference
between the score two players get, given that they play optimally.
The greedy strategy of getting the max. value of coin often does not lead to the best results in my case.
Now I developed an algorithm:
Sample:{9,1,15,22,4,8}
We calculate the sum of coins in even index and that of coins in odd index.
Compare the two sum, (9+15+4)<(1+22+8) so sum of odd is greater. We then pick the coin with odd index, in our sample that would be 8.
the opponent, who plays optimally, will try to pick the greater coin, e.g. 9.
There is always a coin at odd index after the opponent finished, so we keep picking the coins
at odd index, that would be 1.
looping the above steps we will get a difference of (8+1+22) - (9+15+4) = 3.
6.vice versa if sum of even is greater in step 2.
I have compared the results generated by my algorithm with a 2nd algorithm similar to below one: https://www.geeksforgeeks.org/optimal-strategy-for-a-game-set-2/?ref=rp
And the results were congruent, until my test generated a random long array:
[6, 14, 6, 8, 6, 3, 14, 5, 18, 6, 19, 17, 10, 11, 14, 16, 15, 18, 7, 8, 6, 9, 0, 15, 7, 4, 19, 9, 5, 2, 0, 18, 2, 8, 19, 14, 4, 8, 11, 2, 6, 16, 16, 13, 10, 19, 6, 17, 13, 13, 15, 3, 18, 2, 14, 13, 3, 4, 2, 13, 17, 14, 3, 4, 14, 1, 15, 10, 2, 19, 2, 6, 16, 7, 16, 14, 7, 0, 9, 4, 9, 6, 15, 9, 3, 15, 11, 19, 7, 3, 18, 14, 11, 10, 2, 3, 7, 3, 18, 7, 7, 14, 6, 4, 6, 12, 4, 19, 15, 19, 17, 3, 3, 1, 9, 19, 12, 6, 7, 1, 6, 6, 19, 7, 15, 1, 1, 6]
My algorithm generated 26 as the result, while the 2nd algorithm generated 36.
Mine is nothing about dynamic programming and it requires less memory, whereas i also implemented the 2nd one with memoization.
This is confusing since mine is correct with most of the array cases until this one.
Any help would be appreciated!
If the array is of even length, your algorithm tries to produce a guaranteed win. You can prove that quite easily. But it doesn't necessarily produce the optimal win. In particular it won't find strategies where you want some coins that are on even indexes and others on odd indexes.
The following short example illustrates the point.
[10, 1, 1, 20, 1, 1]
Your algorithm will look at evens vs odds, realize that 10+1+1 < 1+20+1 and take the last element first. Guaranteeing a win by 10.
But you want both the 10 and the 20. Therefore the optimal strategy is to take the 10 leaving 1, 1, 20, 1, 1, whichever side the other person takes you take the other to get to 1, 20, 1, and then whichever side the other takes you take the middle. Resulting in you getting 10, 1, 20 and the other person getting 1, 1, 1. Guaranteeing a win by 28.

How to post a "poster" to VKontakte social network wall

I have a script that can post a text or an image to the VK.com over API. But I can't find a way to create a poaster:
There is no information about posters at the official documentation page.
You have to add the parameter poster_bkg_id = {background_number_for_poster}. This parameter is not described in the documentation (thanks to the author https://github.com/VBIralo/vk-posters).
Backgrounds (poster_bkg_id)
Gradients
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Artwork
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 31, 32
Emoji Patterns
21, 22, 23, 24, 25, 26, 27, 28, 29, 30

Remove n elements from array dynamically and add to another array

nums= [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
new_array=[]
How do I grab every two items divisible by 5 and add them to a new array.
This is the desired result:
the new_array should now contain these values
[[5,10],[15,20],[25,30]]
Note: I want to do this without pushing them all into the array and then performing
array.each_slice(2). The process should happen dynamically.
Try this
new_array = nums.select { |x| x % 5 == 0 }.each_slice(2).entries
No push involved.

Computing a moving maximum [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Find the min number in all contiguous subarrays of size l of a array of size n
I have a (large) array of numeric data (size N) and would like to compute an array of running maximums with a fixed window size w.
More directly, I can define a new array out[k-w+1] = max{data[k-w+1,...,k]} for k >= w-1 (this assumes 0-based arrays, as in C++).
Is there a better way to do this than N log(w)?
[I'm hoping there should be a linear one in N without dependence on w, like for moving average, but cannot find it. For N log(w) I think there is a way to manage with a sorted data structure which will do insert(), delete() and extract_max() altogether in log(w) or less on a structure of size w -- like a sorted binary tree, for example].
Thank you very much.
There is indeed an algorithm that can do this in O(N) time with no dependence on the window size w. The idea is to use a clever data structure that supports the following operations:
Enqueue, which adds a new element to the structure,
Dequeue, which removes the oldest element from the structure, and
Find-max, which returns (but does not remove) the minimum element from the structure.
This is essentially a queue data structure that supports access (but not removal) of the maximum element. Amazingly, as seen in this earlier question, it is possible to implement this data structure such that each of these operations runs in amortized O(1) time. As a result, if you use this structure to enqueue w elements, then continuously dequeue and enqueue another element into the structure while calling find-max as needed, it will take only O(n + Q) time, where Q is the number of queries you make. If you only care about the minimum of each window once, this ends up being O(n), with no dependence on the window size.
Hope this helps!
I'll demonstrate how to do it with the list:
L = [21, 17, 16, 7, 3, 9, 11, 18, 19, 5, 10, 23, 20, 15, 4, 14, 1, 2, 22, 13, 8, 12, 6]
with length N=23 and W = 4.
Make two new copies of your list:
L1 = [21, 17, 16, 7, 3, 9, 11, 18, 19, 5, 10, 23, 20, 15, 4, 14, 1, 2, 22, 13, 8, 12, 6]
L2 = [21, 17, 16, 7, 3, 9, 11, 18, 19, 5, 10, 23, 20, 15, 4, 14, 1, 2, 22, 13, 8, 12, 6]
Loop from i=0 to N-1. If i is not divisible by W, then replace L1[i] with max(L1[i],L1[i-1]).
L1 = [21, 21, 21, 21, | 3, 9, 11, 18, | 19, 19, 19, 23 | 20, 20, 20, 20 | 1, 2, 22, 22 | 8, 12, 12]
Loop from i=N-2 to0. If i+1 is not divisible by W, then replace L2[i] with max(L2[i], L2[i+1]).
L2 = [21, 17, 16, 7 | 18, 18, 18, 18 | 23, 23, 23, 23 | 20, 15, 14, 14 | 22, 22, 22, 13 | 12, 12, 6]
Make a list L3 of length N + 1 - W, so that L3[i] = max(L2[i], L1[i + W - 1])
L3 = [21, 17, 16, 11 | 18, 19, 19, 19 | 23, 23, 23, 23 | 20, 15, 14, 22 | 22, 22, 22, 13]
Then this list L3 is the moving maxima you seek, L2[i] is the maximum of the range between i and the next vertical line, while l1[i + W - 1] is the maximum of the range between the vertical line and i + W - 1.

Resources