How to execute this sorting process in pyspark?

How to execute this sorting process in pyspark? - sorting

I have tried map, mapValues and sort, but nothing works.
The question is described as follows:
"by the similarity (the second one in the value), if same, choose the user that has smallest ID (the first one in the value)."
And the list of Key-Value pair is :
[
(18, [(2, 0.5)]),
(30, [(19, 0.5), (6, 0.25)]),
(6, [(30, 0.25), (20, 0.2), (19, 0.2)]),
(19, [(30, 0.5), (8, 0.2), (6, 0.2)]),
(2, [(18, 0.5)]),
(26, [(9, 0.2)]),
(9, [(26, 0.2)])
]
I want to get:
[
(18, [(2, 0.5)]),
(30, [(19, 0.5), (6, 0.25)]),
(6, [(30, 0.25), (19, 0.2)]),
(19, [(30, 0.5), (6, 0.2)]),
(2, [(18, 0.5)]),
(26, [(9, 0.2)]),
(9, [(26, 0.2)])
]
Thank you a lot!

Pretty much straight forward. Just need to figure out the necessary transformations:
data = [(18, [(2, 0.5)]),
(30, [(19, 0.5), (6, 0.25)]),
(6, [(30, 0.25), (20, 0.2), (19, 0.2)]),
(19, [(30, 0.5), (8, 0.2), (6, 0.2)]),
(2, [(18, 0.5)]),
(26, [(9, 0.2)]),
(9, [(26, 0.2)])]
rdd1 = sc.parallelize(data)
rdd2 = rdd1.flatMapValues(lambda x:x)
rdd3 = rdd2.map(lambda x: ((x[0], x[1][1]),x[1][0]))
rdd4 = rdd3.reduceByKey(min)
rdd5 = rdd4.map(lambda x: (x[0][0], (x[1], x[0][1])))
rdd6 = rdd5.reduceByKey(lambda x,y: [x,y])
rdd6.collect()
[(9, (26, 0.2)),
(26, (9, 0.2)),
(18, (2, 0.5)),
(30, [(6, 0.25), (19, 0.5)]),
(2, (18, 0.5)),
(6, [(30, 0.25), (19, 0.2)]),
(19, [(30, 0.5), (6, 0.2)])]

Related

Assigning color to nodes in a graph such that no two neighbor nodes to it has same color

If you see the above graph, no nodes next to each other has the same color. I created a grid graph with diagonal edges across nodes using networkx python and applied greedy color to it.
greed = nx.coloring.greedy_color(G)
print(greed)
which gives the output
{(1, 1): 0, (1, 2): 1, (1, 3): 0, (1, 4): 1, (1, 5): 0, (1, 6): 1, (1, 7): 0, (1, 8): 1, (2, 1): 2, (2, 2): 3, (2, 3): 2, (2, 4): 3, (2, 5): 2, (2, 6): 3, (2, 7): 2, (2, 8): 3, (3, 1): 0, (3, 2): 1, (3, 3): 0, (3, 4): 1, (3, 5): 0, (3, 6): 1, (3, 7): 0, (3, 8): 1, (4, 1): 2, (4, 2): 3, (4, 3): 2, (4, 4): 3, (4, 5): 2, (4, 6): 3, (4, 7): 2, (4, 8): 3, (5, 1): 0, (5, 2): 1, (5, 3): 0, (5, 4): 1, (5, 5): 0, (5, 6): 1, (5, 7): 0, (5, 8): 1, (6, 1): 2, (6, 2): 3, (6, 3): 2, (6, 4): 3, (6, 5): 2, (6, 6): 3, (6, 7): 2, (6, 8): 3, (7, 1): 0, (7, 2): 1, (7, 3): 0, (7, 4): 1, (7, 5): 0, (7, 6): 1, (7, 7): 0, (7, 8): 1, (8, 1): 2, (8, 2): 3, (8, 3): 2, (8, 4): 3, (8, 5): 2, (8, 6): 3, (8, 7): 2, (8, 8): 3, (0, 1): 2, (0, 2): 3, (0, 3): 2, (0, 4): 3, (0, 5): 2, (0, 6): 3, (0, 7): 2, (0, 8): 3, (1, 0): 1, (1, 9): 0, (2, 0): 3, (2, 9): 2, (3, 0): 1, (3, 9): 0, (4, 0): 3, (4, 9): 2, (5, 0): 1, (5, 9): 0, (6, 0): 3, (6, 9): 2, (7, 0): 1, (7, 9): 0, (8, 0): 3, (8, 9): 2, (9, 1): 0, (9, 2): 1, (9, 3): 0, (9, 4): 1, (9, 5): 0, (9, 6): 1, (9, 7): 0, (9, 8): 1, (0, 0): 3, (0, 9): 2, (9, 0): 1, (9, 9): 0}
after sorting
{(0, 0): 3, (0, 1): 2, (0, 2): 3, (0, 3): 2, (0, 4): 3, (0, 5): 2, (0, 6): 3, (0, 7): 2, (0, 8): 3, (0, 9): 2, (1, 0): 1, (1, 1): 0, (1, 2): 1, (1, 3): 0, (1, 4): 1, (1, 5): 0, (1, 6): 1, (1, 7): 0, (1, 8): 1, (1, 9): 0, (2, 0): 3, (2, 1): 2, (2, 2): 3, (2, 3): 2, (2, 4): 3, (2, 5): 2, (2, 6): 3, (2, 7): 2, (2, 8): 3, (2, 9): 2, (3, 0): 1, (3, 1): 0, (3, 2): 1, (3, 3): 0, (3, 4): 1, (3, 5): 0, (3, 6): 1, (3, 7): 0, (3, 8): 1, (3, 9): 0, (4, 0): 3, (4, 1): 2, (4, 2): 3, (4, 3): 2, (4, 4): 3, (4, 5): 2, (4, 6): 3, (4, 7): 2, (4, 8): 3, (4, 9): 2, (5, 0): 1, (5, 1): 0, (5, 2): 1, (5, 3): 0, (5, 4): 1, (5, 5): 0, (5, 6): 1, (5, 7): 0, (5, 8): 1, (5, 9): 0, (6, 0): 3, (6, 1): 2, (6, 2): 3, (6, 3): 2, (6, 4): 3, (6, 5): 2, (6, 6): 3, (6, 7): 2, (6, 8): 3, (6, 9): 2, (7, 0): 1, (7, 1): 0, (7, 2): 1, (7, 3): 0, (7, 4): 1, (7, 5): 0, (7, 6): 1, (7, 7): 0, (7, 8): 1, (7, 9): 0, (8, 0): 3, (8, 1): 2, (8, 2): 3, (8, 3): 2, (8, 4): 3, (8, 5): 2, (8, 6): 3, (8, 7): 2, (8, 8): 3, (8, 9): 2, (9, 0): 1, (9, 1): 0, (9, 2): 1, (9, 3): 0, (9, 4): 1, (9, 5): 0, (9, 6): 1, (9, 7): 0, (9, 8): 1, (9, 9): 0}
But I want it to be in such a way that no two adjacent/neighbor nodes to a node should have the same color
In the above figure, (1,4) [green] has its neighbors (1,3) [red] and (1,5) [red]. In this case both nodes next to node (1,4) are red. But I want (1,3) and (1,5) in different colors. Can anyone tell me how to solve this problem?
I tried greedy color method from networkx to color in such a way that no two nodes adjacent to each other have the same color.

The problem is that you have an additional constraint that the coloring algorithm does not respect. You have two choice : change the algorithm to respect the constraint (hard), change the data (the graph) so that the constraints are integrated in it.
The second option is really easy to do here. All we have to do is add edges between nodes that should not be the same color (that is, nodes that share a common neighbor), color the graph.
Create a deep copy G2 of the graph G. As we will modify the graph to match the new constraints, we have to keep the original intact.
For every pair of nodes n_1, n_2 in G :
If they are adjacent, nothing to do.
If they share a common neighbor in G, add an edge (n_1, n_2) in G2
Color G2
For every node in G, set it's color to the color of the corresponding node in G2

Have you tried the Graph Coloring algorithm?
Step 1 − Arrange the vertices of the graph in some order.
Step 2 − Choose the first vertex and color it with the first color.
Step 3 − Choose the next vertex and color it with the lowest numbered color that has not been colored on any vertices adjacent to it. If all the adjacent vertices are colored with this color, assign a new color to it. Repeat this step until all the vertices are colored.
credits : https://www.tutorialspoint.com/the-graph-coloring

Sort by key then value which will then be grouped up...pyspark

So I'm trying to sort data in this format...
[((0, 4), 3), ((4, 0), 3), ((1, 6), 1), ((3, 2), 3), ((0, 5), 1)...
Ascending by key and then descending by value. I'm able to achieve this via...
test = test.sortBy(lambda x: (x[0], -x[1]))
which would give me based on shortened version above...
[((0, 4), 3), ((0, 5), 1), ((1, 6), 1), ((3, 2), 3), ((4, 0), 3)...
The problem I'm having is that after the sorting I no longer want the value but do need to retain the sort after grouping the data. So...
test = test.map(lambda x: (x[0][0],x[0][1]))
Gives me...
[(0, 4), (0, 5), (1, 6), (3, 2), (4, 0)...
Which is still in the order I need it but I need the elements to be grouped up by key. I then use this command...
test = test.groupByKey().map(lambda x: (x[0], list(x[1])))
But in the process I lose the sorting. Is there any way retain?

I managed to retain the order by changing the format of the tuple...
test = test.map(lambda x: (x[0][0],(x[0][1],x[1]))
test = test.groupByKey().map(lambda x: (x[0], sorted(list(x[1]), key=lambda x: (x[0],-x[1]))))
[(0, [(4, 3), (5, 1)] ...
which leaves me with the value (2nd element in the tuple) that I want to get rid of but took care of that too...
test = test.map(lambda x: (x[0], [e[0] for e in x[1]]))
Feels a bit hacky but not sure how else it could be done.

How to correlate all combination of arrays in an RDD?

I have an RDD from model.productFeatures() which returns an RDD in the form of (id, array("d", (...))). For example:
(1, array("d", (0, 1, 2)))
(2, array("d", (4, 3, 2)))
(3, array("d", (5, 3, 0)))
...
I would like to calculate the pairwise correlation between each array, then return for each id another id whose array has the highest correlation.

The first thing you need is to get all pairs of elements, except the "diagonal" where they're the same.
>>> rdd.cartesian(rdd).filter(lambda (x, y): x != y).collect()
[((1, array('d', [0.0, 1.0, 2.0])), (2, array('d', [4.0, 3.0, 2.0]))),
((1, array('d', [0.0, 1.0, 2.0])), (3, array('d', [5.0, 3.0, 0.0]))),
((2, array('d', [4.0, 3.0, 2.0])), (1, array('d', [0.0, 1.0, 2.0]))),
((3, array('d', [5.0, 3.0, 0.0])), (1, array('d', [0.0, 1.0, 2.0]))),
((2, array('d', [4.0, 3.0, 2.0])), (3, array('d', [5.0, 3.0, 0.0]))),
((3, array('d', [5.0, 3.0, 0.0])), (2, array('d', [4.0, 3.0, 2.0])))]
Then a function to calculate the correlation and rearrange to prepare for the last step. Let's assume by "correlation" you mean what is done by numpy.correlate.
def corr_pair(((id1, a1), (id2, a2))):
return id1, (id2, np.correlate(a1, a2)[0])
>>> rdd.cartesian(rdd).filter(lambda (p1, p2): p1 != p2).map(corr_pair).collect()
[(1, (2, 7.0)), (1, (3, 3.0)), (2, (1, 7.0)), (3, (1, 3.0)), (2, (3, 29.0)), (3, (2, 29.0))]
To get the 2nd ID with the maximum correlation with each 1st ID, you can use reduceByKey and always keep the bigger one:
def keep_higher((id1, c1), (id2, c2)):
if c1 > c2:
return id1, c1
else:
return id2, c2
>>> rdd.cartesian(rdd).filter(lambda (x, y): x != y).map(corr_pair).reduceByKey(keep_higher).collect()
[(1, (2, 7.0)), (2, (3, 29.0)), (3, (2, 29.0))]

I have a list of numbers, how to generate all unique k-partitions?

So if I had the numbers [1,2,2,3] and I want k=2 partitions I'd have [1][2,2,3], [1,2][2,3], [2,2][1,3], [2][1,2,3], [3][1,2,2], etc.

See an answer in Python at Code Review.

user3569's solution at Code Review produces five 2-tuples for the test case below, instead of exclusively 3-tuples. However, removing the frozenset() call for the returned tuples leads to the code returning exclusively 3-tuples. The revised code is as follows:
from itertools import chain, combinations
def subsets(arr):
""" Note this only returns non empty subsets of arr"""
return chain(*[combinations(arr,i + 1) for i,a in enumerate(arr)])
def k_subset(arr, k):
s_arr = sorted(arr)
return set([i for i in combinations(subsets(arr),k)
if sorted(chain(*i)) == s_arr])
s = k_subset([2,2,2,2,3,3,5],3)
for ss in sorted(s):
print(len(ss)," - ",ss)
As user3569 says "it runs pretty slow, but is fairly concise".
(EDIT: see below for Knuth's solution)
The output is:
3 - ((2,), (2,), (2, 2, 3, 3, 5))
3 - ((2,), (2, 2), (2, 3, 3, 5))
3 - ((2,), (2, 2, 2), (3, 3, 5))
3 - ((2,), (2, 2, 3), (2, 3, 5))
3 - ((2,), (2, 2, 5), (2, 3, 3))
3 - ((2,), (2, 3), (2, 2, 3, 5))
3 - ((2,), (2, 3, 3), (2, 2, 5))
3 - ((2,), (2, 3, 5), (2, 2, 3))
3 - ((2,), (2, 5), (2, 2, 3, 3))
3 - ((2,), (3,), (2, 2, 2, 3, 5))
3 - ((2,), (3, 3), (2, 2, 2, 5))
3 - ((2,), (3, 5), (2, 2, 2, 3))
3 - ((2,), (5,), (2, 2, 2, 3, 3))
3 - ((2, 2), (2, 2), (3, 3, 5))
3 - ((2, 2), (2, 3), (2, 3, 5))
3 - ((2, 2), (2, 5), (2, 3, 3))
3 - ((2, 2), (3, 3), (2, 2, 5))
3 - ((2, 2), (3, 5), (2, 2, 3))
3 - ((2, 3), (2, 2), (2, 3, 5))
3 - ((2, 3), (2, 3), (2, 2, 5))
3 - ((2, 3), (2, 5), (2, 2, 3))
3 - ((2, 3), (3, 5), (2, 2, 2))
3 - ((2, 5), (2, 2), (2, 3, 3))
3 - ((2, 5), (2, 3), (2, 2, 3))
3 - ((2, 5), (3, 3), (2, 2, 2))
3 - ((3,), (2, 2), (2, 2, 3, 5))
3 - ((3,), (2, 2, 2), (2, 3, 5))
3 - ((3,), (2, 2, 3), (2, 2, 5))
3 - ((3,), (2, 2, 5), (2, 2, 3))
3 - ((3,), (2, 3), (2, 2, 2, 5))
3 - ((3,), (2, 3, 5), (2, 2, 2))
3 - ((3,), (2, 5), (2, 2, 2, 3))
3 - ((3,), (3,), (2, 2, 2, 2, 5))
3 - ((3,), (3, 5), (2, 2, 2, 2))
3 - ((3,), (5,), (2, 2, 2, 2, 3))
3 - ((5,), (2, 2), (2, 2, 3, 3))
3 - ((5,), (2, 2, 2), (2, 3, 3))
3 - ((5,), (2, 2, 3), (2, 2, 3))
3 - ((5,), (2, 3), (2, 2, 2, 3))
3 - ((5,), (2, 3, 3), (2, 2, 2))
3 - ((5,), (3, 3), (2, 2, 2, 2))
Knuth's solution, as implemented by Adeel Zafar Soomro on the same Code Review page can be called as follows if no duplicates are desired:
s = algorithm_u([2,2,2,2,3,3,5],3)
ss = set(tuple(sorted(tuple(tuple(y) for y in x) for x in s)))
I haven't timed it, but Knuth's solution is visibly faster, even for this test case.
However, it returns 63 tuples rather than the 41 returned by user3569's solution. I haven't yet gone through the output closely enough to establish which output is correct.

Here's a version in Haskell:
import Data.List (nub, sort, permutations)
parts 0 = []
parts n = nub $ map sort $ [n] : [x:xs | x <- [1..n`div`2], xs <- parts(n - x)]
partition [] ys result = sort $ map sort result
partition (x:xs) ys result =
partition xs (drop x ys) (result ++ [take x ys])
partitions xs k =
let variations = filter (\x -> length x == k) $ parts (length xs)
in nub $ concat $ map (\x -> mapVariation x (nub $ permutations xs)) variations
where mapVariation variation = map (\x -> partition variation x [])
OUTPUT:
*Main> partitions [1,2,2,3] 2
[[[1],[2,2,3]],[[1,2,3],[2]],[[1,2,2],[3]],[[1,2],[2,3]],[[1,3],[2,2]]]

Python solution:
pip install PartitionSets
Then:
import partitionsets.partition
filter(lambda x: len(x) == k, partitionsets.partition.Partition(arr))
The PartitionSets implementation seems to be pretty fast however it's a pity you can't pass number of partitions as an argument, so you need to filter your k-set partitions from all subset partitions.
You may also want to look at:
similar topic on researchgate.

Opposite of a GROUP in apache pig latin?

Let's say I have the following input in apache pig:
(123, ( (1, 2), (3, 4) ) )
(666, ( (8, 9), (10, 11), (3, 4) ) )
and I want to convert these 2 rows into the following 7 rows:
(123, (1, 2) )
(123, (3, 4) )
(666, (8, 9) )
(666, (10, 11) )
(666, (3, 4) )
i.e. this is sorta 'doing the opposite of a GROUP'. Is this possible in pig latin?

Take a look at FLATTEN. It does what you probably need.
However, using your notation above, it looks like the list of tuples is a tuple. This should be a bag for this to work properly.
Instead of:
(123, ( (1, 2), (3, 4) ) )
(666, ( (8, 9), (10, 11), (3, 4) ) )
You should be representing your data as:
(123, { (1, 2), (3, 4) } )
(666, { (8, 9), (10, 11), (3, 4) } )
Then, once it is this form, you can do:
O = FOREACH grouped GENERATE $0, FLATTEN($1);

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to execute this sorting process in pyspark? - sorting

Related

Assigning color to nodes in a graph such that no two neighbor nodes to it has same color

Sort by key then value which will then be grouped up...pyspark

How to correlate all combination of arrays in an RDD?

I have a list of numbers, how to generate all unique k-partitions?

Opposite of a GROUP in apache pig latin?

Categories

Resources