Spark groupByKey Clarification - hadoop

I am trying to process some data and write the output in such a way that the result is partitioned by a key, and is sorted by another parameter- say ASC. For example,
>>> data =sc.parallelize(range(10000))
>>> mapped = data.map(lambda x: (x%2,x))
>>> grouped = mapped.groupByKey().partitionBy(2).map(lambda x: x[1] ).saveAsTextFile("mymr-output")
$ hadoop fs -cat mymr-output/part-00000 |cut -c1-1000
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98, 100, 102, 104, 106, 108, 110, 112, 114, 116, 118, 120, 122, 124, 126, 128, 130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150, 152, 154, 156, 158, 160, 162, 164, 166, 168, 170, 172, 174, 176, 178, 180, 182, 184, 186, 188, 190, 192, 194, 196, 198, 200, 202, 204, 206, 208, 210, 212, 214, 216, 218, 220, 222, 224, 226, 228, 230, 232, 234, 236, 238, 240, 242, 244, 246, 248, 250, 252, 254, 256, 258, 260, 262, 264, 266, 268, 270, 272, 274, 276, 278, 280, 282, 284, 286, 288, 290, 292, 294, 296, 298, 300, 302, 304, 306, 308, 310, 312, 314, 316, 318, 320, 322, 324, 326, 328, 330, 332, 334, 336, 338, 340, 342, 344, 346, 348, 350, 352, 354, 356, 358, 360, 362, 364, 366, 368, 370, 372, 374, 376, 378, 380, 382, 384, 386, 388, 390, 392, 394, 396, 398, 400, 402, 404, 406, 408, 410, 412, 414, 416, 418, 420,
$ hadoop fs -cat mymr-output/part-00001 |cut -c1-1000
[2049, 2051, 2053, 2055, 2057, 2059, 2061, 2063, 2065, 2067, 2069, 2071, 2073, 2075, 2077, 2079, 2081, 2083, 2085, 2087, 2089, 2091, 2093, 2095, 2097, 2099, 2101, 2103, 2105, 2107, 2109, 2111, 2113, 2115, 2117, 2119, 2121, 2123, 2125, 2127, 2129, 2131, 2133, 2135, 2137, 2139, 2141, 2143, 2145, 2147, 2149, 2151, 2153, 2155, 2157, 2159, 2161, 2163, 2165, 2167, 2169, 2171, 2173, 2175, 2177, 2179, 2181, 2183, 2185, 2187, 2189, 2191, 2193, 2195, 2197, 2199, 2201, 2203, 2205, 2207, 2209, 2211, 2213, 2215, 2217, 2219, 2221, 2223, 2225, 2227, 2229, 2231, 2233, 2235, 2237, 2239, 2241, 2243, 2245, 2247, 2249, 2251, 2253, 2255, 2257, 2259, 2261, 2263, 2265, 2267, 2269, 2271, 2273, 2275, 2277, 2279, 2281, 2283, 2285, 2287, 2289, 2291, 2293, 2295, 2297, 2299, 2301, 2303, 2305, 2307, 2309, 2311, 2313, 2315, 2317, 2319, 2321, 2323, 2325, 2327, 2329, 2331, 2333, 2335, 2337, 2339, 2341, 2343, 2345, 2347, 2349, 2351, 2353, 2355, 2357, 2359, 2361, 2363, 2365, 2367, 2369, 2371, 2373, 2375, 2377, 2379, 238
$
Which is perfect- satisfies my first criteria, which is to have results partitioned by key. But I want the result sorted. I tried sorted(), but it didn't work.
>>> grouped= sorted(mapped.groupByKey().partitionBy(2).map(lambda x: x[1] ))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'PipelinedRDD' object is not iterable
I don't want to use parallelize again, and go recursive. Any help would be greatly appreciated.
PS: I did go through this: Does groupByKey in Spark preserve the original order? but it didn't help.
Thanks,
Jeevan.

Yes, that's an RDD, not a Python object that you can sort as if it's a local collection. After groupByKey() though, the value in each key-value tuple is a collection of numbers, and that is what you want to sort? You can use mapValues() which called sorted() on its argument.
I realize it's a toy example but be careful with groupByKey as it has to get all values for a key in memory. Also it is not even necessarily guaranteed that an RDD with 2 elements, and 2 partitions, means 1 goes in each partition. It's probable but not guarnateed.
PS you should be able to replace map(lambda x: x[1]) with values(). It may be faster.

Similar to what is said above the value in key-value is an RDD collection; you can test this by checking type(value). However, you can access a python list via the member .data and call sort or sorted on that.
grouped = mapped.groupByKey().partitionBy(2).map(lambda x: sorted(x[1].data) )

Related

How to decode socket response?

I am try to decode socket response from web source, but I can to decode only 50% of message.
It looks strange, because it should be a full message decompress, if it can decompress some socket response, but maybe I wrong. Any help will be greatly appreciated!
require "faye/websocket"
require "eventmachine"
require "permessage_deflate"
require 'zlib'
require 'stringio'
EM.run do
ws = Faye::WebSocket::Client.new("wss://wss.winline.ru/data_ng?", [], :extensions => [PermessageDeflate])
ws.on :open do |event|
p [:open]
ws.send "lang"
ws.send "AQ=="
ws.send "data"
ws.send "WINLINE"
ws.send "getdate"
ws.send "no_fordata"
end
ws.on :message do |event|
gz = Zlib::GzipReader.new(StringIO.new(event.data.pack('C*')))
puts gz.read
end
ws.on :close do |event|
p [:close, event.code, event.reason]
ws = nil
end
end
I have receive the following data:
�$��$^���60.5
�$��$^��62.5'!�$i#�$^����31.5�I�$�6�$^���9.5�3�$^���11.5� �$^��10�3�$$W�$^����80:3ূ$�m���&ާ�$�m�id�rݧ�$�m�t
�$''�=�v�p��a��0ȣc�Cincinnati BearcatsDetroit Mercy Titans��}$1'':%���}$1''�G���}$1'n|
As u can see, some parts of the message are decodable and other is still not.
Without decompress, I have this msg from sock:
[242, 204, 179, 144, 33, 112, 239, 219, 8, 118, 124, 89, 134, 218, 170, 173, 183, 173, 126, 59, 167, 131, 63, 254, 48, 115, 255, 123, 7, 161, 157, 64, 184, 79, 95, 202, 7, 203, 254, 172, 245, 49, 72, 63, 3, 5, 128, 62, 243, 77, 202, 4, 219, 80, 245, 92, 169, 237, 174, 85, 242, 165, 252, 180, 73, 54, 149, 19, 245, 163, 88, 212, 49, 186, 65, 111, 156, 215, 136, 53, 70, 54, 205, 33, 147, 72, 255, 36, 246, 167, 135, 253, 65, 163, 224, 36, 28, 78, 38, 227, 165, 7, 191, 246, 98, 255, 235, 195, 126, 55, 101, 38, 246, 68, 188, 252, 63, 92, 112, 93, 239, 36, 118, 2, 0]
I think, this source are using pako to compress their data socket.

javascript: from a websocket I receive messages as zlib deflate: how to read OR "unflate" OR "deflate" (not inflate)

This question is about converting a gzip deflate message from a websocket message and convert it to array OR raw text that I can apply JSON.parse on it...
*** to be also clear: In this question : i use a websocket from a crypto exchange.... but the question is about the received message NOT about crypto exchange
in the documentation they say "please use zlib deflate"
HERE THE JAVASCRIPT
digifinexopen = '{"id":12312,"method":"trades.subscribe","params":["btc_usdt"]}';
digifinex_market_ws = new WebSocket("wss://openapi.digifinex.com/ws/v1/");
digifinex_market_ws.binaryType = "arraybuffer";
digifinex_market_ws.onmessage = event => digifinex_trades(event.data);
digifinex_market_ws.onopen = event => digifinex_market_ws.send(digifinexopen);
function fu_bitmex_trades (jsonx) { console.log(jsonx); }
I have this in the log
object=>[[Int8Array]]: Int8Array(1129) 0 … 99]
0: 120
1: -38
I tried with <script src="https://cdnjs.cloudflare.com/ajax/libs/pako/2.0.4/pako.min.js" ...></script>
if I do pako.deflate(jsonx);
I get
object=> Uint8Array(78) [120, 156, 1, 67, 0, 188, 255, 120, 218, 4, 192, 177, 13, 196, 32, 12, 133, 225, 93, 254, 154, 6, 174, 243, 54, 39, 66, 17, 201, 74, 36, 63, 187, 66, 236, 158, 111, 179, 34, 222, 192, 158, 114, 111, 196, 82, 121, 98, 27, 229, 63, 75, 24, 170, 57, 151, 196, 105, 220, 23, 214, 199, 175, 143, 243, 5, 0, 0, 255, 255, 32, 108, 18, 108, 62, 68, 31,
If I add decoder = new TextDecoder("utf8"); and log(decoder.decode(jsonx)); I get
string=> x�E��xڜ��n\7����'5���*
���$pƋ Ȼ�*�֋�g��#����|�����������������v\�//�_������������
but, HOW TO RETREIVE the array or raw data that I could json.parse ????
If I decompress your data twice, I get:
{"error":null,"result":{"status":"success"},"id":12312}
It looks like you compressed instead of decompressed. Use pako.inflate().

Python. Homework: Not working due to efficiency with large inputs

Homework: return the maximal sum of k consecutive elements in a list. I have tried the following 3, which work for 6 of the 7 tests by which the solution is verified. The 7th test is a very long input with a very large k value. I cannot put the input list in because the shown list is truncated due to its length. Here are the 3 methods I tried. Reiterating, each timed out, while the last one also gave me a SyntaxError.
Method 1: [verbose]
def arrayMaxConsecutiveSum(inputArray, k):
sum_array = []
for i in range(len(inputArray)-(k+1)):
sum_array.append(sum(inputArray[i:i+k]))
return max(sum_array)
Method 2: [one line = efficiency??]
def arrayMaxConsecutiveSum(inputArray, k):
return max([sum(inputArray[i:i+k]) for i in range(len(inputArray)-(k+1))])
Method 3: Lambda call
def arrayMaxConsecutiveSum(inputArray, k):
f = lambda data, n: [data[i:i+n] for i in range(len(data) - n + 1)]
sum_array = [sum(val) for val in f(inputArray,k)]
return max(sum_array)
Some examples of inputs and (correct) outputs:
IN:[2, 3, 5, 1, 6]
k: 2 OUT: 8
IN:[2, 4, 10, 1]
k: 2 OUT: 14
IN: [1, 3, 4, 2, 4, 2, 4]
k: 4 OUT: 13
Again, I would like to mention that I passed the other tests (6 was very long with a large k value as well[k was an order of magnitude smaller than 7's, however]) and just need to identify a method or a revision that would be more efficient/make these more efficient. Lastly, I would like to add that I attempted both 6 and 7 with the (truncated) inputs on IDLE3 and each produced a ValueError:
Traceback (most recent call last):
File "/Users/ryanflynn/arrmaxconsecsum.py", line 15, in <module>
962, 244, 390, 854, 406, 457, 160, 612, 693, 896, 800, 670, 776, 65, 81, 336, 305, 262, 877, 217, 50, 835, 307, 865, 774, 163, 556, 186, 734, 404, 610, 621, 538, 370, 153, 105, 816, 172, 149, 404, 634, 105, 74, 303, 304, 145, 592, 472, 778, 301, 480, 693, 954, 628, 355, 400, 327, 916, 458, 599, 157, 424, 957, 340, 51, 60, 688, 325, 456, 148, 189, 365, 358, 618, 462, 125, 863, 530, 942, 978, 898, 858, 671, 527, 877, 614, 826, 163, 380, 442, 68, 825, 978, 965, 562, 724, 553, 18, 554, 516, 694, 802, 650, 434, 520, 685, 581, 445, 441, 711, 757, 167, 594, 686, 993, 543, 694, 950, 812, 765, 483, 474, 961, 566, 224, 879, 403, 649, 27, 205, 841, 35, 35, 816, 723, 276, 984, 869, 502, 248, 695, 273, 689, 885, 157, 246, 684, 642, 172, 313, 683, 968, 29, 52, 915, 800, 608, 974, 266, 5, 252, 6, 15, 725, 788, 137, 200, 107, 173, 245, 753, 594, 47, 795, 477, 37, 904, 4, 781, 804, 352, 460, 244, 119, 410, 333, 187, 231, 48, 560, 771, 921, 595, 794, 925, 35, 312, 561, 173, 233, 669, 300, 73, 977, 977, 591, 322, 187, 199, 817, 386, 806, 625, 500, 1, 294, 40, 271, 306, 724, 713, 600, 126, 263, 591, 855, 976, 515, 850, 219, 118, 921, 522, 587, 498, 420, 724, 716],6886)
File "/Users/ryanflynn/arrmaxconsecsum.py", line 6, in arrayMaxConsecutiveSum
return max(sum_array)
ValueError: max() arg is an empty sequence
(Note: this used method 3) I checked with print statements both the value for f(inputArray,k) and sum_array: [] Any help would be appreciated :)
Try:
def arrayMaxConsecutiveSum(inputArray, k):
S = sum(inputArray[:k])
M = S
for i in range(len(inputArray) - k):
S += ( inputArray[i+k] - inputArray[i])
if M < S:
M = S
return M
S stands for sum and M stands for max.
This solution have a complexity of O(n), when your's have O(n*k)
You are summing k numbers n-k times, when I am summing 3 numbers n times.

Faster way for randomly chosing a value from list and deleting the chosen value

I was wondering if there is a faster way to randomly choose a value from a list and deleting this value from this list so it cannot be chosen again. This drawing of a value will continue until there aint no values left anymore.
The way I did it soved the problem but it takes almost 8 seconds. So I'm wondering if there is a faster way. I am using Jupyter notebook through the Anaconda software. Since this goes through a server, could it be the problem?
This is what I did:
TotalNumbcol = 266
Column_Numbers = list(np.arange(1,TotalNumbcol+1,1)) # creating a list with all column numbers in it from which can be drawn.
#print Column_Numbers
ABC = Column_Numbers # Creating a variable for the len command in the for loop below, since the Column Numbers length will change.
Chosen_Columns = [[0] for i in range(0,len(Column_Numbers))]
for i in range(len(ABC)):
RandChoiceCol = int(random.choice(Column_Numbers)) # chosing a random number from the Column_Numbers range
Chosen_Columns[i]=(RandChoiceCol) # adding each randomly chosen column number to a list in list showing which column has been chosen.
Column_Numbers = [x for x in Column_Numbers if x not in Chosen_Columns] # delete chosen_column from RandChoiceCol
print Chosen_Columns
print Column_Numbers
[21, 131, 145, 218, 153, 60, 201, 15, 158, 189, 230, 210, 18, 103, 69, 76, 226, 180, 67, 187, 238, 20, 157, 24, 48, 11, 47, 117, 101, 51, 122, 155, 109, 225, 86, 243, 146, 30, 58, 7, 66, 132, 22, 110, 1, 142, 234, 245, 266, 129, 232, 39, 184, 49, 114, 182, 162, 144, 92, 126, 5, 254, 150, 102, 135, 173, 36, 52, 42, 26, 228, 63, 17, 8, 163, 40, 78, 174, 222, 205, 183, 140, 221, 70, 125, 72, 247, 237, 64, 246, 185, 130, 248, 90, 197, 53, 107, 77, 108, 256, 207, 139, 176, 192, 2, 164, 4, 124, 241, 113, 188, 178, 235, 265, 190, 212, 99, 175, 79, 231, 257, 202, 50, 242, 181, 46, 161, 133, 104, 28, 251, 213, 204, 59, 149, 252, 179, 43, 137, 195, 160, 220, 119, 74, 87, 255, 98, 208, 105, 239, 170, 203, 167, 136, 250, 134, 32, 165, 229, 9, 258, 13, 141, 240, 262, 34, 227, 148, 41, 111, 54, 71, 61, 94, 249, 29, 75, 10, 193, 152, 73, 123, 65, 6, 116, 68, 91, 56, 25, 233, 156, 261, 35, 171, 211, 215, 186, 154, 138, 200, 44, 112, 57, 166, 120, 147, 89, 31, 106, 118, 199, 198, 81, 223, 83, 12, 214, 45, 121, 244, 95, 168, 55, 37, 206, 263, 93, 196, 115, 169, 217, 236, 82, 143, 96, 33, 209, 14, 100, 216, 128, 259, 219, 151, 16, 177, 159, 23, 38, 84, 80, 27, 19, 264, 62, 85, 127, 97, 224, 172, 191, 88, 253, 3, 260, 194]
[]
If there is a more efficient way saving time please let me know.
Regards,
You may just shuffle the list in place instead of creating a new list and a new number every time:
Column_Numbers = list(np.arange(1,TotalNumbcol+1,1))
random.shuffle(Column_Numbers)
while Column_Numbers:
rand = Column_Numbers.pop()
print(rand)

Union of two arrays equals the first array? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am doing a problem from projecteuler.com. The question is this:
If we list all the natural numbers below 10 that are multiples of 3 or
5, we get 3, 5, 6 and 9. The sum of these multiples is 23. Find the
sum of all the multiples of 3 or 5 below 1000.
I thought of creating arrays of multiples for each 3 and 5 up to 1000 and taking the union of them, which doesn't leave me with duplicates (so I don't need to call array.uniq). What I've written is this:
def get_range(range, step)
ret = []
range.step(step) { |i| ret << i }
return ret
end
p get_range(0..1000, 3) | get_range(0..1000, 5)
This comes out with this result:
[0, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99, 102, 105, 108, 111, 114, 117, 120, 123, 126, 129, 132, 135, 138, 141, 144, 147, 150, 153, 156, 159, 162, 165, 168, 171, 174, 177, 180, 183, 186, 189, 192, 195, 198, 201, 204, 207, 210, 213, 216, 219, 222, 225, 228, 231, 234, 237, 240, 243, 246, 249, 252, 255, 258, 261, 264, 267, 270, 273, 276, 279, 282, 285, 288, 291, 294, 297, 300, 303, 306, 309, 312, 315, 318, 321, 324, 327, 330, 333, 336, 339, 342, 345, 348, 351, 354, 357, 360, 363, 366, 369, 372, 375, 378, 381, 384, 387, 390, 393, 396, 399, 402, 405, 408, 411, 414, 417, 420, 423, 426, 429, 432, 435, 438, 441, 444, 447, 450, 453, 456, 459, 462, 465, 468, 471, 474, 477, 480, 483, 486, 489, 492, 495, 498, 501, 504, 507, 510, 513, 516, 519, 522, 525, 528, 531, 534, 537, 540, 543, 546, 549, 552, 555, 558, 561, 564, 567, 570, 573, 576, 579, 582, 585, 588, 591, 594, 597, 600, 603, 606, 609, 612, 615, 618, 621, 624, 627, 630, 633, 636, 639, 642, 645, 648, 651, 654, 657, 660, 663, 666, 669, 672, 675, 678, 681, 684, 687, 690, 693, 696, 699, 702, 705, 708, 711, 714, 717, 720, 723, 726, 729, 732, 735, 738, 741, 744, 747, 750, 753, 756, 759, 762, 765, 768, 771, 774, 777, 780, 783, 786, 789, 792, 795, 798, 801, 804, 807, 810, 813, 816, 819, 822, 825, 828, 831, 834, 837, 840, 843, 846, 849, 852, 855, 858, 861, 864, 867, 870, 873, 876, 879, 882, 885, 888, 891, 894, 897, 900, 903, 906, 909, 912, 915, 918, 921, 924, 927, 930, 933, 936, 939, 942, 945, 948, 951, 954, 957, 960, 963, 966, 969, 972, 975, 978, 981, 984, 987, 990, 993, 996, 999, 5, 10, 20, 25, 35, 40, 50, 55, 65, 70, 80, 85, 95, 100, 110, 115, 125, 130, 140, 145, 155, 160, 170, 175, 185, 190, 200, 205, 215, 220, 230, 235, 245, 250, 260, 265, 275, 280, 290, 295, 305, 310, 320, 325, 335, 340, 350, 355, 365, 370, 380, 385, 395, 400, 410, 415, 425, 430, 440, 445, 455, 460, 470, 475, 485, 490, 500, 505, 515, 520, 530, 535, 545, 550, 560, 565, 575, 580, 590, 595, 605, 610, 620, 625, 635, 640, 650, 655, 665, 670, 680, 685, 695, 700, 710, 715, 725, 730, 740, 745, 755, 760, 770, 775, 785, 790, 800, 805, 815, 820, 830, 835, 845, 850, 860, 865, 875, 880, 890, 895, 905, 910, 920, 925, 935, 940, 950, 955, 965, 970, 980, 985, 995]
which is the first array. If I swap the order of the ranges, then I get the array with the multiples of 5. I tried something like this on IRB:
[1,3,5] | [3, 5, 7]
# => [1, 3, 5, 7]
Am I missing something, am I just going insane, or have I encountered a bug in Ruby?
Your array is correct. It contains both multiples of 3 and 5. Look at the end. It's just not sorted.

Resources