How to sort data in map reduce hadoop? - sorting

I am working with a programme that has 4 MapReduce steps.the output of my first step is:
id value
1 20
2 3
3 9
4 36
I have about 1,000,000 IDs and in the second step i must sort the values.the output of this step:
id value
4 36
1 20
3 9
2 3
How can I sort my data in map reduce? Do I need to use terasort? If yes, how do I use terasort in second step of my programme?
Thanks.

If you want to sort according to value's, make it key in map function. i.e.
id value
1 20
2 3
3 9
4 36
5 3
(value) (key) in map function
output will be
key value
3 5
3 2
9 3
20 1
36 4
map<value, id> output key/value
reduce <value, id>
if you want id to be in the first column, this will work.
context.write(value, key);
Note that, id's are not going to be sorted

Related

Increase the numbers in apl

I have the following data:
a b c d
5 9 6 0
3 1 3 2
Characters in the first row, numbers in the second row.
How do I get the character corresponding to the highest number in the second row, and how do I increase the corresponding number in the second row? (For example, here, column b has the highest number, 9, so increase that number by 10%.)
I use Dyalog version 17.1.
With:
⎕←data←3 4⍴'a' 'b' 'c' 'd' 5 9 6 0 3 1 3 2
a b c d
5 9 6 0
3 1 3 2
You can extract the second row with:
2⌷data
5 9 6 0
Now grade it descending, that is, find the indices that would sort it from highest to lowest:
⍒2⌷data
2 3 1 4
The first number is the column we're looking for:
⊃⍒2⌷data
2
Now we can use this to extract the character from the first row:
data[⊂1,⊃⍒2⌷data]
b
But we only need the column index, not the actual character. The full index of the number we want to increase is:
2,⊃⍒2⌷data
2 2
Extracting the data to see that we got the right index:
data[⊂2,⊃⍒2⌷data]
9
Now we can either create a new array with the target value increased by 10%:
1.1×#(⊂2,⊃⍒2⌷data)⊢data
a b c d
5 9.9 6 0
3 1 3 2
Or change it in-place:
data[⊂2,⊃⍒2⌷data]×←1.1
data
a b c d
5 9.9 6 0
3 1 3 2
Try it online!

Filter Google Sheets' pivot table by comparison with column

I want to filter a pivot table in the following set up:
My Table:
Key Value1 Value2
1 23 a
2 33 b
3 1 c
4 5 d
My pivot table (simplified):
Key SUM of Value1 COUNTA of Value2
1 23 1
2 33 1
3 1 1
4 5 1
Grand Total 62 4
I now want to filter the pivot table by the values in this list:
Keys
1
2
4
So the resulting pivot table should look like this:
Key SUM of Value1 COUNTA of Value2
1 23 1
2 33 1
4 5 1
Grand Total 61 3
I thought this should be possible by using a custom formula in the pivot filter but it seems I have no way of using the current cell in the pivot e.g. to make a lookup.
I created a simple example of this setup here: https://docs.google.com/spreadsheets/d/1GlQDYtW8v8ri5L68RhryTZxwTikV_NXZQlccSI6_7pU/edit?usp=sharing
paste this formula in Filters!B1:
=ARRAYFORMULA(IFERROR(VLOOKUP(A1:A, Table!A1:C, {2,3}, 0), ))
and create a resulting pivot table from there:
demo spreadsheet

hadoop mapreduce - Retain specific entries after secondary sorting using composite key

I have done a secondary sorting using composite key (hadoop mapreduce java). The sorted data looks like:
(asec) (desc) (asec)
id num price
1 10 9
1 10 10
1 8 7
2 10 9
2 8 12
(id, num) is the composite key.
The expected result is:
id num price
1 10 9
2 10 9
That is, for each id, get the largest num and lowest price (if there are some same largest nums).
How should I write the reduce method to finish this step?

Kibana - Limit "sum" metric on data table

In Kibana's visualize screen I'm trying to create a "sum" metric and I would like to limit the values of this sum (between 0.1 and X), but I only get to limit the field values before aggregation is performed.
I have designed a simple metric that consists in the aggregation of a field called Timestamp_Med. I don't know how to make the filter properly,
so I only get the filter applied on the field, but not on the sum metric.
For example, if I have these rows:
ID Session_Id Timestamp_Med
1 1 5
2 1 7
3 1 3
4 2 6
5 3 0
6 3 2
7 4 15
8 5 4
9 6 0
10 6 4
My metric must sum the Timestamp_Med field, group by the Session_Id
Session_Id Sum(Timestamp_Med)
1 15
2 6
3 2
4 15
5 4
6 4
I apply the filter: (Timestamp_Med: [1 TO 5]) I get the results for the original records with ID 1,3,5,6,8,9 and 10 with the next session values for the metric:
Session_Id Sum(Timestamp_Med)
1 8
3 2
5 4
6 4
But I want to filter values in "Sum(Timestamp_Med)", not in "Timestamp_Med", and get:
Session_Id Sum(Timestamp_Med)
3 2
5 4
6 4
Could I do this filter? Is it possible?
Thank you.

Convert a space separated file (eachy row = vector) to SequenceFile

I created the large text file (4 GB) as followings.
0 1 2 3 2 1
3 6 2 0 6 4
3 0 6 3 0 0
1 6 7 3 9 4
Each row describes a vector, and each column denotes each element of the vector. Each element is separated by one space.
Now, I would like to execute K-Means clustering for all the vectors with Apache Mahout, but I received the error "not a SequenceFile".
How can I create the file whose format meets the requirement of mahout?

Resources