hadoop mapreduce - Retain specific entries after secondary sorting using composite key - hadoop

I have done a secondary sorting using composite key (hadoop mapreduce java). The sorted data looks like:
(asec) (desc) (asec)
id num price
1 10 9
1 10 10
1 8 7
2 10 9
2 8 12
(id, num) is the composite key.
The expected result is:
id num price
1 10 9
2 10 9
That is, for each id, get the largest num and lowest price (if there are some same largest nums).
How should I write the reduce method to finish this step?

Related

Can you check for duplicates by taking the sum of the array and then the product of the array?

Let's say we have an array of size N with values from 1 to N inside it. We want to check if this array has any duplicates. My friend suggested two ways that I showed him were wrong:
Take the sum of the array and check it against the sum 1+2+3+...+N. I gave the example 1,1,4,4 which proves that this way is wrong since 1+1+4+4 = 1+2+3+4 despite there being duplicates in the array.
Next he suggested the same thing but with multiplication. i.e. check if the product of the elements in the array is equal to N!, but again this fails with an array like 2,2,3,2, where 2x2x3x2 = 1x2x3x4.
Finally, he suggested doing both checks, and if one of them fails, then there is a duplicate in the array. I can't help but feel that this is still incorrect, but I can't prove it to him by giving him an example of an array with duplicates that passes both checks. I understand that the burden of proof lies with him, not me, but I can't help but want to find an example where this doesn't work.
P.S. I understand there are many more efficient ways to solve such a problem, but we are trying to discuss this particular approach.
Is there a way to prove that doing both checks doesn't necessarily mean there are no duplicates?
Here's a counterexample: 1,3,3,3,4,6,7,8,10,10
Found by looking for a pair of composite numbers with factorizations that change the sum & count by the same amount.
I.e., 9 -> 3, 3 reduces the sum by 3 and increases the count by 1, and 10 -> 2, 5 does the same. So by converting 2,5 to 10 and 9 to 3,3, I leave both the sum and count unchanged. Also of course the product, since I'm replacing numbers with their factors & vice versa.
Here's a much longer one.
24 -> 2*3*4 increases the count by 2 and decreases the sum by 15
2*11 -> 22 decreases the count by 1 and increases the sum by 9
2*8 -> 16 decreases the count by 1 and increases the sum by 6.
We have a second 2 available because of the factorization of 24.
This gives us:
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
Has the same sum, product, and count of elements as
1,3,3,4,4,5,6,7,9,10,12,13,14,15,16,16,17,18,19,20,21,22,22,23
In general you can find these by finding all factorizations of composite numbers, seeing how they change the sum & count (as above), and choosing changes in both directions (composite <-> factors) that cancel out.
I've just wrote a simple not very effective brute-force function. And it shows that there is for example
1 2 4 4 4 5 7 9 9
sequence that has the same sum and product as
1 2 3 4 5 6 7 8 9
For n = 10 there are more such sequences:
1 2 3 4 6 6 6 7 10 10
1 2 4 4 4 5 7 9 9 10
1 3 3 3 4 6 7 8 10 10
1 3 3 4 4 4 7 9 10 10
2 2 2 3 4 6 7 9 10 10
My write-only c++ code is here: https://ideone.com/2oRCbh

Algorithm to distribute evenly products value into care packages

i'm currently solving a problem that states:
A company filed for bankruptcy and decided to pay the employees with the last remaining valuable items in the company only if it can be distributed evenly among them so that all of them have at least received 1 item and that the difference between the employee carrying the most valuable items and the employee carrying the least valuable items can not exceed a certain value x;
Input:
First row contains number of employee;
Second row contains the x value so that the the difference between the employee carrying the most valuable items and the employee carrying the least valuable items can not exceed;
Third row contains all the items with their value;
Output:
First number is the least valuable basket of items value and the second is the most valuable basket;
Example:
Input:
5
4
2 5 3 11 4 3 1 15 7 8 10
Output:
13 15
Input:
5
4
1 1 1 11 1 3 1 2 7 8
Output:
NO (It's impossible to distribute evenly)
Input:
5
10
1 1 1 1
Output:
NO (It's impossible to distribute evenly)
My solution to resolve this problem taking the first input is to, sort the items in ascending or descending order so from
2 5 3 11 4 3 1 15 7 8 10 --> 1 2 3 3 4 5 7 8 10 11 15
then create an adjacency list or just store it in simple variables where we add the biggest number to the lowest basket while iterating the item values array
Element 0: 15
Element 1: 11 <- 3 (sum 14)
Element 2: 10 <- 3 (sum 13)
Element 3: 8 <- 4 <- 1 (sum 13)
Element 4: 7 <- 5 <- 2 (sum 14)
So that my solution will have O(nlogN + 2n), first part using merge sort and then finding max e min value, what do you guys think about this solution?

Sort string in Java

I need way to sort according to the name .
‏ According to the number of letters of the alphabet, the word starts from A to Z,
‏ it's mean you want to count how many a in the two word and the word who have the largest number of letter a, you want to put this word first (swap)
‏ And if their number of a is equal, you will compare the letter after it means b, and if the number of the word is equal, you will compare C, and this is what ... and he will tell you that this is the case Suppose that there are no students who are inspired by the same number of all letters in the same class
My Code contains a class contain a name type of string and main drive contain a array of objects
As I'm a C++ and Python Developer. I can't help you with the Java Code, but according to your query, I think Count Sort is the most suitable for this kind of problem because while sorting numbers it sorts all of them using their Digits.
Example
Input data: 1, 4, 1, 2, 7, 5, 2
Take a count array to store the count of each unique object.
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 2 2 0 1 1 0 1 0 0
Modify the count array such that each element at each index
stores the sum of previous counts.
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 2 4 4 5 6 6 7 7 7
The modified count array indicates the position of each object in
the output sequence.
Rotate the array clockwise for one time.
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 0 2 4 4 5 6 6 7 7
Output each object from the input sequence followed by
increasing its count by 1.
Process the input data: 1, 4, 1, 2, 7, 5, 2. Position of 1 is 0.
Put data 1 at index 0 in output. Increase count by 1 to place next data 1 at an index 1 greater than this index.
Above Example is Taken from https://www.geeksforgeeks.org/counting-sort/

How to sort data in map reduce hadoop?

I am working with a programme that has 4 MapReduce steps.the output of my first step is:
id value
1 20
2 3
3 9
4 36
I have about 1,000,000 IDs and in the second step i must sort the values.the output of this step:
id value
4 36
1 20
3 9
2 3
How can I sort my data in map reduce? Do I need to use terasort? If yes, how do I use terasort in second step of my programme?
Thanks.
If you want to sort according to value's, make it key in map function. i.e.
id value
1 20
2 3
3 9
4 36
5 3
(value) (key) in map function
output will be
key value
3 5
3 2
9 3
20 1
36 4
map<value, id> output key/value
reduce <value, id>
if you want id to be in the first column, this will work.
context.write(value, key);
Note that, id's are not going to be sorted

Array problem using if and do loop

This is my code:
data INDAT8; set INDAT6;
Array myarray{24,27};
goodgroups=0;
do i=2 to 24 by 2;
do j=2 to 27;
if myarray[i,j] gt 1 then myarray[i+1,j] = 'bad';
else if myarray[i,j] eq 1 and myarray[i+1,j] = 1 then myarray[i+1,j]= 'good';
end;
end;
run;
proc print data=INDAT8;
run;
Problem:
I have the data in this format- it is just an example: n=2
X Y info
2 1 good
2 4 bad
3 2 good
4 1 bad
4 4 good
6 2 good
6 3 good
Now, the above data is in sorted manner (total 7 rows). I need to make a group of 2 , 3 or 4 rows separately and generate a graph. In the above data, I made a group of 2 rows. The third row is left alone as there is no other column in 3rd row to form a group. A group can be formed only within the same row. NOT with other rows.
Now, I will check if both the rows have “good” in the info column or not. If both rows have “good” – the group formed is also good , otherwise bad. In the above example, 3rd /last group is “good” group. Rest are all bad group. Once I’m done with all the rows, I will calculate the total no. of Good groups formed/Total no. of groups.
In the above example, the output will be: Total no. of good groups/Total no. of groups => 1/3.
This is the case of n=2(size of group)
Now, for n=3, we make group of 3 rows and for n=4, we make a group of 4 rows and find the good /bad groups in a similar way. If all the rows in a group has “good” block—the result is good block, otherwise bad.
Example: n= 3
2 1 good
2 4 bad
2 6 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
In the above case, I left the 4th row and last 2 rows as I can’t make group of 3 rows with them. The first group result is “bad” and last group result is “good”.
Output: 1/ 2
For n= 4:
2 1 good
2 4 good
2 6 good
2 7 good
3 2 good
4 1 good
4 4 good
4 6 good
6 2 good
6 3 good
6 4 good
6 5 good
In this case, I make a group of 4 and finds the result. The 5th,6th,7th,8th row are left behind or ignored. I made 2 groups of 4 rows and both are “good” blocks.
Output: 2/2
So, After getting 3 output values from n=2 , n-3, and n=4 I will plot a graph of these values.
If you can help in any any language using array, if and do loop. it would be great.
I can change my code accordingly.
Update:
The answer for this doesn't have to be in sas. Since it is more algorithm-related than anything, I will accept suggestions in any language as long as they show how to accomplish this using arrays and do.
I am having trouble understanding your problem statement, but from what I can gather here is what I can suggest:
Place data into bins and the process the summary data.
Implementation 1
Assumption: You don't know what the range of the first column will be or distriution will be sparse
Create a hash table. The Key will be the item you are doing your grouping on. The value will be the count seen so far.
Proces each record. If the key already exists, increment the count (value for that key in the hash). Otherwise add the key and set the value to 1.
Continue until you have processed all records
Count the number of keys in the hash table and the number of values that are greater than your threshold.
Implementation 2
Assumption: You know the range of the first column and the distriution is reasonably dense
Create an array of integers with enough elements so the index can match the column value. Initialize all elements to zero. This array will hold your count for each item you are grouping on
Process each record. Examine value of first column. Increment corresponding index in array. (So if you have "2 1 good", do groupCount[2]++)
Continue until you have processed all records
Walk each element in the array. Count how many items are non zero (meaning they appeared at least once) and how many items meet your threshold.
You can use the same approach for gathering the good and bad counts.

Resources