MapReduce, to extract the one row with highest value - hadoop

This is the result(actula output) for reducer. The data is title(key),month and frequency of how many books are borrowed based on the book title(value), and Is there any way to get the only one row with highest value? For example, I want to choose the only row with highest frequency among lots of rows. If you know the way, please enlighten me. Thanks a lot.
"""E"" is for evidence [sound recording] / by Sue Grafton." 05 8
"""F"" is for fugitive [sound recording] / by Sue Grafton." 05 6
"""G"" is for Grafton : the world of Kinsey Millhone / Natalie Hevener Kaufman and Carol McGinnis Kay." 06 1
"""G"" is for gumshoe [text (large print)] / Sue Grafton." 09,10 1
"""Galapagos"" means ""tortoises"" / written and illustrated by Ruth Heller." 10,04,09 2
"""Git on board 09 1
"""God's banker"" / by Rupert Cornwell." 05,10,11 1
"""Gospodi-- spasi i usmiri Rossi︠i︡u"" : Nikolaĭ II 10,11 1
"""H"" is for homicide [sound recording] / by Sue Grafton." 12 4

Run a secondary mapreduce job accepting as input, the output of the first action. The values to write in the Mapper could be (NullWritable, line), as you want to collect all lines to a single reducer, but you don't really care about a key otherwise, then parse out the number of each line, keeping track of the current maximum value and its associated line. After looping over all values, write the maximum line.
To improve the run time of this process, use setCombinerClass in your job configuration to use this new reducer

Related

Create values of new data frame variable based on other column values

I have a question about data set preparation. In a survey, the same people were asked about a number of different variables at two points of measurement. This resulted in a dataset in long format, i.e. information from each participant is stored in two rows. Each row represents the data of this person at the respective time of measurement (see example). Individuals have individual participation codes. The same participation code thus indicates that the data is from the same person.
code
time
risk_perception
DB6M
1
6
DB6M
2
4
TH4D
1
2
TH4D
2
3
Now I would like to create a new variable "risk_perception.complete", which shows me whether the information for each participant is complete. It could be that a person has not given any information at both measurement times or only at one of the two measurement times and therefore values are missing (NAs).In the new variable I would like to check and code this information for each person. If the person has one or more NAs, then a 0 should be coded there. If the person has no NAs, then there should be a 1 (see example).
code
time
risk_perception
risk_perception.complete
DB6M
1
6
1
DB6M
2
4
1
TH4D
1
2
1
TH4D
2
3
1
SU6H
1
NA
0
SU6H
2
3
0
VG9S
1
NA
0
VG9S
2
NA
0
Can anyone tell me the best way to program this?
Here is reproducible example:
data <- data.frame(
code = c("AH6M","AH6M","BD7M","BD7M","SH9L","SH9L"),
time = c(1,2,1,2,1,2),
risk = c(6,7,NA,3,NA,NA))
Thank you in advance and best regards!

Moving average conditioned in Google Sheets

I have a Google Sheet in which I have to calculate a moving average conditioned to the 'ID' that calculates the average of the last 3 periods.
Any idea on how to do it?
I leave an example with the final results (column "Mean Average (last 3)").
Regards!
ID value Mean Average (last 3)
1 12 12,00
1 19 12,00
1 19 15,50
1 18 16,67
1 13 18,67
2 11 11,00
2 18 11,00
2 15 14,50
2 17 14,67
2 11 16,67
3 11 11,00
3 16 11,00
3 10 13,50
3 11 12,33
I've got an answer that may work for you. Assuming that your sample data is in columns A4:C (see my sample sheet), try the following formula in column D, in the same row as your data headers.
={"Mean Avg";ArrayFormula(
IF(ROW(A4:A18)<ROW(A$4)+2,
C$4,
IF(NOT(EQ(A4:A18,OFFSET(A4:A18,-1,0))),
B4:B19,
IF(NOT(EQ(A4:A18,OFFSET(A4:A18,-2,0))),
B3:B18,
IF(NOT(EQ(A4:A18,OFFSET(A4:A18,-3,0))),
(B2:B17+B3:B18)/2,
(B1:B16+B2:B17+B3:B18)/3)))))}
The first IF checks whether it is one of the first two data rows, to force the initial values.
The next IF checks if the ID is not equal to the row above, and forces the start of a new Average, with just one value. The next IF checks if it is the second ID in a series (NOT EQual to the ID 2 rows up), and if yes, also uses the single value from the row above.
The next IF checks up three rows, and if the IDs are different, it averages the values from the two rows above.
Otherwise, this is the fourth data row in a series with the same ID, and the formula takes the values from the three rows above, and averages them.
Due to the offsets, it seems quite sensitive to ranges, so it may need some tuning if you move it.
Let me know if this helps.

Bash: find all pair of lines such that the difference of their first field is less than a threshold

my problem is the following. I have a BIG file with many rows containing ordered numbers (repetitions are possible)
1
1.5
3
3.5
6
6
...
1504054
1504056
I would like to print all the pair of row numbers such that their difference is smaller than a given threshold thr. Let us say for instance thr=2.01, I want
0 1
0 2
1 2
1 3
2 3
4 5
...
N-1 N
I wrote a thing in python but the file is huge and I think I need a smart way to do this in bash.
Actually, in the complete data structure there exists also a second column containing a string:
1 s0
1.5 s1
3 s2
3.5 s3
6 s4
6 s5
...
1504054 sN-1
1504056 sN
and, if easy to do, I would like to write in each row the pair of linked strings, possibly separated by "|":
s0|s1
s0|s2
s1|s2
s1|s3
s2|s3
s4|s5
...
sN-1|sN
Thanks for your help, I am not too familiar with bash
In any language you can white a program implementing this pseudo code:
while read line:
row = line.split(sep)
new_kept_rows = []
for kr in kept_rows :
if abs(kr[0], row[0])<=thr:
print "".join(kr[1:]) "|" "".join(row[1:])
new_kept_rows.append(kr)
kept_rows = new_kept_rows
This program only keep the few lines which could match the condition. All other are freed from memory. So the memory footprint should remain small even for big files.
I would use awk language because I'm comfortable with. But python would fit too (the pseudo code I give is very close to be python).

Using Powershell to sort through text file and combine like terms?

In this hypothetical world, let's say that there is a school that offers classes A through Z. In order to graduate, you merely have to get so many hours in so many certain courses. This is recorded in this way (number = hours, letter = class):
Sunday 01/10/15 - 2A, 5H, 3Z
Monday 01/11/15 - 2A, 5Z, 5B
Wednesday 01/15/15 - 2A, 4Y, 6R
At the end of the individual record for the student, it would combine all numbers that had a matching letter, therefor showing the hour total in each class.
Sunday 01/15 - 2A, 5H, 3Z
Monday 02/15 - 2A, 5Z, 5B
Wednesday 01/15/15 - 2A, 4Y, 6R
6A, 5B, 5H, 6R, 4Y, 8Z.
While I am comfortable importing data and sorting data, I am having a hard time figuring out how to script it to add up all the numbers that share the same letter. If this data was hundreds of lines long and there was three instances of 2B in those lines, I would like it to be able to pick those out and show me 6B. What am I missing? Am I using the wrong tool for the job?
What you want is a dictionary of sums. In Powershell that would be a hash table of sums. Everytime you see a letter, look it up in your top level hash table and see if it is there. If not create a sum for that letter and add it to the hash table.
When you are done you can take the keys to the hash table and sort them, and then print out all the letter, sum pairs as key-value pairs.
I will write some code for that, but I need to have a coffee first :).
And here it is...
$slst = ("2A","5H","3Z","2A","5Z","5B","2A","4Y","6R")
$sums = #{}
foreach($s in $slst)
{
$skey = $s[1]
$sval = $s[0] - 48
$sums[$skey] += $sval
}
$sums.GetEnumerator() | Sort-Object Name
Note there was a longer version here before, but someone pointed out that you don't need to initialize dictionaries explicity in powershell.

Include only complete groups in panel regression using Stata

I have a panel set of data but not all individuals are present for all periods. I see when I run my xtreg that there are between 1-4 observations per group with a mean of 1.9. I'd like to only include those with 4 observations. Is there any way I can do this easily?
I understand that you want to include in your regression only those groups for which there are exactly 4 observations. If this is the case, then one solution is to count the number of observations per group and condition the regression using if:
clear all
set more off
webuse nlswork
xtset idcode
list idcode year in 1/50, sepby(idcode)
bysort idcode: gen counter = _N
xtreg ln_w grade age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure ///
c.tenure#c.tenure 2.race not_smsa south if counter == 12, be
In this example the regression is conditioned to groups with 12 observations. The xtreg command gives (among other things):
Number of obs = 1881
Number of groups = 158
which you can compare with the result of running the regression without the if:
Number of obs = 28091
Number of groups = 4697
As commented by #NickCox, if you don't mind losing observations you can drop or keep (un)desired groups:
bysort idcode: drop if _N != 4
or
bysort idcode: keep if _N == 4
followed by an unconditional xtreg (i.e. with no if).
Notice that both approaches count missings, so you may need to account for that.
On the other hand, you might want to think about why you want to discard that data in your analysis.

Resources