How to sort by key and value in mapreduce? - sorting

I have a text file:
10 1 15
10 12 30
10 9 45
10 8 40
10 15 55
12 9 0
12 7 18
12 10 1
9 1 1
9 2 1
9 0 1
14 5 5
And I would like to get this file as an output of my MapReduce job:
9 0 1
9 1 1
9 2 1
10 1 15
10 9 40
10 9 45
10 12 30
10 15 55
12 7 18
12 9 0
12 10 1
14 5 5
It means it has to be sorted by 1st, 2nd and 3rd columns.
I use this command:
#!/bin/bash
IN_DIR="/user/cloudera/temp"
OUT_DIR="/user/cloudera/temp_out"
NUM_REDUCERS=1
hdfs dfs -rmr ${OUT_DIR} > /dev/null
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-D mapred.jab.name="Parsing mista pages job 1 (parsing)" \
-D stream.num.map.output.key.fields=3 \
-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator \
-D mapreduce.partition.keycomparator.options='-k1,1n -k2,2n -k3,3n' \
-D mapreduce.job.reduces=${NUM_REDUCERS} \
-mapper 'cat' \
-reducer 'cat' \
-input ${IN_DIR} \
-output ${OUT_DIR}
hdfs dfs -cat ${OUT_DIR}/* | head -100
And get exactly what I want. BUT. When I do NUM_REDUCERS=2 I get this output:
[cloudera#quickstart ~]$ hdfs dfs -cat /user/cloudera/temp_out/part-00000 | head -100
9 1 1
10 9 45
10 12 30
10 15 55
12 7 18
12 10 1
14 5 5
[cloudera#quickstart ~]$ hdfs dfs -cat /user/cloudera/temp_out/part-00001 | head -100
9 0 1
9 2 1
10 1 15
10 9 40
12 9 0
Why partitioner splits my data with same keys (for example '9') to different reducers?
How can I force partitioner to split Mapper output by the key and sort it by value. For example, if I have 4 reducers the reducers input should be:
reducer 1
9 0 1
9 1 1
9 2 1
reducer 2
10 1 15
10 9 40
10 9 45
10 12 30
10 15 55
reducer 3
12 7 18
12 9 0
12 10 1
reducer 4:
14 5 5

you can overwrite the default Partioner to put each key into diferent reduce .Set the same Nums of reduce . let each reduce to deal with only one key .
for example()
groupMap.put("9", 0);
groupMap.put("10", 1);
groupMap.put("12", 2);
groupMap.put("14", 3);
Add -partitioner argument to use your own partition in your job.
I think it might works for you

Related

Relationship between columns - awk

I have a file with a structure more or less like this:
test:
1 2 3 4 5
2 4 5 0 0
6 4 5 0 0
7 8 9 10 11
8 10 11 0 0
12 10 11 0 0
13 10 11 0 0
14 2 3 4 5
15 10 11 0 0
16 2 3 4 5
17 2 3 4 5
What I want is to get the first column when the 4th and the 5th are in the 2nd and 3rd, but the 2nd does not appear in the 2nd of the current line. It's a bit confusing, but it'd be like this:
1 6
7 12
7 13
7 15
14 6
16 6
17 6
I believe I'm almost there using this code:
cat test | awk 'NR==FNR {{a[$4" "$5]=a[$4" "$5]" "$1};next} $2" "$3 in a {print a[$2" "$3],$1}' - test
But the output that I get is:
1 14 16 17 2
1 14 16 17 6
7 8
7 12
7 13
7 15
Any help?
Thanks!
(elaborating on my comment)
This awk procedure uses the main action block to build a 2-d array representing the input table. The END block then makes pair-wise comparisons for each row against all others. The logic looks for rows where the 4th and 5th entry in one row match the 2nd and 3rd entry of the other but excludes rows if the second entry holds the first entry of the row it's being compared to:
(input data is from file named data.txt)
awk '
{
for (col = 1; col <= NF; col++) {
table[NR, col] = $col;}
}
END {
for (i=1; i<=FNR; i++) {
for(j=1; j<=FNR; j++) {
if (table[i,4]==table[j,2] && table[i,5]==table[j,3] && table[i,2]!=table[j,1]) {
print table[i,1]" "table[j,1];}
}}
}
' data.txt
Output:
1 6
7 12
7 13
7 15
14 6
16 6
17 6

Need help to find average, min and max values in shell script from text file (again)

This is an update to a question I posted before. I've gotten a little farther into this but need help with a new problem.
I'm working on a shell script right now. I need to loop through a text file, grab the text from it, and find the average number, max number and min number from each line of numbers then print them in a chart with the name of each line. This is the text file:
Experiment1 9 8 1 2 9 0 2 3 4 5
collect1 83 39 84 2 1 3 0 9
jump1 82 -1 9 26 8 9
exp2 22 0 7 1 0 7 3 2
jump2 88 7 6 5
taker1 5 5 44 2 3
This is my code so far. It should be working but it won't do any of the calculations. First loop grabs the line of text, second loop separates the name from the numbers, these two work. tHe thrid loop takes the numbers and does the calculations. It keeps giving me an error saying "expr: non integer argument", why is it doing that? I shouldn't
#!/bin/bash
while read line
do
echo $line | while read first second
do
echo $first
echo $second
sum=0
max=0
min=0
len=0
for arg in $second
do
sum=`expr $sum + $arg`
if [ $min > $arg ]
then
set min=$arg
fi
if [ $max < $arg ]
then
set max=$arg
fi
len=`expr $len + 1`
done
avg=`expr $sum / $len`
echo $avg
echo $min
echo $max
done
done < mystats.txt
This is the desired output when you type "bash statcalc.sh -s name mystats.txt"
Experiment Name Average Max Min
collect1 27 84 0
exp2 5 22 0
Experiment1 3 9 0
jump1 21 82 -1
jump2 31 88 5
taker1 13 44 2
Using awk
awk '{if (NR==1)print "Experiment Name Average Max Min"; min=$2;max=$2;for(i=2;i<=NF;i++) {a[$1]=a[$1]+$i; if (min<$i) min=$i; if(max>$i)max=$i} print $1, int(a[$1]/(NF-1)),min,max}'
Demo :
$awk '{if (NR==1)print "Experiment Name Average Max Min"; min=$2;max=$2;for(i=2;i<=NF;i++) {a[$1]=a[$1]+$i; if (min<$i) min=$i; if(max>$i)max=$i} print $1, int(a[$1]/(NF-1)),min,max}' file.txt | column -t
Experiment Name Average Max Min
Experiment1 4 9 0
collect1 27 84 0
jump1 22 82 -1
exp2 5 22 0
jump2 26 88 5
taker1 11 44 2
$cat file.txt
Experiment1 9 8 1 2 9 0 2 3 4 5
collect1 83 39 84 2 1 3 0 9
jump1 82 -1 9 26 8 9
exp2 22 0 7 1 0 7 3 2
jump2 88 7 6 5
taker1 5 5 44 2 3
$

hadoop mapreduce.partition.keypartitioner.options not working

I only want to partition the data where the first field of key is same as the reducer. For example, [ 11 * * * ] data .
But it seems keypartitioner does not work, I really don't know why.
Environment
Hadoop Version
The code run.sh is here --->
#!/usr/bin/sh
hadoop fs -rm -r /training/likang/tmp2
hadoop fs -rm /training/likang/tmp/testfile
hadoop fs -put testfile1 /training/likang/tmp/testfile
hadoop-streaming -D stream.map.output.field.separator="\t" \
-D stream.num.map.output.key.fields=2 \
-D map.output.key.field.separator="\t" \
-D mapreduce.partition.keypartitioner.options=-k1,1 \
-D mapreduce.job.maps=2 \
-D mapreduce.job.reduces=2 \
-D mapred.job.name="lk_filt_rid" \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-input /training/likang/tmp/testfile \
-output /training/likang/tmp2 \
-mapper "cat" -reducer "cat"
hadoop fs -cat /training/likang/tmp2/part-00000
echo "------------------"
hadoop fs -cat /training/likang/tmp2/part-00001
The Input File is testfile1 --->
11 5 333 111
11 5 777 000
11 3 888 999
11 9 988 888
11 7 234 2342
11 5 4 4
15 9 230 134
12 8 232 834
15 77 220 000
15 33 256 399
11 5 999 888
15 9 222 111
14 88 372 233
15 9 66 77
11 5 821 221
11 0 11 11
15 0 22 22
12 0 33 33
14 0 44 44
The result is here, that all the [ 11 * * * * ] data is not sent to the same reducer... Does anybody know why? Thank you.
Now I konw , it's useful to delete this line
-D map.output.key.field.separator="\t" \
After delete this option, the result will be right, but more confused for the reason.
The default value of map.output.key.field.separator seem's just a Tab,but after I write it here, It makes fault.........

Replace repeated elements in a list with unique identifiers

I have a list like the below:
1 . Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 . Sam 3 4 56 6 89
3 . Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 . Pig 2 5 67 2 21
(except the real list is 40 million lines long).
There are repeated elements in the second column (i.e. the ".")
I want to replace these with unique identifers (e.g. ".1", ".2", ".3"...".n")
I tried to do this with a bash loop / sed combination, but it didn't work...
Failed attempt:
for i in 1..4
do
sed -i "s_//._//."$i"_"$i""
done
(Essentially, I was trying to get sed to replace each n th "." with ".n", but this didn't work).
Here's a way to do it with awk (assuming your file is called input:
$ awk '$2=="."{$2="."++counter}{print}' input
1 .1 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 .2 Sam 3 4 56 6 89
3 .3 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 .4 Pig 2 5 67 2 21
The awk program replaces the second column ($2) by a string formed by concatenating . and a pre-incremented counter (++counter) if the second column was exactly .. It then prints out all the columns it got (with $2 modified or not) ({print}).
Plain bash alternative:
c=1
while read -r a b line ; do
if [ "$b" == "." ] ; then
echo "$a ."$((c++))" $line"
else
echo "$a $b $line"
fi
done < input
Since your question is tagged sed and bash, here are a few examples for completeness.
Bash only
Use parameter expansion. The second column will be unique, but not sequential:
i=1; while read line; do echo ${line/\./.$((i++))}; done < input
1 .1 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 .3 Sam 3 4 56 6 89
3 .4 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 .6 Pig 2 5 67 2 21
Bash + sed
sed cannot increment variables, it has to be done externally.
For each line, increment $i if line contains a ., then let sed append $i after the .
i=0
while read line; do
[[ $line == *.* ]] && i=$((i+1))
sed "s#\.#.$i#" <<<"$line"
done < input
Output:
1 .1 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 .2 Sam 3 4 56 6 89
3 .3 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 .4 Pig 2 5 67 2 21
you can use this command:
awk '{gsub(/\./,c++);print}' filename
Output:
1 0 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 2 Sam 3 4 56 6 89
3 3 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 5 Pig 2 5 67 2 21

Linux: GNU sort does not sort seq

Title sums it up.
$ echo `seq 0 10` `seq 5 15` | sort -n
0 1 2 3 4 5 6 7 8 9 10 5 6 7 8 9 10 11 12 13 14 15
Why doesn't this work?
Even if I don't use seq:
echo '0 1 2 3 4 5 6 7 8 9 10 5 6 7 8 9 10 11 12 13 14 15' | sort -n
0 1 2 3 4 5 6 7 8 9 10 5 6 7 8 9 10 11 12 13 14 15
And even ditching echo directly:
$ echo '0 1 2 3 4 5 6 7 8 9 10 5 6 7 8 9 10 11 12 13 14 15' > numbers
$ sort -n numbers
0 1 2 3 4 5 6 7 8 9 10 5 6 7 8 9 10 11 12 13 14 15
sort(1) sorts lines. You have to parse whitespace delimited data yourself:
echo `seq 0 10` `seq 5 15` | tr " " "\n" | sort -n
Because you need newlines for sort:
$ echo `seq 0 10` `seq 5 15` | tr " " "\\n" | sort -n | tr "\\n" " "; echo ""
0 1 2 3 4 5 5 6 6 7 7 8 8 9 9 10 10 11 12 13 14 15
$
You have single line of input. There is nothing to sort.
The command as you typed it results in the sequence of numbers being all passed to sort in one line. That's not what you want. Just pass the output of seq directly to sort:
(seq 0 10; seq 5 15) | sort -n
By the way, as you just found out, the construct
echo `command`
doesn't usually do what you expect and is redundant for what you actually expect: It tells the shell to capture the output of command and pass it to echo, which produces it as output again. Just let the output of the command go through directly (unless you really mean to have it processed by echo, maybe to expand escape sequences, or to collapse everything to one line).

Resources