Multiple column sorting hadoop streaming (EMR) - sorting

I'm trying to sort differently on each column on the mapper output. My output looks like this:
xx yy 2 4
xx yy 1 5
xx yy 5 39
xx yy 8 3
So the first 2 columns are text the the last 2 columns are numbers.
This is how I try to do this:
-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator
-D "mapreduce.partition.keycomparator.options=-k1,2 -k3,3nr -k4,4nr"
It just doesn't sort numerically ... only alphabetically.
I also tried:
-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapreduce.lib.partition.KeyFieldBasedComparator
-D mapreduce.partition.keycomparator.options='-k1,2 -k3,3nr -k4,4nr'
but got an error that -k3,3nr is not a valid parameter.
Ideas?

Related

Shell script to read two columns in CSV and count how many uniq values in Column 2 per each unique value in Column 1

Shell script to read two columns in CSV and count how many unique values in column two per each unique value in column 1
I have a sheet that looks like
1 a
2 a
3 b
3 c
2 a
2 f
2 a
1 d
The output I need is this:
1 2
2 3
3 2
cut -f 1,2 | sort | uniq -c | sort
I tried the above but I am doing something wrong. New to shell scripting here.

Combining multiple awk output statements into one line

I have some ascii files I’m processing, with 35 columns each, and variable number of rows. I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
I’ve done similar processing in the past, but by outputting temp files for each awk command, reading each successive temp file in to eventually create a final ascii file. Then, I would delete the temp files after. I’m hoping there is an easier/faster method than having to create a bunch of temp files.
Below is an initial working processing step, that the above awk commands would need to follow and fit into. This step gets the data from foo.txt, removes the header, and processes only the rows containing a particular, but varying, string.
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
There’s another processing step for different data files, that I would also need the 2 new columns discussed earlier. This is simply appending a unique file name from what’s being catted to the last column of every row in a new ascii file. This command is actually in a loop with varying input files, but I’ve simplified it here.
cat foo.txt | tail -n +2 | awk -v fname="$fname" '{print $0 OFS fname;}' >> foo_new.txt
An example of one of the foo.txt files.
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
Below would be the example foo_new.txt desired. The requested 2 columns of output from awk (last 2 columns). In this example, column 5 is the difference between column 3 and 2 plus 1. Column 6 is the result of column 1 divided by column 5.
20 0 5 F001 6 3.3
4 2 3 F002 2 2.0
12 4 8 F003 5 2.4
For the second example foo_new.txt. The last column is an example of fname. These are computed in the shell script, and passed to awk. I don't care if the results in column 7 (fname) are at the end or placed between columns 4 and 5, so long as it gets along with the other awk statements.
20 0 5 F001 6 3.3 C1
4 2 3 F002 2 2.0 C2
12 4 8 F003 5 2.4 C3
The best luck so far, but unfortunately this is producing a file with the original output first, and the added output below it. I'd like to have the added output appended on as columns (#5 and #6).
cat foo.txt | tail -n +2 | awk '$17 ~ /^[F][0-9][0-9][0-9]$/' >> foo_new.txt
cat foo_new.txt | awk '{print $4=$3-$2+1, $5=$1/($3-$2+1)}' >> foo_new.txt
Consider an input file data with header line like this (based closely on your minimal example):
Col1 Col2 Col3 Col4
20 0 5 F001
4 2 3 F002
12 4 8 F003
100 10 29 O001
You want the output to contain a column 5 that is the value of $3 - $2 + 1 (column 3 minus column 2 plus 1), and a column 6 that is the value of column 1 divided by column 5 (with 1 decimal place in the output), and a file name that is based on a variable fname passed to the script but that has a unique value for each line. And you only want lines where column 4 matches F and 3 digits, and you want to skip the first line. That can all be written directly in awk:
awk -v fname=C '
NR == 1 { next }
$4 ~ /^F[0-9][0-9][0-9]$/ { c5 = $3 - $2 + 1
c6 = sprintf("%.1f", $1 / c5)
print $0, c5, c6, fname NR
}' data
You could write that on one line too:
awk -v fname=C 'NR==1{next} $4~/^F[0-9][0-9][0-9]$/ { c5=$3-$2+1; print $0,c5,sprintf("%.1f",$1/c5), fname NR }' data
The output is:
20 0 5 F001 6 3.3 C2
4 2 3 F002 2 2.0 C3
12 4 8 F003 5 2.4 C4
Clearly, you could change the file name so that the counter starts from 0 or 1 by using counter++ or ++counter respectively in place of the NR in the print statement, and you could format it with leading zeros or whatever else you want with sprintf() again. If you want to drop the first line of each file, rather than just the first file, change the NR == 1 condition to FNR == 1 instead.
Note that this does not need the preprocessing provided by cat foo.txt | tail -n +2.
I need to take the difference between two columns (N+1), and place the results into a duplicate ascii file on column number 36. Then, I need to take another column, and divide it (row by row) by column 36, and place that result into the same duplicate ascii file in column 37.
That's just:
awk -vN=9 -vanother_column=10 '{ v36 = $N - $(N+1); print $0, v36, $another_column / v36 }' input_file.tsv
I guess your file has some "header"/special "first line", so if it's the first line, then preserve it:
awk ... 'NR==1{print $0, "36_header", "37_header"} NR>1{ ... the script above ... }`
Taking first 3 columns from the example script you presented, and substituting N for 2 and another_column for 1, we get the following script:
# recreate input file
cat <<EOF |
20 0 5
4 2 3
12 4 8
100 10 29
EOF
tr -s ' ' |
tr ' ' '\t' > input_file.tsv
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N; print $0, tmp, $another_column / tmp }' input_file.tsv
and it will output:
20 0 5 5 4
4 2 3 1 4
12 4 8 4 3
100 10 29 19 5.26316
Such script:
awk -vOFS=$'\t' -vIFS=$'\t' -vN=2 -vanother_column=1 '{ tmp = $(N + 1) - $N + 1; print $0, tmp, sprintf("%.1f", $another_column / tmp) }' input_file.tsv
I think get's closer output to what you want:
20 0 5 6 3.3
4 2 3 2 2.0
12 4 8 5 2.4
100 10 29 20 5.0
And I guess that by that (N+1) you meant "the difference between two columns with 1 added".

Vertically divide an array so we get minimum splits

I am thinking on the following problem.
I can have an array of strings like
Col1 Col2 Col3 Col4
aa aa aa aa
aaa aaa aaaaa aaa
aaaa aaaaaaa aa a
...........................
Actually it is CSV file. And I should find a way to divide this vertically into one or more files. Condition for splitting is that no one file contain no row that exceeds some bytes. For simplicity we can rewrite that array with lengths:
Col1 Col2 Col3 Col4
2 2 2 2
3 3 5 3
4 7 2 1
...........................
And let's say the limit is 10, i.e. if > 9 we should split. So if we split into 2 files [Col1, Col2, Col3] and [Col4] this will not satisfy the condition because the first file will contain 3 + 3 + 5 > 9 in the second row and 4 + 7 + 2 > 9 in the third row. If we split into [Col1, Col2] and [Col3, Col4] this will not satisfy the condition because the first file will contain 4 + 7 > 9 in the third row. So we are splitting this into 3 files like [Col1], [Col2, Col3] and [Col4]. Now every file is correct and looks like:
File1 | File2 | File3
------------------------------
Col1 | Col2 Col3 | Col4
2 | 2 2 | 2
3 | 3 5 | 3
4 | 7 2 | 1
...............................
So it should split from left to right giving maximum columns as possible to the left file. The problem is that this file can be huge and I don't want to read it into memory and so we read the initial file line by line and somehow I should determine a set of indexes to split. If that is possible at all? I hope I described the problem well, so you can understand it.
Generally awk is quite good at handling large csv files.
You could try something like this to retrieve the max length for each column and then decide how to split.
Let's say the file.txt contains
Col1;Col2;Col3;Col4
aa;aa;aa;aa
aaa;aaa;aaaaa;aaa
aaaa;aaaaaaa;aa;a
(Assuming windows style quotes) Running the following :
> awk -F";" "NR>1{for (i=1; i<=NF; i++) max[i]=(length($i)>max[i]?length($i):max[i])} END {for (i=1; i<=NF; i++) printf \"%d%s\", max[i], (i==NF?RS:FS)}" file.txt
Will output :
4;7;5;3
Could you try this on your real data set ?

Using awk to get the maximum value of a column, for each unique value of another column

So I have a file such as:
10 1 abc
10 2 def
10 3 ghi
20 4 elm
20 5 nop
20 6 qrs
30 3 tuv
I would like to get the maximum value of the second column for each value of the first column, i.e.:
10 3 ghi
20 6 qrs
30 3 tuv
How can I do using awk or similar unix commands?
You can use awk:
awk '$2>max[$1]{max[$1]=$2; row[$1]=$0} END{for (i in row) print row[i]}' file
Output:
10 3 ghi
20 6 qrs
30 3 tuv
Explanation:
awk command uses an associative array max with key as $1 and value as $2. Every time we encounter a value already stored in this associative array max, we update our previous entry and store whole row in another associative array row with the same key. Finally in END section we simply iterate over associative array row and print it.
shorter alternative with sort
$ sort -k1,1 -k2,2nr file | sort -u -k1,1
10 3 ghi
20 6 qrs
30 3 tuv
sort by field one and field two (numeric, reverse) so that max for each key will be top of the group, pick the first for each key by the second sort.

How to group the information respectively with comma separator in the input?

My file name is myInfo.txt under the current directory: DIR="$(pwd)";
inside it has:
1000 at num 2049 28 2068100
1000 at num 2049 28 2623200
1000 at num 2049 28 2833000
1000 at num 2049 28 3499700
1000 at num 2051 28 2453500
1000 at num 2051 28 2969400
1000 at num 2051 28 3071300
1000 at num 2051 28 3838200
Now I used the bash script sequentially:
DIR="$(pwd)";
array=(2049 2151);
for k in "${array[#]}"; do
grep "at num ${k}" myInfo.txt | cut -d' ' -f 6 > ${DIR}/Info/nums/${k}.out
done
and group the 6th column information in each row like 2068100 2623200...... into the file 2049.out and 2051.out respectively under the folder ${DIR}/Info/nums/
My question is: Can I use comma separator like follows to get the same functionality as before:
for k in "${array[#]}"; do
grep "at num ${k}" myInfo.txt | cut -d',' -f 6 > ${DIR}/Info/nums/${k}.out
done
I tried to re-generate the myInfo.txt to satisfy the above command:
1000,at num 2049,28,2068100
1000,at num 2049,28,2623200
1000,at num 2049,28,2833000
1000,at num 2049,28,3499700
1000,at num 2051 28,2453500
1000,at num 2051 28,2969400
1000,at num 2051 28,3071300
1000,at num 2051 28,3838200
and tried to group the information same as before. But it seems that the cut -d',' -f 6 cannot get the same functionality as cut -d' ' -f 6.
I wonder if the "cut -d',' -f 6" is valid? If it is valid, which kind of format of information should I re-generate in the myInfo.txt file? Thank you.
You can fix your problem in (at least) two ways:
Either you replace each space in myInfo.txt with a comma, and not just some, or you use the 4th column now (because when using the , as the delimiter, each column is separated by a column).
In any case, you should fix up your file so that your comma separation is consistent across all lines (right now you sometimes have 3, sometimes 2 commas).
If your input record is structured like this:
1000,at num 2049,28,2068100
Then you need
cut -d',' -f 4
To extract 4th column.
However if you want to use:
cut -d',' -f 6
then input record should be formatted like this:
1000,at,num,2049,28,2068100

Resources