How to use awk and sed to make statistics on a data - bash

I have a raw data file and i want to generate an output file, both of which are shown below. The rule for generation is that column 1 in the output is equivalent to the column 2 of raw file. The column 2 in the output is the mean of last column values in the raw data, the line of which are match. For example, the value1 in the output is (25+24.3846+13.8972+1.33333+1)/5.
#Raw data
4 compiler-compiler 100000 99975 1 25
4 compiler-compiler 100000 99683 13 24.3846
4 compiler-compiler 100000 93649 457 13.8972
4 compiler-compiler 100000 99764 177 1.33333
4 compiler-compiler 100000 99999 1 1
4 compiler-sunflow 100000 99999 1 1
4 compiler-sunflow 100000 99674 11 29.6364
4 compiler-sunflow 100000 93467 423 15.4444
4 compiler-sunflow 100000 99694 159 1.92453
4 compiler-sunflow 100000 99938 4 15.5
4 compress 100000 99997 1 3
4 compress 100000 99653 10 34.7
4 compress 100000 93639 454 14.011
4 compress 100000 99666 173 1.93064
4 compress 100000 99978 4 5.5
4 serial 100000 99998 1 2
4 serial 100000 99932 6 11.3333
4 serial 100000 93068 460 15.0696
4 serial 100000 99264 206 3.57282
4 serial 100000 99997 3 1
4 sunflow 100000 99998 1 2
4 sunflow 100000 99546 18 25.2222
4 sunflow 100000 93387 481 13.7484
4 sunflow 100000 99752 189 1.31217
4 sunflow 100000 99974 4 6.5
4 xml-transfomer 100000 99994 1 6
4 xml-transfomer 100000 99964 3 12
4 xml-transfomer 100000 93621 463 13.7775
4 xml-transfomer 100000 99540 199 2.31156
4 xml-transfomer 100000 99986 2 7
4 xml-validation 100000 99996 1 4
4 xml-validation 100000 99563 16 27.3125
4 xml-validation 100000 93748 451 13.8625
4 xml-validation 100000 99716 190 1.49474
4 xml-validation 100000 99979 3 7
#Output data
compiler-compiler value1
....
xml-transfomer value2
xml-validation value3
I think the comment awk & sed can work for this, but i do not know how get it.

sed cannot being used here since it does not support math operations. It's a job for awk:
awk 'NR>1{c[$2]++;s[$2]+=$(NF)}END{for(i in c){print i,s[i]/c[i]}}' input.txt
Explanation:
NR>1 { c[$2]++; s[$2+=($NF) }
NR>1 means that the following block gets executed on all lines except of the first line. $2 is the value of the second column. NF is the number of fields per line. $(NF) contains the value of the last column. c and s are assoc arrays. c counts the occurrences of $2, c stores a total of the numeric value in the last column - grouped by $2.
END {for(i in c){print i,s[i]/c[i]}}
END means the following action will take place after the last line of input has been processed. The for loop iterates through c and outputs the name and the mean for all indexes in c.
Output:
xml-validation 10.7339
compiler-compiler 13.123
serial 6.59514
sunflow 9.75655
xml-transfomer 8.21781
compiler-sunflow 12.7011
compress 11.8283
Note that you have influence on the output order if using an assoc aray. If you care about the output order you might use the following command:
awk 'NR>1 && $2!=n && c {print n,t/c;c=t=0} NR>1{n=$2;c++;t+=$(NF)}'
This command does not use assoc arrays, it prints out the stats just in time when $2 changes - note that this requires to sorted by $2

Related

How to use this awk command without affecting the header

Good nigt. I have this two files:
File 1 - with phenotype informations, the first column are the Ids, the orinal file has 400 rows:
ID a b c d
215 2 25 13.8354303 15.2841303
222 2 25.2 15.8507278 17.2994278
216 2 28.2 13.0482192 14.4969192
223 11 15.4 9.2714745 11.6494745
File 2 - with SNPs information, the original file has 400 lines and 42,000 characters per line.
ID t u j l
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
217 2 0 2 1
218 0 2 0 2
And I need to remove from file 2 individuals that do not appear in the file 1, for example:
ID t u j l
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
I used this code:
awk 'NR==FNR{a[$1]; next}$1 in a{print $0}' file2 file1 > file3
and I can get this output(file 3):
215 2 0 2 1
222 2 0 1 1
216 2 0 2 1
223 2 0 2 2
but I lose the header, how do I not lose the header?
To keep the header of the second file, add a condition{action} like this:
awk 'NR==FNR {a[$1]; next}
FNR==1 {print $0; next} # <= this will print the header of file2.
$1 in a {print $0}' file1 file2
NR holds the total record number while FNR is the file record number, it counts the records of the file currently being processed. Also the next statements are important, so that to continue with the next record and don't try the rest of the actions.

Max of all columns based on distinct first column

How do I change the code if I have more than two columns. Let's say the data is like this
ifile.dat
1 10 15
3 34 20
1 4 22
3 32 33
5 3 46
2 2 98
4 20 100
3 13 23
4 50 65
1 40 76
2 20 22
How do I achieve this?
ofile.dat
1 40 76
2 20 98
3 34 33
4 50 100
5 3 46
I mean the max of each column by comparing first column. Thanks.
Here is what I have tried(on a sample file with 13columns). But the highest value is not coming up this way.
cat input.txt | sort -k1,1 -k2,2nr -k3,3nr -k4,4nr -k5,5nr -k6,6nr -k7,7nr -k8,8nr -k9,9nr -k10,10nr -nrk11,11 -nrk12,12 -nrk13,13 | sort -k1,1 -u
Instead of relying on sort, you could switch over to something more robust like awk:
awk 'BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"} {for(i=2;i<=NF;++i) if (a[$1][i]<$i){a[$1][i]=$i}} END{n=asorti(a, asorted); for(col1 in asorted){print col1, a[col1][2], a[col1][3]}}' input.txt
That's a mouthful. It breaks down like:
Before processing the file set the PROCINFO setting of sorted_in to #val_num_asc since we will be sorting the contents of an array by its index which will be numeric: (BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"})
Loop through each field in the record for(i=2;i<=NF;++i)
Test to see if the current value is greater than the stored value in the array which has as its index the first field's value. If it is greater, then store it in place of whatever is currently held in the array for that index at that fields position: (if (a[$1][i]<$i){a[$1][i]=$i})
When done processing all of the records, sort the array into new array asorted: (END{n=asorti(a, asorted);)
Iterate through the array and print each element (for(col1 in asorted){print col1, a[col1][2], a[col1][3]})
There may be a more elegant way to do this in awk, but this will do the trick.
Example:
:~$ cat test
1 10 15
3 34 20
1 4 22
3 32 33
5 3 46
2 2 98
4 20 100
3 13 23
4 50 65
1 40 76
2 20 22
:~$ awk 'BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"} {for(i=2;i<=NF;++i) if (a[$1][i]<$i){a[$1][i]=$i}} END{n=asorti(a, asorted); for(col1 in asorted){print col1, a[col1][2], a[col1][3]}}' test
1 40 76
2 20 98
3 34 33
4 50 100
5 3 46

How to substract every nth from (n+3)th line in awk?

I have 4 column data files which have approximately 100 lines. I'd like to substract every nth from (n+3)th line and print the values in a new column ($5). The column data has not a regular pattern for each column.
My sample file:
cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
Output should be:
1 2 3 20 0 #(20-20)
1 2 3 10 20 #(30-10)
1 2 3 5 35 #(40-5)
1 2 3 20 ? #(. - 20)
1 2 3 30 ? #(. - 30)
1 2 3 40 ? #(. - 40)
1 2 3 .
1 2 3 .
1 2 3 . (and so on)
How can i do this in awk?
Thank you
For this I think the easiest thing is to read through the file twice. The first time (the NR==FNR block) we save all the 4th column values in an array indexed by the line number. The next block is executed for the second pass and creates a 5th column with the desired calculation (checking first to make sure that we wouldn't go passed the end of the file).
$ cat input
1 2 3 20
1 2 3 10
1 2 3 5
1 2 3 20
1 2 3 30
1 2 3 40
$ awk 'NR==FNR{a[NR]=$4; last=NR; next} {$5 = (FNR+3 <= last ? a[FNR+3] - $4 : "")}1' input input
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20
1 2 3 30
1 2 3 40
You can do this using tac + awk + tac:
tac input |
awk '{a[NR]=$4} NR>3 { $5 = (a[NR-3] ~ /^[0-9]+$/ ? a[NR-3] - $4 : "?") } 1' |
tac | column -t
1 2 3 20 0
1 2 3 10 20
1 2 3 5 35
1 2 3 20 ?
1 2 3 30 ?
1 2 3 40 ?
1 2 3 .
1 2 3 .
1 2 3 .

adding columns for specified rows & dividing by the number of rows using awk

So I'm really new to using linux and script commands, help would really be appreciated!
I have a file of 1050 rows and 8 columns. Example:
anger 1 0 5 101 13 2 somesentenceofwords
anger 2 0 5 101 23 3 somesentenceofwords
anger 3 0 3 101 35 3 somesentenceofwords
anger 4 0 2 101 23 3 somesentenceofwords
arch 5 0 3 101 34 12 somesentenceofwords
arch 6 0 2 101 45 23 somesentenceofwords
arch 7 0 2 101 23 12 somesentenceofwords
hand 8 9 0 101 32 21 somesentenceofwords
hand 9 0 2 101 23 12 somesentenceofwords
What I want to do is if the first column is the same for x number of rows then output the sum of the 6th column for those rows and divide it by the number of rows (an average essentially).
So in the example since the first 4 rows are all anger I want to get the average of the numbers corresponding to all rows with anger in column 1 for column 6. It would add 13 + 23 + 35 + 23 / 4. It would then do the same for arch, then hand and so on.
Example output:
anger 23.5
arch 34
hand 27.5
I tried this just to see if I can do it individually where each column would equal a specific letter string but couldn't even get that to work.
$ awk '{if($1="anger"){sum+=$6} {print sum}}' filename
Is this possible?
Using awk:
awk '!($1 in s){b[++i]=$1; s[$1]=0} {c[$1]++; s[$1]+=$6}
END{for (k=1; k<=i; k++) printf "%s %.1f\n", b[k], s[b[k]]/c[b[k]]}' file
anger 23.5
arch 34.0
hand 27.5
Pretty straight forward with awk:
$ awk '{a[$1]+=$6;b[$1]++}END{for (i in a) print i,a[i]/b[i]}' file
hand 27.5
arch 34
anger 23.5
How this works?
The block {a[$1]+=$6;b[$1]++} is executed for every line that is read. We create two maps, one storing the sum, for each key, and one storing the count for each key.
The block END{for (i in a) print i,a[i]/b[i]} is executed after all lines are read. We iterate over the keys of the first map, and print the key, and the division of the sum over the count (i.e. the mean).

Unix sort using unknown delimiter (last column)

My data looks like this:
Adelaide Crows 5 2 3 0 450 455 460.67 8
Essendon 5 5 0 0 622 352 955.88 20
Fremantle 5 3 2 0 439 428 598.50 12
As you can tell, there is a mixture of spaces and tabs. I need to be able to sort the last column descending. So the ouput looks like this:
Essendon 5 5 0 0 622 352 955.88 20
Fremantle 5 3 2 0 439 428 598.50 12
Adelaide Crows 5 2 3 0 450 455 460.67 8
The whole data consists of all AFL teams.
Using sort how can I achieve this. Am i right in attempting to use the $ character to start from the end of line? I also need to sort the second last column after sorting the last column. Therefore any duplicate numbers in the last column will be sorted in the 2nd last column. Code so far:
sort -n -t$'\t' -k 9,9 -k 8,8 tmp
How do I take into account that the football team names will count as whitespace?
Here is the file being sorted (filename: 'tmp') sample data
You could copy the last field into the first position using awk first, then sort by the first field, the get rid of the first field using cut.
awk '{print($NF" "$0)}' sample.txt | sort -k1,1 -n -r -t' ' | cut -f2- -d' '
Port Adelaide 5 5 0 0 573 386 916.05 20
Essendon 5 5 0 0 622 352 955.88 20
Sydney Swans 5 4 1 0 533 428 681.68 16
Hawthorn 5 4 1 0 596 453 620.64 16
Richmond 5 3 2 0 499 445 579.68 12
..
..

Resources