Top awk result cut off with count condition - sorting

I have a list of student records, grades, that I want to sort by GPA, returning the top 5 results. For some reason count<=7 and below cuts off the top result. I can't figure out why that is.
Also, is there a more elegant way to remove the first column after sorting than piping the results back in to awk from sort?
user#machine:~> awk '{ if (count<=7) print $3, $0; count++; }' grades | sort -nr | awk '{ print $2 " " $3 " " $4 " " $5 }'
Ahmad Rashid 3.74 MBA
James Davis 3.71 ECE
Sam Chu 3.68 ECE
John Doe 3.54 ECE
Arun Roy 3.06 SS
James Adam 2.77 CS
Al Davis 2.63 CS
Rick Marsh 2.34 CS
user#machine:~> awk '{ if (count<=8) print $3, $0; count++; }' grades | sort -nr | awk '{ print $2 " " $3 " " $4 " " $5 }'
Art Pohm 4.00 ECE
Ahmad Rashid 3.74 MBA
James Davis 3.71 ECE
Sam Chu 3.68 ECE
John Doe 3.54 ECE
Arun Roy 3.06 SS
James Adam 2.77 CS
Al Davis 2.63 CS
Rick Marsh 2.34 CS
grades:
John Doe 3.54 ECE
James Davis 3.71 ECE
Al Davis 2.63 CS
Ahmad Rashid 3.74 MBA
Sam Chu 3.68 ECE
Arun Roy 3.06 SS
Rick Marsh 2.34 CS
James Adam 2.77 CS
Art Pohm 4.00 ECE
John Clark 2.68 ECE
Nabeel Ali 3.56 EE
Tom Nelson 3.81 ECE
Pat King 2.77 SS
Jake Zulu 3.00 CS
John Lee 2.64 EE
Sunil Raj 3.36 ECE
Charles Right 3.31 EECS
Diane Rover 3.87 ECE
Aziz Inan 3.75 EECS
Lu John 3.06 CS
Lee Chow 3.74 EE
Adam Giles 2.54 SS
Andy John 3.98 EECS

You actually do not need awk in the case. Unix sort will sort numerically by column.
Given you input:
$ sort -k 3 -nr grades
Art Pohm 4.00 ECE
Andy John 3.98 EECS
Diane Rover 3.87 ECE
Tom Nelson 3.81 ECE
Aziz Inan 3.75 EECS
Lee Chow 3.74 EE
Ahmad Rashid 3.74 MBA
James Davis 3.71 ECE
Sam Chu 3.68 ECE
Nabeel Ali 3.56 EE
John Doe 3.54 ECE
Sunil Raj 3.36 ECE
Charles Right 3.31 EECS
Lu John 3.06 CS
Arun Roy 3.06 SS
Jake Zulu 3.00 CS
Pat King 2.77 SS
James Adam 2.77 CS
John Clark 2.68 ECE
John Lee 2.64 EE
Al Davis 2.63 CS
Adam Giles 2.54 SS
Rick Marsh 2.34 CS
Then just use head:
$ count=7
$ sort -k 3 -nr grades | head -n $count
Art Pohm 4.00 ECE
Andy John 3.98 EECS
Diane Rover 3.87 ECE
Tom Nelson 3.81 ECE
Aziz Inan 3.75 EECS
Lee Chow 3.74 EE
Ahmad Rashid 3.74 MBA
If you want to use gawk, you would define an array traversal based on an index. You might do something along these lines:
awk -v count=7 'function sort_by_num(i1, v1, i2, v2) {
return (v2-v1)
}
{ lines[NR]=$0
idx[NR]=$3
}
END {
asorti(idx, si, "sort_by_num");
for(n = 1; n <= count; ++n) {
print lines[si[n]]
}
}' grades
Art Pohm 4.00 ECE
Andy John 3.98 EECS
Diane Rover 3.87 ECE
Tom Nelson 3.81 ECE
Aziz Inan 3.75 EECS
Ahmad Rashid 3.74 MBA
Lee Chow 3.74 EE
Note the difference in sort order between sort and the function we have defined in gawk for the last two. You would need to define in your function what you want with the same GPA value. The default is stable for gawk and sort is performing additional comparisons based on other columns. (You can also add the -s switch to sort and the output is identical)

Related

How can I append new lines in a CSV file and modify them in Unix

I am new to Unix in general and starting to learn shell scripting. I am working with a CSV file with the below sample rows (it's a large CSV file with 4 entries for each item):
Table 1
Item ID Time Available Location
0001 02/02/2021 08:00 Y NJ
0001 02/02/2021 09:00 N UT
0001 02/02/2021 10:00 Y AZ
0001 02/02/2021 11:00 Y CA
0002 02/02/2021 08:00 Y NJ
0002 02/02/2021 09:00 N UT
0002 02/02/2021 10:00 Y AZ
0002 02/02/2021 11:00 Y CA
I have another CSV with a bunch of item IDs as follows:
Table 2
Item ID Item_Name Item_Aux_ID Item_Aux_name
1001 IT_1 3323 IT_Aux_1
1002 IT_2 3325 IT_Aux_2
1003 IT_3 3328 IT_Aux_3
1010 IT_4 3333 IT_Aux_4
I would like to create new entries in the first CSV file (one entry for each Item in the second CSV file). Each new entry should be the same as the first row of the Table1 with the Item ID replaced appropriately. The expected output would be:
Table 1
Item ID Time Available Location
0001 02/02/2021 08:00 Y NJ
0001 02/02/2021 09:00 N UT
0001 02/02/2021 10:00 Y AZ
0001 02/02/2021 11:00 Y CA
0002 02/02/2021 08:00 Y NJ
0002 02/02/2021 09:00 N UT
0002 02/02/2021 10:00 Y AZ
0002 02/02/2021 11:00 Y CA
1001 02/02/2021 08:00 Y NJ
1002 02/02/2021 08:00 Y NJ
1003 02/02/2021 08:00 Y NJ
1010 02/02/2021 08:00 Y NJ
How do I write a script to achieve the above in Unix? Thanks in advance.
One awk idea:
awk '
NR==3 { # 1st file: skip 1st two lines (the header rows) then ...
copyline=$0 # make a copy of the 3rd line and ...
nextfile # skip to the next file
}
FNR>2 { # 2nd file: skip 1st two lines (the header rows) and ...
# replace the 1st field of variable "copyline" with 1st field of current input line and ...
# print the modified "copyline" to stdout
print gensub(/^[^[:space:]]*/,$1,1,copyline)
}
' file1.csv file2.csv
Comments removed:
awk '
NR==3 { copyline=$0; nextfile }
FNR>2 { print gensub(/^[^[:space:]]*/,$1,1,copyline) }
' file1.csv file2.csv
Collapsed further into a one-liner:
awk 'NR==3{copyline=$0;nextfile}FNR>2{print gensub(/^[^[:space:]]*/,$1,1,copyline)}' file1.csv file2.csv
This generates:
1001 02/02/2021 08:00 Y NJ
1002 02/02/2021 08:00 Y NJ
1003 02/02/2021 08:00 Y NJ
1010 02/02/2021 08:00 Y NJ
Once OP is satisfied with the output, and assuming the desire is to append the output to the first file, then ...
# change this:
' file1.csv file2.csv
# to this:
' file1.csv file2.csv >> file1.csv

AWK select columns in file2 based on partial header match in file1

I have a file ("File1") with ~40-80k columns and ~10k rows. The column headers in File1 are comprised of a unique identifier (e.g. "4b_1.04:") followed by a description (e.g. "Colname_3"). File2 contains a list of unique identifiers (i.e. not an exact match). Is there a way to extract columns from File1 using a list of column headers in File2 based on a partial match?
For example:
"File1"
patient_ID,response,0_4: Number of Variants,0_6: Number of CDS Variants,3_2.83: Colname_1,3_8.5102: Colname_2,4b_1.04: Colname_3,4_1.0: Colname_4,4_7.7101: Colname_5
ID_237.vcf,Benefit,13008,4343,0.65,1.23,0.17,2.57,4.22
ID_841.vcf,Benefit,15127,2468,0.9,0.68,2.39,1.8,1.6
ID_767.vcf,Benefit,5190,3261,0.73,1.16,1.99,0.79,1.17
ID_263.vcf,Benefit,16888,9548,0.61,1.66,0.73,2.42,1.55
ID_179.vcf,Benefit,3545,842,0.22,0.67,0.48,3.9,3.95
ID_408.vcf,Benefit,1427,4583,0.92,0.76,0.17,0.8,1.27
ID_850.vcf,Benefit,13835,4682,0.8,1.21,0.05,1.74,4.61
ID_856.vcf,Benefit,8939,8435,0.31,0.99,2.5,1.36,0.74
ID_328.vcf,Benefit,14220,8481,0.23,0.22,0.79,0.14,1.08
ID_704.vcf,Benefit,18145,914,0.66,1.69,0.17,0.4,3.13
ID_828.vcf,No_Benefit,4798,8163,0.74,0.89,1.04,1.68,1.29
ID_16.vcf,No_Benefit,6472,528,0.47,1.5,1.74,0.19,3.54
ID_380.vcf,No_Benefit,9827,8359,0.86,1.59,2.41,0.11,3.71
ID_559.vcf,No_Benefit,10247,9150,0.68,0.78,1.02,0.69,1.31
ID_466.vcf,No_Benefit,11092,4078,0.16,0.03,0.4,1.51,2.86
ID_925.vcf,No_Benefit,4809,2908,0.01,1.49,2.32,2.35,4.58
ID_573.vcf,No_Benefit,4341,4307,0.87,0.14,2.63,1.35,3.54
ID_497.vcf,No_Benefit,18279,663,0.1,1.06,2.96,1.98,4.22
ID_830.vcf,No_Benefit,18505,456,0.31,0.25,1.96,3.01,4.6
ID_665.vcf,No_Benefit,15072,2962,0.43,1.35,0.76,0.68,1.47
"File2"
patient_ID
response
0_4:
0_6:
4b_1.04:
3_2.83:
3_8.5102:
NB. The identifiers in File2 are in a different order to the column headers in File1, and the delimiter in File1 is a tab, not a comma (do tabs get converted to spaces when copy-pasting from SO?).
My attempt:
awk 'NR==FNR{T[$1]=NR; next} FNR==1 {MX=NR-1; for (i=1; i<=NF; i++) if ($i in T) C[T[$i]] = i } {for (j=1; j<=MX; j++) printf "%s%s", $C[j], (j==MX)?RS:"\t" }' File2 <(tr -s "," "\t" < File1)
Unfortunately, this prints the 'partial' header - I want the full header - and appears to struggle with File2 being in a different order to File1.
Expected outcome (awk 'BEGIN{FS=","; OFS="\t"}{print $1, $2, $3, $4, $7, $5, $6}' File1):
patient_ID response 0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_3 3_2.83: Colname_1 3_8.5102: Colname_2
ID_237.vcf Benefit 13008 4343 0.17 0.65 1.23
ID_841.vcf Benefit 15127 2468 2.39 0.9 0.68
ID_767.vcf Benefit 5190 3261 1.99 0.73 1.16
ID_263.vcf Benefit 16888 9548 0.73 0.61 1.66
ID_179.vcf Benefit 3545 842 0.48 0.22 0.67
ID_408.vcf Benefit 1427 4583 0.17 0.92 0.76
ID_850.vcf Benefit 13835 4682 0.05 0.8 1.21
ID_856.vcf Benefit 8939 8435 2.5 0.31 0.99
ID_328.vcf Benefit 14220 8481 0.79 0.23 0.22
ID_704.vcf Benefit 18145 914 0.17 0.66 1.69
ID_828.vcf No_Benefit 4798 8163 1.04 0.74 0.89
ID_16.vcf No_Benefit 6472 528 1.74 0.47 1.5
ID_380.vcf No_Benefit 9827 8359 2.41 0.86 1.59
ID_559.vcf No_Benefit 10247 9150 1.02 0.68 0.78
ID_466.vcf No_Benefit 11092 4078 0.4 0.16 0.03
ID_925.vcf No_Benefit 4809 2908 2.32 0.01 1.49
ID_573.vcf No_Benefit 4341 4307 2.63 0.87 0.14
ID_497.vcf No_Benefit 18279 663 2.96 0.1 1.06
ID_830.vcf No_Benefit 18505 456 1.96 0.31 0.25
ID_665.vcf No_Benefit 15072 2962 0.76 0.43 1.35
Would you please try the following:
awk -F"\t" '
NR==FNR { # handle File2
partial[FNR] = $i # create a list of desired header (partial)
len = FNR # array lenth of "partial"
next
}
FNR==1 { # handle header line of File1
ofs = line = ""
for (j = 1; j <= len; j++) {
for (i = 1; i <= NF; i++) {
if (index($i, partial[j]) == 1) { # test the partial match
header[++n] = i # if match, store the position
line = line ofs $i
ofs = "\t"
}
}
}
print line # print the desired header (full)
}
FNR>1 { # handle body lines of File1
ofs = line = ""
for (i = 1; i <= n; i++) { # positions of desired columns
line = line ofs $header[i]
ofs = "\t"
}
print line
}
' File2 File1
Output:
patient_ID response 0_4: Number of Variants 0_6: Number of CDS Variants 4b_1.04: Colname_33_2.83: Colname_1 3_8.5102: Colname_2
ID_237.vcf Benefit 13008 4343 0.17 0.65 1.23
ID_841.vcf Benefit 15127 2468 2.39 0.9 0.68
ID_767.vcf Benefit 5190 3261 1.99 0.73 1.16
ID_263.vcf Benefit 16888 9548 0.73 0.61 1.66
ID_179.vcf Benefit 3545 842 0.48 0.22 0.67
ID_408.vcf Benefit 1427 4583 0.17 0.92 0.76
ID_850.vcf Benefit 13835 4682 0.05 0.8 1.21
ID_856.vcf Benefit 8939 8435 2.5 0.31 0.99
ID_328.vcf Benefit 14220 8481 0.79 0.23 0.22
ID_704.vcf Benefit 18145 914 0.17 0.66 1.69
ID_828.vcf No_Benefit 4798 8163 1.04 0.74 0.89
ID_16.vcf No_Benefit 6472 528 1.74 0.47 1.5
ID_380.vcf No_Benefit 9827 8359 2.41 0.86 1.59
ID_559.vcf No_Benefit 10247 9150 1.02 0.68 0.78
ID_466.vcf No_Benefit 11092 4078 0.4 0.16 0.03
ID_925.vcf No_Benefit 4809 2908 2.32 0.01 1.49
ID_573.vcf No_Benefit 4341 4307 2.63 0.87 0.14
ID_497.vcf No_Benefit 18279 663 2.96 0.1 1.06
ID_830.vcf No_Benefit 18505 456 1.96 0.31 0.25
ID_665.vcf No_Benefit 15072 2962 0.76 0.43 1.35

bash/awk: remove duplicate columns after merging of several files

I am using the following function written in my bash script in order to merge many files (contained multi-column data) into one big summary chart with all fused data
table_fuse () {
paste -d'\t' "${rescore}"/*.csv >> "${rescore}"/results_2PROTS_CNE_strategy3.csv | column -t -s$'\t'
}
Taking two files as an example, this routine would produce the following concatenated chart as the result of the merging:
# file 1. # file 2
Lig dG(10V1) dG(rmsd) Lig dG(10V2) dG(rmsd)
lig1 -6.78 0.32 lig1 -7.04 0.20
lig2 -5.56 0.14 lig2 -5.79 0.45
lig3 -7.30 0.78 lig3 -7.28 0.71
lig4 -7.98 0.44 lig4 -7.87 0.42
lig5 -6.78 0.28 lig5 -6.75 0.31
lig6 -6.24 0.24 lig6 -6.24 0.24
lig7 -7.44 0.40 lig7 -7.42 0.39
lig8 -4.62 0.41 lig8 -5.19 0.11
lig9 -7.26 0.16 lig9 -7.30 0.13
Since the both files share the same first column (Lig), how would it be possible to remove (substitute to " ") all repeats of this column in each of the fussed file, while keeping only the Lig column from the first CSV?
EDIT: As per OP's comments to cover [Ll]ig or [Ll]ig0123 or [Ll]ig(abcd) formats in file adding following solution here.
awk '{first=$1;gsub(/[Ll]ig([0-9]+)?(\([-azA-Z]+\))?/,"");print first,$0}' Input_file
With awk you could try following, considering that you want to remove only lig(digits) duplicate values here.
awk '{first=$1;gsub(/[Ll]ig([0-9]+)?/,"");print first,$0}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
first=$1 ##Setting first column value to first here.
gsub(/[Ll]ig([0-9]+)?/,"") ##Globally substituting L/lig digits(optional) with NULL in whole line.
print first,$0 ##printing first and current line here.
}
' Input_file ##mentioning Input_file name here.
It's not hard to replace repeats of phrases. What exactly works for your case depends on the precise input file format; but something like
sed s/^\([^ ]*\)\( .* \)\1 /\1\2 /' file
would get rid of any repeat of the token in the first column.
Perhaps a better solution is to use a more sophisticated merge tool, though. A simple Awk or Python script could take care of removing the first token from every file except the first while merging.
join would appear to be a solution, at least for the sample data provided by OP ...
Sample input data:
$ cat file1
Lig dG(10V1) dG(rmsd)
lig1 -6.78 0.32
lig2 -5.56 0.14
lig3 -7.30 0.78
lig4 -7.98 0.44
lig5 -6.78 0.28
lig6 -6.24 0.24
lig7 -7.44 0.40
lig8 -4.62 0.41
lig9 -7.26 0.16
$ cat file2
Lig dG(10V2) dG(rmsd)
lig1 -7.04 0.20
lig2 -5.79 0.45
lig3 -7.28 0.71
lig4 -7.87 0.42
lig5 -6.75 0.31
lig6 -6.24 0.24
lig7 -7.42 0.39
lig8 -5.19 0.11
lig9 -7.30 0.13
We can join these two files on the first column (aka field) like such:
$ join -j1 file1 file2
Lig dG(10V1) dG(rmsd) dG(10V2) dG(rmsd)
lig1 -6.78 0.32 -7.04 0.20
lig2 -5.56 0.14 -5.79 0.45
lig3 -7.30 0.78 -7.28 0.71
lig4 -7.98 0.44 -7.87 0.42
lig5 -6.78 0.28 -6.75 0.31
lig6 -6.24 0.24 -6.24 0.24
lig7 -7.44 0.40 -7.42 0.39
lig8 -4.62 0.41 -5.19 0.11
lig9 -7.26 0.16 -7.30 0.13
For more than 2 files some sort of repetitive/looping method would be needed to repeatedly join a new file into the mix.

awk find the common rows to two files and combine the rows to a row in a third file

I am new to awk and the shell generally. i want to manipulate some files and find the common rows to two files based on a column
and write the combination of the row from file1 and row from file2 as a row in file3.
I have checked many proposed solutions online, which brought me to getting the following results.
The file structure and the commands I used are as follows.
file1.tab
name level regno dept sex
john 900 123 csc male
debby 800 378 mth male
ken 800 234 csc male
sol 700 923 mth female
dare 900 273 phy male
olanna 800 283 csc female
olumba 400 245 phy male
petrus 800 284 mth female
file2.tab
regno grade
234 A
283 D
123 A
273 B
I was able to get file3.tab with this command
awk 'NR==FNR{a[$1];next} $3 in a {print $0}' file2.tab file1.tab > file3.tab
file3.tab
name level regno dept sex
john 900 123 csc male
ken 800 234 csc male
dare 900 273 phy male
olanna 800 283 csc female
But what I want is the whole of file1 row with file2 row attached to it like this
name level regno dept sex regno grade
john 900 123 csc male 123 A
ken 800 234 csc male 234 A
dare 900 273 phy male 273 B
olanna 800 283 csc female 283 D
Secondly, I also want to get file3.tab in this format
name level regno dept sex grade
john 900 123 csc male A
debby 800 378 mth male NA
ken 800 234 csc male A
sol 700 923 mth female NA
dare 900 273 phy male B
olanna 800 283 csc female D
olumba 400 245 phy male NA
petrus 800 284 mth female NA
I used this command
awk 'FNR==NR{a[$1]=$1;next}{print $0, "\t" (($3 in a)? a[$1]:"NA")}' file2.tab file1.tab > file3-2.tab
But what I got is this and the grades from the file2.tab are not showing
name level regno dept sex
john 900 123 csc male
debby 800 378 mth male NA
ken 800 234 csc male
sol 700 923 mth female NA
dare 900 273 phy male
olanna 800 283 csc female
olumba 400 245 phy male NA
petrus 800 284 mth female NA
All files are tab-delimited.
Kindly help me resolve these.
You can use this awk command to achieve your output:
awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[$1]=$2;next} {
print $0, ($3 in a ? a[$3] : "NA")}' file2.tab file1.tab
name level regno dept sex grade
john 900 123 csc male A
debby 800 378 mth male NA
ken 800 234 csc male A
sol 700 923 mth female NA
dare 900 273 phy male B
olanna 800 283 csc female D
olumba 400 245 phy male NA
petrus 800 284 mth female NA

count and print the unique number of strings

I have a text file as shown below. I would like to count the unique number of connections of each person in the first and second column. Third column is the ID numbers of first column persons and fourth column is the ID numbers of second column persons.
susan ali 156 294
susan ali 156 294
susan anna 156 67
rex rex 432 564
rex rex 432 564
philip sama 543 22
for example, susan has two connections with ali and anna. susan's ID is 156. Ali's and anna's ID are 294, 67 respectively. In the ouput, last column is the number of connections of each person. Total connections are the sum of the connections of each person.
your help would be appreciated!!
output:
susan 156 :- ali 294 anna 67 2
rex 432 :- rex 564 1
philip 543 :- sama 22 1
ali 294 :- susan 156 1
anna 67 :- susan 156 1
rex 564 :- rex 432 1
sama 22 :- philip 543 1
Total connections:-8
a simple cat ztest.txt | sort -k1,2 | uniq -c does the trick , but since you want it formatted - you can use awk like this :
awk '{ print $2 " :- " $4 " connected to " $3 " :- " $5 "-- count: " $1} '
full command :
$ cat ztest.txt | sort -k1,2 | uniq -c | awk '{ print $2 " :- " $4 " connected to " $3 " :- " $5 "-- count: " $1} '
output :
philip :- 543 connected to sama :- 22 -- count: 1
rex :- 432 connected to rex :- 564 -- count: 2
susan :- 156 connected to ali :- 294 -- count: 2
susan :- 156 connected to anna :- 67 -- count: 1

Resources