Select the biggest value and print the line - sorting

I need some help with text manipulation.
I do have data like this:
29554 31109 "ENSG00000243485.1" 1555
29554 31097 "ENSG00000243485.1" 1543
29554 30039 "ENSG00000243485.1" 485
30564 30667 "ENSG00000243485.1" 103
30267 30667 "ENSG00000243485.1" 400
30976 31109 "ENSG00000243485.1" 133
89295 133566 "ENSG00000238009.2" 44271
89295 120932 "ENSG00000238009.2" 31637
120775 120932 "ENSG00000238009.2" 157
112700 112804 "ENSG00000238009.2" 104
92091 92240 "ENSG00000238009.2" 149
28269867 28269929 "ENSG00000248451.1" 62
28270383 28270486 "ENSG00000248451.1" 103
28273195 28273372 "ENSG00000248451.1" 177
28275308 28275354 "ENSG00000248451.1" 46
.....................
I have to print the line with the biggest value per group.
There is group name in column 4 and values are in column 5.
As I imagine it should go like this:
1. Separating groups from each other;
2. Selecting biggest value;
3. Printing the whole line.
Preferred output for the example should be:
29554 31109 "ENSG00000243485.1" 1555
89295 133566 "ENSG00000238009.2" 44271
28273195 28273372 "ENSG00000248451.1" 177
Hope someone could help me with this in awk or sed.

You only need to pass through the file once with awk:
awk '
$4 > val[$3] {val[$3] = $4; line[$3] = $0}
END {for (grp in line) print line[grp]}
' filename

This should do in bash and awk:
GROUPS=$(cut -d' ' -f3 datafile | uniq) # list of groups
for f in "$GROUPS"
do
# print line if 4th field is max
awk -v "grp=$f" '$0 ~ grp && $4 > max {max=$4; line=$0} END {print line}' datafile
done

This might work for you:
cat -n file | sort -k4,4 -k5,5nr | sort -u -k4,4 | sort -n | cut -f2-

Related

A UNIX Command to Find the Name of the Student who has the Second Highest Score

I am new to Unix Programming. Could you please help me to solve the question.
For example, If the input file has the below content
RollNo Name Score
234 ABC 70
567 QWE 12
457 RTE 56
234 XYZ 80
456 ERT 45
The output will be
ABC
I tried something like this
sort -k3,3 -rn -t" " | head -n2 | awk '{print $2}'
Using awk
awk 'NR>1{arr[$3]=$2} END {n=asorti(arr,arr_sorted); print arr[arr_sorted[n-1]]}'
Demo:
$cat file.txt
RollNo Name Score
234 ABC 70
567 QWE 12
457 RTE 56
234 XYZ 80
456 ERT 45
$awk 'NR>1{arr[$3]=$2} END {n=asorti(arr,arr_sorted); print arr[arr_sorted[n-1]]}' file.txt
ABC
$
Explanation:
NR>1 --> Skip first record
{arr[$3]=$2} --> Create associtive array with marks as index and name as value
END <-- read till end of file
n=asorti(arr,arr_sorted) <-- Sort array arr on index value(i.e marks) and save in arr_sorted. n= number of element in array
print arr[arr_sorted[n-1]]} <-- n-1 will point to second last value in arr_sorted (i.e marks) and print corresponding value from arr
Your attempt is 90% correct just a single change
Try this...it will work.
sort -k3,3 -rn -t" " | head -n1 | awk '{print $2}'
Instead of using head -n2 replace it with head -n1

Bash Shell: How do I sort by values on last column, but ignoring the header of a file?

file
ID First_Name Last_Name(s) Average_Winter_Grade
323 Popa Arianna 10
317 Tabarcea Andreea 5.24
326 Balan Ionut 9.935
327 Balan Tudor-Emanuel 8.4
329 Lungu Iulian-Gabriel 7.78
365 Brailean Mircea 7.615
365 Popescu Anca-Maria 7.38
398 Acatrinei Andrei 8
How do I sort it by last column, except for the header ?
This is what file should look like after the changes:
ID First_Name Last_Name(s) Average_Winter_Grade
323 Popa Arianna 10
326 Balan Ionut 9.935
327 Balan Tudor-Emanuel 8.4
398 Acatrinei Andrei 8
329 Lungu Iulian-Gabriel 7.78
365 Brailean Mircea 7.615
365 Popescu Anca-Maria 7.38
317 Tabarcea Andreea 5.24
If it's always 4th column:
head -n 1 file; tail -n +2 file | sort -n -r -k 4,4
If all you know is that it's the last column:
head -n 1 file; tail -n +2 file | awk '{print $NF,$0}' | sort -n -r | cut -f2- -d' '
You'd like to just sort by the last column, but sort doesn't allow you to do that easily. So rewrite the data with the column to be sorted at the beginning of each line:
Ignoring the header for the moment (although this will often work by itself):
awk '{print $NF, $0 | "sort -nr" }' input | cut -d ' ' -f 2-
If you do need to trim the order (eg, it's getting mixed in the sort), you can do things like:
< input awk 'NR==1; NR>1 {print $NF, $0 | "sh -c \"sort -nr | cut -d \\\ -f 2-\"" }'
or
awk 'NR==1{ print " ", $0} NR>1 {print $NF, $0 | "sort -nr" }' OFS=\; input | cut -d \; -f 2-

Select first two columns from tab-delimited text file and and substitute with '_' character

I have a sample input file as follows
RF00001 1c2x C 3 118 77.20 1.6e-20 1 119 f29242
RF00001 1ffk 9 1 121 77.40 1.4e-20 1 119 8e2511
RF00001 1jj2 9 1 121 77.40 1.4e-20 1 119 f29242
RF00001 1k73 B 1 121 77.40 1.4e-20 1 119 8484c0
RF00001 1k8a B 1 121 77.40 1.4e-20 1 119 93c090
RF00001 1k9m B 1 121 77.40 1.4e-20 1 119 ebeb30
RF00001 1kc8 B 1 121 77.40 1.4e-20 1 119 bdc000
I need to extract the second and third columns from the text file and substitute the tab with '_'
Desired output file :
1c2x_C
1ffk_9
1jj2_9
1k73_B
1k8a_B
1k9m_B
1kc8_B
I am able to print the two columns by :
awk -F" " '{ print $2,$3 }' input.txt
but unable to substitute the tab with '_' with the following command
awk -F" " '{ print $2,'_',$3 }' input.txt
Could you please try following.
awk '{print $2"_"$3}' Input_file
2nd solution:
awk 'BEGIN{OFS="_"} {print $2,$3}' Input_file
3rd solution: Adding a sed solution.
sed -E 's/[^ ]* +([^ ]*) +([^ ]*).*/\1_\2/' Input_file

get subset of table based on unique column values

H- I am looking for a bash/awk/sed solution to get subsets of a table based on unique column values. For example if I have:
chrom1 333
chrom1 343
chrom2 380
chrom2 501
chrom1 342
chrom3 102
I want to be able to split this table into 3:
chrom1 333
chrom1 343
chrom1 342
chrom2 380
chrom2 501
chrom3 102
I know how to do this in R using the split command, but I am specifically looking for a bash/awk/sed solution.
Thanks
I don’t know if this awk is of any use but it will create 3 separate file based on the unique column values:
awk '{print >> $1; close($1)}' file
alternative awk which keeps the original order of records within each block
$ awk '{a[$1]=a[$1]?a[$1] ORS $0:$0}
END{for(k in a) print a[k] ORS ORS}' file
generates
chrom1 333
chrom1 343
chrom1 342
chrom2 380
chrom2 501
chrom3 102
there are 2 trailing empty lines at the end but not displayed in the formatted output.
Using sort and awk:
sort -k1,1 file | awk 'NR>1 && p != $1{print ORS} {p=$1} 1'
EDIT: If you want to keep original order of records from input file then use:
awk -v ORS='\n\n' '!($1 in a){a[$1]=$0; ind[++i]=$1; next}
{a[$1]=a[$1] RS $0}
END{for(k=1; k<=i; k++) print a[ind[k]]}' file
create input list file.txt
(
cat << EOF
chrom1 333
chrom1 343
chrom2 380
chrom2 501
chrom1 342
chrom3 102
EOF
) > file.txt
transfomation
cat file.txt | cut -d" " -f1 | sort -u | while read c
do
cat file.txt | grep "^$c" | sort
echo
done

retrieve and add two numbers of files

In my file I have following structure :-
A | 12 | 10
B | 90 | 112
C | 54 | 34
What I have to do is I have to add column 2 and column 3 and print the result with column 1.
output:-
A | 22
B | 202
C | 88
I retrieve the two columns but dont know how to add
What I did is :-
cut -d ' | ' -f3,5 myfile.txt
How to add those columns and display.
A Bash solution:
#!/bin/bash
while IFS="|" read f1 f2 f3
do
echo $f1 "|" $((f2+f3))
done < file
You can do this easily with awk.
awk '{print $1," | ",($3+$5)'} myfile.txt
wil work perhaps.
You can do this with awk:
awk 'BEGIN{FS="|"; OFS="| "} {print $1 OFS $2+$3}' input_filename
Input:
A | 12 | 10
B | 90 | 112
C | 54 | 34
Output:
A | 22
B | 202
C | 88
Explanation:
awk: invoke the awk tool
BEGIN{...}: do things before starting to read lines from the file
FS="|": FS stands for Field Separator. Think of it as the delimiter that separates each line of your file into fields
OFS="| ": OFS stands for Output Field Separator. Same idea as above, but for output. FS =/= OFS in this case due to formatting
{print $1 OFS $2+$3}: For each line that awk reads, print the first field (the letter), followed by a delimiter specified by OFS, then the sum of field 2 and field 3.
input_filename: awk accepts the input file name as an argument here.

Resources