how to sort values of 1st column and print only corresponding values of third column using awk - sorting

I tried to sort the file using awk '{print $0|"sort -t',' -nk1 "}' but I want to print only the third column of sorted file. Input file:
1 4 7 9
9 7 4 1
4 6 8 9
1 2 3 4
5 4 5 2
the expected output:
3
7
8
5
4

Try this -
sort file|awk '{print $3}'
3
7
8
5
4

Two simple ways "to print only the third column of sorted file":
with tr + cut command:
sort -n file | tr -s ' ' | cut -d' ' -f3
with awk:
sort -n file | awk '{print $3}'

Well, since I already started doing it using Gnu awk's asorti:
$ awk '
{ a[$1 "," $3]=$3 } # get the data to a hash (*)
END { n=asorti(a) # sort a by the index
for(i=1;i<=n;i++) { # for each ordered index
split(a[i],b,",") # split and
print b[2] # print the latter part
}
}' file
3
7
8
5
4
(*) If using only $1 as the key there will be a collision when $1=1. For this data using $1 "," $3 won't produce a collision and will sort the $3 as well. However, there could be a collision. The correct way would be to keep a count of the keys and have a sub for loop to print those out (or have keys and values in different arrays indexed with NR). That will be left as an exersize.

Related

piping commands of awk and sed is too slow! any ideas on how to make it work faster?

I am trying to convert a file containing a column with scaffold numbers and another one with corresponding individual sites into a bed file which lists sites in ranges. For example, this file ($indiv.txt):
SCAFF SITE
1 1
1 2
1 3
1 4
1 5
3 1
3 2
3 34
3 35
3 36
should be converted into $indiv.bed:
SCAFF SITE-START SITE-END
1 1 5
3 1 2
3 34 36
Currently, I am using the following code but it is super slow so I wanted to ask if anybody could come up with a quicker way??
COMMAND:
for scaff in $(awk '{print $1}' $indiv.txt | uniq)
do
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt | awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' | sed "s/^/$scaff\t/" >> $indiv.bed
done
DESCRIPTION:
awk '{print $1}' $indiv.txt | uniq #outputs a list with the unique scaffold numbers
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt #extracts the values from column 2 if the value in the first column equals the variable $scaff
awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' #converts the list of sequential numbers into ranges as described here: https://stackoverflow.com/questions/26809668/collapse-sequential-numbers-to-ranges-in-bash
sed "s/^/$scaff\t/" >> $indiv.bed #adds a column with the respective scaffold number and then outputs the file into $indiv.bed
Thanks a lot in advance!
Calling several programs for each line of the input must be slow. It's usually better to find a way how to process all the lines in one call.
I'd reach for Perl:
tail -n+2 indiv.txt \
| sort -u -nk1,1 -nk2,2 \
| perl -ane 'END {print " $F[1]"}
next if $p[0] == $F[0] && $F[1] == $p[1] + 1;
print " $p[1]\n#F";
} continue { #p = #F;' > indiv.bed
The first two lines sort the input so that the groups are always adjacent (might be unnecessary if your input is already sorted that way); Perl than reads the lines,-a splits each line into the #F array, the #p array is used to keep the previous line: if the current line has the same first element and the second element is greater by 1, we go to the continue section which just stores the current line into #p. Otherwise, we print the last element of the previous section and the first line of the current one. The END block is responsible for printing the last element of the last section.
The output is different from yours for sections that have only a single member.

Optimizing grep -f piping commands [duplicate]

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 4 years ago.
I have two files.
file1 has some keys that start have abc in the second column
et1 abc
et2 abc
et55 abc
file2 has the column 1 values and some other numbers I need to add up:
1 2 3 4 5 et1
5 5 5 5 5 et100
3 3 3 3 3 et55
5 5 5 5 4 et1
6 6 6 6 3 et1
For the keys extracted in file1, I need to add up the corresponding column 5 if it matches. File2 itself is very large
This command seems to be working but it is very slow:
egrep -isr "abc" file1.tcl | awk '{print $1}' | grep -vwf /dev/stdin file2.tcl | awk '{tl+=$5} END {print tl}'
How would I go about optimizing the pipe. Also what am I doing wrong with grep -f. Is it generally not recommended to do something like this.
Edit: Expected output is the sum of all column5 in file2 when the column6 key is present in file1
Edit2:Expected output: Since file 1 has keys "et1, et2 and et55", in file2 adding up the column 5 with matching keys in rows 1,3,4 and 5, the expected output is [5+3+4+3=15]
Use a single awk to read file1 into the keys of an array. Then when reading file2, add $5 to a total variable when $6 is in the array.
awk 'NR==FNR {if ($2 == "abc") a[$1] = 0;
next}
$6 in a {total += $5}
END { print total }
' file1.tcl file2.tcl
Could you please try following, with reading first Input_file2.tcl and with less loops. Since your expected output is not clear so haven't completely tested it.
awk 'FNR==NR{a[$NF]+=$(NF-1);next} $2=="abc"{print $1,a[$1]+0}' file2.tcl file1.tcl

Line differences with element location in shell script

Input:
file1.txt
abc 1 2 3 4
file2.txt
abc 1 2 5 6
Expected output:
differences is
3
5
at location 3
I am able to track the differences using:
comm -3 file1.txt file2.txt | uniq -c | awk '{print $4}' | uniq
But not able to track the element location.
Could you guys please suggest the shell script to track the element location?
With perl, and Path::Class from CPAN for convenience
perl -MPath::Class -MList::Util=first -e '
#f1 = split " ", file(shift)->slurp;
#f2 = split " ", file(shift)->slurp;
$idx = first {$f1[$_] ne $f2[$_]} 0..$#f1;
printf "difference is\n%s\n%s\nat index %d\n", $f1[$idx], $f2[$idx], $idx;
' file{1,2}.txt
difference is
3
5
at index 3

Bash retrieve column number from column name

Is there a better way (such as a one liner in AWK) where I can get the column number in a table with headings from a column name? I want to be able to process a column independent of what the column number actually is (such as when another column is added the script will not need to change).
For example, given the following table in "table.tsv":
ID Value Target Not Used
1 5 9 11
2 4 8 12
3 6 7 10
I can do a sort on the "Target" column using:
#!/bin/bash
(IFS=$'\t'; read -r; printf "%s\n" "$REPLY"; i=0; for col in $REPLY; do
((++i))
[ "$col" == "Target" ] && break
done; sort -t$'\t' "-k$i,${i}n") < table.tsv
Is there a way to do it without the for loop (or at least clean it up a little)?
The expected output of the given script is:
ID Value Target Not Used
3 6 7 10
2 4 8 12
1 5 9 11
However, I was trying to give an example of what I was trying to do. I want to pass/filter my table through several programs so the headings and all columns should be preserved: just have processing occur at each step.
In pseudo code, what I would like to do is:
print headings from stdin
i=$(magic to determine column position given "Target")
sort -t$'\t' "-k$i,${i}n" # or whatever processing is required on that column
another alternative with a lot of pipes
$ head -1 table | tr -s ' ' '\n' | nl -nln | grep "Target" | cut -f1
extract first row, transpose, number lines, find column name, extract number
Or, awk to the rescue!
$ awk -v RS='\t' '/Target/{print NR; exit}' file.tsv
3
Here is an awk alternative:
awk -F '\t' -v col='Target' 'NR==1{for (i=1; i<=NF; i++) if ($i == col){c=i; break}}
{print $c}' file
EDIT: To print column number only:
awk -F '\t' -v col='Target' 'NR==1{for (i=1; i<=NF; i++) if ($i==col) {print i;exit}}' file
3
$ awk -v name='Target' '{for (i=1;i<=NF;i++) if ($i==name) print i; exit}' file
3

How can I use awk to sort columns by the last value of a column?

I have a file like this (with hundreds of lines and columns)
1 2 3
4 5 6
7 88 9
and I would like to re-order columns basing on the last line values (or a specific line values)
1 3 2
4 6 5
7 9 88
How can I use awk (or other) to accomplish this task?
Thank you in advance for your help
EDIT: I would like to thank everybody and to apologize if I wasn't enough clear.
What I would like to do is:
take a line (for example the last one);
reorder the columns of the matrix using the sorted values of the chosen line to derermine the order.
So, the last line is 7 88 9, which sorted is 7 9 88, then the three columns have to be reordered in a way such that, in this case, the last two columns are swapped.
A four-column more generic example, based on the last line again:
Input:
1 2 3 4
4 5 6 7
7 88.0 9 -3
Output:
4 1 3 2
7 4 6 5
-3 7 9 88.0
Here's a quick, dirty and improvable solution: (edited because OP clarified that numbers are floating point).
$ cat test.dat
1 2 3
4 5 6
.07 .88 -.09
$ awk "{print $(printf '$%d%.0s\n' \
$(i=0; for x in $(tail -n1 test.dat); do
echo $((++i)) $x
done |
sort -k2g) | paste -sd,)}" test.dat
3 1 2
6 4 5
-.09 .07 .88
To see what's going on there (or at least part of it):
$ echo "{print $(printf '$%d%.0s\n' \
$(i=0; for x in $(tail -n1 test.dat); do
echo $((++i)) $x
done |
sort -k2g) | paste -sd,)}" test.dat
{print $3,$1,$2} test.dat
To make it work for an arbitrary line, replace tail -n1 with tail -n+$L|head -n1
This problem can be elegantly solved using GNU awk's array sorting feature. GNU awk allows you to control array traversal using PROCINFO. So two passes of the file are required, the first pass to split the last record into an array and the second pass to loop through the indices of the array in value order and output fields based on indices. The code below probably explains it better than I do.
awk 'BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"};
NR == FNR {for (x in arr) delete arr[x]; split($0, arr)};
NR != FNR{sep=""; for (x in arr) {printf sep""$x; sep=" "} print ""}' file.txt file.txt
4 1 3 2
7 4 6 5
-3 7 9 88.0
Update:
Create a file called transpose.awk like this:
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str OFS a[i,j];
}
print str
}
}
Now here is the script that should do work for you:
awk -f transpose.awk file | sort -n -k $(awk 'NR==1{print NF}' file) | awk -f transpose.awk
1 3 2
4 6 5
7 9 88
I am using transpose.awk twice here. Once to transpose rows to columns then I am doing numeric sorting by last column and then again I am transposing rows to columns. It may not be most efficient solution but it is something that works as per the OP's requirements.
transposing awk script courtesy of: #ghostdog74 from An efficient way to transpose a file in Bash

Resources