How to use awk counting the number of specificed digit in certain column - bash

How can I count the matched value in the certain column?
I have a file:(wm.csv)
I executed the command to get the targeted value in certain column: tail -n +497 wm.csv | awk -F"," '$2=="2" {print $3" "$4}'
then I get the following output data I want:
hit 2
hit 2
hit 2
hit 2
miss
hit 2
hit 2
hit 2
hit 2
hit 2
hit 2
hit 2
incorrect 1
hit 2
hit 2
hit 2
I want to count the number of "2" in second column in order to do simple math like: total digits in column divided by total number of row. Specifically, in this case, it would looks like: 14 (fourteen "2" in second column) / 16(total number of row)
Following is the command I tried but this does not work :
tail -n +497 wm.csv | awk -F"," '$2=="2" {count=0;} { if ($4 == "2") count+=1 } {print $3,$4,$count }'
thanks

taking the posted data as input file
$ awk '$2==2{c++} END{print NR,c,c/NR}' file
16 14 0.875

awk '($0 ~ "hit 2"){count += 1} END{print count, FNR, count/FNR}' sample.csv
14 16 0.875
I use ~ to compare the whole line($0) matches "hit 2", if it is increase counter by 1. FNR is the file number of records which is the total line number.

Related

Modify values of one column based on values of another column on a line-by-line basis

I'm looking to use bash/awk/sed in order to modify a document.
The document contains multiple columns. Column 5 currently has the value "A" at every row. Column six is composed of increasing numbers. I'm attempting a script that goes through the document line by line, checks the value of Column 6, if the value is greater than a certain integer (specifically 275) the value of Column 5 in that same line is changed to "B".
while IFS="" read -r line ; do
awk 'BEGIN {FS = " "}'
Num=$(awk '{print $6}' original.txt)
if [ $Num > 275 ] ; then
awk '{ gsub("A","B",$5) }'
fi
done < original.txt >> edited.txt
For the above, I've tried setting the residueNum variable both inside and outside of the while loop.
I've also tried using a for loop and cat:
awk 'BEGIN {FS = " "}' original.txt
Num=$(awk '{print $6}' heterodimer_P49913/unrelaxed_model_1.pdb)
integer=275
for data in $Num ; do
if [ $data > $integer ] ; then
##Change value in other column to "B" for all lines containing column 6 values greater than "integer"
fi
done
Thanks in advance.
GNU AWK does not need external while loop (there is implicit loop), if you need further explanation read awk info page. Let file.txt content be
1 2 3 4 A 100
1 2 3 4 A 275
1 2 3 4 A 300
and task to be
checks the value of Column 6, if the value is greater than a certain
integer (specifically 275) the value of Column 5 in that same line is
changed to "B".
then it might be done using GNU AWK following way
awk '$6>275{$5="B"}{print}' file.txt
which gives output
1 2 3 4 A 100
1 2 3 4 A 275
1 2 3 4 B 300
Explanation: action set value of 5th field ($5) to B is applied conditionally to rows where value of 6th field is greater than 275. Action to print is applied unconditionally to all lines. Observe that change if applied is done before printing.
(tested in GNU Awk 5.0.1)

piping commands of awk and sed is too slow! any ideas on how to make it work faster?

I am trying to convert a file containing a column with scaffold numbers and another one with corresponding individual sites into a bed file which lists sites in ranges. For example, this file ($indiv.txt):
SCAFF SITE
1 1
1 2
1 3
1 4
1 5
3 1
3 2
3 34
3 35
3 36
should be converted into $indiv.bed:
SCAFF SITE-START SITE-END
1 1 5
3 1 2
3 34 36
Currently, I am using the following code but it is super slow so I wanted to ask if anybody could come up with a quicker way??
COMMAND:
for scaff in $(awk '{print $1}' $indiv.txt | uniq)
do
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt | awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' | sed "s/^/$scaff\t/" >> $indiv.bed
done
DESCRIPTION:
awk '{print $1}' $indiv.txt | uniq #outputs a list with the unique scaffold numbers
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt #extracts the values from column 2 if the value in the first column equals the variable $scaff
awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' #converts the list of sequential numbers into ranges as described here: https://stackoverflow.com/questions/26809668/collapse-sequential-numbers-to-ranges-in-bash
sed "s/^/$scaff\t/" >> $indiv.bed #adds a column with the respective scaffold number and then outputs the file into $indiv.bed
Thanks a lot in advance!
Calling several programs for each line of the input must be slow. It's usually better to find a way how to process all the lines in one call.
I'd reach for Perl:
tail -n+2 indiv.txt \
| sort -u -nk1,1 -nk2,2 \
| perl -ane 'END {print " $F[1]"}
next if $p[0] == $F[0] && $F[1] == $p[1] + 1;
print " $p[1]\n#F";
} continue { #p = #F;' > indiv.bed
The first two lines sort the input so that the groups are always adjacent (might be unnecessary if your input is already sorted that way); Perl than reads the lines,-a splits each line into the #F array, the #p array is used to keep the previous line: if the current line has the same first element and the second element is greater by 1, we go to the continue section which just stores the current line into #p. Otherwise, we print the last element of the previous section and the first line of the current one. The END block is responsible for printing the last element of the last section.
The output is different from yours for sections that have only a single member.

assign sequential number for 1st column of data and start from 1 when it reaches a blank line. using awk or/and sed

I have a big data file consist of blocks of xy data, blocks are separated by a blank line. now I want to change all x to a set of sequential number, and start from 1 for next block. number of rows within each block could be different.
input:
165168 14653
5131655 51365
155615 1356
13651625 13651
12 51
55165 51656
64 64
651456 546546
desired output:
1 14653
2 51365
3 1356
1 13651
2 51
3 51656
1 64
2 546546
I would use:
$ awk '!NF{i=0; print; next} {print ++i, $2}' file
1 14653
2 51365
3 1356
1 13651
2 51
3 51656
1 64
2 546546
Explanation
It is a matter of keeping a counter i and resetting it appropriately.
!NF{i=0; print; next} if there are no fields, that is, if the line is empty, print an empty line and reset the counter.
{print ++i, $2} otherwise, increment the counter and print it together with the 2nd field.
Maybe even
awk '!NF { n=NR } NF { $1=NR-n } 1' file
So on an empty line, we set n to the current line number. On nonempty lines, we change the first field to the current line number minus n. Print all lines.

Exclude a define pattern using awk

I have a file with two columns and want to print the first column only if a determined pattern is not found in the second column, the file can be for example:
3 0.
5 0.
4 1.
3 1.
10 0.
and I want to print the values in the first column only if there isn't the number 1. in the second file, i.e.
3
5
10
I know that to print the first column I can use
awk '{print $1}' fileInput >> fileOutput
Is it possible to have an if block somewhere?
In general, you just need to indicate what pattern you don't want to match:
awk '! /pattern/' file
In this specific case, where you want to print the 1st column of lines where 2st column is not "1.", you can say:
$ awk '$2 != "1." {print $1}' file
3
5
10
When the condition is accomplished, {print $1} will be performed, so that you will have the first column of the file.
In this special case, because the 1 evaluates to true and the 0 to false, you can do:
awk '!$2 { print $1 }' file
3
5
10
The part before the { } is the condition under which the commands are executed. In this case, !$2 means that not column 2 is true (i.e. column 2 is false).
edit: this remains to be the case, even with the trailing dot. In fact, all three of these solutions work:
bash-4.2$ cat file
3 0.
5 0.
4 1.
3 1.
10 0.
bash-4.2$ awk '!$2 { print $1 }' file # treat column 2 as a boolean
3
5
10
bash-4.2$ awk '$2 != "1." {print $1}' file # treat column 2 as a string
3
5
10
bash-4.2$ awk '$2 != 1 {print $1}' file # treat column 2 as a number
3
5
10

Divide the first entry and last entry of each row in a file using awk

I have a file with varying row lengths:
120 2 3 4 5 9 0.003
220 2 3 4 0.004
320 2 3 5 6 7 8 8 0.009
I want the output to consist of a single column with entries like:
120/0.003
220/0.004
320/0.009
That is i want to divide the first column and last column of each row.
How can I achieve this using awk?
To show the output of the division operation:
$ awk '{ printf $1 "/" $NF "=" ; print ($1/$NF)}' infile
120/0.003=40000
220/0.004=55000
320/0.009=35555.6
awk will split its input based on the value of FS, which by default is any sequence of whitespace. That means you can get at the first and last column by referring to $1 and $NF. NF is the number of fields in the current line or record.
So to tell awk to print the first and last column do something like this:
awk '{ print $1 "/" $NF }' infile
Output:
120/0.003
220/0.004
320/0.009

Resources