awk numbered columns and ignore errors - bash

The following works well and captures all 2nd column values for S_nn. The goal is to add numbers in the 2nd column.
awk -F "," '/s_/ {cons = cons + $2} END {print cons}' G.csv
How can I change this to add only when nnn is between N1 and N2 e.g. s_23 and s_24?
Also is it possible to consider 1 if a line has junk instead of numbers in the 2nd column?
S_22, 1
S_23, 0
S_24, 1
S_25, 1
S_26, ?
Sample input: sum s_24 to s_26
Sample output: 1+1+1=3 (the last one is for error)

The solution is rather simple, all you need to do is perform a simple numeric test.
awk -v start=24 -v stop=26 '
BEGIN { FS="[_,]" }
(start <= $2 ) && ($2 <= stop) { s = s + (($3==$3+0)?$3:1) }
END{ print s+0 }' <file>
which outputs
3
How does it work:
line 1 : defines the start and stop fields
BEGIN statement redefines the field separator as a _ or a ,, so now we have 3 fields.
the second line checks if field 2 (the number) is between start and stop, if so perform the sum.
the field 3 is checked if it is a number by testing the condition $3==$3+0, if this fails, it is assumed to be 1
If you want to see the numbers printed, you can do :
awk -v start=24 -v stop=26 '
BEGIN{ FS="[_,]" }
(start <= $2 ) && ($2 <= stop) {
v = ($3==$3+0)?$3:1
s = s + v
printf "%s%d", (c++?"+":""), v
}
END{ printf "=%d\n", s }' <file>
output :
1+1+1=3
The printf statement always prints "+"$3 except on the first time. This is checked by keeping track of a counter c. By default the value of c is set to zero. The entry (c++?"+":"") determines if we are printing the first entry or not. c++ will return the value of c and afterwards sets c to the value c+1, This is called a post increment operator. Thus, the first time, c=0 and (c++?"+":"") returns "" and sets c to 1. The second time, (c++?"+":"") returns "+" and sets c to 2.

Related

bash script to read values inside every file and compare them

I want to plot some data of a spray simulation. There is a variable called the vaporpenetrationlength, which describes the distance from the injector to the position where the mass fraction is 0.1%. The simulation created many folders for each time step. Inside those folders there is one file which contains the mass fraction and the distance. 
I want to create a script which goes through all the time step folders and search inside this one file and prints out the distance where the 0.1% were measured and in which time step it was.
I found a script, but I don't understand it because I just started to learn shell scripting.
Could someone please help me step by step in building such a script? I am interested in learning it, and therefore I want to understand ever line of the code. 
Thanks in advance :)
This little script outputs TimeTabLengthTabMass based on the value of the "mass fraction":
printf '%s\t%s\t%s\n' 'Time' 'Length' 'Mass'
awk '
BEGIN { FS = OFS = "\t"}
FNR == 1 {
n = split(FILENAME,path,"/")
time = sprintf("%0.7f",path[n-1])
}
NF != 2 {next}
0.001 <= $2 && $2 < 0.00101 { print time,$1,$2 }
' postProcessing/singleGraphVapPen/*/*
remark: In fact, printing the header could be done within the awk program, but doing it with a separate printf command allows you to post-process the output of awk (for ex. if you need to sort the times and/or lengths and/or masses).
notes:
FNR == 1 is true for the first line of each input file. In the corresponding block, I extract the time value from the directory name.
NF != 2 {next} is for filtering out the gnuplot commands that are at the beginning of the input files. In words, this statement means "if the number of (tab-delimited) fields in the line isn't 2, then skip"
0.001 <= $2 && $2 < 0.00101 selects the lines based on the value of their second field, which is referred to as yheptane in your script. IDK the margin of error of your "0.1% of mass fraction" so I chose convenient conditions for the sample output below.
With the sample data, the output will be:
Time Length Mass
0.0001500 0.0895768 0.00100839
0.0002000 0.102057 0.00100301
0.0002000 0.0877939 0.00100832
0.0003500 0.0827694 0.00100114
0.0009000 0.0657509 0.00100015
0.0015000 0.0501911 0.00100016
0.0016500 0.0469495 0.00100594
0.0018000 0.0436538 0.00100853
0.0021500 0.0369005 0.00100809
0.0023000 0.100328 0.00100751
As an aside, here's a script for replacing your original code:
#!/bin/bash
set -- postProcessing/singleGraphVapPen/*/*
if ! [ -f VapPen.txt ]
then
{
printf '%s\t%s\n' 'Time [s]' 'VapPen [m]'
awk '
BEGIN {FS = OFS = "\t"}
FNR == 1 {
if (NR > 1)
print time,vappen
vappen = 0
n = split(FILENAME,path,"/")
time = sprintf("%0.7f",path[n-1])
}
NF != 2 {next}
$2 >= 0.001 { vappen = $1 }
END { if (NR) print time,vappen }
' "$#" |
sort -n -k1,1
} > VapPen.txt
fi
gnuplot -e '
set title "Verdunstungspenetration";
set xlabel "Zeit [s]";
set ylabel "Verdunstungspenetrationslänge [m]";
set grid;
plot "VapPen.txt" using 1:2 with linespoints title "Vapor penetraion 0,1% mass";
pause -1 "Hit return to continue";
'
With the provided data, it reduces the execution time from several minutes to 0.15s on my computer.

Retrive entire column to a new file if it matches from list of another file

I have a huge file and I need to retrieve specific columns from File1 which is ~ 200000 rows and ~ 1000 Columns if it matches with the list of file2. (Prefer Bash over R )
for example my dummy data files are as follows,
file1
gene s1 s2 s3 s4 s5
a 1 2 1 2 1
b 2 3 2 3 3
c 1 1 1 1 1
d 1 1 2 2 2
and file2
sample
s4
s3
s7
s8
My desired output is
gene s3 s4
a 1 2
b 2 3
c 1 1
d 2 2
likewise, i have 3 different file2 and i have to pick different samples from the same file1 into a new file.
I would be very greatful if you guys can provide me with your valuable suggestions
P.S: I am a Biologist, i have very little coding experience
Regards
Ateeq
$ cat file1
gene s1 s2 s3 s4 s5
a 1 2 1 2 1
b 2 3 2 3 3
c 1 1 1 1 1
d 1 1 2 2 2
$ cat file2
gene
s4
s3
s8
s7
$ cat a
awk '
NR == FNR {
columns[ NR ] = $0
printf "%s\t", $0
next
}
FNR == 1 {
print ""
split( $0, headers )
for (x = 1 ; x <= length(headers) ; x++ )
{
aheaders[ headers[x]] = x
}
next
}
{
for ( x = 1 ; x <= length( columns ) ; x++ )
{
if (length( aheaders[ columns[x] ] ) == 0 )
printf "N/A\t"
else
printf "%s\t" , $aheaders[ columns[x] ]
}
print ""
}
' $*
$ ./a file2 file1 | column -t
gene s4 s3 s8 s7
a 2 1 N/A N/A
b 3 2 N/A N/A
c 1 1 N/A N/A
d 2 2 N/A N/A
The above should get you on your way. It's an extremely optimistic program and no negative testing was performed.
Awk is a tool that applies a set of commands to every line of every file that matches an expression. In general, the awk script has the form:
<pattern> <command>
There are three such pairs above. Each needs a little explanation:
NR == FNR {
columns[ NR ] = $0
printf "%s\t", $0
next
}
NR == FNR is a awk'ism. NR is the record number and FNR is the record number in the current file. NR is always increasing but FNR resets to 1 when awk parses the next file. NR==FNR is an idiom that is only true when parsing the first file.
I've designed the awk program to read the columns file first (you are calling this file2). File2 has a list of columns to output. As you can see, we are storing each line in the first file (file2) into an array called columns. We are also printing the columns out as we read them. In order to avoid newlines after each column name (since we want all the column headers to be on the same line), we use printf which doesn't output a newline (as opposed to print which does).
The 'next' at the end of the stanza tells awk to read the next line in the file without processing any of the other stanzas. After all, we just want to read the first file.
In summary, the first stanza remembers the column names (and order) and prints them out on a single line (without a newline).
The second "stanza":
FNR == 1 {
print ""
split( $0, headers )
for (x = 1 ; x <= length(headers) ; x++ )
{
aheaders[ headers[x]] = x
}
next
}
FNR==1 will match on the first line of any file. Due to the next in the previous stanza, we'll only hit this stanza when we are on the first line of the second file (file1). The first print "" statement adds the newline that was missing from the first stanza. Now the line with the column headers is complete.
The split command takes the first parameter, $0, the current line and splits it according to whitespace. We know the current line is the first line and has the column headers in it. The split command writes to an array named in the second parameter , headers. Now headers[1] = "gene" and headers[2] = "s4" , headers[3] = "s3", etc.
We're going to need to map the column names to the column numbers. The next bit of code takes each header value and creates an aheaders entry. aheders is an associative array that maps column header names to the column number.
aheaders["gene"] = 1
aheaders["s1"] = 2
aheaders["s2"] = 3
aheaders["s3"] = 4
aheaders["s4"] = 5
aheaders["s5"] = 6
When we're done making the aheaders array, the next command tells awk to skip to the next line of the input. From this point on, only the third stanza is going to have a true condition.
{
for ( x = 1 ; x <= length( columns ) ; x++ )
{
if (length( aheaders[ columns[x] ] ) == 0 )
printf "N/A\t"
else
printf "%s\t" , $aheaders[ columns[x] ]
}
print ""
}
The third stanza has no explicit . Awk will process this as always true. So this last is executed for every line of the second file.
At this point, we want to print the columns that are specified in columns array. We walk through each element of the array in order. The first time through the loop, columns[1] = "gene_symbol". This gives us:
printf "%s\t" , $aheaders[ "gene" ]
And since aheaders["gene"] = 1 this gives us:
printf "%s\t" , $1
And awk understands $1 to be the first field (or column) in the input line. Thus the first column is passed to printf which outputs the value with a tab (\t) appended.
The loop then executes another time with x=2 and columns[2]="s4". This results in the following print executing:
printf "%s\t" , $5
This prints the fifth column followed by a tab. The next iteration:
columns[3] = "s3"
aheaders["s3"] = 4
Which results in:
printf "%s\t" , $4
That is, the fourth field is output.
The next iteration we hit a failure situation:
columns[4] = "s8"
aheaders["s8"] = ''
In this case, the length( aheaders[ columns[x] ] ) == 0 is true so we just print out a placeholder - something to tell the operator their input may be invalid:
printf "N/A\t"
The same is output when we process the last columns[x] value "s7".
Now, since there are no more entries in columns, the loop exists and we hit the final print:
print ""
The empty string is provided to print because print by itself defaults to print $0 - the entire line.
At this point, awk reads the next line out of file1 hits the third block again (only). Thus awk continues until the second file is completely read.

Awk script to provide sum of certain columns in another column basis on criteria

Need your help,
have one file where data is as shown below.
Data for both the below scenarios is present in only 1(single) file and want the expected output in same file only if possible
Scenario 1:
If the value in the first column DocumentNo is appearing once and
the second column Line has the value 10, then I would like to sum columns 3,4,5 and 6 (Taxablevalue,IGSTAmount,CGSTAm and SGSTAmo) and place/replace this value which we have summed in the eight column Invoicevalue:
example data:
DocumentNo|Line|Taxablevalue|IGSTAmount|CGSTAm|SGSTAmo|OthTa|InvoiceValue
262881894|10|10000|0|900|900||
Senario 2:
If we have multiple rows with identical values in the first column DocumentNo and unique value in the second column LineN, then I would like to sum all value of columns 3,4,5 and 6 (Taxablevalue,IGSTAmount,CGSTAm and SGSTAmo) and place/replace this value which we have summed in the eight column Invoicevalue of each line.
example data:
DocumentNo|LineN|Taxablevalue|IGSTAmo|CGSTAmo|SGSTAmou|OthTa|InvoValue
262881894|10|10000|0|900|900||
262881894|20|15000|0|1350|1350||
262881894|30|20000|0|1800|1800||
Expected output Scenario 1:
DocumentNo|Line|Taxablevalue|IGSTAmount|CGSTAm|SGSTAmo|OthTa|InvoiceValue
262881894|10|10000|0|900|900||11800
Expected output Scenario 2:
Invoice Value = 10000+15000+20000+0+0+0+900+1350+1800+900+1350+1800 =
53100
DocumentNo|LineN|Taxablevalue|IGSTAmo|CGSTAmo|SGSTAmou|OthTa|InvoValue
262881894|10|10000|0|900|900||53100
262881894|20|15000|0|1350|1350||53100
262881894|30|20000|0|1800|1800||53100
Below is the code tried, but not able to figure out how to put added values in lastcolumn(InvoValue)
awk '{a[$1]+=$3;b[$1]+=$4;c[$1]+=$5;d[$1]+=$6;}
END {for(i in a) { print " " a[i] " " b[i] " " c[i] " " d[i];}}' File
Below is output of code that I'm getting. Sadly it is not matching my expected output :
0 0 0 0
I would do it in two passes.
On the first pass I would create a dictinary s that would hold the sum of columns 3, 4, 5 and 6 for any specific document number.
On the second pass I would replace the value in InvoValue column.
Here's an example input data.txt:
DocumentNo|LineN|Taxablevalue|IGSTAmo|CGSTAmo|SGSTAmou|OthTa|InvoValue
262881894|10|10000|0|900|900||
262881894|20|15000|0|1350|1350||
262881894|30|20000|0|1800|1800||
262881895|10|10000|0|900|900||
Here is the command:
gawk 'BEGIN { OFS=FS="|" } NR == FNR { s[$1] += $3+$4+$5+$6; next } FNR!=1 { $8 = s[$1] } 1;' data.txt data.txt
Here is the output:
DocumentNo|LineN|Taxablevalue|IGSTAmo|CGSTAmo|SGSTAmou|OthTa|InvoValue
262881894|10|10000|0|900|900||53100
262881894|20|15000|0|1350|1350||53100
262881894|30|20000|0|1800|1800||53100
262881895|10|10000|0|900|900||11800
Note that I ignored column 2 altogether. You might need to modify my answer if you want to account for the LineN.
To ensure that all pairs (DocumentNo, LineN) are unique and occur only once, you could add this error detection:
if (met[$1 FS $2]) print "ERROR: " $1 " " $2;
met[$1 FS $2] = 1;
So the updated command with error detection would be:
gawk 'BEGIN { OFS=FS="|" } NR == FNR { if (met[$1 FS $2]) print "ERROR: " $1 " " $2; met[$1 FS $2] = 1; s[$1] += $3+$4+$5+$6; next } FNR!=1 { $8 = s[$1] } 1;' data.txt data.txt

Bash script - How to loop through rows in a CSV file

I am working with a huge CSV file (filename.csv) that contains a single column. From column 1, I wanted to read current row and compare it with the value of the previous row. If it is greater OR equal, continue comparing and if the value of the current cell is smaller than the previous row - divide the value of the current cell by the value of the previous cell and exit by printing the value of the division. For example from the following example: i wanted my bash script to divide 327 by 340 and print 0.961765 to the console and exit.
338
338
339
340
327
301
299
284
284
283
283
283
282
282
282
283
I tried it with the following awk and it works perfectly fine.
awk '$1 < val {print $1/val; exit} {val=$1}' filename.csv
However, since i want to include around 7 conditional statements (if-else's), I wanted to do it with a bit cleaner bash script and here is my approach. I am not that used to awk to be honest and that's why i prefer to use bash.
#!/bin/bash
FileName="filename.csv"
# Test when to stop looping
STOP=1
# to find the number of columns
NumCol=`sed 's/[^,]//g' $FileName | wc -c`; let "NumCol+=1"
# Loop until the current cell is less than the count+1
while [ "$STOP" -lt "$NumCol" ]; do
cat $FileName | cut -d, -f$STOP
let "STOP+=1"
done
How can we loop through the values and add conditional statements?
PS: the criteria for my if-else statement is (if the value ($1/val) is >=0.85 and <=0.9, print A, else if the value ($1/val) is >=0.7 and <=0.8, print B, if the value ($1/val) is >=0.5 and <=0.6 print C otherwise print D).
Here's one in GNU awk using switch, because I haven't used it in a while:
awk '
$1<p {
s=sprintf("%.1f",$1/p)
switch(s) {
case "0.9": # if comparing to values ranged [0.9-1.0[ use /0.9/
print "A" # ... in which case (no pun) you don't need sprintf
break
case "0.8":
print "B"
break
case "0.7":
print "c"
break
default:
print "D"
}
exit
}
{ p=$1 }' file
D
Other awks using if:
awk '
$1<p {
# s=sprintf("%.1f",$1/p) # s is not rounded anymore
s=$1/p
# if(s==0.9) # if you want rounding,
# print "A" # uncomment and edit all ifs to resemble
if(s~/0.9/)
print "A"
else if(s~/0.8/)
print "B"
else if(s~/0.7/)
print "c"
else
print "D"
exit
}
{ p=$1 }' file
D
This is an alternative approach,based on previous input data describing comparison of $1/val with fixed numbers 0.9 , 0.7 and 0.6.
This solution will not work with ranges like ($1/val) >=0.85 and <=0.9 as clarified later.
awk 'BEGIN{crit[0.9]="A";crit[0.7]="B";crit[0.6]="C"} \
$1 < val{ss=substr($1/val,1,3);if(ss in crit) {print crit[ss]} else {print D};exit}{val=$1}' file
A
This technique is based on checking if rounded value $1/val belongs to a predefined array loaded with corresponding messages.
Let me expand the code for better understanding:
awk 'BEGIN{crit[0.9]="A";crit[0.7]="B";crit[0.6]="C"} \ #Define the criteria array. Your criteria values are used as keys and values are the messages you want to print.
$1 < val{
ss=substr($1/val,1,3); #gets the first three chars of the result $1/val
if(ss in crit) { #checks if the first three chars is a key of the array crit declared in begin
print crit[ss] #if it is, print it's value
}
else {
print D #If it is not, print D
};
exit
}
{val=$1}' file
Using substr we get the first three chars of the result $1/val:
for $1/val = 0.961765 using substr($1/val,1,3) returns 0.9
If you want to make comparisons based on two decimals like 0.96 then change substr like substr($1/val,1,4).
In this case you need to accordingly provide the correct comparison entries in crit array i.e crit[0.96]="A"

Uniq in awk; removing duplicate values in a column using awk

I have a large datafile in the following format below:
ENST00000371026 WDR78,WDR78,WDR78, WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458, atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:
ENST00000371026 WDR78 WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458 atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
I tried the following code below but it doesn't seem to remove the duplicate values.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!( valueArray[i] in duplicateArray))
{
duplicateArray[j] = valueArray[i];
j++;
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (duplicateArray[j]) {
printf duplicateArray[j] ",";
}
}
printf "\t";
print $3
}' knownGeneFromUCSC.txt
How can I remove the duplicates in column 2 correctly?
Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.
The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.
The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there.
* In reality all arrays in AWK are associative.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!(valueArray[i] in duplicateArray))
{
duplicateArray[valueArray[i]] = 1
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (j) # prevents printing an extra comma
{
printf j ",";
}
}
printf "\t";
print $3
delete duplicateArray # for non-gawk, use split("", duplicateArray)
}'
Perl:
perl -F'\t' -lane'
$F[1] = join ",", grep !$_{$_}++, split ",", $F[1];
print join "\t", #F; %_ = ();
' infile
awk:
awk -F'\t' '{
n = split($2, t, ","); _2 = x
split(x, _) # use delete _ if supported
for (i = 0; ++i <= n;)
_[t[i]]++ || _2 = _2 ? _2 "," t[i] : t[i]
$2 = _2
}-3' OFS='\t' infile
The line 4 in the awk script is used to preserve the original order of the values in the second field after filtering the unique values.
Sorry, I know you asked about awk... but Perl makes this much more simple:
$ perl -n -e ' #t = split(/\t/);
%t2 = map { $_ => 1 } split(/,/,$t[1]);
$t[1] = join(",",keys %t2);
print join("\t",#t); ' knownGeneFromUCSC.txt
Pure Bash 4.0 (one associative array):
declare -a part # parts of a line
declare -a part2 # parts 2. column
declare -A check # used to remember items in part2
while read line ; do
part=( $line ) # split line using whitespaces
IFS=',' # separator is comma
part2=( ${part[1]} ) # split 2. column using comma
if [ ${#part2[#]} -gt 1 ] ; then # more than 1 field in 2. column?
check=() # empty check array
new2='' # empty new 2. column
for item in ${part2[#]} ; do
(( check[$item]++ )) # remember items in 2. column
if [ ${check[$item]} -eq 1 ] ; then # not yet seen?
new2=$new2,$item # add to new 2. column
fi
done
part[1]=${new2#,} # remove leading comma
fi
IFS=$'\t' # separator for the output
echo "${part[*]}" # rebuild line
done < "$infile"

Resources