How to get a substring in awk - bash

This is one line of the input file:
FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 0.50 0.50 0.50 -43.00 100010101101110101000111010
And an awk command that checks a certain position if it's a "1" or "0" at column 13
Something like:
awk -v values="${values}" '{if (substr($13,1,1)==1) printf values,$1,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13}' foo.txt > bar.txt
The values variable works, but i just want in the above example to check if the first bit if it is equal to "1".
EDIT
Ok, so I guess I wasn't very clear in my question. The "$13" in the substr method is in fact the bitstring. So this awk wants to pass all the lines in foo.txt that have a "1" at position "1" of the bitstring at column "$13". Hope this clarifies things.
EDIT 2
Ok, let me break it down real easy. The code above are examples, so the input line is one of MANY lines. So not all lines have a 1 at position 8. I've double checked to see if a certain position has both occurences, so that in any case I should get some output. Thing is that in all lines it doesn't find any "1"'s on the posistions that I choose, but when I say that it has to find a "0" then it returns me all lines.

$ cat file
FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 0.50 0.50 0.50 -43.00 100010101101110101000111010
FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 1.50 1.50 1.50 -42.00 100010111101110101000111010
$ awk 'substr($13,8,1)==1{ print "1->"$0 } substr($13,8,1)==0{ print "0->"$0 }' file
0->FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 0.50 0.50 0.50 -43.00 100010101101110101000111010
1->FOO BAR 0.40 0.20 0.40 0.50 0.60 0.80 1.50 1.50 1.50 -42.00 100010111101110101000111010

Related

bash/awk: remove duplicate columns after merging of several files

I am using the following function written in my bash script in order to merge many files (contained multi-column data) into one big summary chart with all fused data
table_fuse () {
paste -d'\t' "${rescore}"/*.csv >> "${rescore}"/results_2PROTS_CNE_strategy3.csv | column -t -s$'\t'
}
Taking two files as an example, this routine would produce the following concatenated chart as the result of the merging:
# file 1. # file 2
Lig dG(10V1) dG(rmsd) Lig dG(10V2) dG(rmsd)
lig1 -6.78 0.32 lig1 -7.04 0.20
lig2 -5.56 0.14 lig2 -5.79 0.45
lig3 -7.30 0.78 lig3 -7.28 0.71
lig4 -7.98 0.44 lig4 -7.87 0.42
lig5 -6.78 0.28 lig5 -6.75 0.31
lig6 -6.24 0.24 lig6 -6.24 0.24
lig7 -7.44 0.40 lig7 -7.42 0.39
lig8 -4.62 0.41 lig8 -5.19 0.11
lig9 -7.26 0.16 lig9 -7.30 0.13
Since the both files share the same first column (Lig), how would it be possible to remove (substitute to " ") all repeats of this column in each of the fussed file, while keeping only the Lig column from the first CSV?
EDIT: As per OP's comments to cover [Ll]ig or [Ll]ig0123 or [Ll]ig(abcd) formats in file adding following solution here.
awk '{first=$1;gsub(/[Ll]ig([0-9]+)?(\([-azA-Z]+\))?/,"");print first,$0}' Input_file
With awk you could try following, considering that you want to remove only lig(digits) duplicate values here.
awk '{first=$1;gsub(/[Ll]ig([0-9]+)?/,"");print first,$0}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
first=$1 ##Setting first column value to first here.
gsub(/[Ll]ig([0-9]+)?/,"") ##Globally substituting L/lig digits(optional) with NULL in whole line.
print first,$0 ##printing first and current line here.
}
' Input_file ##mentioning Input_file name here.
It's not hard to replace repeats of phrases. What exactly works for your case depends on the precise input file format; but something like
sed s/^\([^ ]*\)\( .* \)\1 /\1\2 /' file
would get rid of any repeat of the token in the first column.
Perhaps a better solution is to use a more sophisticated merge tool, though. A simple Awk or Python script could take care of removing the first token from every file except the first while merging.
join would appear to be a solution, at least for the sample data provided by OP ...
Sample input data:
$ cat file1
Lig dG(10V1) dG(rmsd)
lig1 -6.78 0.32
lig2 -5.56 0.14
lig3 -7.30 0.78
lig4 -7.98 0.44
lig5 -6.78 0.28
lig6 -6.24 0.24
lig7 -7.44 0.40
lig8 -4.62 0.41
lig9 -7.26 0.16
$ cat file2
Lig dG(10V2) dG(rmsd)
lig1 -7.04 0.20
lig2 -5.79 0.45
lig3 -7.28 0.71
lig4 -7.87 0.42
lig5 -6.75 0.31
lig6 -6.24 0.24
lig7 -7.42 0.39
lig8 -5.19 0.11
lig9 -7.30 0.13
We can join these two files on the first column (aka field) like such:
$ join -j1 file1 file2
Lig dG(10V1) dG(rmsd) dG(10V2) dG(rmsd)
lig1 -6.78 0.32 -7.04 0.20
lig2 -5.56 0.14 -5.79 0.45
lig3 -7.30 0.78 -7.28 0.71
lig4 -7.98 0.44 -7.87 0.42
lig5 -6.78 0.28 -6.75 0.31
lig6 -6.24 0.24 -6.24 0.24
lig7 -7.44 0.40 -7.42 0.39
lig8 -4.62 0.41 -5.19 0.11
lig9 -7.26 0.16 -7.30 0.13
For more than 2 files some sort of repetitive/looping method would be needed to repeatedly join a new file into the mix.

Why does if condition print zero value for non-zero condition?

I am unable to get the required output from bash code.
I have a text file:
1 0.00 0.00
2 0.00 0.00
3 0.00 0.08
4 0.00 0.00
5 0.04 0.00
6 0.00 0.00
7 -3.00 0.00
8 0.00 0.00
The required output should be only non-zero values:
0.08
0.04
-3.0
This is my code:
z=0.00
while read line
do
line_o="$line"
openx_de=`echo $line_o|awk -F' ' '{print $2,$3}'`
IFS=' ' read -ra od <<< "$openx_de"
for i in "${od[#]}";do
if [ $i != $z ]
then
echo "openx_default_value is $i"
fi
done
done < /openx.txt
but it also gets the zero values.
To get only the nonzero values from columns 2 and 3, try:
$ awk '$2+0!=0{print $2} $3+0!=0{print $3}' openx.txt
0.08
0.04
-3.00
How it works:
$2+0 != 0 {print $2} tests to see if the second column is nonzero. If it is nonzero, then the print statement is executed.
We want to do a numeric comparison between the second column, $2, and zero. To tell awk to treat $2 as a number, we first add zero to it and then we do the comparison.
The same is done for column 3.
Using column names
Consider this input file:
$ cat openx2.txt
n first second
1 0.00 0.00
2 0.00 0.00
3 0.00 0.08
4 0.00 0.00
5 0.04 0.00
6 0.00 0.00
7 -3.00 0.00 8 0.00 0.00
To print the column name with each value found, try:
$ awk 'NR==1{two=$2; three=$3; next} $2+0!=0{print two,$2} $3+0!=0{print three,$3}' openx2.txt
second 0.08
first 0.04
first -3.00
awk '{$1=""}{gsub(/0.00/,"")}NF{$1=$1;print}' file
0.08
0.04
-3.00

Combine rows with the same name in each column using bash

I have a file like the following (but with 52 columns and 4,000 rows):
1NA2 1NB2 2RA2 2RB2
Vibrionaceae 0.22 0.25 0.36 1.02
Bacillaceae 2.0 1.76 0.55 0.23
Enterobacteriaceae 0.55 0.52 2.40 1.23
Vibrionaceae 0.22 0.25 0.36 1.02
Bacillaceae 2.0 1.76 0.55 0.23
Enterobacteriaceae 0.55 0.52 2.40 1.23
And I want it to look like this:
1NA2 1NB2 2RA2 2RB2
Vibrionaceae 0.44 0.50 0.72 2.04
Bacillaceae 4.0 3.52 1.10 0.46
Enterobacteriaceae 1.10 1.04 4.80 2.46
edit: I´m sorry, I don't want to delete the remaining rows and columns. Every row name is repeated several times, so I want it to appear only 1 time with the the total in every column.
I have tried the following:
awk '{a[$1]+=$2}END{for(i in a) print i,a[i]}' file
but it only does it for the first column, and I want it to work for all 52 columns.
With GNU awk and a 2D array:
awk 'NR==1
NR>1{
for(i=2; i<=NF; i++){
a[$1][i]+=$i
}
}
END{
for(i in a){
printf("%-19s", i)
for(j=2; j<=NF; j++){
printf("%.2f ", a[i][j])
}
print ""
}
}' file
or as one-liner:
awk 'NR==1; NR>1{for(i=2; i<=NF; i++){a[$1][i]+=$i}} END{for(i in a){printf("%-19s", i); for(j in a[i]){printf("%.2f ", a[i][j])} print ""}}' file
Output:
1NA2 1NB2 2RA2 2RB2
Bacillaceae 4.00 3.52 1.10 0.46
Vibrionaceae 0.44 0.50 0.72 2.04
Enterobacteriaceae 1.10 1.04 4.80 2.46
NR is the line number
NF is the number of fields in a row

iostat & steal time

I am trying to catch some data from iostat output:
# iostat -m
avg-cpu: %user %nice %system %iowait %steal %idle
9.92 0.00 14.17 0.01 0.00 75.90
Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sda 6.08 0.00 0.04 2533 261072
dm-0 1.12 0.00 0.00 1290 30622
dm-1 0.00 0.00 0.00 1 0
dm-2 1.22 0.00 0.00 0 33735
dm-3 7.22 0.00 0.03 1213 196713
How can I match the "0.00" value?
Numbers aren't separated by a tab or a constant number of spaces.
Also the value can be 3 digits 0.00 or 4 digits 45.00, etc.
Any idea how to match it using bash?
Try this, using awk:
iostat | awk 'NR==3 { print $5 }'
NR==3 will operate on the third line, and $5 prints column 5. Verify that the proper column is being selected by playing around with the number, i.e. using your output and print $4 should yield 0.01.

awk search and calculate standard deviation different results

I am working to take the output of sar and calculate the standard deviation of a column. I can perform this successfully with a single column in a file. However when I calculate this same column in a file where I am stripping out the 'bad' lines like the title lines and avg lines, it is giving me a different value.
Here are the files I am performing this on:
/tmp/saru.tmp
# cat /tmp/saru.tmp
Linux 2.6.32-279.el6.x86_64 (progserver) 09/06/2012 _x86_64_ (4 CPU)
11:09:01 PM CPU %user %nice %system %iowait %steal %idle
11:10:01 PM all 0.01 0.00 0.05 0.01 0.00 99.93
11:11:01 PM all 0.01 0.00 0.06 0.00 0.00 99.92
11:12:01 PM all 0.01 0.00 0.05 0.01 0.00 99.93
11:13:01 PM all 0.01 0.00 0.05 0.00 0.00 99.93
11:14:01 PM all 0.01 0.00 0.04 0.00 0.00 99.95
11:15:01 PM all 0.01 0.00 0.06 0.00 0.00 99.92
11:16:01 PM all 0.01 0.00 2.64 0.01 0.01 97.33
11:17:01 PM all 0.02 0.00 21.96 0.00 0.08 77.94
11:18:01 PM all 0.02 0.00 21.99 0.00 0.08 77.91
11:19:01 PM all 0.02 0.00 22.10 0.00 0.09 77.78
11:20:01 PM all 0.02 0.00 22.06 0.00 0.09 77.83
11:21:01 PM all 0.02 0.00 22.10 0.03 0.11 77.75
11:22:01 PM all 0.01 0.00 21.94 0.00 0.09 77.95
11:23:01 PM all 0.02 0.00 22.15 0.00 0.10 77.73
11:24:01 PM all 0.02 0.00 22.02 0.00 0.09 77.87
11:25:01 PM all 0.02 0.00 22.03 0.00 0.13 77.82
11:26:01 PM all 0.02 0.00 21.96 0.01 0.14 77.86
11:27:01 PM all 0.02 0.00 22.00 0.00 0.09 77.89
11:28:01 PM all 0.02 0.00 21.91 0.00 0.09 77.98
11:29:01 PM all 0.03 0.00 22.02 0.02 0.08 77.85
11:30:01 PM all 0.14 0.00 22.23 0.01 0.13 77.48
11:31:01 PM all 0.02 0.00 22.26 0.00 0.16 77.56
11:32:01 PM all 0.03 0.00 22.04 0.01 0.10 77.83
Average: all 0.02 0.00 15.29 0.01 0.07 84.61
/tmp/sarustriped.tmp
# cat /tmp/sarustriped.tmp
0.05
0.06
0.05
0.05
0.04
0.06
2.64
21.96
21.99
22.10
22.06
22.10
21.94
22.15
22.02
22.03
21.96
22.00
21.91
22.02
22.23
22.26
22.04
The Calculation based on /tmp/saru.tmp:
# awk '$1~/^[01]/ && $6~/^[0-9]/ {sum+=$6; array[NR]=$6} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}print sqrt(sumsq/NR)}' /tmp/saru.tmp
10.7126
The Calculation based on /tmp/sarustriped.tmp ( the correct one )
# awk '{sum+=$1; array[NR]=$1} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))**2);}print sqrt(sumsq/NR)}' /tmp/sarustriped.tmp
9.96397
Could someone assist and tell me why these results are different and is there a way to get the corrected results with a single awk command. I am trying to do this for performance so not using a separate command like grep or another awk command is preferable.
Thanks!
UPDATE
so I tried this ...
awk '
$1~/^[01]/ && $6~/^[0-9]/ {
numrec += 1
sum += $6
array[numrec] = $6
}
END {
for(x=1; x<=numrec; x++)
sumsq += ((array[x]-(sum/numrec))^2)
print sqrt(sumsq/numrec)
}
' saru.tmp
and it works correctly for the sar -u output I was working with. I do not see why it would not work with other 'lists'. To make it short, trying to work with sar -r column 5. it is giving a wrong answer again... Output is giving 1.68891 but actual deviation is .107374... this is the same command that worked with sar -u..... if you need files I can provide. Just not sure how to make a new 'full' comment... so i just edited the old one...thanks!
I think the bug is that your first awk line (the one that operates on saru.tmp) does not ignore the invalid lines, so when you do math using NR your result depends on the number of skipped lines. When you remove all of the invalid/skipped lines the result is the same from both programs. So in the first command, you should use the number of valid lines rather than NR in your math.
How about this?
awk '
$1 ~ /^[01]/ && $6~/^[0-9]/ {
numrec += 1
sum += $6
array[numrec] = $6
}
END {
for(x=1; x<=numrec; x++)
sumsq += (array[x]-(sum/numrec))^2
print sqrt(sumsq/numrec)
}
' saru.tmp
For debugging problems like this, the simplest technique is to print out some basic data. You might print the number of items, and the sum of the values, and the sum of the squares of the values (or sum of the squares of the deviations from the mean). This will likely tell you what's different between the two runs. Sometimes, it might help to print out the values you're accumulating as you're accumulating the data. If I had to guess, I'd suspect you are counting inappropriate lines (blanks, or the decoration lines), so the counts are different (and maybe the sums too).
I have a couple of (non-standard) programs to do the calculations. Given the 23 relevant lines from the multi-column output in a file data, I ran:
$ colnum -c 6 data | pstats
# Count = 23
# Sum(x1) = 3.557200e+02
# Sum(x2) = 7.785051e+03
# Mean = 1.546609e+01
# Std Dev = 1.018790e+01
# Variance = 1.037934e+02
# Min = 4.000000e-02
# Max = 2.226000e+01
$
The standard deviation here is the sample standard deviation rather than the population standard deviation; the difference is dividing by (N-1) for the sample and N for the population.

Resources