Replacing the value of specific field in a table-like string stored as bash variable - bash

I am looking for a way to replace (with 0) a specific value (1043252782) in a "table-like" string stored as a bash variable. The output of echo "$var"looks like this:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 1043252782
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
After the replacement echo "$var" should look like this:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
Is there a way to do this without saving the content of $var to a file and directly manipulating it within the bash (shell script)?
Maby with awk? I can select the value in the 10th field of the second record with awk and pattern matching ("7 Seek_Error_Rate ....") like this:
echo "$var" | awk '/^ 7/{print $10}'
Maby there is some way doing it with awk (or other cli-tool) to replace it and store it back into $var? Also, the value changes over time, but the structure remains the same (some record at the 10th field).

You can change a specific string directly in the shell:
var=${var/1043252782/0}
To replace final number of second line, you could use awk or sed:
var=$(awk 'NR==2 { sub(/[0-9]+$/,0) }1' <<<"$var")
var=$(sed '2s/[0-9][0-9]*$/0/' <<<"$var")
If you don't know which line it will be, you can match a known string:
var=$(awk '/Seek_Error_Rate/{ sub(/[0-9]+$/,0) }1' <<<"$var")
var=$(sed '/Seek_Error_Rate/s/[0-9][0-9]*$/0/' <<<"$var")

You can use a here-string to feed the variable as input to awk.
Use sub() to perform a regular expression replacement.
var=$(awk '{sub(/1043252782$/, "0")}1' <<<"$var")

Using sed
$ var=$(sed '/1043252782$/s//0/' <<< "$var")
$ echo "$var"
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0

if you don't wanna ruin formatting of tabs and spaces :
{m,g}wk NF=NF FS=' 1043252782$' OFS=' 0'
:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
or doing the whole file in one single shot :
awk NF=NF FS=' 1043252782\n' OFS=' 0\n' RS='^$' ORS=
awk NF=NF FS=' 1043252782\n' OFS=' 0\n' RS= -- (This might work too but I'm not too well versed in any side effects for blank RS)

Related

How to use the value in a file as input for a calculation in awk - in bash?

I'm trying to calculate if the count for each row is more than a certain value, 30% of the total counts.
Within a for cycle, I've obtained the percentage in awk '$1=($1/100)*30' ${i}_counts > ${i}_percentage-value and that's a single number, the output only contains that.
How do I make the calculation "value is greater than" for each row of ${i}_counts against ${i}_percentage-value?
In other words, how to use the number inside the file as a numerical value for a math operation?
Data:
data.csv (an extract)
SampleID ASV Count
1000A ASV_1216 14
1000A ASV_12580 150
1000A ASV_12691 260
1000A ASV_135 434
1000A ASV_147 79
1000A ASV_15 287
1000A ASV_16 361
1000A ASV_184 8
1000A ASV_19 42
samples-ID-short
1000A
1000B
1000C
So for each sample ID, there's a lot of ASV, a quantity that may vary a lot like 50 ASV for 1000A, 120 for 1000B and so on. Every ASV_## has a count and my code is for calculating the count total sum, then finding out which is the 30% value for each sample, report which ASV_## is greater than 30%. Ultimately, it should report a 0 for <30% and 1 for >30%.
Here's my code so far:
for i in $(cat samplesID-short)
do
grep ${i} data.csv | cut -d , -f3 - > ${i}_count_sample
grep ${i} data.csv | cut -d , -f2 - > ${i}_ASV
awk '{ sum += $1; } END { print sum; }' ${i}_count_sample > ${i}_counts
awk '$1=($1/100)*30' ${i}_counts > ${i}_percentage-value
#I was thinking about replicate the numeric value for the entire column and make the comparison "greater than", but the repetition times depend on the ASV counts for each sample, and they are always different.
wc -l ${i}_ASV > n
for (( c=1; c<=n; c++)) ; do echo ${i}_percentage-value ; done
paste <(sed 's/^[[:blank:]]*//' ${i}_ASV) ${i}_count_sample ${i}_percentage-value > ${i}_tmp;
awk 'BEGIN{OFS="\t"}{if($2 >= $3) print $1}' ${i}_tmp > ${i}_is30;
#How the output should be:
paste <(sed 's/^[[:blank:]]*//' ${i}_ASV) ${i}_count_sample ${i}_counts ${i}_percentage-value ${i}_is30 > ${i}_summary_nh
echo -e "ASV_ID\tASV_in_sample\ttotal_ASVs_inSample\ttreshold_for_30%\tASV_over30%" | cat - ${i}_summary_nh > ${i}_summary
rm ${i}_count_sample ${i}_counts ${i}_percentage-value ${i}_ASV ${i}_summary_nh ${i}_is30
done &
You can filter on a column based on a value e.g
$ awk '$3>300' data.csv
SampleID ASV Count
1000A ASV_135 434
1000A ASV_16 361
You can use >= for greater than or equal to.
It looks like your script is overcomplicating matters.
this should work
$ awk 'NR==1 || $3>$1*3/10' file
SampleID ASV Count
1000A ASV_135 434
1000A ASV_16 361
or, with the indicator column
$ awk 'NR==1{print $0, "Ind"} NR>1{print $0, ($3>$1*3/10)}' file | column -t
SampleID ASV Count Ind
1000A ASV_1216 14 0
1000A ASV_12580 150 0
1000A ASV_12691 260 0
1000A ASV_135 434 1
1000A ASV_147 79 0
1000A ASV_15 287 0
1000A ASV_16 361 1
1000A ASV_184 8 0
1000A ASV_19 42 0
Would you please try the following:
awk -v OFS="\t" '
NR==FNR { # this block is executed in the 1st pass only
if (FNR > 1) sum[$1] += $3
# accumulate the "count" for each "SampleID"
next
}
# the following block is executed in the 2nd pass only
FNR > 1 { # skip the header line
if ($1 != prev_id) {
# SampleID has changed. then update the output filename and print the header line
if (outfile) close(outfile)
# close previous outfile
outfile = $1 "_summary"
print "ASV_ID", "ASV_in_sample", "total_ASVs_inSample", "treshold_for_30%", "ASV_over30%" >> outfile
prev_id = $1
}
mark = ($3 > sum[$1] * 0.3) ? 1 : 0
# set the mark to "1" if the "Count" exceeds 30% of sum
print $2, $3, sum[$1], sum[$1] * 0.3, mark >> outfile
# append the line to the summary file
}
' data.csv data.csv
data.csv:
SampleID ASV Count
1000A ASV_1216 14
1000A ASV_12580 150
1000A ASV_12691 260
1000A ASV_135 434
1000A ASV_147 79
1000A ASV_15 287
1000A ASV_16 361
1000A ASV_184 8
1000A ASV_19 42
1000B ASV_1 90
1000B ASV_2 90
1000B ASV_3 20
1000C ASV_4 100
1000C ASV_5 10
1000C ASV_6 10
In the following output examples, the last field ASV_over30% indicates 1 if the count exceeds 30% of the sum value.
1000A_summary:
ASV_ID ASV_in_sample total_ASVs_inSample treshold_for_30% ASV_over30%
ASV_1216 14 1635 490.5 0
ASV_12580 150 1635 490.5 0
ASV_12691 260 1635 490.5 0
ASV_135 434 1635 490.5 0
ASV_147 79 1635 490.5 0
ASV_15 287 1635 490.5 0
ASV_16 361 1635 490.5 0
ASV_184 8 1635 490.5 0
ASV_19 42 1635 490.5 0
1000B_summary:
ASV_ID ASV_in_sample total_ASVs_inSample treshold_for_30% ASV_over30%
ASV_1 90 200 60 1
ASV_2 90 200 60 1
ASV_3 20 200 60 0
1000C_summary:
ASV_ID ASV_in_sample total_ASVs_inSample treshold_for_30% ASV_over30%
ASV_4 100 120 36 1
ASV_5 10 120 36 0
ASV_6 10 120 36 0
[Explanations]
When calculating the average of the input data, we need to go through until
the end of the data. If we want to print out the input record and the average
value (or other information based on the average) at the same time, we need to
use a trick:
To store the whole input records in memory.
To read the input data twice.
As awk is suitable for reading multiple files changing the proceduce
depending the order of files, I have picked the 2nd method.
The condition NR==FNR returns TRUE while reading the 1st file only.
We calculate the sum of count field within this block as a 1st pass.
The next statement at the end of the block skips the following codes.
If the 1st file is done, the script reads the 2nd file which is
same as the 1st file, of course.
While reading the 2nd file, the condition NR==FNR no longer returns
TRUE and the 1st block is skipped.
The 2nd block reads the input file again, opening a file to print the
output, reading the input data line by line, and adding information
such as average value obtained in the 1st pass.

process second column if first column matches

I just want the second column to be multiplied by exp(3) if the first column matches the parameter I define.
cat inputfile.i
100 2
200 3
300 1
100 5
200 2
300 3
I want the output to be:
100 2
200 60.25
300 1
100 5
200 40.17
300 3
I tried this code:
awk ' $1 == "200" {print $2*exp(3)}' inputfile
but nothing actually shows
you are not printing the unmatched lines, you don't need to quote numbers
$ awk '$1==200{$2*=exp(3)}1' file
100 2
200 60.2566
300 1
100 5
200 40.1711
300 3
Is there a difference between inputfile.i and inputfile?
Anyway, here is my solution for you:
awk '$1 == 200 {printf "%s %.2f\n",$1,$2*exp(3)};$1 != 200 {print $0}' inputfile.i
100 2
200 60.26
300 1
100 5
200 40.17
300 3

Divide column values of different files by a constant then output one minus the other

I have two files of the form
file1:
#fileheader1
0 123
1 456
2 789
3 999
4 112
5 131
6 415
etc.
file2:
#fileheader2
0 442
1 232
2 542
3 559
4 888
5 231
6 322
etc.
How can I take the second column of each, divide it by a value then minus one from the other and then output a new third file with the new values?
I want the output file to have the form
#outputheader
0 123/c-422/k
1 456/c-232/k
2 789/c-542/k
etc.
where c and k are numbers I can plug into the script
I have seen this question: subtract columns from different files with awk
But I don't know how to use awk to do this by myself, does anyone know how to do this or could explain what is going on in the linked question so I can try to modify it?
I'd write:
awk -v c=10 -v k=20 ' ;# pass values to awk variables
/^#/ {next} ;# skip headers
FNR==NR {val[$1]=$2; next} ;# store values from file1
$1 in val {print $1, (val[$1]/c - $2/k)} ;# perform the calc and print
' file1 file2
output
0 -9.8
1 34
2 51.8
3 71.95
4 -33.2
5 1.55
6 25.4
etc. 0

Search for a value in a file and remove subsequent lines

I'm developing a shell script but I am stuck with the below part.
I have the file sample.txt:
S.No Sub1 Sub2
1 100 200
2 100 200
3 100 200
4 100 200
5 100 200
6 100 200
7 100 200
I want to search the S.No column in sample.txt. For example if I'm searching the value 5 I need the rows up to 5 only I don't want the rows after the value of in S.NO is larger than 5.
the output must look like, output.txt
S.No Sub1 Sub2
1 100 200
2 100 200
3 100 200
4 100 200
5 100 200
Print the first line and any other line where the first field is less than or equal to 5:
$ awk 'NR==1||$1<=5' file
S.No Sub1 Sub2
1 100 200
2 100 200
3 100 200
4 100 200
5 100 200
Using perl:
perl -ane 'print if $F[$1]<=5' file
And the sed solution
n=5
sed "/^$n[[:space:]]/q" filename
The sed q command exits after printing the current line
The suggested awk relies on that column 1 is numeric sorted. A generic awk that fulfills the question title would be:
gawk -v p=5 '$1==p {print; exit} {print}'
However, in this situation, sed is better IMO. Use -i to modify the input file.
sed '6q' sample.txt > output.txt

In bash, how could I add integers with leading zeroes and maintain a specified buffer

For example, I want to count from 001 to 100. Meaning the zero buffer would start off with 2, 1, then eventually 0 when it reaches 100 or more.
ex:
001
002
...
010
011
...
098
099
100
I could do this if the numbers had a predefined number of zeroes with printf "%02d" $i. But that's static and not dynamic and would not work in my example.
If by static versus dynamic you mean that you'd like to be able to use a variable for the width, you can do this:
$ padtowidth=3
$ for i in 0 {8..11} {98..101}; do printf "%0*d\n" $padtowidth $i; done
000
008
009
010
011
098
099
100
101
The asterisk is replaced by the value of the variable it corresponds to in the argument list ($padtowidth in this case).
Otherwise, the only reason your example doesn't work is that you use "2" (perhaps as if it were the maximum padding to apply) when it should be "3" (as in my example) since that value is the resulting total width (not the pad-only width).
If your system has it, try seq with the -w (--equal-width) option:
$ seq -s, -w 1 10
01,02,03,04,05,06,07,08,09,10
$ for i in `seq -w 95 105` ; do echo -n " $i" ; done
095 096 097 098 099 100 101 102 103 104 105
In Bash version 4 (use bash -version) you can use brace expansion. Putting a 0 before either limit forces the numbers to be padded by zeros
echo {01..100} # 001 002 003 ...
echo {03..100..3} # 003 006 009 ...
#!/bin/bash
max=100;
for ((i=1;i<=$max;i++)); do
printf "%0*d\n" ${#max} $i
done
The code above will auto-pad your numbers with the correct number of 0's based upon how many digits the max/terminal value contains. All you need to do is change the max variable and it will handle the rest.
Examples:
max=10
01
02
03
04
05
06
07
08
09
10
max=100
001
002
003
004
005
006
...
097
098
099
100
max=1000
0001
0002
0003
0004
0005
0006
...
0997
0998
0999
1000
# jot is available on FreeBSD, Mac OS X, ...
jot -s " " -w '%03d' 5
jot -s " " -w '%03d' 10
jot -s " " -w '%03d' 50
jot -s " " -w '%03d' 100
If you need to pad values up to a variable number with variable padding:
$values_count=514;
$padding_width=5;
for i in 0 `seq 1 $(($values_count - 1))`; do printf "%0*d\n" $padding_width $i; done;
This would print out 00000, 00001, ... 00513.
(I didn't find any of the current answers meeting my need)

Resources