I'm trying to find the mean of several numbers in a file, which contains "< Overall >" on the line.
My code:
awk -v file=$file '{if ($1~"<Overall>") {rating+=$1; count++;}} {rating=rating/count; print file, rating;}}' $file | sed 's/<Overall>//'
I'm getting
awk: cmd. line:1: (FILENAME=[file] FNR=1) fatal: division by zero attempted
for every file. I can't see why count would be zero if the file does contain a line such as "< Overall >5"
EDIT:
Sample from the (very large) input file, as requested:
<Author>RW53
<Content>Location! Location? view from room of nearby freeway
<Date>Dec 26, 2008
<No. Reader>-1
<No. Helpful>-1
<Overall>3
<Value>4
<Rooms>3
<Location>2
<Cleanliness>4
<Check in / front desk>3
<Service>-1
<Business service>-1
Expected output:
[filename] X
Where X is the average of all the lines containing < Overall >
Use an Awk as below,
awk -F'<Overall>' 'NF==2 {sum+=$2; count++}
END{printf "[%s] %s\n",FILENAME,(count?sum/count:0)}' file
For an input file containing two <Overall> clauses like this, it produces a result as follows the file-name being input-file
<Author>RW53
<Content>Location! Location? view from room of nearby freeway
<Date>Dec 26, 2008
<No. Reader>-1
<No. Helpful>-1
<Overall>3
<Value>4
<Rooms>3
<Location>2
<Cleanliness>4
<Check in / front desk>3
<Service>-1
<Business service>-1
<Overall>2
Running it produces,
[input-file] 2.5
The part, -F'<Overall>' splits input-lines with de-limiter as <Overall>, basically only the lines having <Overall> and the number after it will be filtered, the number being $2 which is summed up and stored in sum variable and count is tracked in c.
The END clause gets executed after all lines are printed which basically prints the filename using the awk special variable FILENAME which retains the name of the file processed and the average is calculated iff the count is not zero.
You aren't waiting until you've completely read the file to compute the average rating. This is simpler if you use patterns rather than an if statement. You also need to remove <Overall> before you attempt to increment rating.
awk '$1 ~ /<Overall>/ {rating+=sub("<Overall>", "", $1); count++;}
END {rating=rating/(count?count:1); print FILENAME, rating;}' "$file"
(Answer has been updated to fix a typo in the call to sub and to correctly avoid dividing by 0.)
awk -F '>' '
# separator of field if the >
# for line that containt <Overall>
/<Overall>/ {
# evaluate the sum and increment counter
Rate+=$2;Count++}
# at end of the current file
END{
# print the average.
printf( "[%s] %f\n", FILENAME, Rate / ( Count + ( ! Count ) )
}
' ${File}
# one liner
awk -F '>' '/<Overall>/{r+=$2;c++}END{printf("[%s] %f\n",FILENAME,r/(c+(!c))}' ${File}
Note:
( c + ( ! c ) ) use a side effect of logical NOT (!). It value 1 if c = 0, 0 otherwise. So if c = 0 it add 1, if not it add 0 to itself insurring a division value of at least 1.
assume the full file reflect the sample for content
Related
I have a file (in.txt) with the following columns:
# DM Sigma Time (s) Sample Downfact
78.20 7.36 134.200512 2096883 70
78.20 7.21 144.099904 2251561 70
78.20 9.99 148.872384 2326131 150
78.20 10.77 283.249664 4425776 45
I want to write a bash script to divide all values in column 'Time' by 0.5867, get a precision up to 2 decimal points and print out the resulting values in another file out.txt
I tried using bc/awk but it gives this error.
awk: cmd. line:1: fatal: division by zero attempted
awk: fatal: cannot open file `file' for reading (No such file or directory)
Could someone help me with this? Thanks.
This is the bash script that I attempted:
cat in.txt | while read DM Sigma Time Sample Downfact; do
echo "$DM $Sigma $Time $Sample $Downfact"
pperiod = 0.5867
awk -v n=$Time 'BEGIN {printf "%.2f\n", (n/$pperiod)}'
#echo "scale=2 ; $Time / $pperiod" | bc
#echo "$subint" > out.txt
done
I expected the script to divide column 'Time' with pperiod and get the result with a precision of 2 decimal places. This result should be printed to a file named out.txt
Lots of issues with current awk code:
need to pass in the value of the $pperiod variable
need to reference the Time column by is position ($3 in this case)
BEGIN{} block is applied before any input lines are processed and has nothing to do with processing of actual input lines
there is no code to perform processing on actual input lines
need to decide what to do in the case of a divide by zero scenario (in this case we'll default answer to 0.00)
NOTE: current code generates divide by zero error because $pperiod is an undefined (awk) variable which in turn defaults to 0
additionally, pperiod = 0.5867 is invalid bash syntax
One idea for fixing current issues:
pperiod=0.5867
awk -v pp="${pperiod}" 'NR>1 {printf "%.2f\n", (pp==0 ? 0 : ($3/pp))}' in.txt > out.txt
Where:
-v pp="${pperiod}" - assign awk variable pp the value of the bash variable "${pperiod}"
NR>1 - skip header line
NR>1 {printf "%.2f\n" ...}- for each input line, other than the header line, print the result of dividing theTimecolumn (aka$3) by the value of the awkvariablepp(which holds the value of thebashvariable"${pperiod}"`)
(pp==0 ? 0 : ($3/pp)) - if pp is equal 0 we print 0 else print result of $3/pp) (this keeps us from generating a divide by zero error)
NOTE: this also eliminates the need for the cat|while loop
This generates:
$ cat out.txt
228.74
245.61
253.75
482.78
I want to remove specific sequence in the list with IDs and extract sequence from large fasta file.
input test.fasta file:
>GHAT8X
MKFNDIRNDGHEDCFNNIIFASKLSSHKNVLKLTGCCLETRIPVIVFESVKNRTLADHIYQNQPHFEPLLLSQRLRIAVHIANAIAYLHIGFSRPILHRKIRPSRIFLDEGYIAKLFDFSLSVSIPEGETCVKDKVTGTMGFLAPEYI
>GHAMNO
MRLIGCCLETENPVLVFEYVEYGTLADRIYHPRQPNFEPVTCSLRLKIAMEIAYGIAYLHVAFSRPIVFRNVKPSNILFQEQSVAKLFDFSYSESIPEGETRIRGRVMGTFGYLPPEYIATGDCNEKCDVYSFGMLLLELLTGQRAVD
>GHAXM6
MYSCLGAIKNSGKEDKEKCIMRNGKNLLENLISSFNDGETHIKDAIPIGIMGFVATEYVTTGDYNEKCDVFSFGVLLLVLLTGQKLYSIDEAGDRHWLLNRVKKHIECNTFDEIVDPVIREELCIQSSEKDKQVQAFVELAVKCVSES
seqid_len.txt file:
GHAT8X 25
GHAMNO 26
GHAXM6 20
Expected output:
>GHAT8X
SSHKNVLKLTGCCLETRIPVIVFESVKNRTLADHIYQNQPHFEPLLLSQRLRIAVHIANA
IAYLHIGFSRPILHRKIRPSRIFLDEGYIAKLFDFSLSVSIPEGETCVKDKVTGTMGFLA
PEYI
>GHAMNO
ADRIYHPRQPNFEPVTCSLRLKIAMEIAYGIAYLHVAFSRPIVFRNVKPSNILFQEQSVA
KLFDFSYSESIPEGETRIRGRVMGTFGYLPPEYIATGDCNEKCDVYSFGMLLLELLTGQR
AVD
>GHAXM6
MRNGKNLLENLISSFNDGETHIKDAIPIGIMGFVATEYVTTGDYNEKCDVFSFGVLLLVL
LTGQKLYSIDEAGDRHWLLNRVKKHIECNTFDEIVDPVIREELCIQSSEKDKQVQAFVEL
AVKCVSES
I tried:
sed 's/_/|/g' seqid_len.txt | while read line;do grep -i -A1 ${line%%[1-9]*} test.fasta | seqkit subseq -r ${line##[a-z]* }:-1 ; done
Only getting GHAT8X 25 and GHAMNO 26 sequence out. However, renaming the header does not work.
Any correction on this or any python solution would be really helpful.
Have a great weekend.
Thanks
Would you please try the following:
#!/bin/bash
awk 'NR==FNR {a[">" $1] = $2 + 0; next} # create an array which maps the header to the starting position of the sequence
$0 in a { # the header matches an array index
start = a[$0] # get the starting position
print # print the header
getline # read the sequence line
print substr($0, start) # print the sequence by removing the beginnings
}
' seqid_len.txt test.fasta | fold -w 60 # wrap the output within 60 columns
Output:
>GHAT8X
SSHKNVLKLTGCCLETRIPVIVFESVKNRTLADHIYQNQPHFEPLLLSQRLRIAVHIANA
IAYLHIGFSRPILHRKIRPSRIFLDEGYIAKLFDFSLSVSIPEGETCVKDKVTGTMGFLA
PEYI
>GHAMNO
ADRIYHPRQPNFEPVTCSLRLKIAMEIAYGIAYLHVAFSRPIVFRNVKPSNILFQEQSVA
KLFDFSYSESIPEGETRIRGRVMGTFGYLPPEYIATGDCNEKCDVYSFGMLLLELLTGQR
AVD
>GHAXM6
IMRNGKNLLENLISSFNDGETHIKDAIPIGIMGFVATEYVTTGDYNEKCDVFSFGVLLLV
LLTGQKLYSIDEAGDRHWLLNRVKKHIECNTFDEIVDPVIREELCIQSSEKDKQVQAFVE
LAVKCVSES
You'll see the 3rd sequence starts with IMR.., one column shifted compared with your expected MRN... If the 3rd one is correct and the 1st and the 2nd sequences should be fixed, tweak the calculation $2 + 0 as $2 + 1.
I have a text file like
-7.9 8.5 1235 125478.30 9632
32656.12 0.0 0.0 2365.12 h
254 3365 12543 kk l423
My aim is to replace the position where 2365.12 occurs with another variable.
This will happen in a loop for more number of iterations.
Initially I used sed to replace quoting the value. But i had problems with the occurrences. The same value might be there in the same line before. But i want exactly this position of the value to be replaced.
Hence I used the following command, that will take the postions into count to replace.
sed -i"3s/^\(.\{32\}\)\(.\{7\}\)\(.*\)/1$variable\3" outputdata.txt
Now my problem here is, when a value with lesser number of digits is replaced in one loop, the following values come near. And hence during the next loop, the position count messes up with the next value.
Hence it would be so great if anyone can help me out with a way in which
either I can replace the set of positions that are only non space
or replace the following set of values or the remaining part of the line with a space preceded everytime it is replaced. or some other method wherein I can point to that value and replace it.
The number of spaces separating the values can be different.
If I understand your question correctly, what you are seeking is this :
$ awk -v myvariable='XXXXX' '{ if( $4 == "2365.12" ) { $4 = myvariable }; print $0}' input.txt
-7.9 8.5 1235 125478.30 9632
32656.12 0.0 0.0 XXXXX h
254 3365 12543 kk l423
Please let me know if it helps.
Additionally, if you want this to happen in a specific line, you can use :
awk -v myvariable='XXXXX' '{ if( $4 == "2365.12" && NR == 2 ) { $4 = myvariable }; print $0}' input.txt
where NR is the line number.
Just use printf, for example:
$ for x in 1 1000; do echo "foo [$(printf '%7s' $x)] bar"; done
foo [ 1] bar
foo [ 1000] bar
In your case it would mean replacing $variable with $(printf '%7s' $variable).
I'm trying to write a script to pull the integers out of 4 files that store temperature readings from 4 industrial freezers, this is a hobby script it generates the general readouts I wanted, however when I try to generate a SUM of the temperature readings I get the following printout into the file and my goal is to print the end SUM only not the individual numbers printed out in a vertical format
Any help would be greatly appreciated;here's my code
grep -o "[0.00-9.99]" "/location/$value-1.txt" | awk '{ SUM += $1; print $1} END { print SUM }' >> "/location/$value-1.txt"
here is what I am getting in return
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
1
7
.
2
8
1
7
.
0
1
1
7
.
0
0
1
8
.
7
2
53
It does generate the SUM I don't need the already listed numbers, just the SUM total
Why not stick with AWK completely? Code:
$ cat > summer.awk
{
while(match($0,/[0-9]+\.[0-9]+/)) # while matches on record
{
sum+=substr($0, RSTART, RLENGTH) # extract matches and sum them
$0=substr($0, RSTART + RLENGTH) # reset to start after previous match
count++ # count matches
}
}
END {
print sum"/"count"="sum/count # print stuff
Data:
$ cat > data.txt
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
Run:
$ awk -f summer.awk file
70.01/4=17.5025
It might work in the winter too.
The regex in grep -o "[0.00-9.99]" "/location/$value-1.txt" is equivalent to [0-9.], but you're probably looking for numbers in the range 0.00 to 9.99. For that, you need a different regex:
grep -o "[0-9]\.[0-9][0-9]" "/location/$value-1.txt"
That looks for a digit, a dot, and two more digits. It was almost tempting to use [.] in place of \.; it would also work. A plain . would not; that would select entries such as 0X87.
Note that the pattern shown ([0-9]\.[0-9][0-9]) will match 192.16.24.231 twice (2.16 and 4.23). If that's not what you want, you have to be a lot more precise. OTOH, it may not matter in the slightest for the actual data you have. If you'd want it to match 192.16 and 24.231 (or .24 and .231), you have to refine your regex.
Your command structure:
grep … filename | awk '…' >> filename
is living dangerously. In the example, it is 'OK' (but there's a huge grimace on my face as I type 'OK') because the awk script doesn't write anything to the file until grep has read it all. But change the >> to > and you have an empty input, or have awk write material before the grep is complete and suddenly it gets very tricky to determine what happens (it depends, in part, on what awk writes to the end of the file).
I have a file like this:
91052011868;Export Equi_Fort Postal;EXPORT;23/02/2015;1;0;0
91052011868;Sof_equi_Fort_Email_am_%yyyy%%mm%%dd%;EMAIL;19/02/2015;1;0;0
91052011868;Sof_trav_Fort_Email_am_%yyyy%%mm%%dd%;EMAIL;19/02/2015;1;0;0
91052151371;Export Trav_faible temoin;EXPORT;12/02/2015;1;0;0
91052182019;Export Deme_fort temoin;EXPORT;24/02/2015;1;0;0
91052199517;Sof_voya_Faible_Email_pm;EMAIL;22/01/2015;1;0;0
91052199517;Sof_voya_Faible_Email_Relance_pm;EMAIL;26/01/2015;1;0;0
91052262558;Sof_deme_faible_Email_am;EMAIL;26/01/2015;1;0;1
91052265940;Sof_trav_Faible_Email_am_%yyyy%%mm%%dd%;EMAIL;13/02/2015;1;0;0
91052265940;Sof_trav_Faible_Email_Relance_am_%yyyy%%mm%%dd%;EMAIL;17/02/2015;1;0;0
91052265940;Sof_voya_Faible_Email_am_%yyyy%%mm%%dd%;EMAIL;13/02/2015;1;0;0
91052265940;Sof_voya_Faible_Email_Relance_am_%yyyy%%mm%%dd%;EMAIL;16/02/2015;1;0;0
91052531428;Export Trav_faible temoin;EXPORT;11/02/2015;1;0;0
91052547697;Export Deme_Faible Postal;EXPORT;27/02/2015;1;0;0
91052562398;Export Deme_faible temoin;EXPORT;18/02/2015;1;0;0
I want to know all the lines where the first column duplicated values are greater than 1 but strictly inferior to 3.
91052199517;Sof_voya_Faible_Email_pm;EMAIL;22/01/2015;1;0;0
91052199517;Sof_voya_Faible_Email_Relance_pm;EMAIL;26/01/2015;1;0;0
I did the part below but it doesn't work...
sort file | awk 'NR==FNR{a[$1]++;next;}{ if (a[$1] > 0 && a[$1] <1 )print $0;}' file file
Why?
If what you want is to print all those lines whose first field appears twice, you can use this:
$ awk -F";" 'FNR==NR{a[$1]++; next} a[$1]==2' file file
91052199517;Sof_voya_Faible_Email_pm;EMAIL;22/01/2015;1;0;0
91052199517;Sof_voya_Faible_Email_Relance_pm;EMAIL;26/01/2015;1;0;0
This sets the field separator to the semi colon and then reads the file twice:
- the first time to count how many the 1st field appears (a[$1]++)
- the second time to print those lines matching the condition a[$1]==2. That is, the first field to appearing twice throughout the file.
If you wanted those indexes appearing between 2 and 4 times, you could use the following syntax on the second block:
a[$1]>=2 && a[$1]<=4
Why wasn't your approach working?
Because your condition says:
if (a[$1] > 0 && a[$1] <1 )
which of course will never happen, since a[$1] is an integer and no integer is bigger than 0 and smaller than 1.
Note my proposed solution uses the same idea, only that in a bit more idiomatic way: There is no need to be explicit in the if condition, neither saying print $0: this is exactly what awk does when a condition evaluates as True.