I'm trying to write a script to pull the integers out of 4 files that store temperature readings from 4 industrial freezers, this is a hobby script it generates the general readouts I wanted, however when I try to generate a SUM of the temperature readings I get the following printout into the file and my goal is to print the end SUM only not the individual numbers printed out in a vertical format
Any help would be greatly appreciated;here's my code
grep -o "[0.00-9.99]" "/location/$value-1.txt" | awk '{ SUM += $1; print $1} END { print SUM }' >> "/location/$value-1.txt"
here is what I am getting in return
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
1
7
.
2
8
1
7
.
0
1
1
7
.
0
0
1
8
.
7
2
53
It does generate the SUM I don't need the already listed numbers, just the SUM total
Why not stick with AWK completely? Code:
$ cat > summer.awk
{
while(match($0,/[0-9]+\.[0-9]+/)) # while matches on record
{
sum+=substr($0, RSTART, RLENGTH) # extract matches and sum them
$0=substr($0, RSTART + RLENGTH) # reset to start after previous match
count++ # count matches
}
}
END {
print sum"/"count"="sum/count # print stuff
Data:
$ cat > data.txt
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
Run:
$ awk -f summer.awk file
70.01/4=17.5025
It might work in the winter too.
The regex in grep -o "[0.00-9.99]" "/location/$value-1.txt" is equivalent to [0-9.], but you're probably looking for numbers in the range 0.00 to 9.99. For that, you need a different regex:
grep -o "[0-9]\.[0-9][0-9]" "/location/$value-1.txt"
That looks for a digit, a dot, and two more digits. It was almost tempting to use [.] in place of \.; it would also work. A plain . would not; that would select entries such as 0X87.
Note that the pattern shown ([0-9]\.[0-9][0-9]) will match 192.16.24.231 twice (2.16 and 4.23). If that's not what you want, you have to be a lot more precise. OTOH, it may not matter in the slightest for the actual data you have. If you'd want it to match 192.16 and 24.231 (or .24 and .231), you have to refine your regex.
Your command structure:
grep … filename | awk '…' >> filename
is living dangerously. In the example, it is 'OK' (but there's a huge grimace on my face as I type 'OK') because the awk script doesn't write anything to the file until grep has read it all. But change the >> to > and you have an empty input, or have awk write material before the grep is complete and suddenly it gets very tricky to determine what happens (it depends, in part, on what awk writes to the end of the file).
Related
My AWK script processes each log file from the folder "${results}, from which it looks for a pattern (a number occurred on the first line of ranking table) and then print it in one line together with the filename of the log:
awk '$1=="1"{sub(/.*\//,"",FILENAME); sub(/\.log/,"",FILENAME); printf("%s: %s\n", FILENAME, $2)}' "${results}"/*_rep"${i}".log
Here is the format of each log file, from which the number
-9.14
should be taken
AutoDock Vina v1.2.3
#################################################################
# If you used AutoDock Vina in your work, please cite: #
# #
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli #
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force #
# Field, and Python Bindings, J. Chem. Inf. Model. (2021) #
# DOI 10.1021/acs.jcim.1c00203 #
# #
# O. Trott, A. J. Olson, #
# AutoDock Vina: improving the speed and accuracy of docking #
# with a new scoring function, efficient optimization and #
# multithreading, J. Comp. Chem. (2010) #
# DOI 10.1002/jcc.21334 #
# #
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for #
# more information. #
#################################################################
Scoring function : vina
Rigid receptor: /home/gleb/Desktop/dolce_vita/temp/nsp5holoHIE.pdbqt
Ligand: /home/gleb/Desktop/dolce_vita/temp/active2322.pdbqt
Grid center: X 11.106 Y 0.659 Z 18.363
Grid size : X 18 Y 18 Z 18
Grid space : 0.375
Exhaustiveness: 48
CPU: 48
Verbosity: 1
Computing Vina grid ... done.
Performing docking (random seed: -1717804037) ...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
mode | affinity | dist from best mode
| (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1 -9.14 0 0
2 -9.109 2.002 2.79
3 -9.006 1.772 2.315
4 -8.925 2 2.744
5 -8.882 3.592 8.189
6 -8.803 1.564 2.092
7 -8.507 4.014 7.308
8 -8.36 2.489 8.193
9 -8.356 2.529 8.104
10 -8.33 1.408 3.841
It works OK for a moderate number of input log files (tested for up to 50k logs), but does not work for the case of big number of the input logs (e.g. with 130k logs), producing the following error:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
How could I adapt the AWK script to be able processing any number of input logs?
If you get a /usr/bin/awk: Argument list too long then you'll have to control the number of "files" that you supply to awk; the standard way to do that efficiently is:
results=. # ???
i=00001 # ???
output= # ???
find "$results" -type f -name "*_rep$i.log" -exec awk '
FNR == 1 {
filename = FILENAME
sub(/.*\//,"",filename)
sub(/\.[^.]*$/,"",filename)
}
$1 == 1 { printf "%s: %s\n", filename, $2 }
' {} + |
LC_ALL=C sort -t':' -k2,2g > "$results"/ranking_"$output"_rep"$i".csv
edit: appended the rest of the chain as asked in comment
note: you might need to specify other predicates to the find command if you don't want it to search the sub-folders of $results recursively
Note that your error message:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
is from your shell interpreting line 124 in your shell script, not from awk - you just happen to be calling awk at that line but it could be any other tool and you'd get the same error. Google ARG_MAX for more information on it.
Assuming printf is a builtin on your system:
printf '%s\0' "${results}"/*_rep"${i}".log |
xargs -0 awk '...'
or if you need awk to process all input files in one call for some reason and your file names don't contain newlines:
printf '%s' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
'
If you're using GNU awk or some other awk that can process NUL characters as the RS and your input file names might contain newlines then you could do:
printf '%s\0' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
' RS='\0' - RS='\n'
When using GNU AWK you might alter ARGC and ARGV to command GNU AWK to read additional files, consider following simple example, let filelist.txt content be
file1.txt
file2.txt
file3.txt
and content of these files to be respectively uno, dos, tres then
awk 'FNR==NR{ARGV[NR+1]=$0;ARGC+=1;next}{print FILENAME,$0}' filelist.txt
gives output
file1.txt uno
file2.txt dos
file3.txt tres
Explanation: when reading first file i.e. where number of row in file (FNR) is equal number of row globally (NR) I add to ARGV line as value under key being number of row plus one, as ARGV[1] is already filelist.txt and I increase ARGC by 1, I instruct GNU AWK to then go to next line so no other action is undertaken. For other files I print filename followed by whole line.
(tested in GNU Awk 5.0.1)
I've recently been working on some lab assignments and in order to collect and analyze results well, I prepared a bash script to automate my job. It was my first attempt to create such script, thus it is not perfect and my question is strictly connected with improving it.
Exemplary output of the program is shown below, but I would like to make it more general for more purposes.
>>> VARIANT 1 <<<
Random number generator seed is 0xea3495cc76b34acc
Generate matrix 128 x 128 (16 KiB)
Performing 1024 random walks of 4096 steps.
> Total instructions: 170620482
> Instructions per cycle: 3.386
Time elapsed: 0.042127 seconds
Walks accrued elements worth: 534351478
All data I want to collect is always in different lines. My first attempt was running the same program twice (or more times depending on the amount of data) and then using grep in each run to extract the data I need by looking for the keyword. It is very inefficient, as there probably are some possibilities of parsing whole output of one run, but I could not come up with any idea. At the moment the script is:
#!/bin/bash
write() {
o1=$(./progname args | grep "Time" | grep -o -E '[0-9]+.[0-9]+')
o2=$(./progname args | grep "cycle" | grep -o -E '[0-9]+.[0-9]+')
o3=$(./progname args | grep "Total" | grep -o -E '[0-9]+.[0-9]+')
echo "$1 $o1 $o2 $o3"
}
for ((i = 1; i <= 10; i++)); do
write $i >> times.dat
done
It is worth mentioning that echoing results in one line is crucial, as I am using gnuplot later and having data in columns is perfect for that use. Sample output should be:
1 0.019306 3.369 170620476
2 0.019559 3.375 170620475
3 0.021971 3.334 170620478
4 0.020536 3.378 170620480
5 0.019692 3.390 170620475
6 0.020833 3.375 170620477
7 0.019951 3.450 170620477
8 0.019417 3.381 170620476
9 0.020105 3.374 170620476
10 0.020255 3.402 170620475
My question is: how could I improve the script to collect such data in just one program execution?
You could use awk here and could get values into an array and later access them by index 1,2 and 3 in case you want to do this in a single command.
myarr=($(your_program args | awk '/Total/{print $NF;next} /cycle/{print $NF;next} /Time/{print $(NF-1)}'))
OR use following to forcefully print all elements into a single line, which will not come in new lines if someone using " to keep new lines safe for values.
myarr=($(your_program args | awk '/Total/{val=$NF;next} /cycle/{val=(val?val OFS:"")$NF;next} /Time/{print val OFS $(NF-1)}'))
Explanation: Adding detailed explanation of awk program above.
awk ' ##Starting awk program from here.
/Total/{ ##Checking if a line has Total keyword in it then do following.
print $NF ##Printing last field of that line which has Total in it here.
next ##next keyword will skip all further statements from here.
}
/cycle/{ ##Checking if a line has cycle in it then do following.
print $NF ##Printing last field of that line which has cycle in it here.
next ##next keyword will skip all further statements from here.
}
/Time/{ ##Checking if a line has Time in it then do following.
print $(NF-1) ##Printing 2nd last field of that line which has Time in it here.
}'
To access individual items you could use like:
echo ${myarr[0]}, echo ${myarr[1]} and echo ${myarr[2]} for Total, cycle and time respectively.
Example to access all elements by loop in case you need:
for i in "${myarr[#]}"
do
echo $i
done
You can execute your program once and save the output at a variable.
o0=$(./progname args)
Then you can grep that saved string any times like this.
o1=$(echo "$o0" | grep "Time" | grep -o -E '[0-9]+.[0-9]+')
Assumptions:
each of the 3x search patterns (Time, cycle, Total) occur just once in a set of output from ./progname
format of ./progname output is always the same (ie, same number of space-separated items for each line of output)
I've created my own progname script that just does an echo of the sample output:
$ cat progname
echo ">>> VARIANT 1 <<<
Random number generator seed is 0xea3495cc76b34acc
Generate matrix 128 x 128 (16 KiB)
Performing 1024 random walks of 4096 steps.
> Total instructions: 170620482
> Instructions per cycle: 3.386
Time elapsed: 0.042127 seconds
Walks accrued elements worth: 534351478"
One awk solution to parse and print the desired values:
$ i=1
$ ./progname | awk -v i=${i} ' # assign awk variable "i" = ${i}
/Time/ { o1 = $3 } # o1 = field 3 of line that contains string "Time"
/cycle/ { o2 = $5 } # o2 = field 5 of line that contains string "cycle"
/Total/ { o3 = $4 } # o4 = field 4 of line that contains string "Total"
END { printf "%s %s %s %s\n", i, o1, o2, o3 } # print 4x variables to stdout
'
1 0.042127 3.386 170620482
I am writing a script for automatically computing average runtime.
First I need to run $ time ./foo.py for 100 times and save output to file time.txt (working)
$ for i in `seq 100`; do { time ./foo.py; } 2>> time.txt; done
Output looks as follows
time ./foo.py
real 0m0,030s
user 0m0,030s
sys 0m0,000s
[...]
Runtimes from different scripts are in the same file. Each entry starts with time ./foo.py, followed by 100 "triplets" of real, user and sys.
Now, if possible, I would love to have the script automatically compute the average runtime for each tested file by using all 100 "triplets", and neatly returning only one "mean triplet".
I have thought about maybe using awk to calculate the mean, like this
awk '{ total += $2 } END { print total/NR }' time.txt
But the command would need to be adapted to fit my needs - after all, only the parts after the , (e.g. ,030s) may be used for computation and the s would also need to be disregarded.
Since I do not know how to achieve this objective, I thought to ask the community.
Any help is greatly appreciated.
It's easier if you tell time to output the time info in POSIX format:
awk '/^real/ { totalReal += $2 } /^user/ { totalUser += $2 } /^sys/ { totalSys += $2 } END { print "realAvg " totalReal/(NR/4) "\n" "userAvg " totalUser/(NR/4) "\n" "sysAvg " totalSys/(NR/4) }' time.txt
Prints output as follows:
realAvg 12.62
userAvg 27
sysAvg 3.8
Explanation:
Basically, tell awk to go through each line in the file, and if the line starts with real, add that to the totalReal variable, same for user and sys. So, basically, keep a running total of each of the three "types".
At the end, simply print the the three running totals, divided by the number of lines divided by 4. This is because you want each "set" of 4 lines to count as 1 instance, and awk's NR just counts the number of lines.
suppose I have file containing numbers like:
1 4 7
2 5 8
and I want to add 1 to all these numbers, making the output like:
2 5 8
3 6 9
is there a simple one-line command (e.g. awk) to realize this?
try following once.
awk '{for(i=1;i<=NF;i++){$i=$i+1}} 1' Input_file
EDIT: As per OP's request without loop, here is a solution(written as per shown sample only).
With hardcoding of number of fields.
awk -v RS='[ \n]' '{ORS=NR%3==0?"\n":" ";print $0+1}' Input_file
OR
Without hardcoding number of fields.
awk -v RS='[ \n]' -v col=$(awk 'FNR==1{print NF}' Input_file) '{ORS=NR%col==0?"\n":" ";print $0+1}' Input_file
Explanation: So in EDIT section 1st solution I have hardcoded the number of fields by mentioning 3 there, in OR solution of EDIT, I am creating a variable named col which will read the very first line of Input_file to get the number of fields. Then it will not read all the Input_file, Now coming onto the code I have set Record separator as space or new line to it will add them without using a loop and it will add space each time after incrementing 1 in their values. It will print new line only when number of lines are completely divided by value of col(which is why we have taken number of fields in -v col section).
In native bash (no awk or other external tool needed):
#!/usr/bin/env bash
while read -r -a nums; do # read a line into an array, splitting on spaces
out=( ) # initialize an empty output array for that line
for num in "${nums[#]}"; do # iterate over the input array...
out+=( "$(( num + 1 ))" ) # ...and add n+1 to the output array.
done
printf '%s\n' "${out[*]}" # then print that output array with a newline following
done <in.txt >out.txt # with input from in.txt and output to out.txt
You can do this using gnu awk:
awk -v RS="[[:space:]]+" '{$0++; ORS=RT} 1' file
2 5 8
3 6 9
If you don't mind Perl:
perl -pe 's/(\d+)/$1+1/eg' file
Substitute any number composed of multiple digits (\d+) with that number ($1) plus 1. /e means to execute the replacement calculation, and /g means globally throughout the file.
As mentioned in the comments, the above only works for positive integers - per the OP's original sample file. If you wanted it to work with negative numbers, decimals and still retain text and spacing, you could go for something like this:
perl -pe 's/([-]?[.0-9]+)/$1+1/eg' file
Input file
Some column headers # words
1 4 7 # a comment
2 5 cat dog # spacing and stray words
+5 0 # plus sign
-7 4 # minus sign
+1000.6 # positive decimal
-21.789 # negative decimal
Output
Some column headers # words
2 5 8 # a comment
3 6 cat dog # spacing and stray words
+6 1 # plus sign
-6 5 # minus sign
+1001.6 # positive decimal
-20.789 # negative decimal
So I have a series of scripts that generate intermediary text files along the way as a means of storing information across different scripts. Essentially the scripts detect rows within data that have been approved by the user for removal. The line numbers that are to be removed from the source file are stored in a file.
For example, say I have a source data file like this:
a1,b1,c1,d1
a2,b2,c2,d2
a3,b3,c3,d3
a4,b4,c4,d4
a5,b5,c5,d5
a6,b6,c6,d6
a7,b7,c7,d7
And the intermediary file would contain something like this:
1 3 4 5 6
Which would result, when the script is run, in an output data file as follows:
a2,b2,c2,d2
a7,b7,c7,d7
This all works fine, there is nothing to fix in this code. The problem is, when I'm dealing with actual data files sometimes there are literally thousands of numbers stored in the intermediary file for removal. This means I can't use a loop, because it will take a massive amount of time, and my current method of using sed gets overloaded with a error: too many arguments. Many of the line numbers are consecutive, so here's where I get to my question:
Is there a way in bash or awk to detect whether a series of space-separated numbers are consecutive?
I can sort out everything beyond that, I'm just stumped on how I could do this in one/a series of step(s). My plan, if I can detect consecutive values, is to change the intermediary file from:
1 3 4 5 6
To:
1 3-6
And then I'll be able to write code that will run on each range of values in a more manageable way.
If possible I'd like to avoid looping through each value and checking individually whether or not it's one step above the previous value, since I'm dealing with tens of thousands of numbers in a list.
If this is not possible in bash/awk, is there another way to accomplish this task to reduce the overall number of arguments passed to my script and greatly reduce the chances of encountering an error for too many arguments?
What about this?
BEGIN {
getline < "intermediate.txt"
split($0, skippedlines, " ")
skipindex = 1
}
{
if (skippedlines[skipindex] == NR)
++skipindex;
else
print
}
Use cat, join, and cut:
Files infile and ids:
a1,b1,c1,d1 1
a2,b2,c2,d2 3
a3,b3,c3,d3 4
a4,b4,c4,d4 5
a5,b5,c5,d5 6
a6,b6,c6,d6
a7,b7,c7,d7
Removal of selected lines:
$ join -v 2 ids <(cat -n infile) | cut -f 2 -d ' '
a2,b2,c2,d2
a7,b7,c7,d7
What's going on:
First, the initial file receives an id on each line, with cat -n infile;
then, the resulting file is joined on the first column with the file holding the ids;
only non-matching lines from second file are printed -- join -v 2;
the first column, with the ids, is removed;
and, it's a neat shell one-liner (:
In case your file with ids is written as an unique line, you can still make use of the above one-liner, simply adding a translation on the file with ids, as follows:
$ join -v 2 <(tr ' ' '\n' ids) <(cat -n infile) | cut -f 2 -d ' '
#jmihalicza's answer nicely uses awk to solve the whole problem of selecting the lines from source file that match those in the intermediate file. For completeness, the following awk program reduces the list of individual line numbers to ranges, where possible, which I think answers the original question:
{ for (j = 1; j <= NF; j++) {
lin[i++] = $j;
}
}
END {
start = lin[0];
j = 1;
while (j <= i) {
end = start
while (lin[j] == (lin[j-1]+1)) {
end = lin[j++];
}
if ((end+0) > (start+0)) {
printf "%d-%d ",start,end
} else {
printf "%d ",start
}
start = lin[j++];
}
}
Given this script, which I've called merge.awk and a file testlin.txt as follows:
1 3 4 5 6 9 10 11 13 15
... we can do this:
$ awk -f merge.awk <testlin.txt
1 3-6 9-11 13 15
This might work for you (GNU sed):
sed -r 's/\S+/&d/g;s/\s+/\n/g' intermediate_file | sed -f - source_file
Change the intermediate file into a sed script.