Print lines after a pattern in several files - bash

I have a file with the same pattern several times.
Something like:
time
12:00
12:32
23:22
time
10:32
1:32
15:45
I want to print the lines after the pattern, in the example time
in several files. The number of lines after the pattern is constant.
I found I can get the first part of my question with awk '/time/ {x=NR+3;next}(NR<=x){print}' filename
But I have no idea how to output each chunk into different files.
EDIT
My files are a bit more complex than my original question.
They have the following format.
4
gen
C -4.141000 -0.098000 0.773000
H -4.528000 -0.437000 -0.197000
H -4.267000 0.997000 0.808000
H -4.777000 -0.521000 1.563000
4
gen
C -4.414000 -0.398000 4.773000
H -4.382000 -0.455000 -4.197000
H -4.267000 0.973000 2.808000
H -4.333000 -0.000000 1.636000
I want to print the lines after
4
gen
EDIT 2
My expected output is x files x=# pattern.
From my second example, I want two files:
C -4.141000 -0.098000 0.773000
H -4.528000 -0.437000 -0.197000
H -4.267000 0.997000 0.808000
H -4.777000 -0.521000 1.563000
and
C -4.414000 -0.398000 4.773000
H -4.382000 -0.455000 -4.197000
H -4.267000 0.973000 2.808000
H -4.333000 -0.000000 1.636000

You can use this awk command:
awk '/time/{close(out); out="output" ++i; next} {print > out}' file
This awk command creates a variable out based on a fixed prefix output and an incrementing counter i which gets incremented every time we get a line time. All subsequent lines are redirected to this output file. Is is a good practice to close these file handles to avoid memory leak.
PS: If you want time line also in output then remove next in above command.

The revised "4/gen" requirements are somewhat ambiguous but the following script (which is just a variant of #anubhava's) conforms with those that are given and can easily be modified to deal with various edge cases:
awk '
/^ *4 *$/ {close(out); out=0; next}
/^ *gen *$/ && out==0 {out = "output." ++i; next}
out {print > out} '

I found another answer from anubhava here How to print 5 consecutive lines after a pattern in file using awk
and with head and for loop I solved my problem:
for i in {1..23}; do grep -A25 "gen" allTemp | tail -n 25 > xyz$i; head -n -27 allTemp > allTemp2; cp allTemp2 allTemp; done
I know that file allTemp has 23 occurrences of gen.
Head will remove the lines I printed to xyzi as well as the two lines I don't want and it will output a new file to allTemp2

Related

AWK: subset randomly and without replacement a string in every row of a file

So I need to subset 10 characters from all strings in a particular column of a file, randomly and without repetition (i.e. I want to avoid drawing a character from any given index more than once).
For the sake of simplicity, let's say I have the following string:
ABCDEFGHIJKLMN
For which I should obtain, for example, this result:
DAKLFCHGBI
Notice that no letter occurs twice, which means that no position is extracted more than once.
For this other string:
CCCCCCCCCCCCGG
Analogously, I should never find more than two "G" characters in the output (otherwise it would mean that a "G" character has been sampled more than once), e.g.:
CCGCCCCCCC
Or, in other words, I want to shuffle all characters from each string, and keep the first 10. This can be easily achieved in bash using:
echo "ABCDEFGHIJKLMN" | fold -w1 | shuf -n10 | tr -d '\n'
However, since I need to perform this many times on dozens of files with over a hundred thousand lines each, this is way too slow. So looking around, I've arrived at the following awk code, which seems to work fine whenever the strings are passed to it one by one, e.g.:
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' <(echo "ABCDEFGHIJKLMN")
But when I input the following file with a string on each row, awk hangs and the output gets truncated on the second line:
echo "ABCDEFGHIJKLMN" > file.txt
echo "CCCCCCCCCCCCGG" >> file.txt
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' file.txt
This other version of the code which samples characters from the string with repetition works fine, so it looks like the issue lies in the part which populates the N array, but I'm not proficient in awk so I'm a bit stuck:
awk '{srand(); len=length($1); for(i=1;i<=10;i++) {k=int(rand()*len)+1; printf "%s", substr($1,k,1)} print ""}'
Anyone can help?
In case this matters: my actual file is more complex than the examples provided here, with several other columns, and unlike the ones in this example, its strings may have different lengths.
Thanks in advance for your time :)
EDIT:
As mentioned in the comments, I managed to make it work by removing the N array (so that it resets before processing each row):
awk 'BEGIN{srand()} {len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} split("", N); print ""}' file.txt
Do note however that if the string in $1 is shorter than 10, this will get stuck in an infinite loop, so make sure that all strings are always longer than the subset target size. The alternative solution provided by Andre Wildberg in the comments doesn't carry this issue.
I would harness GNU AWK for this task following way, let file.txt content be
ABCDEFGHIJKLMN
CCCCCCCCCCCCGG
then
awk 'function comp_func(i1, v1, i2, v2){return rand()-0.5}BEGIN{FPAT=".";PROCINFO["sorted_in"]="comp_func"}{s="";patsplit($0,arr);for(i in arr){s = s arr[i]};print substr(s,1,10)}' file.txt
might give output
NGLHCKEIMJ
CCCCCCCCGG
Explanation: I use custom Array Traversal Control function which does randomly decides which element should be considered greater. -0.5 is used as rand() gives values from 0 to 1. For each line array arr is populated by characters of line, then traversed in random order to create s string which are characters shuffled, then substr used to get first 10 characters. You might elect to add counter which will terminate for loop if you have very long lines in comparison to number of characters to select.
(tested in GNU Awk 5.0.1)
Iteratively construct a substring of the remaining letters.
Tested with
awk version 20121220
GNU Awk 4.2.1, API: 2.0
GNU Awk 5.2.1, API 3.2
mawk 1.3.4 20200120
% awk -v size=10 'BEGIN{srand()} {n=length($0); a=$0; x=0;
for(i=1; i<=n; i++){x++; na=length(a); rnd = int(rand() * na + 1)
printf("%s", substr(a, rnd, 1))
a=substr(a, 1, rnd - 1)""substr(a, rnd + 1, na)
if(x >= size){break}}
print ""}' file.txt
CJFMKHNDLA
CGCCCCCCCC
In consecutive iterative runs remember to check if srand works the way you expect in your version of awk. If in doubt use $RANDOM or, better, /dev/urandom.
if u don't need to be strictly within awk, then jot makes it super easy :
say you want 20 random characters between
"A" (ascii 65) and "N" (ascii 78), inc. repeats of same chars
jot -s '' -c -r 20 65 78
ANNKECLDMLMNCLGDIGNL

Processing of the data from a big number of input files

My AWK script processes each log file from the folder "${results}, from which it looks for a pattern (a number occurred on the first line of ranking table) and then print it in one line together with the filename of the log:
awk '$1=="1"{sub(/.*\//,"",FILENAME); sub(/\.log/,"",FILENAME); printf("%s: %s\n", FILENAME, $2)}' "${results}"/*_rep"${i}".log
Here is the format of each log file, from which the number
-9.14
should be taken
AutoDock Vina v1.2.3
#################################################################
# If you used AutoDock Vina in your work, please cite: #
# #
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli #
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force #
# Field, and Python Bindings, J. Chem. Inf. Model. (2021) #
# DOI 10.1021/acs.jcim.1c00203 #
# #
# O. Trott, A. J. Olson, #
# AutoDock Vina: improving the speed and accuracy of docking #
# with a new scoring function, efficient optimization and #
# multithreading, J. Comp. Chem. (2010) #
# DOI 10.1002/jcc.21334 #
# #
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for #
# more information. #
#################################################################
Scoring function : vina
Rigid receptor: /home/gleb/Desktop/dolce_vita/temp/nsp5holoHIE.pdbqt
Ligand: /home/gleb/Desktop/dolce_vita/temp/active2322.pdbqt
Grid center: X 11.106 Y 0.659 Z 18.363
Grid size : X 18 Y 18 Z 18
Grid space : 0.375
Exhaustiveness: 48
CPU: 48
Verbosity: 1
Computing Vina grid ... done.
Performing docking (random seed: -1717804037) ...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
mode | affinity | dist from best mode
| (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1 -9.14 0 0
2 -9.109 2.002 2.79
3 -9.006 1.772 2.315
4 -8.925 2 2.744
5 -8.882 3.592 8.189
6 -8.803 1.564 2.092
7 -8.507 4.014 7.308
8 -8.36 2.489 8.193
9 -8.356 2.529 8.104
10 -8.33 1.408 3.841
It works OK for a moderate number of input log files (tested for up to 50k logs), but does not work for the case of big number of the input logs (e.g. with 130k logs), producing the following error:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
How could I adapt the AWK script to be able processing any number of input logs?
If you get a /usr/bin/awk: Argument list too long then you'll have to control the number of "files" that you supply to awk; the standard way to do that efficiently is:
results=. # ???
i=00001 # ???
output= # ???
find "$results" -type f -name "*_rep$i.log" -exec awk '
FNR == 1 {
filename = FILENAME
sub(/.*\//,"",filename)
sub(/\.[^.]*$/,"",filename)
}
$1 == 1 { printf "%s: %s\n", filename, $2 }
' {} + |
LC_ALL=C sort -t':' -k2,2g > "$results"/ranking_"$output"_rep"$i".csv
edit: appended the rest of the chain as asked in comment
note: you might need to specify other predicates to the find command if you don't want it to search the sub-folders of $results recursively
Note that your error message:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
is from your shell interpreting line 124 in your shell script, not from awk - you just happen to be calling awk at that line but it could be any other tool and you'd get the same error. Google ARG_MAX for more information on it.
Assuming printf is a builtin on your system:
printf '%s\0' "${results}"/*_rep"${i}".log |
xargs -0 awk '...'
or if you need awk to process all input files in one call for some reason and your file names don't contain newlines:
printf '%s' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
'
If you're using GNU awk or some other awk that can process NUL characters as the RS and your input file names might contain newlines then you could do:
printf '%s\0' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
' RS='\0' - RS='\n'
When using GNU AWK you might alter ARGC and ARGV to command GNU AWK to read additional files, consider following simple example, let filelist.txt content be
file1.txt
file2.txt
file3.txt
and content of these files to be respectively uno, dos, tres then
awk 'FNR==NR{ARGV[NR+1]=$0;ARGC+=1;next}{print FILENAME,$0}' filelist.txt
gives output
file1.txt uno
file2.txt dos
file3.txt tres
Explanation: when reading first file i.e. where number of row in file (FNR) is equal number of row globally (NR) I add to ARGV line as value under key being number of row plus one, as ARGV[1] is already filelist.txt and I increase ARGC by 1, I instruct GNU AWK to then go to next line so no other action is undertaken. For other files I print filename followed by whole line.
(tested in GNU Awk 5.0.1)

Faster way to extract data from large file

I have file containing about 40000 frames of Cartesian coordinates of 28 atoms. I need to extract coordinates of atom 21 to 27 from each frame.
I tried using bash script with for-loop.
for i in {0..39999}
do
cat $1 | grep -A 27 "frame $i " | tail -n 6 | awk '{print $2, $3, $4}' >> new_coors.xyz
done
Data have following form:
28
-1373.82296 frame 0 xyz file generated by terachem
Re 1.6345663991 0.9571586961 0.3920887712
N 0.7107677071 -1.0248027788 0.5007181135
N -0.3626961076 1.1948218124 -0.4621264246
C -1.1299268126 0.0792071086 -0.5595954110
C -0.5157993503 -1.1509115191 -0.0469223696
C 1.3354467762 -2.1017253883 1.0125736017
C 0.7611763218 -3.3742177216 0.9821756556
C -1.1378354025 -2.4089069492 -0.1199253156
C -0.4944655989 -3.5108477831 0.4043826684
C -0.8597552614 2.3604180994 -0.9043060625
C -2.1340008843 2.4846545826 -1.4451933224
C -2.4023114639 0.1449111237 -1.0888703147
C -2.9292779079 1.3528434658 -1.5302429615
H 2.3226814021 -1.9233467458 1.4602019023
H 1.3128699342 -4.2076373780 1.3768411246
H -2.1105470176 -2.5059031902 -0.5582958817
H -0.9564415355 -4.4988963635 0.3544299401
H -0.1913951275 3.2219343258 -0.8231465989
H -2.4436044324 3.4620639189 -1.7693069306
H -3.0306593902 -0.7362803011 -1.1626515622
H -3.9523215784 1.4136948699 -1.9142814745
C 3.3621999538 0.4972227756 1.1031860016
O 4.3763020637 0.2022266109 1.5735343064
C 2.2906331057 2.7428149541 0.0483795630
O 2.6669163864 3.8206298898 -0.1683800650
C 1.0351398442 1.4995168190 2.1137684156
O 0.6510904387 1.8559680025 3.1601927094
Cl 2.2433490373 0.2064711824 -1.9226174036
It works but it takes enormous amount of time,
In future I will be working with larger file. Is there faster way to do that?
The reason why your program is slow is that you keep on re-reading your input file over and over in your for-loop. You can do everything with reading your file a single time and use awk instead:
awk '/frame/{c=0;next}{c++}(c>20 && c<27){ print $2,$3,$4 }' input > output
This answer assumes the following form of data:
frame ???
??? x y z ???
??? x y z ???
...
frame ???
??? x y z ???
??? x y z ???
...
The solution checks if it finds the word frame in a line. If so, it sets the atom counter c to zero and skips to the next line. From that point forward, it will always read increase the counter if it reads a new line. If the counter is between 20 and 27 (exclusive), it will print the coordinates.
You can now easily expand on this: Assume you want the same atoms but only from frame 1000 till 1500. You can do this by introducing a frame-counter fc
awk '/frame/{fc++;c=0;next}{c++}(fc>=1000 && fc <=1500) && (c>20 && c<27){ print $2,$3,$4 }' input > output
If frames numbers in file are already in sorted order, e.g. they have numbers 0 - 39999 in this order, then maybe something likes this could do the job (not tested, since we don't have a sample input file, as Jepessen suggested):
cat $1 | grep -A 27 -E "frame [0-9]+ " | \
awk '{if ($1 == "frame") n = 0; if (n++ > 20) print $2, $3, $4}' > new_coors.xyz
(code above made explicitly verbose to be easier to understand and closer to your existing script. If you need a more compact solution check kvantour answer)
You could perhaps use 2 passes of grep, rather than thousands?
Assuming you want the lines 21-27 after every frame, and you don't want to record the frame number itself, the following phrase should get the lines you want, which you can then 'tidy' with awk:
grep -A27 ' frame ' | grep -B6 '-----'
If you also wanted the frame numbers (I see no evidence), or you really want to restrict the range of frame numbers, you could do that with tee and >( grep 'frame') to generate a second file that you would then need to re-merge. If you added -n to grep then you could easily merge sort the files on line number.
Another way to restrict the frame number without doing multiple passes would be a more complex grep expression that describes the range of numbers (-E because life is too short for backticks):
-E ' frame (([0-9]{1,4}|[0-3][0-9]{1,4}) '

Having SUM issues with a bash script

I'm trying to write a script to pull the integers out of 4 files that store temperature readings from 4 industrial freezers, this is a hobby script it generates the general readouts I wanted, however when I try to generate a SUM of the temperature readings I get the following printout into the file and my goal is to print the end SUM only not the individual numbers printed out in a vertical format
Any help would be greatly appreciated;here's my code
grep -o "[0.00-9.99]" "/location/$value-1.txt" | awk '{ SUM += $1; print $1} END { print SUM }' >> "/location/$value-1.txt"
here is what I am getting in return
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
1
7
.
2
8
1
7
.
0
1
1
7
.
0
0
1
8
.
7
2
53
It does generate the SUM I don't need the already listed numbers, just the SUM total
Why not stick with AWK completely? Code:
$ cat > summer.awk
{
while(match($0,/[0-9]+\.[0-9]+/)) # while matches on record
{
sum+=substr($0, RSTART, RLENGTH) # extract matches and sum them
$0=substr($0, RSTART + RLENGTH) # reset to start after previous match
count++ # count matches
}
}
END {
print sum"/"count"="sum/count # print stuff
Data:
$ cat > data.txt
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
Run:
$ awk -f summer.awk file
70.01/4=17.5025
It might work in the winter too.
The regex in grep -o "[0.00-9.99]" "/location/$value-1.txt" is equivalent to [0-9.], but you're probably looking for numbers in the range 0.00 to 9.99. For that, you need a different regex:
grep -o "[0-9]\.[0-9][0-9]" "/location/$value-1.txt"
That looks for a digit, a dot, and two more digits. It was almost tempting to use [.] in place of \.; it would also work. A plain . would not; that would select entries such as 0X87.
Note that the pattern shown ([0-9]\.[0-9][0-9]) will match 192.16.24.231 twice (2.16 and 4.23). If that's not what you want, you have to be a lot more precise. OTOH, it may not matter in the slightest for the actual data you have. If you'd want it to match 192.16 and 24.231 (or .24 and .231), you have to refine your regex.
Your command structure:
grep … filename | awk '…' >> filename
is living dangerously. In the example, it is 'OK' (but there's a huge grimace on my face as I type 'OK') because the awk script doesn't write anything to the file until grep has read it all. But change the >> to > and you have an empty input, or have awk write material before the grep is complete and suddenly it gets very tricky to determine what happens (it depends, in part, on what awk writes to the end of the file).

How to read a file located in in several folders and subfolders in Bash Shell

There are several files named TESTFILE which located in directories ~/main1/sub1, ~/main1/sub2, ~/main1/sub3, ..., ~/main2/sub1,~/main2/sub2, ... ~/mainX/subY where mainX is the main folder and subY are the subfolders inside the main folder. The TESTFILE file for each main folder-subfolder has the same pattern, but the data in each is unique.
Now here's what I want to do:
I want to read a specific number in the TESTFILE for each ~/mainX/subY.
I want to create a text file where every line has the following format [mainX][space][subY][space][value read from TESTFILE]
Some information about TESTFILE and the data I want to get:
It is an OSZICAR file from VASP, a DFT program
The number of lines in OSZICAR varies in different folder-subfolder combination
The information I want to get is always located in the last two lines of the file
The last two lines always look like this:
DAV: 2 -0.942521930239E+01 0.27889E-09 -0.79991E-13 864 0.312E-06
10 F= -.94252193E+01 E0= -.94252193E+01 d E =-.717252E-07
Or in general, the last two lines pattern is:
DAV: a b c d e f
g F= h E0= i d E = j
where the italicized parts are the parts that do not change and boldfaced variable are the ones that I want to get
Some information about main folder mainX and sub-folder subY:
The folders mainX and subY are all real numbers.
How I want the output to be:
Suppose mainX={0.12, 0.20, 0.34, 0.7} and subY={1.10, 2.30, 4.50, 1.00, 2.78}, and the last two lines of ~/0.12/1.10/OSZICAR is the example above, my output file should contain:
0.12 1.10 2 10 -.94252193E+01 -.94252193E+01 -.717252E-07
...
0.7 2.30 2 10 -.94252193E+01 -.94252193E+01 -.717252E-07
...
mainX mainY a g h i j
How do I do this in the simplest way possible? I'm reading grep, awk, sed and I'm very overwhelmed.
You could do this using some for loops in bash:
for m in ~/main*/; do
main=$(basename "$m")
for s in "$m"sub*/; do
sub=$(basename "$s")
num=$(tail -n2 TESTFILE | awk -F'[ =]+' 'NR==1{s=$2;next}{print s,$1,$3,$5,$8}')
echo "$main $sub $num"
done
done > output_file
I have modified the command to extract the data from your file. It uses tail to read the last two lines of the file. The lines are passed to awk, where they are split into fields using any number of spaces and = signs together as the field separator. The second field from the first of the two lines is saved to the variable s. next skips to the next line, then the columns that you are interested in are printed.
Your question is not very clear - specifically on how to extract the value from TESTFILE, but this is something like what you want:
#!/bin/bash
for X in {1..100}; do
for Y in {1..100}; do
directory="main${X}/sub${Y}"
echo Checking $directory
if [ -f "${directory}/TESTFILE" ]; then
something=$(grep something "${directory}/TESTFILE")
echo main${X} sub${Y} $something
fi
done
done

Resources