extract file names from a folder based on conditions - bash

I have a folder which has files with the following contents.
ATOM 9 CE1 PHE A 1 70.635 -26.989 98.805 1.00 39.17 C
ATOM 10 CE2 PHE A 1 69.915 -26.416 100.989 1.00 42.21 C
ATOM 11 CZ PHE A 1 -69.816 26.271 -99.622 1.00 40.62 C
ATOM 12 N PRO A 2 -69.795 30.848 101.863 1.00 44.44 N
In some files, the appearance of the 7th column as follows.
ATOM 9 CE1 PHE A 1 70.635-26.989 98.805 1.00 39.17 C
ATOM 10 CE2 PHE A 1 69.915-26.416 100.989 1.00 42.21 C
ATOM 11 CZ PHE A 1 -69.816-26.271 -99.622 1.00 40.62 C
ATOM 12 N PRO A 2 -69.795-30.848 101.863 1.00 44.44 N
I would like to extract the name of files which have the above type of lines. What is the easy way to do this?

by refering to Erik E. Lorenz answer
you can simply do
grep -l '\s-\?[0-9.]\+-[0-9.]\+\s' dir/*
from grep manpage
-l
(The letter ell.) Write only the names of files containing selected
lines to standard output. Pathnames are written once per file searched.
If the standard input is searched, a pathname of (standard input) will
be written, in the POSIX locale. In other locales, standard input may be
replaced by something more appropriate in those locales.

A combination of grep and cut works for me:
grep -H -m 1 '\s-\?[0-9.]\+-[0-9.]\+\s' dir/* | cut -d: -f1
This performs the following steps:
for every file in dir/*, find the first match (-m 1) of two adjacent numbers separated by only a dash
print it with the filename prepended (-H). Should be the default anyway.
extract the file name using cut
This is fast since it only looks for the first line match. If there's other places with two adjacent numbers, consider changing the regex.
Edit:
This doesn't match scientific notation and may falsely report contents such as '.-.', for example in comments. If you're dealing with one of them, you have to expand the regex.

awk 'NF > 10 && $1 ~ /^[[:upper:]]+$/ && $2 ~ /^[[:digit:]]+/ { print FILENAME; nextfile }' *
Will print files that have more than 10 fields in which first field is all uppercase letters and second field is all digits.

Using GNU awk for nextfile:
awk '$7 ~ /[0-9]-[0-9]/{print FILENAME; nextfile}' *
or more efficiently since you just need to test the first line of each file if all lines in a given file have the same format:
awk 'FNR==1{if ($7 ~ /[0-9]-[0-9]/) print FILENAME; nextfile}' *

Related

Processing of the data from a big number of input files

My AWK script processes each log file from the folder "${results}, from which it looks for a pattern (a number occurred on the first line of ranking table) and then print it in one line together with the filename of the log:
awk '$1=="1"{sub(/.*\//,"",FILENAME); sub(/\.log/,"",FILENAME); printf("%s: %s\n", FILENAME, $2)}' "${results}"/*_rep"${i}".log
Here is the format of each log file, from which the number
-9.14
should be taken
AutoDock Vina v1.2.3
#################################################################
# If you used AutoDock Vina in your work, please cite: #
# #
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli #
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force #
# Field, and Python Bindings, J. Chem. Inf. Model. (2021) #
# DOI 10.1021/acs.jcim.1c00203 #
# #
# O. Trott, A. J. Olson, #
# AutoDock Vina: improving the speed and accuracy of docking #
# with a new scoring function, efficient optimization and #
# multithreading, J. Comp. Chem. (2010) #
# DOI 10.1002/jcc.21334 #
# #
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for #
# more information. #
#################################################################
Scoring function : vina
Rigid receptor: /home/gleb/Desktop/dolce_vita/temp/nsp5holoHIE.pdbqt
Ligand: /home/gleb/Desktop/dolce_vita/temp/active2322.pdbqt
Grid center: X 11.106 Y 0.659 Z 18.363
Grid size : X 18 Y 18 Z 18
Grid space : 0.375
Exhaustiveness: 48
CPU: 48
Verbosity: 1
Computing Vina grid ... done.
Performing docking (random seed: -1717804037) ...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
mode | affinity | dist from best mode
| (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1 -9.14 0 0
2 -9.109 2.002 2.79
3 -9.006 1.772 2.315
4 -8.925 2 2.744
5 -8.882 3.592 8.189
6 -8.803 1.564 2.092
7 -8.507 4.014 7.308
8 -8.36 2.489 8.193
9 -8.356 2.529 8.104
10 -8.33 1.408 3.841
It works OK for a moderate number of input log files (tested for up to 50k logs), but does not work for the case of big number of the input logs (e.g. with 130k logs), producing the following error:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
How could I adapt the AWK script to be able processing any number of input logs?
If you get a /usr/bin/awk: Argument list too long then you'll have to control the number of "files" that you supply to awk; the standard way to do that efficiently is:
results=. # ???
i=00001 # ???
output= # ???
find "$results" -type f -name "*_rep$i.log" -exec awk '
FNR == 1 {
filename = FILENAME
sub(/.*\//,"",filename)
sub(/\.[^.]*$/,"",filename)
}
$1 == 1 { printf "%s: %s\n", filename, $2 }
' {} + |
LC_ALL=C sort -t':' -k2,2g > "$results"/ranking_"$output"_rep"$i".csv
edit: appended the rest of the chain as asked in comment
note: you might need to specify other predicates to the find command if you don't want it to search the sub-folders of $results recursively
Note that your error message:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
is from your shell interpreting line 124 in your shell script, not from awk - you just happen to be calling awk at that line but it could be any other tool and you'd get the same error. Google ARG_MAX for more information on it.
Assuming printf is a builtin on your system:
printf '%s\0' "${results}"/*_rep"${i}".log |
xargs -0 awk '...'
or if you need awk to process all input files in one call for some reason and your file names don't contain newlines:
printf '%s' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
'
If you're using GNU awk or some other awk that can process NUL characters as the RS and your input file names might contain newlines then you could do:
printf '%s\0' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
' RS='\0' - RS='\n'
When using GNU AWK you might alter ARGC and ARGV to command GNU AWK to read additional files, consider following simple example, let filelist.txt content be
file1.txt
file2.txt
file3.txt
and content of these files to be respectively uno, dos, tres then
awk 'FNR==NR{ARGV[NR+1]=$0;ARGC+=1;next}{print FILENAME,$0}' filelist.txt
gives output
file1.txt uno
file2.txt dos
file3.txt tres
Explanation: when reading first file i.e. where number of row in file (FNR) is equal number of row globally (NR) I add to ARGV line as value under key being number of row plus one, as ARGV[1] is already filelist.txt and I increase ARGC by 1, I instruct GNU AWK to then go to next line so no other action is undertaken. For other files I print filename followed by whole line.
(tested in GNU Awk 5.0.1)

Faster way to extract data from large file

I have file containing about 40000 frames of Cartesian coordinates of 28 atoms. I need to extract coordinates of atom 21 to 27 from each frame.
I tried using bash script with for-loop.
for i in {0..39999}
do
cat $1 | grep -A 27 "frame $i " | tail -n 6 | awk '{print $2, $3, $4}' >> new_coors.xyz
done
Data have following form:
28
-1373.82296 frame 0 xyz file generated by terachem
Re 1.6345663991 0.9571586961 0.3920887712
N 0.7107677071 -1.0248027788 0.5007181135
N -0.3626961076 1.1948218124 -0.4621264246
C -1.1299268126 0.0792071086 -0.5595954110
C -0.5157993503 -1.1509115191 -0.0469223696
C 1.3354467762 -2.1017253883 1.0125736017
C 0.7611763218 -3.3742177216 0.9821756556
C -1.1378354025 -2.4089069492 -0.1199253156
C -0.4944655989 -3.5108477831 0.4043826684
C -0.8597552614 2.3604180994 -0.9043060625
C -2.1340008843 2.4846545826 -1.4451933224
C -2.4023114639 0.1449111237 -1.0888703147
C -2.9292779079 1.3528434658 -1.5302429615
H 2.3226814021 -1.9233467458 1.4602019023
H 1.3128699342 -4.2076373780 1.3768411246
H -2.1105470176 -2.5059031902 -0.5582958817
H -0.9564415355 -4.4988963635 0.3544299401
H -0.1913951275 3.2219343258 -0.8231465989
H -2.4436044324 3.4620639189 -1.7693069306
H -3.0306593902 -0.7362803011 -1.1626515622
H -3.9523215784 1.4136948699 -1.9142814745
C 3.3621999538 0.4972227756 1.1031860016
O 4.3763020637 0.2022266109 1.5735343064
C 2.2906331057 2.7428149541 0.0483795630
O 2.6669163864 3.8206298898 -0.1683800650
C 1.0351398442 1.4995168190 2.1137684156
O 0.6510904387 1.8559680025 3.1601927094
Cl 2.2433490373 0.2064711824 -1.9226174036
It works but it takes enormous amount of time,
In future I will be working with larger file. Is there faster way to do that?
The reason why your program is slow is that you keep on re-reading your input file over and over in your for-loop. You can do everything with reading your file a single time and use awk instead:
awk '/frame/{c=0;next}{c++}(c>20 && c<27){ print $2,$3,$4 }' input > output
This answer assumes the following form of data:
frame ???
??? x y z ???
??? x y z ???
...
frame ???
??? x y z ???
??? x y z ???
...
The solution checks if it finds the word frame in a line. If so, it sets the atom counter c to zero and skips to the next line. From that point forward, it will always read increase the counter if it reads a new line. If the counter is between 20 and 27 (exclusive), it will print the coordinates.
You can now easily expand on this: Assume you want the same atoms but only from frame 1000 till 1500. You can do this by introducing a frame-counter fc
awk '/frame/{fc++;c=0;next}{c++}(fc>=1000 && fc <=1500) && (c>20 && c<27){ print $2,$3,$4 }' input > output
If frames numbers in file are already in sorted order, e.g. they have numbers 0 - 39999 in this order, then maybe something likes this could do the job (not tested, since we don't have a sample input file, as Jepessen suggested):
cat $1 | grep -A 27 -E "frame [0-9]+ " | \
awk '{if ($1 == "frame") n = 0; if (n++ > 20) print $2, $3, $4}' > new_coors.xyz
(code above made explicitly verbose to be easier to understand and closer to your existing script. If you need a more compact solution check kvantour answer)
You could perhaps use 2 passes of grep, rather than thousands?
Assuming you want the lines 21-27 after every frame, and you don't want to record the frame number itself, the following phrase should get the lines you want, which you can then 'tidy' with awk:
grep -A27 ' frame ' | grep -B6 '-----'
If you also wanted the frame numbers (I see no evidence), or you really want to restrict the range of frame numbers, you could do that with tee and >( grep 'frame') to generate a second file that you would then need to re-merge. If you added -n to grep then you could easily merge sort the files on line number.
Another way to restrict the frame number without doing multiple passes would be a more complex grep expression that describes the range of numbers (-E because life is too short for backticks):
-E ' frame (([0-9]{1,4}|[0-3][0-9]{1,4}) '

Getting blocks of strings between specific characters in text file "bash"

can someone help me finding a code for copying all the strings between the string 'X 0' (X = H, He...) and the nearest '****'? I use bash for programming.
H 0
S 3 1.00 0.000000000000
0.1873113696D+02 0.3349460434D 01
0.2825394365D+01 0.2347269535D+00
0.6401216923D+00 0.8137573261D+00
S 1 1.00 0.000000000000
0.1612777588D+00 0.1000000000D+01
****
He 0
S 3 1.00 0.000000000000
0.3842163400D+02 0.4013973935D 01
0.5778030000D+01 0.2612460970D+00
0.1241774000D+01 0.7931846246D+00
S 1 1.00 0.000000000000
0.2979640000D+00 0.1000000000D+01
****
I want to do this for all the "X 0" (X = H, He...) especifically, obtaining a isolated text like that, for all the "X 0":
H 0
S 3 1.00 0.000000000000
0.1873113696D+02 0.3349460434D 01
0.2825394365D+01 0.2347269535D+00
0.6401216923D+00 0.8137573261D+00
S 1 1.00 0.000000000000
0.1612777588D+00 0.1000000000D+01
****
and
He 0
S 3 1.00 0.000000000000
0.3842163400D+02 0.4013973935D 01
0.5778030000D+01 0.2612460970D+00
0.1241774000D+01 0.7931846246D+00
S 1 1.00 0.000000000000
0.2979640000D+00 0.1000000000D+01
****
So I think I have to find a way to do it using the string containing "X 0".
I was trying to use grep -A2000 'H 0' filename.txt | grep -B2000 -m8 '****' filename.txt >> filenameH.txt but its not so usefull for the other exemples of X, just for the first.
Using awk:
awk '/^[^ ]+ 0$/{p=1;++c}/^\*\*\*\*$/{print >>FILENAME c;p=0}p{print >> FILENAME c}' file
The script creates as many files as there are blocks matching the the patterns /^[^ ]+ 0$/ and /^\*\*\*\*$/. The file index starts at 1.
if the records are separated with 4 stars. Needs gawk
$ awk -v RS='\\*\\*\\*\\*\n' '$1~/^He?$/{printf "%s", $0 RT > FILENAME $1}' file
this will only extract H and He records. If you don't want to restrict, just remove the condition before the curly brace. (Equivalent to $1=="H" || $1=="He")

Print lines after a pattern in several files

I have a file with the same pattern several times.
Something like:
time
12:00
12:32
23:22
time
10:32
1:32
15:45
I want to print the lines after the pattern, in the example time
in several files. The number of lines after the pattern is constant.
I found I can get the first part of my question with awk '/time/ {x=NR+3;next}(NR<=x){print}' filename
But I have no idea how to output each chunk into different files.
EDIT
My files are a bit more complex than my original question.
They have the following format.
4
gen
C -4.141000 -0.098000 0.773000
H -4.528000 -0.437000 -0.197000
H -4.267000 0.997000 0.808000
H -4.777000 -0.521000 1.563000
4
gen
C -4.414000 -0.398000 4.773000
H -4.382000 -0.455000 -4.197000
H -4.267000 0.973000 2.808000
H -4.333000 -0.000000 1.636000
I want to print the lines after
4
gen
EDIT 2
My expected output is x files x=# pattern.
From my second example, I want two files:
C -4.141000 -0.098000 0.773000
H -4.528000 -0.437000 -0.197000
H -4.267000 0.997000 0.808000
H -4.777000 -0.521000 1.563000
and
C -4.414000 -0.398000 4.773000
H -4.382000 -0.455000 -4.197000
H -4.267000 0.973000 2.808000
H -4.333000 -0.000000 1.636000
You can use this awk command:
awk '/time/{close(out); out="output" ++i; next} {print > out}' file
This awk command creates a variable out based on a fixed prefix output and an incrementing counter i which gets incremented every time we get a line time. All subsequent lines are redirected to this output file. Is is a good practice to close these file handles to avoid memory leak.
PS: If you want time line also in output then remove next in above command.
The revised "4/gen" requirements are somewhat ambiguous but the following script (which is just a variant of #anubhava's) conforms with those that are given and can easily be modified to deal with various edge cases:
awk '
/^ *4 *$/ {close(out); out=0; next}
/^ *gen *$/ && out==0 {out = "output." ++i; next}
out {print > out} '
I found another answer from anubhava here How to print 5 consecutive lines after a pattern in file using awk
and with head and for loop I solved my problem:
for i in {1..23}; do grep -A25 "gen" allTemp | tail -n 25 > xyz$i; head -n -27 allTemp > allTemp2; cp allTemp2 allTemp; done
I know that file allTemp has 23 occurrences of gen.
Head will remove the lines I printed to xyzi as well as the two lines I don't want and it will output a new file to allTemp2

Analysis of numerical data

Using some bash script I'd like to perform some kind of the analysis of big log.txt file consisted of the big numbers of strings where each of them are present in the following format
PHE 233,R PHE 233,0.0,0.0,0.0,-0.07884,0.0296770011962,0.00209848087911,0.023555,0.0757544518494,0.00535664866078,-0.065675,0.0859064571205,0.00607450383776,0.0,0.0,0.0,-0.12096,0.0486756448339,0.00344188785407
TYR 234,R TYR 234,0.0,0.0,0.0,-1.25531,0.629561517169,0.0445167217964,-0.004085,0.179779219531,0.0127123105246,0.169925,0.199097411774,0.0140783129982,-0.06675426,0.0227214659046,0.00160665026196,-1.15622426,0.59309226863,0.0419379565017
GLY 235,R GLY 235,0.0,0.0,0.0,-0.039345,0.0259211491836,0.00183290203639,-0.053115,0.0245550763591,0.00173630610061,0.098535,0.0441429357316,0.00312137691973,0.0,0.0,0.0,0.006075,0.0208364914273,0.00147336243844
THR 236,R THR 236,0.0,0.0,0.0,-0.03241,0.0100624003101,0.000711519149426,-0.115375,0.0590932684407,0.00417852508369,0.116505,0.0563931731241,0.00398759951286,0.0,0.0,0.0,-0.03128,0.0262172004608,0.00185383602295
from each of this line of log.txt I need to get and paste in new log file final_log.txt of only first, second and last terms: In the above case it would be
PHE 233 0.00344188785407
TYR 234 0.0419379565017
THR 236 0.00185383602295
!! what is most important! because typical logs are consisted of big number of the strings in new txt file I'd like to sort strings in accordance to the value of last term providing choosen threshold for them. Eventually from the log.txt I'd like to select and paste to the final_log.txt of only those strings where the numbers in last column are equal or higher than the defined threshold. I'd be very thankful for any solutions of this non-trivial (for me!) problem.
Gleb
Through awk,
awk -F'[ ,]' '{print $1" "$2" "$NF}' file
OR
$ awk -F'[ ,]' '{print $1,$2,$NF}' file
PHE 233 0.00344188785407
TYR 234 0.0419379565017
GLY 235 0.00147336243844
THR 236 0.00185383602295

Resources