bash + awk: extract specific information from ensemble of filles - bash

I am using bash script to extract some information from log files located within the directory and save the summary in the separate file.
In the bottom of each log file, there is a table like:
mode | affinity | dist from best mode
| (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1 -6.961 0 0
2 -6.797 2.908 4.673
3 -6.639 27.93 30.19
4 -6.204 2.949 6.422
5 -6.111 24.92 28.55
6 -6.058 2.836 7.608
7 -5.986 6.448 10.53
8 -5.95 19.32 23.99
9 -5.927 27.63 30.04
10 -5.916 27.17 31.29
11 -5.895 25.88 30.23
12 -5.835 26.24 30.36
from this I need to take only the value from the second column of the first line (-6.961) and add it together with the name of the log as one string in new ranking_${output}.log
log_name -6.961
so for 5 processed logs it should be something like:
# ranking_${output}.log
log_name1 -X.XXX
log_name2 -X.XXX
log_name3 -X.XXX
log_name4 -X.XXX
log_name5 -X.XXX
Here is a simple bash workflow, which takes ALL THE LINES from ranking table and saves it together with the name of the LOG file:
#!/bin/bash
home="$PWD"
#folder contained all *.log files
results="${home}"/results
# loop each log file and take its name + all the ranking table
for log in ${results}/*.log; do
log_name=$(basename "$log" .log)
echo "$log_name" >> ${results}/ranking_${output}.log
cat $log | tail -n 12 >> ${results}/ranking_${output}.log
done
Could you suggest me an AWK routine which would select only the top value located on the first line of each table?
This is an AWK example that I had used for another format, which does not work there:
awk -F', *' 'FNR==2 {f=FILENAME;
sub(/.*\//,"",f);
sub(/_.*/ ,"",f);
printf("%s: %s\n", f, $5) }' ${results}/*.log >> ${results}/ranking_${output}.log

With awk. If first column contains 1 print filename and second column to file output:
awk '$1=="1"{print FILENAME, $2}' *.log > output
Update to remove path and suffix (.log):
awk '$1=="1"{sub(/.*\//,"",FILENAME); sub(/\.log/,"",FILENAME); print FILENAME, $2}'

Related

Processing of the data from a big number of input files

My AWK script processes each log file from the folder "${results}, from which it looks for a pattern (a number occurred on the first line of ranking table) and then print it in one line together with the filename of the log:
awk '$1=="1"{sub(/.*\//,"",FILENAME); sub(/\.log/,"",FILENAME); printf("%s: %s\n", FILENAME, $2)}' "${results}"/*_rep"${i}".log
Here is the format of each log file, from which the number
-9.14
should be taken
AutoDock Vina v1.2.3
#################################################################
# If you used AutoDock Vina in your work, please cite: #
# #
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli #
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force #
# Field, and Python Bindings, J. Chem. Inf. Model. (2021) #
# DOI 10.1021/acs.jcim.1c00203 #
# #
# O. Trott, A. J. Olson, #
# AutoDock Vina: improving the speed and accuracy of docking #
# with a new scoring function, efficient optimization and #
# multithreading, J. Comp. Chem. (2010) #
# DOI 10.1002/jcc.21334 #
# #
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for #
# more information. #
#################################################################
Scoring function : vina
Rigid receptor: /home/gleb/Desktop/dolce_vita/temp/nsp5holoHIE.pdbqt
Ligand: /home/gleb/Desktop/dolce_vita/temp/active2322.pdbqt
Grid center: X 11.106 Y 0.659 Z 18.363
Grid size : X 18 Y 18 Z 18
Grid space : 0.375
Exhaustiveness: 48
CPU: 48
Verbosity: 1
Computing Vina grid ... done.
Performing docking (random seed: -1717804037) ...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
mode | affinity | dist from best mode
| (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1 -9.14 0 0
2 -9.109 2.002 2.79
3 -9.006 1.772 2.315
4 -8.925 2 2.744
5 -8.882 3.592 8.189
6 -8.803 1.564 2.092
7 -8.507 4.014 7.308
8 -8.36 2.489 8.193
9 -8.356 2.529 8.104
10 -8.33 1.408 3.841
It works OK for a moderate number of input log files (tested for up to 50k logs), but does not work for the case of big number of the input logs (e.g. with 130k logs), producing the following error:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
How could I adapt the AWK script to be able processing any number of input logs?
If you get a /usr/bin/awk: Argument list too long then you'll have to control the number of "files" that you supply to awk; the standard way to do that efficiently is:
results=. # ???
i=00001 # ???
output= # ???
find "$results" -type f -name "*_rep$i.log" -exec awk '
FNR == 1 {
filename = FILENAME
sub(/.*\//,"",filename)
sub(/\.[^.]*$/,"",filename)
}
$1 == 1 { printf "%s: %s\n", filename, $2 }
' {} + |
LC_ALL=C sort -t':' -k2,2g > "$results"/ranking_"$output"_rep"$i".csv
edit: appended the rest of the chain as asked in comment
note: you might need to specify other predicates to the find command if you don't want it to search the sub-folders of $results recursively
Note that your error message:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
is from your shell interpreting line 124 in your shell script, not from awk - you just happen to be calling awk at that line but it could be any other tool and you'd get the same error. Google ARG_MAX for more information on it.
Assuming printf is a builtin on your system:
printf '%s\0' "${results}"/*_rep"${i}".log |
xargs -0 awk '...'
or if you need awk to process all input files in one call for some reason and your file names don't contain newlines:
printf '%s' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
'
If you're using GNU awk or some other awk that can process NUL characters as the RS and your input file names might contain newlines then you could do:
printf '%s\0' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
' RS='\0' - RS='\n'
When using GNU AWK you might alter ARGC and ARGV to command GNU AWK to read additional files, consider following simple example, let filelist.txt content be
file1.txt
file2.txt
file3.txt
and content of these files to be respectively uno, dos, tres then
awk 'FNR==NR{ARGV[NR+1]=$0;ARGC+=1;next}{print FILENAME,$0}' filelist.txt
gives output
file1.txt uno
file2.txt dos
file3.txt tres
Explanation: when reading first file i.e. where number of row in file (FNR) is equal number of row globally (NR) I add to ARGV line as value under key being number of row plus one, as ARGV[1] is already filelist.txt and I increase ARGC by 1, I instruct GNU AWK to then go to next line so no other action is undertaken. For other files I print filename followed by whole line.
(tested in GNU Awk 5.0.1)

Convert text file to csv using shell script

I have a file like this
InputFile.txt
JOB JOB_A
Source C://files/InputFile
Resource 0 AC
User Guest
ExitCode 0 Success
EndJob
JOB JOB_B
Source C://files/
Resource 1 AD
User Current
ExitCode 1 Fail
EndJob
JOB JOB_C
Source C://files/Input/
Resource 3 AE
User Guest2
ExitCode 0 Success
EndJob
I have to convert the above file to a csv file as below
How to convert it using shell scripting?
I used awk.
The separator is a tabulator because it's more common than a comma in the CSV format.
If you want a coma, you can simply change the \t -> ,.
cat InputFile.txt | \
awk '
BEGIN{print "Source\tResource\tUser\tExitCode"}
/^JOB/{i=0}
/^\s/{
i++;
match($0,/\s*[a-zA-Z]* /);
a[i]=substr($0,RLENGTH+RPOS)}
/^EndJob/{for(i=1;i<5;i++) printf "%s\t",a[i];print ""}'
The first line BEGIN writes header.
The second line matches /JOB/ and only sets an iterator i as zero.
The third line matches the blank on the start of a line and fills array a with values (it count on strict count and order of rows).
The fourth part of the awk script matches EndJob and prints stored values.
Output:
Source
Resource
User
ExitCode
C://files/InputFile
0 AC
Guest
0 Success
C://files/
1 AD
Current
1 Fail
C://files/Input/
3 AE
Guest2
0 Success
Script using associative array:
You can change the script so that uses strict Source, Resource, User, and ExitCode values from $1 (first record) of lines, but it would be a little longer, and this input file doesn't need it.
cat InputFile.txt | \
awk '
BEGIN{
h[1]="Source";
h[2]="Resource";
h[3]="User";
h[4]="ExitCode";
for(i=1;i<5;i++) printf "%s\t",h[i];print ""}
/^\s/{
i++;
match($0,/\s*[a-zA-Z]* /);
a[$1]=substr($0,RLENGTH+RPOS)}
/^EndJob/{for(i=1;i<5;i++) printf "%s\t",a[h[i]];print ""}'
with sed ... dont know if the order in the InputFile.txt is always the same
as Source, Resource, User, ExitCode, but if it is
declare delimiter=";"
sed -Ez "s/[^\n]*(Source|Resource|User) ([^\n]*)\n/\2${delimiter}/g;s/[ \t]*ExitCode //g;s/[^\n]*JOB[^\n]*\n//gi;s/^/Source${delimiter}Resource${delimiter}User${delimiter}ExitCode\n/" < InputFile.txt > output.csv

Grep all content of File 1 from File 2

This is regarding grepping all the Thread IDs which are mentioned in one file from the thread dump file in unix.
I also require at least 5 lines below each thread id from thread dump while grepping.
Like below:-
MAX_CPU_PID_TD_Ids.out:
1001
1003
MAX_CPU_PID_TD.txt:
............TDID=1001..................
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
............TDID=1002...................
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
...........TDID=1003......................
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Output should contain :-
............TDID=1001..................
Line 1
Line 2
Line 3
Line 4
Line 5
...........TDID=1003......................
Line 1
Line 2
Line 3
Line 4
Line 5
If possible I would like to have the above output in the mail body.
I have tried the below code but it sends me the thread IDs in the body with thread dump file as an attachment
How ever I would like to have the description of each thread id in the body of the mail only
JAVA_HOME=/u01/oracle/products/jdk
MAX_CPU_PID=`ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -2 | sed -n '1!p' | awk '{print $1}'`
ps -eLo pid,ppid,tid,pcpu,comm | grep $MAX_CPU_PID > MAX_CPU_PID_SubProcess.out
cat MAX_CPU_PID_SubProcess.out | awk '{ print "pccpu: "$4" pid: "$1" ppid: "$2" ttid: "$3" comm: "$5}' |sort -n > MAX_CPU_PID_SubProcess_Sorted_temp1.out
rm MAX_CPU_PID_SubProcess.out
sort -k 2n MAX_CPU_PID_SubProcess_Sorted_temp1.out > MAX_CPU_PID_SubProcess_Sorted_temp2.out
rm MAX_CPU_PID_SubProcess_Sorted_temp1.out
awk '{a[i++]=$0}END{for(j=i-1;j>=0;j--)print a[j];}' MAX_CPU_PID_SubProcess_Sorted_temp2.out > MAX_CPU_PID_SubProcess_Sorted_temp3.out
rm MAX_CPU_PID_SubProcess_Sorted_temp2.out
awk '($2 > 15 ) ' MAX_CPU_PID_SubProcess_Sorted_temp3.out > MAX_CPU_PID_SubProcess_Sorted_Highest_Consuming.out
rm MAX_CPU_PID_SubProcess_Sorted_temp3.out
awk '{ print $8 }' MAX_CPU_PID_SubProcess_Sorted_Highest_Consuming.out > MAX_CPU_PID_SubProcess_Sorted_temp4.out
( echo "obase=16" ; cat MAX_CPU_PID_SubProcess_Sorted_temp4.out ) | bc > MAX_CPU_PID_TD_Ids_temp.out
rm MAX_CPU_PID_SubProcess_Sorted_temp4.out
$JAVA_HOME/bin/jstack -l $MAX_CPU_PID > MAX_CPU_PID_TD.txt
#grep -i -A 10 'error' data
awk 'BEGIN{print "The below thread IDs from the attached thread dump of OUD1 server are causing the highest CPU utilization. Please Analyze it further\n"}1' MAX_CPU_PID_TD_Ids_temp.out > MAX_CPU_PID_TD_Ids.out
rm MAX_CPU_PID_TD_Ids_temp.out
tr -cd "[:print:]\n" < MAX_CPU_PID_TD_Ids.out | mailx -s "OUD1 MAX CPU Utilization Analysis" -a MAX_CPU_PID_TD.txt <My Mail ID>
Answer for the first part: How to extract the lines.
The solution with grep -F -f MAX_CPU_PID_TD_Ids.out -A 5 MAX_CPU_PID_TD.txt as proposed in a comment is much simpler, but it may fail if the lines Line 1 etc can contain the values from MAX_CPU_PID_TD_Ids.out. It may also print a non-matching TDID= line if there are not enough lines after the previous matching line.
For the grep solution it may be better to create a file with patterns like ...TDID=1001....
The following script will print the matching lines ...TDID=XYZ... and at most the following 5 lines. It will stop after fewer lines if a new ...TDID=XYZ... is found.
For simplicity an empty line is printed before every ...TDID=XYZ... line, i.e. also before the first one.
awk 'NR==FNR {ids[$1]=1;next} # from the first file save all IDs as array keys
/\.\.\.TDID=/ {
sel = 0; # stop any previous output
id=gensub(/\.*TDID=([^.]*)\.*/,"\\1",1); # extract ID
if(id in ids) { # select if ID is present in array
print "" # empty line as separator
sel = 1;
}
count = 0; # counter to limit number of lines
}
sel { # selected for output?
print;
count++;
if(count > 5) { # stop after ...TDID= + 5 more lines (change the number if necessary)
sel = 0
}
}' MAX_CPU_PID_TD_Ids.out MAX_CPU_PID_TD.txt > MAX_CPU_PID_TD.extract
Apart from the first empty line, this script produces the expected output from the example input as shown in the question. If it does not work with the real input or if there are additional requirements, update the question to show the problematic input and the expected output or the additional requirements.
Answer for the second part: Mail formatting
To get the resulting data into the mail body you simply have to pipe it into mailx instead of specifying the file as an attachment.
( tr -cd "[:print:]\n" < MAX_CPU_PID_TD_Ids.out ; cat MAX_CPU_PID_TD.extract ) | mailx -s "OUD1 MAX CPU Utilization Analysis" <My Mail ID>

Loop Script from Input File

I have a reference file with device names in them. For example WABEL8499IPM101. I'm using this script to set the base name (without the last 3 digits) to look at the reference file and see what is already used. If 101 is used it will create a file for me with 102, 103 if I request 2 total. I'm looking to use an input file to run it multiple times. I'm also trying to figure out how to start at 101 if there isn't a name found when searching the reference file
I would like to loop this using an input file instead of manually entering bash test.sh WABEL8499IPM 2 each time. I would like to be able to build an input file of all the names that need compared and then output. It would also be nice that if there isn't a match that it starts creating names at WABEL8499IPM101 instead of just WABEL8499IPM1.
Input file example:
ColumnA (BASE NAME) ColumnB (QUANTITY)
WABEL8499IPM 2
Script:
SRCFILE="~/Desktop/deviceinfo.csv"
LOGDIR="~/Desktop/"
LOGFILE="$LOGDIR/DeviceNames.csv"
# base name, such as "WABEL8499IPM"
device_name=$1
# quantity, such as "2"
quantityNum=$2
# the largest in sequence, such as "WABEL8499IPM108"
max_sequence_name=$(cat $SRCFILE | grep -o -e "$device_name[0-9]*" | sort --reverse | head -n 1)
# extract the last 3digit number (such as "108") from max_sequence_name
max_sequence_num=$(echo $max_sequence_name | rev | cut -c 1-3 | rev)
# create new sequence_name
# such as ["WABEL8499IPM109", "WABEL8499IPM110"]
array_new_sequence_name=()
for i in $(seq 1 $quantityNum);
do
cnum=$((max_sequence_num + i))
array_new_sequence_name+=($(echo $device_name$cnum))
done
#CODE FOR CREATING OUTPUT FILE HERE
#for fn in ${array_new_sequence_name[#]}; do touch $fn; done;
# write log
for sqn in ${array_new_sequence_name[#]};
do
echo $sqn >> $LOGFILE
done
Usage:
bash test.sh WABEL8499IPM 2
Result in the log file:
WABEL8499IPM109
WABEL8499IPM110
Just wrap a loop around your code instead of assuming the args come in on the command line.
SRCFILE="~/Desktop/deviceinfo.csv"
LOGDIR="~/Desktop/"
LOGFILE="$LOGDIR/DeviceNames.csv"
while read device_name quantityNum
do max_sequence_name=$( grep -o -e "$device_name[0-9]*" $SRCFILE |
sort --reverse | head -n 1)
max_sequence_num=${max_sequence_name: -3}
array_new_sequence_name=()
for i in $(seq 1 $quantityNum)
do cnum=$((max_sequence_num + i))
array_new_sequence_name+=("$device_name$cnum")
done
for sqn in ${array_new_sequence_name[#]};
do echo $sqn >> $LOGFILE
done
done < input.file
I'd maybe pass the input file as the parameter now.

Make cat command to operate recursively looping through a directory

I have a large directory of data files which I am in the process of manipulating to get them in a desired format. They each begin and end 15 lines too soon, meaning I need to strip the first 15 lines off one file and paste them to the end of the previous file in the sequence.
To begin, I have written the following code to separate the relevant data into easy chunks:
#!/bin/bash
destination='media/user/directory/'
for file1 in `ls $destination*.ascii`
do
echo $file1
file2="${file1}.end"
file3="${file1}.snip"
sed -e '16,$d' $file1 > $file2
sed -e '1,15d' $file1 > $file3
done
This worked perfectly, so the next step is the worlds simplest cat command:
cat $file3 $file2 > outfile
However, what I need to do is to stitch file2 to the previous file3. Look at this screenshot of the directory for better understanding.
See how these files are all sequential over time:
*_20090412T235945_20090413T235944_* ### April 13
*_20090413T235945_20090414T235944_* ### April 14
So I need to take the 15 lines snipped off the April 14 example above and paste it to the end of the April 13 example.
This doesn't have to be part of the original code, in fact it would be probably best if it weren't. I was just hoping someone would be able to help me get this going.
Thanks in advance! If there is anything I have been unclear about and needs further explanation please let me know.
"I need to strip the first 15 lines off one file and paste them to the end of the previous file in the sequence."
If I understand what you want correctly, it can be done with one line of code:
awk 'NR==1 || FNR==16{close(f); f=FILENAME ".new"} {print>f}' file1 file2 file3
When this has run, the files file1.new, file2.new, and file3.new will be in the new form with the lines transferred. Of course, you are not limited to three files: you may specify as many as you like on the command line.
Example
To keep our example short, let's just strip the first 2 lines instead of 15. Consider these test files:
$ cat file1
1
2
3
$ cat file2
4
5
6
7
8
$ cat file3
9
10
11
12
13
14
15
Here is the result of running our command:
$ awk 'NR==1 || FNR==3{close(f); f=FILENAME ".new"} {print>f}' file1 file2 file3
$ cat file1.new
1
2
3
4
5
$ cat file2.new
6
7
8
9
10
$ cat file3.new
11
12
13
14
15
As you can see, the first two lines of each file have been transferred to the preceding file.
How it works
awk implicitly reads each file line-by-line. The job of our code is to choose which new file a line should be written to based on its line number. The variable f will contain the name of the file that we are writing to.
NR==1 || FNR==16{f=FILENAME ".new"}
When we are reading the first line of the first file, NR==1, or when we are reading the 16th line of whatever file we are on, FNR==16, we update f to be the name of the current file with .new added to the end.
For the short example, which transferred 2 lines instead of 15, we used the same code but with FNR==16 replaced with FNR==3.
print>f
This prints the current line to file f.
(If this was a shell script, we would use >>. This is not a shell script. This is awk.)
Using a glob to specify the file names
destination='media/user/directory/'
awk 'NR==1 || FNR==16{close(f); f=FILENAME ".new"} {print>f}' "$destination"*.ascii
Your task is not that difficult at all. You want to gather a list of all _end files in the directory (using a for loop and globbing, NOT looping on the results of ls). Once you have all the end files, you simply parse the dates using parameter expansion w/substing removal say into d1 and d2 for date1 and date2 in:
stuff_20090413T235945_20090414T235944_end
| d1 | | d2 |
then you simply subtract 1 from d1 into say date0 or d0 and then construct a previous filename out of d0 and d1 using _snip instead of _end. Then just test for the existence of the previous _snip filename, and if it exists, paste your info from the current _end file to the previous _snip file. e.g.
#!/bin/bash
for i in *end; do ## find all _end files
d1="${i#*stuff_}" ## isolate first date in filename
d1="${d1%%T*}"
d2="${i%T*}" ## isolate second date
d2="${d2##*_}"
d0=$((d1 - 1)) ## subtract 1 from first, get snip d1
prev="${i/$d1/$d0}" ## create previous 'snip' filename
prev="${prev/$d2/$d1}"
prev="${prev%end}snip"
if [ -f "$prev" ] ## test that prev snip file exists
then
printf "paste to : %s\n" "$prev"
printf " from : %s\n\n" "$i"
fi
done
Test Input Files
$ ls -1
stuff_20090413T235945_20090414T235944_end
stuff_20090413T235945_20090414T235944_snip
stuff_20090414T235945_20090415T235944_end
stuff_20090414T235945_20090415T235944_snip
stuff_20090415T235945_20090416T235944_end
stuff_20090415T235945_20090416T235944_snip
stuff_20090416T235945_20090417T235944_end
stuff_20090416T235945_20090417T235944_snip
stuff_20090417T235945_20090418T235944_end
stuff_20090417T235945_20090418T235944_snip
stuff_20090418T235945_20090419T235944_end
stuff_20090418T235945_20090419T235944_snip
Example Use/Output
$ bash endsnip.sh
paste to : stuff_20090413T235945_20090414T235944_snip
from : stuff_20090414T235945_20090415T235944_end
paste to : stuff_20090414T235945_20090415T235944_snip
from : stuff_20090415T235945_20090416T235944_end
paste to : stuff_20090415T235945_20090416T235944_snip
from : stuff_20090416T235945_20090417T235944_end
paste to : stuff_20090416T235945_20090417T235944_snip
from : stuff_20090417T235945_20090418T235944_end
paste to : stuff_20090417T235945_20090418T235944_snip
from : stuff_20090418T235945_20090419T235944_end
(of course replace stuff_ with your actual prefix)
Let me know if you have questions.
You could store the previous $file3 value in a variable (and do a check if it is not the first run with -z check):
#!/bin/bash
destination='media/user/directory/'
prev=""
for file1 in $destination*.ascii
do
echo $file1
file2="${file1}.end"
file3="${file1}.snip"
sed -e '16,$d' $file1 > $file2
sed -e '1,15d' $file1 > $file3
if [ -z "$prev" ]; then
cat $prev $file2 > outfile
fi
prev=$file3
done

Resources