Auto-insert blank lines in `tail -f` - shell

Having a log file such as:
[DEBUG][2016-06-24 11:10:10,064][DataSourceImpl] - [line A...]
[DEBUG][2016-06-24 11:10:10,069][DataSourceImpl] - [line B...]
[DEBUG][2016-06-24 11:10:12,112][DataSourceImpl] - [line C...]
which is under tail -f real-time monitoring, is it possible to auto-insert (via a command we would pipe to the tail) "blank lines" after, let's say, 2 seconds of inactivity?
Expected result:
[DEBUG][2016-06-24 11:10:10,064][DataSourceImpl] - [line A...]
[DEBUG][2016-06-24 11:10:10,069][DataSourceImpl] - [line B...]
---
[DEBUG][2016-06-24 11:10:12,112][DataSourceImpl] - [line C...]
(because there is a gap of more than 2 seconds between 2 successive lines).

awk -F'[][\\- ,:]+' '1'
The above will split fields on ], [, -, ,, and :, so that each field is as described below:
[DEBUG][2016-06-24 11:10:10,064][DataSourceImpl] - [line A...]
22222 3333 44 55 66 77 88 999 ...
You can then concatenate some of the fields and use that to measure time difference:
tail -f input.log | awk -F'[][\\- ,:]+' '{ curr=$3$4$5$6$7$8$9 }
prev + 2000 < curr { print "" } # Print empty line if two seconds
# have passed since last record.
{ prev=curr } 1'

tail does not have such feature. If you want you could implement a program or script that checks the last line of the file; something like (pseudocode)
previous_last_line = last line of your file
while(sleep 2 seconds)
{
if (last_line == previous_last_line)
print newline
else
print lines since previous_last_line
}
Two remarks:
this will cause you to have no output during 2 seconds; you could keep checking for the last line more often and keep a timestamp; but that requires more code...
this depends on the fact that all lines are unique; which is reasonable in your case; since you have timestamps in each line

Related

Processing of the data from a big number of input files

My AWK script processes each log file from the folder "${results}, from which it looks for a pattern (a number occurred on the first line of ranking table) and then print it in one line together with the filename of the log:
awk '$1=="1"{sub(/.*\//,"",FILENAME); sub(/\.log/,"",FILENAME); printf("%s: %s\n", FILENAME, $2)}' "${results}"/*_rep"${i}".log
Here is the format of each log file, from which the number
-9.14
should be taken
AutoDock Vina v1.2.3
#################################################################
# If you used AutoDock Vina in your work, please cite: #
# #
# J. Eberhardt, D. Santos-Martins, A. F. Tillack, and S. Forli #
# AutoDock Vina 1.2.0: New Docking Methods, Expanded Force #
# Field, and Python Bindings, J. Chem. Inf. Model. (2021) #
# DOI 10.1021/acs.jcim.1c00203 #
# #
# O. Trott, A. J. Olson, #
# AutoDock Vina: improving the speed and accuracy of docking #
# with a new scoring function, efficient optimization and #
# multithreading, J. Comp. Chem. (2010) #
# DOI 10.1002/jcc.21334 #
# #
# Please see https://github.com/ccsb-scripps/AutoDock-Vina for #
# more information. #
#################################################################
Scoring function : vina
Rigid receptor: /home/gleb/Desktop/dolce_vita/temp/nsp5holoHIE.pdbqt
Ligand: /home/gleb/Desktop/dolce_vita/temp/active2322.pdbqt
Grid center: X 11.106 Y 0.659 Z 18.363
Grid size : X 18 Y 18 Z 18
Grid space : 0.375
Exhaustiveness: 48
CPU: 48
Verbosity: 1
Computing Vina grid ... done.
Performing docking (random seed: -1717804037) ...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
mode | affinity | dist from best mode
| (kcal/mol) | rmsd l.b.| rmsd u.b.
-----+------------+----------+----------
1 -9.14 0 0
2 -9.109 2.002 2.79
3 -9.006 1.772 2.315
4 -8.925 2 2.744
5 -8.882 3.592 8.189
6 -8.803 1.564 2.092
7 -8.507 4.014 7.308
8 -8.36 2.489 8.193
9 -8.356 2.529 8.104
10 -8.33 1.408 3.841
It works OK for a moderate number of input log files (tested for up to 50k logs), but does not work for the case of big number of the input logs (e.g. with 130k logs), producing the following error:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
How could I adapt the AWK script to be able processing any number of input logs?
If you get a /usr/bin/awk: Argument list too long then you'll have to control the number of "files" that you supply to awk; the standard way to do that efficiently is:
results=. # ???
i=00001 # ???
output= # ???
find "$results" -type f -name "*_rep$i.log" -exec awk '
FNR == 1 {
filename = FILENAME
sub(/.*\//,"",filename)
sub(/\.[^.]*$/,"",filename)
}
$1 == 1 { printf "%s: %s\n", filename, $2 }
' {} + |
LC_ALL=C sort -t':' -k2,2g > "$results"/ranking_"$output"_rep"$i".csv
edit: appended the rest of the chain as asked in comment
note: you might need to specify other predicates to the find command if you don't want it to search the sub-folders of $results recursively
Note that your error message:
./dolche_finito.sh: line 124: /usr/bin/awk: Argument list too long
is from your shell interpreting line 124 in your shell script, not from awk - you just happen to be calling awk at that line but it could be any other tool and you'd get the same error. Google ARG_MAX for more information on it.
Assuming printf is a builtin on your system:
printf '%s\0' "${results}"/*_rep"${i}".log |
xargs -0 awk '...'
or if you need awk to process all input files in one call for some reason and your file names don't contain newlines:
printf '%s' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
'
If you're using GNU awk or some other awk that can process NUL characters as the RS and your input file names might contain newlines then you could do:
printf '%s\0' "${results}"/*_rep"${i}".log |
awk '
NR==FNR {
ARGV[ARGC++] = $0
next
}
...
' RS='\0' - RS='\n'
When using GNU AWK you might alter ARGC and ARGV to command GNU AWK to read additional files, consider following simple example, let filelist.txt content be
file1.txt
file2.txt
file3.txt
and content of these files to be respectively uno, dos, tres then
awk 'FNR==NR{ARGV[NR+1]=$0;ARGC+=1;next}{print FILENAME,$0}' filelist.txt
gives output
file1.txt uno
file2.txt dos
file3.txt tres
Explanation: when reading first file i.e. where number of row in file (FNR) is equal number of row globally (NR) I add to ARGV line as value under key being number of row plus one, as ARGV[1] is already filelist.txt and I increase ARGC by 1, I instruct GNU AWK to then go to next line so no other action is undertaken. For other files I print filename followed by whole line.
(tested in GNU Awk 5.0.1)

Removing beginnings sequences in fasta from a list with size

I want to remove specific sequence in the list with IDs and extract sequence from large fasta file.
input test.fasta file:
>GHAT8X
MKFNDIRNDGHEDCFNNIIFASKLSSHKNVLKLTGCCLETRIPVIVFESVKNRTLADHIYQNQPHFEPLLLSQRLRIAVHIANAIAYLHIGFSRPILHRKIRPSRIFLDEGYIAKLFDFSLSVSIPEGETCVKDKVTGTMGFLAPEYI
>GHAMNO
MRLIGCCLETENPVLVFEYVEYGTLADRIYHPRQPNFEPVTCSLRLKIAMEIAYGIAYLHVAFSRPIVFRNVKPSNILFQEQSVAKLFDFSYSESIPEGETRIRGRVMGTFGYLPPEYIATGDCNEKCDVYSFGMLLLELLTGQRAVD
>GHAXM6
MYSCLGAIKNSGKEDKEKCIMRNGKNLLENLISSFNDGETHIKDAIPIGIMGFVATEYVTTGDYNEKCDVFSFGVLLLVLLTGQKLYSIDEAGDRHWLLNRVKKHIECNTFDEIVDPVIREELCIQSSEKDKQVQAFVELAVKCVSES
seqid_len.txt file:
GHAT8X 25
GHAMNO 26
GHAXM6 20
Expected output:
>GHAT8X
SSHKNVLKLTGCCLETRIPVIVFESVKNRTLADHIYQNQPHFEPLLLSQRLRIAVHIANA
IAYLHIGFSRPILHRKIRPSRIFLDEGYIAKLFDFSLSVSIPEGETCVKDKVTGTMGFLA
PEYI
>GHAMNO
ADRIYHPRQPNFEPVTCSLRLKIAMEIAYGIAYLHVAFSRPIVFRNVKPSNILFQEQSVA
KLFDFSYSESIPEGETRIRGRVMGTFGYLPPEYIATGDCNEKCDVYSFGMLLLELLTGQR
AVD
>GHAXM6
MRNGKNLLENLISSFNDGETHIKDAIPIGIMGFVATEYVTTGDYNEKCDVFSFGVLLLVL
LTGQKLYSIDEAGDRHWLLNRVKKHIECNTFDEIVDPVIREELCIQSSEKDKQVQAFVEL
AVKCVSES
I tried:
sed 's/_/|/g' seqid_len.txt | while read line;do grep -i -A1 ${line%%[1-9]*} test.fasta | seqkit subseq -r ${line##[a-z]* }:-1 ; done
Only getting GHAT8X 25 and GHAMNO 26 sequence out. However, renaming the header does not work.
Any correction on this or any python solution would be really helpful.
Have a great weekend.
Thanks
Would you please try the following:
#!/bin/bash
awk 'NR==FNR {a[">" $1] = $2 + 0; next} # create an array which maps the header to the starting position of the sequence
$0 in a { # the header matches an array index
start = a[$0] # get the starting position
print # print the header
getline # read the sequence line
print substr($0, start) # print the sequence by removing the beginnings
}
' seqid_len.txt test.fasta | fold -w 60 # wrap the output within 60 columns
Output:
>GHAT8X
SSHKNVLKLTGCCLETRIPVIVFESVKNRTLADHIYQNQPHFEPLLLSQRLRIAVHIANA
IAYLHIGFSRPILHRKIRPSRIFLDEGYIAKLFDFSLSVSIPEGETCVKDKVTGTMGFLA
PEYI
>GHAMNO
ADRIYHPRQPNFEPVTCSLRLKIAMEIAYGIAYLHVAFSRPIVFRNVKPSNILFQEQSVA
KLFDFSYSESIPEGETRIRGRVMGTFGYLPPEYIATGDCNEKCDVYSFGMLLLELLTGQR
AVD
>GHAXM6
IMRNGKNLLENLISSFNDGETHIKDAIPIGIMGFVATEYVTTGDYNEKCDVFSFGVLLLV
LLTGQKLYSIDEAGDRHWLLNRVKKHIECNTFDEIVDPVIREELCIQSSEKDKQVQAFVE
LAVKCVSES
You'll see the 3rd sequence starts with IMR.., one column shifted compared with your expected MRN... If the 3rd one is correct and the 1st and the 2nd sequences should be fixed, tweak the calculation $2 + 0 as $2 + 1.

Unable to parse the log file using Shell and python

I am trying to parse the log file using shell or python script. I used awk and sed but no luck. Can some one help me to resolve this. Below is the input and expecting output.
Input:
customer1:123
SRE:1
clientID:1
Error=1
customer1:124
SRE:1
clientID:1
Error=2
customer1:125
SRE:1
clientID:1
Error=3
customer1:126
SRE:1
clientID:1
Error=4
Output:
Customer | Error
123 1
124 2
125 3
126 4
It's usual to show some of your work, or what you've tried so far, but here's a rough guess at what you're looking for.
tmp$ awk -F: '/^customer1:/ {CUST=$2} ; /^Error/ {split($0,a,"=") ; print CUST, a[2]} ' t
Or breaking down by line:
tmp$ awk -F: '\
> /^customer1:/ {CUST=$2} ; \
> /^Error/ {split($0,a,"=") ; print CUST, a[2]} \
> ' t
123 1
124 2
125 3
126 4
The first line
/^customer1:/ {CUST=$2} ;
Does two things - matches lines that start (^ means start) with customer1, and those lines are automatically split on : because we said -F: at the start of our command.
/^Error/ {split($0,a,"=") ; print CUST, a[2]} ;
Matches lines that starts with Error, splits those lines into array a, on the delimiter "=", and then prints out the last value of CUST, as well as the second field on the error line.
Hopefully that all makes sense. It's worth reading an awk tutorial like https://www.grymoire.com/Unix/Awk.html

Grep all content of File 1 from File 2

This is regarding grepping all the Thread IDs which are mentioned in one file from the thread dump file in unix.
I also require at least 5 lines below each thread id from thread dump while grepping.
Like below:-
MAX_CPU_PID_TD_Ids.out:
1001
1003
MAX_CPU_PID_TD.txt:
............TDID=1001..................
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
............TDID=1002...................
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
...........TDID=1003......................
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Output should contain :-
............TDID=1001..................
Line 1
Line 2
Line 3
Line 4
Line 5
...........TDID=1003......................
Line 1
Line 2
Line 3
Line 4
Line 5
If possible I would like to have the above output in the mail body.
I have tried the below code but it sends me the thread IDs in the body with thread dump file as an attachment
How ever I would like to have the description of each thread id in the body of the mail only
JAVA_HOME=/u01/oracle/products/jdk
MAX_CPU_PID=`ps -eo pid,ppid,cmd,%mem,%cpu --sort=-%mem | head -2 | sed -n '1!p' | awk '{print $1}'`
ps -eLo pid,ppid,tid,pcpu,comm | grep $MAX_CPU_PID > MAX_CPU_PID_SubProcess.out
cat MAX_CPU_PID_SubProcess.out | awk '{ print "pccpu: "$4" pid: "$1" ppid: "$2" ttid: "$3" comm: "$5}' |sort -n > MAX_CPU_PID_SubProcess_Sorted_temp1.out
rm MAX_CPU_PID_SubProcess.out
sort -k 2n MAX_CPU_PID_SubProcess_Sorted_temp1.out > MAX_CPU_PID_SubProcess_Sorted_temp2.out
rm MAX_CPU_PID_SubProcess_Sorted_temp1.out
awk '{a[i++]=$0}END{for(j=i-1;j>=0;j--)print a[j];}' MAX_CPU_PID_SubProcess_Sorted_temp2.out > MAX_CPU_PID_SubProcess_Sorted_temp3.out
rm MAX_CPU_PID_SubProcess_Sorted_temp2.out
awk '($2 > 15 ) ' MAX_CPU_PID_SubProcess_Sorted_temp3.out > MAX_CPU_PID_SubProcess_Sorted_Highest_Consuming.out
rm MAX_CPU_PID_SubProcess_Sorted_temp3.out
awk '{ print $8 }' MAX_CPU_PID_SubProcess_Sorted_Highest_Consuming.out > MAX_CPU_PID_SubProcess_Sorted_temp4.out
( echo "obase=16" ; cat MAX_CPU_PID_SubProcess_Sorted_temp4.out ) | bc > MAX_CPU_PID_TD_Ids_temp.out
rm MAX_CPU_PID_SubProcess_Sorted_temp4.out
$JAVA_HOME/bin/jstack -l $MAX_CPU_PID > MAX_CPU_PID_TD.txt
#grep -i -A 10 'error' data
awk 'BEGIN{print "The below thread IDs from the attached thread dump of OUD1 server are causing the highest CPU utilization. Please Analyze it further\n"}1' MAX_CPU_PID_TD_Ids_temp.out > MAX_CPU_PID_TD_Ids.out
rm MAX_CPU_PID_TD_Ids_temp.out
tr -cd "[:print:]\n" < MAX_CPU_PID_TD_Ids.out | mailx -s "OUD1 MAX CPU Utilization Analysis" -a MAX_CPU_PID_TD.txt <My Mail ID>
Answer for the first part: How to extract the lines.
The solution with grep -F -f MAX_CPU_PID_TD_Ids.out -A 5 MAX_CPU_PID_TD.txt as proposed in a comment is much simpler, but it may fail if the lines Line 1 etc can contain the values from MAX_CPU_PID_TD_Ids.out. It may also print a non-matching TDID= line if there are not enough lines after the previous matching line.
For the grep solution it may be better to create a file with patterns like ...TDID=1001....
The following script will print the matching lines ...TDID=XYZ... and at most the following 5 lines. It will stop after fewer lines if a new ...TDID=XYZ... is found.
For simplicity an empty line is printed before every ...TDID=XYZ... line, i.e. also before the first one.
awk 'NR==FNR {ids[$1]=1;next} # from the first file save all IDs as array keys
/\.\.\.TDID=/ {
sel = 0; # stop any previous output
id=gensub(/\.*TDID=([^.]*)\.*/,"\\1",1); # extract ID
if(id in ids) { # select if ID is present in array
print "" # empty line as separator
sel = 1;
}
count = 0; # counter to limit number of lines
}
sel { # selected for output?
print;
count++;
if(count > 5) { # stop after ...TDID= + 5 more lines (change the number if necessary)
sel = 0
}
}' MAX_CPU_PID_TD_Ids.out MAX_CPU_PID_TD.txt > MAX_CPU_PID_TD.extract
Apart from the first empty line, this script produces the expected output from the example input as shown in the question. If it does not work with the real input or if there are additional requirements, update the question to show the problematic input and the expected output or the additional requirements.
Answer for the second part: Mail formatting
To get the resulting data into the mail body you simply have to pipe it into mailx instead of specifying the file as an attachment.
( tr -cd "[:print:]\n" < MAX_CPU_PID_TD_Ids.out ; cat MAX_CPU_PID_TD.extract ) | mailx -s "OUD1 MAX CPU Utilization Analysis" <My Mail ID>

Bash: reading same lines in two files in nested loop

I'm trying to calculate confidence interval from several files: ones contains lines with means, and others contains lines with values (one per line). I'm trying to read one line from the file that contains the means, and all the lines from another file (because I have to do some computations). Here is what I've done (of course it's not working):
parameters="some value to move from a file to another one"
while read avg; do
for row in mypath/*_${parameters}*.dat; do
for value in $( awk '{ print $2; }' ${row}); do
read all the lines in first_file.dat (I need only the second column)
read the first line in avg.dat
combine data and calculate the confidence interval
done
done
done < avg.dat
** file avg.dat (not necessarily 100 lines) **
.99
2.34
5.41
...
...
2.88
** firstfile.dat in mypath (100 lines) **
0 13.77
1 2
2 63.123
3 21.109
...
...
99 1.05
** secondfile.dat in mypath (100 lines) **
0 8.56
1 91.663
2 19
3 0
...
...
99 4.34
The first line of avg.dat refers to the firstfile.dat in mypath, the second line of avg.dat refers to the secondfile.dat in mypath, etc... So, in the example above, I have to do some computation using .99 (from avg.dat) with all the numbers in the second column of firstfile.dat. Same with 2.34 and secondfile.dat.
I can't reach my objective because I can't find a way to switch to the next line in the avg.dat when I've finished to read a file in mypath. Instead I read the first line in avg.dat and all the files in mypath, then the second line in avg.dat and, again, all the files in mypath, etc... Can you help me to find a solution? Thank you all!
In bash I would do this:
exec 3<avg.dat
shopt -s extglob
for file in !(avg).dat; do
read -u 3 avg
while read value; do
# do stuff with $value and $avg
done < <(cut -f 2 -d " " "$file")
done
exec 3<&- # close the file descriptor

Resources