Manipulating contents of a file for determining deadlock - bash

I've been working on a short assignment to determine deadlocks in database soon after it is identified. For this i will be regularly monitoring the process details once i dump the same to a text file.
After certain manipulation The output file (a miniature version) to be monitored is:
INPUT FILE:-
SPID BLOCKING SPID
23 50
71 46
50 60
60 96
This means that process 23 is being blocked by process 50 and in turn by 60.
So in this case the process chain dependency is 23->50->60->96 and the other 71->46
In short I want the final culprit spids.. in this case 96 and 46 along with the chain count ( 4 and 2)
OUTPUT FILE:-
CULPRIT SPID CHAIN_COUNT
1) 96 4
2) 46 2
For this the entire file will have to be looped and compare column 2 matching with any column 1 in the file,. This has to be repeated over until there is no match for column 2
I need to acheive this via awk or sed, however I believe this can be achieved via linked
list.
Any kind of suggesstion are welcom
****JS just one more question, for sample file
SPID BLOCKING SPID
45 11
12 34
34 35
23 60
71 45
60 71
OUTPUT I RECEIVE:
71=>45
23=>60->71
12=>34->35
OUTPUT I AM SUPPOSED TO GET
23=>60=>71=>45
12=>34->35
I hope you got what I am trying to say.

Let's try to solve this in two steps. Step one is where we shall create the chain using your input file (assuming that it might be useful for you to go back and check the trail) :
awk '
NR>1 {
for(value in values) {
if ($1 == values[value]) {
chain[$1] = chain[value] "->" $2;
values[$1] = $2;
delete chain[value];
delete values[value];
next
}
}
values[$1] = $2;
chain[$1] = $1 "->" $2
}
END {
for(left in values) {
for(c in chain) {
x = split(chain[c],t,/->/)
if (left == t[x]) {
chain[left] = chain[c] "->" values[left]
delete chain[c]
values[left]
}
}
}
for (trail in chain) print chain[trail]
}' file
For your input file, this gives the output as:
23->60->71->45->11
12->34->35
Now we will pipe our above script to another awk for formatting.
awk -F'->' '
BEGIN {
OFS="\t";
print "CULPRIT SPID", "CHAIN_COUNT"
}
{
print $NF, NF
}'
This will give the output as:
CULPRIT SPID CHAIN_COUNT
11 5
35 3
As I said before assuming you need to look at the trail. If trail is of no use to you then you can simply do the following:
awk '
BEGIN {
OFS="\t";
print "CULPRIT SPID", "CHAIN_COUNT"
}
NR>1 {
for(value in values) {
if ($1 == values[value]) {
chain[$1] = chain[value] FS $2;
values[$1] = $2;
delete chain[value];
next
}
}
values[$1] = $2;
chain[$1] = $1 FS $2
}
END {
for(left in values) {
for(c in chain) {
x = split (chain[c],t)
if (left == t[x]) {
chain[left] = chain[c] FS values[left]
delete chain[c]
values[left]
}
}
}
for(v in chain) {
num = split(chain[v], culprit)
print culprit[num], num
}
}' file
Output will be:
CULPRIT SPID CHAIN_COUNT
11 5
35 3
You can re-direct it to an output file as you please.

Related

How to transpose data after when match found in array

I want data to be in csv format from xml data
xml data to test is available in this question
extract data between tags <t> </t>
Note : The data in xml file might differ and Headers will differ as well
In this xml their are 3 headers -> NAME,AGE,COURSE
I have Used below command to get data in horizontal format ,all in one line :
awk -F'(</*t>|</*t>)' 'NF>1{for(i=2;i<NF; i=i+2) printf("%s%s", $i, (i+1==NF)?ORS:OFS)}' OFS=',' demo.xml
After running above command , below is the output
"NAME","Vikas","Vijay","Vilas","AGE","24","34","35","COURSE","MCA","MMS","MBA"
How i am trying to implement logic
Take parameter from user , how many header values will be their
In above xml their are 3 headers -> NAME,AGE,COURSE
header_count=3
So 3 headers means will have 3 values like : HEADER + 3 values
NAME,Vikas,Prabhas,Arjun -> this will be transpose to below
Output :
NAME,
Vikas,
Prabhas,
Arjun,
Same goes with header value AGE -> AGE + 3 values
AGE,25,34,21 will be transpose to vertical
AGE
25
34
21
Same goes with header value COURSE -> COURSE + 3 values
COURSE,MCA,MBA,MMS will be transpose to vertical
COURSE
MCA
MBA
MMS
**Expected Output after combing all data for NAME,AGE,COURSE **
NAME,AGE,Course
Vikas,"25",MCA
Prabhash,"34",MBA
Arjun,"21",MMS
With your shown samples please try following. Written and tested in GNU awk, should work in any awk.
awk '
BEGIN{
FS=OFS=","
}
{
for(i=1;i<=NF;i++){
if($i~/^"NAME"/){
found1=1
found2=found3=""
}
if($i~/^"AGE"$/){
found1=found2=""
found2=1
}
if($i~/^"COURSE"$/){
found1=found2=""
found3=1
}
if(found1){
name[++count1]=$i
}
if(found2){
age[++count2]=$i
}
if(found3){
course[++count3]=$i
}
}
}
END{
if(count1>=count2 && count1>=count3){ max=count1 }
if(count2>=count1 && count2>=count3){ max=count2 }
if(count3>=count1 && count3>=count2){ max=count3 }
for(i=1;i<=max;i++){
print (name[i]?name[i]:"NA",age[i]?age[i]:"NA",course[i]?course[i]:"NA")
}
}
' Input_file

how to read NON-ASCII Char in the file as ASCII using AWK

My input file:
000000000 vélIstine IOBAN 00000004960
000000000 shankargu kumar 00000000040
TTTTTTTTT 0000000200000000050000000000000000000000
whenever I have non Ascii character in the file like above,
my below code snippet not calculating the sum (d_amt_sum+=substr($0,27,10)) properly sometimes its skiping that row and sometime its giving incorrect value instead of 496 its returning 49 for substr($0,27,10)?
besides I want to know how to add print statement inside AWK, example i need to print the value of "substr($0,27,10)" inside the if block how to do that?
set -A out_result -- `LC_ALL=en_US.UTF-8 awk 'BEGIN{
d_amt_sum=d_rec_count=d_trailer_out_amt_sum=d_trailer_rec_count=0;
}
{
if(substr($0,1,9) != "TTTTTTTTT")
{
d_amt_sum+=substr($0,27,10); d_rec_count+=1
}
else if(substr($0,1,9) == "TTTTTTTTT")
{
d_trailer_out_amt_sum+=substr($0,39,12);
d_trailer_rec_count+=substr($0,31,8);
}
}
END{print d_amt_sum, d_rec_count,d_trailer_out_amt_sum,d_trailer_rec_count}' ${OUTDIR}/${OUT_FILE}
Expected output
500,2,500,2
you have a logic error on the ordering of the if/else statements, another error on checking 1 char length against 9 char length. Fixing both gives...
awk '{k=substr($0,1,9)
if(k=="TTTTTTTTT")
{d_trailer_out_amt_sum+=substr($0,39,12)
d_trailer_rec_count+=substr($0,31,8)}
else if(k!="999999999")
{d_amt_sum+=substr($0,27,10);
d_rec_count+=1}}
END {print d_amt_sum, d_rec_count,d_trailer_out_amt_sum,d_trailer_rec_count}' file
500 2 500 2

AWK performance while processing big files

I have an awk script that I use for calculate how much time some transactions takes to complete. The script gets the unique ID of each transaction and stores the minimum and maximum timestamp of each one. Then it calculates the difference and at the end it shows those results that are over 60 seconds.
It works very well when used with some thousand (200k) but it takes more time when used in real world. I tested it several times and it takes about 15 minutes to process about 28 million of lines. Can I consider this good performance or it is possible to improve it?
I'm open to any kind of suggestion.
Here you have the complete code
zgrep -E "\(([a-z0-9]){15,}:" /path/to/very/big/log | awk '{
gsub("[()]|:.*","",$4); #just removing ugly chars
++cont
min=$4"min" #name for maximun value of current transaction
max=$4"max" #same as previous, just for readability
split($2,secs,/[:,]/) #split hours,minutes and seconds
seconds = 3600*secs[1] + 60*secs[2] + secs[3] #turn everything into seconds
if(arr[min] > seconds || arr[min] == 0)
arr[min]=seconds
if(arr[max] < seconds)
arr[max]=seconds
dif=arr[max] - arr[min]
if(dif > 60)
result[$4] = dif
}
END{
for(x in result)
print x" - "result[x]
print ":Processed "cont" lines"
}'
You don't need to calculate the dif every time you read a record. Just do it once in the END section.
You don't need that cont variable, just use NR.
You dont need to populate min and max separately string concatenation is slow in awk.
You shouldn't change $4 as that will force the record to be recompiled.
Try this:
awk '{
name = $4
gsub(/[()]|:.*/,"",name); #just removing ugly chars
split($2,secs,/[:,]/) #split hours,minutes and seconds
seconds = 3600*secs[1] + 60*secs[2] + secs[3] #turn everything into seconds
if (NR==1) {
min[name] = max[name] = seconds
}
else {
if (min[name] > seconds) {
min[name] = seconds
}
if (max[name] < seconds) {
max[name] = seconds
}
}
}
END {
for (name in min) {
diff = max[name] - min[name]
if (diff > 60) {
print name, "-", diff
}
}
print ":Processed", NR, "lines"
}'
After making some test, and with the suggestions gave by Ed Morton (both for code improvement and performance test) I found that the bottleneck was the zgrep command. Here is an example that does several things:
Check if we have a transaction line (first if)
Cleans the transaction id
checks if this has been already registered (second if) by checking if it is in the array
If is not registered then checks if it is the appropriate type of transaction and if so it registers the timestamp in second
If is already registered saves the new time-stamp as the maximun
After all it makes the necessary operations to calculate the time difference
Thank you very much to all that helped me.
zcat /veryBigLog.gz | awk '
{if($4 ~ /^\([:alnum:]/ ){
name=$4;gsub(/[()]|:.*/,"",name);
if(!(name in min)){
if($0 ~ /TypeOFTransaction/ ){
split($2,secs,/[:,]/)
seconds = 3600*secs[1] + 60*secs[2] + secs[3]
max[name] = min[name]=seconds
print lengt(min) "new "name " start at "seconds
}
}else{
split($2,secs,/[:,]/)
seconds = 3600*secs[1] + 60*secs[2] + secs[3]
if( max[name] < seconds) max[name]=seconds
print name " new max " max[name]
}
}}END{
for(x in min){
dif=max[x]- min[x]
print max[x]" max - min "min[x]" : "dif
}
print "Processed "NR" Records"
print "Found "length(min)" MOs" }'

Awk Calc Avg Rows Below Certain Line

I'm having trouble calculating an average of specific numbers in column BELOW a specific text identifier using awk. I have two columns of data and I'm trying to start the average keying on a common identifier that repeats, which is 01/1991. So, awk should calc the average of all lines beginning with 01/1991, which repeats, using the next 21 lines with total count of rows for average = 22 for the total number of years 1991-2012. The desired output is an average of each TextID/Name entry for all the January's (01) for each year 1991 - 2012 show below:
TextID/Name 1
Avg: 50.34
TextID/Name 2
Avg: 45.67
TextID/Name 3
Avg: 39.97
...
sample data:
TextID/Name 1
01/1991, 57.67
01/1992, 56.43
01/1993, 49.41
..
01/2012, 39.88
TextID/Name 2
01/1991, 45.66
01/1992, 34.77
01/1993, 56.21
..
01/2012, 42.11
TextID/Name 3
01/1991, 32.22
01/1992, 23.71
01/1993, 29.55
..
01/2012, 35.10
continues with the same data for TextID/Name 4
I'm getting an answer using this code shown below but the average is starting to calculate BEFORE the specific identifier line and not on and below that line (01/1991).
awk '$1="01/1991" {sum+=$2} (NR%22==0){avg=sum/22;print"Average: "avg;sum=0;next}' myfile
Thanks and explanations of the solution is greatly appreciated! I have edited the original answer with more description - thank you again.
If you look at your file, the first field is "01/1991," with a comma at the end, not "01/1991". Also, NR%22==0 will look at line numbers divisible by 22, not 22 lines after the point it thinks you care about.
You can do something like this instead:
awk '
BEGIN { l=-1; }
$1 == "01/1991," {
l=22;
s=0;
}
l > 0 { s+=$2; l--; }
l == 0 { print s/22; l--; }'
It has a counter l that it sets to the number of lines to count, then it sums up that number of lines.
You may want to consider simply summing all lines from one 01/1991 to the next though, which might be more robust.
If you're allowed to use Perl instead of Awk, you could do:
#!/usr/bin/env perl
$start = 0;
$have_started = 0;
$count = 0;
$sum = 0;
while (<>) {
$line = $_;
# Grab the value after the date and comma
if ($line = /\d+\/\d+,\s+([\d\.]+)/) {
$val = $+;
}
# Start summing values after 01/1991
if (/01\/1991,\s+([\d\.]+)/) {
$have_started = 1;
$val = $+;
}
# If we have started counting,
if ($have_started) {
$count++;
$sum += $+;
}
}
print "Average of all values = " . $sum/$count;
Run it like so:
$ cat your-text-file.txt | above-perl-script.pl

Comparing many files in Bash

I'm trying to automate a task at work that I normally do by hand, that is taking database output from the permissions of multiple users and comparing them to see what they have in common. I have a script right now that uses comm and paste, but it's not giving me all the output I'd like.
Part of the problem comes in comm only dealing with two files at once, and I need to compare at least three to find a trend. I also need to determine if two out of the three have something in common, but the third one doesn't have it (so comparing the output of two comm commands doesn't work). I need these in comma separated values so it can be imported into Excel. Each user has a column, and at the end is a listing of everything they have in common. comm would work perfectly if it could compare more than two files (and show two-out-of-three comparisons).
In addition to the code I have to clean all the extra cruft off the raw csv file, here's what I have so far in comparing four users. It's highly inefficient, but it's what I know.
cat foo1 | sort > foo5
cat foo2 | sort > foo6
cat foo3 | sort > foo7
cat foo4 | sort > foo8
comm foo5 foo6 > foomp
comm foo7 foo8 > foomp2
paste foomp foomp2 > output2
sed 's/[\t]/,/g' output2 > output4.csv
cat output4.csv
Right now this outputs two users, their similarities and differences, then does the same for another two users and pastes it together. This works better than doing it by hand, but I know I could be doing more.
An example input file would be something like:
User1
Active Directory
Internet
S: Drive
Sales Records
User2
Active Directory
Internet
Pricing Lookup
S: Drive
User3
Active Directory
Internet
Novell
Sales Records
where they have AD and Internet in common, two out of three have sales records access and S: drive permission, only one of each has Novell and Pricing access.
Can someone give me a hand in what I'm missing?
Using GNU AWK (gawk) you can print a table that shows how multiple users' permissions correlate. You could also do the same thing in any language that supports associative arrays (hashes), such as Bash 4, Python, Perl, etc.
#!/usr/bin/awk -f
{
array[FILENAME, $0] = $0
perms[$0] = $0
if (length($0) > maxplen) {
maxplen = length($0)
}
users[FILENAME] = FILENAME
}
END {
pcount = asort(perms)
ucount = asort(users)
maxplen += 2
colwidth = 8
printf("%*s", maxplen, "")
for (u = 1; u <= ucount; u++) {
printf("%-*s", colwidth, users[u])
}
printf("\n")
for (p = 1; p <= pcount; p++) {
printf("%-*s", maxplen, perms[p])
for (u = 1; u <= ucount; u++) {
if (array[users[u], perms[p]]) {
printf("%-*s", colwidth, " X")
} else {
printf("%-*s", colwidth, "")
}
}
printf("\n")
}
}
Save this file, perhaps calling it "correlate", then set it to be executable:
$ chmod u+x correlate
Then, assuming that the filenames correspond to the usernames or are otherwise meaningful (your examples are "user1" through "user3" so that works well), you can run it like this:
$ ./correlate user*
and you would get the following output based on your sample input:
user1 user2 user3
Active Directory X X X
Internet X X X
Novell X
Pricing Lookup X
S: Drive X X
Sales Records X X
Edit:
This version doesn't use asort() and so it should work on non-GNU versions of AWK. The disadvantage is that the order of rows and columns is unpredictable.
#!/usr/bin/awk -f
{
array[FILENAME, $0] = $0
perms[$0] = $0
if (length($0) > maxplen) {
maxplen = length($0)
}
users[FILENAME] = FILENAME
}
END {
maxplen += 2
colwidth = 8
printf("%*s", maxplen, "")
for (u in users) {
printf("%-*s", colwidth, u)
}
printf("\n")
for (p in perms) {
printf("%-*s", maxplen, p)
for (u in users) {
if (array[u, p]) {
printf("%-*s", colwidth, " X")
} else {
printf("%-*s", colwidth, "")
}
}
printf("\n")
}
}
You can use the diff3 program. From the man page:
diff3 - compare three files line by line
Given your sample inputs, above, running diff3 results in:
====
1:3,4c
S: Drive
Sales Records
2:3,4c
Pricing Lookup
S: Drive
3:3,4c
Novell
Sales Records
Does this get you any closer to what you're looking for?
I would use the strings command to remove any binary from the files, cat them together then use uniq -c on the concatenated file to get a count of occurrences of a string

Resources