Merging sums of numbers from different files and deleting select duplicate lines - bash

I've checked other threads here on merging, but they seem to be mostly about merging text, and not quite what I needed, or at least I couldn't figure out a way to connect their solutions to my own problem.
Problem
I have 10+ input files, each consisting of two columns of numbers (think of them as x,y data points for a graph). Goals:
Merge these files into 1 file for plotting
For any duplicate x values in the merge, add their respective y-values together, then print one line with x in field 1 and the added y-values in field 2.
Consider this example for 3 files:
y1.dat
25 16
27 18
y2.dat
24 10
27 9
y3.dat
24 2
29 3
According to my goals above, I should be able to merge them into one file with output:
final.dat
24 12
25 16
27 27
29 3
Attempt
So far, I have the following:
#!/bin/bash
loops=3
for i in `seq $loops`; do
if [ $i == 1 ]; then
cp -f y$i.dat final.dat
else
awk 'NR==FNR { arr[NR] = $1; p[NR] = $2; next } {
for (n in arr) {
if ($1 == arr[n]) {
print $1, p[n] + $2
n++
}
}
print $1, $2
}' final.dat y$i.dat >> final.dat
fi
done
Output:
25 16
27 18
24 10
27 27
27 9
24 12
24 2
29 3
On closer inspection, it's clear I have duplicates of the original x-values.
The problem is my script needs to print all the x-values first, and then I can add them together for my output. However, I don't know how to go back and remove the lines with the old x-values that I needed to make the addition.
If I blindly use uniq, I don't know whether the old x-values or the new x-value is deleted. With awk '!duplicate[$1]++' the order of lines deleted was reversed over the loop, so it deletes on the first loop correctly but the wrong ones after that.
Been at this for a long time, would appreciate any help. Thank you!

I am assuming you already merged all the files into a single one before making the calculation. Once that's done the script is as simple as :
awk '{ if ( $1 != "" ) { coord[$1]+=$2 } } END { for ( k in coord ) { print k " " coord[k] } }' input.txt
Hope it helps!
Edit : How this works ?
if ( $1 != "" ) { coord[$1]+=$2 }
This line will get executed for each line in your input. It will first check whether there is a value for X, otherwise it simply ignores the line. This helps to ignore empty lines should your file have any. The block which gets executed : coord[$1]+=$2 is the heart of the script and creates a dictionary with X being the key of each entry and at the same time it adds each value for Y found.
END { for ( k in coord ) { print k " " coord[k] }
This block will execute after awk has iterated over all the lines in your file. It will simply grab each key from the dictionary and print it, then a space and finally the sum of all the values which were found, or in other words, the value for that specific key.

Using Perl one-liner
> cat y1.dat
25 16
27 18
> cat y2.dat
24 10
27 9
> cat y3.dat
24 2
29 3
> perl -lane ' $kv{$F[0]}+=$F[1]; END { print "$_ $kv{$_}" for(sort keys %kv) }' y*dat
24 12
25 16
27 27
29 3
>

Related

bash shell script for conditional assignment

I have a shell script that gives me a text file output in following format:
OUTPUT.TXT
FirmA
58
FirmB
58
FirmC
58
FirmD
58
FirmE
58
This output is good or a YES that my job completed as expected since the value for all of the firms is 58.
So I used to take count of '58' in this text file to automatically tell in a RESULT job that everything worked out well.
Now there seems to be a bug due to which for sometimes the output comes like below: OUTPUT.TXT
FirmA
58
FirmB
58
FirmC
61
FirmD
58
FirmE
61
which is impacting my count(since only 3 counts of 58 instead of expected 5) and hence my RESULT job states that it FAILED or a NO.
But actually the job has worked fine as long as the value stays within 58 to 61 for each firm.
So how can I ensure that in case the count is >=58 and <=61 for these five firms, than it has worked as expected ?
my simple one liner to check count in OUTPUT.TXT file
grep -cow 58 "OUTPUT.TXT"
Try Awk for simple jobs like this. You can learn enough in an hour to solve these problems yourself easily.
awk '(NR % 3 == 2) && ($1 < 58 || $1 > 61)' OUTPUT.TXT
This checks every third line, starting from the second, and prints any which are not in the range 58 to 61.
It would not be hard to extend the script to remember the string from the previous line. In fact, let's do that.
awk '(NR % 3 == 1) { firm = $0; next }
(NR % 3 == 2) && ($1 < 58 || $1 > 61) { print NR ":" firm, $0 }' OUTPUT.TXT
You might also want to check how many you get of each. But let's just make a separate script for that.
awk '(NR % 3 == 2) { ++a[$1] }
END { for (k in a) print k, a[k] }' OUTPUT.TXT
The Stack Overflow awk tag info page has links to learning materials etc.

Converting second pattern to millisecond in awk

I have file which is having pattern 's' , I need to convert into 'ms' by multiplying by 1000. I am unable to do it. Please help me.
file.txt
First launch 1
App: +1s170ms
First launch 2
App: +186ms
First launch 3
App: +1s171ms
First launch 4
App: +1s484ms
First launch 5
App: +1s227ms
First launch 6
App: +204ms
First launch 7
App: +1s180ms
First launch 8
App: +1s177ms
First launch 9
App: +1s183ms
First launch 10
App: +1s155ms
My code:
awk 'BEGIN { FS="[: ]+"}
/:/ && $2 ~/ms$/{vals[$1]=vals[$1] OFS $2+0;next}
END {
for (key in vals)
print key vals[key]
}' file.txt
Expected output:
App 1170 186 1171 1484 1227 204 1180 1177 1183 1155
Output Coming:
App 1 186 1 1 1 204 1 1 1 1
How to convert in above pattern 's' to 'ms' if second pattern comes .
What I will try to do here is explain it a bit generic and then apply it to your case.
Question: I have a string of the form 123a456b7c8d where the numbers are numeric integral values of any length and the letters are corresponding units. I also have conversion factors to convert from unit a,b,c,d to unit f. How can I convert this to a single quantity of unit f?
Example: from 1s183ms to 1183ms
Strategy:
create per string a set of key-value pairs 'a' => 123,'b' => 456, 'c' => 7 and 'd' => 8
multiply each value with the corect conversion factor
add the numbers together
Assume we use awk and the key-value pairs are stored in array a with the key as an index.
Extract key-value pairs from str:
function extract(str,a, t,k,v) {
delete a; t=str;
while(t!="") {
v=t+0; match(t,/[a-zA-Z]+/); k=substr(t,RSTART,RLENGTH);
t=substr(t,RSTART+RLENGTH);
a[k]=v
}
return
}
convert and sum: here we assume we have an array f which contains the conversion factors:
function convert(a,f, t,k) {
t=0; for(k in a) t+=a[k] * f[k]
return t
}
The full code (for the example of the OP)
# set conversion factors
BEGIN{ f['s']=1000; f['ms'] = 1 }
# print first word
BEGIN{ printf "App:" }
# extract string and print
/^App/ { extract($2,a); printf OFS "%dms", convert(a,f) }
END { printf ORS }
which outputs:
App: 1170ms 186ms 1171ms 1484ms 1227ms 204ms 1180ms 1177ms 1183ms 1155ms
perl -n -e '$s=0; ($s)=/(\d+)s/; ($ms)=/(\d+)ms/;
s/^(\w+):/push #{$vals{$1}}, $ms+$s*1000/e;
eof && print "$_: #{$vals{$_}}\n" for keys %vals;' file`
perl -n doesn't print anything as it loops through the input.
$s and $ms are set to those fields. $s is ensured to reset to zero
s///e is stuffing the %vals hash with a list of numbers in ms for each key, App, in this case.
eof && executes the subsequent code after the end of the file.
print "$_: #{$vals{$_}}\n" for keys %vals is printing the %vals hash as the OP wants.
App: 1170 186 1171 1484 1227 204 1180 1177 1183 1155

Calculate the average over a number of columns

I am trying to create a script which calculates the average over a number of rows.
This number would depend on the number of samples that I have, which varies.
An example of these files is here:
24 1 2.505
24 2 0.728
24 3 0.681
48 1 2.856
48 2 2.839
48 3 2.942
96 1 13.040
96 2 12.922
96 3 13.130
192 1 50.629
192 2 51.506
192 3 51.016
The average is calculated on the 3rd column and,
the second column indicates the number of samples, 3 in this particular case.
Therefore, I should obtain 4 values here.
One average value per 3 rows.
I have tried something like:
count=3;
total=0;
for i in $( awk '{ print $3; }' ${file} )
do
for j in 1 2 3
do
total=$(echo $total+$i | bc )
done
echo "scale=2; $total / $count" | bc
done
But it is not giving me the right answer, instead I think it calculates an average per each group of three rows.
The average is calculated on the 3rd column and,
the second column indicates the number of samples, 3 in this particular case.
Therefore, I should obtain 4 values here.
One average value per 3 rows.
I have tried something like:
count=3;
total=0;
for i in $( awk '{ print $3; }' ${file} )
do
for j in 1 2 3
do
total=$(echo $total+$i | bc )
done
echo "scale=2; $total / $count" | bc
done
But it is not giving me the right answer, instead I think it calculates an average per each group of three rows.
Expected output
24 1.3046
48 2.879
96 13.0306
192 51.0503
You can use the following awk script:
awk '{t[$2]+=$3;n[$2]++}END{for(i in t){print i,t[i]/n[i]}}' file
Output:
1 17.2575
2 16.9988
3 16.9423
This is better explained as a multiline script with comments in it:
# On every line of input
{
# sum up the value of the 3rd column in an array t
# which is is indexed by the 2nd column
t[$2]+=$3
# Increment the number of lines having the same value of
# the 2nd column
n[$2]++
}
# At the end of input
END {
# Iterate through the array t
for(i in t){
# Print the number of samples along with the average
print i,t[i]/n[i]
}
}
Apparently I brought a third view to the problem. In awk:
$ awk 'NR>1 && $1!=p{print p, s/c; c=s=0} {s+=$3;c++;p=$1} END {print p, s/c}' file
24 1.30467
48 2.879
96 13.0307
192 51.0503

bash: find pattern in one file and apply some code for each pattern found

I created a script that will auto-login to router and checks for current CPU load, if load exceeds a certain threshold I need it print the current CPU value to the standard output.
i would like to search in script o/p for a certain pattern (the value 80 in this case which is the threshold for high CPU load) and then for each instance of the pattern it will check if current value is greater than 80 or not, if true then it will print 5 lines before the pattern followed by then the current line with the pattern.
Question1: how to loop over each instance of the pattern and apply some code on each of them separately?
Question2: How to print n lines before the pattern followed by x lines after the pattern?
ex. i used awk to search for the pattern "health" and print 6 lines after it as below:
awk '/health/{x=NR+6}(NR<=x){print}' ./logs/CpuCheck.log
I would like to do the same for the pattern "80" and this time print 5 lines before it and one line after....only if $3 (representing current CPU load) is exceeding the value 80
below is the output of auto-login script (file name: CpuCheck.log)
ABCD-> show health xxxxxxxxxx
* - current value exceeds threshold
1 Min 1 Hr 1 Hr
Cpu Limit Curr Avg Avg Max
-----------------+-------+------+------+-----+----
01 80 39 36 36 47
WXYZ-> show health xxxxxxxxxx
* - current value exceeds threshold
1 Min 1 Hr 1 Hr
Cpu Limit Curr Avg Avg Max
-----------------+-------+------+------+-----+----
01 80 29 31 31 43
Thanks in advance for the help
Rather than use awk, you could use the -B and -A and switches to grep, which print a number of lines before and after a pattern is matched:
grep -E -B 5 -A 1 '^[0-9]+[[:space:]]+80[[:space:]]+(100|9[0-9]|8[1-9])' CpuCheck.log
The pattern matches lines which start with some numbers, followed by spaces, followed by 80, followed by a number greater between 81 and 100. The -E switch enables extended regular expressions (EREs), which are needed if you want to use the + character to mean "one or more". If your version of grep doesn't support EREs, you can instead use the slightly more verbose \{1,\} syntax:
grep -B 5 -A 1 '^[0-9]\{1,\}[[:space:]]\{1,\}80[[:space:]]\{1,\}\(100\|9[0-9]\|8[1-9]\)' CpuCheck.log
If grep isn't an option, one alternative would be to use awk. The easiest way would be to store all of the lines in a buffer:
awk 'f-->0;{a[NR]=$0}/^[0-9]+[[:space:]]+80[[:space:]]+(100|9[0-9]|8[1-9])/{for(i=NR-5;i<=NR;++i)print i, a[i];f=1}'
This stores every line in an array a. When the third column is greater than 80, it prints the previous 5 lines from the array. It also sets the flag f to 1, so that f-->0 is true for the next line, causing it to be printed.
Originally I had opted for a comparison $3>80 instead of the regular expression but this isn't a good idea due to the varying format of the lines.
If the log file is really big, meaning that reading the whole thing into memory is unfeasible, you could implement a circular buffer so that only the previous 5 lines were stored, or alternatively, read the file twice.
Unfortunately, awk is stream-oriented and doesn't have a simple way to get the lines before the current line. But that doesn't mean it isn't possible:
awk '
BEGIN {
bufferSize = 6;
}
{
buffer[NR % bufferSize] = $0;
}
$2 == 80 && $3 > 80 {
# print the five lines before the match and the line with the match
for (i = 1; i <= bufferSize; i++) {
print buffer[(NR + i) % bufferSize];
}
}
' ./logs/CpuCheck.log
I think the easiest way with awk, by reading the file.
This should use essentially 0 memory except whatever is used to store the line numbers.
If there is only one occurence
awk 'NR==FNR&&$2=="80"{to=NR+1;from=NR-5}NR!=FNR&&FNR<=to&&FNR>=from' file{,}
If there are more than one occurences
awk 'NR==FNR&&$2=="80"{to[++x]=NR+1;from[x]=NR-5}
NR!=FNR{for(i in to)if(FNR<=to[i]&&FNR>=from[i]){print;next}}' file{,}
Input/output
Input
1
2
3
4
5
6
7
8
9
10
11
12
01 80 39 36 36 47
13
14
15
16
17
01 80 39 36 36 47
18
19
20
Output
8
9
10
11
12
01 80 39 36 36 47
13
14
15
16
17
01 80 39 36 36 47
18
How it works
NR==FNR&&$2=="80"{to[++x]=NR+5;from[x]=NR-5}
In the first file if the second field is 80 set to and from to the record number + or - whatever you want.
Increment the occurrence variable x.
NR!=FNR
In the second file
for(i in to)
For each occurrence
if(FNR<=to[i]&&FNR>=from[i]){print;next}
If the current record number(in this file) is between this occurrences to and from then print the line.Next prevents the line from being printed multiple times if occurrences of the pattern are close together.
file{,}
Use the file twice as two args. the {,} expands to file file

Awk--For loop by comparing two files

I have two big files.
File 1 looks like following:
10 2864001 2864012
10 5942987 5943316
File 2 looks like following:
10 2864000 28
10 2864001 28
10 2864002 28
10 2864003 27
10 2864004 28
10 2864005 26
10 2864006 26
10 2864007 26
10 2864008 26
10 2864009 26
10 2864010 26
10 2864011 26
10 2864012 26
So I want to create a for loop in such a way that,
First column of File 1 must match first column of File 2 AND
To start a for loop by matching second column of File 1 with second
column of File 2 AND
Sum third column of File 2 until third column of File 1 match to
second column of File 2.
So the output of above example should be sum of third column of File 2 for first line of File 1 which is 347. I tried to use NR and FNR but I have not been able to do it so far. Could you please help me to generate awk script?
Thank you so much
Transcribed, so there may be typos:
awk '
BEGIN { lastFNR=0; acount=0; FIRST="T"}
FNR < lastFNR {FIRST="F"; aindex=0; next}
FIRST=="T" {
sta[acount] = $2
fna[acount] = $3
acount += 1
lastFNR=FNR
}
FIRST=="F" && $2 >= sta[index] && $2 <= fna[aindex] {
sum[aindex] += $3
lastFNR = FNR
}
FIRST=="F" && $2 > fna[aindex] {
aindex ==1
if (aindex > acount) { FIRST="E" }
}
END {
for(aindex=0; aindex<acount; +=aindex) {
print sta[aindex], "through", fna[index], "totals", sum[aindex]
}
}
' file 1 file2
You could try
awk -f s.awk file1 file2
where s.awk is
NR==FNR {
a[$1,$2]=$3
next
}
($1,$2) in a {
do {
s+=$3
if ((getline)+0 < 1) break
} while ($2 != a[$1,$2])
print s
}
{ s=0 }
output:
319

Resources