I have a shell script that gives me a text file output in following format:
OUTPUT.TXT
FirmA
58
FirmB
58
FirmC
58
FirmD
58
FirmE
58
This output is good or a YES that my job completed as expected since the value for all of the firms is 58.
So I used to take count of '58' in this text file to automatically tell in a RESULT job that everything worked out well.
Now there seems to be a bug due to which for sometimes the output comes like below: OUTPUT.TXT
FirmA
58
FirmB
58
FirmC
61
FirmD
58
FirmE
61
which is impacting my count(since only 3 counts of 58 instead of expected 5) and hence my RESULT job states that it FAILED or a NO.
But actually the job has worked fine as long as the value stays within 58 to 61 for each firm.
So how can I ensure that in case the count is >=58 and <=61 for these five firms, than it has worked as expected ?
my simple one liner to check count in OUTPUT.TXT file
grep -cow 58 "OUTPUT.TXT"
Try Awk for simple jobs like this. You can learn enough in an hour to solve these problems yourself easily.
awk '(NR % 3 == 2) && ($1 < 58 || $1 > 61)' OUTPUT.TXT
This checks every third line, starting from the second, and prints any which are not in the range 58 to 61.
It would not be hard to extend the script to remember the string from the previous line. In fact, let's do that.
awk '(NR % 3 == 1) { firm = $0; next }
(NR % 3 == 2) && ($1 < 58 || $1 > 61) { print NR ":" firm, $0 }' OUTPUT.TXT
You might also want to check how many you get of each. But let's just make a separate script for that.
awk '(NR % 3 == 2) { ++a[$1] }
END { for (k in a) print k, a[k] }' OUTPUT.TXT
The Stack Overflow awk tag info page has links to learning materials etc.
Related
So, I have a text file separated by tabs that looks like:
contig141 hit293 2939 71 293 alksjdflksdf
contig141 hit339 9393 71 302 kljdkfjsjdfksdf
contig124 hit993 9239 55 274 laksjdfkls
contig124 hit101 9333 66 287 aslkdjfalkdjsfkjlsdf
contig124 hit205 4856 123 301 ksdjflksjdfskldjfeiedfdwe
contig132 hit003 2290 58 290 jsdfishfoisodncklsn
contig133 hit100 1889 21 107 sijhfdshfdjhsdjkdfjf
For each contig, I want to subtract the 4th column from the 5th column, and compare the differences. For the contig with the largest difference, I would like to print the entire row to a new file. I'm thinking of a nested loop, but I can't figure out how to do it.
I'm thinking:
Loop over each row of the file.
set variable a = string in first column of the first row
while: the string in the first column of the next row is equal to a,
take the difference of the 4th and 5th column
compare the differences among all rows of that contig
output the row with the greatest difference to a new file
So you would compare the differences between the 4th and 5th column for contig141, and output the line with the greatest difference. Repeat for contig124, etc. etc.
The efficient way to address this is with awk redirecting the output to a new file. In fact any time you start thinking "I need to do X with a field...", your first thought should be awk. (it is the Swiss Army-Knife of text processing). You can do what you need with:
awk '{
diff = $5-$4
if (diff > config[$1]) {
config[$1] = diff
rec[$1] = $0
}
}
END { for(i in config) print rec[i] }' file > newfile
Above you simply calculate the difference between the 5th and 4th field saving in diff. Then check if that difference is the max for the array element of config index by the first field, and if so, update config[$1] = diff saving the max difference for the config, and save the record (line) as rec[$1] also indexed by the first field. Then using the END rule, you simply output the max record for each config.
Example Use/Output
Showing what would be redirected to the new file, you have
awk '{
> diff = $5-$4
> if (diff > config[$1]) {
> config[$1] = diff
> rec[$1] = $0
> }
> }
> END { for(i in config) print rec[i] }' file
contig132 hit003 2290 58 290 jsdfishfoisodncklsn
contig141 hit339 9393 71 302 kljdkfjsjdfksdf
contig133 hit100 1889 21 107 sijhfdshfdjhsdjkdfjf
contig124 hit101 9333 66 287 aslkdjfalkdjsfkjlsdf
You can pipe to sort if you need the configs in sort order before redirecting to the new file, e.g.
$ awk '{
diff = $5-$4
if (diff > config[$1]) {
config[$1] = diff
rec[$1] = $0
}
}
END { for(i in config) print rec[i] }' file | sort
contig124 hit101 9333 66 287 aslkdjfalkdjsfkjlsdf
contig132 hit003 2290 58 290 jsdfishfoisodncklsn
contig133 hit100 1889 21 107 sijhfdshfdjhsdjkdfjf
contig141 hit339 9393 71 302 kljdkfjsjdfksdf
I have a big file whose entries are like this .
Input:
1113
1113456
11134567
12345
1734
123
194567
From this entries , I need to find out the minimum number of prefix which can represent all these entries.
Expected output:
1113
123
1734
194567
If we have 1113 then there is no need to use 1113456 or 1113457.
Things I have tried:
I can use grep -v ^123 and compare with input file and store the unique results in the output file. IF I use a while loop , I dont know , how I can delete the entries from the input file itself.
I will assume that input file is:
790234
790835
795023
79788
7985904
7902713
791
7987
7988
709576
749576
7902712
790856
79780
798599
791453
791454
791455
791456
791457
791458
791459
791460
You can use
awk '!(prev && $0~prev){prev = "^" $0; print}' <(sort file)
Returns
709576
749576
790234
7902712
7902713
790835
790856
791
795023
79780
79788
7985904
798599
7987
7988
How does it work ? First it sorts the file using lexicographic sort (1 < 10 < 2). Then it keeps the minimal prefix and checks if next lines match. If they do they are skipped. If a line doesn't, it will update the minimal prefix and prints the line.
Let's say that input is
71
82
710
First it orders the lines and input becomes (lexicographic sort : 71 < 710 < 82) :
71
710
82
First line is printed because awk variable prev is not set so condition !(prev && $0~prev) is reached. prev becomes 71. On next row, 710 will match regexp ^71 so line is skipped and prev variable stays 71. On next row, 82does not match ^71, condition !(prev && $0~prev) is reached again, line is printed, prev is set to 82.
You may use this awk command:
awk '{
n = (n != "" && index($1, n) == 1 ? n : $1)
}
p != n {
print p = n
}' <(sort file)
1113
123
1734
194567
$ awk 'NR==1 || (index($0,n)!=1){n=$0; print}' <(sort file)
1113
123
1734
194567
I've checked other threads here on merging, but they seem to be mostly about merging text, and not quite what I needed, or at least I couldn't figure out a way to connect their solutions to my own problem.
Problem
I have 10+ input files, each consisting of two columns of numbers (think of them as x,y data points for a graph). Goals:
Merge these files into 1 file for plotting
For any duplicate x values in the merge, add their respective y-values together, then print one line with x in field 1 and the added y-values in field 2.
Consider this example for 3 files:
y1.dat
25 16
27 18
y2.dat
24 10
27 9
y3.dat
24 2
29 3
According to my goals above, I should be able to merge them into one file with output:
final.dat
24 12
25 16
27 27
29 3
Attempt
So far, I have the following:
#!/bin/bash
loops=3
for i in `seq $loops`; do
if [ $i == 1 ]; then
cp -f y$i.dat final.dat
else
awk 'NR==FNR { arr[NR] = $1; p[NR] = $2; next } {
for (n in arr) {
if ($1 == arr[n]) {
print $1, p[n] + $2
n++
}
}
print $1, $2
}' final.dat y$i.dat >> final.dat
fi
done
Output:
25 16
27 18
24 10
27 27
27 9
24 12
24 2
29 3
On closer inspection, it's clear I have duplicates of the original x-values.
The problem is my script needs to print all the x-values first, and then I can add them together for my output. However, I don't know how to go back and remove the lines with the old x-values that I needed to make the addition.
If I blindly use uniq, I don't know whether the old x-values or the new x-value is deleted. With awk '!duplicate[$1]++' the order of lines deleted was reversed over the loop, so it deletes on the first loop correctly but the wrong ones after that.
Been at this for a long time, would appreciate any help. Thank you!
I am assuming you already merged all the files into a single one before making the calculation. Once that's done the script is as simple as :
awk '{ if ( $1 != "" ) { coord[$1]+=$2 } } END { for ( k in coord ) { print k " " coord[k] } }' input.txt
Hope it helps!
Edit : How this works ?
if ( $1 != "" ) { coord[$1]+=$2 }
This line will get executed for each line in your input. It will first check whether there is a value for X, otherwise it simply ignores the line. This helps to ignore empty lines should your file have any. The block which gets executed : coord[$1]+=$2 is the heart of the script and creates a dictionary with X being the key of each entry and at the same time it adds each value for Y found.
END { for ( k in coord ) { print k " " coord[k] }
This block will execute after awk has iterated over all the lines in your file. It will simply grab each key from the dictionary and print it, then a space and finally the sum of all the values which were found, or in other words, the value for that specific key.
Using Perl one-liner
> cat y1.dat
25 16
27 18
> cat y2.dat
24 10
27 9
> cat y3.dat
24 2
29 3
> perl -lane ' $kv{$F[0]}+=$F[1]; END { print "$_ $kv{$_}" for(sort keys %kv) }' y*dat
24 12
25 16
27 27
29 3
>
Hello the following code is used by me to split a file
BEGIN{body=0}
!body && /^\/\/$/ {body=1}
body && /^\[/ {print > "first_"FILENAME}
body && /^pos/{$1="";print > "second_"FILENAME}
body && /^[01]+/ {print > "third_"FILENAME}
body && /^\[[0-9]+\]/ {
print > "first_"FILENAME
print substr($0, 2, index($0,"]")-2) > "fourth_"FILENAME
}
the file looks like here
header
//
SeqT: {"POS-s":174.683, "time":0.0130084}
SeqT: {"POS-s":431.49, "time":0.0221447}
[2.04545e+2]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001);
[29]:((962:0.000580339,930:0.000580339):0.00543993);
absolute:
gthcont: 5 4 2 1 3 4 543 5 67 657 78 67 8 5645 6
01010010101010101010101010101011111100011
1111010010010101010101010111101000100000
00000000000000011001100101010010101011111
The problem is that in the file 4 print substr($0, 2, index($0,"]")-2) > "fourth_"FILENAME the number with the sci notation with e does not get through. it works only as long as it is written without that . how can i cahnge the awk to also get the number in the way like 2.7e+7 or so
The problem is you're trying to match E notation when your regex is looking for integers only.
Instead of:
/^\[[0-9]+\]/
use something like:
/^\[[0-9]+(\.[0-9]+(e[+-]?[0-9]+)?)?\]/
This will match positive integers, floats, and E notation wrapped in square brackets at the start of the line.
See demo
I'm trying to write a Bash script that reads files with several columns of data and multiplies each value in the second column by each value in the third column, adding the results of all those multiplications together.
For example if the file looked like this:
Column 1 Column 2 Column 3 Column 4
genome 1 30 500
genome 2 27 500
genome 3 83 500
...
The script should multiply 1*30 to give 30, then 2*27 to give 54 (and add that to 30), then 3*83 to give 249 (and add that to 84) etc..
I've been trying to use awk to parse the input file but am unsure of how to get the operation to proceed line by line. Right now it stops after the first line is read and the operations on the variables are performed.
Here's what I've written so far:
for file in fileone filetwo
do
set -- $(awk '/genome/ {print $2,$3}' $file.hist)
var1=$1
var2=$2
var3=$((var1*var2))
total=$((total+var3))
echo var1 \= $var1
echo var2 \= $var2
echo var3 \= $var3
echo total \= $total
done
I tried placing a "while read" loop around everything but could not get the variables to update with each line. I think I'm going about this the wrong way!
I'm very new to Linux and Bash scripting so any help would be greatly appreciated!
That's because awk reads the entire file and runs its program on each line. So the output you get from awk '/genome/ {print $2,$3}' $file.hist will look like
1 30
2 27
3 83
and so on, which means in the bash script, the set command makes the following variable assignments:
$1 = 1
$2 = 30
$3 = 2
$4 = 27
$5 = 3
$6 = 83
etc. But you only use $1 and $2 in your script, meaning that the rest of the file's contents - everything after the first line - is discarded.
Honestly, unless you're doing this just to learn how to use bash, I'd say just do it in awk. Since awk automatically runs over every line in the file, it'll be easy to multiply columns 2 and 3 and keep a running total.
awk '{ total += $2 * $3 } ENDFILE { print total; total = 0 }' fileone filetwo
Here ENDFILE is a special address that means "run this next block at the end of each file, not at each line."
If you are doing this for educational purposes, let me say this: the only thing you need to know about doing arithmetic in bash is that you should never do arithmetic in bash :-P Seriously though, when you want to manipulate numbers, bash is one of the least well-adapted tools for that job. But if you really want to know, I can edit this to include some information on how you could do this task primarily in bash.
I agree that awk is in general better suited for this kind of work, but if you are curious what a pure bash implementation would look like:
for f in file1 file2; do
total=0
while read -r _ x y _; do
((total += x * y))
done < "$f"
echo "$total"
done