Grep a pattern and ignore others - bash

I have an output with this pattern :
Auxiliary excitation energy for root 3: (variable value)
It appears a consequent number of time in the output, but I only want to grep the last one.
I'm a beginner in bash so I didn't understand the "tail" fonction yet...
Here is what i wrote :
for nn in 0.00000001 0.4 1.0; do
for w in 0.0 0.001 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25 0.275 0.3 0.325 0.35 0.375 0.4 0.425 0.45 0.475 0.5; do
a=`grep ' Auxiliary excitation energy for root 3: ' $nn"_"$w.out`
echo $w" "${a:47:16} >> data_$nn.dat
done
done
With $nn and $w parameters.
But with this grep I only have the first pattern. How to only get the last one?
data example :
line 1 Auxiliary excitation energy for root 3: 0.75588889
line 2 Auxiliary excitation energy for root 3: 0.74981555
line 3 Auxiliary excitation energy for root 3: 0.74891111
line 4 Auxiliary excitation energy for root 3: 0.86745155
My command grep line 1, i would like to grep the last line which has my pattern : here line 4 with my example.

To get the last match, you can use:
grep ... | tail -n 1
Where ... are your grep parameters. So your script would read (with a little cleanup):
for nn in 0.00000001 0.4 1.0; do
for w in 0.0 0.001 0.01 0.025 0.05 0.075 0.1 0.125 0.15 0.175 0.2 0.225 0.25 0.275 0.3 0.325 0.35 0.375 0.4 0.425 0.45 0.475 0.5; do
a=$( grep ' Auxiliary excitation energy for root 3: ' $nn"_"$w.out | tail -n 1 )
echo $w" "${a:47:16} >> data_$nn.dat
done
done

Related

find lowest value and index of floating point array awk, sed, sort

I have the following array:
echo $array
0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.3 0.2 0.4 0.4 0.4 0.4 0.5 0.5 0.4 0.2
I have written a code to sort the values and also get the index number:
echo $array | tr -s ' ' '\n' | awk '{print($0" "NR)}' | sort -g -k1,1
0.2 11
0.2 19
0.3 1
0.3 10
0.3 2
0.3 9
0.4 12
0.4 13
0.4 14
0.4 15
0.4 18
0.4 3
0.4 4
0.4 5
0.4 6
0.4 7
0.4 8
0.5 16
0.5 17
I am having a difficult time extracting only the rows which have the lowest value in the first column (i.e., the lowest values in the array, overall). For example, the desired final product for the above example would be:
0.2 11
0.2 19
It should be able to handle instances of one, and multiple lowest value indices. The code does not need to include any sort of awk, sort, sed, or any commands if they do not need to - anything could work (this is just as far as I have gotten with achieving the final task).
Print the output until the number in the first column does not change.
echo $array | tr -s ' ' '\n' | awk '{print($0" "NR)}' | sort -g -k1,1 |
awk 'length(last) == 0 || last == $1 { last=$1; print; }'
Notes:
It's best to always quote variable expansions echo "$array".
If you don't quote $array, you could just printf "%s\n" $array
You could use nl to number lines (but the columns order would be different).
Using asort() funtion in awk
awk '{split($0,a); for (i in a) a[i]=a[i]" "i; n=asort(a); for (i = 1; i <= 2; i++) print a[i]} '
Demo:
$echo $array
0.3 0.3 0.4 0.4 0.4 0.4 0.4 0.4 0.3 0.3 0.2 0.4 0.4 0.4 0.4 0.5 0.5 0.4 0.2
$echo $array | awk '{split($0,a); for (i in a) a[i]=a[i]" "i; n=asort(a); for (i = 1; i <= 2; i++) print a[i]} '
0.2 11
0.2 19
$
Explanation:
{split($0,a); -- Initialize a array a from input record
for (i in a) a[i]=a[i]" "i; -- Append current row number to existing value
n=asort(a); -- Call sort array function and store number of element in variable n.
for (i = 1; i <= 2; i++) -- loop for first 2 element of array
print a[i]}
Documentation on asort()
P.S. --> storing number of element in n was not required.

Passing variable grep command

A given text file (say foo*.txt) data as follow.
1 g = 0.54 0.00
2 g = 0.32 0.00
3 g = 0.45 0.00
...
5000 g = 0.5 0.00
Basically, I want to extract 10 lines before and after the matching line (including the matching line). The matching line contains 59 characters that contain strings, spaces and numbers.
I have a script as follow:
#!/usr/bin/bash
for file in file*.txt;
do
var=$(command_to_extract_var) # 59 characters containing strings, spaces and numbers
# to get this var, I use grep and head
grep -C 10 "$var" "$file"
done > bar.csv
Running script by bash -x script_name.sh gives the following:
+ for file in 'foo*.txt'
++ grep 'match_pattern' foo1.txt
++ awk '{print $6}'
++ head -n1
++ grep '[0-9]'
+ basis=150
++ grep 'match_pattern' foo1.txt
++ tail -n1
++ awk '{print $3}'
+ number=25
++ grep '[0-9] f = ' foo.txt
++ tail -n150
This is followed by a number of lines (even up to 1000) like
001 h = 0.000000000000000E+00 e = 3.543218084205956E+00
Finally,
File name too long
+ final=
+ grep -C 10 '' foo1.txt
The output I expect is (one column from each file):
0.54 0.62 0.36 ... 0.45
0.32 3.25 0.89 ... 0.25
0.45 0.96 0.14 ... 0.14
... .... .... ... 0.96
0.25 0.00 7.23 ... 0.77

Wildcard symbol with grep -F

I have the following file
0 0
0 0.001
0 0.032
0 0.1241
0 0.2241
0 0.42
0.0142 0
0.0234 0
0.01429 0.01282
0.001 0.224
0.098 0.367
0.129 0
0.123 0.01282
0.149 0.16
0.1345 0.216
0.293 0
0.2439 0.01316
0.2549 0.1316
0.2354 0.5
0.3345 0
0.3456 0.0116
0.3462 0.316
0.3632 0.416
0.429 0
0.42439 0.016
0.4234 0.3
0.5 0
0.5 0.33
0.5 0.5
Notice that the two columns are sorted ascending, first by the first column and then by the second one. The minimum value is 0 and the maximum is 0.5.
I would like to count the number of lines that are:
0 0
and store that number in a file called "0_0". In this case, this file should contain "1".
Then, the same for those that are:
0 0.0*
For example,
0 0.032
And call it "0_0.0" (it should contain "2"), and this for all combinations only considering the first decimal digit (0 0.1*, 0 0.2* ... 0.0* 0, 0.0* 0.0* ... 0.5 0.5).
I am using this loop:
for i in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
for j in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
grep -F ""$i" "$j"" file | wc -l > "$i"_"$j"
done
done
rm 0_0 #this 0_0 output is badly done, the good way is with the next command, which accepts \n
pcregrep -M "0 0\n" file | wc -l > 0_0
The problem is that for example, line
0.0142 0
will not be recognized by the iteration "0.0 0", since there are digits after the "0.0". Removing the -F option in grep in order to consider all numbers that start by "0.0" will not work, since the point will be considered a wildcard symbol and therefore for example in the iteration "0.1 0" the line
0.0142 0
will be counted, because 0.0142 is a 0"anything"1.
I hope I am making myself clear!
Is there any way to include a wildcard symbol with grep -F, like in:
for i in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
for j in 0 0.0 0.1 0.2 0.3 0.4 0.5
do
grep -F ""$i"* "$j"*" file | wc -l > "$i"_"$j"
done
done
(Please notice the asterisks after the variables in the grep command).
Thank you!
Don't use shell loops just to manipulate text, that's what the guys who invented shell also invented awk to do. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
It sounds like all you need is:
awk '{cnt[substr($1,1,3)"_"substr($2,1,3)]++} END{ for (pair in cnt) {print cnt[pair] > pair; close(pair)} }' file
That will be vastly more efficient than your nested shell loops approach.
Here's what it'll be outputting to the files it creates:
$ awk '{cnt[substr($1,1,3)"_"substr($2,1,3)]++} END{for (pair in cnt) print pair "\t" cnt[pair]}' file
0.0_0.3 1
0_0.4 1
0.5_0 1
0.2_0.5 1
0.4_0.3 1
0.0_0 2
0.1_0.0 1
0.3_0 1
0.1_0.1 1
0.1_0.2 1
0.3_0.0 1
0_0 1
0.1_0 1
0.5_0.3 1
0.4_0 1
0.3_0.3 1
0.2_0.0 1
0_0.0 2
0.5_0.5 1
0.3_0.4 1
0.2_0.1 1
0.0_0.0 1
0_0.1 1
0_0.2 1
0.4_0.0 1
0.2_0 1
0.0_0.2 1

awk condition always TRUE in a loop [duplicate]

This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 7 years ago.
Good morning,
I'm sorry this question will seem trivial to some. It has been driving me mad for hours. My problem is the following:
I have these two files:
head <input file>
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 751756 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 1 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 753474 G C 1.14 0.009
rs2073813 1 753541 A G 0.85 0.0095
and
head <interval file>
1 112667912 114334946
1 116220516 117220516
1 160997252 161997252
1 198231312 199231314
2 60408994 61408994
2 64868452 65868452
2 99649474 100719272
2 190599907 191599907
2 203245673 204245673
2 203374196 204374196
I would like to use a bash script to remove all lines from the input file in which the BP column lies within the interval specified in the input file and in which there is matching of the CHR column with the first column of the interval file.
Here is the code I've been working with (although a simpler solution would be welcomed):
while read interval; do
chr=$(echo $interval | awk '{print $1}')
START=$(echo $interval | awk '{print $2}')
STOP=$(echo $interval | awk '{print $3}')
awk '$2!=$chr {print} $2==$chr && ($3<$START || $3>$STOP) {print}' < input_file > tmp
mv tmp <input file>
done <
My problem is that no lines are removed from the input file. Even if the command
awk '$2==1 && ($3>112667912 && $3<114334946) {print}' < input_file | wc -l
returns >4000 lines, so the lines clearly are in the input file.
Thank you very much for your help.
You can try with perl instead of awk. The reason is that in perl you can create a hash of arrays to save the data of interval file, and extract it easier when processing your input, like:
perl -lane '
$. == 1 && next;
#F == 3 && do {
push #{$h{$F[0]}}, [#F[1..2]];
next;
};
#F == 7 && do {
$ok = 1;
if (exists $h{$F[1]}) {
for (#{$h{$F[1]}}) {
if ($F[2] > $_->[0] and $F[2] < $_->[1]) {
$ok = 0;
last;
}
}
}
printf qq|%s\n|, $_ if $ok;
};
' interval input
$. skips header of interval file. #F checks number of columns and the push creates the hash of arrays.
Your test data is not accurate because none line is filtered out, I changed it to:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 112667922 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 199231312 G C 1.14 0.009
rs2073813 2 204245670 A G 0.85 0.0095
So you can run it and get as result:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097

Matching numbers in two different files using awk

I have two files (f1 and f2), both made of three columns, of different lengths. I would like to create a new file of four columns in the following way:
f1 f2
1 2 0.2 1 4 0.3
1 3 0.5 1 5 0.2
1 4 0.2 2 3 0.6
2 2 0.5
2 3 0.9
If the numbers in the first two columns are present in both files, then we print the first two numbers and the third number of each file (e.g. in both there is 1 4, in f3 there should be 1 4 0.2 0.3; otherwise, if the two first numbers are missing in f2 just print a zero in the fourth column.
The complete results of these example should be
f3
1 2 0.2 0
1 3 0.5 0
1 4 0.2 0.3
2 2 0.5 0
2 3 0.9 0.6
The script that I wrote is the following:
awk '{str1=$1; str2=$2; str3=$3;
getline < "f2";
if($1==str1 && $2==str2)
print str1,str2,str3,$3 > "f3";
else
print str1,str2,str3,0 > "f3";
}' f1
but it just looks if the same two numbers are in the same row (it does not go through all the f2 file) giving as results
1 2 0.2 0
1 3 0.5 0
1 4 0.2 0
2 2 0.5 0
2 3 0.9 0
This awk should work:
awk 'FNR==NR{a[$1,$2]=$3;next} {print $0, (a[$1,$2])? a[$1,$2]:0}' f2 f1
1 2 0.2 0
1 3 0.5 0
1 4 0.2 0.3
2 2 0.5 0
2 3 0.9 0.6

Resources