Passing variable grep command - bash

A given text file (say foo*.txt) data as follow.
1 g = 0.54 0.00
2 g = 0.32 0.00
3 g = 0.45 0.00
...
5000 g = 0.5 0.00
Basically, I want to extract 10 lines before and after the matching line (including the matching line). The matching line contains 59 characters that contain strings, spaces and numbers.
I have a script as follow:
#!/usr/bin/bash
for file in file*.txt;
do
var=$(command_to_extract_var) # 59 characters containing strings, spaces and numbers
# to get this var, I use grep and head
grep -C 10 "$var" "$file"
done > bar.csv
Running script by bash -x script_name.sh gives the following:
+ for file in 'foo*.txt'
++ grep 'match_pattern' foo1.txt
++ awk '{print $6}'
++ head -n1
++ grep '[0-9]'
+ basis=150
++ grep 'match_pattern' foo1.txt
++ tail -n1
++ awk '{print $3}'
+ number=25
++ grep '[0-9] f = ' foo.txt
++ tail -n150
This is followed by a number of lines (even up to 1000) like
001 h = 0.000000000000000E+00 e = 3.543218084205956E+00
Finally,
File name too long
+ final=
+ grep -C 10 '' foo1.txt
The output I expect is (one column from each file):
0.54 0.62 0.36 ... 0.45
0.32 3.25 0.89 ... 0.25
0.45 0.96 0.14 ... 0.14
... .... .... ... 0.96
0.25 0.00 7.23 ... 0.77

Related

How to merge two tab-separated files and predefine formatting of missing values?

I am trying to merge two unsorted tab separated files by a column of partially overlapping identifiers (gene#) with the option of predefining missing values and keeping the order of the first table.
When using paste on my two example tables missing values end up as empty space.
cat file1
c3 100 300 gene4
c1 300 400 gene1
c13 600 700 gene2
cat file2
gene1 4.2 0.001
gene4 1.05 0.5
paste file1 file2
c3 100 300 gene1 gene1 4.2 0.001
c1 300 400 gene4 gene4 1.05 0.5
c13 600 700 gene2
As you see the result not surprisingly shows empty spaces in non matched lines. Is there a way to keep the order of file1 and fill lines like the third as follows:
c3 100 300 gene4 gene4 1.05 0.5
c1 300 400 gene1 gene1 4.2 0.001
c13 600 700 gene2 NA 1 1
I assume one way could be to build an awk conditional construct. It would be great if you could point me in the right direction.
With awk please try the following:
awk 'FNR==NR {a[$1]=$1; b[$1]=$2; c[$1]=$3; next}
{if (!a[$4]) {a[$4]="N/A"; b[$4]=1; c[$4]=1}
printf "%s %s %s %s\n", $0, a[$4], b[$4], c[$4]}
' file2 file1
which yields:
c3 100 300 gene1 gene1 4.2 0.001
c1 300 400 gene4 gene4 1.05 0.5
c13 600 700 gene2 N/A 1 1
awk 'FNR==NR {a[$1]=$1; b[$1]=$2; c[$1]=$3; next}
{if (!a[$4]) {a[$4]="N/A"; b[$4]=1; c[$4]=1}
printf "%s %s %s %s\n", $0, a[$4], b[$4], c[$4]}
' file2 file1
[Explanations]
In the 1st line, FNR==NR { command; next} is an idiom to execute the command only when reading the 1st file in the argument list ("file2" in this case). Then it creates maps (aka associative arrays) to associate values in "file2" to genes
as:
gene1 => gene1 (with array a)
gene1 => 4.2 (with array b)
gene1 => 0.001 (with array c)
gene4 => gene4 (with array a)
gene4 => 1.05 (with array b)
gene4 => 0.5 (with array c)
It is not necessary that "file2" is sorted.
The following lines are executed only when reading the 2nd file ("file1") because these lines are skipped when reading the 1st file due to the next statement.
The line {if (!a[$4]) .. is a fallback to assign variables to default values when the associative array a[gene] is undefined (meaning the gene is not found in "file2").
The final line prints the contents of "file1" followed by the associated values via the gene.
You can use join:
join -e NA -o '1.1 1.2 1.3 1.4 1.5 2.1 2.2 2.3' -a 1 -1 5 -2 1 <(nl -w1 -s ' ' file1 | sort -k 5) <(sort -k 1 file2) | sed 's/NA\sNA$/1 1/' | sort -n | cut -d ' ' -f 2-
-e NA — replace all missing values with NA
-o ... — output format (field is specified using <file>.<field>)
-a 1 — Keep every line from the left file
-1 5, -2 1 — Fields used to join the files
file1, file2 — The files
nl -w1 -s ' ' file1 — file1 with numbered lines
<(sort -k X fileN) — File N ready to be joined on column X
s/NA\sNA$/1 1/ — Replace every NA NA on end of line with 1 1
| sort -n | cut -d ' ' -f 2- — sort numerically and remove the first column
The example above uses spaces on output. To use tabs, append | tr ' ' '\t':
join -e NA -o '1.1 1.2 1.3 1.4 2.1 2.2 2.3' -a 1 -1 4 -2 1 file1 file2 | sed 's/NA\sNA$/1 1/' | tr ' ' '\t'
The broken lines have a TAB as the last character. Fix this with
paste file1 file2 | sed 's/\t$/\tNA\t1\t1/g'

Manipulation of data with BASH

I have a file full of lines like the following:
8408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
The first two digits in the first column represent the year and I need to substitute them with the full year while retaining the full line. I tried various method with awk and sed but couldn't get them working.
My latest attempt is the following:
while read line
a=20
b=19
do
awk '{ if ("$1 | cut -c1-2" == "0"); then {print $a$1}; }' > test.txt
done < catalog.txt
The final output should be:
198408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
Any ideas on how I can do this? Any help would be greatly appreciated.
You have a conceptional problem. How would you know whether 15 refers to 2015 or 1915. Otherwise it's quite easy:
#!/bin/bash
for num in 8408292236 8508292236 0408292236 1508292236
do
prefix=${num:0:2} # get first two digits of line
if [ $prefix -ge 20 ]; then # assuming that 0-20 refers to the 2000s
echo "19$num"
else
echo "20$num"
fi
done
This will prefix 19 if the first field starts with 16 or more, and will prefix 20 otherwise. I think this is what you need.
awk ' { if ($1 > 1600000000 ) print "19" $0 ; else print "20" $0 ; }' catalog.txt
A solution using just bash:
while read A; do
PREFIX=20
if [ ${A:0:2} -gt 15 ] ; then
PREFIX=19
fi
echo "${PREFIX}${A}"
done
sed -e 's/^[0-3]/20\0/' -e 's/^[4-9]/19\0/'
This will append 19 to each line starting with 3 to 9 or 20 to each line starting with 0 to 2, so we consider 30 - 99 to correspond to 1900, and 00-29 to correspond to 2000.
Adapt to your need.
Use awk ternary operator.
$ cat catalog.txt
8408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
0408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
$ awk '$1=substr($1,1,2)<15 ? "20"$1 : "19"$1' catalog.txt
198408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
200408292236 48.04 46n20.13 12e43.22 1.00 3.1 MCSTI 0.4897 108 0.10 0.10 20 41 84EV01978
Explanation
Let’s use the ternary operator.
expr ? action1 : action2
Its pretty straight forward: if expris true then action1 is performed/evaluated , if not action2.
Field one must be prefixed with 19 or 20 depending of it's two first chars value substr($1,1,2).
NOTE: This only works if your data does not contains years bellow 1915.
At this point, we just need to change the first field $1 to our needs: $1=substr($1,1,2)<15 ? "20"$1 : "19"$1
Check: http://klashxx.github.io/ternary-operator/
Simply in bash:
while read i; do echo "$((${i:0:2}<20?20:19))$i"; done <file.txt
shorter in perl:
perl -pe'print substr($_,0,2)<20?20:19' file.txt
even shorter in sed:
sed 's/^[01]/20&/;t;s/^/19/' file.txt

awk condition always TRUE in a loop [duplicate]

This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 7 years ago.
Good morning,
I'm sorry this question will seem trivial to some. It has been driving me mad for hours. My problem is the following:
I have these two files:
head <input file>
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 751756 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 1 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 753474 G C 1.14 0.009
rs2073813 1 753541 A G 0.85 0.0095
and
head <interval file>
1 112667912 114334946
1 116220516 117220516
1 160997252 161997252
1 198231312 199231314
2 60408994 61408994
2 64868452 65868452
2 99649474 100719272
2 190599907 191599907
2 203245673 204245673
2 203374196 204374196
I would like to use a bash script to remove all lines from the input file in which the BP column lies within the interval specified in the input file and in which there is matching of the CHR column with the first column of the interval file.
Here is the code I've been working with (although a simpler solution would be welcomed):
while read interval; do
chr=$(echo $interval | awk '{print $1}')
START=$(echo $interval | awk '{print $2}')
STOP=$(echo $interval | awk '{print $3}')
awk '$2!=$chr {print} $2==$chr && ($3<$START || $3>$STOP) {print}' < input_file > tmp
mv tmp <input file>
done <
My problem is that no lines are removed from the input file. Even if the command
awk '$2==1 && ($3>112667912 && $3<114334946) {print}' < input_file | wc -l
returns >4000 lines, so the lines clearly are in the input file.
Thank you very much for your help.
You can try with perl instead of awk. The reason is that in perl you can create a hash of arrays to save the data of interval file, and extract it easier when processing your input, like:
perl -lane '
$. == 1 && next;
#F == 3 && do {
push #{$h{$F[0]}}, [#F[1..2]];
next;
};
#F == 7 && do {
$ok = 1;
if (exists $h{$F[1]}) {
for (#{$h{$F[1]}}) {
if ($F[2] > $_->[0] and $F[2] < $_->[1]) {
$ok = 0;
last;
}
}
}
printf qq|%s\n|, $_ if $ok;
};
' interval input
$. skips header of interval file. #F checks number of columns and the push creates the hash of arrays.
Your test data is not accurate because none line is filtered out, I changed it to:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
chr1:751756 1 112667922 T C 1.17 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097
rs2073814 1 199231312 G C 1.14 0.009
rs2073813 2 204245670 A G 0.85 0.0095
So you can run it and get as result:
SNP CHR BP A1 A2 OR P
chr1:751343 1 751343 A T 0.85 0.01
rs3094315 1 752566 A G 1.14 0.0093
rs3131972 1 752721 A G 0.88 0.009
rs3131971 1 752894 T C 0.87 0.01
chr1:753405 2 753405 A C 1.17 0.01
chr1:753425 1 753425 T C 0.87 0.0097

Add different value to each column in array

How can i add a different value to each column in a bash script?
Example: Three function f1(x) f2(x) f3(x) plotted over x
test.dat:
# x f1 f2 f3
1 0.1 0.01 0.001
2 0.2 0.02 0.002
3 0.3 0.03 0.003
Now i want to add to each function a different offset value
values = 1 2 3
Desired result:
# x f1 f2 f3
1 1.1 2.01 3.001
2 1.2 2.02 3.002
3 1.3 2.03 3.003
So first column should be unaffected, otherwise the value added.
I tried this, but it doesn work
declare -a energy_array=( 1 2 3 )
for (( i =0 ; i < ${#energy_array[#]} ; i ++ ))
do
local energy=${energy_array[${i}]}
cat "test.dat" \
| awk -v "offset=${energy}" \
'{ for(j=2; j<NF;j++) printf "%s",$j+offset OFS; if (NF) printf "%s",$NF; printf ORS} '
done
You can try the following:
declare -a energy_array=( 1 2 3 )
awk -voffset="${energy_array[*]}" \
'BEGIN { n=split(offset,a) }
NR> 1{
for(j=2; j<=NF;j++)
$j=$j+a[j-1]
print;next
}1' test.dat
With output:
# x f1 f2 f3
1 1.1 2.01 3.001
2 1.2 2.02 3.002
3 1.3 2.03 3.003

The simplest way to join 2 files using bash and both of their keys appear in the result

I have 2 input files
file1
A 0.01
B 0.09
D 0.05
F 0.08
file2
A 0.03
C 0.01
D 0.04
E 0.09
The output I want is
A 0.01 0.03
B 0.09 NULL
C NULL 0.01
D 0.05 0.04
E NULL 0.09
F 0.08 NULL
The best that I can do is
join -t' ' -a 1 -a 2 -1 1 -2 1 -o 1.1,1.2,2.2 file1 file2
which doesn't give me what I want
You can write:
join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL -o 0,1.2,2.2 file1 file2
where I've made these changes:
In the output format, I changed 1.1 ("first column of file #1") to 0 ("join field"), so that values from file #2 can show up in the first field when necessary. (Specifically, so that C and E will.)
I added the -e option to specify a value (NULL) for missing/empty fields.
I used $'\t', which Bash converts to a tab, instead of typing an actual tab. I find this easier to use than a tab in the middle of the command. But if you disagree, and the actual tab is working for you, then by all means, you can keep using it. :-)

Resources