using awk to print header name and a substring - bash

i try using this code for printing a header of a gene name and then pulling a substring based on its location but it doesn't work
>output_file
cat input_file | while read row; do
echo $row > temp
geneName=`awk '{print $1}' tmp`
startPos=`awk '{print $2}' tmp`
endPOs=`awk '{print $3}' tmp`
for i in temp; do
echo ">${geneName}" >> genes_fasta ;
echo "awk '{val=substr($0,${startPos},${endPOs});print val}' fasta" >> genes_fasta
done
done
input_file
nad5_exon1 250405 250551
nad5_exon2 251490 251884
nad5_exon3 195620 195641
nad5_exon4 154254 155469
nad5_exon5 156319 156548
fasta
atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc............
and this is my wrong output file
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
output should look like that:
>name1
atgcatgcatgcatgcatgcat
>name2
tgcatgcatgcatgcat
>name3
gcatgcatgcatgcatgcat
>namen....

You can do this with a single call to awk which will be orders of magnitude more efficient than looping in a shell script and calling awk 4-times per-iteration. Since you have bash, you can simply use command substitution and redirect the contents of fasta to an awk variable and then simply output the heading and the substring containing the beginning through ending characters from your fasta file.
For example:
awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
or using getline within the BEGIN rule:
awk 'BEGIN{getline fasta<"fasta"}
{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
Example Input Files
Note: the beginning and ending values have been reduced to fit within the 129 characters of your example:
$ cat input
rad5_exon1 1 17
rad5_exon2 23 51
rad5_exon3 110 127
rad5_exon4 38 62
rad5_exon5 59 79
and the first 129-characters of your example fasta
$ cat fasta
atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc
Example Use/Output
$ awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
>rad5_exon1
atgcatgcatgcatgca
>rad5_exon2
gcatgcatgcatgcatgcatgcatgcatg
>rad5_exon3
tgcatgcatgcatgcatg
>rad5_exon4
tgcatgcatgcatgcatgcatgcat
>rad5_exon5
gcatgcatgcatgcatgcatg
Look thing over and let me know if I understood your question requirements. Also let me know if you have further questions on the solution.

If I'm understanding correctly, how about:
awk 'NR==FNR {fasta = fasta $0; next}
{
printf(">%s %s\n", $1, substr(fasta, $2, $3 - $2 + 1))
}' fasta input_file > genes_fasta
It first reads fasta file and stores the sequence in a variable fasta.
Then it reads input_file line by line, extracts the substring of fasta starting at $2 and of length $3 - $2 + 1. (Note that the 3rd argument to substr function is length, not endpos.)
Hope this helps.

made it work!
this is the script for pulling substrings from a fasta file
cat genes_and_bounderies1 | while read row; do
echo $row > temp
geneName=`awk '{print $1}' temp`
startPos=`awk '{print $2}' temp`
endPos=`awk '{print $3}' temp`
length=$(expr $endPos - $startPos)
for i in temp; do
echo ">${geneName}" >> genes_fasta
awk -v S=$startPos -v L=$length '{print substr($0,S,L)}' unwraped_${fasta} >> genes_fasta
done
done

Related

Compare two files(file1 & file2) and add one column from from file2 to file1 if first column of two files matches

I have two files (file1 and file2)
file1
ABC=14.2.0.7.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/abc/patch142007
DEF=14.3.0.5.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/def/patch143005
DEF=14.3.0.5.SAMPLE2=git.calypso/plugins/gitiles/+/refs/heads/clientpatch/def/patch14300-calib
HIJ=12.0.0.0.Sp3.SAMPLE3=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/hij/patch120000sp3
MNO=16.1.0.28.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/mno/patch161028
.......(150 lines)
file2
IJK = open
ABC = closed
PQR = closed
DEF = open
HIJ = open
LMN = closed
MNO = closed
PQR = open
......(> 150 lines)
output file
ABC=14.2.0.7.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/client/abc/patch142007=closed
DEF=14.3.0.5.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/client/def/patch143005=open
DEF=14.3.0.5.SAMPLE2=git.xyz/plugins/gitiles/+/refs/heads/client/def/patch14300-calib=open
HIJ=12.0.0.0.Sp3.SAMPLE3=git.xyz/plugins/gitiles/+/refs/heads/client/hij/patch120000sp3=open
MNO=16.1.0.28.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/client/mno/patch161028=closed
I have tried the following script. But it is not giving me any output. Not even printing anything. No errors
while IFS= read -r line
do
key1=`echo $line | awk -F "=" '{print $1}'` < file1
key2=`echo $line | awk -F "=" '{print $2}'` < file1
key3=`echo $line | awk -F "=" '{print $3}'` < file1
key4=`echo $line | awk -F "=" '{print $1}'` < file2
value3=`echo $line | awk -F "=" '{print $2}'` < file2
if [ "$key1" == "$key4" ]; then
echo "$key1=$key2=$key3=$value3"
fi
done
Giving a brief description for how the code should work.
The code should compare first columns of two files(file1 and file2). If each name matches it should give me output file as listed above. Else go to the next line. I should get output if my two files are either in sorted or unsorted format.
Helps will be appreciated. Thank you
Or another approach with awk that stores the file2 values in an array and then appends the correct state to the appropriate line in file1:
awk -F' = ' 'NR==FNR {a[$1]=$2; next} {print $0"="a[$1]}' file2 FS="=" file1
Example Use/Output
$ awk -F' = ' 'NR==FNR {a[$1]=$2; next} {print $0"="a[$1]}' file2 FS="=" file1
ABC=14.2.0.7.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/abc/patch142007=closed
DEF=14.3.0.5.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/def/patch143005=open
DEF=14.3.0.5.SAMPLE2=git.calypso/plugins/gitiles/+/refs/heads/clientpatch/def/patch14300-calib=open
HIJ=12.0.0.0.Sp3.SAMPLE3=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/hij/patch120000sp3=open
MNO=16.1.0.28.SAMPLE=git.xyz/plugins/gitiles/+/refs/heads/clientpatch/mno/patch161028=closed
Could you please try following.
awk '
BEGIN{
OFS="="
}
FNR==NR{
a[$1]=$NF
next
}
($1 in a){
print $0,a[$1]
}
' Input_file2 FS="=" Input_file1
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
OFS="=" ##Setting OFS as = here for all lines.
}
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when file2 is being read.
a[$1]=$NF ##Creating an array a with index $1 and value is last field.
next ##next will skip all further statements from here.
}
($1 in a){ ##Checking condition if $1 of current line is present in array a then do following.
print $0,a[$1] ##Printing current line and value of array a with index $1.
}
' file2 FS="=" file1 ##Mentioning Input_file file2 and file1 and setting FS="=" for file1 here.

Splitting csv file into multiple files with 2 columns in each file

I am trying to split a file (testfile.csv) that contains the following:
1,2,4,5,6,7,8,9
a,b,c,d,e,f,g,h
q,w,e,r,t,y,u,i
a,s,d,f,g,h,j,k
z,x,c,v,b,n,m,z
into a file
1,2
a,b
q,w
a,s
z,x
and another file
4,5
c,d
e,r
d,f
c,v
but I cannot seem to do that in awk using an iterative solution.
awk -F, '{print $1, $2}'
awk -F, '{print $3, $4}'
does it for me but I would like a looping solution.
I tried
awk -F, '{ for (i=1;i< NF;i+=2) print $i, $(i+1) }' testfile.csv
but it gives me a single column. It appears that I am iterating over the first row and then moving onto the second row skipping every other element of that specific row.
You can use cut:
$ cut -d, -f1,2 file > file_1
$ cut -d, -f3,4 file > file_2
If you are going to use awk be sure to set the OFS so that the columns remain a CSV file:
$ awk 'BEGIN{FS=OFS=","}
{print $1,$2 >"f1"; print $3,$4 > "f2"}' file
$ cat f1
1,2
a,b
q,w
a,s
z,x
$cat f2
4,5
c,d
e,r
d,f
c,v
Is there a quick and dirty way of renaming the resulting files with the first row and first column (like first file would be 1.csv, second file would be 4.csv:
awk 'BEGIN{FS=OFS=","}
FNR==1 {n1=$1 ".csv"; n2=$3 ".csv"}
{print $1,$2 >n1; print $3,$4 > n2}' file
awk -F, '{ for (i=1; i < NF; i+=2) print $i, $(i+1) > i ".csv"}' tes.csv
works for me. I was trying to get the output in bash which was all jumbled up.
It's do-able in bash, but it will be much slower than awk:
f=testfile.csv
IFS=, read -ra first < <(head -1 "$f")
for ((i = 0; i < (${#first[#]} + 1) / 2; i++)); do
slice_file="${f%.csv}$((i+1)).csv"
cut -d, -f"$((2 * i + 1))-$((2 * (i + 1)))" "$f" > "$slice_file"
done
with sed:
sed -r '
h
s/(.,.),./\1/w file1.txt
g
s/.,.,(.,.),./\1/w file2.txt' file.txt

Multiple if statements in awk

I have a file that looks like
01/11/2015;998978000000;4890********3290;5735;ITUNES.COM/BILL;LU;Cross_border_rub;4065;17;915;INSUFF FUNDS;51;0;
There are 13 semicolon separated columns.
I'm trying to calculate 9 columns for all lines:
awk -F ';' -vOFS=';' '{ gsub(",", ".", $9); print }' file |
awk -F ';' '$0 = NR-1";"$0' |
awk -F ';' -vOFS=';' '{bar[$1]=$1;a[$1]=$2;b[$1]=$3;c[$1]=$4;d[$1]=$5;e[$1]=$6;f[$1]=$7;g[$1]=$8;h[$1]=$9;k[$1]=$10;l[$1]=$11;l[$1]=$12;m[$1]=$13;p[$1]=$14;};
if($7="International") {income=0.0162*h[i]+0.0425*h[i]};
else if($7="Domestic") {income=0.0188*h[i]};
else if($7="Cross_border_rub") {income=0.0162*h[i]+0.025*h[i]}
END{for(i in bar) print income";"a[i],b[i],c[i],d[i],e[i],f[i],g[i],h[i],k[i],l[i],m[i],p[i]}'
How exactly do multiple if statements correctly work in awk?
awk to the rescue!
You don't need the multiple awk invocations. Can consolidate into one
$ awk -F';' -v OFS=';' '{gsub(",", ".", $9)}
$7=="International" {income=(0.0162+0.0425)*$9}
$7=="Domestic" {income=0.0188*$9}
$7=="Cross_border_rub" {income=(0.0162+0.025)*$9}
# what happens for other values since previous income will be copied over
{print income, NR-1, $0}' file
test with your file since you didn't provide a enough sample to test.
Perhaps better if you just assign the rate
$ awk -F';' -v OFS=';' '{gsub(",", ".", $9); rate=0}
$7=="International" {rate=0.0162+0.0425}
$7=="Domestic" {rate=0.0188}
$7=="Cross_border_rub" {rate=0.0162+0.025}
{print rate*$9, NR-1, $0}' file

awk print something if column is empty

I am trying out one script in which a file [ file.txt ] has so many columns like
abc|pqr|lmn|123
pqr|xzy|321|azy
lee|cha| |325
xyz| |abc|123
I would like to get the column list in bash script using awk command if column is empty it should print blank else print the column value
I have tried the below possibilities but it is not working
cat file.txt | awk -F "|" {'print $2'} | sed -e 's/^$/blank/' // Using awk and sed
cat file.txt | awk -F "|" '!$2 {print "blank"} '
cat file.txt | awk -F "|" '{if ($2 =="" ) print "blank" } '
please let me know how can we do that using awk or any other bash tools.
Thanks
I think what you're looking for is
awk -F '|' '{print match($2, /[^ ]/) ? $2 : "blank"}' file.txt
match(str, regex) returns the position in str of the first match of regex, or 0 if there is no match. So in this case, it will return a non-zero value if there is some non-blank character in field 2. Note that in awk, the index of the first character in a string is 1, not 0.
Here, I'm assuming that you're interested only in a single column.
If you wanted to be able to specify the replacement string from a bash variable, the best solution would be to pass the bash variable into the awk program using the -v switch:
awk -F '|' -v blank="$replacement" \
'{print match($2, /[^ ]/) ? $2 : blank}' file.txt
This mechanism avoids problems with escaping metacharacters.
You can do it using this sed script:
sed -r 's/\| +\|/\|blank\|/g' File
abc|pqr|lmn|123
pqr|xzy|321|azy
lee|cha|blank|325
xyz|blank|abc|123
If you don't want the |:
sed -r 's/\| +\|/\|blank\|/g; s/\|/ /g' File
abc pqr lmn 123
pqr xzy 321 azy
lee cha blank 325
xyz blank abc 123
Else with awk:
awk '{gsub(/\| +\|/,"|blank|")}1' File
abc|pqr|lmn|123
pqr|xzy|321|azy
lee|cha|blank|325
xyz|blank|abc|123
You can use awk like this:
awk 'BEGIN{FS=OFS="|"} {for (i=1; i<=NF; i++) if ($i ~ /^ *$/) $i="blank"} 1' file
abc|pqr|lmn|123
pqr|xzy|321|azy
lee|cha|blank|325
xyz|blank|abc|123

Awk and head not identifying columns properly

Here is my code that I want to use to separate 3 columns from hist.txt into 2 separate files, hist1.dat with first and second column and hist2.dat with first and third column. The columns in hist.txt may be separated with more than one space. I want to save in histogram1.dat and histogram2.dat the first n lines until the last nonzero value.
The script creates histogram1.dat correct, but histogram2.dat contains all the lines from hist2.dat.
hist.txt is like :
http://pastebin.com/JqgSKZrP
#!bin/bash
sed 's/\t/ /g' hist.txt | awk '{print $1 " " $2;}' > hist1.dat
sed 's/\t/ /g' hist.txt | awk '{print $1 " " $3;}' > hist2.dat
head -n $( awk 'BEGIN {last=1}; {if($2!=0) last=NR};END {print last}' hist1.dat) hist1.dat > histogram1.dat
head -n $( awk 'BEGIN {last=1}; {if($2!=0) last=NR};END {print last}' hist2.dat) hist2.dat > histogram2.dat
What is the cause of this problem? Might it be due to some special restriction with head?
Thanks.
For your first histogram, try
awk '$2 ~ /000000/{exit}{print $1, $2}' hist.txt
and for your second:
awk '$3 ~ /000000/{exit}{print $1, $3}' hist.txt
Hope I understood you correctly...

Resources