sed - extract data between 2 words from different parts of .txt by combining 3 sed commands - bash

I have multiple .txt with info like this:
"commercial_name":"THE OUTBACK","contact_name":"JEFF","contact_person":"MANAGER","working_place"
there is a lot of garbage behind and after the given sentence.
I want to get results like this:
THE OUTBACK,JEFF,MANAGER
All in the same line for each .txt file, but jump line for the next .txt.
I am doing with 3 different sed commands
sed -n 's:.*"commercial_name"\(.*\)"contact_name".*:\1:p' *.txt
sed -n 's:.*"contact_name"\(.*\)"contact_person".*:\1:p' *.txt
sed -n 's:.*"contact_person"\(.*\)"working_place".*:\1:p' *.txt
even if I combine these 3, the result is:
:"THE OUTBACK",
-all commercial names 1 line for each .txt
:"JEFF",
-all contact names 1 line for each .txt
:"MANAGER",
-all contact person 1 line for each .txt
I want to extract all the info in the same line:
THE OUTBACK,JEFF,MANAGER
then the info for the next .txt in the next line
and so on.

You may use this awk:
awk 'BEGIN {
FS=OFS=","
}
{
gsub(/"/, "")
for(i=1; i<=NF; ++i) {
if (split($i, entry, ":") == 2)
map[entry[1]] = entry[2]
}
print map["commercial_name"], map["contact_name"], map["contact_person"]
}' file
THE OUTBACK,JEFF,MANAGER

With awk
we set FS and OFS separately:
awk -v FS=',|:' -v OFS=',' '{print $2,$4,$6}' file
"THE OUTBACK","JEFF","MANAGER"
and gsub for removing double quotes:
awk -v FS=',|:' -v OFS=',' '{gsub(/"/, "")} {print $2,$4,$6}' file
THE OUTBACK,JEFF,MANAGER
This code:
why printing $2,$4,$6?
Ed Morton gives a detail explication here:
converting regex to sed or grep regex
Using Ed's code, you can see it with for
awk -v FS=',|:' -v OFS=',' '{gsub(/"/, "")} {for (i=1; i<=NF;i++) print "Record", NR, "Field", i, ": " $i;}{print RT}' file
Record,1,Field,1,: commercial_name
Record,1,Field,2,: THE OUTBACK
Record,1,Field,3,: contact_name
Record,1,Field,4,: JEFF
Record,1,Field,5,: contact_person
Record,1,Field,6,: MANAGER
Record,1,Field,7,: working_place
In this case, we are interested in fields 2, 4 and 6:
{print $2,$4,$6}
--

Related

using awk to print header name and a substring

i try using this code for printing a header of a gene name and then pulling a substring based on its location but it doesn't work
>output_file
cat input_file | while read row; do
echo $row > temp
geneName=`awk '{print $1}' tmp`
startPos=`awk '{print $2}' tmp`
endPOs=`awk '{print $3}' tmp`
for i in temp; do
echo ">${geneName}" >> genes_fasta ;
echo "awk '{val=substr($0,${startPos},${endPOs});print val}' fasta" >> genes_fasta
done
done
input_file
nad5_exon1 250405 250551
nad5_exon2 251490 251884
nad5_exon3 195620 195641
nad5_exon4 154254 155469
nad5_exon5 156319 156548
fasta
atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc............
and this is my wrong output file
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
>
awk '{val=substr(pull_genes.sh,,);print val}' unwraped_carm_mt.fasta
output should look like that:
>name1
atgcatgcatgcatgcatgcat
>name2
tgcatgcatgcatgcat
>name3
gcatgcatgcatgcatgcat
>namen....
You can do this with a single call to awk which will be orders of magnitude more efficient than looping in a shell script and calling awk 4-times per-iteration. Since you have bash, you can simply use command substitution and redirect the contents of fasta to an awk variable and then simply output the heading and the substring containing the beginning through ending characters from your fasta file.
For example:
awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
or using getline within the BEGIN rule:
awk 'BEGIN{getline fasta<"fasta"}
{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
Example Input Files
Note: the beginning and ending values have been reduced to fit within the 129 characters of your example:
$ cat input
rad5_exon1 1 17
rad5_exon2 23 51
rad5_exon3 110 127
rad5_exon4 38 62
rad5_exon5 59 79
and the first 129-characters of your example fasta
$ cat fasta
atgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgcatgc
Example Use/Output
$ awk -v fasta=$(<fasta) '{print ">" $1; print substr(fasta,$2,$3-$2+1)}' input
>rad5_exon1
atgcatgcatgcatgca
>rad5_exon2
gcatgcatgcatgcatgcatgcatgcatg
>rad5_exon3
tgcatgcatgcatgcatg
>rad5_exon4
tgcatgcatgcatgcatgcatgcat
>rad5_exon5
gcatgcatgcatgcatgcatg
Look thing over and let me know if I understood your question requirements. Also let me know if you have further questions on the solution.
If I'm understanding correctly, how about:
awk 'NR==FNR {fasta = fasta $0; next}
{
printf(">%s %s\n", $1, substr(fasta, $2, $3 - $2 + 1))
}' fasta input_file > genes_fasta
It first reads fasta file and stores the sequence in a variable fasta.
Then it reads input_file line by line, extracts the substring of fasta starting at $2 and of length $3 - $2 + 1. (Note that the 3rd argument to substr function is length, not endpos.)
Hope this helps.
made it work!
this is the script for pulling substrings from a fasta file
cat genes_and_bounderies1 | while read row; do
echo $row > temp
geneName=`awk '{print $1}' temp`
startPos=`awk '{print $2}' temp`
endPos=`awk '{print $3}' temp`
length=$(expr $endPos - $startPos)
for i in temp; do
echo ">${geneName}" >> genes_fasta
awk -v S=$startPos -v L=$length '{print substr($0,S,L)}' unwraped_${fasta} >> genes_fasta
done
done

Gawk Line removal, Splitter is :

Is it possible to move certain columns from one .txt file into another .txt file?
I have a .txt that contains:
USERID:ORDER#:IP:PHONE:ADDRESS:POSTCODE
USERID:ORDER#:IP:PHONE:ADDRESS:POSTCODE
With gawk I want to extract ADDRESS & POSTCODE columns into another .txt, so for this given file the output should be:
ADDRESS1:POSTCODE1
ADDRESS2:POSTCODE2
etc.
This is a classic AWK transform. You want to use "-F :" to specify that the input is delimited by ":" and print a new ":" on output:
awk -F: '{ print $5 ":" $6 }' <input.txt >output.txt
Try that:
awk -F: '{printf "%s:%s ",$5,$6}' ex.txt
input is
USERID:ORDER#:IP:PHONE:ADDRESS1:POSTCODE1
USERID:ORDER#:IP:PHONE:ADDRESS2:POSTCODE2
output is (on one line if I understand correctly)
ADDRESS1:POSTCODE1 ADDRESS2:POSTCODE2
only default is that it ends with a trailing space and does not end with a newline.
Which can be fixed with the slightly more complex (but still readable):
awk -F: 'BEGIN {z=0;} {if (z==1) { printf " "; } ; z=1; printf "%s:%s",$5,$6} END{printf"\n"}' ex.txt
awk -F: 'NR==1 {print $5"1:"$6"1"};NR==2 {print $5"2:"$6"2"}' file
ADDRESS1:POSTCODE1
ADDRESS2:POSTCODE2

Multiple if statements in awk

I have a file that looks like
01/11/2015;998978000000;4890********3290;5735;ITUNES.COM/BILL;LU;Cross_border_rub;4065;17;915;INSUFF FUNDS;51;0;
There are 13 semicolon separated columns.
I'm trying to calculate 9 columns for all lines:
awk -F ';' -vOFS=';' '{ gsub(",", ".", $9); print }' file |
awk -F ';' '$0 = NR-1";"$0' |
awk -F ';' -vOFS=';' '{bar[$1]=$1;a[$1]=$2;b[$1]=$3;c[$1]=$4;d[$1]=$5;e[$1]=$6;f[$1]=$7;g[$1]=$8;h[$1]=$9;k[$1]=$10;l[$1]=$11;l[$1]=$12;m[$1]=$13;p[$1]=$14;};
if($7="International") {income=0.0162*h[i]+0.0425*h[i]};
else if($7="Domestic") {income=0.0188*h[i]};
else if($7="Cross_border_rub") {income=0.0162*h[i]+0.025*h[i]}
END{for(i in bar) print income";"a[i],b[i],c[i],d[i],e[i],f[i],g[i],h[i],k[i],l[i],m[i],p[i]}'
How exactly do multiple if statements correctly work in awk?
awk to the rescue!
You don't need the multiple awk invocations. Can consolidate into one
$ awk -F';' -v OFS=';' '{gsub(",", ".", $9)}
$7=="International" {income=(0.0162+0.0425)*$9}
$7=="Domestic" {income=0.0188*$9}
$7=="Cross_border_rub" {income=(0.0162+0.025)*$9}
# what happens for other values since previous income will be copied over
{print income, NR-1, $0}' file
test with your file since you didn't provide a enough sample to test.
Perhaps better if you just assign the rate
$ awk -F';' -v OFS=';' '{gsub(",", ".", $9); rate=0}
$7=="International" {rate=0.0162+0.0425}
$7=="Domestic" {rate=0.0188}
$7=="Cross_border_rub" {rate=0.0162+0.025}
{print rate*$9, NR-1, $0}' file

awk OFS not producing expected value

I have a file
[root#nmk~]# cat file
abc>
sssd>
were>
I run both these variations of the awk commands
[root#nmk~]# cat file | awk -F\> ' { print $1}' OFS=','
abc
sssd
were
[root#nmk~]# cat file | awk -F\> ' BEGIN { OFS=","} { print $1}'
abc
sssd
were
[root#nmk~]#
But my expected output is
abc,sssd,were
What's missing in my commands ?
You're just a bit confused about the meaning/use of FS, OFS, RS and ORS. Take another look at the man page. I think this is what you were trying to do:
$ awk -F'>' -v ORS=',' '{print $1}' file
abc,sssd,were,$
but this is probably closer to the output you really want:
$ awk -F'>' '{rec = rec (NR>1?",":"") $1} END{print rec}' file
abc,sssd,were
or if you don't want to buffer the whole output as a string:
$ awk -F'>' '{printf "%s%s", (NR>1?",":""), $1} END{print ""}' file
abc,sssd,were
awk -F\> -v ORS="" 'NR>1{print ","$1;next}{print $1}' file
to print newline at the end:
awk -F\> -v ORS="" 'NR>1{print ","$1;next}{print $1} END{print "\n"}' file
output:
abc,sssd,were
Each line of input in awk is a record, so what you want to set is the Output Record Separator, ORS. The OFS variable holds the Output Field Separator, which is used to separate different parts of each line.
Since you are setting the input field separator, FS, to >, and OFS to ,, an easy way to see how these work is to add something on each line of your file after the >:
awk 'BEGIN { FS=">"; OFS=","} {$1=$1} 1' <<<$'abc>def\nsssd>dsss\nwere>wolf'
abc,def
sssd,dsss
were,wolf
So you want to set the ORS. The default record separator is newline, so whatever you set ORS to effectively replaces the newlines in the input. But that means that if the last line of input has a newline - which is usually the a case - that last line will also get a copy of your new ORS:
awk 'BEGIN { FS=">"; ORS=","} 1' <<<$'abc>def\nsssd>dsss\nwere>wolf'
abc>def,sssd>dsss,were>wolf,
It also won't get a newline at all, because that newline was interpreted as an input record separator and turned into the output record separator - it became the final comma.
So you have to be a little more explicit about what you're trying to do:
awk 'BEGIN { FS=">" } # split input on >
(NR>1) { printf "," } # if not the first line, print a ,
{ printf "%s", $1 } # print the first field (everything up to the first >)
END { printf "\n" } # add a newline at the end
' <<<$'abc>\nsssd>\nwere>'
Which outputs this:
abc,sssd,were
Through sed,
$ sed ':a;N;$!ba;s/>\n/,/g;s/>$//' file
abc,sssd,were
Through Perl,
$ perl -00pe 's/>\n(?=.)/,/g;s/>$//' file
abc,sssd,were

How to print a range of columns in a CSV in AWK? [duplicate]

This question already has answers here:
Extract specific columns from delimited file using Awk
(8 answers)
Closed 4 years ago.
With awk, I can print any column within a CSV, e.g., this will print the 10th column in file.csv.
awk -F, '{ print $10 }' file.csv
If I need to print columns 5-10, including the comma, I only know this way:
awk -F, '{ print $5","$6","$7","$8","$9","$10 }' file.csv
This method is not so good if I want to print many columns. Is there a simpler syntax for printing a range of columns in a CSV in awk?
The standard way to do this in awk is using a for loop:
awk -v s=5 -v e=10 'BEGIN{FS=OFS=","}{for (i=s; i<=e; ++i) printf "%s%s", $i, (i<e?OFS:ORS)}' file
However, if your delimiter is simple (as in your example), you may prefer to use cut:
cut -d, -f5-10 file
Perl deserves a mention (using -a to enable autosplit mode):
perl -F, -lane '$"=","; print "#F[4..9]"' file
You can use a loop in awk to print columns from 5 to 10:
awk -F, '{ for (i=5; i<=10; i++) print $i }' file.csv
Keep in mind that using print it will print each columns on a new line. If you want to print them on same line using OFS then use:
awk -F, -v OFS=, '{ for (i=5; i<=10; i++) printf("%s%s", $i, OFS) }' file.csv
With GNU awk for gensub():
$ cat file
a,b,c,d,e,f,g,h,i,j,k,l,m
$
$ awk -v s=5 -v n=6 '{ print gensub("(([^,]+,){"s-1"})(([^,]+,){"n-1"}[^,]+).*","\\3","") }' file
e,f,g,h,i,j
s is the start position and n is the number of fields to print from that point on. Or if you prefer to specify start and end:
$ awk -v s=5 -v e=10 '{ print gensub("(([^,]+,){"s-1"})(([^,]+,){"e-s"}[^,]+).*","\\3","") }' file
e,f,g,h,i,j
Note that this will only work with single-character field separators since it relies on being able to negate the FS in a character class.

Resources