awk or other shell to convert delimited list into a table

awk or other shell to convert delimited list into a table - shell

So what I have is a huge csv akin to this:
Pool1,Shard1,Event1,10
Pool1,Shard1,Event2,20
Pool1,Shard2,Event1,30
Pool1,Shard2,Event4,40
Pool2,Shard1,Event3,50
etc
Which is not ealisy readable. Eith there being only 4 types of events I'm useing spreadsheets to convert this into the following:
Pool1,Shard1,10,20,,
Pool1,Shard2,30,,,40
Pool2,Shard1,,,50,
Only events are limited to 4, pools and shards can be indefinite really. But the events may be missing from the lines - not all pools/shards have all 4 events every day.
So I tried doing this within an awk in the shell script that gathers the csv in the first place, but I'm failing spectacuraly, no working code can even be shown since it's producing zero results.
Basically I tried sorting the CSV reading the first two fields of a row, comparing to previous row and if matching comparing the third field to a set array of event strings then storing the fouth field in a variable respective to the event, and one the first two fileds are not matching - finally print the whole line including variables.
Sorry for the one-liner, testing and experimenting directly in the command line. It's embarassing, it does nothing.
awk -F, '{if (a==$1&&b==$2) {if ($3=="Event1") {r=$4} ; if ($3=="Event2") {d=$4} ; if ($3=="Event3") {t=$4} ; if ($3=="Event4") {p=$4}} else {printf $a","$b","$r","$d","$p","$t"\n"; a=$1 ; b=$2 ; if ($3=="Event1") {r=$4} ; if ($3=="Event2") {d=$4} ; if ($3=="Event3") {t=$4} ; if ($3=="Event4") {p=$4} ; a=$1; b=$2}} END {printf "\n"}'

You could simply use an assoc array: awk -F, -f parse.awk input.csv with parse.awk being:
{
sub(/Event/, "", $3);
res[$1","$2][$3]=$4;
}
END {
for (name in res) {
printf("%s,%s,%s,%s,%s\n", name, res[name][1], res[name][2], res[name][3], res[name][4])
}
}
Order could be confused by awk, but my test output is:
Pool2,Shard1,,,50,
Pool1,Shard1,10,20,,
Pool1,Shard2,30,,,40
PS: Please use an editor to write awk source code. Your one-liner is really hard to read. Since I used a different approach, I did not even try do get it "right"... ;)

$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $1 OFS $2 }
key != prev {
if ( NR>1 ) {
print prev, f["Event1"], f["Event2"], f["Event3"], f["Event4"]
delete f
}
prev = key
}
{ f[$3] = $4 }
END { print key, f["Event1"], f["Event2"], f["Event3"], f["Event4"] }
$ sort file | awk -f tst.awk
Pool1,Shard1,10,20,,
Pool1,Shard2,30,,,40
Pool2,Shard1,,,50,

Related

Reverse complement SOME sequences in fasta file

I've been reading lots of helpful posts about reverse complementing sequences, but I've got what seems to be an unusual request. I'm working in bash and I have DNA sequences in fasta format in my stdout that I'd like to pass on down the pipe. The seemingly unusual bit is that I'm trying to reverse complement SOME of those sequences, so that the output has all the sequences in the same direction (for multiple sequence alignment later).
My fasta headers end in either "C" or "+". I'd like to reverse complement the ones that end in "C". Here's a little subset:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
caccttagagataatgaagtatattcagaatgtagaacattctataagac
aactgacccaatatcttttaaaaagtcaatgccatgttaaaaataaaaag
I know there are lots of ways to reverse complement out there, like:
echo ACCTTGAAA | tr ACGTacgt TGCAtgca | rev
and
seqtk seq -r in.fa > out.fa
But I'm not sure how to do this for only those sequences that have a C at the end of the header. I think awk or sed is probably the ticket, but I'm at a loss as to how to actually code it. I can get the sequence headers with awk, like:
awk '/^>/ { print $0 }'
>chr1:84518073-84524089C
>chr1:86214203-86220231+
But if someone could help me figure out how to turn that awk statement into one that asks "if the last character in the header has a C, do this!" that would be great!
Edited to add:
I was so tired when I made this post, I apologize for not including my desired output. Here is what I'd like to output to look like, using my little example:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagtt
gtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
You can see the sequence that ends in + is unchanged, but the sequence with a header that ends in C is reverse complemented.
Thanks!

An earlier answer (by Ed Morton) uses a self-contained awk procedure to selectively reverse-complement sequences following a comment line ending with "C". Although I think that to be the best approach, I will offer an alternative approach that might have wider applicability.
The procedure here uses awk's system() function to send data extracted from the fasta file in awk to the shell where the sequence can be processed by any of the many shell applications existing for sequence manipulation.
I have defined an awk user function to pass the isolated sequence from awk to the shell. It can be called from any part of the awk procedure:
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
The argument of the system function is a string containing the command you would type into terminal to achieve the desired outcome (in this case I've used one of the example reverse-complement routines mentioned in the question). The parts to note are the correct escaping of quote marks that are to appear in the shell command, and the variable s that will be substituted for the sequence string assigned to it when the function is called. The value of s is concatenated with the strings quoted before and after it in the argument to system() shown above.
isolating the required sequences
The rest of the procedure addresses how to achieve:
"if the last character in the header has a C, do this"
Before making use of shell applications, awk needs to isolate the part(s) of the file to process. In general terms, awk employs one or more pattern/action blocks where only records (lines by default) that match a given pattern are processed by the subsequent action commands. For example, the following illustrative procedure performs the action of printing the whole line print $0 if the pattern /^>/ && /C$/ is true for that line (where /^>/ looks for ">" at the start of a line and /C$/ looks for "C" at the end of the same line.:
/^>/ && /C$/{ print $0 }
For the current needs, the sequence begins on the next record (line) after any record beginning with > and ending with C. One way of referencing that next line is to set a variable (named line in my example) when the C line is encountered and establishing a later pattern for the record with numerical value one more than line variable.
Because fasta sequences may extend over several lines, we have to accumulate several successive lines following a C title line. I have achieved this by concatenating each line following the C title line until a record beginning with > is encountered again (or until the end of the file is reached, using the END block).
In order that sequence lines following a non-C title line are ignored, I have used a variable named flag with values of either "do" or "ignore" set when a title record is encountered.
The call to a the custom function processSeq() that employs the system() command, is made at the beginning of a C title action block if the variable seq holds an accumulated sequence (and in the END block for relevant sequences that occur at the end of the file where there will be no title line).
Test file and procedure
A modified version of your example fasta was used to test the procedure. It contains an extra relevant C record with three and-a-bit lines instead of two, and an extra irrelevant + record.
seq.fasta:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
caccttagagataatgaagtatattcagaatgtagaacattctataagac
aactgacccaatatcttttaaaaagtcaatgccatgttaaaaataaaaag
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chranotherC
aatgaagtatattcagaatgtagaacattaactgacccgccatgttaatc
aatatctataagaccttttaaaaagcaccttagagattcaataaagtcag
gaagtatattcagaatgtagaacattaactgactaagaccttttaacatg
gcattgact
procedure
awk '
/^>/ && /C$/{
if (length(seq)>0) {processSeq(seq); seq="";}
line=NR; print $0; flag="do"; next;
}
/^>/ {line=NR; flag="ignore"}
NR>1 && NR==(line+1) && (flag=="do"){seq=seq $0; line=NR; next}
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
END { if (length(seq)>0) processSeq(seq);}
' seq.fasta
output
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagttgtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
>chranotherC
agtcaatgccatgttaaaaggtcttagtcagttaatgttctacattctgaatatacttcctgactttattgaatctctaaggtgctttttaaaaggtcttatagatattgattaacatggcgggtcagttaatgttctacattctgaatatacttcatt
Tested using GNU Awk 5.1.0 on a Raspberry Pi 400.
performance note
Because calling sytstem() creates a sub shell, this process will be slower than a self-contained awk procedure. It might be useful where existing shell routines are available or tricky to reproduce with custom awk routines.
Edit: modification to include unaltered + records
This version has some repetition of earlier blocks, with minor changes, to handle printing of the lines that are not to be reverse-complemented (the changes should be self-explanatory if the main explanations were understood)
awk '
/^>/ && /C$/{
if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq} seq="";line=NR; print $0; flag="do"; next;
}
/^>/ {if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq} seq=""; print $0; line=NR; flag="ignore"}
NR>1 && NR==(line+1){seq=seq $0; line=NR; next}
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
END { if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq}}
' seq.fasta

Using any awk:
$ cat tst.awk
/^>/ {
if ( NR > 1 ) {
prt()
}
head = $0
tail = ""
next
}
{ tail = ( tail == "" ? "" : tail ORS ) $0 }
END { prt() }
function prt( type) {
type = substr(head,length(head),1)
tail = ( type == "C" ? rev( tr( tail, "ACGTacgt TGCAtgca" ) ) : tail )
print head ORS tail
}
function tr(oldStr,trStr, i,lgth,char,newStr) {
if ( !_trSeen[trStr]++ ) {
lgth = (length(trStr) - 1) / 2
for ( i=1; i<=lgth; i++ ) {
_trMap[trStr,substr(trStr,i,1)] = substr(trStr,lgth+1+i,1)
}
}
lgth = length(oldStr)
for (i=1; i<=lgth; i++) {
char = substr(oldStr,i,1)
newStr = newStr ( (trStr,char) in _trMap ? _trMap[trStr,char] : char )
}
return newStr
}
function rev(oldStr, i,lgth,char,newStr) {
lgth = length(oldStr)
for ( i=1; i<=lgth; i++ ) {
char = substr(oldStr,i,1)
newStr = char newStr
}
return newStr
}
$ awk -f tst.awk file
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagtt
gtcttatagaatgttctacattctgaatatacttcattatctctaaggtg

This might work for you (GNU sed):
sed -nE ':a;p;/^>.*C$/!b
:b;n;/^>/ba;s/^/\n/;y/ACGTacgt/TGCAtgca/
:c;tc;/\n$/{s///p;bb};s/(.*)\n(.)/\2\1\n/;tc' file
Print the current line and then inspect it.
If the line does not begin with > and end with C, bail out and repeat.
Otherwise, fetch the next line and if it begins with >, repeat the above line.
Otherwise, insert a newline (to use as a pivot point when reversing the line), complement the code of the line using a translation command. Then set about reversing the line, character by character until the inserted newline makes its way to the end of the line.
Remove the newline, print the result and repeat the line above.
N.B. The n command will terminate the script when it is executed after the last line has been read.
Since the OP has amended the ouput, another solution is when the whole of the sequence is complemented and then reversed. Here is another solution that I believe follows these criteria.
sed -nE ':a;p;/^>.*C$/!b
:b;n;/^>/!{H;$!bb};x;y/ACGTacgt\n/TGCAtgca%/;s/%/\n/
:c;tc;s/\n$//;td;s/(.*)\n(.)/\2\1\n/;tc
:d;y/%/\n/;p;z;x;$!ba' file

Split command output into separate variables

I'm trying to use a bash script with macos time machine.
I'm trying to read the properties of the time machine backup destinations and then split the destinations into variables if there are multiple destinations.
from there I can use the ID to make a backup.
I'm having trouble splitting the output into their own variables.
rawdstinfo=$(tmutil destinationinfo)
echo "$rawdstinfo"
> ==================================================
Name : USB HDD
Kind : Local
Mount Point : /Volumes/USB HDD
ID : 317BD93D-7D90-494C-9D5F-9013B25D1345
====================================================
Name : TM TEST
Kind : Local
Mount Point : /Volumes/TM TEST
ID : 4648083B-2A11-42BC-A8E0-D95917053D27
I was thinking of counting the ================================================== and then trying to split the variable based on them but i'm not having any luck.
any help would be greatly appreciated.
Thanks
PS:
to make it clear what i would like to achieve, I would like to send each destination drive to an object. From there I can compare the mount point names (which has been selected earlier in the script), to then get the "destination ID" within that object so I can then use it with the other tmutil commands such as
#start a TM backup
sudo tmutil startbackup --destination $DESTINATIONID
#remove Migration HDD as a destination
sudo tmutil removedestination $DESTINATIONID

I like to use awk for parsing delimited flat files. I copied the tmutil output from your question and pasted it into a file I named testdata.txt since I'm not doing this on a Mac. Make sure the number of equal signs in the record separators actually match what tmutil produces.
Here is the awk portion of the solution which goes into a file I named timemachine_variables.awk:
function ltrim(s) { sub(/^[ \t\r\n]+/, "", s); return s }
function rtrim(s) { sub(/[ \t\r\n]+$/, "", s); return s }
function trim(s) { return rtrim(ltrim(s)); }
BEGIN {
RS="====================================================\n";
FS=":|\n"
}
{
i=FNR-1
}
(FNR>1 && $1 ~ /Name/) {print "Name["i"]="trim($2)}
(FNR>1 && $3 ~ /Kind/) {print "Kind["i"]="trim($4)}
(FNR>1 && $5 ~ /Mount Point/) {print "Mount_Point["i"]="trim($6)}
(FNR>1 && $7 ~ /ID/) {print "ID["i"]="trim($8)}
The functions at the beginning are to trim leading or trailing white spaces off any fields. I split the records based on the equals sign separators and the fields based on the colon ":" character. FNR is gawk's internal variable for the current record number that we're looking at. Since the output apparently begins with a bar of equal signs, the first record is empty so I am using FNR > 1 as a condition to exclude it. Then I have gawk print code which will become array assignments for bash. In your example, this should be gawk's output:
$ gawk -f timemachine_variables.awk testdata.txt
Name[1]=USB HDD
Kind[1]=Local
Mount_Point[1]=/Volumes/USB HDD
ID[1]=317BD93D-7D90-494C-9D5F-9013B25D1345
Name[2]=TM TEST
Kind[2]=Local
Mount_Point[2]=/Volumes/TM TEST
ID[2]=4648083B-2A11-42BC-A8E0-D95917053D27
In your BASH script, declare the arrays from the gawk script's output:
$ declare $(gawk -f timemachine_variables.awk testdata.txt)
You should now have BASH arrays for each drive:
$ echo ${ID[2]}
4648083B-2A11-42BC-A8E0-D95917053D27
UPDATE: The original awk script that I posted does not work on the Mac because BSD awk does not support multi-character separators. I'm leaving it here because it works for gawk, and comparing the two scripts may help others who are looking for a way to achieve multi-character separator behavior in BSD awk.
Instead of changing the default record separator which is the end of the line, I set my own counter of i to 0, and then increment it every time the whole record starts and ends with one or more equal signs. Since awk now views each line as its own record and the field separator is still ":", the name we are trying to match is always in $1 and the value is always in $2.
function ltrim(s) { sub(/^[ \t\r\n]+/, "", s); return s }
function rtrim(s) { sub(/[ \t\r\n]+$/, "", s); return s }
function trim(s) { return rtrim(ltrim(s)); }
BEGIN {
FS=":";
i=0;
}
($0 ~ /^=+$/) {i++;}
($1 ~ /Name/) {print "Name["i"]="trim($2)}
($1 ~ /Kind/) {print "Kind["i"]="trim($2)}
($1 ~ /Mount Point/) {print "Mount_Point["i"]="trim($2)}
($1 ~ /ID/) {print "ID["i"]="trim($2)}

Split CSV into two files based on column matching values in an array in bash / posh

I have a input CSV that I would like to split into two CSV files. If the value of column 4 matches any value in WLTarray it should go in output file 1, if it doesn't it should go in output file 2.
WLTarray:
"22532" "79994" "18809" "21032"
input CSV file:
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
output CSV file1:
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
output CSV file2:
header1,header2,header3,header4,header5,header6,header7,header8
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
I've been looking at awk to filter this (python & perl not an option in my environment) but I think there is probably a much smarter way:
declare -a WLTarray=("22532" "79994" "18809" "21032")
for WLTvalue in "${WLTarray[#]}" #Everything in the WLTarray will go to $filename-WLT.tmp
do
awk -F, '($4=='$WLTvalue'){print}' $filename.tmp >> $filename-WLT.tmp #move the lines to the WLT file
# now filter to remove non matching values? why not just move the rows entirely?
done

With regular awk you can make use of split and substr (to handle double-quote removal for comparison) and split the csv file as you indicate. For example you can use:
awk 'BEGIN { FS=","; s="22532 79994 18809 21032"
split (s,a," ") # split s into array a
for (i in a) # loop over each index in a
b[a[i]]=1 # use value in a as index for b
}
FNR == 1 { # first record, write header to both output files
print $0 > "output1.csv"
print $0 > "output2.csv"
next
}
substr($4,2,length($4)-2) in b { # 4th field w/o quotes in b?
print $0 > "output1.csv" # write to output1.csv
next
}
{ print $0 > "output2.csv" } # otherwise write to output2.csv
' input.csv
Where:
in the BEGIN {...} rule you set the field separator (FS) to break on comma, and split the string containing your desired output1.csv field 4 matches into the array a, then loops over the values in a using them for the indexes in array b (to allow a simple i in b check);
the first rule is applied to the first records in the file (the header line) which is simply written out to both output files;
the next rule removes the double-quotes surrounding field-4 and then checks if the number in field-4 matches an index in array b. If so the record is written to output1.csv otherwise it is written to output2.csv.
Example Input File
$ cat input.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
Resulting Output Files
$ cat output1.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","6344324","585677","22532","Entitlements","BX","22532:718","36721"
"83","1223432","616454","79994","Compliance Stuff","DR","79994:64703","206134"
"83","267216","616457","79994","Compliance Engine","ABC","79994:64703","206020"
$ cat output2.csv
header1,header2,header3,header4,header5,header6,header7,header8
"83","162217","616454","83223","Data Enrichment","IEO","83223:64701","206475"

You can use gawk like this:
test.awk
#!/usr/bin/gawk -f
BEGIN {
split("22532 79994 18809 21032", a)
for(i in a) {
WLTarray[a[i]]
}
FPAT="[^\",]+"
}
NR > 1 {
if ($4 in WLTarray) {
print >> "output1.csv"
} else {
print >> "output2.csv"
}
}
Make it executable and run it like this:
chmod +x test.awk
./test.awk input.csv

using grep with a filter file as input was the simplest answer.
declare -a WLTarray=("22532" "79994" "18809" "21032")
for WLTvalue in "${WLTarray[#]}"
do
awkstring="'\$4 == "\"\\\"$WLTvalue\\\"\"" {print}'"
eval "awk -F, $awkstring input.csv >> output.WLT.csv"
done
grep -v -x -f output.WLT.csv input.csv > output.NonWLT.csv

AWK split for multiple delimiters lines

I'm trying to split a file using AWK one-line but the code below that I came with is not working properly.
awk '
BEGIN { idx=0; file="original_file.split." }
/^REC_DELIMITER.(HIGH|TOP)$/ { idx++ }
/^REC_DELIMITER.TOP$/,/^REC_DELIMITER.(HIGH|TOP)$/ { print > file sprintf("%03d", idx) }
' original_file
Test file is "original_file":
REC_DELIMITER.TOP
lineA1
lineA2
lineA3
REC_DELIMITER.HIGH
lineB1
lineB2
lineB3
REC_DELIMITER.TOP
lineC1
lineC2
lineC3
REC_DELIMITER.HIGH
lineD1
lineD2
lineD3
AWK code above is for REC_DELIMITER.TOP and it is giving me these files:
original_file.split.001:
REC_DELIMITER.TOP
original_file.split.003:
REC_DELIMITER.TOP
however, I'm trying to get this:
original_file.split.001:
REC_DELIMITER.TOP
lineA1
lineA2
lineA3
original_file.split.003:
REC_DELIMITER.TOP
lineC1
lineC2
lineC3
There will be other record delimiters, and when needed, we can run for them like REC_DELIMITER.HIGH, this way getting files like below:
original_file.split.002:
REC_DELIMITER.HIGH
lineB1
lineB2
lineB3
original_file.split.004:
REC_DELIMITER.HIGH
lineD1
lineD2
lineD3
Any help guys is very appreciate, I have been trying to get this working past few days and AWK code above is the best I was able to get. I need now help from AWK masters. :)
Thank you!

You can try something like this:
awk '
/REC_DELIMITER\.TOP/ {
a=1
b=0
file = sprintf (FILENAME".split.%03d",++n)
}
/REC_DELIMITER\.HIGH/ {
b=1
a=0
file = sprintf (FILENAME".split.%03d",++n)
}
a {
print $0 > file
}
b {
print $0 > file
}' file

You need something like this (untested):
awk -v dtype="TOP" '
BEGIN { dbase = "^REC_DELIMITER\\."; delim = dbase dtype "$" }
$0 ~ dbase { inBlock=0 }
$0 ~ delim { inBlock=1; idx++ }
inBlock { print > sprintf("original_file.split.%03d", idx) }
' original_file

awk -vRS=REC_DELIMITER '/^.TOP\n/{print RS $0 > sprintf("original_file.split.%03d",n)};!++n' original_file
(Give or take an extra newline at the end.)
Generally, when input is supposed to be treated as a series of multi-line records with a special line as delimiter, the most direct approach is to set RS (and often ORS) to that delimiter.
Normally you'd want to add newlines to its beginning and/or end, but this case is a little special so it's easier without them.
Edited to add: You need GNU Awk for this. Standard Awk considers only the first character of RS.

I made some changes so the different delimiters go to the their own file, even when they occur later in the file. make a file like splitter.awk with the contents below, the chmod +x it and run it with ./splitter.awk original_file
#!/usr/bin/awk -f
BEGIN {
idx=0;
file="original_file.split.";
out=""
}
{
if($0 ~ /^REC_DELIMITER.(TOP|HIGH)/){
if (!cnt[$0]) {
cnt[$0] = ++idx;
}
out=cnt[$0];
}
print > file sprintf("%03d", out)
}

I'm not very used to AWK, however, plasticide's answer put me towards right direction and I finally got AWK script working as requirements.
In below code, first IF turn echo to 0 if a demilier is found. Second IF turn echo to 1 if the wanted delimiter is found, then the want ones are are split from file.
I know regex could be something like /^(REC_(DELIMITER\.(TOP|HIGH|LOW)|NO_CATEGORY)$/ but since regex is created dynamically via shellscript that reads from an specific file a list of delimiters, it will look more like in AWK below.
awk 'BEGIN {
idx=0; echo=1; file="original_file.split."
}
{
#All the delimiters to consider in given file
if($0 ~ /^(REC_DELIMITER.TOP|REC_DELIMITER.HIGH|REC_DELIMITER.LOW|REC_NO_CATEGORY)$/) {
echo=0
}
#Delimiters that should actually be pulled
if($0 ~ /^(REC_DELIMITER.HIGH|REC_DELIMITER.LOW)$/ {
idx++; echo=1
}
#Print to a file is match wanted delimmiter
if(echo) {
print > file idx
}
}' original_file
Thank you all. I really appreciate it very much.

Shell script to combine three files using AWK

I have three files G_P_map.txt, G_S_map.txt and S_P_map.txt. I have to combine these three files using awk. The example contents are the following -
(G_P_map.txt contains)
test21g|A-CZ|1mos
test21g|A-CZ|2mos
...
(G_S_map.txt contains)
nwtestn5|A-CZ
nwtestn6|A-CZ
...
(S_P_map.txt contains)
3mos|nwtestn5
4mos|nwtestn6
Expected Output :
1mos, 3mos
2mos, 4mos
Here is the code which I tried. I was able to combine the first two, but I couldn't do along with the third one.
awk -F"|" 'NR==FNR {file1[$1]=$1; next} {$2=file[$1]; print}' G_S_map.txt S_P_map.txt
Any ideas/help is much appreciated. Thanks in advance!

I would look at a combination of join and cut.

GNU AWK (gawk) 4 has BEGINFILE and ENDFILE which would be perfect for this. However, the gawk manual includes a function that will provide this functionality for most versions of AWK.
#!/usr/bin/awk
BEGIN {
FS = "|"
}
function beginfile(ignoreme) {
files++
}
function endfile(ignoreme) {
# endfile() would be defined here if we were using it
}
FILENAME != _oldfilename \
{
if (_oldfilename != "")
endfile(_oldfilename)
_oldfilename = FILENAME
beginfile(FILENAME)
}
END { endfile(FILENAME) }
files == 1 { # save all the key, value pairs from file 1
file1[$2] = $3
next
}
files == 2 { # save all the key, value pairs from file 2
file2[$1] = $2
next
}
files == 3 { # perform the lookup and output
print file1[file2[$2]], $1
}
# Place the regular END block here, if needed. It would be in addition to the one above (there can be more than one)
Call the script like this:
./scriptname G_P_map.txt G_S_map.txt S_P_map.txt

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

awk or other shell to convert delimited list into a table - shell

Related

Reverse complement SOME sequences in fasta file

Split command output into separate variables

Split CSV into two files based on column matching values in an array in bash / posh

AWK split for multiple delimiters lines

Shell script to combine three files using AWK

Categories

Resources