issue for condition on unique raws in bash - bash

I want to print rows of a table in a file, the issue is when I use a readline the reprint me the result several times, here is my input file
aa ,DEC ,file1.txt
aa ,CHAR ,file1.txt
cc ,CHAR ,file1.txt
dd ,DEC ,file2.txt
bb ,DEC ,file3.txt
bb ,CHAR ,file3.txt
cc ,DEC ,file1.txt
Here is the result I want to have:
printed in file1.txt
aa#DEC,CHAR
cc#CHAR,DEC
printed in file2.txt
dd#DEC
printed in file3.txt
bb#DEC,CHAR
here is it my attempt :
(cat input.txt|while read line
do
table=`echo $line|cut -d"," -f1
variable=`echo $line|cut -d"," -f2
file=`echo $line|cut -d"," -f3
echo ${table}#${variable},
done ) > ${file}

This can be done in a single pass gnu awk like this:
awk -F ' *, *' '{
map[$3][$1] = (map[$3][$1] == "" ? "" : map[$3][$1] ",") $2
}
END {
for (f in map)
for (d in map[f])
print d "#" map[f][d] > f
}' file
This will populate this data:
=== file1.txt ===
aa#DEC,CHAR
cc#CHAR,DEC
=== file2.txt ===
dd#DEC
=== file3.txt ===
bb#DEC,CHAR

With your shown samples, could you please try following, written and tested in shown samples in GNU awk.
awk '
{
sub(/^,/,"",$3)
}
FNR==NR{
sub(/^,/,"",$2)
arr[$1,$3]=(arr[$1,$3]?arr[$1,$3]",":"")$2
next
}
(($1,$3) in arr){
close(outputFile)
outputFile=$3
print $1"#"arr[$1,$3] >> (outputFile)
delete arr[$1,$3]
}
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
sub(/^,/,"",$3) ##Substituting starting comma in 3rd field with NULL.
}
FNR==NR{ ##Checking condition FNR==NR will be true when first time Input_file is being read.
sub(/^,/,"",$2) ##Substituting starting comma with NULL in 2nd field.
arr[$1,$3]=(arr[$1,$3]?arr[$1,$3]",":"")$2
##Creating arr with index of 1st and 3rd fields, which has 2nd field as value.
next ##next will skip all further statements from here.
}
(($1,$3) in arr){ ##Checking condition if 1st and 3rd fields are in arr then do following.
close(outputFile) ##Closing output file, to avoid "too many opened files" error.
outputFile=$3 ##Setting outputFile with value of 3rd field.
print $1"#"arr[$1,$3] >> (outputFile)
##printing 1st field # arr value and output it to outputFile here.
delete arr[$1,$3] ##Deleting array element with index of 1st and 3rd field here.
}
' Input_file Input_file ##Mentioning Input_file 2 times here.

You have several errors in your code. You can use the built-in read to split on a comma, and the parentheses are completely unnecessary.
while IFS=, read -r table variable file
do
echo "${table}#${variable}," >>"$file"
done< input.txt
Using $file in a redirect after done is an error; the shell wants to open the file handle to redirect to before file is defined. But as per your requirements, each line should go to a different `file.
Notice also quoting fixes and the omission of the useless cat.
Wrapping fields with the same value onto the same line would be comfortably easy with an Awk postprocessor, but then you might as well do all of this in Awk, as in the other answer you already received.

Related

awk from file using echo and output to file

A.txt contains:
/*333*/
asdfasdfadfg
sadfasdfasgadas
###
/*555*/
hfawehfihohawe
aweihfiwahif
aiwehfwwh
###
/*777*/
jawejfiawjia
ajwiejfjeiie
eiuehhawefjj
###
B.txt contains:
555
777
I want to create the loop, for each string found in B.txt, then output the '/*'[the string] until right before the first '###' met to each own file (the string name is also used as file name).
So based on the sample above, the result should be :
555.txt, which contains:
/*555*/
hfawehfihohawe
aweihfiwahif
aiwehfwwh
and 777.txt, which contains:
/*777*/
jawejfiawjia
ajwiejfjeiie
eiuehhawefjj
I tried this script but it outputs nothing:
for i in `cat B.txt`; do echo $i | awk '/{print "/*"$1}/{flag=1} /###/{flag=0} flag' A.txt > $i.txt; done
Thank you in advance
With your shown samples, please try following awk code. Written and tested in GNU awk should work in any awk.
awk '
FNR==NR{
if($0~/^\/\*/){
line=$0
gsub(/^\/\*|\*\/$/,"",line)
arr[++count]=$0
arr1[line]=count
next
}
arr[count]=(arr[count]?arr[count] ORS:"") $0
next
}
($0 in arr1){
outputFile=$0".txt"
print arr[arr1[$0]] >> (outputFile)
close(outputFile)
}
' file1 file2
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when file1 is being read.
if($0~/^\/\*/){ ##Checking condition if current line starts with /* then do following.
line=$0 ##Setting $0 to line variable here.
gsub(/^\/\*|\*\/$/,"",line) ##using gsub to globally substitute starting /* and ending */ with NULL in line here.
arr[++count]=$0 ##Creating arr with index of ++count and value is $0.
arr1[line]=count ##Creating arr1 with index of line and value of count.
next ##next will skip all further statements from here.
}
arr[count]=(arr[count]?arr[count] ORS:"") $0 ##Creating arr with index of count and keep appending values of same count values with current line value.
next ##next will skip all further statements from here.
}
($0 in arr1){ ##checking if current line is present in arr1 then do following.
outputFile=$0".txt" ##Creating outputFile with current line .txt value here.
print arr[arr1[$0]] >> (outputFile) ##Printing arr value with index of arr1[$0] to outputFile.
close(outputFile) ##Closing outputFile in backend to avoid too many opened files error.
}
' file1 file2 ##Mentioning Input_file names here.
Making a few alterations to your code provides the desired outcome with the example data provided:
while read -r f
do
awk -v var="/[*]$f[*]/" '$0 ~ var {flag=1} /###/{flag=0} flag' A.txt > "$f".txt
done < B.txt
cat 555.txt
/*555*/
hfawehfihohawe
aweihfiwahif
aiwehfwwh
cat 777.txt
jawejfiawjia
ajwiejfjeiie
eiuehhawefjj
Does this solve your problem?
Here is another awk solution for this:
awk '
FNR == NR {
map["/*" $0 "*/"] = $0
next
}
$0 in map {
fn = map[$0] ".txt"
}
/^###$/ {
close(fn)
fn = ""
}
fn {print > fn}' B.txt A.txt

Regex pattern as variable in AWK

Let's say I have a file with multiple fields and field 1 needs to be filtered for 2 conditions. I was thinking of turning those conditions into a regex pattern and pass them as variables to the awk statement. For some reason, they are not filtering out the records at all. Here is my attempt that runs fine, but doesn't filter out the results per conditions, except when fed directly into awk without variable assignment.
regex1="/abc|def/"; # match first field for abc or def;
regex2="/123|567/"; # and also match the first field for 123 or 567;
cat file_name \
| awk -v pat1="${regex1}" -v pat2="${regex2}" 'BEGIN{FS=OFS="\t"} {if ( ($1~pat1) && ($1~pat2) ) print $0}'
Update: Fixed a syntax error related to missing parenthesis for the if conditions in the awk. (I had it fixed in the code I ran).
Sample data
abc:567 1
egf:888 2
Expected output
abc:567 1
The problem is that I am getting all the results instead of the ones that satisfy the 2 regex for field 1
Note that the match needs to be wildcarded instead of exact match. Meaning 567 as defined in the regex pattern should also match on 567_1 if available.
It seems like the way to implement what you want to do would be:
awk -F'\t' '
($1 ~ /abc|def/) &&
($1 ~ /123|567/)
' file
or probably more robustly:
awk -F'\t' '
{ split($1,a,/:/) }
(a[1] ~ /abc|def/) &&
(a[2] ~ /123|567/)
' file
What's wrong with that?
EDIT here is me running the OPs code before and after fixing the inclusion of regexp delimiters (/) in the dynamic regexp strings:
$ cat tst.sh
#!/usr/bin/env bash
regex1="/abc|def/"; #--match first field for abc or def;
regex2="/123|567/"; #--and also match the first field for 123 or 567;
cat file_name \
| awk -v pat1="${regex1}" -v pat2="${regex2}" 'BEGIN{FS=OFS="\t"} $1~pat1 && $1~pat2'
echo "###################"
regex1="abc|def"; #--match first field for abc or def;
regex2="123|567"; #--and also match the first field for 123 or 567;
cat file_name \
| awk -v pat1="${regex1}" -v pat2="${regex2}" 'BEGIN{FS=OFS="\t"} $1~pat1 && $1~pat2'
$
$ ./tst.sh
###################
abc:567 1
EDIT: Since OP has changed the samples, so adding this solution here, this will work for partial matches also, again written and tested with shown samples in GNU awk.
awk -F':|[[:space:]]+' -v var1="abc|def" -v var2="123|567" '
BEGIN{
num=split(var1,arr1,"|")
split(var2,arr2,"|")
for(i=1;i<=num;i++){
reg1[arr1[i]]
reg2[arr2[i]]
}
}
{
for(i in reg1){
if(index($1,i)){
for(j in reg2){
if(index($2,j)){ print; next }
}
}
}
}
' Input_file
Let's say following is an Input_file:
cat Input_file
abc_2:567_3 1
egf:888 2
Now after running above code we will get abc_2:567_3 1 in output.
With your shown samples only, could you please try following. Written and tested in GNU awk. Give your values which you you want to look for in 1st column in var1 and those which you want to look in 2nd field in var2 variables respectively with pipe delimiter in it.
awk -F':|[[:space:]]+' -v var1="abc|def" -v var2="123|567" '
BEGIN{
num=split(var1,arr1,"|")
split(var2,arr2,"|")
for(i=1;i<=num;i++){
reg1[arr1[i]]
reg2[arr2[i]]
}
}
($1 in reg1) && ($2 in reg2)
' Input_file
Explanation: Adding detailed explanation for above.
awk -F':|[[:space:]]+' -v var1="abc|def" -v var2="123|567" ' ##Starting awk program from here.
##Setting field separator as colon or spaces, setting var1 and var2 values here.
BEGIN{ ##Starting BEGIN section of this program from here.
num=split(var1,arr1,"|") ##Splitting var1 to arr1 here.
split(var2,arr2,"|") ##Splitting var2 to arr2 here.
for(i=1;i<=num;i++){ ##Running for loop from 1 to till value of num here.
reg1[arr1[i]] ##Creating reg1 with index of arr1 value here.
reg2[arr2[i]] ##Creating reg1 with index of arr2 value here.
}
}
($1 in reg1) && ($2 in reg2) ##Checking condition if 1st field is present in reg1 AND in reg2 then print that line.
' Input_file ##Mentioning Input_file name here.

replace names in fasta

I want to change the sequence names in a fasta file according a text file containing new names. I found several approaches but seqkit made a good impression, anyway I can´t get it running. Replace key with value by key-value file
The fasta file seq.fa looks like
>BC1
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>BC2
TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCG
GCATGCATGCATGCATGCATGCATGCATGCATGCG
>BC3
GCATGCATGCATGCATGCATGCATGCATGCATGCCCCCCC
TGCATGCATGCATG
and the ref.txt tab delimited text file like
BC1 1234
BC2 1235
BC3 1236
using siqkit in Git Bash runs trough the file but doesn´t change the names.
seqkit replace -p' (.+)$' -r' {kv}' -k ref.txt seq.fa --keep-key
I´m used to r and new to bash and can´t find the bug but guess I need to adjust for tab and _ ?
As in the example https://bioinf.shenwei.me/seqkit/usage/#replace part 7. Replace key with value by key-value file the sequence name is tab delimited and only the second part is replaced.
Advise how to adjust the code?
Desired outcome should look like: Replacing BC1 by the number in the text file 1234
>1234
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
>1235
TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCG
GCATGCATGCATGCATGCATGCATGCATGCATGCG
>1236
GCATGCATGCATGCATGCATGCATGCATGCATGCCCCCCC
TGCATGCATGCATG
could you please try following.
awk '
FNR==NR{
a[$1]=$2
next
}
($2 in a) && /^>/{
print ">"a[$2]
next
}
1
' ref.txt FS="[> ]" seq.fa
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program here.
FNR==NR{ ##FNR==NR is condition which will be TRUE when 1st Input_file named ref.txt will be read.
a[$1]=$2 ##Creating an array named a whose index is $1 and value is $2 of current line.
next ##next will skip all further statements from here.
} ##Closing BLOCK for FNR==NR condition here.
($2 in a) && /^>/{ ##Checking condition if $2 of current line is present in array a and starts with > then do following.
print ">"a[$2] ##Printing > and value of array a whose index is $2.
next ##next will skip all further statements from here.
}
1 ##Mentioning 1 will print the lines(those which are NOT starting with > in Input_file seq.fa)
' ref.txt FS="[> ]" seq.fa ##Mentioning Input_file names here and setting FS= either space or > for Input_file seq.fa here.
EDIT: As per OP's comment need to add >1234_1 occurrence number too in output so adding following code now.
awk '
FNR==NR{
a[$1]=$2
b[$1]=++c[$2]
next
}
($2 in a) && /^>/{
print ">"a[$2]"_"b[$2]
next
}
1
' ref.txt FS="[> ]" seq.fa
awk solution that doesn't require GNU awk:
awk 'NR==FNR{a[$1]=$2;next}
NF==2{$2=a[$2]; print ">" $2;next}
1' FS='\t' ref.txt FS='>' seq.fa
The first statement is filling the array a with the content of the tab delimited file ref.txt.
The second statement prints all lines of the second files seq.fa with 2 fields given the > as field delimiter.
The last statement prints all lines of that same file.

How do I join lines using space and comma

I have the file that contains content like:
IP
111
22
25
I want to print the output in the format IP 111,22,25.
I have tried tr ' ' , but its not working
Welcome to paste
$ paste -sd " ," file
IP 111,22,25
Normally what paste does is it writes to standard output lines consisting of sequentially corresponding lines of each given file, separated by a <tab>-character. The option -s does it differently. It states to paste each line of the files sequentially with a <tab>-character as a delimiter. When using the -d flag, you can give a list of delimiters to be used instead of the <tab>-character. Here I gave as a list " ," indicating, use space and then only commas.
In pure Bash:
# Read file into array
mapfile -t lines < infile
# Print to string, comma-separated from second element on
printf -v str '%s %s' "${lines[0]}" "$(IFS=,; echo "${lines[*]:1}")"
# Print
echo "$str"
Output:
IP 111,22,25
I'd go with:
{ read a; read b; read c; read d; } < file
echo "$a $b,$c,$d"
This will also work:
xargs printf "%s %s,%s,%s" < file
Try cat file.txt | tr '\n' ',' | sed "s/IP,/IP /g"
tr deletes new lines, sed changes IP,111,22,25 into IP 111,22,25
The following awk script will do the requested:
awk 'BEGIN{OFS=","} FNR==1{first=$0;next} {val=val?val OFS $0:$0} END{print first FS val}' Input_file
Explanation: Adding explanation for above code now.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section here of awk program.
OFS="," ##Setting OFS as comma, output field separator.
} ##Closing BEGIN section of awk here.
FNR==1{ ##Checking if line is first line then do following.
first=$0 ##Creating variable first whose value is current first line.
next ##next keyword is awk out of the box keyword which skips all further statements from here.
} ##Closing FNR==1 BLOCK here.
{ ##This BLOCK will be executed for all lines apart from 1st line.
val=val?val OFS $0:$0 ##Creating variable val whose values will be keep concatenating its own value.
}
END{ ##Mentioning awk END block here.
print first FS val ##Printing variable first FS(field separator) and variable val value here.
}' Input_file ##Mentioning Input_file name here which is getting processed by awk.
Using Perl
$ cat captain.txt
IP
111
22
25
$ perl -0777 -ne ' #k=split(/\s+/); print $k[0]," ",join(",",#k[1..$#k]) ' captain.txt
IP 111,22,25
$

grep a string from a specific block of text

Some help required please...
I have a block of text in a file on my Linux machine like this;
Block.1:\
:Value1=something:\
:Value2=something_else:\
:Value3=something_other:
Block.2:\
:Value1=something:\
:Value2=something_else:\
:Value3=something_other:
Block.n:\
:Value1=something:\
:Value2=something_else:\
:Value3=something_other:
How can I use grep (and/or possibly awk?) to pluck out e.g Value2 from Block.2 only?
Blocks won't always be ordered sequentially (they have arbitary names) but will always be unique.
Colon and backslash positions are absolute.
TIA, Rob.
Following awk may help you in same.
awk -F"=" '/^Block\.2/{flag=1} flag && /Value2/{print $2;flag=""}' Input_file
Output will be as follows.
something_else:\
In case you want to print full line of value2 in block2 then change from print $2 to print in above code.
Explanation: Adding explanation of above code too now.
awk -F"=" ' ##Creating field separator as = for each line of Input_file.
/Block\.2/{ ##Checking condition if a line is having string Block.2, here I have escaped . to refrain its special meaning, if condition is TRUE then do follow:
flag=1 ##Setting variable flag value as 1, which indicates that flag is TRUE.
}
flag && /Value2/{ ##Checking condition if flag value is TRUE and line is having string Value2 in it then do following:
print $2; ##Printing 2nd field of the current line.
flag="" ##Nullifying the variable flag now.
}
' Input_file ##Mentioning the Input_file name here.
$ cat tst.awk
BEGIN { FS="[:=]" }
NF==2 { f = ($1 == "Block.2" ? 1 : 0) }
f && ($2 == "Value2") { print $3 }
$ awk -f tst.awk file
something_else
grep -A 2 "Block.2" | tail -1 | cut -d= -f2
explanation :
grep -A look for a pattern and prints 2 more lines (till Value2)
tail -1 gets the last line (the one with Value2)
cut use "=" as a field separator and prints second field

Resources