Append delimiters for implied blank fields - shell
I am looking for a simple solution to have for each line the same number of commas in file (CSV file)
e.g.
example of file:
1,1
A,B,C,D,E,F
2,2,
3,3,3,
4,4,4,4
expected:
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
the line with the largest number of commas has 5 commas in this case (line #2). so, I want to add other commas in all lines to have the same number for each line (i.e. 5 commas)
Using awk:
$ awk 'BEGIN{FS=OFS=","} {$6=$6} 1' file
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
As you can see above, in this approach the max. number of fields must be hardcoded in the command.
Another take on providing making all lines in a CSV file have the same number of fields. The number of fields need not be known. The max fields will be calculated and a substring of needed commas appended to each record, e.g.
awk -F, -v max=0 '{
lines[n++] = $0 # store lines indexed by line number
fields[lines[n-1]] = NF # store number of field indexed by $0
if (NF > max) # find max NF value
max = NF
}
END {
for(i=0;i<max;i++) # form string with max commas
commastr=commastr","
for(i=0;i<n;i++) # loop appended substring of commas
printf "%s%s\n", lines[i], substr(commastr,1,max-fields[lines[i]])
}' file
Example Use/Output
Pasting at the command-line, you would receive:
$ awk -F, -v max=0 '{
> lines[n++] = $0 # store lines indexed by line number
> fields[lines[n-1]] = NF # store number of field indexed by $0
> if (NF > max) # find max NF value
> max = NF
> }
> END {
> for(i=0;i<max;i++) # form string with max commas
> commastr=commastr","
> for(i=0;i<n;i++) # loop appended substring of commas
> printf "%s%s\n", lines[i], substr(commastr,1,max-fields[lines[i]])
> }' file
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
Could you please try following, a more generic way. This code will work even number of fields are not same in your Input_file and will first read and get maximum number of fields from whole file and then 2nd time reading file it will reset the fields(why because we have set OFS as , so if current line's number of fields are lesser than nf value those many commas will be added to that line). Enhanced version of #oguz ismail's answer.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
nf=nf>NF?nf:NF
next
}
{
$nf=$nf
}
1
' Input_file Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program frmo here.
BEGIN{ ##Starting BEGIN section of awk program from here.
FS=OFS="," ##Setting FS and OFS as comma for all lines here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
nf=nf>NF?nf:NF ##Creating variable nf whose value is getting set as per condition, if nf is greater than NF then set it as NF else keep it as it is,
next ##next will skip all further statements from here.
}
{
$nf=$nf ##Mentioning $nf=$nf will reset current lines value and will add comma(s) at last of line if NF is lesser than nf.
}
1 ##1 will print edited/non-edited lines here.
' Input_file Input_file ##Mentioning Input_file names here.
Related
Merge rows with same value and every 100 lines in csv file using command
I have a csv file like below: http://www.a.com/1,apple http://www.a.com/2,apple http://www.a.com/3,apple http://www.a.com/4,apple ... http://www.z.com/1,flower http://www.z.com/2,flower http://www.z.com/3,flower ... I want combine the csv file to new csv file like below: "http://www.a.com/1 http://www.a.com/2 http://www.a.com/3 http://www.a.com/4 ",apple "http://www.z.com/1 http://www.z.com/2 http://www.z.com/3 http://www.z.com/4 ... http://www.z.com/100 ",flower "http://www.z.com/101 http://www.z.com/102 http://www.z.com/103 http://www.z.com/104 ... http://www.z.com/200 ",flower I want keep the first column every cell have max 100 lines http url. Column two same value will appear in corresponding cell. Is there a very simple command pattern to achieve this idea ? I used command below: awk '{if(NR%100!=0)ORS="\t";else ORS="\n"}1' test.csv > result.csv
$ awk -F, '$2!=p || n==100 {if(NR!=1) print "\"," p; printf "\""; p=$2; n=0} {print $1; n+=1} END {print "\"," p}' test.csv "http://www.a.com/1 http://www.a.com/2 http://www.a.com/3 http://www.a.com/4 ",apple "http://www.z.com/1 http://www.z.com/2 http://www.z.com/3 ",flower First set the field separator to the comma (-F,). Then: If the second field changes ($2!=p) or if we already printed 100 lines in the current batch (n==100): if it is not the first line, print a double quote, a comma, the previous second field and a newline, print a double quote, store the new second field in variable p for later comparisons, reset line counter n. For all lines print the first field and increment line counter n. At the end print a double quote, a comma and the last value of second field.
1st solution: With your shown samples, please try following awk code. awk ' BEGIN{ s1="\"" FS=OFS="," } prev!=$2 && prev{ print s1 val s1,prev val="" } { val=(val?val ORS:"")$1 prev=$2 } END{ if(val){ print s1 val s1,prev } } ' Input_file 2nd solution: In case your Input_file is NOT sorted with 2nd column then try following sort + awk code. sort -t, -k2 Input_file | awk ' BEGIN{ s1="\"" FS=OFS="," } prev!=$2 && prev{ print s1 val s1,prev val="" } { val=(val?val ORS:"")$1 prev=$2 } END{ if(val){ print s1 val s1,prev } } ' Output will be as follows: "http://www.a.com/1 http://www.a.com/2 http://www.a.com/3 http://www.a.com/4",apple "http://www.z.com/1 http://www.z.com/2 http://www.z.com/3",flower
Given: cat file http://www.a.com/1,apple http://www.a.com/2,apple http://www.a.com/3,apple http://www.a.com/4,apple http://www.z.com/1,flower http://www.z.com/2,flower http://www.z.com/3,flower Here is a two pass awk to do this: awk -F, 'FNR==NR{seen[$2]=FNR; next} seen[$2]==FNR{ printf("\"%s%s\"\n,%s\n",data,$1,$2) data="" next } {data=data sprintf("%s\n",$1)}' file file If you want to print either at the change of the $2 value or at some fixed line interval (like 100) you can do: awk -F, -v n=100 'FNR==NR{seen[$2]=FNR; next} seen[$2]==FNR || FNR%n==0{ printf("\"%s%s\"\n,%s\n",data,$1,$2) data="" next } {data=data sprintf("%s\n",$1)}' file file Either prints: "http://www.a.com/1 http://www.a.com/2 http://www.a.com/3 http://www.a.com/4" ,apple "http://www.z.com/1 http://www.z.com/2 http://www.z.com/3" ,flower
divide each column by max value/last value
I have a matrix like this: A 25 27 50 B 35 37 475 C 75 78 80 D 99 88 76 0 234 230 681 The last row is the sum of all elements in the column - and it is also the maximum value. What I would like to get is the matrix in which each value is divided by the last value in the column (e.g. for the first number in column 2, I would want "25/234="): A 0.106837606837607 0.117391304347826 0.073421439060206 B 0.14957264957265 0.160869565217391 0.697503671071953 C 0.320512820512821 0.339130434782609 0.117474302496329 D 0.423076923076923 0.382608695652174 0.11160058737151 An answer in another thread gives an acceptable result for one column, but I was not able to loop it over all columns. $ awk 'FNR==NR{max=($2+0>max)?$2:max;next} {print $1,$2/max}' file file (this answer was provided here: normalize column data with maximum value of that column) I would be grateful for any help!
In addition to the great approaches by #RavinderSingh13, you can also isolate the last line in the input file with, e.g. tail -n1 Input_file and then use the split() command in the BEGIN rule to separate the values. You can then make a single-pass through the file with awk to update the values as you indicate. In the end, you can pipe the output to head -n-1 to remove the unneeded final row, e.g. awk -v lline="$(tail -n1 Input_file)" ' BEGIN { split(lline,a," ") } { printf "%s", $1 for(i=2; i<=NF; i++) printf " %.15lf", $i/a[i] print "" } ' Input_file | head -n-1 Example Use/Output $ awk -v lline="$(tail -n1 Input_file)" ' > BEGIN { split(lline,a," ") } > { > printf "%s", $1 > for(i=2; i<=NF; i++) > printf " %.15lf", $i/a[i] > print "" > } > ' Input_file | head -n-1 A 0.106837606837607 0.117391304347826 0.073421439060206 B 0.149572649572650 0.160869565217391 0.697503671071953 C 0.320512820512821 0.339130434782609 0.117474302496329 D 0.423076923076923 0.382608695652174 0.111600587371512 (note: this presumes you don't have trailing blank lines in your file and you really don't have blank lines between every row. If you do, let me know) The differences between the approaches are largely negligible. In each case you are making a total of 3-passes through the file. Here with tail, awk and then head. In the other case with wc and then two-passes with awk. Let either of us know if you have questions.
1st solution: Could you please try following, written and tested with shown samples in GNU awk. With exact 15 floating points as per OP's shown samples: awk -v lines=$(wc -l < Input_file) ' FNR==NR{ if(FNR==lines){ for(i=2;i<=NF;i++){ arr[i]=$i } } next } FNR<lines{ for(i=2;i<=NF;i++){ $i=sprintf("%0.15f",(arr[i]?$i/arr[i]:"NaN")) } print } ' Input_file Input_file 2nd solution: If you don't care of floating points to be specific points then try following. awk -v lines=$(wc -l < Input_file) ' FNR==NR && FNR==lines{ for(i=2;i<=NF;i++){ arr[i]=$i } next } FNR<lines && FNR!=NR{ for(i=2;i<=NF;i++){ $i=(arr[i]?$i/arr[i]:"NaN") } print } ' Input_file Input_file OR(placing condition of FNR==lines inside FNR==NR condition): awk -v lines=$(wc -l < Input_file) ' FNR==NR{ if(FNR==lines){ for(i=2;i<=NF;i++){ arr[i]=$i } } next } FNR<lines{ for(i=2;i<=NF;i++){ $i=(arr[i]?$i/arr[i]:"NaN") } print } ' Input_file Input_file Explanation: Adding detailed explanation for above. awk -v lines=$(wc -l < Input_file) ' ##Starting awk program from here, creating lines which variable which has total number of lines in Input_file here. FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read. if(FNR==lines){ ##Checking if FNR is equal to lines then do following. for(i=2;i<=NF;i++){ arr[i]=$i } ##Traversing through all fields here of current line and creating an array arr with index of i and value of current field value. } next ##next will skip all further statements from here. } FNR<lines{ ##Checking condition if current line number is lesser than lines, this will execute when 2nd time Input_file is being read. for(i=2;i<=NF;i++){ $i=sprintf("%0.15f",(arr[i]?$i/arr[i]:"NaN")) } ##Traversing through all fields here and saving value of divide of current field with arr current field value with 15 floating points into current field. print ##Printing current line here. } ' Input_file Input_file ##Mentioning Input_file names here.
awk to input column data from one file to another based on a match
Objective I am trying to fill out $9(booking ref), $10 (client) in file1.csv with information pulled from $2 (booking ref) and $3 (client) of file2.csv using "CAMPAIGN ID" ($5 in file1.csv and $1 in file2.csv). So where I have a match between the two files based on "CAMPAIGN ID" I want to print the columns of file2.csv into the matching rows of file1.csv File1.csv INVOICE,CLIENT,PLATFORM,CAMPAIGN NAME,CAMPAIGN ID,IMPS,TFS,PRICE,Booking Ref,client BOB-UK,clientname1,platform_1,campaign1,20572431,5383594,0.05,2692.18,, BOB-UK,clientname2,platform_1,campaign2,20589101,4932821,0.05,2463.641,, BOB-UK,clientname1,platform_1,campaign3,23030494,4795549,0.05,2394.777,, BOB-UK,clientname1,platform_1,campaign4,22973424,5844194,0.05,2925.21,, BOB-UK,clientname1,platform_1,campaign5,21489000,4251031,0.05,2122.552,, BOB-UK,clientname1,platform_1,campaign6,23150347,3123945,0.05,1561.197,, BOB-UK,clientname3,platform_1,campaign7,23194965,2503875,0.05,1254.194,, BOB-UK,clientname3,platform_1,campaign8,20578983,1522448,0.05,765.1224,, BOB-UK,clientname3,platform_1,campaign9,22243554,920166,0.05,463.0083,, BOB-UK,clientname1,platform_1,campaign10,20572149,118865,0.05,52.94325,, BOB-UK,clientname2,platform_1,campaign11,23077785,28077,0.05,14.40385,, BOB-UK,clientname2,platform_1,campaign12,21811100,5439,0.05,5.27195,, File2.csv CAMPAIGN ID,Booking Ref,client 20572431,ref1,1 21489000,ref2,1 23030494,ref3,1 22973424,ref4,1 23150347,ref5,1 20572149,ref6,1 20578983,ref7,2 22243554,ref8,2 20589101,ref9,3 23077785,ref10,3 21811100,ref11,3 23194965,ref12,3 Desired Output INVOICE,CLIENT,PLATFORM,CAMPAIGN NAME,CAMPAIGN ID,IMPS,TFS,PRICE,Booking Ref,client BOB-UK,clientname1,platform_1,campaign1,20572431,5383594,0.05,2692.18,ref1,1 BOB-UK,clientname2,platform_1,campaign2,20589101,4932821,0.05,2463.641,ref9,3 BOB-UK,clientname1,platform_1,campaign3,23030494,4795549,0.05,2394.777,ref3,1 BOB-UK,clientname1,platform_1,campaign4,22973424,5844194,0.05,2925.21,ref4,1 BOB-UK,clientname1,platform_1,campaign5,21489000,4251031,0.05,2122.552,ref2,1 BOB-UK,clientname1,platform_1,campaign6,23150347,3123945,0.05,1561.197,ref5,1 BOB-UK,clientname3,platform_1,campaign7,23194965,2503875,0.05,1254.194,ref12,3 BOB-UK,clientname3,platform_1,campaign8,20578983,1522448,0.05,765.1224,ref7,2 BOB-UK,clientname3,platform_1,campaign9,22243554,920166,0.05,463.0083,ref8,2 BOB-UK,clientname1,platform_1,campaign10,20572149,118865,0.05,52.94325,ref6,1 BOB-UK,clientname2,platform_1,campaign11,23077785,28077,0.05,14.40385,ref10,3 BOB-UK,clientname2,platform_1,campaign12,21811100,5439,0.05,5.27195,ref11,3 What I've tried From the research I've done on line this appears to be possible using awk and join (How to merge two files using AWK? to get me the closest out of what I found online). I've tried various awk codes I've found online and I can't seem to get it to achieve my goal. below is the code I've been trying to massage and work that gets me the closes. At current the code is set up to try and populate just the booking ref as I presume I can just rinse repeat it for the client column. With this code I was able to get it to populate the booking ref but it required me to move CAMPAIGN ID to $1 and all it did was replace the values. NOTE: The order for file1.csv won't sync with file2.csv. All rows may be in a different order as shown in this example. current code awk -F"," -v OFS=',' 'BEGIN { while (getline < "fil2.csv") { f[$1] = $2; } } {print $0, f[$1] }' file1.csv Can someone confirm where I'm going wrong with this code as I've tried altering the columns in this - and the file - without success? Maybe it's just how I'm understanding the code itself.
Like this: awk 'BEGIN{FS=OFS=","} NR==FNR{r[$1]=$2;c[$1]=$3;next} NR>1{$9=r[$5];$10=c[$5]} 1' \ file2.csv file1.csv Explanation in multi line form: # Set input and output field delimiter to , BEGIN{ FS=OFS="," } # Total row number is the same as the row number in file # as long as we are reading the first file, file2.csv NR==FNR{ # Store booking ref and client id indexed by campaign id r[$1]=$2 c[$1]=$3 # Skip blocks below next } # From here code runs only on file1.csv NR>1{ # Set booking ref and client id according to the campaign id # in field 5 $9=r[$5] $10=c[$5] } # Print the modified line of file1.csv (includes the header line) { print }
Could you please try following. awk ' BEGIN{ FS=OFS="," print " print "INVOICE,CLIENT,PLATFORM,CAMPAIGN NAME,CAMPAIGN ID,IMPS,TFS,PRICE,Booking Ref,client" } FNR==NR && FNR>1{ val=$1 $1="" sub(/^,/,"") a[val]=$0 next } ($5 in a) && FNR>1{ sub(/,*$/,"") print $0,a[$5] } ' file2.csv file1.csv Explanation: Adding explanation for above code. awk ' ##Starting awk program from here. BEGIN{ ##Starting BEGIN section of code here. FS=OFS="," ##Setting FS(field separator) and OFS(output field separator)as comma here. print "INVOICE,CLIENT,PLATFORM,CAMPAIGN NAME,CAMPAIGN ID,IMPS,TFS,PRICE,Booking Ref,client" } ##Closing BEGIN section of this program now. FNR==NR && FNR>1{ ##Checking condition FNR==NR which will be true when file2.csv is being read. val=$1 ##Creating variable val whose value is $1 here. $1="" ##Nullifying $1 here. sub(/^,/,"") ##Substitute initial comma with NULL in this line. a[val]=$0 ##Creating an array a whose index is val and value is $0. next ##next will skip all further statements from here. } ##Closing BLOCK for condition FNR==NR here. ($5 in a) && FNR>1{ ##Checking if $5 is present in array a this condition will be checked when file1.csv is being read. sub(/,*$/,"") ##Substituting all commas at last of line with NULL here. print $0,a[$5] ##Printing current line and value of array a with index $5 here. } ##Closing BLOCK for above ($5 in a) condition here. ' file2.csv file1.csv ##Mentioning Input_file names here. Output will be as follows. INVOICE,CLIENT,PLATFORM,CAMPAIGN NAME,CAMPAIGN ID,IMPS,TFS,PRICE,Booking Ref,client BOB-UK,clientname1,platform_1,campaign1,20572431,5383594,0.05,2692.18,ref1,1 BOB-UK,clientname2,platform_1,campaign2,20589101,4932821,0.05,2463.641,ref9,3 BOB-UK,clientname1,platform_1,campaign3,23030494,4795549,0.05,2394.777,ref3,1 BOB-UK,clientname1,platform_1,campaign4,22973424,5844194,0.05,2925.21,ref4,1 BOB-UK,clientname1,platform_1,campaign5,21489000,4251031,0.05,2122.552,ref2,1 BOB-UK,clientname1,platform_1,campaign6,23150347,3123945,0.05,1561.197,ref5,1 BOB-UK,clientname3,platform_1,campaign7,23194965,2503875,0.05,1254.194,ref12,3 BOB-UK,clientname3,platform_1,campaign8,20578983,1522448,0.05,765.1224,ref7,2 BOB-UK,clientname3,platform_1,campaign9,22243554,920166,0.05,463.0083,ref8,2 BOB-UK,clientname1,platform_1,campaign10,20572149,118865,0.05,52.94325,ref6,1 BOB-UK,clientname2,platform_1,campaign11,23077785,28077,0.05,14.40385,ref10,3 BOB-UK,clientname2,platform_1,campaign12,21811100,5439,0.05,5.27195,ref11,3
Add unique value from first column before each group
I have following file contents: T12 19/11/19 2000 T12 18/12/19 2040 T15 19/11/19 2000 T15 18/12/19 2080 How to get following output with awk,bash and etc, I searched for similar examples but didn't find so far: T12 19/11/19 2000 18/12/19 2040 T15 19/11/19 2000 18/12/19 2080 Thanks, S
Could you please try following. This code will print output in same order in which first field is occurring in Input_file. awk ' !a[$1]++ && NF{ b[++count]=$1 } NF{ val=$1 $1="" sub(/^ +/,"") c[val]=(c[val]?c[val] ORS:"")$0 } END{ for(i=1;i<=count;i++){ print b[i] ORS c[b[i]] } } ' Input_file Output will be as follows. T12 19/11/19 2000 18/12/19 2040 T15 19/11/19 2000 18/12/19 2080 Explanation: Adding detailed explanation for above code here. awk ' ##Starting awk program from here. !a[$1]++ && NF{ ##Checking condition if $1 is NOT present in array a and line is NOT NULL then do following. b[++count]=$1 ##Creating an array named b whose index is variable count(every time its value increases cursor comes here) and its value is first field of current line. } ##Closing BLOCK for this condition now. NF{ ##Checking condition if a line is NOT NULL then do following. val=$1 ##Creating variable named val whose value is $1 of current line. $1="" ##Nullifying $1 here of current line. sub(/^ +/,"") ##Substituting initial space with NULL now in line. c[val]=(c[val]?c[val] ORS:"")$0 ##Creating an array c whose index is variable val and its value is keep concatenating to its own value with ORS value. } ##Closing BLOCK for this condition here. END{ ##Starting END block for this awk program here. for(i=1;i<=count;i++){ ##Starting a for loop which runs from i=1 to till value of variable count. print b[i] ORS c[b[i]] ##Printing array b whose index is i and array c whose index is array b value with index i. } } ##Closing this program END block here. ' Input_file ##Mentioning Input_file name here.
Here is a quick awk: $ awk 'BEGIN{RS="";ORS="\n\n"}{printf "%s\n",$1; gsub($1" +",""); print}' file How does it work? Awk knows the concept records and fields. Files are split in records where consecutive records are split by the record separator RS. Each record is split in fields, where consecutive fields are split by the field separator FS. By default, the record separator RS is set to be the <newline> character (\n) and thus each record is a line. The record separator has the following definition: RS: The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is. So with the file format you give, we can define the records based on RS="". By default, the field separator is set to be any sequence of blanks. So $1 will point to that particular word we want on the separate line. So we print it with printf, and then we remove any reference to it with gsub.
awk is very flexible and provides a number of ways to solve the same problem. The answers you have already are excellent. Another way to approach the problem is to simply keep a single variable that holds the current field 1 as its value. (unset by default) When the first field changes, you simply output the first field as the current heading. Otherwise you output the 2nd and 3rd fields. If a blank-line is encountered, simply output the newline. awk -v h= ' NF < 3 {print ""; next} $1 != h {h=$1; print $1} {printf "%s %s\n", $2, $3} ' file Above are the 3-rules. If the line is empty (checked with number of fields less than three (NF < 3), then output the newline and skip to the next record. The second checks if the first field is not equal to your current heading variable h -- if not, set h to the new heading and output it. All non-empty records have the 2nd and 3rd fields output. Result Just paste the command above at the command line and you will get the desired result, e.g. awk -v h= ' > NF < 3 {print ""; next} > $1 != h {h=$1; print $1} > {printf "%s %s\n", $2, $3} > ' file T12 19/11/19 2000 18/12/19 2040 T15 19/11/19 2000 18/12/19 2080
replace names in fasta
I want to change the sequence names in a fasta file according a text file containing new names. I found several approaches but seqkit made a good impression, anyway I can´t get it running. Replace key with value by key-value file The fasta file seq.fa looks like >BC1 ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC >BC2 TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCG GCATGCATGCATGCATGCATGCATGCATGCATGCG >BC3 GCATGCATGCATGCATGCATGCATGCATGCATGCCCCCCC TGCATGCATGCATG and the ref.txt tab delimited text file like BC1 1234 BC2 1235 BC3 1236 using siqkit in Git Bash runs trough the file but doesn´t change the names. seqkit replace -p' (.+)$' -r' {kv}' -k ref.txt seq.fa --keep-key I´m used to r and new to bash and can´t find the bug but guess I need to adjust for tab and _ ? As in the example https://bioinf.shenwei.me/seqkit/usage/#replace part 7. Replace key with value by key-value file the sequence name is tab delimited and only the second part is replaced. Advise how to adjust the code? Desired outcome should look like: Replacing BC1 by the number in the text file 1234 >1234 ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC >1235 TGCATGCATGCATGCATGCATGCATGCATGCATGCATGCG GCATGCATGCATGCATGCATGCATGCATGCATGCG >1236 GCATGCATGCATGCATGCATGCATGCATGCATGCCCCCCC TGCATGCATGCATG
could you please try following. awk ' FNR==NR{ a[$1]=$2 next } ($2 in a) && /^>/{ print ">"a[$2] next } 1 ' ref.txt FS="[> ]" seq.fa Explanation: Adding detailed explanation for above code. awk ' ##Starting awk program here. FNR==NR{ ##FNR==NR is condition which will be TRUE when 1st Input_file named ref.txt will be read. a[$1]=$2 ##Creating an array named a whose index is $1 and value is $2 of current line. next ##next will skip all further statements from here. } ##Closing BLOCK for FNR==NR condition here. ($2 in a) && /^>/{ ##Checking condition if $2 of current line is present in array a and starts with > then do following. print ">"a[$2] ##Printing > and value of array a whose index is $2. next ##next will skip all further statements from here. } 1 ##Mentioning 1 will print the lines(those which are NOT starting with > in Input_file seq.fa) ' ref.txt FS="[> ]" seq.fa ##Mentioning Input_file names here and setting FS= either space or > for Input_file seq.fa here. EDIT: As per OP's comment need to add >1234_1 occurrence number too in output so adding following code now. awk ' FNR==NR{ a[$1]=$2 b[$1]=++c[$2] next } ($2 in a) && /^>/{ print ">"a[$2]"_"b[$2] next } 1 ' ref.txt FS="[> ]" seq.fa
awk solution that doesn't require GNU awk: awk 'NR==FNR{a[$1]=$2;next} NF==2{$2=a[$2]; print ">" $2;next} 1' FS='\t' ref.txt FS='>' seq.fa The first statement is filling the array a with the content of the tab delimited file ref.txt. The second statement prints all lines of the second files seq.fa with 2 fields given the > as field delimiter. The last statement prints all lines of that same file.