How to use dynamic field value for gsub? - shell

I have a use case where I need to replace the values of certain fields with some string. The field value has to be picked up from config file at runtime and should replace each character in that field with 'X'.
Input:
Hello~|*World Good~|*Bye
Output:
Hello~|*XXXXX Good~|*XXX
To do this I am using below command
awk -F "~\|\*" -v OFS="~|*" '{gsub(/[a-zA-Z0-9]/,"X",$ordinal_position)}1' $temp_directory/$file_basename
Here I would like to use ordinal_position variable where I will pass the field number.
I have already tried below command but it is not working.
awk -F '~\|\*' -v var="$"25 -v OFS='~|*' '{gsub(/[a-zA-Z0-9]/,"X",var)}1' $temp_directory/$file_basename

Pass the field number as an integer and precede the variable name with a $ (or enclose in $() for better readability) in the awk program for referencing that field. Like:
awk -v var=25 '{ gsub(/regex/, "replacement", $var) } 1' file

Could you please try following, here in awk variable named fields you can mention all the fields which you want to change and rest will be taken care in the solution(like OP has shown 2nd and 3rd fields in samples so putting 2,3 in here OP could change values as per need). Written and tested with shown samples in GNU awk.
awk -v fields="2,3" '
BEGIN{
FS=OFS="|"
num=split(fields,fieldIn,",")
for(i=1;i<=num;i++){
arrayfieldsIn[fieldIn[i]]
}
}
function fieldChange(field_number){
delete array
num=split($field_number,array," ")
gsub(/[a-zA-Z0-9]/,"X",array[1])
for(i=2;i<=num;i++){
val=val array[i]
}
$field_number=array[1] " " val
val=""
}
{
for(j=1;j<=NF;j++){
if(j in arrayfieldsIn){
fieldChange(j)
}
}
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk -v fields="2,3" ' ##Starting awk program from here and setting value of variable fields with value of 2,3.
BEGIN{ ##Starting BEGIN section of this program here.
FS=OFS="|" ##Setting FS and OFS values as | here.
num=split(fields,fieldIn,",") ##Splitting fields variable into array fieldIn and delimited with comma here.
for(i=1;i<=num;i++){ ##Starting for loop from 1 to till value of num here.
arrayfieldsIn[fieldIn[i]] ##Creating array arrayfieldsIn with index fieldIn here.
}
}
function fieldChange(field_number){ ##Creating function here for changing field values.
delete array ##Deleting array here.
num=split($field_number,array," ") ##Splitting field_number into array with delimiter as space here.
gsub(/[a-zA-Z0-9]/,"X",array[1]) ##Globally substituting alphabets and digits with X in array[1] here.
for(i=2;i<=num;i++){ ##Running for loop from 2 to till num here.
val=val array[i] ##Creating variable val which has array value here.
}
$field_number=array[1] " " val ##Setting field_number to array value and val here.
val="" ##Nullify val here.
}
{
for(j=1;j<=NF;j++){ ##Running loop till value of NF here.
if(j in arrayfieldsIn){ ##Checking if j is present in array then do following.
fieldChange(j) ##Calling fieldChange with variable j here.
}
}
}
1 ##1 will print line here.
' Input_file ##Mentioning Input_file name here.

Related

Add Extra Strings Based on count of fields- Sed/Awk

I have data in below format in a text file.
null,"ABC:MNO"
"hjgy","ABC:PQR"
"mn","qwe","ABC:WER"
"mn","qwe","mno","ABC:WER"
All rows should have 3 fields like row 3. I want the data in below format.
"","","","ABC:MNO"
"hjgy","","","ABC:PQR"
"mn","qwe","","ABC:WER"
"mn","qwe","mno","ABC:WER"
If the row starts with null then null should be replace by "","","",
If there are only 2 fields then "","", should be added after 1st string .
If there are 3 fields then "", should be added after 2nd string
If there are 4 fields then do nothing.
I am able to handle 1st scenario by using sed 's/null/\"\",\"\",\"\"/' test.txt
But I dont know how to handle next 2 scenarios.
Regards.
With perl:
$ perl -pe 's/^null,/"","","",/; s/.*,\K/q("",) x (3 - tr|,||)/e' ip.txt
"","","","ABC:MNO"
"hjgy","","","ABC:PQR"
"mn","qwe","","ABC:WER"
"mn","qwe","mno","ABC:WER"
s/^null,/"","","",/ take care of null field first
.*,\K matches till last , in the line
\K is helpful to avoid having to put this matching portion back
3 - tr|,|| will give you how many fields are missing (tr return value is number of occurrences of , here)
q("",) here q() is used to represent single quoted string, so that escaping " isn't needed
x is the string replication operator
e flag allows you to use Perl code in replacement section
If rows starting with null, will always have two fields, then you can also use:
perl -pe 's/.*,\K/q("",) x (3 - tr|,||)/e; s/^null,/"",/'
Similar logic with awk:
awk -v q='"",' 'BEGIN{FS=OFS=","} {sub(/^null,/, q q q);
c=4-NF; while (c--) $NF = q $NF} 1'
With your shown samples only, please try following.
awk '
BEGIN{
FS=OFS=","
}
{
sub(/^null/,"\"\",\"\",\"\"")
}
NF==2{
$1=$1",\"\",\"\""
}
NF==3{
$2=$2",\"\""
}
1' Input_file
OR make " as a variable and one could try following too:
awk -v s1="\"\"" '
BEGIN{
FS=OFS=","
}
{
sub(/^null/,s1 "," s1","s1)
}
NF==2{
$1=$1"," s1 "," s1
}
NF==3{
$2=$2"," s1
}
1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="," ##Setting FS and OFS to comma here.
}
{
sub(/^null/,"\"\",\"\",\"\"") ##Substituting starting with space null to "","","", in current line.
}
NF==2{ ##If number of fields are 2 then do following.
$1=$1",\"\",\"\"" ##Adding ,"","" after 1st field value here.
}
NF==3{ ##If number of fields are 3 here then do following.
$2=$2",\"\"" ##Adding ,"" after 2nd field value here.
}
1 ##Printing current line here.
' Input_file ##Mentioning Input_file name here.
A solution using awk:
awk -F "," 'BEGIN{ OFS=FS }
{ gsub(/^ /,"",$1)
if($1=="null") print "\x22\x22","\x22\x22","\x22\x22", $2
else if(NF==2) print $1,"\x22\x22","\x22\x22",$2
else if(NF==3) print $1,$2,"\x22\x22",$3
else print $0 }' input
This might work for you (GNU sed):
sed 's/^\s*null,/"",/;:a;ta;s/,/&/3;t;s/.*,/&"",/;ta' file
If the line begins with null replace that field by an empty one i.e. "",.
Reset the substitute success flag by going back to :a using ta (this will only be the case when the first field is null and has be substituted).
If the 3rd field separator exists then all done.
Otherwise, insert an empty field before the last field separator and repeat.

Awk for loop not searching all fields

I'm trying to
print the first 3 columns
find all fields with "Eury_gr1_" and print them to the 4th column
if there are no "Eury_gr1_" in the whole row print 0 in the 4th column.
Input looks like the below named "final_pcs_mod_test.csv":
PC_00001,143,143.0,Eury_gr2_(111),Eury_gr5_(19),Unk_unclust_(1),Eury_gr1_(6),MAV_eury_(6)
PC_00004,137,137.0,Eury_gr6_(20),Eury_gr11_(24),Eury_gr14_(24),Eury_gr8_(8),Eury_gr12_(13)
PC_00027,109,109.0,Eury_gr1_(80),MAV_eury_(8)
The desired output will look like the below named "eury1":
PC_00001,143,143.0,Eury_gr1_(6)
PC_00004,137,137.0,0
PC_00027,109,109.0,Eury_gr1_(80)
The command I'm using is:
awk 'BEGIN {FS=","};{for(i=4;i<=NF;i++){if($i~/^Eury_gr1_/){a=$i} else {a="0"}} print $1,$2,$3,a}' final_pcs_mod_test.csv > eury1
The actual output is:
PC_00001,143,143.0,0
PC_00004,137,137.0,0
PC_00027,109,109.0,Eury_gr1_(80)
As you can see the first row is missing a "Eury_gr1_" entry. Looks like the code is only looking in the first specified column and not searching all columns as I want. I've been messing around with for(i=4;i<=4;i++) etc... but so far cannot seem to get it to find entries in the last columns of the input. The whole input file has a max of 17 columns. What am I doing wrong?
Could you please try following, written and tested with shown samples in GNU awk. Output will be same as shown samples.
awk '
BEGIN{
FS=OFS=","
}
{
for(i=4;i<=NF;i++){
if($i~/Eury_gr1_\([0-9]+\)/){
found=(found?found OFS:"")$i
}
}
if(found==""){ $4="0" }
else { $4=found }
print $1,$2,$3,$4
found=""
}' Input_file
OR
awk '
BEGIN{
FS=OFS=","
}
{
for(i=1;i<=NF;i++){
if(i<=3){
val1=(val1?val1 OFS:"")$i
}
else if(i>3){
if($i~/Eury_gr1_\([0-9]+\)/){
found=(found?found OFS:"")$i
}
}
}
if(found==""){ $4="0" }
else { $4=found }
print val1,$4
found=val1=""
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here of this program.
FS=OFS="," ##Setting field separator and output field separator to comma here.
}
{
for(i=1;i<=NF;i++){ ##Traversing through all the fields of current line here.
if(i<=3){ ##Checking condition if field number of lesser than or equal to 3 then do following.
val1=(val1?val1 OFS:"")$i ##Creating val1 and keep adding values there.
}
else if(i>3){ ##else if field number is greater than 3 then do following.
if($i~/Eury_gr1_\([0-9]+\)/){ ##Checking if current field is Eury_gr1_(digits) then do following.
found=(found?found OFS:"")$i ##Creating variable found and keep adding values there.
}
}
}
if(found==""){ $4="0" } ##Checking condition if found is NULL then make 4th field as zero.
else { $4=found } ##else set found value to 4th field here.
print val1,$4 ##Printing val1 and 4th field here.
found=val1="" ##Nullifying val1 and found here.
}' Input_file ##Mentioning Input_file name here.
OP's attempt fix: As per OP's comments fixing OP's attempt here. But this will match only 1 occurrence of Eury_gr1 each line, for looking for all occurrences please refer my above solution.
awk '
BEGIN{
FS=","
}
{
for(i=4;i<=NF;i++){
if($i~/^Eury_gr1_\([0-9]+\)$/){ a1 }
}
print $1,$2,$3,a1
a1=""
}' Input_file

Retrieve data from a file using patterns and annotate it with its filename

I have a file called bin.001.fasta looking like this:
>contig_655
GGCGGTTATTTAGTATCTGCCACTCAGCCTCGCTATTATGCGAAATTTGAGGGCAGGAGGAAACCATGAC
AGTAGTCAAGTGCGACAAGC
>contig_866
CCCAGACCTTTCAGTTGTTGGGTGGGGTGGGTGCTGACCGCTGGTGAGGGCTCGACGGCGCCCATCCTGG
CTAGTTGAAC
...
What I wanna do is to get a new file, where the 1st column is retrieved contig IDs and the 2nd column is the filename without .fasta:
contig_655 bin.001
contig_866 bin.001
Any ideas how to make it ?
Could you please try following.
awk -F'>' '
FNR==1{
split(FILENAME,array,".")
file=array[1]"."array[2]
}
/^>/{
print $2,file
}
' Input_file
OR more generic if your Input_file has more than 2 dots then run following.
awk -F'>' '
FNR==1{
match(FILENAME,/.*\./)
file=substr(FILENAME,RSTART,RLENGTH-1)
}
/^>/{
print $2,file
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk -F'>' ' ##Starting awk program from here and setting field separator as > here for all lines.
FNR==1{ ##Checking condition if this is first line then do following.
split(FILENAME,array,".") ##Splitting filename which is passed to this awk program into an array named array with delimiter .
file=array[1]"."array[2] ##Creating variable file whose value is 1st and 2nd element of array with DOT in between as per OP shown sample.
}
/^>/{ ##Checking condition if a line starts with > then do following.
print $2,file ##Printing 2nd field and variable file value here.
}
' Input_file ##Mentioning Input_file name here.

Append delimiters for implied blank fields

I am looking for a simple solution to have for each line the same number of commas in file (CSV file)
e.g.
example of file:
1,1
A,B,C,D,E,F
2,2,
3,3,3,
4,4,4,4
expected:
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
the line with the largest number of commas has 5 commas in this case (line #2). so, I want to add other commas in all lines to have the same number for each line (i.e. 5 commas)
Using awk:
$ awk 'BEGIN{FS=OFS=","} {$6=$6} 1' file
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
As you can see above, in this approach the max. number of fields must be hardcoded in the command.
Another take on providing making all lines in a CSV file have the same number of fields. The number of fields need not be known. The max fields will be calculated and a substring of needed commas appended to each record, e.g.
awk -F, -v max=0 '{
lines[n++] = $0 # store lines indexed by line number
fields[lines[n-1]] = NF # store number of field indexed by $0
if (NF > max) # find max NF value
max = NF
}
END {
for(i=0;i<max;i++) # form string with max commas
commastr=commastr","
for(i=0;i<n;i++) # loop appended substring of commas
printf "%s%s\n", lines[i], substr(commastr,1,max-fields[lines[i]])
}' file
Example Use/Output
Pasting at the command-line, you would receive:
$ awk -F, -v max=0 '{
> lines[n++] = $0 # store lines indexed by line number
> fields[lines[n-1]] = NF # store number of field indexed by $0
> if (NF > max) # find max NF value
> max = NF
> }
> END {
> for(i=0;i<max;i++) # form string with max commas
> commastr=commastr","
> for(i=0;i<n;i++) # loop appended substring of commas
> printf "%s%s\n", lines[i], substr(commastr,1,max-fields[lines[i]])
> }' file
1,1,,,,
A,B,C,D,E,F
2,2,,,,
3,3,3,,,
4,4,4,4,,
Could you please try following, a more generic way. This code will work even number of fields are not same in your Input_file and will first read and get maximum number of fields from whole file and then 2nd time reading file it will reset the fields(why because we have set OFS as , so if current line's number of fields are lesser than nf value those many commas will be added to that line). Enhanced version of #oguz ismail's answer.
awk '
BEGIN{
FS=OFS=","
}
FNR==NR{
nf=nf>NF?nf:NF
next
}
{
$nf=$nf
}
1
' Input_file Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program frmo here.
BEGIN{ ##Starting BEGIN section of awk program from here.
FS=OFS="," ##Setting FS and OFS as comma for all lines here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
nf=nf>NF?nf:NF ##Creating variable nf whose value is getting set as per condition, if nf is greater than NF then set it as NF else keep it as it is,
next ##next will skip all further statements from here.
}
{
$nf=$nf ##Mentioning $nf=$nf will reset current lines value and will add comma(s) at last of line if NF is lesser than nf.
}
1 ##1 will print edited/non-edited lines here.
' Input_file Input_file ##Mentioning Input_file names here.

gsub with awk on a file by name

I'm trying to learn how to use awk with gsub for a particular field, but passing the name, not the number of the column on this data:
especievalida,nom059
Rhizophora mangle,Amenazada (A)
Avicennia germinans,Amenazada (A)
Laguncularia racemosa,Amenazada (A)
Cedrela odorata,Sujeta a protección especial (Pr)
Litsea glaucescens,En peligro de extinción (P)
Conocarpus erectus,Amenazada (A)
Magnolia schiedeana,Amenazada (A)
Carpinus caroliniana,Amenazada (A)
Ostrya virginiana,Sujeta a protección especial (Pr)
I tried
awk -F, -v OFS="," '{gsub("\\(.*\\)", "", $2 ) ; print $0}'
removes everything between parentheses on the second ($2) column; but I'd really like to be able to pass "nom059" to the expression, to get the same result
When reading the first line of your input file (the header line) build an array (f[] below) that maps the field name to the field number. Then you can access the fields by just using their names as an index to f[] to get their numbers and then contents:
$ cat tst.awk
BEGIN {
FS = OFS = ","
}
NR==1 {
for (i=1; i<=NF; i++) {
f[$i] = i
}
}
{
gsub(/\(.*\)/,"",$(f["nom05"]))
print
}
$ awk -f tst.awk file
especievalida,nom059
Rhizophora mangle,Amenazada
Avicennia germinans,Amenazada
Laguncularia racemosa,Amenazada
Cedrela odorata,Sujeta a protección especial
Litsea glaucescens,En peligro de extinción
Conocarpus erectus,Amenazada
Magnolia schiedeana,Amenazada
Carpinus caroliniana,Amenazada
Ostrya virginiana,Sujeta a protección especial
By the way, read https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps for why you should be using gsub(/.../ (a constant or literal regexp) instead of gsub("..." (a dynamic or computed regexp).
Could you please try following. I have made an awk variable named header_value where you could mention field name on which you want to use gsub.
awk -v header_value="nom059" '
BEGIN{
FS=OFS=","
}
FNR==1{
for(i=1;i<=NF;i++){
if($i==header_value){
field_value=i
}
}
print
next
}
{
gsub(/\(.*\)/, "",$field_value)
}
1
' Input_file
Explanation: Adding explanation of above code.
awk -v header_value="nom059" ' ##Starting awk program here and creating a variable named header_value whose value is set as nom059.
BEGIN{ ##Starting BEGIN section of this program here.
FS=OFS="," ##Setting FS and OFS value as comma here.
} ##Closing BEGIN section here.
FNR==1{ ##Checking condition if FNR==1, line is 1st line then do following.
for(i=1;i<=NF;i++){ ##Starting a for loop which starts from i=1 to till value of NF.
if($i==header_value){ ##checking condition if any field value is equal to variable header_value then do following.
field_value=i ##Creating variable field_value whose value is variable i value.
}
}
print ##Printing 1st line here.
next ##next will skip all further statements from here.
}
{
gsub(/\(.*\)/, "",$field_value) ##Now using gsub to Globally substituting everything between ( to ) with NULL in all lines.
}
1 ##Mentioning 1 will print edited/non-edited line.
' Input_file ##Mentioning Input_file name here.

Resources