Add Extra Strings Based on count of fields- Sed/Awk - shell

I have data in below format in a text file.
null,"ABC:MNO"
"hjgy","ABC:PQR"
"mn","qwe","ABC:WER"
"mn","qwe","mno","ABC:WER"
All rows should have 3 fields like row 3. I want the data in below format.
"","","","ABC:MNO"
"hjgy","","","ABC:PQR"
"mn","qwe","","ABC:WER"
"mn","qwe","mno","ABC:WER"
If the row starts with null then null should be replace by "","","",
If there are only 2 fields then "","", should be added after 1st string .
If there are 3 fields then "", should be added after 2nd string
If there are 4 fields then do nothing.
I am able to handle 1st scenario by using sed 's/null/\"\",\"\",\"\"/' test.txt
But I dont know how to handle next 2 scenarios.
Regards.

With perl:
$ perl -pe 's/^null,/"","","",/; s/.*,\K/q("",) x (3 - tr|,||)/e' ip.txt
"","","","ABC:MNO"
"hjgy","","","ABC:PQR"
"mn","qwe","","ABC:WER"
"mn","qwe","mno","ABC:WER"
s/^null,/"","","",/ take care of null field first
.*,\K matches till last , in the line
\K is helpful to avoid having to put this matching portion back
3 - tr|,|| will give you how many fields are missing (tr return value is number of occurrences of , here)
q("",) here q() is used to represent single quoted string, so that escaping " isn't needed
x is the string replication operator
e flag allows you to use Perl code in replacement section
If rows starting with null, will always have two fields, then you can also use:
perl -pe 's/.*,\K/q("",) x (3 - tr|,||)/e; s/^null,/"",/'
Similar logic with awk:
awk -v q='"",' 'BEGIN{FS=OFS=","} {sub(/^null,/, q q q);
c=4-NF; while (c--) $NF = q $NF} 1'

With your shown samples only, please try following.
awk '
BEGIN{
FS=OFS=","
}
{
sub(/^null/,"\"\",\"\",\"\"")
}
NF==2{
$1=$1",\"\",\"\""
}
NF==3{
$2=$2",\"\""
}
1' Input_file
OR make " as a variable and one could try following too:
awk -v s1="\"\"" '
BEGIN{
FS=OFS=","
}
{
sub(/^null/,s1 "," s1","s1)
}
NF==2{
$1=$1"," s1 "," s1
}
NF==3{
$2=$2"," s1
}
1' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="," ##Setting FS and OFS to comma here.
}
{
sub(/^null/,"\"\",\"\",\"\"") ##Substituting starting with space null to "","","", in current line.
}
NF==2{ ##If number of fields are 2 then do following.
$1=$1",\"\",\"\"" ##Adding ,"","" after 1st field value here.
}
NF==3{ ##If number of fields are 3 here then do following.
$2=$2",\"\"" ##Adding ,"" after 2nd field value here.
}
1 ##Printing current line here.
' Input_file ##Mentioning Input_file name here.

A solution using awk:
awk -F "," 'BEGIN{ OFS=FS }
{ gsub(/^ /,"",$1)
if($1=="null") print "\x22\x22","\x22\x22","\x22\x22", $2
else if(NF==2) print $1,"\x22\x22","\x22\x22",$2
else if(NF==3) print $1,$2,"\x22\x22",$3
else print $0 }' input

This might work for you (GNU sed):
sed 's/^\s*null,/"",/;:a;ta;s/,/&/3;t;s/.*,/&"",/;ta' file
If the line begins with null replace that field by an empty one i.e. "",.
Reset the substitute success flag by going back to :a using ta (this will only be the case when the first field is null and has be substituted).
If the 3rd field separator exists then all done.
Otherwise, insert an empty field before the last field separator and repeat.

Related

Regex pattern as variable in AWK

Let's say I have a file with multiple fields and field 1 needs to be filtered for 2 conditions. I was thinking of turning those conditions into a regex pattern and pass them as variables to the awk statement. For some reason, they are not filtering out the records at all. Here is my attempt that runs fine, but doesn't filter out the results per conditions, except when fed directly into awk without variable assignment.
regex1="/abc|def/"; # match first field for abc or def;
regex2="/123|567/"; # and also match the first field for 123 or 567;
cat file_name \
| awk -v pat1="${regex1}" -v pat2="${regex2}" 'BEGIN{FS=OFS="\t"} {if ( ($1~pat1) && ($1~pat2) ) print $0}'
Update: Fixed a syntax error related to missing parenthesis for the if conditions in the awk. (I had it fixed in the code I ran).
Sample data
abc:567 1
egf:888 2
Expected output
abc:567 1
The problem is that I am getting all the results instead of the ones that satisfy the 2 regex for field 1
Note that the match needs to be wildcarded instead of exact match. Meaning 567 as defined in the regex pattern should also match on 567_1 if available.
It seems like the way to implement what you want to do would be:
awk -F'\t' '
($1 ~ /abc|def/) &&
($1 ~ /123|567/)
' file
or probably more robustly:
awk -F'\t' '
{ split($1,a,/:/) }
(a[1] ~ /abc|def/) &&
(a[2] ~ /123|567/)
' file
What's wrong with that?
EDIT here is me running the OPs code before and after fixing the inclusion of regexp delimiters (/) in the dynamic regexp strings:
$ cat tst.sh
#!/usr/bin/env bash
regex1="/abc|def/"; #--match first field for abc or def;
regex2="/123|567/"; #--and also match the first field for 123 or 567;
cat file_name \
| awk -v pat1="${regex1}" -v pat2="${regex2}" 'BEGIN{FS=OFS="\t"} $1~pat1 && $1~pat2'
echo "###################"
regex1="abc|def"; #--match first field for abc or def;
regex2="123|567"; #--and also match the first field for 123 or 567;
cat file_name \
| awk -v pat1="${regex1}" -v pat2="${regex2}" 'BEGIN{FS=OFS="\t"} $1~pat1 && $1~pat2'
$
$ ./tst.sh
###################
abc:567 1
EDIT: Since OP has changed the samples, so adding this solution here, this will work for partial matches also, again written and tested with shown samples in GNU awk.
awk -F':|[[:space:]]+' -v var1="abc|def" -v var2="123|567" '
BEGIN{
num=split(var1,arr1,"|")
split(var2,arr2,"|")
for(i=1;i<=num;i++){
reg1[arr1[i]]
reg2[arr2[i]]
}
}
{
for(i in reg1){
if(index($1,i)){
for(j in reg2){
if(index($2,j)){ print; next }
}
}
}
}
' Input_file
Let's say following is an Input_file:
cat Input_file
abc_2:567_3 1
egf:888 2
Now after running above code we will get abc_2:567_3 1 in output.
With your shown samples only, could you please try following. Written and tested in GNU awk. Give your values which you you want to look for in 1st column in var1 and those which you want to look in 2nd field in var2 variables respectively with pipe delimiter in it.
awk -F':|[[:space:]]+' -v var1="abc|def" -v var2="123|567" '
BEGIN{
num=split(var1,arr1,"|")
split(var2,arr2,"|")
for(i=1;i<=num;i++){
reg1[arr1[i]]
reg2[arr2[i]]
}
}
($1 in reg1) && ($2 in reg2)
' Input_file
Explanation: Adding detailed explanation for above.
awk -F':|[[:space:]]+' -v var1="abc|def" -v var2="123|567" ' ##Starting awk program from here.
##Setting field separator as colon or spaces, setting var1 and var2 values here.
BEGIN{ ##Starting BEGIN section of this program from here.
num=split(var1,arr1,"|") ##Splitting var1 to arr1 here.
split(var2,arr2,"|") ##Splitting var2 to arr2 here.
for(i=1;i<=num;i++){ ##Running for loop from 1 to till value of num here.
reg1[arr1[i]] ##Creating reg1 with index of arr1 value here.
reg2[arr2[i]] ##Creating reg1 with index of arr2 value here.
}
}
($1 in reg1) && ($2 in reg2) ##Checking condition if 1st field is present in reg1 AND in reg2 then print that line.
' Input_file ##Mentioning Input_file name here.

How to use dynamic field value for gsub?

I have a use case where I need to replace the values of certain fields with some string. The field value has to be picked up from config file at runtime and should replace each character in that field with 'X'.
Input:
Hello~|*World Good~|*Bye
Output:
Hello~|*XXXXX Good~|*XXX
To do this I am using below command
awk -F "~\|\*" -v OFS="~|*" '{gsub(/[a-zA-Z0-9]/,"X",$ordinal_position)}1' $temp_directory/$file_basename
Here I would like to use ordinal_position variable where I will pass the field number.
I have already tried below command but it is not working.
awk -F '~\|\*' -v var="$"25 -v OFS='~|*' '{gsub(/[a-zA-Z0-9]/,"X",var)}1' $temp_directory/$file_basename
Pass the field number as an integer and precede the variable name with a $ (or enclose in $() for better readability) in the awk program for referencing that field. Like:
awk -v var=25 '{ gsub(/regex/, "replacement", $var) } 1' file
Could you please try following, here in awk variable named fields you can mention all the fields which you want to change and rest will be taken care in the solution(like OP has shown 2nd and 3rd fields in samples so putting 2,3 in here OP could change values as per need). Written and tested with shown samples in GNU awk.
awk -v fields="2,3" '
BEGIN{
FS=OFS="|"
num=split(fields,fieldIn,",")
for(i=1;i<=num;i++){
arrayfieldsIn[fieldIn[i]]
}
}
function fieldChange(field_number){
delete array
num=split($field_number,array," ")
gsub(/[a-zA-Z0-9]/,"X",array[1])
for(i=2;i<=num;i++){
val=val array[i]
}
$field_number=array[1] " " val
val=""
}
{
for(j=1;j<=NF;j++){
if(j in arrayfieldsIn){
fieldChange(j)
}
}
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk -v fields="2,3" ' ##Starting awk program from here and setting value of variable fields with value of 2,3.
BEGIN{ ##Starting BEGIN section of this program here.
FS=OFS="|" ##Setting FS and OFS values as | here.
num=split(fields,fieldIn,",") ##Splitting fields variable into array fieldIn and delimited with comma here.
for(i=1;i<=num;i++){ ##Starting for loop from 1 to till value of num here.
arrayfieldsIn[fieldIn[i]] ##Creating array arrayfieldsIn with index fieldIn here.
}
}
function fieldChange(field_number){ ##Creating function here for changing field values.
delete array ##Deleting array here.
num=split($field_number,array," ") ##Splitting field_number into array with delimiter as space here.
gsub(/[a-zA-Z0-9]/,"X",array[1]) ##Globally substituting alphabets and digits with X in array[1] here.
for(i=2;i<=num;i++){ ##Running for loop from 2 to till num here.
val=val array[i] ##Creating variable val which has array value here.
}
$field_number=array[1] " " val ##Setting field_number to array value and val here.
val="" ##Nullify val here.
}
{
for(j=1;j<=NF;j++){ ##Running loop till value of NF here.
if(j in arrayfieldsIn){ ##Checking if j is present in array then do following.
fieldChange(j) ##Calling fieldChange with variable j here.
}
}
}
1 ##1 will print line here.
' Input_file ##Mentioning Input_file name here.

Use sed (or similar) to remove anything between repeating patterns

I'm essentially trying to "tidy" a lot of data in a CSV. I don't need any of the information that's in "quotes".
Tried sed 's/".*"/""/' but it removes the commas if there's more than one section together.
I would like to get from this:
1,2,"a",4,"b","c",5
To this:
1,2,,4,,,5
Is there a sed wizard who can help? :)
You may use
sed 's/"[^"]*"//g' file > newfile
See online sed demo:
s='1,2,"a",4,"b","c",5'
sed 's/"[^"]*"//g' <<< "$s"
# => 1,2,,4,,,5
Details
The "[^"]*" pattern matches ", then 0 or more characters other than ", and then ". The matches are removed since RHS is empty. g flag makes it match all occurrences on each line.
Could you please try following.
awk -v s1="\"" 'BEGIN{FS=OFS=","} {for(i=1;i<=NF;i++){if($i~s1){$i=""}}} 1' Input_file
Non-one liner form of solution is:
awk -v s1="\"" '
BEGIN{
FS=OFS=","
}
{
for(i=1;i<=NF;i++){
if($i~s1){
$i=""
}
}
}
1
' Input_file
Detailed explanation:
awk -v s1="\"" ' ##Starting awk program from here and mentioning variable s1 whose value is "
BEGIN{ ##Starting BEGIN section of this code here.
FS=OFS="," ##Setting field separator and output field separator as comma(,) here.
}
{
for(i=1;i<=NF;i++){ ##Starting a for loop which traverse through all fields of current line.
if($i~s1){ ##Checking if current field has " in it if yes then do following.
$i="" ##Nullifying current field value here.
}
}
}
1 ##Mentioning 1 will print edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
With Perl:
perl -p -e 's/".*?"//g' file
? forces * to be non-greedy.
Output:
1,2,,4,,,5

gsub with awk on a file by name

I'm trying to learn how to use awk with gsub for a particular field, but passing the name, not the number of the column on this data:
especievalida,nom059
Rhizophora mangle,Amenazada (A)
Avicennia germinans,Amenazada (A)
Laguncularia racemosa,Amenazada (A)
Cedrela odorata,Sujeta a protección especial (Pr)
Litsea glaucescens,En peligro de extinción (P)
Conocarpus erectus,Amenazada (A)
Magnolia schiedeana,Amenazada (A)
Carpinus caroliniana,Amenazada (A)
Ostrya virginiana,Sujeta a protección especial (Pr)
I tried
awk -F, -v OFS="," '{gsub("\\(.*\\)", "", $2 ) ; print $0}'
removes everything between parentheses on the second ($2) column; but I'd really like to be able to pass "nom059" to the expression, to get the same result
When reading the first line of your input file (the header line) build an array (f[] below) that maps the field name to the field number. Then you can access the fields by just using their names as an index to f[] to get their numbers and then contents:
$ cat tst.awk
BEGIN {
FS = OFS = ","
}
NR==1 {
for (i=1; i<=NF; i++) {
f[$i] = i
}
}
{
gsub(/\(.*\)/,"",$(f["nom05"]))
print
}
$ awk -f tst.awk file
especievalida,nom059
Rhizophora mangle,Amenazada
Avicennia germinans,Amenazada
Laguncularia racemosa,Amenazada
Cedrela odorata,Sujeta a protección especial
Litsea glaucescens,En peligro de extinción
Conocarpus erectus,Amenazada
Magnolia schiedeana,Amenazada
Carpinus caroliniana,Amenazada
Ostrya virginiana,Sujeta a protección especial
By the way, read https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps for why you should be using gsub(/.../ (a constant or literal regexp) instead of gsub("..." (a dynamic or computed regexp).
Could you please try following. I have made an awk variable named header_value where you could mention field name on which you want to use gsub.
awk -v header_value="nom059" '
BEGIN{
FS=OFS=","
}
FNR==1{
for(i=1;i<=NF;i++){
if($i==header_value){
field_value=i
}
}
print
next
}
{
gsub(/\(.*\)/, "",$field_value)
}
1
' Input_file
Explanation: Adding explanation of above code.
awk -v header_value="nom059" ' ##Starting awk program here and creating a variable named header_value whose value is set as nom059.
BEGIN{ ##Starting BEGIN section of this program here.
FS=OFS="," ##Setting FS and OFS value as comma here.
} ##Closing BEGIN section here.
FNR==1{ ##Checking condition if FNR==1, line is 1st line then do following.
for(i=1;i<=NF;i++){ ##Starting a for loop which starts from i=1 to till value of NF.
if($i==header_value){ ##checking condition if any field value is equal to variable header_value then do following.
field_value=i ##Creating variable field_value whose value is variable i value.
}
}
print ##Printing 1st line here.
next ##next will skip all further statements from here.
}
{
gsub(/\(.*\)/, "",$field_value) ##Now using gsub to Globally substituting everything between ( to ) with NULL in all lines.
}
1 ##Mentioning 1 will print edited/non-edited line.
' Input_file ##Mentioning Input_file name here.

grep a string from a specific block of text

Some help required please...
I have a block of text in a file on my Linux machine like this;
Block.1:\
:Value1=something:\
:Value2=something_else:\
:Value3=something_other:
Block.2:\
:Value1=something:\
:Value2=something_else:\
:Value3=something_other:
Block.n:\
:Value1=something:\
:Value2=something_else:\
:Value3=something_other:
How can I use grep (and/or possibly awk?) to pluck out e.g Value2 from Block.2 only?
Blocks won't always be ordered sequentially (they have arbitary names) but will always be unique.
Colon and backslash positions are absolute.
TIA, Rob.
Following awk may help you in same.
awk -F"=" '/^Block\.2/{flag=1} flag && /Value2/{print $2;flag=""}' Input_file
Output will be as follows.
something_else:\
In case you want to print full line of value2 in block2 then change from print $2 to print in above code.
Explanation: Adding explanation of above code too now.
awk -F"=" ' ##Creating field separator as = for each line of Input_file.
/Block\.2/{ ##Checking condition if a line is having string Block.2, here I have escaped . to refrain its special meaning, if condition is TRUE then do follow:
flag=1 ##Setting variable flag value as 1, which indicates that flag is TRUE.
}
flag && /Value2/{ ##Checking condition if flag value is TRUE and line is having string Value2 in it then do following:
print $2; ##Printing 2nd field of the current line.
flag="" ##Nullifying the variable flag now.
}
' Input_file ##Mentioning the Input_file name here.
$ cat tst.awk
BEGIN { FS="[:=]" }
NF==2 { f = ($1 == "Block.2" ? 1 : 0) }
f && ($2 == "Value2") { print $3 }
$ awk -f tst.awk file
something_else
grep -A 2 "Block.2" | tail -1 | cut -d= -f2
explanation :
grep -A look for a pattern and prints 2 more lines (till Value2)
tail -1 gets the last line (the one with Value2)
cut use "=" as a field separator and prints second field

Resources