Compare a file with two separate lookup files using awk - shell

Basically, I want to check if strings present in lookup_1 & lookup_2 exists in my xyz.txt file then perform action & redirect output to an output file. Also, my code is currently substituting all occurrences of the strings in lookup_1 even as substring, but I need it to only substitute if there's a whole word match.
Can you please help in tweaking the code to achieve the same?
code
awk '
FNR==NR { if ($0 in lookups)
next
lookups[$0]=$0
for (i=1;i<=NF;i++) {
oldstr=$i
newstr=""
while (oldstr) {
len=length(oldstr)
newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
oldstr=substr(oldstr,4)
}
ndx=index(lookups[$0],$i)
lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
}
next
}
{ for (i in lookups) {
ndx=index($0,i)
while (ndx > 0) { t
$0=substr($0,1,ndx-1) lookups[i] substr($0,ndx+length(lookups[i]))
ndx=index($0,i)
}
}
print
}
' lookup_1 xyz.txt > output.txt
lookup_1
ha
achine
skhatw
at
ree
ter
man
dun
lookup_2
United States
CDEXX123X
Institution
xyz.txt
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user ter
[2] [ter] This is a demo file
Demo file is currently being edited by user skhatw
Internal Machine's Change Request being processed. Approved by user mandeep
Institution code is 'CDEXX123X' where country is United States
current output
[1] [h#milton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user skh#tw
Internal Ma##i##'s Ch#nge Request being processed. Approved by user m##deep
Institution code is 'CDEXX123X' where country is United States
desired output
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##

We can make a couple changes to the current code:
feed the results of cat lookup_1 lookup_2 into awk such that it looks like a single file to awk (see last line of new code)
use word boundary flags (\< and \>) to build regexes with which to perform the replacements (see 2nd half of new code)
The new code:
awk '
# the FNR==NR block of code remains the same
FNR==NR { if ($0 in lookups)
next
lookups[$0]=$0
for (i=1;i<=NF;i++) {
oldstr=$i
newstr=""
while (oldstr) {
len=length(oldstr)
newstr=newstr substr(oldstr,1,1) substr("##",1,len-1)
oldstr=substr(oldstr,4)
}
ndx=index(lookups[$0],$i)
lookups[$0]=substr(lookups[$0],1,ndx-1) newstr substr(lookups[$0],ndx+length($i))
}
next
}
# complete rewrite of the following block to perform replacements based on a regex using word boundaries
{ for (i in lookups) {
regex= "\\<" i "\\>" # build regex
gsub(regex,lookups[i]) # replace strings that match regex
}
print
}
' <(cat lookup_1 lookup_2) xyz.txt # combine lookup_1/lookup_2 into a single stream so both files are processed under the FNR==NR block of code
This generates:
[1] [hamilton] This is a demo file
Demo file is currently being reviewed by user t##
[2] [t##] This is a demo file
Demo file is currently being edited by user s##a##
Internal Machine's Change Request being processed. Approved by user mandeep
I##t##u##o# code is 'C##X##2##' where country is U##t## S##t##
NOTES:
the 'boundary' characters (\< and \>) match on non-word characters; in awk a word is defined as a sequence of numbers, letters and underscores; see GNU awk - regex operators for more details
all of the sample lookup values fall within the definition of an awk word so this new code works as desired
your previous question includes lookup values that cannot be considered as an awk 'word' (eg, #vanti Finserv Co., 11:11 - Capital, MS&CO(NY)) in which case this new code may fail to replace these new lookup values
for lookup values that contain non-word characters it's not clear how you would define 'whole word match' as you would also need to determine when a non-word character (eg, #) is to be treated as part of a lookup string vs being treated as a word boundary
If you need to replace lookup values that contain (awk) non-word characters you could try replacing the word-boundary characters with \W, though this then causes problems for the lookup values that are (awk) 'words'.
One possible workaround may be to run a dual set of regex matches for each lookup value, eg:
awk '
FNR==NR { ... no changes to this block of code ... }
{ for (i in lookups) {
regex= "\\<" i "\\>"
gsub(regex,lookups[i])
regex= "\\W" i "\\W"
gsub(regex,lookups[i])
}
print
}
' <(cat lookup_1 lookup_2) xyz.txt
You'll need to determine if the 2nd regex breaks your 'whole word match' requirement.

Related

Search duplicates in a column, add value

Convert file input.csv.
id,location_id,organization_id,service_id,name,title,email,department
36,,,22,Joe Smith,third-party,john.smith#example.org,third-party Applications
18,11,,,Dave Genesy,Head of office,,
14,9,,,David Genesy,Library Director,,
22,14,,,Andres Espinoza, Manager Commanding Officer,,
(Done!) Need to update column name. Name format: first letter of name/surname uppercase and all other letters lowercase.
(Done!) Need to update column email with domain #abc.Email format: first letter from name and full surname, lowercase
(Not done) Emails with the same ID should contain numbers. Example: Name Max Houston, email mhouston1#examples.com etc.
#!/bin/bash
inputfile="accounts.csv"
echo "id,location_id,organization_id,service_id,name,title,email,department" > accounts_new.csv
while IFS="," read -r rec_column1 rec_column2 rec_column3 rec_column4 rec_column5 rec_column6 rec_column7 rec_column8
do
surnametemp="${rec_column5:0:1}$(echo $rec_column5 | awk '{print $2}')"
namesurname=$(echo $rec_column5 | sed 's! .!\U&!g')
echo $rec_column1","$rec_column2","$rec_column3","$rec_column4","$namesurname","$rec_column6",""${surnametemp,,}#abc.com"","$rec_column8 >>accounts_new.csv
done < <(tail -n +2 $inputfile)
How can do that?
Outputfile
id,location_id,organization_id,service_id,name,title,email,department
14,9,,,Dave Genesy,Library Director,dgenesy#abc.com,
14,9,,,David Genesy,Library Director,dgenesy2#abc.com,
15,9,,,maria Kramer,Library Divisions Manager,mkramer#abc.com,
26,18,,,Sharon Petersen,Administrator,spetersen#abc.com,
27,19,,,Shen Petersen,Administrator,spetersen2#abc.com,
Task specification
This task would be much easier if specified otherwise:
add email iterator to every email
or
add email iterator to second,third... occurrence
But it was specified:
add email iterator to every email if email is used multiple times.
This specification requires double iteration through lines, thus making this task more difficult.
The right tool
My rule of thumb is: use pure bash tools (grep, sed, etc) for simple tasks, use awk for moderate tasks and python for complicated tasks. In this case (double iteration over lines) I would use python. However, there was not python tag in problem specification, so I used awk.
Solution
<accounts.csv \
gawk -vFPAT='[^,]*|"[^"]*"' \
'
BEGIN {
OFS = ","
};
{
if ($7 == "") {
split($5,name," ");
firstname = substr(tolower(name[1]),1,1);
lastname = tolower(name[2]);
domain="#abc.com";
$7=firstname "." lastname domain;
};
emailcounts[$7]++;
immutables[++iter]=$1","$2","$3","$4","$5","$6","$8;
emails[iter]=$7;
}
END {
for (iter in immutables) {
if (emailcounts[emails[iter]] > 1) {
emailiter[emails[iter]]++;
email=gensub(/#/, emailiter[emails[iter]]"#", "g", emails[iter]);
} else {
email=emails[iter]
};
print immutables[iter], email
}
}'
Results
id,location_id,organization_id,service_id,name,title,department,email
36,,,22,Joe Smith,third-party,third-party Applications,john.smith#example.org
18,11,,,Dave Genesy,Head of office,,d.genesy1#abc.com
14,9,,,David Genesy,Library Director,,d.genesy2#abc.com
22,14,,,Andres Espinoza,"Manager, Commanding Officer",,a.espinoza#abc.com
Explanation
-vFPAT='[^,]*|"[^"]*"' read csv
$7=firstname "." lastname domain;} substitute email field
emailcounts[$7]++ count email occurences
iter iterator to preserve order
immutables[++iter]=$1","$2","$3","$4","$5","$6","$8 save non email fields for second loop
emails[iter]=$7 save email for second loop
for (iter in immutables) iterate over keys in immutables dictionary
{if (emailcounts[emails[iter]] > 1) change email if more than 1 occurence
emailiter[emails[iter]]++ increase email iterator
email=gensub(/#/, emailiter[emails[iter]]"#", "g", emails[iter]) add iterator to email
print immutables[iter], email print
With the input (mailcsv) file as:
id,location_id,organization_id,service_id,name,title,email,department
14,9,,,Dave Genesy,Library Director,dgenesy#abc.com,
14,9,,,David Genesy,Library Director,dgenesy#abc.com,
15,9,,,maria Kramer,Library Divisions Manager,mkramer#abc.com,
26,18,,,Sharon Petersen,Administrator,spetersen#abc.com,
27,19,,,Shen Petersen,Administrator,spetersen2#abc.com,
You can use awk and so:
awk -F, ' NR>1 { mails[$7]+=1;if ( mails[$7] > 1 ) { OFS=",";split($7,mail1,"#");$7=mail1[1]mails[$7]"#"mail1[2] } else { $0=$0 } }1' mailscsv
Set the field delimiter to , and then create an array keyed by email address. Increment the index every time the email address is encountered. If there is more than one occurrence of the address, split the address into another array mail1 based on "#". Set $7 to the first index of the array mail1 (email address before #) followed by the value of mails index for the email address, then "#" and the second index of mail1 (the section after #) If there is only one occurrence of the email address simple set the whole line as is. Use 1 to print the line.

Bash/Linux: Merge rows on match; add last field

I have a set of wireless stats from various branches in the organization:
branchA,171
branchA_guests,1020
branchB,2019
branchB_guests,3409
There are 2 entries for each branch: 1st is internal wifi usage, the next is guest usage. I'd like to merge them into a single total as we don't care whether it's guests or staff ...etc.
Desired output should be:
branchA,1191
branchB,5428
The input file has a header and some markdown so it has to identify a match, not just assume the next line is related --- though the data could be cleaned first, it is my opinion that a match would make this more bulletproof.
Here is my approach: Remove the _guests and tally:
# file: tally.awk
BEGIN {
FS = OFS = ","
}
{
sub(/_guests/, "", $1) # Remove _guests
stat[$1] += $2 # Tally
}
END {
for (branch in stat) {
printf "%s,%d\n", branch, stat[branch]
}
}
Running the script:
awk -f tally.awk data.txt
Notes
In the BEGIN pattern, I set the field separator (FS) and output field separator (OFS) both to a comma
Next, for each line, I remove the _guests part and tally the count
Finally, at the end of the file, I print out the counts

Replace string with text only when a given text precedes it

I have about one hundred Markdown files that contain snippets of Latex like this:
<div latex="true" class="task" id="Task">
(#) Delete the fourth patterns from your .teach file and your .data files. Remember to change the second line in each so that Tlearn knows there are now only three patterns.
- They should look like [#fig:dataTeach]
</div>
I'd like to replace the <div> tags with pseudotags that are easier to read, like this:
<task>
(#) Delete the fourth patterns from your .teach file and your .data files. Remember to change the second line in each so that Tlearn knows there are now only three patterns.
- They should look like [#fig:dataTeach]
</task>
This would be trivial if all my <div> tags were marking 'tasks', but I have similar divs for 'journal' and 'highlight'. I need a process that will change the </div> to </task> only when the preceding <div> has the class or id 'task', and likewise for 'journal' and 'highlight'.
Having looked around Stack Overflow for a while, I find many examples of multiline search and replace that do almost what I want to do, but the syntax (particularly for sed) is so difficult to untangle I can't adapt it for the above case. My next option is to write a bash script to loop through line by line, but I have a feeling this might be too fragile.
Cheers
Ian
The following awk command works generically, under the following assumptions:
All opening and closing div tags are on their own lines.
Attributes all use "-quoting.
The new tag name is derived from the value of the class attribute only (this could be generalized if the rules were clearer).
awk -F ' class="' '
/^<div / && NF > 1 { tag=$2; sub("\".*", "", tag); printf "<%s>\n", tag; next }
/^<\/div>/ && tag != "" { printf "</%s>\n", tag; tag=""; next }
1
' file
-F ' class="' effectively splits each line into before (field 1, $1) and after (field 2, $2) the class attribute, if present. Only lines that have such an attribute will therefore have more than 1 field (NF > 1).
Processing the opening div tag:
Pattern /^<div / && NF > 1 therefor only matches lines that start with (^) <div and (&&) contain a class attribute (NF > 1)
tag=$2; sub("\".*", "", tag) extracts the class attribute value from the 2nd field, by replacing everything from the first " (the closing " of the attribute value) with the empty string, effectively retaining the attribute value only in variable tag.
printf "<%s>\n", tag prints the attribute value as the replacement opening tag.
next skips the rest of the script and moves to the next input line.
Processing the closing div tag:
/^<\/div>/ && tag != "" matches the closing div tag, assuming that a class attribute value was found in the previous opening tag (tag != "").
printf "</%s>\n", tag prints the new closing tag.
tag="" resets the most recent replacement tag so that any subsequent div elements that do not have class attributes don't accidentally get renamed too.
next skips the rest of the script and moves to the next input line.
All other lines:
1 simply prints all other lines as-is. (1 is a common Awk shorthand for { print }: Pattern 1, interpreted as a Boolean, is by definition true, and a pattern without associated action { ... } prints the input line by default).
No loop needed. Just pipe the files though this...
sed '/Task/s/<div.*>/<task>/g;s/<\/div>/<\/task>/g'
/Task at the beginning makes sed edit lines with the name Task in it only.
With s/NAME/NEWNAME/ you replace some text one by one.
Adding .* will replace all text starting at this point.
Last but not least, g stands for global and will edit all entries this way.
Second command (after ;) will replace </div> with </task>. Its a part of the same command like before. The difference this time is that a / (slash) will be used by sed it self, if not declared other wise! This can be archived via a \ (backslash).
Here you go. The output of your file will look like this....
<task>
(#) Delete the fourth patterns from your .teach file and your .data files. Remember to change the second line in each so that Tlearn knows there are now only three patterns.
- They should look like [#fig:dataTeach]
</task>
This might work for you (GNU sed):
v='task|journal|highlight'
sed -ri '/^<div/{:a;N;/^<\/div/M!ba;s/^<.*class="('$v')"[^>]*(.*<\/)div/<\1\2\1/}' file1 file2 file3 ...
This stores the div statements in the pattern space and then substitutes (or not) the required values depending on the shell variable set beforehand.
N.B. the alternatives are stored in the shell variable v separated by |
This should do the trick:
$msys\bin\sed -En "s/<div latex=\"true\" class=\"task\" id=\"Task\">/<task>/;T;{:a;N;s/<\/div>/<\/task>/;Ta;p;}" input.txt
These are the building blocks, in case you want to adapt it:
make a loop:{:a;
it ends when the second replacement triggers: s/<\/div>/<\/task>/;Ta;
only start it, if the first replacement triggered:
s/<div latex=\"true\" class=\"task\" id=\"Task\">/<task>/;T;
inside the loop just collect lines into pattern space:N;
at the end of the loop just print:p;}
called with extended regular expressions and without default-printing
(mine is a windows/msys sed, just so you know):$msys\bin\sed -En

Using awk to format text

I'm getting hard times understanding how to achieve what I want using awk and after searching for quite some time, I couldn't find the solution I'm looking for.
I have an input text that looks like this:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(
Element 4
)
Another line
(
Element 1, span 1 to
Element 5, span 4
)
Another Line
I want to properly format the weird lines between ' (' and ')'. The expected output is as follow:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
Looking up on stack overflow I found this :
How to select lines between two marker patterns which may occur multiple times with awk/sed
So what I'm using now is echo $text | awk '/ \(/{flag=1;next}/\)/{flag=0}flag'
Which almost works except it filters out the non-matching lines, here's the output produced by this very last command:
(Element 4)
(Element 1, span 1 to Element 5, span 4)
Anyone knows how-to do this? I'm open to any suggestion, including not-using awk if you know better.
Bonus point if you teach me how to remove syntaxic coloration on my question code blocks :)
Thanks a billion times
Edit: Ok, so I accepted #EdMorton's solution as he provided something using awk (well, GNU awk). However, I'm currently using #aaron's sed voodoo incantations with great success and will probably continue doing so until I hit anything new on that specific usecase.
I strongly suggest reading EdMorton's explanation, last paragraph made my day. If anyone passing by has good ressources regarding awk/sed they can share, feel free to do so in the comments.
Here's how I would do it with GNU sed :
s/^\s*(/(/;/^(/{:l N;/)/b e;b l;:e s/\n//g}
Which, for those who don't speak gibberish, means :
remove the leading spaces from lines that start with spaces and an opening bracket
test if the line now start with an opening bracket. If that's the case, do the following :
mark this spot as the label l, which denotes the start of a loop
add a line from the input to the pattern space
test if you now have a closing bracket in your pattern space
if so, jump to the label e
(if not) jump to the label l
mark this spot as the label e, which denotes the end of the code
remove the linefeeds from the pattern space
(implicitly print the pattern space, whether it has been modified or not)
This can probably be refined, but it does the trick :
$ echo """Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(
Element 4
)
Another line
(
Element 1, span 1 to
Element 5, span 4
)
Another Line """ | sed 's/^\s*(/(/;/^(/{:l N;/)/b e;b l;:e s/\n//g}'
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
Edit : if you can disable history expansion (set +H), this sed command is nicer : s/^\s*(/(/;/^(/{:l N;/)/!b l;s/\n//g}
sed is for simple substitutions on individual lines, that is all. If you try to do anything else with it then you are using constructs that became obsolete in the mid-1970s when awk was invented, are almost certainly non-portable and inefficient, are always just a pile of indecipherable arcane runes, and are used today just for mental exercise.
The following uses GNU awk for multi-char RS, RT and the \s shorthand for [[:space:]] and works by simply isolating the (...) strings and then doing whatever you want with them:
$ cat tst.awk
BEGIN {
RS="[(][^)]+[)]" # a regexp for the string you want to isolate in RT
ORS="" # disable appending of newlines so we print as-is
}
{
gsub(/\n[[:blank:]]+$/,"\n") # remove any blanks before RT at the start of each line
sub(/\(\s+/,"(",RT) # remove spaces after ( in RT
sub(/\s+\)/,")",RT) # remove spaces before ) in RT
gsub(/\s+/," ",RT) # compress each chain of spaces to one blank char in RT
print $0 RT # print the result
}
$ awk -f tst.awk file
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
If you're considering using a sed solution for this also consider how you would enhance it if/when you have the slightest requirements change. Any change to the above awk code would be trivial and obvious while a change to the equivalent sed code would require first sacrificing a goat under a blood moon then breaking out your copy of the Rosetta Stone...
It's doable in awk, and maybe there's a slicker way than this. It looks for lines between and including those containing only blanks and either an open or close parenthesis, and processes them specially. Everything else it just prints:
awk '/^ *\( *$/,/^ *\) *$/ {
sub(/^ */, "");
sub(/ *$/, "");
if ($1 ~ /[()]/) hold = hold $1; else hold = hold " " $0
if ($0 ~ /\)/) {
sub(/\( /, "(", hold)
sub(/ \)/, ")", hold)
print hold
hold = ""
}
next
}
{ print }' data
The variable hold is initially empty.
The first pair of sub calls strip leading and trailing blanks (copying the data from the question, there's a blank after span 1 to). The if adds the ( or ) to hold without a space, or the line to hold after a space. If the close parenthesis is present, remove the space after the open parenthesis and before the close parenthesis, print hold, and reset hold to empty. Always skip the rest of the script with next. The rest of the script is { print } — print unconditionally, often written 1 by minimalists.
The file data is copy'n'paste from the data in the question.
Output:
Some text (possibly containing text within parenthesis).
Some other text
Another line (with something here) with some text
(Element 4)
Another line
(Element 1, span 1 to Element 5, span 4)
Another Line
The 'Another Line' (with capital L) has a trailing blank because the data in the question does.
With awk
$ cat fmt.awk
function rem_wsp(s) { # remove white spaces
gsub(/[\t ]/, "", s)
return s
}
function beg() {return rem_wsp($0)=="("}
function end() {return rem_wsp($0)==")"}
function dump_block() {
print "(" block ")"
}
beg() {
in_block = 1
next
}
end() {
dump_block()
in_block = block = ""
next
}
in_block {
if (length(block)>0) sep = " "
block = block sep $0
next
}
{
print
}
END {
if (in_block) dump_block()
}
Usage:
$ awk -f fmt.awk fime.dat

How can I read a CSV file if only non-empty fields are wrapped by double quotes?

I'm trying to read a CSV file in a Bash script. I achieved that successfully using gawk and specifying FPAT like:
gawk -v LOGFILE="${LOGFILE}" 'BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"
}
NR == 1{
# doing some logic with header
}
NR >= 2{
# doing some logic with fields
}' <filename>
The problem here is, the file contains data like:
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"
Now, with this data I'm getting wrong data because it is ignoring commas, which is giving me wrong position number of extracted data.
For example, it is telling "7865431234" is present at 3rd position whereas it is at 6th.
Can anyone suggest the changes to get the correct position of fields?
Your FPAT requires each field to contain at least one character, but you want to recognize empty fields with zero characters. Add an alternative to FPAT that allows zero characters:
gawk 'BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")|" }
{ printf "%d:%d:", NR, NF; for (i = 1; i <= NF; i++) printf("[%s]", $i); print "" }'
Note the extra | at the end of FPAT. The action simply identifies the record number, the number of fields, and surrounds the value of each field with square brackets.
When your data string is provided to that script, the output is:
1:8:["RAM"]["31st street, Bengaluru, India"][][][]["7865431234"][]["VALID"]
That shows the four empty fields quite clearly.
Now all you have to do is deal with:
"Mr ""Manipulator"", the Artisan","29th Street, Delhi, India",,,"",,,"INVALID"
where there are double quotes inside the quoted value. That's not dreadfully hard to manage:
gawk 'BEGIN { FPAT = "([^,]+)|(\"([^\"]|\"\")*\")[^,]*|" }
{ printf "%d:%d:", NR, NF; for (i = 1; i <= NF; i++) printf("%d[%s]", i, $i); print "" }' "$#"
The FPAT says that a field is:
a sequence of non-commas,
or it is a field started with a double quote, containing zero or more instances of either:
a non-quote, or
two double quotes
followed by a double quote and optional non-comma data
or it is empty
Note that the 'optional non-comma data' should be empty, and only appears in malformed CSV data.
Given input data:
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"
"Mr ""Manipulator"", the Artisan","29th Street, Delhi, India",,,,,,"INVALID"
"Some","","Empty","",Fields "" Wrapped,"",in quotes
"Malformed" CSV,Data,"Note it has data after" a close quote,"and before a comma,",,"INVALID"
This produces:
1:8:1["RAM"]2["31st street, Bengaluru, India"]3[]4[]5[]6["7865431234"]7[]8["VALID"]
2:8:1["Mr ""Manipulator"", the Artisan"]2["29th Street, Delhi, India"]3[]4[]5[]6[]7[]8["INVALID"]
3:7:1["Some"]2[""]3["Empty"]4[""]5[Fields "" Wrapped]6[""]7[in quotes]
4:6:1["Malformed" CSV]2[Data]3["Note it has data after" a close quote]4["and before a comma,"]5[]6["INVALID"]
Note that the field numbers are included as a prefix to the bracketed data (so I tweaked the print format slightly).
About the only format this doesn't handle is one where newlines can be embedded in the data for a field — by the nature of the line-based input, it assumes that no field is split over multiple lines. (It also means it won't properly recognize a field that starts with a double quote and doesn't have a matching double quote before the end of the line. I suppose you could add an alternative to recognize that. It would be better just to make the data right.)
Note the advice in Sobrique's answer to use a tool designed to handle CSV for handling CSV. That is generally a good idea, and the more complex the sets of variations you have to deal with, the better an idea it is. This is close to as complicated a regex as you should consider using. Also note that although RFC 4180 defines a version of CSV formally and rigorously, there are multiple programs (including MS Office) that handle different but related formats.
If you have csv that needs parsing, then whilst you can usually hack it with a regex, it's far easier to user a parser.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV -> new;
open ( my $input, '<', 'flarg.csv' ) or die $!;
while ( my $row = $csv -> getline ( $input ) ) {
if ( $. == 1 ) {
# do first row stuff;
print "Header: ", join ",", #$row,"\n";
}
else {
print join "\n", #$row;
}
}
Or simpler yet - use Text::ParseWords which is core.
#!/usr/bin/env perl
use strict;
use warnings;
use Text::ParseWords;
while ( my $line = <DATA> ) {
my #fields = parse_line(',', 1, $line);
print join "\n", #fields;
}
__DATA__
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"

Resources