Convert file input.csv.
id,location_id,organization_id,service_id,name,title,email,department
36,,,22,Joe Smith,third-party,john.smith#example.org,third-party Applications
18,11,,,Dave Genesy,Head of office,,
14,9,,,David Genesy,Library Director,,
22,14,,,Andres Espinoza, Manager Commanding Officer,,
(Done!) Need to update column name. Name format: first letter of name/surname uppercase and all other letters lowercase.
(Done!) Need to update column email with domain #abc.Email format: first letter from name and full surname, lowercase
(Not done) Emails with the same ID should contain numbers. Example: Name Max Houston, email mhouston1#examples.com etc.
#!/bin/bash
inputfile="accounts.csv"
echo "id,location_id,organization_id,service_id,name,title,email,department" > accounts_new.csv
while IFS="," read -r rec_column1 rec_column2 rec_column3 rec_column4 rec_column5 rec_column6 rec_column7 rec_column8
do
surnametemp="${rec_column5:0:1}$(echo $rec_column5 | awk '{print $2}')"
namesurname=$(echo $rec_column5 | sed 's! .!\U&!g')
echo $rec_column1","$rec_column2","$rec_column3","$rec_column4","$namesurname","$rec_column6",""${surnametemp,,}#abc.com"","$rec_column8 >>accounts_new.csv
done < <(tail -n +2 $inputfile)
How can do that?
Outputfile
id,location_id,organization_id,service_id,name,title,email,department
14,9,,,Dave Genesy,Library Director,dgenesy#abc.com,
14,9,,,David Genesy,Library Director,dgenesy2#abc.com,
15,9,,,maria Kramer,Library Divisions Manager,mkramer#abc.com,
26,18,,,Sharon Petersen,Administrator,spetersen#abc.com,
27,19,,,Shen Petersen,Administrator,spetersen2#abc.com,
Task specification
This task would be much easier if specified otherwise:
add email iterator to every email
or
add email iterator to second,third... occurrence
But it was specified:
add email iterator to every email if email is used multiple times.
This specification requires double iteration through lines, thus making this task more difficult.
The right tool
My rule of thumb is: use pure bash tools (grep, sed, etc) for simple tasks, use awk for moderate tasks and python for complicated tasks. In this case (double iteration over lines) I would use python. However, there was not python tag in problem specification, so I used awk.
Solution
<accounts.csv \
gawk -vFPAT='[^,]*|"[^"]*"' \
'
BEGIN {
OFS = ","
};
{
if ($7 == "") {
split($5,name," ");
firstname = substr(tolower(name[1]),1,1);
lastname = tolower(name[2]);
domain="#abc.com";
$7=firstname "." lastname domain;
};
emailcounts[$7]++;
immutables[++iter]=$1","$2","$3","$4","$5","$6","$8;
emails[iter]=$7;
}
END {
for (iter in immutables) {
if (emailcounts[emails[iter]] > 1) {
emailiter[emails[iter]]++;
email=gensub(/#/, emailiter[emails[iter]]"#", "g", emails[iter]);
} else {
email=emails[iter]
};
print immutables[iter], email
}
}'
Results
id,location_id,organization_id,service_id,name,title,department,email
36,,,22,Joe Smith,third-party,third-party Applications,john.smith#example.org
18,11,,,Dave Genesy,Head of office,,d.genesy1#abc.com
14,9,,,David Genesy,Library Director,,d.genesy2#abc.com
22,14,,,Andres Espinoza,"Manager, Commanding Officer",,a.espinoza#abc.com
Explanation
-vFPAT='[^,]*|"[^"]*"' read csv
$7=firstname "." lastname domain;} substitute email field
emailcounts[$7]++ count email occurences
iter iterator to preserve order
immutables[++iter]=$1","$2","$3","$4","$5","$6","$8 save non email fields for second loop
emails[iter]=$7 save email for second loop
for (iter in immutables) iterate over keys in immutables dictionary
{if (emailcounts[emails[iter]] > 1) change email if more than 1 occurence
emailiter[emails[iter]]++ increase email iterator
email=gensub(/#/, emailiter[emails[iter]]"#", "g", emails[iter]) add iterator to email
print immutables[iter], email print
With the input (mailcsv) file as:
id,location_id,organization_id,service_id,name,title,email,department
14,9,,,Dave Genesy,Library Director,dgenesy#abc.com,
14,9,,,David Genesy,Library Director,dgenesy#abc.com,
15,9,,,maria Kramer,Library Divisions Manager,mkramer#abc.com,
26,18,,,Sharon Petersen,Administrator,spetersen#abc.com,
27,19,,,Shen Petersen,Administrator,spetersen2#abc.com,
You can use awk and so:
awk -F, ' NR>1 { mails[$7]+=1;if ( mails[$7] > 1 ) { OFS=",";split($7,mail1,"#");$7=mail1[1]mails[$7]"#"mail1[2] } else { $0=$0 } }1' mailscsv
Set the field delimiter to , and then create an array keyed by email address. Increment the index every time the email address is encountered. If there is more than one occurrence of the address, split the address into another array mail1 based on "#". Set $7 to the first index of the array mail1 (email address before #) followed by the value of mails index for the email address, then "#" and the second index of mail1 (the section after #) If there is only one occurrence of the email address simple set the whole line as is. Use 1 to print the line.
I have a set of wireless stats from various branches in the organization:
branchA,171
branchA_guests,1020
branchB,2019
branchB_guests,3409
There are 2 entries for each branch: 1st is internal wifi usage, the next is guest usage. I'd like to merge them into a single total as we don't care whether it's guests or staff ...etc.
Desired output should be:
branchA,1191
branchB,5428
The input file has a header and some markdown so it has to identify a match, not just assume the next line is related --- though the data could be cleaned first, it is my opinion that a match would make this more bulletproof.
Here is my approach: Remove the _guests and tally:
# file: tally.awk
BEGIN {
FS = OFS = ","
}
{
sub(/_guests/, "", $1) # Remove _guests
stat[$1] += $2 # Tally
}
END {
for (branch in stat) {
printf "%s,%d\n", branch, stat[branch]
}
}
Running the script:
awk -f tally.awk data.txt
Notes
In the BEGIN pattern, I set the field separator (FS) and output field separator (OFS) both to a comma
Next, for each line, I remove the _guests part and tally the count
Finally, at the end of the file, I print out the counts
I have about one hundred Markdown files that contain snippets of Latex like this:
<div latex="true" class="task" id="Task">
(#) Delete the fourth patterns from your .teach file and your .data files. Remember to change the second line in each so that Tlearn knows there are now only three patterns.
- They should look like [#fig:dataTeach]
</div>
I'd like to replace the <div> tags with pseudotags that are easier to read, like this:
<task>
(#) Delete the fourth patterns from your .teach file and your .data files. Remember to change the second line in each so that Tlearn knows there are now only three patterns.
- They should look like [#fig:dataTeach]
</task>
This would be trivial if all my <div> tags were marking 'tasks', but I have similar divs for 'journal' and 'highlight'. I need a process that will change the </div> to </task> only when the preceding <div> has the class or id 'task', and likewise for 'journal' and 'highlight'.
Having looked around Stack Overflow for a while, I find many examples of multiline search and replace that do almost what I want to do, but the syntax (particularly for sed) is so difficult to untangle I can't adapt it for the above case. My next option is to write a bash script to loop through line by line, but I have a feeling this might be too fragile.
Cheers
Ian
The following awk command works generically, under the following assumptions:
All opening and closing div tags are on their own lines.
Attributes all use "-quoting.
The new tag name is derived from the value of the class attribute only (this could be generalized if the rules were clearer).
awk -F ' class="' '
/^<div / && NF > 1 { tag=$2; sub("\".*", "", tag); printf "<%s>\n", tag; next }
/^<\/div>/ && tag != "" { printf "</%s>\n", tag; tag=""; next }
1
' file
-F ' class="' effectively splits each line into before (field 1, $1) and after (field 2, $2) the class attribute, if present. Only lines that have such an attribute will therefore have more than 1 field (NF > 1).
Processing the opening div tag:
Pattern /^<div / && NF > 1 therefor only matches lines that start with (^) <div and (&&) contain a class attribute (NF > 1)
tag=$2; sub("\".*", "", tag) extracts the class attribute value from the 2nd field, by replacing everything from the first " (the closing " of the attribute value) with the empty string, effectively retaining the attribute value only in variable tag.
printf "<%s>\n", tag prints the attribute value as the replacement opening tag.
next skips the rest of the script and moves to the next input line.
Processing the closing div tag:
/^<\/div>/ && tag != "" matches the closing div tag, assuming that a class attribute value was found in the previous opening tag (tag != "").
printf "</%s>\n", tag prints the new closing tag.
tag="" resets the most recent replacement tag so that any subsequent div elements that do not have class attributes don't accidentally get renamed too.
next skips the rest of the script and moves to the next input line.
All other lines:
1 simply prints all other lines as-is. (1 is a common Awk shorthand for { print }: Pattern 1, interpreted as a Boolean, is by definition true, and a pattern without associated action { ... } prints the input line by default).
No loop needed. Just pipe the files though this...
sed '/Task/s/<div.*>/<task>/g;s/<\/div>/<\/task>/g'
/Task at the beginning makes sed edit lines with the name Task in it only.
With s/NAME/NEWNAME/ you replace some text one by one.
Adding .* will replace all text starting at this point.
Last but not least, g stands for global and will edit all entries this way.
Second command (after ;) will replace </div> with </task>. Its a part of the same command like before. The difference this time is that a / (slash) will be used by sed it self, if not declared other wise! This can be archived via a \ (backslash).
Here you go. The output of your file will look like this....
<task>
(#) Delete the fourth patterns from your .teach file and your .data files. Remember to change the second line in each so that Tlearn knows there are now only three patterns.
- They should look like [#fig:dataTeach]
</task>
This might work for you (GNU sed):
v='task|journal|highlight'
sed -ri '/^<div/{:a;N;/^<\/div/M!ba;s/^<.*class="('$v')"[^>]*(.*<\/)div/<\1\2\1/}' file1 file2 file3 ...
This stores the div statements in the pattern space and then substitutes (or not) the required values depending on the shell variable set beforehand.
N.B. the alternatives are stored in the shell variable v separated by |
This should do the trick:
$msys\bin\sed -En "s/<div latex=\"true\" class=\"task\" id=\"Task\">/<task>/;T;{:a;N;s/<\/div>/<\/task>/;Ta;p;}" input.txt
These are the building blocks, in case you want to adapt it:
make a loop:{:a;
it ends when the second replacement triggers: s/<\/div>/<\/task>/;Ta;
only start it, if the first replacement triggered:
s/<div latex=\"true\" class=\"task\" id=\"Task\">/<task>/;T;
inside the loop just collect lines into pattern space:N;
at the end of the loop just print:p;}
called with extended regular expressions and without default-printing
(mine is a windows/msys sed, just so you know):$msys\bin\sed -En