Bash/Sed/Awk - parsing CSV from column ==x until column==x again - bash

I've got a rather large set of CSV's that I need to parse. Most of it is extremely easy, however I've got some 'group' objects with embedded objects that I need to extract correctly.
The file looks something like this
Test_GroupA,Group,-,-,-,-,NodeA,,-,
,,,,,,NodeB,,,
,,,,,,NodeC,,,
,,,,,,NodeD,,,
,,,,,,NodeE,,,
Test_GroupB,Group,-,-,-,-,NodeA,,-,
,,,,,,NodeB,,,
,,,,,,NodeC,,,
,,,,,,NodeX,,,
,,,,,,NodeE,,,
,,,,,,NodeF,,,
So, as you can see, I need something along the lines of:
awk -F"[,|]" '{if ($2=="Group")
then - pseudo code->
print "create group",$1
print "add member in $7 to group found in $1 of first row"
continue until you reach next $2=="Group"), then loop
This is perplexing me greatly :)
Edit::
It seems a lot of the values are somewhat bogus and contain '-' when they're blank instead of just being ,,
Something like
sed 's/\,\-\,/\,\,/g'
should replace them I'd think, however I think I need a leading wildcard.
New example:
grp-ext-test-test,Group,-,-,-,-,Net_10.10.10.10,,-,
,,,,,,Net_10.101.10.10,,,
,,,,,,ws-ext-test-10.102,,,
,,,,,,ws-ext-test-10.103,,,
,,,,,,ws-ext-test-10.104,,,
,,,,,,ws-ext-test-10.105,,,
,,,,,,ws-ext-test-10.106,,,
,,,,,,ws-ext-test-10.107,,,
,,,,,,ws-ext-test-10.108,,,
,,,,,,ws-ext-test-10.108,,,
Running the new string on it only produces:
create group grp-ext-test-test

You could try something like this and adapt as required..
awk -F, '$2=="Group"{g=$1; print "create group",g}{print "add " $7 " to " g}' file
Output:
create group Test_GroupA
add NodeA to Test_GroupA
add NodeB to Test_GroupA
add NodeC to Test_GroupA
add NodeD to Test_GroupA
add NodeE to Test_GroupA
create group Test_GroupB
add NodeA to Test_GroupB
add NodeB to Test_GroupB
add NodeC to Test_GroupB
add NodeX to Test_GroupB
add NodeE to Test_GroupB
add NodeF to Test_GroupB
---edit---
To check if the contents of $7 are valid you could try something like:
awk -F, '$2=="Group"{ g=$1; print "create group",g } $7!="-"{print "add " $7 " to " g}' file

Related

Bash: compare if a value is contained within an interval

I got a text file which is tab separated and contains 2 columns like this:
1227637 1298347
1347879 1356788
1389993 1399847
... ...
Now I got some values from an analysis and I'd like to check if these values are contained in my text file intervals.
For example if I have 1227659, which is contained in the first interval, I'd like the bash-script to print to std out something like:
1227659 is contained between 1227637 and 1298347
Thanks.
How about:
awk -v x=1227659 '
$1<x && x<$2 {print x, "is contained between", $1, "and", $2}
' intervals.txt
1227659 is contained between 1227637 and 1298347
If you want any end of the interval to be interpreted as inclusive, change < to <= accordingly. If you want to stop after the first match (makes only sense if the intervals can overlap), add ; exit before the closing curly brace }.

Search duplicates in a column, add value

Convert file input.csv.
id,location_id,organization_id,service_id,name,title,email,department
36,,,22,Joe Smith,third-party,john.smith#example.org,third-party Applications
18,11,,,Dave Genesy,Head of office,,
14,9,,,David Genesy,Library Director,,
22,14,,,Andres Espinoza, Manager Commanding Officer,,
(Done!) Need to update column name. Name format: first letter of name/surname uppercase and all other letters lowercase.
(Done!) Need to update column email with domain #abc.Email format: first letter from name and full surname, lowercase
(Not done) Emails with the same ID should contain numbers. Example: Name Max Houston, email mhouston1#examples.com etc.
#!/bin/bash
inputfile="accounts.csv"
echo "id,location_id,organization_id,service_id,name,title,email,department" > accounts_new.csv
while IFS="," read -r rec_column1 rec_column2 rec_column3 rec_column4 rec_column5 rec_column6 rec_column7 rec_column8
do
surnametemp="${rec_column5:0:1}$(echo $rec_column5 | awk '{print $2}')"
namesurname=$(echo $rec_column5 | sed 's! .!\U&!g')
echo $rec_column1","$rec_column2","$rec_column3","$rec_column4","$namesurname","$rec_column6",""${surnametemp,,}#abc.com"","$rec_column8 >>accounts_new.csv
done < <(tail -n +2 $inputfile)
How can do that?
Outputfile
id,location_id,organization_id,service_id,name,title,email,department
14,9,,,Dave Genesy,Library Director,dgenesy#abc.com,
14,9,,,David Genesy,Library Director,dgenesy2#abc.com,
15,9,,,maria Kramer,Library Divisions Manager,mkramer#abc.com,
26,18,,,Sharon Petersen,Administrator,spetersen#abc.com,
27,19,,,Shen Petersen,Administrator,spetersen2#abc.com,
Task specification
This task would be much easier if specified otherwise:
add email iterator to every email
or
add email iterator to second,third... occurrence
But it was specified:
add email iterator to every email if email is used multiple times.
This specification requires double iteration through lines, thus making this task more difficult.
The right tool
My rule of thumb is: use pure bash tools (grep, sed, etc) for simple tasks, use awk for moderate tasks and python for complicated tasks. In this case (double iteration over lines) I would use python. However, there was not python tag in problem specification, so I used awk.
Solution
<accounts.csv \
gawk -vFPAT='[^,]*|"[^"]*"' \
'
BEGIN {
OFS = ","
};
{
if ($7 == "") {
split($5,name," ");
firstname = substr(tolower(name[1]),1,1);
lastname = tolower(name[2]);
domain="#abc.com";
$7=firstname "." lastname domain;
};
emailcounts[$7]++;
immutables[++iter]=$1","$2","$3","$4","$5","$6","$8;
emails[iter]=$7;
}
END {
for (iter in immutables) {
if (emailcounts[emails[iter]] > 1) {
emailiter[emails[iter]]++;
email=gensub(/#/, emailiter[emails[iter]]"#", "g", emails[iter]);
} else {
email=emails[iter]
};
print immutables[iter], email
}
}'
Results
id,location_id,organization_id,service_id,name,title,department,email
36,,,22,Joe Smith,third-party,third-party Applications,john.smith#example.org
18,11,,,Dave Genesy,Head of office,,d.genesy1#abc.com
14,9,,,David Genesy,Library Director,,d.genesy2#abc.com
22,14,,,Andres Espinoza,"Manager, Commanding Officer",,a.espinoza#abc.com
Explanation
-vFPAT='[^,]*|"[^"]*"' read csv
$7=firstname "." lastname domain;} substitute email field
emailcounts[$7]++ count email occurences
iter iterator to preserve order
immutables[++iter]=$1","$2","$3","$4","$5","$6","$8 save non email fields for second loop
emails[iter]=$7 save email for second loop
for (iter in immutables) iterate over keys in immutables dictionary
{if (emailcounts[emails[iter]] > 1) change email if more than 1 occurence
emailiter[emails[iter]]++ increase email iterator
email=gensub(/#/, emailiter[emails[iter]]"#", "g", emails[iter]) add iterator to email
print immutables[iter], email print
With the input (mailcsv) file as:
id,location_id,organization_id,service_id,name,title,email,department
14,9,,,Dave Genesy,Library Director,dgenesy#abc.com,
14,9,,,David Genesy,Library Director,dgenesy#abc.com,
15,9,,,maria Kramer,Library Divisions Manager,mkramer#abc.com,
26,18,,,Sharon Petersen,Administrator,spetersen#abc.com,
27,19,,,Shen Petersen,Administrator,spetersen2#abc.com,
You can use awk and so:
awk -F, ' NR>1 { mails[$7]+=1;if ( mails[$7] > 1 ) { OFS=",";split($7,mail1,"#");$7=mail1[1]mails[$7]"#"mail1[2] } else { $0=$0 } }1' mailscsv
Set the field delimiter to , and then create an array keyed by email address. Increment the index every time the email address is encountered. If there is more than one occurrence of the address, split the address into another array mail1 based on "#". Set $7 to the first index of the array mail1 (email address before #) followed by the value of mails index for the email address, then "#" and the second index of mail1 (the section after #) If there is only one occurrence of the email address simple set the whole line as is. Use 1 to print the line.

Bash/Linux: Merge rows on match; add last field

I have a set of wireless stats from various branches in the organization:
branchA,171
branchA_guests,1020
branchB,2019
branchB_guests,3409
There are 2 entries for each branch: 1st is internal wifi usage, the next is guest usage. I'd like to merge them into a single total as we don't care whether it's guests or staff ...etc.
Desired output should be:
branchA,1191
branchB,5428
The input file has a header and some markdown so it has to identify a match, not just assume the next line is related --- though the data could be cleaned first, it is my opinion that a match would make this more bulletproof.
Here is my approach: Remove the _guests and tally:
# file: tally.awk
BEGIN {
FS = OFS = ","
}
{
sub(/_guests/, "", $1) # Remove _guests
stat[$1] += $2 # Tally
}
END {
for (branch in stat) {
printf "%s,%d\n", branch, stat[branch]
}
}
Running the script:
awk -f tally.awk data.txt
Notes
In the BEGIN pattern, I set the field separator (FS) and output field separator (OFS) both to a comma
Next, for each line, I remove the _guests part and tally the count
Finally, at the end of the file, I print out the counts

Replace string with text only when a given text precedes it

I have about one hundred Markdown files that contain snippets of Latex like this:
<div latex="true" class="task" id="Task">
(#) Delete the fourth patterns from your .teach file and your .data files. Remember to change the second line in each so that Tlearn knows there are now only three patterns.
- They should look like [#fig:dataTeach]
</div>
I'd like to replace the <div> tags with pseudotags that are easier to read, like this:
<task>
(#) Delete the fourth patterns from your .teach file and your .data files. Remember to change the second line in each so that Tlearn knows there are now only three patterns.
- They should look like [#fig:dataTeach]
</task>
This would be trivial if all my <div> tags were marking 'tasks', but I have similar divs for 'journal' and 'highlight'. I need a process that will change the </div> to </task> only when the preceding <div> has the class or id 'task', and likewise for 'journal' and 'highlight'.
Having looked around Stack Overflow for a while, I find many examples of multiline search and replace that do almost what I want to do, but the syntax (particularly for sed) is so difficult to untangle I can't adapt it for the above case. My next option is to write a bash script to loop through line by line, but I have a feeling this might be too fragile.
Cheers
Ian
The following awk command works generically, under the following assumptions:
All opening and closing div tags are on their own lines.
Attributes all use "-quoting.
The new tag name is derived from the value of the class attribute only (this could be generalized if the rules were clearer).
awk -F ' class="' '
/^<div / && NF > 1 { tag=$2; sub("\".*", "", tag); printf "<%s>\n", tag; next }
/^<\/div>/ && tag != "" { printf "</%s>\n", tag; tag=""; next }
1
' file
-F ' class="' effectively splits each line into before (field 1, $1) and after (field 2, $2) the class attribute, if present. Only lines that have such an attribute will therefore have more than 1 field (NF > 1).
Processing the opening div tag:
Pattern /^<div / && NF > 1 therefor only matches lines that start with (^) <div and (&&) contain a class attribute (NF > 1)
tag=$2; sub("\".*", "", tag) extracts the class attribute value from the 2nd field, by replacing everything from the first " (the closing " of the attribute value) with the empty string, effectively retaining the attribute value only in variable tag.
printf "<%s>\n", tag prints the attribute value as the replacement opening tag.
next skips the rest of the script and moves to the next input line.
Processing the closing div tag:
/^<\/div>/ && tag != "" matches the closing div tag, assuming that a class attribute value was found in the previous opening tag (tag != "").
printf "</%s>\n", tag prints the new closing tag.
tag="" resets the most recent replacement tag so that any subsequent div elements that do not have class attributes don't accidentally get renamed too.
next skips the rest of the script and moves to the next input line.
All other lines:
1 simply prints all other lines as-is. (1 is a common Awk shorthand for { print }: Pattern 1, interpreted as a Boolean, is by definition true, and a pattern without associated action { ... } prints the input line by default).
No loop needed. Just pipe the files though this...
sed '/Task/s/<div.*>/<task>/g;s/<\/div>/<\/task>/g'
/Task at the beginning makes sed edit lines with the name Task in it only.
With s/NAME/NEWNAME/ you replace some text one by one.
Adding .* will replace all text starting at this point.
Last but not least, g stands for global and will edit all entries this way.
Second command (after ;) will replace </div> with </task>. Its a part of the same command like before. The difference this time is that a / (slash) will be used by sed it self, if not declared other wise! This can be archived via a \ (backslash).
Here you go. The output of your file will look like this....
<task>
(#) Delete the fourth patterns from your .teach file and your .data files. Remember to change the second line in each so that Tlearn knows there are now only three patterns.
- They should look like [#fig:dataTeach]
</task>
This might work for you (GNU sed):
v='task|journal|highlight'
sed -ri '/^<div/{:a;N;/^<\/div/M!ba;s/^<.*class="('$v')"[^>]*(.*<\/)div/<\1\2\1/}' file1 file2 file3 ...
This stores the div statements in the pattern space and then substitutes (or not) the required values depending on the shell variable set beforehand.
N.B. the alternatives are stored in the shell variable v separated by |
This should do the trick:
$msys\bin\sed -En "s/<div latex=\"true\" class=\"task\" id=\"Task\">/<task>/;T;{:a;N;s/<\/div>/<\/task>/;Ta;p;}" input.txt
These are the building blocks, in case you want to adapt it:
make a loop:{:a;
it ends when the second replacement triggers: s/<\/div>/<\/task>/;Ta;
only start it, if the first replacement triggered:
s/<div latex=\"true\" class=\"task\" id=\"Task\">/<task>/;T;
inside the loop just collect lines into pattern space:N;
at the end of the loop just print:p;}
called with extended regular expressions and without default-printing
(mine is a windows/msys sed, just so you know):$msys\bin\sed -En

Grep for displaying count of muliple strings in a single file

Another question ... Can I get the count of items that are unique .. If in my previous case, i just took a simple instance . My business req is here ...
I have string like the below happy=7
happy=5
happy=5,
bascically I will be using regex for searching the word happy, I would give like "happy=*"... I need the output as "count of happy =2" as there is one duplicate instance ...
Use awk:
awk '/happy/{ happy+=1 } /sad/ {sad += 1 }
END { print "happy =", happy+0, "sad = ", sad+0 }'
Note that like grep -c, this does not count occurrences of each word but the number of lines that match each word.
You're better off using something like perl or awk, where you can increment counters based on conditional statements.

Resources