awk store a pattern result to a shell array variable - bash

I am trying to store the result of a pattern matched by awk to a shell array variable. Here's a simplified example of the same:
#!/bin/bash
declare -a array1=()
declare -a array2=()
READ_FILE="directory1/read_file.csv"
WRITE_FILE="directory2/results.csv"
#variable for counting array index
count1=0
count2=0
#
#
# need help with line below
# $2 below is the second set of characters which is a floating point number
awk -F 'string1_to_search' '{$array1[count1++] = $2}' $READ_FILE
awk -F 'string2_to_search' '{$array2[count2++] = $2}' $READ_FILE
#count++ indicates post increment of count variable
#do something with the array
.
.
#end
any suggestions would be helpful.

Something roughly like this, then?
awk '/string1_to_search/ {
count["id1"]++; sum["id1"] += $2 }
/string2_too/ {
count["id2"]++; sum["id2"] += $2 }
# ...
END { for (k in count) printf("%s: sum %f/count %i = avg %f\n", k, sum[k], count[k], sum[k]/count[k]) }' inputfile
I seem to recall there was a clever way to calculate a rolling variance without keeping the entire input set in memory; or else just collect the values space-separated value["id"] = value["id"] " " $2 and split into a list and loop over it near the end. Alternatively, simplify this to only examine one search string at a time and run it multiple times (let's hope then the input isn't very big). Or switch to Perl, which will easily let you collect lists of lists and other nested structures.
Obviously break out common functionality into separate functions so you don't have repeated code ... I suppose it's actually clearer like this, but if you find bugs, or need other changes, you only want to have to change one place in the code.

another method to do it is making awk print the number which can be passed to an array variable in bash like this :
mapfile -t array1 < <( awk -F 'string1_to_search' '{print $2}' "$READ_FILE" )
Later for taking out mean, variance and SD we can use bc tool from within the bash

Related

Convert a bash array into an awk array

I have an array in bash and want to use this array in an awk script. How can I pass the array from bash to awk?
The keys of the awk array should be the indices of the bash array. For simplicity, we can assume that the bash array is dense, that is, the array is not sparse like a=([3]=x [5]=y).
The elements inside the array can have any value. Besides strange unicode symbols and ascii control characters they may contain spaces or even newlines. Also, there might be empty ("") entries which should be retained. As an example consider the following array:
a=(AB " C D " $'E\nF\tG' "¼ẞ🍕" "")
Extending approach #1 provided by Socowi, it is possible to address the shortcoming that he identified using the awk split function. Note that this solution does not use the stdin - it uses command line options - allowing awk to process stdin, files, etc.
The solution will convert the 'a' bash array into the 'a' awk, using intermediate awk file AVG (process substituion). This is a workaround to the bash limit that prevent NUL from being stored in a string.
a=(AB " C D " $'E\nF\tG' "¼ẞ🍕" "")
awk -v AVF=<(printf '%s\0' "${a[#]}") '
BEGIN {
# Temporary RS to allow reading the array with a single read.
saveRS=RS
RS=""
getline AV < AVF
rs = saveRS
na=split(AV, a, "\\0")
# Remove trailing empty element (printf add trailing separator).
delete a[na]
na-- ; for (i=1 ; i<=na ; i++ ) print "AV#", i, "=" a[i]
}{
# Use a[x]
}
'
Output:
1 AB
2 C D
3 E
F G
4 ¼ẞ🍕
5
Previous solution: For practical reason, Using the '\001' character as separator. make the script much easier (can use any other character sequence that is known not to appear in the info array). Bash command substitution does not allow NUL character. Hopefully, not a major issue, as this control character is not used for normal files, etc. I believe possible to solve this, but I'm not how.
The solution will convert the 'a' bash array into the 'a' awk, using intermediate awk variable 'AV'.
a=(AB " C D " $'E\nF\tG' "¼ẞ🍕" "")
awk -v AV="$(printf '%s\1' "${a[#]}")" '
BEGIN {
na=split(AV, a, "\\1") }
# Remove trailing empty element (printf add trailing separator).
delete a[na]
for (i=1 ; i<=na ; i++ ) print "AV#", i, "=" a[i]
{
# Use a[x]
}
'
Approach 1: Reading in awk
Since the array elements can contain any character but the null byte (\0) we have to delimit them by \0. This is done with printf. For simplicity we assume that the array has at least one entry.
Due to the \0 we can no longer pass the string to awk as an argument but have to use (or emulate) a file instead. We then read that file in awk using \0 as the record separator RS (may require GNU awk).
awk 'BEGIN {RS="\0"} {a[n++]=$0; next}' <(printf %s\\0 "${a[#]}")
This reliably constructs the awk array a from the bash array a. The length of a is stored in n.
This approach is ugly when you actually want to use it. There is no simple step-by-step instruction on how to incorporate this approach into your existing awk script. Normally, your awk script would read another file afterwards, therefore you have to change the record separator RS after the array file was read. This can be done with NR>FNR. However, if your awk script already reads multiple files and relies on something like NR==FNR things get complicated.
Approach 2: Generating awk Code with bash
Instead of parsing the array in awk we hard-code the array by generating awk code. This code will be injected at the beginning of an existing awk script and initialize the array. This approach also supports sparse arrays and associative arrays and should work with all awk versions, not only GNU.
For the code generation we have to correctly quote all strings. For example, the code generator echo "a[0]=${a[0]}" would fail if ${a[0]} was " resulting in the code a[1]=""". POSIX awk supports octal escape sequences (\012) which can encode all bytes. We simply encoding everything. That way we cannot forget any special symbols (even though the generated code is a bit inefficient).
octString() {
printf %s "$*" | od -bvAn | tr ' ' '\\' | tr -d '\n'
}
arrayToAwk() {
printf 'BEGIN{'
n=0
for key in "${!a[#]}"; do
printf 'a["%s"]="%s";' "$(octString "$key")" "$(octString "${a[$key]}")"
((n++))
done
echo "n=$n}"
}
The function arrayToAwk converts the bash array a (can be sparse or associative) into a BEGIN block. After inserting the generated code block at the begging of your existing awk program you can use the awk array a anywhere inside awk without having to adapt anything (assuming that the variable names a and n were unused before). n is the size of the awk array a.
For awk commands of the form awk ... 'program' ... use
awk ... "$(arrayToAwk)"'program' ...
For big arrays this might result in the error Argument list too long. You can circumvent this problem using a program file:
awk ... -f <(arrayToAwk; echo 'program') ...
For awk commands of the form awk ... -f progfile ... use
awk ... -f <(arrayToAwk; cat progfile) ...
I'd like to point out that this can be extremely simple if you do not mind using ARGV and deleting all the non-file arguments. One way:
>cat awk_script.sh
#!/bin/awk -f
BEGIN{
i=1
while(ARGV[i] != "--" && i < ARGC) {
print ARGV[i]
delete ARGV[i]
i++
}
if(i < ARGC)
delete ARGV[i]
} {
print "File 1 contains at 1",$1
}
Then run it with:
>./awk_script.sh "${a[#]}" -- file1
AB
C D
E
F G
¼ẞ�
File 1 contains at 1 a
Obviously I'm missing some symbols.
Note while I like this method it assumes -- is not in the array, as pointed out by Oguz Ismail. They give a great alternate solution of having the first argument the length of your list.
This can be a one liner to where you have
awk 'BEGIN{... get and delete first arguments ...}{process files}END{if wanted} "${a[#]}" file1 file2...
but will become unreadable very quickly.

How do I do a for loop with 2 arrays in shell script?

I have to first declare two arrays which I also need help with.
Originally, it's two single variables.
day=$(hadoop fs -ls -R /user/hive/* |
awk '/filename.txt.gz/' |
tail -1 |
date -d $(echo `awk '{print $6}'`) '+%b %-d' |
tr -d ' ')
time_stamp=$(hadoop fs -ls -R /user/hive/* |
awk '/filename.txt.gz/' |
tail -1 |
awk '{ print $7 }')
Now instead of tail -1, I need tail -5. So first, how do I make these two arrays?
Second question, how do I make a for loop with each value from the paired values of $day and $time_stamp? I can't use array_combine because I need to perform actions on each array separately. Thanks
You are collecting the data into strings, not arrays. But additionally, your code should probably be refactored significantly -- as a general rule of thumb, if something happens in Awk, most of the rest should also happen in Awk.
You assign to an array with variable=(values of array) and to get the values from a subprocess, it's variable=($(command to produce values)).
Here's a first attempt at refactoring your code.
# Avoid repeated code -- break this out into a function
extract_field () {
hadoop fs -ls -R /user/hive/* |
# Get rid of the tail and the repeated Awk
# Notice backslashes in regex
# Pass in the field to extract as a parameter
awk -v field="$1" '/filename\.txt\.gz/ { d[++i]=$field }
END { for(j=i-5; j<=i; ++j) print d[j] }'
)
day=($(extract_field 6 |
# Refactor accordingly
# And if you don't want a space in the format string, don't put a space in the format string in the first place
xargs -i {} date -d {} '+%b%-d'))
time_stamp=($(extract_field 7))
I'm highly skeptical of the arrangement to call the Hadoop command twice, though. Perhaps just extract fields 6 and 7 in a single go and then post-process the results to get them into two separate arrays. Something like this instead then?
combined=($(hadoop fs -ls -R /user/hive/* |
awk '/filename\.txt\.gz/ { d[++i]=$6 " " $7 }
END { for(j=i-5; j<=i; ++j) print d[j] }'))
for ((i=0; i<"${#combined[#]}"; ++i)); do
day[$i]="$(date -d "${combined[i]% *}" +'%b%-d')"
time_stamp[$i]="${combined[i]#* }"
done
unset combined
The statement that you need to handle the dates and times independently from each other sounds suspicious; if you can find a way to avoid doing that, perhaps after all don't split combined into two separate arrays. The code above reveals how to extract the date and the time from a value in combined (the mechanism is called parameter substitution). It also obviously demonstrates how to loop over the indices in an array.

Initialize an Array inside AWK Command and use the Array to Print using AWK

Im trying to Do a Comparison of 2 File Data and print certain out out of it.
My objective mainly here is to initlize an araay containing some values inside the same awk statement and use it for some printing purpose.
Below is the Command i am using which i feel looking like some syntactical error.
Please Help in the AWK part how I should define the Array also How i cna use it inside it.
Command tried -
paste -d "|" filedata.txt tabdata.txt | awk -F '|' '{array=("RE_LOG_ID" "FILE_RUN_ID" "FH_RECORDTYPE" "FILECATEGORY")}' '{c=NF/2;for(i=1;i<=c;i++)if($i!=$(i+c))printf "%s|%s|%s|%s\n",$1,${array[i]},$i,$(i+c)}'
SAMPLE INPUT FILE
filedata.txt
A|1|2|3
B|2|3|4
tabdata.txt
A|1|4|3
B|2|3|7
So my Output i am wanting is . -
A|FH_RECORDTYPE|2|4
B|FILECATEGORY|4|7
The Output Comprises the Differences -
PRIMARYKEY|COLUMNNAME|FILE1DATA|FILE2DATA
I want the Array to be initialized inside the AWK as array=("RE_LOG_ID" "FILE_RUN_ID" "FH_RECORDTYPE" "FILECATEGORY") and will correspond Column Names
The fetching columnname from the array- condition will be when ($i!=$(i+c)) whichever "i"th position mismatches i will print the "i" th Element from the Array.
Finding the Differences Section is working perfect if i remove the array part from my command, but my ask is i want to initialize an array containing the column names and print it too within the awk statement.
Just i need help how to incorporate the Array Part within AWK.
Unfortunately arrays in AWK cannot be assigned as you expect. As an alternative, you can use split function like:
split("RE_LOG_ID FILE_RUN_ID FH_RECORDTYPE FILECATEGORY", array, " ")
(Optional " " is needed because FS is overwritten.)
Then your command will look like:
paste -d "|" filedata.txt tabdata.txt | awk -F '|' '
BEGIN {split("RE_LOG_ID FILE_RUN_ID FH_RECORDTYPE FILECATEGORY", array, " ")}
{
c= NF/2;
for(i=1; i<=c; i++)
if ($i != $(i+c))
printf "%s|%s|%s|%s\n", $1, array[i], $i, $(i+c);
}'

Efficient substring parsing of large fixed length text file in Bash

I have a large text file (millions of records) of fixed length data and need to extract unique substrings and create a number of arrays with those values. I have a working version, however I'm wondering if performance can be improved since I need to run the script iteratively.
$_file5 looks like:
138000010065011417865201710152017102122
138000010067710416865201710152017102133
138000010131490417865201710152017102124
138000010142349413865201710152017102154
138400010142356417865201710152017102165
130000101694334417865201710152017102176
Here is what I have so far:
while IFS='' read -r line || [[ -n "$line" ]]; do
_in=0
_set=${line:15:6}
_startDate=${line:21:8}
_id="$_account-$_set-$_startDate"
for element in "${_subsets[#]}"; do
if [[ $element == "$_set" ]]; then
_in=1
break
fi
done
# If we find a new one and it's not 504721
if [ $_in -eq 0 ] && [ $_set != "504721" ] ; then
_subsets=("${_subsets[#]}" "$_set")
_ids=("${_ids[#]}" "$_id")
fi
done < $_file5
And this yields:
_subsets=("417865","416865","413865")
_ids=("9899-417865-20171015", "9899-416865-20171015", "9899-413865-20171015")
I'm not sure if sed or awk would be better here and can't find a way to implement either. Thanks.
EDIT: Benchmark Tests
So I benchmarked my original solution against the two provided. Ran this over 10 times and all results where similar to below.
# Bash read
real 0m8.423s
user 0m8.115s
sys 0m0.307s
# Using sort -u (#randomir)
real 0m0.719s
user 0m0.693s
sys 0m0.041s
# Using awk (#shellter)
real 0m0.159s
user 0m0.152s
sys 0m0.007s
Looks like awk wins this one. Regardless, the performance improvement from my original code is substantial. Thank you both for your contributions.
I don't think you can beat the performance of sort -u with bash loops (except in corner cases, as this one turned out to be, see footnote✻).
To reduce the list of strings you have in file to a list of unique strings (set), based on a substring:
sort -k1.16,1.21 -u file >set
Then, to filter-out the unwanted id, 504721, starting at position 16, you can use grep -v:
grep -vE '.{15}504721' set
Finally, reformat the remaining lines and store them in arrays with cut/sed/awk/bash.
So, to populate the _subsets array, for example:
$ _subsets=($(sort -k1.16,1.21 -u file | grep -vE '.{15}504721' | cut -c16-21))
$ printf "%s\n" "${_subsets[#]}"
413865
416865
417865
or, to populate the _ids array:
$ _ids=($(sort -k1.16,1.21 -u file | grep -vE '.{15}504721' | sed -E 's/^.{15}(.{6})(.{8}).*/9899-\1-\2/'))
$ printf "%s\n" "${_ids[#]}"
9899-413865-20171015
9899-416865-20171015
9899-417865-20171015
✻ If the input file is huge, but it contains only a small number (~40) of unique elements (for the relevant field), then it makes perfect sense for the awk solution to be faster. sort needs to sort a huge file (O(N*logN)), then filter the dupes (O(N)), all for a large N. On the other hand, awk needs to pass through the large input only once, checking for dupes along the way via set membership testing. Since the set of uniques is small, membership testing takes only O(1) (on average, but for such a small set, practically constant even in worst case), making the overall time O(N).
In case there were less dupes, awk would have O(N*log(N)) amortized, and O(N2) worst case. Not to mention the higher constant per-instruction overhead.
In short: you have to know how your data looks like before choosing the right tool for the job.
Here's an awk solution embedded in a bash script:
#!/bin/bash
fn_parser() {
awk '
BEGIN{ _account="9899" }
{ _set=substr($0,16,6)
_startDate=substr($0,22,8)
#dbg print "#dbg:_set=" _set "\t_startDate=" _startDate
if (_set != "504721") {
_id= _account "-" _set"-" _startDate
ids[_id] = _id
sets[_set]=_set
}
}
END {
printf "_subsets=("
for (s in sets) { printf("%s\"%s\"" , (commaCtr++ ? "," : ""), sets[s]) }
print ");"
printf "_ids=("
for (i in ids) { printf("%s\"%s\"" , (commaCtr2++ ? "," : ""), ids[i]) }
print ")"
}
' "${#}"
}
#dbg set -vx
eval $( echo $(fn_parser *.txt) )
echo "_subsets="$_subsets
echo "_ids="$_ids
output
_subsets=413865,417865,416865
_ids=9899-416865-20171015,9899-413865-20171015,9899-417865-20171015
Which I believe would be the same output your script would get if you did an echo on your variable names.
I didn't see that _account was being extracted from your file, and assume it is passed in from a previous step in your batch. But until I know if that is a critical piece, I'll have to come back to figuring out how to pass in var to a function that calls awk.
People won't like using eval, but hopefully no one will embed /bin/rm -rf / into your data set ;-)
I use the eval so that the data extracted is available via the shell variables. You can uncomment the #dbg before the eval line to see how the code is executing in the "layers" of function, eval, var=value assignments.
Hopefully, you see how the awk script is a transcription of your code into awk.
It does rely on the fact that arrays can contain only 1 copy of a key/value pair.
I'd really appreciate if you post timings for all solutions submitted. (You could reduce the file size by 1/2 and still have a good test). Be sure to run each version several times, and discard the first run.
IHTH

Bash script processing too slow

I have the following script where I'm parsing 2 csv files to find a MATCH the files have 10000 lines each one. But the processing is taking a long time!!! Is this normal?
My script:
#!/bin/bash
IFS=$'\n'
CSV_FILE1=$1;
CSV_FILE2=$2;
sort -t';' $CSV_FILE1 >> Sorted_CSV1
sort -t';' $CSV_FILE2 >> Sorted_CSV2
echo "PATH1 ; NAME1 ; SIZE1 ; CKSUM1 ; PATH2 ; NAME2 ; SIZE2 ; CKSUM2" >> 'mapping.csv'
while read lineCSV1 #Parse 1st CSV file
do
PATH1=`echo $lineCSV1 | awk '{print $1}'`
NAME1=`echo $lineCSV1 | awk '{print $3}'`
SIZE1=`echo $lineCSV1 | awk '{print $7}'`
CKSUM1=`echo $lineCSV1 | awk '{print $9}'`
while read lineCSV2 #Parse 2nd CSV file
do
PATH2=`echo $lineCSV2 | awk '{print $1}'`
NAME2=`echo $lineCSV2 | awk '{print $3}'`
SIZE2=`echo $lineCSV2 | awk '{print $7}'`
CKSUM2=`echo $lineCSV2 | awk '{print $9}'`
# Test if NAM1 MATCHS NAME2
if [[ $NAME1 == $NAME2 ]]; then
#Test checksum OF THE MATCHING NAME
if [[ $CKSUM1 != $CKSUM2 ]]; then
#MAPPING OF THE MATCHING LINES
echo $PATH1 ';' $NAME1 ';' $SIZE1 ';' $CKSUM1 ';' $PATH2 ';' $NAME2 ';' $SIZE2 ';' $CKSUM2 >> 'mapping.csv'
fi
break #When its a match break the while loop and go the the next Row of the 1st CSV File
fi
done < Sorted_CSV2 #Done CSV2
done < Sorted_CSV1 #Done CSV1
This is a quadratic order. Also, see Tom Fenech comment: You are calling awk several times inside a loop inside another loop. Instead of using awk for the fields in every line try setting the IFS shell variable to ";" and read the fields directly in read commands:
IFS=";"
while read FIELD11 FIELD12 FIELD13; do
while read FIELD21 FIELD22 FIELD23; do
...
done <Sorted_CSV2
done <Sorted_CSV1
Though, this would be still O(N^2) and very inefficient. It seems you are matching 2 fields by a coincident field. This task is easier and faster to accomplish by using join command line utility, and would reduce order from O(N^2) to O(N).
Whenever you say "Does this file/data list/table have something that matches this file/data list/table?", you should think of associative arrays (sometimes called hashes).
An associative array is keyed by a particular value and each key is associated with a value. The nice thing is that finding a key is extremely fast.
In your loop of a loop, you have 10,000 lines in each file. You're outer loop executed 10,000 times. Your inner loop may execute 10,000 times for each and every line in your first file. That's 10,000 x 10,000 times you go through that inner loop. That's potentially looping 100 million times through that inner loop. Think you can see why your program might be a little slow?
In this day and age, having a 10,000 member associative array isn't that bad. (Imagine doing this back in 1980 on a MS-DOS system with 256K. It just wouldn't work). So, let's go through the first file, create a 10,000 member associative array, and then go through the second file looking for matching lines.
Bash 4.x has associative arrays, but I only have Bash 3.2 on my system, so I can't really give you an answer in Bash.
Besides, sometimes Bash isn't the answer to a particular issue. Bash can be a bit slow and the syntax can be error prone. Awk might be faster, but many versions don't have associative arrays. This is really a job for a higher level scripting language like Python or Perl.
Since I can't do a Bash answer, here's a Perl answer. Maybe this will help. Or, maybe this will inspire someone who has Bash 4.x can give an answer in Bash.
I Basically open the first file and create an associative array keyed by the checksum. If this is a sha1 checksum, it should be unique for all files (unless they're an exact match). If you don't have a sha1 checksum, you'll need to massage the structure a wee bit, but it's pretty much the same idea.
Once I have the associative array figured out, I then open file #2 and simply see if the checksum already exists in the file. If it does, I know I have a matching line, and print out the two matches.
I have to loop 10,000 times in the first file, and 10,000 times in the second. That's only 20,000 loops instead of 10 million that's 20,000 times less looping which means the program will run 20,000 times faster. So, if it takes 2 full days for your program to run with a double loop, an associative array solution will work in less than one second.
#! /usr/bin/env perl
#
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
FILE1 => "file1.txt",
FILE2 => "file2.txt",
MATCHING => "csv_matches.txt",
};
#
# Open the first file and create the associative array
#
my %file_data;
open my $fh1, "<", FILE1;
while ( my $line = <$fh1> ) {
chomp $line;
my ( $path, $blah, $name, $bather, $yadda, $tl_dr, $size, $etc, $check_sum ) = split /\s+/, $line;
#
# The main key is "check_sum" which **should** be unique, especially if it's a sha1
#
$file_data{$check_sum}->{PATH} = $path;
$file_data{$check_sum}->{NAME} = $name;
$file_data{$check_sum}->{SIZE} = $size;
}
close $fh1;
#
# Now, we have the associative array keyed by the data we want to match, read file 2
#
open my $fh2, "<", FILE2;
open my $csv_fh, ">", MATCHING;
while ( my $line = <$fh2> ) {
chomp $line;
my ( $path, $blah, $name, $bather, $yadda, $tl_dr, $size, $etc, $check_sum ) = split /\s+/, $line;
#
# If there is a matching checksum in file1, we know we have a matching entry
#
if ( exists $file_data{$check_sum} ) {
printf {$csv_fh} "%s;%s:%s:%s:%s:%s\n",
$file_data{$check_sum}->{PATH}, $file_data{$check_sum}->{NAME}, $file_data{$check_sum}->{SIZE},
$path, $name, $size;
}
}
close $fh2;
close $csv_fh;
BUGS
(A good manpage always list issues!)
This assumes one match per file. If you have multiple duplicates in file1 or file2, you will only pick up the last one.
This assumes a sha256 or equivalent checksum. In such a checksum, it is extremely unlikely that two files will have the same checksum unless they match. A 16bit checksum from the historic sum command may have collisions.
Although a proper database engine would make a much better tool for this, it is still very well possible to do it with awk.
The trick is to sort your data, so that records with the same name are grouped together. Then a single pass from top to bottom is enough to find the matches. This can be done in linear time.
In detail:
Insert two columns in both CSV files
Make sure every line starts with the name. Also add a number (either 1 or 2) which denotes from which file the line originates. We will need this when we merge the two files together.
awk -F';' '{ print $2 ";1;" $0 }' csvfile1 > tmpfile1
awk -F';' '{ print $2 ";2;" $0 }' csvfile2 > tmpfile2
Concatenate the files, then sort the lines
sort tmpfile1 tmpfile2 > tmpfile3
Scan the result, report the mismatches
awk -F';' -f scan.awk tmpfile3
Where scan.awk contains:
BEGIN {
origin = 3;
}
$1 == name && $2 > origin && $6 != checksum {
print record;
}
{
name = $1;
origin = $2;
checksum = $6;
sub(/^[^;]*;.;/, "");
record = $0;
}
Putting it all together
Crammed together into a Bash oneliner, without explicit temporary files:
(awk -F';' '{print $2";1;"$0}' csvfile1 ; awk -F';' '{print $2";2;"$0}' csvfile2) | sort | awk -F';' 'BEGIN{origin=3}$1==name&&$2>origin&&$6!=checksum{print record}{name=$1;origin=$2;checksum=$6;sub(/^[^;]*;.;/,"");record=$0;}'
Notes:
If the same name appears more than once in csvfile1, then all but the last one are ignored.
If the same name appears more than once in csvfile2, then all but the first one are ignored.

Resources