How can I use awk to remove duplicate entries in the same field with data separated with commas? - bash

I am trying to call awk from a bash script to remove duplicate data entries of a field in a file.
Data Example in file1
data1 a,b,c,d,d,d,c,e
data2 a,b,b,c
Desired Output:
data1 a,b,c,d,e
data2 a,b,c
First I removed the first column to only have the second remaining.
cut --complement -d$'\t' -f1 file1 &> file2
This worked fine, and now I just have the following in file2:
a,b,c,d,d,d,c,e
a,b,b,c
So then I tried this code that I found but do not understand well:
awk '{
for(i=1; i<=NF; i++)
printf "%s", (!seen[$1]++? (i==1?"":FS) $i: "" )
delete seen; print ""
}' file2
The problem is that this code was for a space delimiter and mine is now a comma delimiter with variable values on each row. This code just prints the file as is and I can see no difference. I also tried to make the FS a comma by doing this, to no avail:
printf "%s", (!seen[$1]++? (i==1?"":FS=",") $i: ""

This is similar to the code you found.
awk -F'[ ,]' '
{
s = $1 " " $2
seen[$2]++
for (i=3; i<=NF; i++)
if (!seen[$i]++) s = s "," $i
print s
delete seen
}
' data-file
-F'[ ,]' - split input lines on spaces and commas
s = ... - we could use printf like the code you found, but building a string is less typing
!seen[x]++ is a common idiom - it returns true only the first time x is seen
to avoid special-casing when to print a comma (as your sample code does with spaces), we simply add $2 to the print string and set seen[$2]
then for the remaining columns (3 .. NF), we add comma and column if it hasn't been seen before
delete seen - clear the array for the next line

That code is right, you need to specify the delimiter and change $1 to $i.
$ awk -F ',' '{
for(i=1; i<=NF; i++)
printf "%s", (!seen[$i]++? (i==1?"":FS) $i: "" )
delete seen; print ""
}' /tmp/file1
data1 a,b,c,d,e
data2 a,b,c

Using GNU sed if applicable
$ sed -E ':a;s/((\<[^,]*\>).*),\2/\1/;ta' input_file
data1 a,b,c,d,e
data2 a,b,c

so i did something similar lately - sanitizing the output of gnu prime factoring program when it prints out every single copy of a bunch of small primes :
gawk -Mbe '
BEGIN {
__+=__+=__+=(__+=___=_+=__=____=_^=_<_)-+-++_
__+=__^=!(___=__-=_+=_++)
for (_; _<=___; _+=__) {
if ((_%++__)*(_%(__+--__))) {
print ____*=_^_
}
}
} | gfactor | sanitize_gnu_factor
58870952193946852435332666506835273111444209706677713:
7^7
11^11
13^13
17^17
116471448967943114621777995869564336419122830800496825559417754612566153180027:
7^7
11^11
13^13
17^17
19^19
2431978363071055324951111475877083878108827552605151765803537946846931963403343871776360412541253748541645309:
7^7
11^11
13^13
17^17
19^19
23^23
6244557167645217304114386952069758950402417741892127946837837979333340639740318438767128131418285303492993082345658543853142417309747238004933649896921:
7^7
11^11
13^13
17^17
19^19
23^23
29^29
823543:
7^7
234966429149994773:
7^7
11^11
71165482274405729335192792293569:
7^7
11^11
13^13
And the core sanitizer does basically the same thing - intra-row duplicate removal :
sanitize_gnu_factor() # i implemented it as a shell function
{
mawk -Wi -- '
BEGIN {
______ = "[ ]+"
___= _+= _^=__*=____ = FS
_______ = FS = "[ \v"(OFS = "\f\r\t")"]+"
FS = ____
} {
if (/ is prime$/) {
print; next
} else if (___==NF) {
$NF = " - - - - - - - \140\140\140"\
"PRIME\140\140\140 - - - - - - - "
} else {
split("",_____)
_ = NF
do { _____[$_]++ } while(--_<(_*_))
delete _____[""]
sub("$"," ")
_^=_<_
for (__ in _____) {
if (+_<+(___=_____[__])) {
sub(" "(__)"( "(__)")+ ",
sprintf(" %\47.f^%\47.f ",__,___))
} }
___ = _+=_^=__*=_<_
FS = _______
$__ = $__
FS = ____ } } NF = NF' |
mawk -Wi -- '
/ is prime$/ { print
next } /[=]/ { gsub("="," ")
} $(_^=(_<_)) = \
(___=length(__=$_))<(_+=_++)^(_+--_) \
?__: sprintf("%.*s......%s } %\47.f dgts ",
_^=++_,__, substr(__,++___-_),--___)' FS='[:]' OFS=':'
}

Related

bash - add columns to csv rewrite headers with prefix filename

I'd prefer a solution that uses bash rather than converting to a dataframe in python, etc as the files are quite big
I have a folder of CSVs that I'd like to merge into one CSV. The CSVs all have the same header save a few exceptions so I need to rewrite the name of each added column with the filename as a prefix to keep track of which file the column came from.
head globcover_color.csv glds00g.csv
==> file1.csv <==
id,max,mean,90
2870316.0,111.77777777777777
2870317.0,63.888888888888886
2870318.0,73.6
2870319.0,83.88888888888889
==> file2.csv <==
ogc_fid,id,_sum
"1","2870316",9.98795110916615
"2","2870317",12.3311055738527
"3","2870318",9.81535963468479
"4","2870319",7.77729743926775
The id column of each file might be in a different "datatype" but in every file the id matches the line number. For example, line 2 is always id 2870316.
Anticipated output:
file1_id,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
I'm not quite sure how to do this but I think I'd use the paste command at some point. I'm surprised that I couldn't find a similar question on stackoverflow but I guess it's not that common to have CSV with the same id on the same line number
edit:
I figured out the first part.
paste -d , * > ../rasterjointest.txt achieves what I want but the header needs to be replaced
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
fname = FILENAME
sub(/\.[^.]+$/,"",fname)
for (i=1; i<=NF; i++) {
$i = fname "_" $i
}
}
{ row[FNR] = (NR==FNR ? "" : row[FNR] OFS) $0 }
END {
for (rowNr=1; rowNr<=FNR; rowNr++) {
print row[rowNr]
}
}
$ awk -f tst.awk file1.csv file2.csv
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
To use minimal memory in awk:
$ cat tst.awk
BEGIN {
FS=OFS=","
for (fileNr=1; fileNr<ARGC; fileNr++) {
filename = ARGV[fileNr]
if ( (getline < filename) > 0 ) {
fname = filename
sub(/\.[^.]+$/,"",fname)
for (i=1; i<=NF; i++) {
$i = fname "_" $i
}
}
row = (fileNr==1 ? "" : row OFS) $0
}
print row
exit
}
$ awk -f tst.awk file1.csv file2.csv; paste -d, file1.csv file2.csv | tail -n +2
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775

how to find out common columns and its records from two files using awk

I have two files:
File 1:
id|name|address|country
1|abc|efg|xyz
2|asd|dfg|uio
File 2(only headers):
id|name|country
Now, I want an output like:
OUTPUT:
id|name|country
1|abc|xyz
2|asd|uio
Basically, I have a user record file(file1) and a header file(file2).Now, I want to extract only those records from (file1) whose columns match with that in the header file.
I want to do this using awk or bash.
I tried using:
awk 'BEGIN { OFS="..."} FNR==NR { a[(FNR"")] = $0; next } { print a[(FNR"")], $0 > "test.txt"}' header.txt file.txt
and have no idea what to do next.
Thank You
Following awk may help you on same.
awk -F"|" 'FNR==NR{for(i=1;i<=NF;i++){a[$i]};next} FNR==1 && FNR!=NR{for(j=1;j<=NF;j++){if($j in a){b[++p]=j}}} {for(o=1;o<=p;o++){printf("%s%s",$b[o],o==p?ORS:OFS)}}' OFS="|" File2 File1
Adding a non-one liner form of solution too now.
awk -F"|" '
FNR==NR{
for(i=1;i<=NF;i++){
a[$i]};
next}
FNR==1 && FNR!=NR{
for(j=1;j<=NF;j++){
if($j in a){ b[++p]=j }}
}
{
for(o=1;o<=p;o++){
printf("%s%s",$b[o],o==p?ORS:OFS)}
}
' OFS="|" File2 File1
Edit by Ed Morton: FWIW here's the same script written with normal indenting/spacing and a couple of more meaningful variable names:
BEGIN { FS=OFS="|" }
NR==FNR {
for (i=1; i<=NF; i++) {
names[$i]
}
next
}
FNR==1 {
for (i=1; i<=NF; i++) {
if ($i in names) {
f[++numFlds] = i
}
}
}
{
for (i=1; i<=numFlds; i++) {
printf "%s%s", $(f[i]), (i<numFlds ? OFS : ORS)
}
}
with (lot's of) unix pipes as Doug McIlroy intended...
$ function p() { sed 1q "$1" | tr '|' '\n' | cat -n | sort -k2; }
$ cut -d'|' -f"$(join -j2 <(p header) <(p file) | sort -k2n | cut -d' ' -f3 | paste -sd,)" file
id|name|country
1|abc|xyz
2|asd|uio
Solution using bash>4:
IFS='|' headers1=($(head -n1 $file1))
IFS='|' headers2=($(head -n1 $file2))
IFS=$'\n'
# find idxes we want to output, ie. mapping of headers1 to headers2
idx=()
for i in $(seq 0 $((${#headers2[#]}-1))); do
for j in $(seq 0 $((${#headers1[#]}-1))); do
if [ "${headers2[$i]}" == "${headers1[$j]}" ]; then
idx+=($j)
break
fi
done
done
# idx=(0 1 3) for example
# simple join output function from https://stackoverflow.com/questions/1527049/join-elements-of-an-array
join_by() { local IFS="$1"; shift; echo "$*"; }
# first line - output headers
join_by '|' "${headers2[#]}"
isfirst=true
while IFS='|' read -a vals; do
# ignore first (header line)
if $isfirst; then
isfirst=false
continue;
fi;
# filter from line only columns with idx indices
tmp=()
for i in ${idx[#]}; do
tmp+=("${vals[$i]}")
done
# join ouptut with '|'
join_by '|' "${tmp[#]}"
done < $file1
This one respects the order of columns in file1, changed the order:
$ cat file1
id|country|name
The awk:
$ awk '
BEGIN { FS=OFS="|" }
NR==1 { # file1
n=split($0,a)
next
}
NR==2 { # file2 header
for(i=1;i<=NF;i++)
b[$i]=i
}
{ # output part
for(i=1;i<=n;i++)
printf "%s%s", $b[a[i]], (i==n?ORS:OFS)
}' file1 file2
id|country|name
1|xyz|abc
2|uio|asd
(Another version using cut for outputing in revisions)
This is similar to RavinderSingh13's solution, in that it first reads the headers from the shorter file, and then decides which columns to keep from the longer file based on the headers on the first line of it.
It however does the output differently. Instead of constructing a string, it shifts the columns to the left if it does not want to include a particular field.
BEGIN { FS = OFS = "|" }
# read headers from first file
NR == FNR { for (i = 1; i <= NF; ++i) header[$i]; next }
# mark fields in second file as "selected" if the header corresponds
# to a header in the first file
FNR == 1 {
for (i = 1; i <= NF; ++i)
select[i] = ($i in header)
}
{
skip = 0
pos = 1
for (i = 1; i <= NF; ++i)
if (!select[i]) { # we don't want this field
++skip
$pos = $(pos + skip) # shift fields left
} else
++pos
NF -= skip # adjust number of fields
print
}
Running this:
$ mawk -f script.awk file2 file1
id|name|country
1|abc|xyz
2|asd|uio

AWK: increment a field based on values from previous line

Given the following input for AWK:
10;20;20
8;41;41
15;52;52
How could I increase/decrease the values so that:
$1 = remains unchanged
$2 = $2 of previous line + $1 of previous line + 1
$3 = $3 of previous line + $1 of previous line + 1
So the desired output would be:
10;20;20
8;31;31
15;40;40
I need to auto-increment and loop over the lines,
using associative arrays, but it's confusing for me.
Surely, this doesn't work as desired:
#!/bin/awk -f
BEGIN { FS = ";" }
{
print ln, st, of
ln=$1
st=$2 + ln + 1
of=$3 + ln + 1
}
with awk
awk -F";" -v OFS=";"
'NR!=1{ $2=a[2]+a[1]+1; $3=a[3]+a[1]+1 } { split($0,a,FS) } 1' file
split the line to an array and when processing the next line we can use the values stored.
test
10;20;20
8;31;31
15;40;40
Following awk may help you in same.
awk -F";" '
FNR==1{
val=$1;
val1=$2;
val2=$3;
print;
next
}
{
$2=val+val1+1;
$3=val+val2+1;
print;
val=$1;
val1=$2;
val2=$3;
}' OFS=";" Input_file
For your given Input_file, output will be as follows.
10;20;20
8;31;31
15;40;40
awk 'BEGIN{
FS = OFS = ";"
}
FNR>1{
$2 = p2 + p1 + 1
$3 = p3 + p1 + 1
}
{
p1=$1; p2=$2; p3=$3
}1
' infile
Input:
$ cat infile
10;20;20
8;41;41
15;52;52
Output:
awk 'BEGIN{FS=OFS=";"}FNR>1{$2=p2+p1+1; $3=p3+p1+1 }{p1=$1; p2=$2; p3=$3}1' infile
10;20;20
8;31;31
15;40;40
Or store only fields of your interest
awk -v myfields="2,3" '
BEGIN{
FS=OFS=";";
split(myfields,t,/,/)
}
{
for(i in t)
{
if(FNR>1)
{
$(t[i]) = a[t[i]] + a[1] + 1
}
a[t[i]] = $(t[i])
}
a[1] = $1
}1' infile

Unix/Bash: Uniq on a cell

I have a tab-separated fileA where the 12th column (starting from 1) contain several comma separated identifiers. Some of them in the same row, however, can occur more than once:
GO:0042302, GO:0042302, GO:0042302
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
....
....
(some with a white-space after the comma, some where it is not).
I would like to only get the unique identifiers and remove the multiples for each row in the 12th column:
GO:0042302
GO:0004386,GO:0005524,GO:0006281
....
....
Here is what I have so far:
for row in `fileA`
do
cut -f12 $row | sed "s/,/\n/" | sort | uniq | paste fileA - | \
awk 'BEGIN {OFS=FS="\t"}{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $13}'
done > out
The idea was to go over each row at a time, cut out the 12th column, replace all commas with newlines and then sort and take uniq to get rid of duplicates, paste it back and print the columns in the right order, skipping the original identifier column.
However, this does not seem to work. Any ideas?
Just for completeness, and because I personally prefer Perl over Awk for this sort of thing, here's a Perl one-liner solution:
perl -F'\t' -le '%u=();#k=split/,/,$F[11];#u{#k}=#k;$F[11]=join",",sort
keys%u;print join"\t",#F'
Explanation:
-F'\t' Loop over input lines, splitting each one into fields at tabs
-l automatically remove newlines from input and append on output
-e get code to execute from the next argument instead of standard input
%u = (); # clear out the hash variable %u
#k = split /,/, $F[11]; # Split 12th field (1st is 0) on comma into array #k
#u{#k} = #k; # Copy the contents of #k into #u as key/value pairs
Because hash keys are unique, that last step means that the keys of %u are now a deduplicated copy of #k.
$F[11] = join ",", sort keys %u; # replace the 12th field with the sorted unique list
print join "\t", #F; # and print out the modified line
If I understand you correctly, then with awk:
awk -F '\t' 'BEGIN { OFS = FS } { delete b; n = split($12, a, /, */); $12 = ""; for(i = 1; i <= n; ++i) { if(!(a[i] in b)) { b[a[i]]; $12 = $12 a[i] "," } } sub(/,$/, "", $12); print }' filename
This works as follows:
BEGIN { OFS = FS } # output FS same as input FS
{
delete b # clear dirty table from last pass
n = split($12, a, /, */) # split 12th field into tokens,
$12 = "" # then clear it out for reassembly
for(i = 1; i <= n; ++i) { # wade through those tokens
if(!(a[i] in b)) { # those that haven't been seen yet:
b[a[i]] # remember that they were seen
$12 = $12 a[i] "," # append to result
}
}
sub(/,$/, "", $12) # remove trailing comma from resulting field
print # print the transformed line
}
The delete b; has been POSIX-conforming for only a short while, so if you're working with an old, old awk and it fails for you, see #MarkReed's comment for another way that ancient awks should accept.
Using field 2 instead of field 12:
$ cat tst.awk
BEGIN{ FS=OFS="\t" }
{
split($2,f,/ *, */)
$2 = ""
delete seen
for (i=1;i in f;i++) {
if ( !seen[f[i]]++ ) {
$2 = $2 (i>1?",":"") f[i]
}
}
print
}
.
$ cat file
a,a,a GO:0042302, GO:0042302, GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281 d,d,d
$ awk -f tst.awk file
a,a,a GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281 d,d,d
If your awk doesn't support delete seen you can use split("",seen).
Using this awk:
awk -F '\t' -v OFS='\t' '{
delete seen;
split($12, a, /[,; ]+/);
for (i=1; i<=length(a); i++) {
if (!(a[i] in seen)) {
seen[a[i]];
s=sprintf("%s%s,", s, a[i])
}
}
$12=s} 1' file
GO:0042302,
GO:0042302,GO:0004386,GO:0005524,GO:0006281,
In your example data, the comma followed by a space is the delimiter of the 12th field. Every subfield after that is merely a repeat of the first field. The subfields appear to already be in sorted order.
GO:0042302, GO:0042302, GO:0042302
^^^dup1^^^ ^^^dup2^^^
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
^^^^^^^^^^^^^^^dup1^^^^^^^^^^^^^
Based on that, you could simply keep the first of the subfields and toss the rest:
awk -F"\t" '{sub(/, .*/, "", $12)} 1' fileA
If instead, you can have different sets of repeated subfields, where keys are not sorted like this:
GO:0042302, GO:0042302, GO:0042302, GO:0062122,GO:0055000, GO:0055001, GO:0062122,GO:0055000
GO:0004386,GO:0005524,GO:0006281, GO:0005525, GO:0004386,GO:0005524,GO:0006281
If you were stuck with a default MacOS awk you could introduce a sort/uniq functions in an awk executable script:
#!/usr/bin/awk -f
BEGIN {FS="\t"}
{
c = uniq(a, split($12, a, /, |,/))
sort(a, c)
s = a[1]
for(i=2; i<=c; i++) { s = s "," a[i] }
$2 = s
}
47 # print out the modified line
# take an indexed arr as from split and de-dup it
function uniq(arr, len, i, uarr) {
for(i=len; i>=1; i--) { uarr[arr[i]] }
delete arr
for(k in uarr) { arr[++i] = k }
return( i )
}
# slightly modified from
# http://rosettacode.org/wiki/Sorting_algorithms/Bubble_sort#AWK
function sort(arr, len, haschanged, tmp, i)
{
haschanged = 1
while( haschanged==1 ) {
haschanged = 0
for(i=1; i<=(len-1); i++) {
if( arr[i] > arr[i+1] ) {
tmp = arr[i]
arr[i] = arr[i + 1]
arr[i + 1] = tmp
haschanged = 1
}
}
}
}
If you had GNU-awk, I think you could swap out the sort(a, c) call with asort(a), and drop the bubble-sort local function completely.
I get the following for the 12th field:
GO:0042302,GO:0055000,GO:0055001,GO:0062122
GO:0004386,GO:0005524,GO:0005525,GO:0006281

Sorting dates by groups

Here is a sample of my data with 4 columns and comma delimiter.
1,A,2009-01-01,2009-07-15
1,A,2009-07-10,2009-07-12
2,B,2009-01-01,2009-07-15
2,B,2009-07-10,2010-12-15
3,C,2009-01-01,2009-07-15
3,C,2009-07-15,2010-12-15
3,C,2010-12-15,2014-07-07
4,D,2009-06-01,2009-07-15
4,D,2009-07-21,2012-12-15
5,E,2011-04-23,2012-10-19
The first 2 columns are grouped. I want the minimum date from the third column, and the maximum date from the fourth column, for each group.
Then I will pick the first line for each first 2 column combination.
Desired output
1,A,2009-01-01,2009-07-15
2,B,2009-01-01,2010-12-15
3,C,2009-01-01,2014-07-07
4,D,2009-06-01,2012-12-15
5,E,2011-04-23,2012-10-19
I have tried the following code, but not working. I get close, but not the max date.
cat exam |sort -t, -nk1 -k2,3 -k4,4r |sort -t, -uk1,2
Would prefer an easy one-liner like above.
sort datafile |
awk -F, -v OFS=, '
{key = $1 FS $2}
key != prev {prev = key; min[key] = $3}
{max[key] = ($4 > max[key]) ? $4 : max[key]}
END {for (key in min) print key, min[key], max[key]}
' |
sort
1,A,2009-01-01,2009-07-15
2,B,2009-01-01,2010-12-15
3,C,2009-01-01,2014-07-07
4,D,2009-06-01,2012-12-15
5,E,2011-04-23,2012-10-19
When you pre-sort, you are guaranteed that the minimum col3 date will occur on the first line of a new group. Then you just need to find the maximum col4 date.
The final sort is required because iterating over the keys of an awk hash is unordered. You can do this sorting in (g)awk with:
END {
n = asorti(min, sortedkeys)
for (i=1; i<=n; i++)
print sortedkeys[i], min[sortedkeys[i]], max[sortedkeys[i]]
}
#!/usr/bin/awk -f
BEGIN { FS = OFS = "," }
{
sub(/[[:blank:]]*<br>$/, "")
key = $1 FS $2
if (!a[key]) {
a[key] = $3
b[key] = $4
keys[++k] = key
} else if ($3 < a[key]) {
a[key] = $3
} else if ($4 > b[key]) {
b[key] = $4
}
}
END {
for (i = 1; i <= k; ++i) {
key = keys[i]
print key, a[key], b[key] " <br>"
}
}
Usage:
awk -f script.awk file
Output:
1,A,2009-01-01,2009-07-15 <br>
2,B,2009-01-01,2010-12-15 <br>
3,C,2009-01-01,2014-07-07 <br>
4,D,2009-06-01,2012-12-15 <br>
5,E,2011-04-23,2012-10-19 <br>
Of course you can add print statements before and after the loop to print the other two <br>'s:
END {
print "<br>"
for (i = 1; i <= k; ++i) {
key = keys[i]
print key, a[key], b[key] " <br>"
}
print "<br>"
}
You want a "one liner" ?
paste -d, \
<(cat exam|sort -t, -nk1,2 -k4 |cut -d, -f1-3) \
<(cat exam|sort -t, -nk1,2 -k4r |cut -d, -f4 ) |
uniq -w4
The key idea is to sort the data once by field 3 asc, and independently by field 4 desc. Then you just have to merge corresponding lines (cut and paste). Finally uniq is used to keep only the first row for each pair of identical first two columns. This is the weak point here, as I assume 4 characters max for comparison. You either have to adjust to your needs, or somehow normalize data for those two columns in order to have a fixed width here when using your actual data.
EDIT: A probably better option is to replace uniq by a simple awk filter:
paste -d, \
<(cat exam|sort -t, -nk1,2 -k4 |cut -d, -f1-3) \
<(cat exam|sort -t, -nk1,2 -k4r |cut -d, -f4 ) |
awk -F , '$1","$2 != last { print; last=$1","$2 }'
On my system (GNU Linux Debian Wheezy), both produce the same result:
1,A,2009-01-01,2009-07-15<br>
2,B,2009-01-01,2010-12-15<br>
3,C,2009-01-01,2014-07-07<br>
4,D,2009-06-01,2012-12-15 <br>
5,E,2011-04-23,2012-10-19<br>

Resources