Remove duplicate from csv using bash / awk - bash

I have a csv file with the format :
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
I want to group by first column unique id's and concat types in a single row like this:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
I found awk does a great job in handling such scenarios. But all I could achieve is this:
"id-1"|"A":"B":"D":"B"
"id-2"|"B":"C"
"id-3"|"A":"A"
I used this command:
awk -F "|" '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
How can I remove the duplicates and also handle the formatting of the second column types?

quick fix:
$ awk -F "|" '!seen[$0]++{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
!seen[$0]++ will be true only if line was not already seen
If second column should all be within double quotes
$ awk -v dq='"' 'BEGIN{FS=OFS="|"}
!seen[$0]++{a[$1]=a[$1] ? a[$1]":"$2 : $2}
END{for (i in a){gsub(dq,"",a[i]); print i, dq a[i] dq}}' file
"id-1"|"A:B:D"
"id-2"|"C:B"
"id-3"|"A"

With GNU awk for true multi-dimensional arrays and gensub() and sorted_in:
$ awk -F'|' '
{ a[$1][gensub(/"/,"","g",$2)] }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (i in a) {
c = 0
for (j in a[i]) {
printf "%s%s", (c++ ? ":" : i "|\""), j
}
print "\""
}
}
' file
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
The output rows and columns will both be string-sorted (i.e. alphabetically by characters) in ascending order.

Short GNU datamash + tr solution:
datamash -st'|' -g1 unique 2 <file | tr ',' ':'
The output:
"id-1"|"A":"B":"D"
"id-2"|"B":"C"
"id-3"|"A"
----------
In case if between-item double quotes should be eliminated - use the following alternative:
datamash -st'|' -g1 unique 2 <file | sed 's/","/:/g'
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

For sample, input below one will work, but unsorted
One-liner
# using two array ( recommended )
awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
# using regexp
awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
Test Results:
$ cat infile
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
$ awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
$ awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
Better Readable:
Using regexp
awk 'BEGIN{
FS=OFS="|"
}
{
a[$1] =$1 in a ?(a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2):$2
}
END{
for(i in a)
print i,a[i]
}
' infile
Using two array
awk 'BEGIN{
FS=OFS="|"
}
!seen[$1,$2]++{
a[$1] = ($1 in a ? a[$1] ":" : "") $2
}
END{
for(i in a)
print i,a[i]
}' infile
Note: you can also use !seen[$0]++, it will use entire line as index, but in case in your real data, if
you want to prefer some other column, you may prefer !seen[$1,$2]++,
here column1 and column2 are used as index

awk + sort solution:
awk -F'|' '{ gsub(/"/,"",$2); a[$1]=b[$1]++? a[$1]":"$2:$2 }
END{ for(i in a) printf "%s|\"%s\"\n",i,a[i] }' <(sort -u file)
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

Related

Shell/Bash Script to reverse the text in every other column?

I'm currently trying to write a bash/shell script that pulls data from a .csv and reverses all the string values in every other column and outputs to a new csv. I have a script that grabs every other column but I'm not sure how to reverse the strings in those columns.
awk 'BEGIN{FS=","} {s=$NF;
for (i=1; i<=NF; i+=2)
printf ("%s%c", $i, i + 2 <= NF ? "," : "\n")
}' input.csv > output.csv
awk to the rescue!
$ seq 100 141 | pr -6ats, |
awk -F, 'function rev(x) {r="";
for(j=length(x);j;j--) r=r substr(x,j,1);
return r}
BEGIN {OFS=FS}
{for(i=1;i<NF;i+=2) $i=rev($i)}1'
001,101,201,103,401,105
601,107,801,109,011,111
211,113,411,115,611,117
811,119,021,121,221,123
421,125,621,127,821,129
031,131,231,133,431,135
631,137,831,139,041,141
$ cat file
abc,def,ghi,klm
$ rev file
mlk,ihg,fed,cba
$ rev file |
awk 'BEGIN{FS=OFS=","} NR==FNR{split($0,a); next} {for (i=1; i<=NF; i+=2) $i=a[NF-i+1]} 1' - file
cba,def,ihg,klm

How do I use awk/sed to merge a field across multiple rows based on matching column values?

I am working with a CSV in bash, and attempting to merge the data in the 2nd column by matched data in the 3rd column.
My code works but the information in the other columns ends up just getting repeated instead of properly copied.
awk -F',' -v OFS=',' '{
env_name=$1
app_name=$4
lob_name=$5
if ($3 in a) {
a[$3] = a[$3]" "$2;
} else {
a[$3] = $2;
}
}
END { for (i in a) print env_name, i, a[i], app_name, lob_name}' input.tmp > output.tmp
This:
A,1,B,C,D
A,2,B,C,D
A,3,E,F,G
A,4,X,Y,Z
A,5,E,F,G
Should become this:
A,1 2,B,C,D
A,3 5,E,F,G
A,4,X,Y,Z
But instead we are getting this:
A,1 2,B,C,D
A,3 5,E,C,D
A,4,X,C,D
your grouping key should be all except second field
$ awk -F, 'BEGIN {SUPSEP=OFS=FS}
{k=$1 FS $3 FS $4 FS $5; a[k]=(k in a)?a[k]" "$2:$2}
END {for(k in a) {split(k,p); print p[1],a[k],p[2],p[3],p[4]}}' file
A,1 2,B,C,D
A,3 5,E,F,G
A,4,X,Y,Z
perhaps can be simplified a bit
$ awk 'BEGIN {OFS=FS=","}
{v=$2; $2=""; k=$0; a[k]=(k in a?a[k]" "v:v)}
END {for(k in a) {$0=k; $2=a[k]; print}}' file
sed + sort + awk
$ sed 's/,/+/3;s/,/+/3' merge_csv | sort -t, -k3 | awk -F, -v OFS=, ' { if($3==p) { a=a b " "; } if(p!=$3 && NR>1) { print $1,a b,p; a="" } b=$2; p=$3 } END { print $1,a b,p } ' | tr '+' ','
A,1 2,B,C,D
A,3 5,E,F,G
A,4,X,Y,Z
$
If Perl is an option, you can try this
$ perl -F, -lane '$x=join(",",#F[-3,-2,-1]); #t=#{$kv{$x}};push(#t,$F[1]);$kv{$x}=[#t]; END { for(keys %kv) { print "A,",join(" ",#{$kv{$_}}),",$_" }} ' merge_csv
A,1 2,B,C,D
A,4,X,Y,Z
A,3 5,E,F,G
$
Input file:
$ cat merge_csv
A,1,B,C,D
A,2,B,C,D
A,3,E,F,G
A,4,X,Y,Z
A,5,E,F,G
$

print multiple fields if multiple pattern matches

I have a comma delimited file like below
0,category=a,type=b,value=1
1,category=c,type=b,.....,original_value=0
2,category=b,type=c,....,original_value=1,....,corrected_value=3
A line in the file can contain
(1)only 'value'
(2)only 'original_value'
(3)both 'original value' and 'corrected_value'
The values can be in any column.
The following awk command I wrote can only print one field after pattern match.
cat file | awk -F, 'BEGIN{OFS=","} /value/ { for (x=1;x<=NF;x++) if ($x~"value") {print $2,$3,$(x)} }' | sort -u
Current Output:
category=a,type=b,value=1
category=b,type=c,corrected_value=3
category=b,type=c,original_value=1
category=c,type=b,original_value=0
How do I print two fields (columns) of a line if two pattern matches occur? In this case, if both original_value and corrected_value exist.
Expected Output:
category=a,type=b,value=1
category=b,type=c,original_value=1,corrected_value=3
category=c,type=b,original_value=0
Bash Version: 4.3.11
You can use this awk command:
awk 'BEGIN{FS=OFS=","} {printf "%s%s%s", $2,OFS,$3; for(i=4; i<=NF; i++)
if ($i ~ /value/) printf "%s%s", OFS,$i; print ""}' file
category=a,type=b,value=1
category=c,type=b,original_value=0
category=b,type=c,original_value=1,corrected_value=3
Similar to #anubhava's answer, but does not rely on the category or type being in a particular column:
awk -F, '
BEGIN { pattern = "^(category|type|value|original_value|corrected_value)" }
{
sep = ""
for (i=1; i<=NF; i++) {
if ($i ~ pattern) {
printf "%s%s", sep, $i
sep = ","
}
}
print ""
}
' file

How to convert space separated key value data into CSV format in bash?

I am working on some data files where data is of key and value pairs that are separated by space.
The data in files is inconsistent. All the Key and values are not always present.But the keys will always be as Table, count and size.
Below example has table_name, count, size information
cat sample1.txt
Table SCOTT.TABLE1 count 3889 size 300
Table SCOTT.TABLE2 count 7744
Table SCOTT.TABLE3 count 2622
Table SCOTT.TABLE4 size 2773 count 22
Table SCOTT.TABLE5 size 21
Below file have just table_name but no count and size data.
cat sample2.txt
Table SCOTT.TABLE1
Table SCOTT.TABLE2
Table SCOTT.TABLE3
Table SCOTT.TABLE4
Table SCOTT.TABLE5
So I am trying to convert these files into CSV format using following
cat <file_name> | awk -F' ' 'BEGIN { RS="\n"; print"Table,Count,Size";OFS="," } NR > 1 { print a["Table"], a["count"], a["size"]; delete a; next }{ a[$1]=$2 }{ a[$3]=$4 }{ a[$5]=$6 }'
cat sample1.txt | awk -F' ' 'BEGIN { RS="\n"; print"Table,Count,Size";OFS="," }
NR > 1 { print a["Table"], a["count"], a["size"]; delete a; next }
{ a[$1]=$2 }{ a[$3]=$4 }{ a[$5]=$6 }'
Table,Count,Size
SCOTT.TABLE1,3889,300
,,
,,
,,
And for the second sample
cat sample2.txt | awk -F' ' 'BEGIN { RS="\n"; print"Table,Count,Size";OFS="," } NR > 1 { print a["Table"], a["count"], a["size"]; delete a; next }{ a[$1]=$2 }{ a[$3]=$4 }{ a[$5]=$6 }'
Table,Count,Size
SCOTT.TABLE1,,
,,
,,
,,
But exepected as following:
For sample1.txt
TABLE,count,size
SCOTT.TABLE1,3889,300
SCOTT.TABLE2,7744,
SCOTT.TABLE3,2622
SCOTT.TABLE4,22,2773
SCOTT.TABLE5,,21
For sample2.txt
Table,Count,Size
SCOTT.TABLE1,,
SCOTT.TABLE2,,
SCOTT.TABLE3,,
SCOTT.TABLE4,,
SCOTT.TABLE5,,
Thanks in advance.
awk to the rescue!
$ awk -v OFS=',' '{for(i=1;i<NF;i+=2)
{if(!($i in c)){c[$i];cols[++k]=$i};
v[NR,$i]=$(i+1)}}
END{for(i=1;i<=k;i++) printf "%s", cols[i] OFS;
print "";
for(i=1;i<=NR;i++)
{for(j=1;j<=k;j++) printf "%s", v[i,cols[j]] OFS;
print ""}}' file
Table,count,size,
SCOTT.TABLE1,3889,300,
SCOTT.TABLE2,7744,,
SCOTT.TABLE3,2622,,
SCOTT.TABLE4,22,2773,
SCOTT.TABLE5,,21,
if you have gawk you can simplify it more with sorted-in
UPDATE For the revised question, the header needs to be known in advance since the keys might be completely missing. This simplifies the problem and the following script should do the trick.
$ awk -v header='Table,count,size' \
'BEGIN{OFS=","; n=split(header,h,OFS); print header}
{for(i=1; i<NF; i+=2) v[NR,$i]=$(i+1)}
END{for(i=1; i<=NR; i++)
{printf "%s", v[i,h[1]];
for(j=2; j<=n; j++) printf "%s", OFS v[i,h[j]];
print ""}}' file
here is an inelegant but fast and comprehensible solution:
awk 'BEGIN{OFS=",";print "TABLE,count,size"}
{
t=$2
if($3=="count"){
c=$4
s=$6
}
else{
s=$4
c=$6
}
print t,c,s
}' 1.txt
output:
TABLE,count,size
SCOTT.TABLE1,3889,300
SCOTT.TABLE2,7744,
SCOTT.TABLE3,2622,
SCOTT.TABLE4,22,2773
SCOTT.TABLE5,,21

find unique lines based on one field only [duplicate]

Would like to print unique lines based on first field , keep the first occurrence of that line and remove duplicate other occurrences.
Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
Desired Output:
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
Have tried below command and in-complete
awk 'BEGIN { FS = OFS = "," } { !seen[$1]++ } END { for ( i in seen) print $0}' Input.csv
Looking for your suggestions ...
You put your test for "seen" in the action part of the script instead of the condition part. Change it to:
awk -F, '!seen[$1]++' Input.csv
Yes, that's the whole script:
$ cat Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
$
$ awk -F, '!seen[$1]++' Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
This should give you what you want:
awk -F, '{ if (!($1 in a)) a[$1] = $0; } END '{ for (i in a) print a[i]}' input.csv
typo there in syntax.
awk '{ if (!($1 in a)) a[$1] = $0; } END { for (i in a) print a[i]}'

Resources