find unique lines based on one field only [duplicate] - bash

Would like to print unique lines based on first field , keep the first occurrence of that line and remove duplicate other occurrences.
Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
Desired Output:
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi
Have tried below command and in-complete
awk 'BEGIN { FS = OFS = "," } { !seen[$1]++ } END { for ( i in seen) print $0}' Input.csv
Looking for your suggestions ...

You put your test for "seen" in the action part of the script instead of the condition part. Change it to:
awk -F, '!seen[$1]++' Input.csv
Yes, that's the whole script:
$ cat Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
10,09-10-2014,def
40,06-10-2014,ghi
10,15-10-2014,abc
$
$ awk -F, '!seen[$1]++' Input.csv
10,15-10-2014,abc
20,12-10-2014,bcd
40,06-10-2014,ghi

This should give you what you want:
awk -F, '{ if (!($1 in a)) a[$1] = $0; } END '{ for (i in a) print a[i]}' input.csv

typo there in syntax.
awk '{ if (!($1 in a)) a[$1] = $0; } END { for (i in a) print a[i]}'

Related

awk output to file based on filter

I have a big CSV file that I need to cut into different pieces based on the value in one of the columns. My input file dataset.csv is something like this:
NOTE: edited to clarify that data is ,data, no spaces.
action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC
So, to split by action_type I simply do (I need the whole matching line in the resulting file):
awk -F, '$2 ~ /^1$/ {print}' dataset.csv >> 1_dataset.csv
awk -F, '$2 ~ /^2$/ {print}' dataset.csv >> 2_dataset.csv
This works as expected but I am basicaly travesing my original dataset twice. My original dataset is about 5GB and I have 30 action_type categories. I need to do this everyday, so, I need to script the thing to run on its own efficiently.
I tried the following but it does not work:
# This is a file called myFilter.awk
{
action_type=$2;
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}
Then I run it as:
awk -f myFilter.awk dataset.csv
But I get nothing. Literally nothing, no even errors. Which sort of tell me that my code is simply not matching anything or my print / pipe statement is wrong.
You may try this awk to do this in a single command:
awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file
With GNU awk to handle many concurrently open files and without replicating the header line in each output file:
awk -F',' '{print > ($2 "_dataset.csv")}' dataset.csv
or if you also want the header line to show up in each output file then with GNU awk:
awk -F',' '
NR==1 { hdr = $0; next }
!seen[$2]++ { print hdr > ($2 "_dataset.csv") }
{ print > ($2 "_dataset.csv") }
' dataset.csv
or the same with any awk:
awk -F',' '
NR==1 { hdr = $0; next }
{ out = $2 "_dataset.csv" }
!seen[$2]++ { print hdr > out }
{ print >> out; close(out) }
' dataset.csv
As currently coded the input field separator has not been defined.
Current:
$ cat myfilter.awk
{
action_type=$2;
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}
Invocation:
$ awk -f myfilter.awk dataset.csv
There are a couple ways to address this:
$ awk -v FS="," -f myfilter.awk dataset.csv
or
$ cat myfilter.awk
BEGIN {FS=","}
{
action_type=$2
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}
$ awk -f myfilter.awk dataset.csv

How do I use awk/sed to merge a field across multiple rows based on matching column values?

I am working with a CSV in bash, and attempting to merge the data in the 2nd column by matched data in the 3rd column.
My code works but the information in the other columns ends up just getting repeated instead of properly copied.
awk -F',' -v OFS=',' '{
env_name=$1
app_name=$4
lob_name=$5
if ($3 in a) {
a[$3] = a[$3]" "$2;
} else {
a[$3] = $2;
}
}
END { for (i in a) print env_name, i, a[i], app_name, lob_name}' input.tmp > output.tmp
This:
A,1,B,C,D
A,2,B,C,D
A,3,E,F,G
A,4,X,Y,Z
A,5,E,F,G
Should become this:
A,1 2,B,C,D
A,3 5,E,F,G
A,4,X,Y,Z
But instead we are getting this:
A,1 2,B,C,D
A,3 5,E,C,D
A,4,X,C,D
your grouping key should be all except second field
$ awk -F, 'BEGIN {SUPSEP=OFS=FS}
{k=$1 FS $3 FS $4 FS $5; a[k]=(k in a)?a[k]" "$2:$2}
END {for(k in a) {split(k,p); print p[1],a[k],p[2],p[3],p[4]}}' file
A,1 2,B,C,D
A,3 5,E,F,G
A,4,X,Y,Z
perhaps can be simplified a bit
$ awk 'BEGIN {OFS=FS=","}
{v=$2; $2=""; k=$0; a[k]=(k in a?a[k]" "v:v)}
END {for(k in a) {$0=k; $2=a[k]; print}}' file
sed + sort + awk
$ sed 's/,/+/3;s/,/+/3' merge_csv | sort -t, -k3 | awk -F, -v OFS=, ' { if($3==p) { a=a b " "; } if(p!=$3 && NR>1) { print $1,a b,p; a="" } b=$2; p=$3 } END { print $1,a b,p } ' | tr '+' ','
A,1 2,B,C,D
A,3 5,E,F,G
A,4,X,Y,Z
$
If Perl is an option, you can try this
$ perl -F, -lane '$x=join(",",#F[-3,-2,-1]); #t=#{$kv{$x}};push(#t,$F[1]);$kv{$x}=[#t]; END { for(keys %kv) { print "A,",join(" ",#{$kv{$_}}),",$_" }} ' merge_csv
A,1 2,B,C,D
A,4,X,Y,Z
A,3 5,E,F,G
$
Input file:
$ cat merge_csv
A,1,B,C,D
A,2,B,C,D
A,3,E,F,G
A,4,X,Y,Z
A,5,E,F,G
$

Remove duplicate from csv using bash / awk

I have a csv file with the format :
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
I want to group by first column unique id's and concat types in a single row like this:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
I found awk does a great job in handling such scenarios. But all I could achieve is this:
"id-1"|"A":"B":"D":"B"
"id-2"|"B":"C"
"id-3"|"A":"A"
I used this command:
awk -F "|" '{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
How can I remove the duplicates and also handle the formatting of the second column types?
quick fix:
$ awk -F "|" '!seen[$0]++{if(a[$1])a[$1]=a[$1]":"$2; else a[$1]=$2;}END{for (i in a)print i, a[i];}' OFS="|" file
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
!seen[$0]++ will be true only if line was not already seen
If second column should all be within double quotes
$ awk -v dq='"' 'BEGIN{FS=OFS="|"}
!seen[$0]++{a[$1]=a[$1] ? a[$1]":"$2 : $2}
END{for (i in a){gsub(dq,"",a[i]); print i, dq a[i] dq}}' file
"id-1"|"A:B:D"
"id-2"|"C:B"
"id-3"|"A"
With GNU awk for true multi-dimensional arrays and gensub() and sorted_in:
$ awk -F'|' '
{ a[$1][gensub(/"/,"","g",$2)] }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (i in a) {
c = 0
for (j in a[i]) {
printf "%s%s", (c++ ? ":" : i "|\""), j
}
print "\""
}
}
' file
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
The output rows and columns will both be string-sorted (i.e. alphabetically by characters) in ascending order.
Short GNU datamash + tr solution:
datamash -st'|' -g1 unique 2 <file | tr ',' ':'
The output:
"id-1"|"A":"B":"D"
"id-2"|"B":"C"
"id-3"|"A"
----------
In case if between-item double quotes should be eliminated - use the following alternative:
datamash -st'|' -g1 unique 2 <file | sed 's/","/:/g'
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"
For sample, input below one will work, but unsorted
One-liner
# using two array ( recommended )
awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
# using regexp
awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
Test Results:
$ cat infile
"id-1"|"A"
"id-2"|"C"
"id-1"|"B"
"id-1"|"D"
"id-2"|"B"
"id-3"|"A"
"id-3"|"A"
"id-1"|"B"
$ awk 'BEGIN{FS=OFS="|"}!seen[$1,$2]++{a[$1] = ($1 in a ? a[$1] ":" : "") $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
$ awk 'BEGIN{FS=OFS="|"}{ a[$1] = $1 in a ? ( a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2 ) : $2}END{for(i in a)print i,a[i]}' infile
"id-1"|"A":"B":"D"
"id-2"|"C":"B"
"id-3"|"A"
Better Readable:
Using regexp
awk 'BEGIN{
FS=OFS="|"
}
{
a[$1] =$1 in a ?(a[$1] ~ ("(^|:)"$2"(:|$)") ? a[$1] : a[$1]":"$2):$2
}
END{
for(i in a)
print i,a[i]
}
' infile
Using two array
awk 'BEGIN{
FS=OFS="|"
}
!seen[$1,$2]++{
a[$1] = ($1 in a ? a[$1] ":" : "") $2
}
END{
for(i in a)
print i,a[i]
}' infile
Note: you can also use !seen[$0]++, it will use entire line as index, but in case in your real data, if
you want to prefer some other column, you may prefer !seen[$1,$2]++,
here column1 and column2 are used as index
awk + sort solution:
awk -F'|' '{ gsub(/"/,"",$2); a[$1]=b[$1]++? a[$1]":"$2:$2 }
END{ for(i in a) printf "%s|\"%s\"\n",i,a[i] }' <(sort -u file)
The output:
"id-1"|"A:B:D"
"id-2"|"B:C"
"id-3"|"A"

How to get percentage of a column based on key in unix

I have table like
Student_name,Subject
Ram,Maths
Ram,Science
Arjun,Maths
Arjun,Science
Arjun,Social
Arjun,Social
Output : I need to report only 'student' whose 'Social' subject percentage is more than 49%
Final output
Arjun, social, 50
.
Temp output(backend)
Student_name,Subject,Percentage(group by student name)
Ram,Maths,50
Ram,Science,50
Arjun,Maths,25
Arjun,Science,25
Arjun,Social,50
I have tried with below awk commands but I see percentage on complete subjects irrespective group by student name.
awk -F, '{x++;}{a[$1,$2]++;}END{for (i in a)print i, a[i],(a[i]/x)*100;}' OFS=, test1.csv > output2.dat
awk -F, '$2=="Science" && $3>=49{ print $1}' output2.dat
And Can we get it in single awk command.
Try following awk too once where it will provide the output in same order in which Input_file is data is there.
awk 'FNR>1 && FNR==NR{a[$1]++;b[$1]=$0;next} FNR==1 && FNR!= NR{print $0,"percentage";next}($1 in b){print $0"\t"100/a[$1]"%"}' Input_file Input_file
EDIT: Adding non-one liner form of solution too now.
awk '
FNR>1 && FNR==NR{
a[$1]++;
b[$1]=$0;
next
}
FNR==1 && FNR!= NR{
print $0,"percentage";
next
}
($1 in b){
print $0"\t"100/a[$1]"%"
}
' Input_file Input_file
EDIT1: Adding new solution as per OP's change in requirement.
awk '
FNR>1 && FNR==NR{
a[$1]++;
b[$1]=b[$1]?b[$1] ORS $0:$0;
c[$1,$2];
next
}
FNR==1 && FNR!= NR{
print $0,"percentage";
next
}
($1 in b){
if($2=="Science" && (100/a[$1])>49){
print b[$1]
}
}
' Input_file Input_file
GNU awk solution:
awk -F, 'NR==1{ print $0,"Percentage" }NR>1{ a[$1][$2]++ }
END{
for(i in a) for(j in a[i]) print i,j,(a[i][j]/length(a[i])*100"%")
}' OFS=',' test1.csv | column -t
The output:
Student_name,Subject,Percentage
Ram,Maths,50%
Ram,Science,50%
Arjun,Social,66.6667%
Arjun,Maths,33.3333%
Arjun,Science,33.3333%
Use a Numeric Comparison
You can do this with a very simple numeric comparison against the third field:
$ awk '$3 > 49 {print}' /tmp/input
Student_name Subject Percentage(group by student name)
Ram Maths 50%
Ram Science 50%
For this comparison, AWK coerces to a string, so the comparion will treat 50% the same as 50. As a nice byproduct, if the third field doesn't contain any numbers then it does a string comparison. The header line is greater than ! so it matches, too.

AWK - using element on next record GETLINE?

I got some problem with this basic data:
DP;DG
67;
;10
;14
;14
;18
;18
;22
;65
68;
;0
;9
;25
;25
70;
that I'd like to transform on this kind of output:
DP;DG
67;
;10
;14
;14
;18
;18
;22
;65;x
68;
;0
;9
;25
;25;x
70;
The "x" value comes if on the next line $1 exists or if $2 is null. From my understanding, I've to use getline but I don't get the way!
I've tried the following code:
#!/bin/bash
file2=tmp.csv
file3=fin.csv
awk 'BEGIN {FS=OFS=";"}
{
print $0;
getline;
if($2="") {print $0";x"}
else {print $0}
}' $file2 > $file3
Seemed easy. I don't mention the result, totally different from my expectation.
Some clue? Is getline necessary on this problem?
OK, I continue to test some code:
#!/bin/bash
file2=tmp.csv
file3=fin.csv
awk 'BEGIN {FS=OFS=";"}
{
getline var
if (var ~ /.*;$/) {
print $0";x";
print var;
}
else {
print $0;
print var;
}
}' $file2 > $file3
It's quite better, but still, all lines that should be marked aren't... I don't get why...
alternative one pass version
$ awk -F\; 'NR>1 {printf "%s\n", (f && $2<0?"x":"")}
{f=$1<0; printf "%s", $0}
END {print ""}' file
give this one-liner a try:
awk -F';' 'NR==FNR{if($1>0||!$2)a[NR-1];next}FNR in a{$0=$0";x"}7' file file
or
awk -F';' 'NR==FNR{if($1~/\S/||$2).....

Resources