I have a csv file, which have column wise data, like
EvtsUpdated,IR23488670,15920221,ESTIMATED
EvtsUpdated,IR23488676,11014018,ESTIMATED
EvtsUpdated,IR23488700,7273867,ESTIMATED
EvtsUpdated,IR23486360,7273881,ESTIMATED
EvtsUpdated,IR23488670,7273807,ESTIMATED
EvtsUpdated,IR23488670,9738420,ESTIMATED
EvtsUpdated,IR23488670,7273845,ESTIMATED
EvtsUpdated,IR23488676,12149463,ESTIMATED
and i just want to find out all the duplicates row ignoring a column, which is column 3. the output should be like
EvtsUpdated,IR23488670,15920221,ESTIMATED
EvtsUpdated,IR23488676,11014018,ESTIMATED
EvtsUpdated,IR23488700,7273867,ESTIMATED
EvtsUpdated,IR23488670,7273807,ESTIMATED
EvtsUpdated,IR23488670,9738420,ESTIMATED
EvtsUpdated,IR23488670,7273845,ESTIMATED
EvtsUpdated,IR23488676,12149463,ESTIMATED
i tried it by first cutting other columns except 3 in another file using
cut --complement -f 3 -d, filename into another file,
then i tried using the awk command, like awk -F, '{if(FNR==NR){print}}' secondfile
As i don't have complete knowledge of awk, so i'm not able to do it
You can use awk arrays to store the count of each group of columns to identify duplicates.
awk -F "," '{row[$1$2$4]++ ; rec[$0","NR] = $1$2$4 }
END{ for ( key in rec ) { if (row[rec[key]] > 1) { print key } } }' filename | sort -t',' -k5 | cut -f1-4 -d','
An additional sort was required to maintain the original ordering expected in your output.
Note: In your output shown, row with IR23488700 is considered as duplicate even though it is not.
I did the same by first cutting the 3rd column which may be different and then running the awk '++A[$0]==2' file command. Thanks for your help
I need to print 2 columns after specific string (in my case it is 64). There can be multiple instances of 64 within same CSV row, however next instance will not occur within 3 columns of previous occurrence. Output of each instance should be in next line and unique. The problem is, the specific string does not fall in same column for all rows. All row is having kind of dynamic data and there is no header for CSV. Let say, below is input file (its just a sample, actual file is having approx 300 columns & 5 Million raws):
00:TEST,123453103279586,ABC,XYZ,123,456,65,906,06149,NIL TS21,1,64,906,06149,NIL TS22,1,64,916,06149,NIL BS20,1,64,926,06149,NIL BS30,1,64,906,06149,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222
00:TEST,123458131344169,ABC,XYZ,123,456,OCCF,1,1,1,64,857,19066,NIL TS21,1,64,857,19066,NIL TS22,1,64,857,19066,NIL BS20,1,64,857,19067,NIL BS30,1,64,857,19068,NIL PSS,1,E2 EPSDATA,GRANTED,NONE,1,N,N,256000,5
00:TEST,123458131016844,ABC,XYZ,123,456,HOLD,,1,64,938,36843,NIL TS21,1,64,938,36841,NIL TS22,1,64,938,36823,NIL BS20,1,64,938,36843,NIL BS30,1,64,938,36843,NIL CAML,1,ORIG,0,TERM,00,50000,N,N,N,N
00:TEST,123453102914690,ABC,XYZ,123,456,HOLD,,1,PBS,TS11,64,938,64126,NIL TS21,1,64,938,64126,NIL TS22,1,64,938,64126,NIL BS20,1,64,938,64226,NIL BS30,1,64,938,64326,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222,2222,R
Output required(only unique entries):
64,906,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,36843
64,938,36843
64,938,64326
There is no performance related concerns. I have tried to search many threads but could not get anything near related. Please help.
We can use a pipe of two commands... first to put the 64's leading on a line and a second to print first three columns if we see a leading 64.
sed 's/,64[,\n]/\n64,/g' | awk -F, '/^64/ { print $1 FS $2 FS $3 }'
There are ways of doing this with a single awk command, but this felt quick and easy to me.
Though the sample data from the question contains redundant lines, karakfa (see below) reminds me that the question speaks of a "unique data" requirement. This version uses the keys of an associative array to keep track of duplicate records.
sed 's/,64[,\n]/\n64,/g' | awk -F, 'BEGIN { split("",a) } /^64/ && !((x=$1 FS $2 FS $3) in a) { a[x]=1; print x }'
gawk:
awk -F, '{for(i=0;++i<=NF;){if($i=="64")a=4;if(--a>0)s=s?s","$i:$i;if(a==1){print s;s=""}}}' file
Sed for fun
sed -n -e 's/$/,n,n,n/' -e ':a' -e 'G;s/[[:blank:],]\(64,.*\)\(\n\)$/\2\1/;s/.*\(\n\)\(64\([[:blank:],][^[:blank:],]\{1,\}\)\{2\}\)\([[:blank:],][^[:blank:],]\{1,\}\)\{3\}\([[:blank:],].*\)\{0,1\}$/\1\2\1\5/;s/^.*\n\(.*\n\)/\1/;/^64.*\n/P;s///;ta' YourFile | sort -u
assuming column are separated by blank space or comma
need a sort -u for uniq (possible in sed but a new "simple" action of the same kind to add in this case)
awk to the rescue!
$ awk -F, '{for(i=1;i<=NF;i++)
if($i==64)
{k=$i FS $(++i) FS $(++i);
if (!a[k]++)
print k
}
}' file
64,906,06149
64,916,06149
64,926,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,64126
64,938,64226
64,938,64326
ps. your sample output doesn't match the given input.
So I've looked around now for a few hours but haven't found anything helpful.
I want to sort through a file that has a large number of lines formatted like
Values1, values2, values3, values4, values5, values6,
but I want to return only the lines that are uniquely related to
Values1, values2, values3, values6
As in I have multiple instances Values1, values2, values3, values6 where their only difference is values4, values5 and I don't want to return those, rather just one instance of the line (preferably the line pertaining to the largest value of values4, values5 but thats not a big deal)
I have tried using
uniq -s ##
but that doesn't work because my values lengths are variable.
I have also tried
sort -u -k 1,3
but that doesn't seem to work either.
mainly my issue is my values are variable in length, I'm not that concerned with sorting by values6 but it would be nice.
any help would be greatly appreciated
With awk, you can print the first time the "key" is seen:
awk '
{ key = $1 OFS $2 OFS $3 OFS $6 }
!seen[key]++
' file
The magic !seen[key]++ is an awk idiom. It returns true only the first time that key is encountered. It then increments the values so that it won't be true for any subsequent encounter.
alternative to awk
cut -d" " -f1-3,6 filename | sort -u
extract only required fields, sort unique
If you absolutely mustn't use the very clean cut method as suggested by #karafka, then with a csv file as input, you could use uniq -f <num> which skips the first <num> columns for the uniqueness comparison.
Since uniq expects blanks as separators we need to change this and also reorder the columns to meet your requirements.
sed 's/,/\t/g' textfile.csv | awk '{ print $4,$5,$1,$2,$3,$6}' | \
sort -k3,4,5,6 | uniq -f 2 | \
awk 'BEGIN{OFS=",";} { print $3,$4,$5,$1,$2,$6}'
This way only first line values (after sort) of $4 and $5 will be printed.