Get all the duplicates record in a csv if a column is different

Get all the duplicates record in a csv if a column is different - bash

I have a csv file, which have column wise data, like
EvtsUpdated,IR23488670,15920221,ESTIMATED
EvtsUpdated,IR23488676,11014018,ESTIMATED
EvtsUpdated,IR23488700,7273867,ESTIMATED
EvtsUpdated,IR23486360,7273881,ESTIMATED
EvtsUpdated,IR23488670,7273807,ESTIMATED
EvtsUpdated,IR23488670,9738420,ESTIMATED
EvtsUpdated,IR23488670,7273845,ESTIMATED
EvtsUpdated,IR23488676,12149463,ESTIMATED
and i just want to find out all the duplicates row ignoring a column, which is column 3. the output should be like
EvtsUpdated,IR23488670,15920221,ESTIMATED
EvtsUpdated,IR23488676,11014018,ESTIMATED
EvtsUpdated,IR23488700,7273867,ESTIMATED
EvtsUpdated,IR23488670,7273807,ESTIMATED
EvtsUpdated,IR23488670,9738420,ESTIMATED
EvtsUpdated,IR23488670,7273845,ESTIMATED
EvtsUpdated,IR23488676,12149463,ESTIMATED
i tried it by first cutting other columns except 3 in another file using
cut --complement -f 3 -d, filename into another file,
then i tried using the awk command, like awk -F, '{if(FNR==NR){print}}' secondfile
As i don't have complete knowledge of awk, so i'm not able to do it

You can use awk arrays to store the count of each group of columns to identify duplicates.
awk -F "," '{row[$1$2$4]++ ; rec[$0","NR] = $1$2$4 }
END{ for ( key in rec ) { if (row[rec[key]] > 1) { print key } } }' filename | sort -t',' -k5 | cut -f1-4 -d','
An additional sort was required to maintain the original ordering expected in your output.
Note: In your output shown, row with IR23488700 is considered as duplicate even though it is not.

I did the same by first cutting the 3rd column which may be different and then running the awk '++A[$0]==2' file command. Thanks for your help

Related

Add prefix to all rows and columns efficiently

My aim is to add a prefix to all rows and columns returned from an SQL query (all rows of the same column should take the same prefix). The way I am doing it at the moment is
echo "$(<my_sql_query> | awk '$0="prefixA_"$0' |
awk '$2="prefixB_"$2' |
awk '$3="prefixC_"$3' |
awk '$4="prefixD_"$4')"
The script above does exactly what I want but what I would like to know is whether there is faster way of doing it.

In case you are willing to do it with echo + awk solution then you could do it in a single awk, where we could prefix values in a single shot, though I am not sure about your query but considering here fields are separated by space only.
echo "$<my_sql-query>" |
awk '{$0="prefixA_"$0;$2="prefixB_"$2;$3="prefixC_"$3;$4="prefixD_"$4} 1'
EDIT: Adding a generic solution here, by which we could pass field numbers and their respective values to and it could be added to fields, fair warning not tested it much because samples were not given.
echo "$<my_sql-query>" |
awk '
function addPrefix(fieldNumbers,fieldValues){
num=split(fieldNumbers,arr1,"#")
split(fieldValues,arr2,"#")
for(i=1;i<=num;i++){
$arr1[i]=arr2[i]$arr1[i]
}
}
addPrefix("1#2#3#4","prefixA_#prefixB_#prefixC_#prefixD_")
1'

print 3 consecutive column after specific string from CSV

I need to print 2 columns after specific string (in my case it is 64). There can be multiple instances of 64 within same CSV row, however next instance will not occur within 3 columns of previous occurrence. Output of each instance should be in next line and unique. The problem is, the specific string does not fall in same column for all rows. All row is having kind of dynamic data and there is no header for CSV. Let say, below is input file (its just a sample, actual file is having approx 300 columns & 5 Million raws):
00:TEST,123453103279586,ABC,XYZ,123,456,65,906,06149,NIL TS21,1,64,906,06149,NIL TS22,1,64,916,06149,NIL BS20,1,64,926,06149,NIL BS30,1,64,906,06149,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222
00:TEST,123458131344169,ABC,XYZ,123,456,OCCF,1,1,1,64,857,19066,NIL TS21,1,64,857,19066,NIL TS22,1,64,857,19066,NIL BS20,1,64,857,19067,NIL BS30,1,64,857,19068,NIL PSS,1,E2 EPSDATA,GRANTED,NONE,1,N,N,256000,5
00:TEST,123458131016844,ABC,XYZ,123,456,HOLD,,1,64,938,36843,NIL TS21,1,64,938,36841,NIL TS22,1,64,938,36823,NIL BS20,1,64,938,36843,NIL BS30,1,64,938,36843,NIL CAML,1,ORIG,0,TERM,00,50000,N,N,N,N
00:TEST,123453102914690,ABC,XYZ,123,456,HOLD,,1,PBS,TS11,64,938,64126,NIL TS21,1,64,938,64126,NIL TS22,1,64,938,64126,NIL BS20,1,64,938,64226,NIL BS30,1,64,938,64326,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222,2222,R
Output required(only unique entries):
64,906,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,36843
64,938,36843
64,938,64326
There is no performance related concerns. I have tried to search many threads but could not get anything near related. Please help.

We can use a pipe of two commands... first to put the 64's leading on a line and a second to print first three columns if we see a leading 64.
sed 's/,64[,\n]/\n64,/g' | awk -F, '/^64/ { print $1 FS $2 FS $3 }'
There are ways of doing this with a single awk command, but this felt quick and easy to me.
Though the sample data from the question contains redundant lines, karakfa (see below) reminds me that the question speaks of a "unique data" requirement. This version uses the keys of an associative array to keep track of duplicate records.
sed 's/,64[,\n]/\n64,/g' | awk -F, 'BEGIN { split("",a) } /^64/ && !((x=$1 FS $2 FS $3) in a) { a[x]=1; print x }'

gawk:
awk -F, '{for(i=0;++i<=NF;){if($i=="64")a=4;if(--a>0)s=s?s","$i:$i;if(a==1){print s;s=""}}}' file

Sed for fun
sed -n -e 's/$/,n,n,n/' -e ':a' -e 'G;s/[[:blank:],]\(64,.*\)\(\n\)$/\2\1/;s/.*\(\n\)\(64\([[:blank:],][^[:blank:],]\{1,\}\)\{2\}\)\([[:blank:],][^[:blank:],]\{1,\}\)\{3\}\([[:blank:],].*\)\{0,1\}$/\1\2\1\5/;s/^.*\n\(.*\n\)/\1/;/^64.*\n/P;s///;ta' YourFile | sort -u
assuming column are separated by blank space or comma
need a sort -u for uniq (possible in sed but a new "simple" action of the same kind to add in this case)

awk to the rescue!
$ awk -F, '{for(i=1;i<=NF;i++)
if($i==64)
{k=$i FS $(++i) FS $(++i);
if (!a[k]++)
print k
}
}' file
64,906,06149
64,916,06149
64,926,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,64126
64,938,64226
64,938,64326
ps. your sample output doesn't match the given input.

find uniq lines in file, but ignore certain columns

So I've looked around now for a few hours but haven't found anything helpful.
I want to sort through a file that has a large number of lines formatted like
Values1, values2, values3, values4, values5, values6,
but I want to return only the lines that are uniquely related to
Values1, values2, values3, values6
As in I have multiple instances Values1, values2, values3, values6 where their only difference is values4, values5 and I don't want to return those, rather just one instance of the line (preferably the line pertaining to the largest value of values4, values5 but thats not a big deal)
I have tried using
uniq -s ##
but that doesn't work because my values lengths are variable.
I have also tried
sort -u -k 1,3
but that doesn't seem to work either.
mainly my issue is my values are variable in length, I'm not that concerned with sorting by values6 but it would be nice.
any help would be greatly appreciated

With awk, you can print the first time the "key" is seen:
awk '
{ key = $1 OFS $2 OFS $3 OFS $6 }
!seen[key]++
' file
The magic !seen[key]++ is an awk idiom. It returns true only the first time that key is encountered. It then increments the values so that it won't be true for any subsequent encounter.

alternative to awk
cut -d" " -f1-3,6 filename | sort -u
extract only required fields, sort unique

If you absolutely mustn't use the very clean cut method as suggested by #karafka, then with a csv file as input, you could use uniq -f <num> which skips the first <num> columns for the uniqueness comparison.
Since uniq expects blanks as separators we need to change this and also reorder the columns to meet your requirements.
sed 's/,/\t/g' textfile.csv | awk '{ print $4,$5,$1,$2,$3,$6}' | \
sort -k3,4,5,6 | uniq -f 2 | \
awk 'BEGIN{OFS=",";} { print $3,$4,$5,$1,$2,$6}'
This way only first line values (after sort) of $4 and $5 will be printed.

Bash/Shell: How to remove duplicates from csv file by columns?

I have a csv separated with ;. I need to remove lines where content of 2nd and 3rd column is not unique, and deliver the material to the standard output.
Example input:
irrelevant;data1;data2;irrelevant;irrelevant
irrelevant;data3;data4;irrelevant;irrelevant
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data1;data2;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant
irrelevant;data1;data2;irrelevant;irrelevant
irrelevant;data3;data4;irrelevant;irrelevant
Desired output
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant
I have found solutions where only first line is printed to the output:
sort -u -t ";" -k2,1 file
but this is not enough.
I have tried to use uniq -u but I can't find a way to check only a few columns.

Using awk:
awk -F';' '!seen[$2,$3]++{data[$2,$3]=$0}
END{for (i in seen) if (seen[i]==1) print data[i]}' file
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant
Explanation: If $2,$3 combination doesn't exist in seen array then a new entry with key of $2,$3 is stored in data array with whole record. Every time $2,$3 entry is found a counter for $2,$3 is incremented. Then in the end those entries with counter==1 are printed.

If order is important and if you can use perl then:
perl -F";" -lane '
$key = #F[1,2];
$uniq{$key}++ or push #rec, [$key, $_]
}{
print $_->[1] for grep { $uniq{$_->[0]} == 1 } #rec' file
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant
We use column2 and column3 to create composite key. We create array of array by pushing the key and the line to array rec for the first occurrence of the line.
In the END block, we check if that occurrence is the only occurrence. If so, we go ahead and print the line.

awk '!a[$0]++' file_input > file_output
This worked for me. It compares whole lines.

Output the first duplicate in a csv file

How do i output the first duplicate of a csv file?
for example if i have:
00:0D:67:24:D7:25,1,-34,123,135
00:0D:67:24:D7:25,1,-84,567,654
00:0D:67:24:D7:26,1,-83,456,234
00:0D:67:24:D7:26,1,-86,123,124
00:0D:67:24:D7:2C,1,-56,245,134
00:0D:67:24:D7:2C,1,-83,442,123
00:18:E7:EB:BC:A9,5,-70,123,136
00:18:E7:EB:BC:A9,5,-90,986,545
00:22:A4:25:A8:F9,6,-81,124,234
00:22:A4:25:A8:F9,6,-90,456,654
64:0F:28:D9:6E:F9,1,-67,789,766
64:0F:28:D9:6E:F9,1,-85,765,123
74:9D:DC:CB:73:89,10,-70,253,777
i want my output to look like this:
00:0D:67:24:D7:25,1,-34,123,135
00:0D:67:24:D7:26,1,-83,456,234
00:0D:67:24:D7:2C,1,-56,245,134
00:18:E7:EB:BC:A9,5,-70,123,136
00:22:A4:25:A8:F9,6,-81,124,234
64:0F:28:D9:6E:F9,1,-67,789,766
74:9D:DC:CB:73:89,10,-70,253,777
i was thinking along the lines of first outputting the first line of the csv file so like awk (code that outputs first row) >> file.csv then compare the first field of the row to the first field of the next row, if they are the same, check the next row. Until it comes to a new row, the code will output the new different row so again awk (code that outputs) >> file.csv and it will repeat until the check is complete
im kinda of new to bash coding, but i love it so far, im currently phrasing a csv file and i need some help. Thanks everyone

Using awk:
awk -F, '!a[$1]++' file.csv
awk forms an array where the 1st column is the key and the value is the count of no. of times the particular key is present. '!a[$1]++' will be true only when the 1st occurence of the 1st column, and hence the first occurrence of the line gets printed.

If I understand what you're getting at you want something like this:
prev_field=""
while read line
do
current_field=$(echo $line | cut -d ',' -f 1)
[[ $current_field != $prev_field ]] && echo $line
prev_field=$current_field
done < "stuff.csv"
Where stuff.csv is the name of your file. That's assuming that what you're trying to do is take the first field in the csv row and only print the first unique occurrence of it, which if that's the case I think your output may be missing a few.

Using uniq:
sort lines.csv | uniq -w 17
Provided your first column is fixed size (17). lines.csv is a file with your original input.

perl -F, -lane '$x{$F[0]}++;print if($x{$F[0]}==1)' your_file
if you want to change the file inplace:
perl -i -F, -lane '$x{$F[0]}++;print if($x{$F[0]}==1)' your_file

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Get all the duplicates record in a csv if a column is different - bash

I did the same by first cutting the 3rd column which may be different and then running the awk '++A[$0]==2' file command. Thanks for your help

Related

Add prefix to all rows and columns efficiently

print 3 consecutive column after specific string from CSV

find uniq lines in file, but ignore certain columns

Bash/Shell: How to remove duplicates from csv file by columns?

Output the first duplicate in a csv file

Categories

Resources