Bash/Shell: How to remove duplicates from csv file by columns?

Bash/Shell: How to remove duplicates from csv file by columns? - bash

I have a csv separated with ;. I need to remove lines where content of 2nd and 3rd column is not unique, and deliver the material to the standard output.
Example input:
irrelevant;data1;data2;irrelevant;irrelevant
irrelevant;data3;data4;irrelevant;irrelevant
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data1;data2;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant
irrelevant;data1;data2;irrelevant;irrelevant
irrelevant;data3;data4;irrelevant;irrelevant
Desired output
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant
I have found solutions where only first line is printed to the output:
sort -u -t ";" -k2,1 file
but this is not enough.
I have tried to use uniq -u but I can't find a way to check only a few columns.

Using awk:
awk -F';' '!seen[$2,$3]++{data[$2,$3]=$0}
END{for (i in seen) if (seen[i]==1) print data[i]}' file
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant
Explanation: If $2,$3 combination doesn't exist in seen array then a new entry with key of $2,$3 is stored in data array with whole record. Every time $2,$3 entry is found a counter for $2,$3 is incremented. Then in the end those entries with counter==1 are printed.

If order is important and if you can use perl then:
perl -F";" -lane '
$key = #F[1,2];
$uniq{$key}++ or push #rec, [$key, $_]
}{
print $_->[1] for grep { $uniq{$_->[0]} == 1 } #rec' file
irrelevant;data5;data6;irrelevant;irrelevant
irrelevant;data7;data8;irrelevant;irrelevant
irrelevant;data9;data0;irrelevant;irrelevant
We use column2 and column3 to create composite key. We create array of array by pushing the key and the line to array rec for the first occurrence of the line.
In the END block, we check if that occurrence is the only occurrence. If so, we go ahead and print the line.

awk '!a[$0]++' file_input > file_output
This worked for me. It compares whole lines.

Related

Formatting output using awk

I've a file with following content:
A 28713.64 27736.1000
B 9835.32
C 38548.96
Now, i need to check if the last row in the first column is 'C', then the value of first row in third column should be printed in the third column against 'C'.
Expected Output:
A 28713.64 27736.1000
B 9835.32
C 38548.96 27736.1000
I tried below, but it's not working:
awk '{if ($1 == "C") ; print $1,$2,$3}' file_name
Any help is most welcome!!!

This works for the given example:
awk 'NR==1{v=$3}$1=="C"{$0=$0 FS v}7' file|column -t
If you want to append the 3rd column value from A row to C row, change NR==1 into $1=="A"
The column -t part is just for making output pretty. :-)

EDIT: As per OP's comment OP is looking for very first line and looking to match C string at very last line of Input_file, if this is the case then one should try following.
awk '
FNR==1{
value=$NF
print
next
}
prev{
print prev
}
{
prev=$0
prev_first=$1
}
END{
if(prev_first=="C"){
print prev,value
}
else{
print
}
}' file | column -t
Assuming that your actual Input_file is same as shown samples and you want to pick value from 1st column whose value is A.
awk '$1=="A" && FNR==1{value=$NF} $1=="C"{print $0,value;next} 1' Input_file| column -t
Output will be as follows.
A 28713.64 27736.1000
B 9835.32
C 38548.96 27736.1000

POSIX dictates that "assigning to a nonexistent field (for example, $(NF+2)=5) shall increase the value of NF; create any intervening fields with the uninitialized value; and cause the value of $0 to be recomputed, with the fields being separated by the value of OFS."
So...
awk 'NR==1{x=$3} $1=="C"{$3=x} 1' input.txt
Note that the output is not formatted well, but that's likely the case with most of the solutions here. You could pipe the output through column, as Ravinder suggested. Or you could control things precisely by printing your data with printf.
awk 'NR==1{x=$3} $1=="C"{$3=x} {printf "%-2s%-26s%s\n",$1,$2,$3}' input.txt
If your lines can be expressed in a printf format, you'll be able to avoid the unpredictability of column -t and save the overhead of a pipe.

sed squeeze multiple occurrence of word

I have text file with lines like below:
this is the code ;rfc1234;rfc1234
this is the code ;rfc1234;rfc1234;rfc1234;rfc1234
How can I squeeze the the repeating words in file to single word like below:
this is the code ;rfc1234
this is the code ;rfc1234
I tried 'tr' command but it's limited to squeezing characters only

with sed for arbitrary repeated strings prefixed with ;
$ sed -E 's/(;[^;]+)(\1)+/\1/g' file
or, if you want to delete everything after the first token without checking whether they match the preceding one or not
$ sed -E 's/(\S);.*/\1/' file
Explanation
(;[^;]+) is to capture a string starting with semicolon
(\1)+ followed by the same captured string one or more times
/\1/g replace the whole chain with one instance, and repeat

Following awk may help here. It will look for all items in last column of you Input_file and will keep only unique values in it.
awk '{num=split($NF,array,";");for(i=1;i<=num;i++){if(!array1[array[i]]++){val=val?val ";" array[i]:array[i]}};NF--;print $0";"val;val="";delete array;delete array1}' Input_file
Adding a non-one liner form of solution too now.
awk '
{
num=split($NF,array,";");
for(i=1;i<=num;i++){
if(!array1[array[i]]++){
val=val?val ";" array[i]:array[i]}
};
NF--;
print $0";"val;
val="";
delete array;
delete array1
}' Input_file
Explanation:
awk '
{
num=split($NF,array,";"); ##Creating a variable named num whose value is length of array named array, which is created on last field of line with ; as a delimiter.
for(i=1;i<=num;i++){ ##Starting a for loop from i=1 to till value of num each time increment i as 1.
if(!array1[array[i]]++){ ##Chrcking here a condition if array named array1 index is value of array[i] is NOT coming more than 1 value then do following.
val=val?val ";" array[i]:array[i]}##Creating a variable named val here whose value is array[i] value and keep concatenating its own value of it.
};
NF--; ##Reducing the value of NF(number of fields) in current line to remove the last field from it.
print $0";"val; ##Printing the current line(without last field) ; and then value of val here.
val=""; ##Nullifying variable val here.
delete array; ##Deleting array named array here.
delete array1 ##Deleting array named array1 here.
}' Input_file ##Mentioning Input_file name here.

I started playing around with s/(.+)\1/\1/g. It seemed to work with perl (even found the is_is_) but didn't quite take me there:
$ perl -pe 's/(.+)\1+/\1/g' file
this the code ;rfc1234
this the code ;rfc1234;rfc1234

sed 's/\(;[^;]*\).*/\1/' file

You can use the below command to achieve this:-
echo "this is the code ;rfc1234;rfc1234" | sed 's/rfc1234//2g'
echo "this is the code ;rfc1234;rfc1234;rfc1234;rfc1234" | sed 's/rfc1234//2g'
or
sed 's/rfc1234//2g' yourfile.txt

This might work for you (GNU sed):
sed -r ':a;s/(\S+)\1+/\1/g;ta' file
The regex is repeated until only the first pattern remains.

print 3 consecutive column after specific string from CSV

I need to print 2 columns after specific string (in my case it is 64). There can be multiple instances of 64 within same CSV row, however next instance will not occur within 3 columns of previous occurrence. Output of each instance should be in next line and unique. The problem is, the specific string does not fall in same column for all rows. All row is having kind of dynamic data and there is no header for CSV. Let say, below is input file (its just a sample, actual file is having approx 300 columns & 5 Million raws):
00:TEST,123453103279586,ABC,XYZ,123,456,65,906,06149,NIL TS21,1,64,906,06149,NIL TS22,1,64,916,06149,NIL BS20,1,64,926,06149,NIL BS30,1,64,906,06149,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222
00:TEST,123458131344169,ABC,XYZ,123,456,OCCF,1,1,1,64,857,19066,NIL TS21,1,64,857,19066,NIL TS22,1,64,857,19066,NIL BS20,1,64,857,19067,NIL BS30,1,64,857,19068,NIL PSS,1,E2 EPSDATA,GRANTED,NONE,1,N,N,256000,5
00:TEST,123458131016844,ABC,XYZ,123,456,HOLD,,1,64,938,36843,NIL TS21,1,64,938,36841,NIL TS22,1,64,938,36823,NIL BS20,1,64,938,36843,NIL BS30,1,64,938,36843,NIL CAML,1,ORIG,0,TERM,00,50000,N,N,N,N
00:TEST,123453102914690,ABC,XYZ,123,456,HOLD,,1,PBS,TS11,64,938,64126,NIL TS21,1,64,938,64126,NIL TS22,1,64,938,64126,NIL BS20,1,64,938,64226,NIL BS30,1,64,938,64326,NIL CAML,1,ORIG,0,TERM,1,1,1,6422222222,2222,R
Output required(only unique entries):
64,906,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,36843
64,938,36843
64,938,64326
There is no performance related concerns. I have tried to search many threads but could not get anything near related. Please help.

We can use a pipe of two commands... first to put the 64's leading on a line and a second to print first three columns if we see a leading 64.
sed 's/,64[,\n]/\n64,/g' | awk -F, '/^64/ { print $1 FS $2 FS $3 }'
There are ways of doing this with a single awk command, but this felt quick and easy to me.
Though the sample data from the question contains redundant lines, karakfa (see below) reminds me that the question speaks of a "unique data" requirement. This version uses the keys of an associative array to keep track of duplicate records.
sed 's/,64[,\n]/\n64,/g' | awk -F, 'BEGIN { split("",a) } /^64/ && !((x=$1 FS $2 FS $3) in a) { a[x]=1; print x }'

gawk:
awk -F, '{for(i=0;++i<=NF;){if($i=="64")a=4;if(--a>0)s=s?s","$i:$i;if(a==1){print s;s=""}}}' file

Sed for fun
sed -n -e 's/$/,n,n,n/' -e ':a' -e 'G;s/[[:blank:],]\(64,.*\)\(\n\)$/\2\1/;s/.*\(\n\)\(64\([[:blank:],][^[:blank:],]\{1,\}\)\{2\}\)\([[:blank:],][^[:blank:],]\{1,\}\)\{3\}\([[:blank:],].*\)\{0,1\}$/\1\2\1\5/;s/^.*\n\(.*\n\)/\1/;/^64.*\n/P;s///;ta' YourFile | sort -u
assuming column are separated by blank space or comma
need a sort -u for uniq (possible in sed but a new "simple" action of the same kind to add in this case)

awk to the rescue!
$ awk -F, '{for(i=1;i<=NF;i++)
if($i==64)
{k=$i FS $(++i) FS $(++i);
if (!a[k]++)
print k
}
}' file
64,906,06149
64,916,06149
64,926,06149
64,857,19066
64,857,19067
64,857,19068
64,938,36843
64,938,36841
64,938,36823
64,938,64126
64,938,64226
64,938,64326
ps. your sample output doesn't match the given input.

Extracting field from last row of given table using sed

I would like to write a bash script to extract a field in the last row of a table. I will illustrate by example. I have a text file containing tables with space delimited fields like ...
Table 1 (foobar)
num flag name comments
1 ON Frank this guy is frank
2 OFF Sarah she is tall
3 ON Ahmed who knows him
Table 2 (foobar)
num flag name comments
1 ON Mike he is short
2 OFF Ahmed his name is listed twice
I want to extract the first field in the last row of Table1, which is 3. Ideally I would like to be able to use any given table's title to do this. There are guaranteed carriage returns between each table. What would be the best way to accomplish this, preferably using sed and grep?

Awk is perfect for this, print the first field in the last row for each record:
$ awk '!$1{print a}{a=$1}END{print a}' file
3
2
Just from the first record:
$ awk '!$1{print a;exit}{a=$1}' file
3
Edit:
For a given table title:
$ awk -v t="Table 1" '$0~t{f=1}!$1&&f{print a;f=0}{a=$1}END{if (f) print a}' file
3
$ awk -v t="Table 2" '$0~t{f=1}!$1&&f{print a;f=0}{a=$1}END{if (f) print a}' file
2

This sed line seems to work for your sample.
table='Table 2'
sed -n "/$table"'/{n;n;:next;h;n;/^$/b last;$b last;b next;:last;g;s/^\s*\(\S*\).*/\1/p;}' file
Explanation: When we find a line matching the table name in $table, we skip that line, and the next (the field labels). Starting at :next we push the current line into the hold space, get the next line and see if it is blank or the end of the file, if not we go back to :next, push the current line into hold and get another. If it is blank or EOF, we skip to :last, pull the hold space (the last line of the table) into pattern space, chop out all but the first field and print it.

Just read each block as a record with each line as a field and then print the first sub-field of the last field of whichever record you care about:
$ awk -v RS= -F'\n' '/^Table 1/{split($NF,a," "); print a[1]}' file
3
$ awk -v RS= -F'\n' '/^Table 2/{split($NF,a," "); print a[1]}' file
2

Better tool to that is awk!
Here is a kind legible code:
awk '{
if(NR==1) {
row=$0;
next;
}
if($0=="") {
$0=row;
print $1;
} else {
row=$0;
}
} END {
if(row!="") {
$0=row;
print $1;
}
}' input.txt

Output the first duplicate in a csv file

How do i output the first duplicate of a csv file?
for example if i have:
00:0D:67:24:D7:25,1,-34,123,135
00:0D:67:24:D7:25,1,-84,567,654
00:0D:67:24:D7:26,1,-83,456,234
00:0D:67:24:D7:26,1,-86,123,124
00:0D:67:24:D7:2C,1,-56,245,134
00:0D:67:24:D7:2C,1,-83,442,123
00:18:E7:EB:BC:A9,5,-70,123,136
00:18:E7:EB:BC:A9,5,-90,986,545
00:22:A4:25:A8:F9,6,-81,124,234
00:22:A4:25:A8:F9,6,-90,456,654
64:0F:28:D9:6E:F9,1,-67,789,766
64:0F:28:D9:6E:F9,1,-85,765,123
74:9D:DC:CB:73:89,10,-70,253,777
i want my output to look like this:
00:0D:67:24:D7:25,1,-34,123,135
00:0D:67:24:D7:26,1,-83,456,234
00:0D:67:24:D7:2C,1,-56,245,134
00:18:E7:EB:BC:A9,5,-70,123,136
00:22:A4:25:A8:F9,6,-81,124,234
64:0F:28:D9:6E:F9,1,-67,789,766
74:9D:DC:CB:73:89,10,-70,253,777
i was thinking along the lines of first outputting the first line of the csv file so like awk (code that outputs first row) >> file.csv then compare the first field of the row to the first field of the next row, if they are the same, check the next row. Until it comes to a new row, the code will output the new different row so again awk (code that outputs) >> file.csv and it will repeat until the check is complete
im kinda of new to bash coding, but i love it so far, im currently phrasing a csv file and i need some help. Thanks everyone

Using awk:
awk -F, '!a[$1]++' file.csv
awk forms an array where the 1st column is the key and the value is the count of no. of times the particular key is present. '!a[$1]++' will be true only when the 1st occurence of the 1st column, and hence the first occurrence of the line gets printed.

If I understand what you're getting at you want something like this:
prev_field=""
while read line
do
current_field=$(echo $line | cut -d ',' -f 1)
[[ $current_field != $prev_field ]] && echo $line
prev_field=$current_field
done < "stuff.csv"
Where stuff.csv is the name of your file. That's assuming that what you're trying to do is take the first field in the csv row and only print the first unique occurrence of it, which if that's the case I think your output may be missing a few.

Using uniq:
sort lines.csv | uniq -w 17
Provided your first column is fixed size (17). lines.csv is a file with your original input.

perl -F, -lane '$x{$F[0]}++;print if($x{$F[0]}==1)' your_file
if you want to change the file inplace:
perl -i -F, -lane '$x{$F[0]}++;print if($x{$F[0]}==1)' your_file

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Bash/Shell: How to remove duplicates from csv file by columns? - bash

awk '!a[$0]++' file_input > file_output This worked for me. It compares whole lines.

Related

Formatting output using awk

sed squeeze multiple occurrence of word

print 3 consecutive column after specific string from CSV

Extracting field from last row of given table using sed

Output the first duplicate in a csv file

Categories

Resources