deleting duplicate columns from csv file

deleting duplicate columns from csv file - shell

I've got perfmon outputting to a csv and I need to delete any repeated columns, e.g.
COL1, Col2, Col3, COL1, Col4, Col5
When columns repeat it's almost always the same column but it doesn't happen every time. What I've got so far are a couple of manual steps:
When the column count is greater than it should be I output all of the column headers on single lines:
head -n1 < output.csv|sed 's/,/\n/g'
Then, when I know which column numbers are guilty, I delete manually, e.g.:
cut -d"," --complement -f5,11 < output.csv > output2.csv
If somebody can point me in the right direction I'd be grateful!
Updated to give rough example of output.csv contents, should be familiar to anyone who's used perfmon:
"COLUMN1","Column2","Column3","COLUMN1","Column4"
"1","1","1","1","1"
"a","b","c","a","d"
"x","dd","ffd","x","ef"
I need to delete the repeated COLUMN1 (4th col)
Just to be clear, I'm trying to think of a way of automatically going into output.csv and deleting repeated columns without having to tell it which columns to delete a la my manual method above. Thanks!

try this awk (not really one-liner), it handles more than one duplicated columns, it checks only the title (first row) to decide which columns are duplicated. Your example shows in this way too.
awk script (one-liner version):
awk -F, 'NR==1{for(i=1;i<=NF;i++)if(!($i in v)){ v[$i];t[i]}}{s=""; for(i=1;i<=NF;i++)if(i in t)s=s sprintf("%s,",$i);if(s){sub(/,$/,"",s);print s}} ' file
clear version (same script):
awk -F, 'NR==1{
for(i=1;i<=NF;i++)
if(!($i in v)){v[$i];t[i]}
}
{s=""
for(i=1;i<=NF;i++)
if(i in t)
s=s sprintf("%s,",$i)
if(s){
sub(/,$/,"",s)
print s
}
} ' file
with example (note I created two duplicated cols ):
kent$ cat file
COL1,COL2,COL3,COL1,COL4,COL2
1,2,3,1,4,2
a1,a2,a3,a1,a4,a2
b1,b2,b3,b1,b4,b2
d1,d2,d3,d1,d4,d2
kent$ awk -F, 'NR==1{
for(i=1;i<=NF;i++)
if(!($i in v)){v[$i];t[i]}
}
{s=""
for(i=1;i<=NF;i++)
if(i in t)
s=s sprintf("%s,",$i)
if(s){
sub(/,$/,"",s)
print s
}
} ' file
COL1,COL2,COL3,COL4
1,2,3,4
a1,a2,a3,a4
b1,b2,b3,b4
d1,d2,d3,d4

Related

How to select the minimum value which includes exponential value for each ID based on the forth column?

Can you please tell me how to Select rows with min value, including exponential, based on fourth column and group by first column in linux?
Original file
ID,y,z,p-value
1,a,b,0.22
1,a,b,5e-10
1,a,b,1.2e-10
2,c,d,0.06
2,c,d,0.003
2,c,d,3e-7
3,e,f,0.002
3,e,f,2e-8
3,e,f,1.0
The file I want is as below.
ID,y,z,p-value
1,a,b,1.2e-10
2,c,d,3e-7
3,e,f,2e-8
Actually this worked fine, so thanks for everybody!
tail -n +2 original_file > txt sort -t, -k 4g txt | awk -F, '!visited[$1]++' | sort -k2,2 -k3,3 >> final_file

You can do it fairly easily in awk just by keeping the current record with the minimum 4th field for a given 1st field. You have to handle outputting the header-row and storing the first record to begin the comparison, which you can do by operating on the first record NR==1 (or first in each file processed, FNR==1).
You can store the first minimum in an array indexed by the first field and save the initial record containing values operating on the 2nd record. Then it is just a matter of checking if the first-field is not the same as the last, if so output the minimum record for the last and keep going until you run out of records. (note: this presumes the first-fields appear in increasing order as they do in your file) Then you use the END rule to output the final record.
You can put that together as follows:
awk -F, '
FNR==1 {print; next}
FNR==2 {rec=$0; m[$1]=$4; next}
{
if ($1 in m) {
if ($4 < m[$1]) {
rec=$0
m[$1]=$4
}
}
else {
print rec
rec=$0
m[$1]=$4
}
}
END {
print rec
}' file
(where your data is in the file file)
If your first field is not in increasing order, then you will need to save the current minimum record in an array as well. (e.g. turn rec into an array indexed by the first-field holding the total record as its value). You would then delay looping over both arrays until the END rule to output the minimum record for each first-field.
Example Use/Output
You can update the filename to match the filename containing your data, and then to test, all you need to do is select-copy the awk expression and middle-mouse paste it into an xterm in the directory containing your file, e.g.
$ awk -F, '
> FNR==1 {print; next}
> FNR==2 {rec=$0; m[$1]=$4; next}
> {
> if ($1 in m) {
> if ($4 < m[$1]) {
> rec=$0
> m[$1]=$4
> }
> }
> else {
> print rec
> rec=$0
> m[$1]=$4
> }
> }
> END {
> print rec
> }' file
ID,y,z,p-value
1,a,b,1.2e-10
2,c,d,3e-7
3,e,f,2e-8
Look things over and let me know if you have questions.

A non-awk approach, using GNU datamash:
$ datamash -H -f -t, -g1 min 4 < input.txt | cut -d, -f1-4
ID,y,z,p-value
1,a,b,1.2e-10
2,c,d,3e-7
3,e,f,2e-8
(The cut is needed because with the -f option datamash adds a fifth column that's a duplicate of the 4th; without it it'll just show the first and fourth column values. Minor annoyance.)
This does require that your data is sorted on the first column like in your sample.

Get all the duplicates record in a csv if a column is different

I have a csv file, which have column wise data, like
EvtsUpdated,IR23488670,15920221,ESTIMATED
EvtsUpdated,IR23488676,11014018,ESTIMATED
EvtsUpdated,IR23488700,7273867,ESTIMATED
EvtsUpdated,IR23486360,7273881,ESTIMATED
EvtsUpdated,IR23488670,7273807,ESTIMATED
EvtsUpdated,IR23488670,9738420,ESTIMATED
EvtsUpdated,IR23488670,7273845,ESTIMATED
EvtsUpdated,IR23488676,12149463,ESTIMATED
and i just want to find out all the duplicates row ignoring a column, which is column 3. the output should be like
EvtsUpdated,IR23488670,15920221,ESTIMATED
EvtsUpdated,IR23488676,11014018,ESTIMATED
EvtsUpdated,IR23488700,7273867,ESTIMATED
EvtsUpdated,IR23488670,7273807,ESTIMATED
EvtsUpdated,IR23488670,9738420,ESTIMATED
EvtsUpdated,IR23488670,7273845,ESTIMATED
EvtsUpdated,IR23488676,12149463,ESTIMATED
i tried it by first cutting other columns except 3 in another file using
cut --complement -f 3 -d, filename into another file,
then i tried using the awk command, like awk -F, '{if(FNR==NR){print}}' secondfile
As i don't have complete knowledge of awk, so i'm not able to do it

You can use awk arrays to store the count of each group of columns to identify duplicates.
awk -F "," '{row[$1$2$4]++ ; rec[$0","NR] = $1$2$4 }
END{ for ( key in rec ) { if (row[rec[key]] > 1) { print key } } }' filename | sort -t',' -k5 | cut -f1-4 -d','
An additional sort was required to maintain the original ordering expected in your output.
Note: In your output shown, row with IR23488700 is considered as duplicate even though it is not.

I did the same by first cutting the 3rd column which may be different and then running the awk '++A[$0]==2' file command. Thanks for your help

extract columns from multiple .csv files and merge them into one

I have three files from which I want to extract some columns and paste them in a new file. The files don't necessarily have the same number of lines. They are sorted on the values in their first column.
File 1 has the following structure:
col1;col2;col3;col4
SAMPLE-1;1;1;1
SAMPLE-2;1;1;1
SAMPLE-3;1;1;1
SAMPLE-4;1;1;1
This file is seperated by ";" instead of ","
File 2 has the following structure:
col5,col6,col7,col8
SAMPLE-1_OTHER_INFO,2,2,2
SAMPLE-2_OTHER_INFO,2,2,2
SAMPLE-3_OTHER_INFO,2,2,2
File 3 has the following structure:
col9,col10,col11,col12
SAMPLE-1_OTHER_INFO,3,3,3
SAMPLE-2_OTHER_INFO,3,3,3
SAMPLE-3_OTHER_INFO,3,3,3
The output file (summary.csv) should look like this:
col1,col2,col4,col6,col7,col10,col12
SAMPLE-1,1,1,2,2,3,3
SAMPLE-2,1,1,2,2,3,3
SAMPLE-3,1,1,2,2,3,3
SAMPLE-4,1,1,,,,
Basically the first columns of all three files contain the sample identifier. 'col1' of file1 should be the first column of the output file. The identifiers in col1 should then be matched with those in col5 and col9 of file2 and file3. The '_OTHER_INFO' part should not be taken into account when doing the comparison.
If there is a match, the info the col6, col7, col10 and col12 values of file 2 and 3 should be added.
If there is no match, the line should still be in the output file, but the last four columns should be empty (like in this case 'SAMPLE-4')
I was planning to perform this action with awk or the 'cut/paste' command. However I don't know how I should look for a match between the values in col1, col5 and col9.

try following and let me know if this helps you.
awk 'BEGIN{
FS=";"
}
FNR==1{
f++
}
f==1 && FNR>1{
a[$1]=$2","$4;
next
}
f>1 && FNR==1 {
FS=","
}
f==2 && FNR>1{
sub(/_.*/,"",$1);
b[$1]=$2","$3;
next
}
f==3 && FNR>1{
sub(/_.*/,"",$1);
c[$1]=$2","$4;
next
}
END{
print "col1,col2,col4,col6,col7,col10,col12";
for(i in a){
printf("%s,%s,%s,%s\n",i,a[i],b[i]?b[i]:",",c[i]?c[i]:",")
}
}
' file1 file2 file3
Will try to add explanation too in sometime.
EDIT1: adding a one-liner form of solution too.
awk 'BEGIN{FS=";"}FNR==1{f++} f==1 && FNR>1{;a[$1]=$2","$4;next} f>1 && FNR==1{FS=","} f==2&&FNR>1{sub(/_.*/,"",$1);b[$1]=$2","$3;next} f==3&&FNR>1{sub(/_.*/,"",$1);c[$1]=$2","$4;next} END{print "col1,col2,col4,col6,col7,col10,col12";for(i in a){printf("%s,%s,%s,%s\n",i,a[i],b[i]?b[i]:",",c[i]?c[i]:",")}}' file1 file2 file3

sort + sed trick (for sorted input files):
join -t, -j1 -a1 -o1.1,1.2,1.4,2.2,2.3 <(tr ';' ',' < file1) <(sed 's/_[^,]*//g' file2)
| join -t, - -a1 -o1.1,1.2,1.3,1.4,1.5,2.2,2.4 <(sed 's/_[^,]*//g' file3)
The output:
SAMPLE-1,1,1,2,2,3,3
SAMPLE-2,1,1,2,2,3,3
SAMPLE-3,1,1,2,2,3,3
SAMPLE-4,1,1,,,,

CSV join some rows that have the same id

I have a CSV file like this
1,A,abc
2,A,def
1,B,smthing
1,A,ghk
5,C,smthing
Now I want to join all the rows that have the same value at row 2. In this case is row with the second element is A. The return file should be
1,A,abcdef,ghk
3,B,smthing
5,C,smthing
I'm trying with awk and I can get the second and the third fields but not whole file like this
awk -F, '{a[$2]=a[$2]?a[$2]$3:$3;}END{for (i in a)print i","a[i];}' old_file.csv > new_file.csv
Update
I solved my problem with 2 command. First create a new_file.csv (command above)
Second command will join old_file with new_file
awk -F, 'NR == FNR {a[$1] = $2;} NR != FNR && a[$2] {print $1","$2","a[$2];}' new_file.csv old_file.csv > last_file.csv
The last_file.csv looks like this
1,A,abcdefghk
2,A,abcdefghk
1,B,smthing
1,A,abcdefghk
5,C,smthing
So, how should I make a better command from those 2 commands?
Thank you!

One awk is enough:
awk 'NR==FNR{a[$2]=a[$2]==""?$3:a[$2] $3;next}{$3=a[$2]}1' FS=, OFS=, file file
1,A,abcdefghk
2,A,abcdefghk
1,B,smthing
1,A,abcdefghk
5,C,smthing
Explanation
NR==FNR{a[$2]=a[$2]==""?$3:a[$2] $3;next} merge records to array a (key is column 2)
$3=a[$2] read the input file again, change column 3 with new value.
Add the command to remove the duplicate records (column 2), keep the first one.
awk 'NR==FNR{a[$2]=a[$2]==""?$3:a[$2] $3;next}!b[$2]++{$3=a[$2];print}' FS=, OFS=, file file
1,A,abcdefghk
1,B,smthing
5,C,smthing

How to remove several columns and the field separators at once in AWK?

I have a big file with several thousands of columns. I want to delete some specific columns and the field separators at once with AWK in Bash.
I can delete one column at a time with this oneliner (column 3 will be deleted and its corresponding field separator):
awk -vkf=3 -vFS="\t" -vOFS="\t" '{for(i=kf; i<NF;i++){ $i=$(i+1);}; NF--; print}' < Big_File
However, I want to delete several columns at once... Can someone help me figure this out?

You can pass list of columns to be deleted from shell to awk like this:
awk -vkf="3,5,11" ...
then in the awk programm parse it into array:
split(kf,kf_array,",")
and then go thru all the colums and test if each particular column is in the kf_array and possibly skip it
Other possibility is to call your oneliner several times :-)

Here is an implementation of Kamil's idea:
awk -v remove="3,8,5" '
BEGIN {
OFS=FS="\t"
split(remove,a,",")
for (i in a) b[a[i]]=1
}
{
j=1
for (i=1;i<=NF;++i) {
if (!(i in b)) {
$j=$i
++j
}
}
NF=j-1
print
}
'

If you can use cut instead of awk, this one is easier with cut:
e.g. this obtains columns 1,3, and from 50 on from file:
cut -f1,3,50- file

Something like this should work:
awk -F'\t' -v remove='3|8|5' '
{
rec=ofs=""
for (i=1;i<=NF;i++) {
if (i !~ "^(" remove ")$" ) {
rec = rec ofs $i
ofs = FS
}
}
print rec
}
' file

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio