CSV join some rows that have the same id - ruby

I have a CSV file like this
1,A,abc
2,A,def
1,B,smthing
1,A,ghk
5,C,smthing
Now I want to join all the rows that have the same value at row 2. In this case is row with the second element is A. The return file should be
1,A,abcdef,ghk
3,B,smthing
5,C,smthing
I'm trying with awk and I can get the second and the third fields but not whole file like this
awk -F, '{a[$2]=a[$2]?a[$2]$3:$3;}END{for (i in a)print i","a[i];}' old_file.csv > new_file.csv
Update
I solved my problem with 2 command. First create a new_file.csv (command above)
Second command will join old_file with new_file
awk -F, 'NR == FNR {a[$1] = $2;} NR != FNR && a[$2] {print $1","$2","a[$2];}' new_file.csv old_file.csv > last_file.csv
The last_file.csv looks like this
1,A,abcdefghk
2,A,abcdefghk
1,B,smthing
1,A,abcdefghk
5,C,smthing
So, how should I make a better command from those 2 commands?
Thank you!

One awk is enough:
awk 'NR==FNR{a[$2]=a[$2]==""?$3:a[$2] $3;next}{$3=a[$2]}1' FS=, OFS=, file file
1,A,abcdefghk
2,A,abcdefghk
1,B,smthing
1,A,abcdefghk
5,C,smthing
Explanation
NR==FNR{a[$2]=a[$2]==""?$3:a[$2] $3;next} merge records to array a (key is column 2)
$3=a[$2] read the input file again, change column 3 with new value.
Add the command to remove the duplicate records (column 2), keep the first one.
awk 'NR==FNR{a[$2]=a[$2]==""?$3:a[$2] $3;next}!b[$2]++{$3=a[$2];print}' FS=, OFS=, file file
1,A,abcdefghk
1,B,smthing
5,C,smthing

Related

How to use awk or sed to split a file into 2 parts based on a list of value?

I have two file. One of them is the main file which contains a lot of columns. Another one contains the information about a list of samples. Now I would like to split the main file into 2 part based on which the sample (row) is in the list of second file or not. Now I use the code like this
awk 'NR == FNR {a[$1]; next} !($1 in a)' $i.list $i > no_in_list.$i
to exclude the sample which in the list, but I was wondering if it is possible to also keep the samples which in the list.
You could use the print >> "file" action to print in a specified file instead of the standard output (tested with GNU awk):
awk -v nil="no_in_list.$i" -v il="in_list.$i" '
NR == FNR {a[$1]; next}
!($1 in a) {print >> nil}
($1 in a) {print >> il}' $i.list $i

How to merge set of columns based on a common field

How can I do the following in bash.
a) Take out columns 1,2,6
b) Row is identified by field 'packetId'; There can be one or 2 rows with same 'packetId'; if there are 2 rows with same packetId, then append the first row with the last field of the second row
c) If there is only one row for a 'packetId', then ignore that row and donot print
Input
SequenceId,TimeStamp,packetId,size,secondaryid,eventType,randomfield,Source,Destination,SystemTime
1,3:41:24,1,100,xyz,event1,abc,S1,D1,1586989874
2,3:41:25,1,100,xyz,event2,abc,S1,D1,1586989877
3,3:41:26,2,100,xyz,event1,abc,S1,D1,1586989879
4,3:41:26,3,100,xyz,event1,abc,S1,D1,1586989871
5,3:41:26,3,100,xyz,event2,abc,S1,D1,1586989879
output
packetId,size,secondaryid,randomfield,Source,Destination,SystemTime,OtherSystemTime
1,100,xyz,abc,S1,D1,1586989874,1586989877
3,100,xyz,abc,S1,D1,1586989871,1586989879
You can do it all in awk, but it's simpler to use cut to first remove the fields you don't care about:
$ cut -d, -f3-5,7- input.csv |
awk -F, 'NR == 1 { print $0 ",OtherSystemTime"; next }
{ if ($1 in seen) print seen[$1] "," $NF; else seen[$1] = $0 }'
packetId,size,secondaryid,randomfield,Source,Destination,SystemTime,OtherSystemTime
1,100,xyz,abc,S1,D1,1586989874,1586989877
3,100,xyz,abc,S1,D1,1586989871,1586989879

Script for changing CSV into Key Value (KV) format

I have a CSV file with data as below:
row_identifier,DBNAME,tblsps_name,Cur_size,Max_size,Used,Free,Percentage
tablespace,MRETF,RERETOSB15_DATA,51200,45600,14284,31316,31
tablespace,MRETF,SPOTLIGHT_DATA,500,2000,259,1741,13
tablespace,MRETF,DDLAUDITING,25,25,2,23,8
I want the output in the following format:
tablespace,MRETF,tblsps_name:RERETOSB15_DATA,Cur_size:51200,Max_size:45600,Used:14284,Free:31316,Percentage:31
tablespace,MRETF,tblsps_name:SPOTLIGHT_DATA,Cur_size:500,Max_size:2000,Used:259,Free:1741,Percentage:13
and so on..
Is this possible to get the output like the above key:value format?
Next time at least pretend that you tried something ;-)
awk -F"," 'FNR > 1 {print $1","$2",tblsps_name:"$3",Cur_size:"$4",Max_size:"$5",Used:"$6",Free:"$7",Percentage:"$8}' your.csv
-F"," is field separator, FNR > 1 skip first header line, $1 is first column and so on

Compare two columns of different files and add new column if it matches

I would like to compare the first two columns of two files, if matched need to print yes else no.
input.txt
123,apple,type1
123,apple,type2
456,orange,type1
6567,kiwi,type2
333,banana,type1
123,apple,type2
qualified.txt
123,apple,type4
6567,kiwi,type2
output.txt
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
I was using the below command for split the data, and then i will add one more column based on the result.
Now the the input.txt has duplicate(1st column) so the below method is not working, also the file size was huge.
Can we get the output.txt in awk one liner?
comm -2 -3 input.txt qualified.txt
$ awk -F, 'NR==FNR {a[$1 FS $2];next} {print $0 FS (($1 FS $2) in a?"yes":"no")}' qual input
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
Explained:
NR==FNR { # for the first file
a[$1 FS $2];next # aknowledge the existance of qualified 1st and 2nd field pairs
}
{
print $0 FS ($1 FS $2 in a?"yes":"no") # output input row and "yes" or "no"
} # depending on whether key found in array a
No need to redefine the OFS as $0 isn't modified and doesn't get rebuilt.
You can use awk logic for this as below. Not sure why do you mention one-liner awk command though.
awk -v FS="," -v OFS="," 'FNR==NR{map[$1]=$2;next} {if($1 in map == 0) {$0=$0FS"no"} else {$0=$0FS"yes"}}1' qualified.txt input.txt
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
The logic is
The command FNR==NR parses the first file qualified.txt and stores the entries in column 1 and 2 in first file with first column being the index.
Then for each of the line in 2nd file {if($1 in map == 0) {$0=$0FS"no"} else {$0=$0FS"yes"}}1 the entry in column 1 does not match the array, append the no string and yes otherwise.
-v FS="," -v OFS="," are for setting input and output field separators
It looks like all you need is:
awk 'BEGIN{FS=OFS=","} NR==FNR{a[$1];next} {print $0, ($1 in a ? "yes" : "no")}' qualified.txt output.txt

awk: Remove duplicates and create a new csv file

I have following CSV file:
1393036,293296,68,59,Mithridates,ny,io
10155431,14595886,1807,135860,Riemogerz,ny,id
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en,en-N
19332,7401441,296,352647,WikiDreamer,ny,fr
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en
I want to remove the duplicates based on the value of the first column. If there are more than one record of the same value I want to only keep one in the new file:
I started with following which actually finds the duplicates but I want to create a new file instead of just printing:
sort input.csv | awk 'NR == 1 {p=$1; next} p == $1 { print $1 " is duplicated"} {p=$1}' FS=","
Nut 100% sure what you like, but this will only get the last input if there are equals:
awk -F, '{a[$1]=$0} END {for (i in a) print a[i]}' file > newfile
cat newfile
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
1393036,293296,68,59,Mithridates,ny,io
353,291765,434,434,Lar,ny,en
10155431,14595886,1807,135860,Riemogerz,ny,id
19332,7401441,296,352647,WikiDreamer,ny,fr
If its not important what record to keep, as long as field 1 is unique.
This will show the first hit if there are several equal:
awk -F, '!a[$1]++' file > newfile
cat newfile
1393036,293296,68,59,Mithridates,ny,io
10155431,14595886,1807,135860,Riemogerz,ny,id
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en,en-N
19332,7401441,296,352647,WikiDreamer,ny,fr
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
To get the duplicated into a new file:
awk -F, '++a[$1]==2 {print $1}' file > newfile
cat newfile
1536088
353
This will show only the first entry for a given first column value:
awk -F, '!(seen[$1]++)' file > newfile

Resources