awk: Remove duplicates and create a new csv file - bash

I have following CSV file:
1393036,293296,68,59,Mithridates,ny,io
10155431,14595886,1807,135860,Riemogerz,ny,id
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en,en-N
19332,7401441,296,352647,WikiDreamer,ny,fr
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en
I want to remove the duplicates based on the value of the first column. If there are more than one record of the same value I want to only keep one in the new file:
I started with following which actually finds the duplicates but I want to create a new file instead of just printing:
sort input.csv | awk 'NR == 1 {p=$1; next} p == $1 { print $1 " is duplicated"} {p=$1}' FS=","

Nut 100% sure what you like, but this will only get the last input if there are equals:
awk -F, '{a[$1]=$0} END {for (i in a) print a[i]}' file > newfile
cat newfile
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
1393036,293296,68,59,Mithridates,ny,io
353,291765,434,434,Lar,ny,en
10155431,14595886,1807,135860,Riemogerz,ny,id
19332,7401441,296,352647,WikiDreamer,ny,fr
If its not important what record to keep, as long as field 1 is unique.
This will show the first hit if there are several equal:
awk -F, '!a[$1]++' file > newfile
cat newfile
1393036,293296,68,59,Mithridates,ny,io
10155431,14595886,1807,135860,Riemogerz,ny,id
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en,en-N
19332,7401441,296,352647,WikiDreamer,ny,fr
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
To get the duplicated into a new file:
awk -F, '++a[$1]==2 {print $1}' file > newfile
cat newfile
1536088
353

This will show only the first entry for a given first column value:
awk -F, '!(seen[$1]++)' file > newfile

Related

Split csv file based on column from command line

I have some data in a file in the form of csv of the form:
ID,DATE,EARNING
1,12 May 2018,5
1,13 May 2018,15
2,12 May 2018,25
I want to split this into multiple files such that file_1_May_report contains:
ID,DATE,EARNING
1,12 May 2018,5
1,13 May 2018,15
and another file file_2_May_report that contains:
ID,DATE,EARNING
2,12 May 2018,25
I have tried :
awk -F, '{print >> $1}' input.csv
However I only get one file 1 with only one record, that is the last record in the input file. How do I get it to split into multiple files based on ID?
You may use this awk:
awk -F, 'NR==1{hdr=$0; next} !seen[$1]++{fn="file_" $1 "_May_report"; print hdr > fn} {print > fn}' input.csv
Or with a more readable format:
awk -F, 'NR == 1 {
hdr = $0
next
}
!seen[$1]++ {
fn = "file_" $1 "_May_report"
print hdr > fn
}
{
print > fn
}' input.csv

Compare two columns of different files and add new column if it matches

I would like to compare the first two columns of two files, if matched need to print yes else no.
input.txt
123,apple,type1
123,apple,type2
456,orange,type1
6567,kiwi,type2
333,banana,type1
123,apple,type2
qualified.txt
123,apple,type4
6567,kiwi,type2
output.txt
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
I was using the below command for split the data, and then i will add one more column based on the result.
Now the the input.txt has duplicate(1st column) so the below method is not working, also the file size was huge.
Can we get the output.txt in awk one liner?
comm -2 -3 input.txt qualified.txt
$ awk -F, 'NR==FNR {a[$1 FS $2];next} {print $0 FS (($1 FS $2) in a?"yes":"no")}' qual input
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
Explained:
NR==FNR { # for the first file
a[$1 FS $2];next # aknowledge the existance of qualified 1st and 2nd field pairs
}
{
print $0 FS ($1 FS $2 in a?"yes":"no") # output input row and "yes" or "no"
} # depending on whether key found in array a
No need to redefine the OFS as $0 isn't modified and doesn't get rebuilt.
You can use awk logic for this as below. Not sure why do you mention one-liner awk command though.
awk -v FS="," -v OFS="," 'FNR==NR{map[$1]=$2;next} {if($1 in map == 0) {$0=$0FS"no"} else {$0=$0FS"yes"}}1' qualified.txt input.txt
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
The logic is
The command FNR==NR parses the first file qualified.txt and stores the entries in column 1 and 2 in first file with first column being the index.
Then for each of the line in 2nd file {if($1 in map == 0) {$0=$0FS"no"} else {$0=$0FS"yes"}}1 the entry in column 1 does not match the array, append the no string and yes otherwise.
-v FS="," -v OFS="," are for setting input and output field separators
It looks like all you need is:
awk 'BEGIN{FS=OFS=","} NR==FNR{a[$1];next} {print $0, ($1 in a ? "yes" : "no")}' qualified.txt output.txt

splitting file1 using AWK and then name the new files based on lines from file2

File 1 has long string of data between <PhotoField1> multiple times.
Example:
<PhotoField1>alidkfjaeijwoeij<PhotoField1>akdfjalskdfasd<PhotoField1>
File 2 has list of IDs that i want to use to label the Files
Example:
A00565415
A00505050
A54531245
I have an AWK command to parse each string between <PhotoField1> from File1 into its own file, but it only labels the files temp with numbers:
awk -v RS="<PhotoField1>" '{ print $0 > "temp" NR }' File1.xml
I need to replace the temp* part with the a line from the 2nd file
So the new files would be named A00565415, A00505050, A54531245, etc..
- it would be excellent if I could add a .txt to the end of the files: A54531245.txt
The awk command works great for separating it into different files, but i need to be able to name them based on the File2 list.
awk 'NR==FNR{fname[NR]=$0".txt";next} {print > fname[FNR]}' File2.list RS="<PhotoField1>" File1.xml
You can use this awk:
awk -v RS="<PhotoField1>|\n" 'FNR==NR{a[NR]=$0; next}
NF{ print $0 > a[FNR] ".txt" }' file2 file1

Redirect output of one command to different files in shell script

I have a tab seperated string.
I want to copy 1 column to one file and the remaining columns to other file in one go..as that string can modify in between if I use 2 different commands.
I tried:
tab_seperated_string | awk -F"\t" '{ print $2"\t"$3"\t"$4"\t"$5} {print $1}'
2,3,4,5 should go to one file and 1 should go to another file.
You can do like this:
tab_seperated_string | awk -F"\t" '{print $2,$3,$4,$5 > "file2"; print $1 > "file1"}' OFS="\t"
It will then save data to two different files.
By setting OFS to \t, you do not need all the \t in the print statement.
Here is another way if you have many fields that go to one file and first field to another:
awk -F"\t" '{print $1 > "file1"; sub(/[^\t]+\t/,""); print $0 > "file2"}' OFS="\t"
The sub(/[^\t]+\t/,"") removes first field and first tab.

Split file into different parts based on the data using awk

I need to split the data in file 1 based on it´s data in $4 using awk. The target file-names should be taken from a mapping file 2.
File 1
text;text;text;AB;text
text;text;text;AB;text
text;text;text;CD;text
text;text;text;CD;text
text;text;text;EF;text
text;text;text;EF;text
File 2
AB;valid
CD;not_valid
EF;not_specified
Desired output where the file names are the value of $2 in file 2.
File valid
text;text;text;AB;text
text;text;text;AB;text
File not_valid
text;text;text;CD;text
text;text;text;CD;text
File not_specified
text;text;text;EF;text
text;text;text;EF;text
Any suggestions on how to perform the split?
Using awk:
awk -F';' 'FNR==NR {a[$1]=$2;next} $4 in a {print > a[$4]}
$4 != p {if (p) close(a[p]); p=$4}' file2 file1
It seems that just the first part of the code will work:
awk -F';' 'FNR==NR {a[$1]=$2;next} $4 in a {print > a[$4]}' file2 file1
So, why the last half code:
$4 != p {if (p) close(a[p]); p=$4
is needed? Thanks!

Resources