Split csv file based on column from command line - shell

I have some data in a file in the form of csv of the form:
ID,DATE,EARNING
1,12 May 2018,5
1,13 May 2018,15
2,12 May 2018,25
I want to split this into multiple files such that file_1_May_report contains:
ID,DATE,EARNING
1,12 May 2018,5
1,13 May 2018,15
and another file file_2_May_report that contains:
ID,DATE,EARNING
2,12 May 2018,25
I have tried :
awk -F, '{print >> $1}' input.csv
However I only get one file 1 with only one record, that is the last record in the input file. How do I get it to split into multiple files based on ID?

You may use this awk:
awk -F, 'NR==1{hdr=$0; next} !seen[$1]++{fn="file_" $1 "_May_report"; print hdr > fn} {print > fn}' input.csv
Or with a more readable format:
awk -F, 'NR == 1 {
hdr = $0
next
}
!seen[$1]++ {
fn = "file_" $1 "_May_report"
print hdr > fn
}
{
print > fn
}' input.csv

Related

awk output to file based on filter

I have a big CSV file that I need to cut into different pieces based on the value in one of the columns. My input file dataset.csv is something like this:
NOTE: edited to clarify that data is ,data, no spaces.
action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC
So, to split by action_type I simply do (I need the whole matching line in the resulting file):
awk -F, '$2 ~ /^1$/ {print}' dataset.csv >> 1_dataset.csv
awk -F, '$2 ~ /^2$/ {print}' dataset.csv >> 2_dataset.csv
This works as expected but I am basicaly travesing my original dataset twice. My original dataset is about 5GB and I have 30 action_type categories. I need to do this everyday, so, I need to script the thing to run on its own efficiently.
I tried the following but it does not work:
# This is a file called myFilter.awk
{
action_type=$2;
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}
Then I run it as:
awk -f myFilter.awk dataset.csv
But I get nothing. Literally nothing, no even errors. Which sort of tell me that my code is simply not matching anything or my print / pipe statement is wrong.
You may try this awk to do this in a single command:
awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file
With GNU awk to handle many concurrently open files and without replicating the header line in each output file:
awk -F',' '{print > ($2 "_dataset.csv")}' dataset.csv
or if you also want the header line to show up in each output file then with GNU awk:
awk -F',' '
NR==1 { hdr = $0; next }
!seen[$2]++ { print hdr > ($2 "_dataset.csv") }
{ print > ($2 "_dataset.csv") }
' dataset.csv
or the same with any awk:
awk -F',' '
NR==1 { hdr = $0; next }
{ out = $2 "_dataset.csv" }
!seen[$2]++ { print hdr > out }
{ print >> out; close(out) }
' dataset.csv
As currently coded the input field separator has not been defined.
Current:
$ cat myfilter.awk
{
action_type=$2;
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}
Invocation:
$ awk -f myfilter.awk dataset.csv
There are a couple ways to address this:
$ awk -v FS="," -f myfilter.awk dataset.csv
or
$ cat myfilter.awk
BEGIN {FS=","}
{
action_type=$2
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}
$ awk -f myfilter.awk dataset.csv

Compare two columns of different files and add new column if it matches

I would like to compare the first two columns of two files, if matched need to print yes else no.
input.txt
123,apple,type1
123,apple,type2
456,orange,type1
6567,kiwi,type2
333,banana,type1
123,apple,type2
qualified.txt
123,apple,type4
6567,kiwi,type2
output.txt
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
I was using the below command for split the data, and then i will add one more column based on the result.
Now the the input.txt has duplicate(1st column) so the below method is not working, also the file size was huge.
Can we get the output.txt in awk one liner?
comm -2 -3 input.txt qualified.txt
$ awk -F, 'NR==FNR {a[$1 FS $2];next} {print $0 FS (($1 FS $2) in a?"yes":"no")}' qual input
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
Explained:
NR==FNR { # for the first file
a[$1 FS $2];next # aknowledge the existance of qualified 1st and 2nd field pairs
}
{
print $0 FS ($1 FS $2 in a?"yes":"no") # output input row and "yes" or "no"
} # depending on whether key found in array a
No need to redefine the OFS as $0 isn't modified and doesn't get rebuilt.
You can use awk logic for this as below. Not sure why do you mention one-liner awk command though.
awk -v FS="," -v OFS="," 'FNR==NR{map[$1]=$2;next} {if($1 in map == 0) {$0=$0FS"no"} else {$0=$0FS"yes"}}1' qualified.txt input.txt
123,apple,type1,yes
123,apple,type2,yes
456,orange,type1,no
6567,kiwi,type2,yes
333,banana,type1,no
123,apple,type2,yes
The logic is
The command FNR==NR parses the first file qualified.txt and stores the entries in column 1 and 2 in first file with first column being the index.
Then for each of the line in 2nd file {if($1 in map == 0) {$0=$0FS"no"} else {$0=$0FS"yes"}}1 the entry in column 1 does not match the array, append the no string and yes otherwise.
-v FS="," -v OFS="," are for setting input and output field separators
It looks like all you need is:
awk 'BEGIN{FS=OFS=","} NR==FNR{a[$1];next} {print $0, ($1 in a ? "yes" : "no")}' qualified.txt output.txt

awk: Remove duplicates and create a new csv file

I have following CSV file:
1393036,293296,68,59,Mithridates,ny,io
10155431,14595886,1807,135860,Riemogerz,ny,id
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en,en-N
19332,7401441,296,352647,WikiDreamer,ny,fr
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en
I want to remove the duplicates based on the value of the first column. If there are more than one record of the same value I want to only keep one in the new file:
I started with following which actually finds the duplicates but I want to create a new file instead of just printing:
sort input.csv | awk 'NR == 1 {p=$1; next} p == $1 { print $1 " is duplicated"} {p=$1}' FS=","
Nut 100% sure what you like, but this will only get the last input if there are equals:
awk -F, '{a[$1]=$0} END {for (i in a) print a[i]}' file > newfile
cat newfile
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
1393036,293296,68,59,Mithridates,ny,io
353,291765,434,434,Lar,ny,en
10155431,14595886,1807,135860,Riemogerz,ny,id
19332,7401441,296,352647,WikiDreamer,ny,fr
If its not important what record to keep, as long as field 1 is unique.
This will show the first hit if there are several equal:
awk -F, '!a[$1]++' file > newfile
cat newfile
1393036,293296,68,59,Mithridates,ny,io
10155431,14595886,1807,135860,Riemogerz,ny,id
10767895,5749707,2402,1716,Nickispeaki,ny,uk
1536088,6390442,1301,109160,Ds02006,ny,ru
353,291765,434,434,Lar,ny,en,en-N
19332,7401441,296,352647,WikiDreamer,ny,fr
7142,7221255,298,78928,WikiDreamer Bot,ny,fi
417258,1507888,409,7709,Dmitri Lytov,ny,ru
7198454,15101351,5604,853415,Ffffnm,cdo,zh
To get the duplicated into a new file:
awk -F, '++a[$1]==2 {print $1}' file > newfile
cat newfile
1536088
353
This will show only the first entry for a given first column value:
awk -F, '!(seen[$1]++)' file > newfile

Split file into different parts based on the data using awk

I need to split the data in file 1 based on it´s data in $4 using awk. The target file-names should be taken from a mapping file 2.
File 1
text;text;text;AB;text
text;text;text;AB;text
text;text;text;CD;text
text;text;text;CD;text
text;text;text;EF;text
text;text;text;EF;text
File 2
AB;valid
CD;not_valid
EF;not_specified
Desired output where the file names are the value of $2 in file 2.
File valid
text;text;text;AB;text
text;text;text;AB;text
File not_valid
text;text;text;CD;text
text;text;text;CD;text
File not_specified
text;text;text;EF;text
text;text;text;EF;text
Any suggestions on how to perform the split?
Using awk:
awk -F';' 'FNR==NR {a[$1]=$2;next} $4 in a {print > a[$4]}
$4 != p {if (p) close(a[p]); p=$4}' file2 file1
It seems that just the first part of the code will work:
awk -F';' 'FNR==NR {a[$1]=$2;next} $4 in a {print > a[$4]}' file2 file1
So, why the last half code:
$4 != p {if (p) close(a[p]); p=$4
is needed? Thanks!

AWK split file by separator and count

I have a large 220mb file. The file is grouped by a horizontal row "---". This is what I have so far:
cat test.list | awk -v ORS="" -v RS="-------------------------------------------------------------------------------" '{print $0;}'
How do I take this and print to a new file every 1000 matches?
Is there another way to do this? I looked at split, and csplit but the "----" rows to not occur predictably so I have to match them, and then split on a count of the matches.
I would like the output files to groups of 1000 matches per file.
To output the first 1000 records to outputfile0, the next to outputfile1, etc., just do:
awk 'NR%1000 == 1{ file = "outputfile" i++ } { print > file }' ORS= RS=------ test.list
(Note that I truncated the dashes in RS for simplicity.)'
Unfortunately, using a value of RS that is more than a single character produces unspecified results, so the above cannot be the solution. Perhaps something like twalberg's solution is required:
awk '/^----$/ { if(!(c%1000)) count+=1; c+=1; next }
{print > ("outputfile"count)}' c=1 count=1
Not tested, but something along these lines might work:
awk 'BEGIN {fileno=1,matchcount=0}
/^-------/ { if (++matchcount == 1000) { ++fileno; matchcount=0; } }
{ print $0 > "output_file_" fileno }' < test.list
It might be cleaner to put all that in, say split.awk and use awk -f split.awk test.list instead...

Resources