awk output to file based on filter - bash

I have a big CSV file that I need to cut into different pieces based on the value in one of the columns. My input file dataset.csv is something like this:
NOTE: edited to clarify that data is ,data, no spaces.
action,action_type, Result
up,1,stringA
down,1,strinB
left,2,stringC
So, to split by action_type I simply do (I need the whole matching line in the resulting file):
awk -F, '$2 ~ /^1$/ {print}' dataset.csv >> 1_dataset.csv
awk -F, '$2 ~ /^2$/ {print}' dataset.csv >> 2_dataset.csv
This works as expected but I am basicaly travesing my original dataset twice. My original dataset is about 5GB and I have 30 action_type categories. I need to do this everyday, so, I need to script the thing to run on its own efficiently.
I tried the following but it does not work:
# This is a file called myFilter.awk
{
action_type=$2;
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}
Then I run it as:
awk -f myFilter.awk dataset.csv
But I get nothing. Literally nothing, no even errors. Which sort of tell me that my code is simply not matching anything or my print / pipe statement is wrong.

You may try this awk to do this in a single command:
awk -F, 'NR > 1{fn = $2 "_dataset.csv"; print >> fn; close(fn)}' file

With GNU awk to handle many concurrently open files and without replicating the header line in each output file:
awk -F',' '{print > ($2 "_dataset.csv")}' dataset.csv
or if you also want the header line to show up in each output file then with GNU awk:
awk -F',' '
NR==1 { hdr = $0; next }
!seen[$2]++ { print hdr > ($2 "_dataset.csv") }
{ print > ($2 "_dataset.csv") }
' dataset.csv
or the same with any awk:
awk -F',' '
NR==1 { hdr = $0; next }
{ out = $2 "_dataset.csv" }
!seen[$2]++ { print hdr > out }
{ print >> out; close(out) }
' dataset.csv

As currently coded the input field separator has not been defined.
Current:
$ cat myfilter.awk
{
action_type=$2;
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}
Invocation:
$ awk -f myfilter.awk dataset.csv
There are a couple ways to address this:
$ awk -v FS="," -f myfilter.awk dataset.csv
or
$ cat myfilter.awk
BEGIN {FS=","}
{
action_type=$2
if (action_type=="1") print $0 >> 1_dataset.csv;
else if (action_type=="2") print $0 >> 2_dataset.csv;
}
$ awk -f myfilter.awk dataset.csv

Related

Is there a way to double print a file in bash? [duplicate]

This question already has answers here:
How to redirect output to a file and stdout
(11 answers)
Closed 1 year ago.
I'm new to Linux and programming. My problem is the following: I have a file listing 3 columns. I want to swap the first and the last column, print it to prompt AND to a new file in one line. So I swapped the columns and printed it to prompt OR to a file.
$ awk -F, ' { t = $1; $1 = $3; $3 = t; print; } ' OFS=, liste.csv
This is my base line to print it to prompt. But it seems impossible to print it to a new file in the same command line.
Does anyone have the idea?
Here are some examples that didn't work:
$ awk -F, ' { t = $1; $1 = $3; $3 = t; print; } ' OFS=, liste.csv | >liste2.csv
$ printf "$(sudo awk -F, ' { t = $1; $1 = $3; $3 = t; print; } ' OFS=, liste.csv > liste2.csv)"
$ cat $(sudo awk -F, ' { t = $1; $1 = $3; $3 = t; print; } ' OFS=, liste.csv > liste2.csv)
I think you catch the drift of what I ask.
Thanks!
Use tee command as mentioned in How to redirect output to a file and stdout
Or, redirect it within awk itself by adding print > "liste2.csv" in addition to the existing print for displaying on stdout
print it to prompt AND to a new file in one line
This sound like task for tee. Assuming
awk -F, ' { t = $1; $1 = $3; $3 = t; print; } ' OFS=, liste.csv
does produce correct output to standard output, this
awk -F, ' { t = $1; $1 = $3; $3 = t; print; } ' OFS=, liste.csv | tee liste2.csv
should write to liste2.csv and standard output

Split csv file based on column from command line

I have some data in a file in the form of csv of the form:
ID,DATE,EARNING
1,12 May 2018,5
1,13 May 2018,15
2,12 May 2018,25
I want to split this into multiple files such that file_1_May_report contains:
ID,DATE,EARNING
1,12 May 2018,5
1,13 May 2018,15
and another file file_2_May_report that contains:
ID,DATE,EARNING
2,12 May 2018,25
I have tried :
awk -F, '{print >> $1}' input.csv
However I only get one file 1 with only one record, that is the last record in the input file. How do I get it to split into multiple files based on ID?
You may use this awk:
awk -F, 'NR==1{hdr=$0; next} !seen[$1]++{fn="file_" $1 "_May_report"; print hdr > fn} {print > fn}' input.csv
Or with a more readable format:
awk -F, 'NR == 1 {
hdr = $0
next
}
!seen[$1]++ {
fn = "file_" $1 "_May_report"
print hdr > fn
}
{
print > fn
}' input.csv

gawk nextfile with compressed input

I'm trying to use awk's nextfile statement with multiple gzipped input files.
I've googled for this before posting but it looks like I'm the only one who want to do this :D
This is what i need to do:
awk '
BEGIN{
print "start",strftime();
}
/match/{
print FILENAME,"->",$0
count++
nextfile
}
END{
print count
print "stop",strftime();
}
' /var/log/*.2015-01-23.gz
Suddenly awk can't read by itself gzipped files, so I have to use zcat and I've modified my syntax as follow:
zcat /var/log/*.2015-01-23.gz | awk '
BEGIN{
print "start",strftime();
}
/match/{
print FILENAME,"->",$0
count++
nextfile
}
END{
print count
print "stop",strftime();
}
'
But this way nextfile statement won't work because awk see just one input data flow.
My awk version:
# awk --version
GNU Awk 3.1.7
Note: what exposed is a resume of what I need to do in the END action, so don't propose to use zgrep or something else. I need awk.
Note2: Files will be elaborated togheter.
Thanks
You can fetch the files one by one, hold the Name of the file in a shell variable and print this in awk:
#!/bin/bash
for LOGFILE in /var/log/*.2015-01-23.gz
do
zcat "$LOGFILE" |
awk '
BEGIN{
print "start",strftime();
}
/match/{
print "'$LOGFILE'","->",$0
last
}
END{
print "stop",strftime();
}
'
done

awk OFS not producing expected value

I have a file
[root#nmk~]# cat file
abc>
sssd>
were>
I run both these variations of the awk commands
[root#nmk~]# cat file | awk -F\> ' { print $1}' OFS=','
abc
sssd
were
[root#nmk~]# cat file | awk -F\> ' BEGIN { OFS=","} { print $1}'
abc
sssd
were
[root#nmk~]#
But my expected output is
abc,sssd,were
What's missing in my commands ?
You're just a bit confused about the meaning/use of FS, OFS, RS and ORS. Take another look at the man page. I think this is what you were trying to do:
$ awk -F'>' -v ORS=',' '{print $1}' file
abc,sssd,were,$
but this is probably closer to the output you really want:
$ awk -F'>' '{rec = rec (NR>1?",":"") $1} END{print rec}' file
abc,sssd,were
or if you don't want to buffer the whole output as a string:
$ awk -F'>' '{printf "%s%s", (NR>1?",":""), $1} END{print ""}' file
abc,sssd,were
awk -F\> -v ORS="" 'NR>1{print ","$1;next}{print $1}' file
to print newline at the end:
awk -F\> -v ORS="" 'NR>1{print ","$1;next}{print $1} END{print "\n"}' file
output:
abc,sssd,were
Each line of input in awk is a record, so what you want to set is the Output Record Separator, ORS. The OFS variable holds the Output Field Separator, which is used to separate different parts of each line.
Since you are setting the input field separator, FS, to >, and OFS to ,, an easy way to see how these work is to add something on each line of your file after the >:
awk 'BEGIN { FS=">"; OFS=","} {$1=$1} 1' <<<$'abc>def\nsssd>dsss\nwere>wolf'
abc,def
sssd,dsss
were,wolf
So you want to set the ORS. The default record separator is newline, so whatever you set ORS to effectively replaces the newlines in the input. But that means that if the last line of input has a newline - which is usually the a case - that last line will also get a copy of your new ORS:
awk 'BEGIN { FS=">"; ORS=","} 1' <<<$'abc>def\nsssd>dsss\nwere>wolf'
abc>def,sssd>dsss,were>wolf,
It also won't get a newline at all, because that newline was interpreted as an input record separator and turned into the output record separator - it became the final comma.
So you have to be a little more explicit about what you're trying to do:
awk 'BEGIN { FS=">" } # split input on >
(NR>1) { printf "," } # if not the first line, print a ,
{ printf "%s", $1 } # print the first field (everything up to the first >)
END { printf "\n" } # add a newline at the end
' <<<$'abc>\nsssd>\nwere>'
Which outputs this:
abc,sssd,were
Through sed,
$ sed ':a;N;$!ba;s/>\n/,/g;s/>$//' file
abc,sssd,were
Through Perl,
$ perl -00pe 's/>\n(?=.)/,/g;s/>$//' file
abc,sssd,were

How to replace full column with the last value?

I'm trying to take last value in third column of a CSV file and replace then the whole third column with this value.
I've been trying this:
var=$(tail -n 1 math_ready.csv | awk -F"," '{print $3}'); awk -F, '{$3="$var";}1' OFS=, math_ready.csv > math1.csv
But it's not working and I don't understand why...
Please help!
awk '
BEGIN { ARGV[2]=ARGV[1]; ARGC++; FS=OFS="," }
NR==FNR { last = $3; next }
{ $3 = last; print }
' math_ready.csv > math1.csv
The main problem with your script was trying to access a shell variable ($var) inside your awk script. Awk is not shell, it is a completely separate language/tool with it's own namespace and variables. You cannot directly access a shell variable in awk, just like you couldn't access it in C. To access the VALUE of a shell variable you'd do:
shellvar=27
awk -v awkvar="$shellvar" 'BEGIN{ print awkvar }'`
Some additional cleanup:
When FS and OFS have the same value, don't assign them each to that value separately, use BEGIN{ FS=OFS="," } instead for clarity and maintainability.
Do not iniatailize variables AFTER the script that uses those variables unless you have a very specifc reason to do so. Use awk -F... -v OFS=... 'script' to init those variables to separate values, not awk -F... 'script' OFS=... as it's very unnatural to init variables in the code segment AFTER you've used them and variables inited in the args list at the end are not initialized when the BEGIN section is executed which can cause bugs.
A shell variable is not expandable internally in awk. You can do this instead:
awk -F, -v var="$var" '{ $3 = var } 1' OFS=, math_ready.csv > math1.cs
And you probably can simplify your code with this:
awk -F, 'NR == FNR { r = $3; next } { $3 = r } 1' OFS=, math_ready.csv math_ready.csv > math1.csv
Example input:
1,2,1
1,2,2
1,2,3
1,2,4
1,2,5
Output:
1,2,5
1,2,5
1,2,5
1,2,5
1,2,5
Try this one liner. It doesn't depend on the column count
var=`tail -1 sample.csv | perl -ne 'm/([^,]+)$/; print "$1";'`; cat sample.csv | while read line; do echo $line | perl -ne "s/[^,]*$/$var\n/; print $_;"; done
cat sample.csv
24,1,2,30,12
33,4,5,61,3333
66,7,8,91111,1
76,10,11,32,678
Out:
24,1,2,30,678
33,4,5,61,678
66,7,8,91111,678
76,10,11,32,678

Resources