convert first column in a csv file from timestamp to year-month format - bash

Trying to convert first column in a csv file from unix timestamp to date(year-month format)
Tried date -d #number'+%Y-%m' and awk, but awk doesn't recognize # when used together
Extract from a csv file :
1556113878,60662402644292
1554090396,59547403093308
Expected O/p
2019-04,60662402644292
2019-03,59547403093308

If you have GNU awk (sometimes called gawk), try:
gawk -F, '{print strftime("%Y-%m", $1),$2}' OFS=, file.csv
For example, consider this input file:
$ cat file.csv
1556113878,60662402644292
1554090396,59547403093308
Our command produces this output:
$ gawk -F, '{print strftime("%Y-%m", $1),$2}' OFS=, file.csv
2019-04,60662402644292
2019-03,59547403093308
On many Linux systems, GNU awk is the default. On others like Ubuntu, it is not but it can be easily installed: sudo apt-get install gawk. On MacOS, GNU awk can be installed via homebrew.

If you don't have GNU AWK, you may have a system Ruby, in which case you can do this:
▶ ruby -F, -ane \
'$F[0] = Time.at($F[0].to_i).strftime("%Y-%m"); print $F.join(",")' FILE
2019-04,60662402644292
2019-04,59547403093308
Further explanation:
Unlike Perl's POSIX::strftime, system Ruby should ship with the Time module. Thus my choice of Ruby.
The command line options are -F, is the same as AWK; -n is the same as sed; -a turns on AWK-like auto-split; -e is the same as sed.
$F is similar to AWK's $0 and $F[0] is similar to AWK's $1. $F[0].to_i converts the Epoch time string in the first field to an integer.

Related

Convert GNU awk command to default macOS awk command

Given a file containing many lines such as, e.g.:
Z|X|20210903|07:00:00|S|33|27.71||
With wanted output of, e.g.:
Z|X|20210903|07:00:00|S|33|27.71|||03-09-2021 07:00:00
This GNU awk command works:
gawk -F'|' '{dt = gensub(/(....)(..)(..)/,"\\3-\\2-\\1",1,$3); print $0"|"dt,$4}' infile > outfile
However, I need this to work under macOS with the version of awk that is installed by default, and it produces the following error:
awk: calling undefined function gensub
input record number 1, file
source line number 1
I'm assuming the default version of awk in macOS is too old and doesn't support the gensub function.
Note that I have tried numerous other string functions to no avail. awk programming is not in my area of expertise and I derived at the GNU awk command above thru a fair amount of googling, but my google-fu was unsuccessful in trying to get something to work with macOS awk.
Can the above GNU awk command be rewritten to work with the default version of awk in, e.g., macOS Catalina and if so how?
Would you please try the following:
awk -F'|' '{dt=substr($3,7,2) "-" substr($3,5,2) "-" substr($3,1,4); print $0 "|" dt, $4}' infile > outfile
Using perl instead of gawk:
$ perl -lne '
my #F = split /[|]/, $_, -1;
my $dt = ($F[2] =~ s/(....)(..)(..)/$3-$2-$1/r);
print join("|", #F, "$dt $F[3]")' <<<"Z|X|20210903|07:00:00|S|33|27.71||"
Z|X|20210903|07:00:00|S|33|27.71|||03-09-2021 07:00:00

awk command inside for loop to read and write multiple files [duplicate]

I am learning awk and I would like to know if there is an option to write changes to file, similar to sed where I would use -i option to save modifications to a file.
I do understand that I could use redirection to write changes. However is there an option in awk to do that?
In GNU Awk 4.1.0 (released 2013) and later, it has the option of "inplace" file editing:
[...] The "inplace" extension, built using the new facility, can be used to simulate the GNU "sed -i" feature. [...]
Example usage:
$ gawk -i inplace '{ gsub(/foo/, "bar") }; { print }' file1 file2 file3
To keep the backup:
$ gawk -i inplace -v INPLACE_SUFFIX=.bak '{ gsub(/foo/, "bar") }
> { print }' file1 file2 file3
Unless you have GNU awk 4.1.0 or later...
You won't have such an option as sed's -i option so instead do:
$ awk '{print $0}' file > tmp && mv tmp file
Note: the -i is not magic, it is also creating a temporary file sed just handles it for you.
As of GNU awk 4.1.0...
GNU awk added this functionality in version 4.1.0 (released 10/05/2013). It is not as straight forwards as just giving the -i option as described in the released notes:
The new -i option (from xgawk) is used for loading awk library files. This differs from -f in that the first non-option argument
is treated as a script.
You need to use the bundled inplace.awk include file to invoke the extension properly like so:
$ cat file
123 abc
456 def
789 hij
$ gawk -i inplace '{print $1}' file
$ cat file
123
456
789
The variable INPLACE_SUFFIX can be used to specify the extension for a backup file:
$ gawk -i inplace -v INPLACE_SUFFIX=.bak '{print $1}' file
$ cat file
123
456
789
$ cat file.bak
123 abc
456 def
789 hij
I am happy this feature has been added but to me, the implementation isn't very awkish as the power comes from the conciseness of the language and -i inplace is 8 characters too long i.m.o.
Here is a link to the manual for the official word.
just a little hack that works
echo "$(awk '{awk code}' file)" > file
#sudo_O has the right answer.
This can't work:
someprocess < file > file
The shell performs the redirections before handing control over to someprocess (redirections). The > redirection will truncate the file to zero size (redirecting output). Therefore, by the time someprocess gets launched and wants to read from the file, there is no data for it to read.
An alternative is to use sponge:
awk '{print $0}' your_file | sponge your_file
Where you replace '{print $0}' by your awk script and your_file by the name of the file you want to edit in place.
sponge absorbs entirely the input before saving it to the file.
following won't work
echo $(awk '{awk code}' file) > file
this should work
echo "$(awk '{awk code}' file)" > file
In case you want an awk-only solution without creating a temporary file and usable with version!=(gawk 4.1.0):
awk '{a[b++]=$0} END {for(c=0;c<=b;c++)print a[c]>ARGV[1]}' file

"grep" a csv file including multi-lines fields?

file.csv:
XA90;"standard"
XA100;"this is
the multi-line"
XA110;"other standard"
I want to grep the "XA100" entry like this:
grep XA100 file.csv
to obtain this result:
XA100;"this is
the multi-line"
but grep return only one line:
XA100;"this is
source.csv contains 3 entries.
The "XA100" entry contain a multi-line field.
And grep doesn't seem to be the right tool to "grep" CSV file including multilines fields.
Do you know the way to make the job ?
Edit: the real world file contains many columns. The researched term can be in any column (not at begin of line, nor at the begin of field). All fields are encapsulated by ". Any field can contain a multi-line, from 1 line to any, and this cannot be predicted.
Give this line a try:
awk '/^XA100;/{p=1}p;p&&/"$/{p=0}' file
I extended your example a bit:
kent$ cat f
XA90;"standard"
XA100;"this is
the
multi-
line"
XA110;"other standard"
kent$ awk '/^XA100;/{p=1}p;p&&/"$/{p=0}' f
XA100;"this is
the
multi-
line"
In the comments you mention: In the real world file, each line start with ". I assume they also end with " and present you this:
Test file:
$ cat file
"single line"
"multi-
lined"
Code and outputs:
$ awk 'BEGIN{RS=ORS="\"\n"} /single/' file
"single line"
$ awk 'BEGIN{RS=ORS="\"\n"} /m/' file
"multi-
lined"
You can also parametrize the search:
$ awk -v s="multi" 'BEGIN{RS=ORS="\"\n"} match($0,s)' file
"multi-
lined"
try:
Solution 1:
awk -v RS="XA" 'NR==3{gsub(/$\n$/,"");print RS $0}' Input_file
Making Record separator as string XA then looking for line 3rd here and then globally substituting the $\n$(which is to remove the extra line at the end of the line) with NULL. Then printing the Record Separator with the current line.
Solution 2:
awk '/XA100/{print;getline;while($0 !~ /^XA/){print;getline}}' Input_file
Looking for string XA100 then printing the current line and using getline to go to next line, using while loop then which will run and print the lines until a line is starting from XA.
If this file was exported from MS-Excel or similar then lines end with \r\n while the newlines inside quotes are just \ns so then all you need is:
$ awk -v RS='\r\n' '/XA100/' file
XA100;"this is
the multi-line"
The above uses GNU awk for multi-char RS. On some platforms, e.g. cygwin, you'll have to add -v BINMODE=3 so gawk sees the \rs rather than them getting stripped by underlying C primitives.
Otherwise, it's extremely hard to parse CSV files in general without a real CSV parser (which awk currently doesn't have but is in the works for GNU awk) but you could do this (again with GNU awk for multi-char RS):
$ cat file
XA90;"standard"
XA100;"this is
the multi-line"
XA110;"other standard"
$ awk -v RS="\"[^\"]*\"" -v ORS= '{gsub(/\n/," ",RT); print $0 RT}' file
XA90;"standard"
XA100;"this is the multi-line"
XA110;"other standard"
to replace all newlines within quotes with blank chars and then process it as regular 1-line-per-record file.
Using PS response, this works for the small example:
sed 's/^X/\n&/' file.csv | awk -v RS= '/XA100/ {print}'
For my real world CSV file, with many columns, with researched term anywhere, with unknown count of multi-lines, with characters " replaced by "", with multi-lines lines beginning with ", with all fields encapsulated by ", this works. Note the exclusion of the second character " in sed part:
sed 's/^"[^"]/\n&/' file.csv | awk -v RS= '/RESEARCH_TERM/ {print}'
Because first column of any entry cannot start with "". First column allways looks like "XXXXXXXXX", where X is any character but ".
Thank you all for so much responses, maybe others solutions are working depending the CSV file format you use.

Awk double-slash record separator

I am trying to separate RECORDS of a file based on the string, "//".
What I've tried is:
awk -v RS="//" '{ print "******************************************\n\n"$0 }' myFile.gb
Where the "******" etc, is just a trace to show me that the record is split.
However, the file also contains / (by themselves) and my trace, ****** is being printed there as well meaning that awk is interpreting those also as my record separator.
How can I get awk to only split records on // ????
UPDATE: I am running on Unix (the one that comes with OS X)
I found a temporary solution, being:
sed s/"\/\/"/"*"/g | awk -v RS="*" ...
But there must be a better way, especially with massive files that I am working with.
On a Mac, awk version 20070501 does not support multi-character RS. Here's an illustration using such an awk, and a comparison (on the same machine) with gawk:
$ /usr/bin/awk --version
awk version 20070501
$ /usr/bin/awk -v RS="//" '{print NR ":" $0}' <<< x//y//z
1:x
2:
3:y
4:
5:z
$ gawk -v RS="//" '{print NR ":" $0}' <<< x//y//z
1:x
2:y
3:z
If you cannot find a suitable awk, then pick a better character than *. For example, if tabs are acceptable, and if your shell supports $'...', then you could use this incantation of sed:
sed $'s,//,\t,g'

Using grep to pull a series of random numbers from a known line

I have a simple scalar file producing strings like...
bpred_2lev.ras_rate.PP 0.9413 # RAS prediction rate (i.e., RAS hits/used RAS)
Once I use grep to find this line in the output.txt, is there a way I can directly grab the "0.9413" portion? I am attempting to make a cvs file and just need whatever value is generated.
Thanks in advance.
There are several ways to combine finding and extracting into a single command:
awk (POSIX-compliant)
awk '$1 == "bpred_2lev.ras_rate.PP" { print $2 }' file
sed (GNU sed or BSD/OSX sed)
sed -En 's/^bpred_2lev\.ras_rate\.PP +([^ ]+).*$/\1/p' file
GNU grep
grep -Po '^bpred_2lev\.ras_rate\.PP +\K[^ ]+' file
You can use awk like this:
grep <your_search_criteria> output.txt | awk '{ print $2 }'

Resources