Awk double-slash record separator - bash

I am trying to separate RECORDS of a file based on the string, "//".
What I've tried is:
awk -v RS="//" '{ print "******************************************\n\n"$0 }' myFile.gb
Where the "******" etc, is just a trace to show me that the record is split.
However, the file also contains / (by themselves) and my trace, ****** is being printed there as well meaning that awk is interpreting those also as my record separator.
How can I get awk to only split records on // ????
UPDATE: I am running on Unix (the one that comes with OS X)
I found a temporary solution, being:
sed s/"\/\/"/"*"/g | awk -v RS="*" ...
But there must be a better way, especially with massive files that I am working with.

On a Mac, awk version 20070501 does not support multi-character RS. Here's an illustration using such an awk, and a comparison (on the same machine) with gawk:
$ /usr/bin/awk --version
awk version 20070501
$ /usr/bin/awk -v RS="//" '{print NR ":" $0}' <<< x//y//z
1:x
2:
3:y
4:
5:z
$ gawk -v RS="//" '{print NR ":" $0}' <<< x//y//z
1:x
2:y
3:z
If you cannot find a suitable awk, then pick a better character than *. For example, if tabs are acceptable, and if your shell supports $'...', then you could use this incantation of sed:
sed $'s,//,\t,g'

Related

What's wrong with repeating entries in awk print statements?

I was trying to answer this other question, about how to repeat an existing column.
I thought this to be fairly easy, just by doing something like:
awk '{print $0 $2}'
This, however, only seems to print $0.
So, I decided to do some more tests:
awk '{print $0 $0}' // prints the entire line only once
awk '{print $1 $1 $1}' // prints the first entry only once
awk '{print $2 $1 $0}' // prints the first entry, followed
// by the entire line
// (the second part is not printed)
...
And having a look at the results, I have the impression that awk is more or less checking what he has printed already and refuses to print it a next time.
Why is that?
I'm using awk from my Windows subsystem for Linux (WSL), more exactly the Ubuntu app from Canonical. This is the result of awk --version:
GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
Copyright (C) 1989, 1991-2019 Free Software Foundation.
awk '{print $0 $0}' // prints the entire line only once
awk '{print $0 $2}' // prints only $0
All these are due to presence of DOS line break \r in your file. Due to presence of \r unix output overwrites on same line from the beginning of the line position hence both lines overlap and you get to see only one line in output.
You can remove \r using tr or sed like this:
tr -d '\t' < file > file.new
sed -i.bak $'s/\\r$//' file
Or you can ask awk to treat \r\n as record separator (note gnu-awk)
awk -v RS='\r\n` '{print $0, $0}' file

Convert GNU awk command to default macOS awk command

Given a file containing many lines such as, e.g.:
Z|X|20210903|07:00:00|S|33|27.71||
With wanted output of, e.g.:
Z|X|20210903|07:00:00|S|33|27.71|||03-09-2021 07:00:00
This GNU awk command works:
gawk -F'|' '{dt = gensub(/(....)(..)(..)/,"\\3-\\2-\\1",1,$3); print $0"|"dt,$4}' infile > outfile
However, I need this to work under macOS with the version of awk that is installed by default, and it produces the following error:
awk: calling undefined function gensub
input record number 1, file
source line number 1
I'm assuming the default version of awk in macOS is too old and doesn't support the gensub function.
Note that I have tried numerous other string functions to no avail. awk programming is not in my area of expertise and I derived at the GNU awk command above thru a fair amount of googling, but my google-fu was unsuccessful in trying to get something to work with macOS awk.
Can the above GNU awk command be rewritten to work with the default version of awk in, e.g., macOS Catalina and if so how?
Would you please try the following:
awk -F'|' '{dt=substr($3,7,2) "-" substr($3,5,2) "-" substr($3,1,4); print $0 "|" dt, $4}' infile > outfile
Using perl instead of gawk:
$ perl -lne '
my #F = split /[|]/, $_, -1;
my $dt = ($F[2] =~ s/(....)(..)(..)/$3-$2-$1/r);
print join("|", #F, "$dt $F[3]")' <<<"Z|X|20210903|07:00:00|S|33|27.71||"
Z|X|20210903|07:00:00|S|33|27.71|||03-09-2021 07:00:00

convert first column in a csv file from timestamp to year-month format

Trying to convert first column in a csv file from unix timestamp to date(year-month format)
Tried date -d #number'+%Y-%m' and awk, but awk doesn't recognize # when used together
Extract from a csv file :
1556113878,60662402644292
1554090396,59547403093308
Expected O/p
2019-04,60662402644292
2019-03,59547403093308
If you have GNU awk (sometimes called gawk), try:
gawk -F, '{print strftime("%Y-%m", $1),$2}' OFS=, file.csv
For example, consider this input file:
$ cat file.csv
1556113878,60662402644292
1554090396,59547403093308
Our command produces this output:
$ gawk -F, '{print strftime("%Y-%m", $1),$2}' OFS=, file.csv
2019-04,60662402644292
2019-03,59547403093308
On many Linux systems, GNU awk is the default. On others like Ubuntu, it is not but it can be easily installed: sudo apt-get install gawk. On MacOS, GNU awk can be installed via homebrew.
If you don't have GNU AWK, you may have a system Ruby, in which case you can do this:
▶ ruby -F, -ane \
'$F[0] = Time.at($F[0].to_i).strftime("%Y-%m"); print $F.join(",")' FILE
2019-04,60662402644292
2019-04,59547403093308
Further explanation:
Unlike Perl's POSIX::strftime, system Ruby should ship with the Time module. Thus my choice of Ruby.
The command line options are -F, is the same as AWK; -n is the same as sed; -a turns on AWK-like auto-split; -e is the same as sed.
$F is similar to AWK's $0 and $F[0] is similar to AWK's $1. $F[0].to_i converts the Epoch time string in the first field to an integer.

"grep" a csv file including multi-lines fields?

file.csv:
XA90;"standard"
XA100;"this is
the multi-line"
XA110;"other standard"
I want to grep the "XA100" entry like this:
grep XA100 file.csv
to obtain this result:
XA100;"this is
the multi-line"
but grep return only one line:
XA100;"this is
source.csv contains 3 entries.
The "XA100" entry contain a multi-line field.
And grep doesn't seem to be the right tool to "grep" CSV file including multilines fields.
Do you know the way to make the job ?
Edit: the real world file contains many columns. The researched term can be in any column (not at begin of line, nor at the begin of field). All fields are encapsulated by ". Any field can contain a multi-line, from 1 line to any, and this cannot be predicted.
Give this line a try:
awk '/^XA100;/{p=1}p;p&&/"$/{p=0}' file
I extended your example a bit:
kent$ cat f
XA90;"standard"
XA100;"this is
the
multi-
line"
XA110;"other standard"
kent$ awk '/^XA100;/{p=1}p;p&&/"$/{p=0}' f
XA100;"this is
the
multi-
line"
In the comments you mention: In the real world file, each line start with ". I assume they also end with " and present you this:
Test file:
$ cat file
"single line"
"multi-
lined"
Code and outputs:
$ awk 'BEGIN{RS=ORS="\"\n"} /single/' file
"single line"
$ awk 'BEGIN{RS=ORS="\"\n"} /m/' file
"multi-
lined"
You can also parametrize the search:
$ awk -v s="multi" 'BEGIN{RS=ORS="\"\n"} match($0,s)' file
"multi-
lined"
try:
Solution 1:
awk -v RS="XA" 'NR==3{gsub(/$\n$/,"");print RS $0}' Input_file
Making Record separator as string XA then looking for line 3rd here and then globally substituting the $\n$(which is to remove the extra line at the end of the line) with NULL. Then printing the Record Separator with the current line.
Solution 2:
awk '/XA100/{print;getline;while($0 !~ /^XA/){print;getline}}' Input_file
Looking for string XA100 then printing the current line and using getline to go to next line, using while loop then which will run and print the lines until a line is starting from XA.
If this file was exported from MS-Excel or similar then lines end with \r\n while the newlines inside quotes are just \ns so then all you need is:
$ awk -v RS='\r\n' '/XA100/' file
XA100;"this is
the multi-line"
The above uses GNU awk for multi-char RS. On some platforms, e.g. cygwin, you'll have to add -v BINMODE=3 so gawk sees the \rs rather than them getting stripped by underlying C primitives.
Otherwise, it's extremely hard to parse CSV files in general without a real CSV parser (which awk currently doesn't have but is in the works for GNU awk) but you could do this (again with GNU awk for multi-char RS):
$ cat file
XA90;"standard"
XA100;"this is
the multi-line"
XA110;"other standard"
$ awk -v RS="\"[^\"]*\"" -v ORS= '{gsub(/\n/," ",RT); print $0 RT}' file
XA90;"standard"
XA100;"this is the multi-line"
XA110;"other standard"
to replace all newlines within quotes with blank chars and then process it as regular 1-line-per-record file.
Using PS response, this works for the small example:
sed 's/^X/\n&/' file.csv | awk -v RS= '/XA100/ {print}'
For my real world CSV file, with many columns, with researched term anywhere, with unknown count of multi-lines, with characters " replaced by "", with multi-lines lines beginning with ", with all fields encapsulated by ", this works. Note the exclusion of the second character " in sed part:
sed 's/^"[^"]/\n&/' file.csv | awk -v RS= '/RESEARCH_TERM/ {print}'
Because first column of any entry cannot start with "". First column allways looks like "XXXXXXXXX", where X is any character but ".
Thank you all for so much responses, maybe others solutions are working depending the CSV file format you use.

Using a multi-character field separator in awk on Solaris

I wish to use a string (BIRCH) as a field delimiter in awk to print second field. I am trying the following command:
cat tmp.log|awk -FBirch '{ print $2}'
Below output is getting printed:
irch2014/06/23,04:36:45,3,1401503,xml-harlan,P12345-1,temp,0a653356353635635,temp,L,Success
Desired output:
2014/06/23,04:36:45,3,1401503,xml-harlan,P12345-1,temp,0a653356353635635,temp,L,Success
Contents of tmp.log file.
-bash-3.2# cat tmp.log
Dec 05 13:49:23 [x.x.x.x.180.100] business-log-dev/int [TEST][0x80000001][business-log][info] mpgw(Test): trans(8497187)[request][10.x.x.x]:
Birch2014/06/23,04:36:45,3,1401503,xml-harlan,P12345-1,temp,0a653356353635635,temp,L,Success
Am I doing something wrong?
OS: Solaris10
Shell: Bash
Tried below command suggested in one of the ansers below. I am getting the desired output, but with an extra empty line at the top. How can this be eliminated from the output?
-bash-3.2# /usr/xpg4/bin/awk -FBirch '{print $2}' tmp.log
2014/06/23,04:36:45,3,1401503,xml-harlan,P12345-1,temp,0a653356353635635,temp,L,Success
Originally, I suggested putting quotes around "Birch" (-F'Birch') but actually, I don't think that should make any difference.
I'm not at all experienced working with Solaris but you may want to also try using nawk ("new awk") instead of awk.
nawk -FBirch '{print $2}' file
If this works, you may want to consider creating an alias so that you always use the newer version of awk with more features.
You may also want to try using the version of awk in the /usr/xpg4/bin directory, which is a POSIX compliant implementation so should support multi-character FS:
/usr/xpg4/bin/awk -FBirch '{print $2}' file
If you only want to print lines which have more than one field, you can add a condition:
/usr/xpg4/bin/awk -FBirch 'NF>1{print $2}' file
This only prints the second field when there is more than one field.
From the man page of the default awk on solaris usr/bin/awk
-Fc Uses the character c as the field separator
(FS) character. See the discussion of FS
below.
As you can see solaris awk only takes a single character as a Field separator
Also in the man page is split
split(s, a, fs)
Split the string s into array elements a[1], a[2], ...
a[n], and returns n. The separation is done with the
regular expression fs or with the field separator FS if
fs is not given.
As you can see here it takes a regular expression as a separator so we can use.
awk 'split($0,a,"Birch"){print a[2]}' file
To print the second field split by Birch

Resources