What's wrong with repeating entries in awk print statements? - windows

I was trying to answer this other question, about how to repeat an existing column.
I thought this to be fairly easy, just by doing something like:
awk '{print $0 $2}'
This, however, only seems to print $0.
So, I decided to do some more tests:
awk '{print $0 $0}' // prints the entire line only once
awk '{print $1 $1 $1}' // prints the first entry only once
awk '{print $2 $1 $0}' // prints the first entry, followed
// by the entire line
// (the second part is not printed)
...
And having a look at the results, I have the impression that awk is more or less checking what he has printed already and refuses to print it a next time.
Why is that?
I'm using awk from my Windows subsystem for Linux (WSL), more exactly the Ubuntu app from Canonical. This is the result of awk --version:
GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
Copyright (C) 1989, 1991-2019 Free Software Foundation.

awk '{print $0 $0}' // prints the entire line only once
awk '{print $0 $2}' // prints only $0
All these are due to presence of DOS line break \r in your file. Due to presence of \r unix output overwrites on same line from the beginning of the line position hence both lines overlap and you get to see only one line in output.
You can remove \r using tr or sed like this:
tr -d '\t' < file > file.new
sed -i.bak $'s/\\r$//' file
Or you can ask awk to treat \r\n as record separator (note gnu-awk)
awk -v RS='\r\n` '{print $0, $0}' file

Related

Move lines in file using awk/sed

Hi my files look like:
>ID.1
GGAACACGACATCCTGCAGGGTTAAAAAAGAAAAAATCAGTAAAAGTACTGGA
>ID.2
GGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGA
and I want to move the lines so that line 1 swaps with 3, and line 2 swaps with 4.
>ID.2
GGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGA
>ID.1
GGAACACGACATCCTGCAGGGTTAAAAAAGAAAAAATCAGTAAAAGTACTGGA
I have thought about using cut so cut send the lines into other files, and then bring them all back in the desired order using paste, but is there a solution using awk/sed.
EDIT: The file always has 4 lines (2 fasta entrys), no more.
For such a simple case, as #Ed_Morton mentioned, you can just swap the even-sized slices with head and tail commands:
$ tail -2 test.txt; head -2 test.txt
>ID.2
GGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGA
>ID.1
GGAACACGACATCCTGCAGGGTTAAAAAAGAAAAAATCAGTAAAAGTACTGGA
Generic solution with GNU tac to reverse contents:
$ tac -bs'>' ip.txt
>ID.2
GGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGA
>ID.1
GGAACACGACATCCTGCAGGGTTAAAAAAGAAAAAATCAGTAAAAGTACTGGA
By default tac reverses line wise but you can customize the separator.
Here, I'm assuming > can be safely used as a unique separator (provided to the -s option). The -b option is used to put the separator before the content in the output.
Using ed (inplace editing):
# move 3rd to 4th lines to the top
printf '3,4m0\nwq\n' | ed -s ip.txt
# move the last two lines to the top
printf -- '-1,$m0\nwq\n' | ed -s ip.txt
Using sed:
sed '1h;2H;1,2d;4G'
Store the first line in the hold space;
Add the second line to the hold space;
Don't print the first two lines;
Before printing the fourth line, append the hold space to it (i.e. append the 1st and 2nd line).
GNU AWK manual has example of swapping two lines using getline as you know that
The file always has 4 lines (2 fasta entrys), no more.
then you might care only about case when number of lines is evenly divisble by 4 and use getline following way, let file.txt content be
>ID.1
GGAACACGACATCCTGCAGGGTTAAAAAAGAAAAAATCAGTAAAAGTACTGGA
>ID.2
GGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGA
then
awk '{line1=$0;getline line2;getline line3;getline line4;printf "%s\n%s\n%s\n%s\n",line3,line4,line1,line2}' file.txt
gives output
>ID.2
GGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGA
>ID.1
GGAACACGACATCCTGCAGGGTTAAAAAAGAAAAAATCAGTAAAAGTACTGGA
Explanation: store current line in variable $0, then next line as line2, yet next line as line3, yet next line as line4, use printf with 4 placeholders (%s) followed by newlines (\n), which are filled accordingly to your requirement.
(tested in GNU Awk 5.0.1)
GNU sed:
sed -zE 's/(.*\r?\n)(.*\r?\n?)/\2\1/' file
A Perl:
perl -0777 -pe 's/(.*\R.*\R)(.*\R.*\R?)/\2\1/' file
A ruby:
ruby -ne 'BEGIN{lines=[]}
lines<<$_
END{puts lines[2...4]+lines[0...2] }' file
Paste and awk:
paste -s file | awk -F'\t' '{print $3, $4, $1, $2}' OFS='\n'
A POSIX pipe:
paste -sd'\t\n' file | nl | sort -nr | cut -f 2- | tr '\t' '\n'
This seems to work:
awk -F'\n' '{print $3, $4, $1, $2}' OFS='\n' RS= ORS='\n\n' file.txt

awk expression that works on awk v4.0.2 but it does not on >= 4.2.1

I have this awk command:
echo www.host.com |awk -F. '{$1="";OFS="." ; print $0}' | sed 's/^.//'
which what it does is to get the domain from the hostname:
host.com
that command works on CentOS 7 (awk v 4.0.2), but it does not work on ubuntu 19.04 (awk 4.2.1) nor alpine (gawk 5.0.1), the output is:
host com
How could I fix that awk expression so it works in recent awk versions ?
For your provided samples could you please try following. This will try to match regex from very first . to till last of the line and then prints after first dot to till last of line.
echo www.host.com | awk 'match($0,/\..*/){print substr($0,RSTART+1,RLENGTH-1)}'
OP's code fix: In case OP wants to use his/her own tried code then following may help. There are 2 points here: 1st- We need not to use any other command along with awk to processing. 2nd- We need to set values of FS and OFS in BEGIN section which you are doing in everyline.
echo www.host.com | awk 'BEGIN{FS=OFS="."} {$1="";sub(/\./,"");print}'
To get the domain, use:
$ echo www.host.com | awk 'BEGIN{FS=OFS="."}{print $(NF-1),$NF}'
host.com
Explained:
awk '
BEGIN { # before processing the data
FS=OFS="." # set input and output delimiters to .
}
{
print $(NF-1),$NF # then print the next-to-last and last fields
}'
It also works if you have arbitrarily long fqdns:
$ echo if.you.have.arbitrarily.long.fqdns.example.com |
awk 'BEGIN{FS=OFS="."}{print $(NF-1),$NF}'
example.com
And yeah, funny, your version really works with 4.0.2. And awk version 20121220.
Update:
Updated with some content checking features, see comments. Are there domains that go higher than three levels?:
$ echo and.with.peculiar.fqdns.like.co.uk |
awk '
BEGIN {
FS=OFS="."
pecs["co\034uk"]
}
{
print (($(NF-1),$NF) in pecs?$(NF-2) OFS:"")$(NF-1),$NF
}'
like.co.uk
You got 2 very good answers on awk but I believe this should be handled with cut because of simplicity it offers in getting all fields starting for a known position:
echo 'www.host.com' | cut -d. -f2-
host.com
Options used are:
-d.: Set delimiter as .
-f2-: Extract all the fields starting from position 2
What you are observing was a bug in GNU awk which was fixed in release 4.2.1. The changlog states:
2014-08-12 Arnold D. Robbins
OFS being set should rebuild $0 using previous OFS if $0 needs to be
rebuilt. Thanks to Mike Brennan for pointing this out.
awk.h (rebuild_record): Declare.
eval.c (set_OFS): If not being called from var_init(), check if $0 needs rebuilding. If so, parse the record fully and rebuild it. Make OFS point to a separate copy of the new OFS for next time, since OFS_node->var_value->stptr was
already updated at this point.
field.c (rebuild_record): Is now extern instead of static. Use OFS and OFSlen instead of the value of OFS_node.
When reading the code in the OP, it states:
awk -F. '{$1="";OFS="." ; print $0}'
which, according to POSIX does the following:
-F.: set the field separator FS to represent the <dot>-character
read a record
Perform field splitting with FS="."
$1="": redefine field 1 and rebuild record $0 using OFS. At this time, OFS is set to be a single space. If the record $0 was www.foo.com it now reads _foo_com (underscores represent spaces). Recompute the number of fields which are now only one as there is no FS available anymore.
OFS=".": redefine the output field separator OFS to be the <dot>-character. This is where the bug happens. The Gnu awk knew that a rebuild needed to happend, but did this already with the new OFS and not the old OFS.
**print $0':** print the record $0 which is now_foo_com`.
The minimal change to your program would be:
awk -F. '{OFS="."; $1=""; print $0}'
The clean change would be:
awk 'BEGIN{FS=OFS="."}{$1="";print $0}'
The perfect change would be to replace the awk and sed by the cut solution of Anubahuva
If you have a variable with that name in there, you could use:
var=www.foo.com
echo ${var#*.}

Awk double-slash record separator

I am trying to separate RECORDS of a file based on the string, "//".
What I've tried is:
awk -v RS="//" '{ print "******************************************\n\n"$0 }' myFile.gb
Where the "******" etc, is just a trace to show me that the record is split.
However, the file also contains / (by themselves) and my trace, ****** is being printed there as well meaning that awk is interpreting those also as my record separator.
How can I get awk to only split records on // ????
UPDATE: I am running on Unix (the one that comes with OS X)
I found a temporary solution, being:
sed s/"\/\/"/"*"/g | awk -v RS="*" ...
But there must be a better way, especially with massive files that I am working with.
On a Mac, awk version 20070501 does not support multi-character RS. Here's an illustration using such an awk, and a comparison (on the same machine) with gawk:
$ /usr/bin/awk --version
awk version 20070501
$ /usr/bin/awk -v RS="//" '{print NR ":" $0}' <<< x//y//z
1:x
2:
3:y
4:
5:z
$ gawk -v RS="//" '{print NR ":" $0}' <<< x//y//z
1:x
2:y
3:z
If you cannot find a suitable awk, then pick a better character than *. For example, if tabs are acceptable, and if your shell supports $'...', then you could use this incantation of sed:
sed $'s,//,\t,g'

Unix Output of command to text file

I'm reading from a file called IMSI.txt using the following command:
$awk 'NR>2' IMSI.txt | awk '{print $NF}'
I need the output of this command to go to a new file called NEW.txt
So i did this :
$awk 'NR>2' IMSI.txt | awk '{print $NF}' > NEW.txt
This worked fine, but when i open the file, the output from the command are on the same line.
The new line is being neglected.
As an example, if i get an output in the console
222
111
333
i open the text file and i get
222111333
How can i fix that ?
Thank you for your help :)
PS: i am using Cygwin on windows
I am guessing your (Windows-y) editor would like to see Carriage Returns at the end of lines, not Linefeeds (which is what awk outputs). Change your print to this
print $NF "\r"
so it looks like this altogether:
awk 'NR>2 {print $NF "\r"}' IMSI.txt
Simply set your ORS to "\r\n" which allows Awk to generate DOS line endings for every output. I believe this is the most natural solution:
awk -v ORS="\r\n" '{print $NF}' > NEW.txt
Tested on a virtual XP system with Cygwin.
From Awk's manual:
ORS The output record separator, by default a newline.

Add blank column using awk or sed

I have a file with the following structure (comma delimited)
116,1,89458180,17,FFFF,0403254F98
I want to add a blank column on the 4th field such that it becomes
116,1,89458180,,17,FFFF,0403254F98
Any inputs as to how to do this using awk or sed if possible ?
thank you
Assuming that none of the fields contain embedded commas, you can restate the task as replacing the third comma with two commas. This is just:
sed 's/,/,,/3'
With the example line from the file:
$ echo "116,1,89458180,17,FFFF,0403254F98" | sed 's/,/,,/3'
116,1,89458180,,17,FFFF,0403254F98
You can use this awk,
awk -F, '$4="," $4' OFS=, yourfile
(OR)
awk -F, '$4=FS$4' OFS=, yourfile
If you want to add 6th and 8th field,
awk -F, '{$4=FS$4; $1=FS$1; $6=FS$6}1' OFS=, yourfile
Through awk
$ echo '116,1,89458180,17,FFFF,0403254F98' | awk -F, -v OFS="," '{print $1,$2,$3,","$4,$5,$6}'
116,1,89458180,,17,FFFF,0403254F98
It prints a , after third field(delimited) by ,
Through GNU sed
$ echo 116,1,89458180,17,FFFF,0403254F98| sed -r 's/^([^,]*,[^,]*,[^,]*)(.*)$/\1,\2/'
116,1,89458180,,17,FFFF,0403254F98
It captures all the characters upto the third command and stored it into a group. Characters including the third , upto the last are stored into another group. In the replacement part, we just add an , between these two captured groups.
Through Basic sed,
Through Basic sed
$ echo 116,1,89458180,17,FFFF,0403254F98| sed 's/^\([^,]*,[^,]*,[^,]*\)\(.*\)$/\1,\2/'
116,1,89458180,,17,FFFF,0403254F98
echo 116,1,89458180,17,FFFF,0403254F98|awk -F',' '{print $1","$2","$3",,"$4","$5","$6}'
Non-awk
t="116,1,89458180,17,FFFF,0403254F98"
echo $(echo $t|cut -d, -f1-3),,$(echo $t|cut -d, -f4-)
You can use bellow awk command to achieve that.Replace the $3 with what ever the column that you want to make it blank.
awk -F, '{$3="" FS $3;}1' OFS=, filename
sed -e 's/\([^,]*,\)\{4\}/&,/' YourFile
replace the sequence of 4 [content (non comma) than comma ] by itself followed by a comma

Resources