awk command inside for loop to read and write multiple files [duplicate] - bash

I am learning awk and I would like to know if there is an option to write changes to file, similar to sed where I would use -i option to save modifications to a file.
I do understand that I could use redirection to write changes. However is there an option in awk to do that?

In GNU Awk 4.1.0 (released 2013) and later, it has the option of "inplace" file editing:
[...] The "inplace" extension, built using the new facility, can be used to simulate the GNU "sed -i" feature. [...]
Example usage:
$ gawk -i inplace '{ gsub(/foo/, "bar") }; { print }' file1 file2 file3
To keep the backup:
$ gawk -i inplace -v INPLACE_SUFFIX=.bak '{ gsub(/foo/, "bar") }
> { print }' file1 file2 file3

Unless you have GNU awk 4.1.0 or later...
You won't have such an option as sed's -i option so instead do:
$ awk '{print $0}' file > tmp && mv tmp file
Note: the -i is not magic, it is also creating a temporary file sed just handles it for you.
As of GNU awk 4.1.0...
GNU awk added this functionality in version 4.1.0 (released 10/05/2013). It is not as straight forwards as just giving the -i option as described in the released notes:
The new -i option (from xgawk) is used for loading awk library files. This differs from -f in that the first non-option argument
is treated as a script.
You need to use the bundled inplace.awk include file to invoke the extension properly like so:
$ cat file
123 abc
456 def
789 hij
$ gawk -i inplace '{print $1}' file
$ cat file
123
456
789
The variable INPLACE_SUFFIX can be used to specify the extension for a backup file:
$ gawk -i inplace -v INPLACE_SUFFIX=.bak '{print $1}' file
$ cat file
123
456
789
$ cat file.bak
123 abc
456 def
789 hij
I am happy this feature has been added but to me, the implementation isn't very awkish as the power comes from the conciseness of the language and -i inplace is 8 characters too long i.m.o.
Here is a link to the manual for the official word.

just a little hack that works
echo "$(awk '{awk code}' file)" > file

#sudo_O has the right answer.
This can't work:
someprocess < file > file
The shell performs the redirections before handing control over to someprocess (redirections). The > redirection will truncate the file to zero size (redirecting output). Therefore, by the time someprocess gets launched and wants to read from the file, there is no data for it to read.

An alternative is to use sponge:
awk '{print $0}' your_file | sponge your_file
Where you replace '{print $0}' by your awk script and your_file by the name of the file you want to edit in place.
sponge absorbs entirely the input before saving it to the file.

following won't work
echo $(awk '{awk code}' file) > file
this should work
echo "$(awk '{awk code}' file)" > file

In case you want an awk-only solution without creating a temporary file and usable with version!=(gawk 4.1.0):
awk '{a[b++]=$0} END {for(c=0;c<=b;c++)print a[c]>ARGV[1]}' file

Related

How to remove duplicate lines in a file?

I understand that the general approach is to use something like
$ sort file1.txt | uniq > file2.txt
But I was wondering if there was a way to do this without needing separate source and destination files, even if it means it can't be a one-liner.
Simply use the -o and -u options of sort:
sort -o file -u file
You don't need even to use a pipe for another command, such as uniq.
With GNU awk for "inplace" editing:
awk -i inplace '!seen[$0]++' file1.txt
As with all tools (except ed which requires the whole file to be read into memory first) that support "inplace" editing (sed -i, perl -i, ruby -i, etc.) this uses a temp file behind the scenes.
With any awk you can do the following with no temp files used but about twice the memory used instead:
awk '!seen[$0]++{a[++n]=$0} END{for (i=1;i<=n;i++) print a[i] > FILENAME}' file
With Perl's -i:
perl -i -lne 'print unless $seen{$_}++' original.file
-i changes the file "in place";
-n reads the input line by line, running the code for each line;
-l removes newlines from input and adds them to print;
The %seen hash idiom is described in perlfaq4.
A common idiom is:
temp=$(mktemp)
some_pipeline < original.file > "$temp" && mv "$temp" original.file
The && is important: if the pipeline fails, then the original file won't be overwritten with (perhaps) garbage.
The Linux moreutils package contains a program that encapsulates this away:
some_pipeline < original.file | sponge original.file

How to remove consecutive repeating characters from every line?

I have the below lines in a file
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;;Profilicollis;Profilicollis_altmani;
Acanthocephala;Eoacanthocephala;Neoechinorhynchida;Neoechinorhynchidae;;;;
Acanthocephala;;;;;;;
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;;Polymorphus;;
and I want to remove the repeating semi-colon characters from all lines to look like below (note- there are repeating semi-colons in the middle of some of the above lines too)
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;Profilicollis;Profilicollis_altmani;
Acanthocephala;Eoacanthocephala;Neoechinorhynchida;Neoechinorhynchidae;
Acanthocephala;
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;Polymorphus;
I would appreciate if someone could kindly share a bash one-liner to accomplish this.
You can use tr with "squeeze":
tr -s ';' < infile
perl -p -e 's/;+/;/g' myfile # writes output to stdout
or
perl -p -i -e 's/;+/;/g' myfile # does an in-place edit
If you want to edit the file itself:
printf "%s\n" 'g/;;/s/;\{2,\}/;/g' w | ed -s foo.txt
If you want to pipe a modified copy of the file to something else and leave the original unchanged:
sed 's/;\{2,\}/;/g' foo.txt | whatever
These replace runs of 2 or more semicolons with single ones.
could be solved easily by substitutions.
I add an awk solution by playing with the FS/OFS variable:
awk -F';+' -v OFS=';' '$1=$1' file
or
awk -F';+' -v OFS=';' '($1=$1)||1' file
Here's a sed version of alaniwi's answer:
sed 's/;\+/;/g' myfile # Write output to stdout
or
sed -i 's/;\+/;/g' myfile # Edit the file in-place

Using grep to pull a series of random numbers from a known line

I have a simple scalar file producing strings like...
bpred_2lev.ras_rate.PP 0.9413 # RAS prediction rate (i.e., RAS hits/used RAS)
Once I use grep to find this line in the output.txt, is there a way I can directly grab the "0.9413" portion? I am attempting to make a cvs file and just need whatever value is generated.
Thanks in advance.
There are several ways to combine finding and extracting into a single command:
awk (POSIX-compliant)
awk '$1 == "bpred_2lev.ras_rate.PP" { print $2 }' file
sed (GNU sed or BSD/OSX sed)
sed -En 's/^bpred_2lev\.ras_rate\.PP +([^ ]+).*$/\1/p' file
GNU grep
grep -Po '^bpred_2lev\.ras_rate\.PP +\K[^ ]+' file
You can use awk like this:
grep <your_search_criteria> output.txt | awk '{ print $2 }'

Extract string between two patterns (inclusive) while conserving the format

I have a file in the following format
cat test.txt
id1,PPLLTOMaaaaaaaaaaaJACK
id2,PPLRTOMbbbbbbbbbbbJACK
id3,PPLRTOMcccccccccccJACK
I am trying to identify and print the string between TOM and JACK including these two strings, while maintaining the first column FS=,
Desired output:
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
So far I have tried gsub:
awk -F"," 'gsub(/.*TOM|JACK.*/,"",$2) && !_[$0]++' test.txt > out.txt
and have the following output
id1 aaaaaaaaaaa
id2 bbbbbbbbbbb
id3 ccccccccccc
As you can see I am getting close but not able to include TOM and JACK patterns in my output. Plus I am also losing the original FS. What am I doing wrong?
Any help will be appreciated.
You are changing a field ($2) which causes awk to reconstruct the record using the value of OFS as the field separator and so in this case changing the commas to spaces.
Never use _ as a variable name - using a name with no meaning is just slightly better than using a name with the wrong meaning, just pick a name that means something which, in this case is seen but idk what you are trying to do when using that in this context.
gsub() and sub() do not support capture groups so you either need to use match()+substr():
$ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/){$2=substr($2,RSTART,RLENGTH)} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
or use GNU awk for the 3rd arg to match()
$ gawk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=a[0]} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
or for gensub():
$ gawk 'BEGIN{FS=OFS=","} {$2=gensub(/.*(TOM.*JACK).*/,"\\1","",$2)} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
The main difference between the match() and gensub() solutions is how they would behave if TOM appeared twice on the line:
$ cat file
id1,PPLLfooTOMbarTOMaaaaaaaaaaaJACK
id2,PPLRTOMbbbbbbbbbbbJACKfooJACKbar
id3,PPLRfooTOMbarTOMcccccccccccJACKfooJACKbar
$
$ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=a[0]} 1' file
id1,TOMbarTOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACKfooJACK
id3,TOMbarTOMcccccccccccJACKfooJACK
$
$ awk 'BEGIN{FS=OFS=","} {$2=gensub(/.*(TOM.*JACK).*/,"\\1","",$2)} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACKfooJACK
id3,TOMcccccccccccJACKfooJACK
and just to show one way of stopping at the first instead of the last JACK on the line:
$ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=gensub(/(JACK).*/,"\\1","",a[0])} 1' file
id1,TOMbarTOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMbarTOMcccccccccccJACK
Use capture groups to save the parts of the line you want to keep. Here's how to do it with sed
sed 's/^\([^,]*,\).*\(TOM.*JACK\).*/\1\2/' <test.txt > out.txt
Do you mean to do the following?
$ cat test.txt
id1,PPLLTOMaaaaaaaaaaaJACKABCD
id2,PPLRTOMbbbbbbbbbbbJACKDFCC
id3,PPLRTOMcccccccccccJACKSDER
$ cat test.txt | sed -e 's/,.*TOM/,TOM/g' | sed -e 's/JACK.*/JACK/g'
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
$
This should work as long as the TOM and JACK do not repeat themselves.
sed 's/\(.*,\).*\(TOM.*JACK\).*/\1\2/' <oldfile >newfile
Output:
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK

How to increment number in a file

I have one file with the date like below,let say file name is file1.txt:
2013-12-29,1
Here I have to increment the number by 1, so it should be 1+1=2 like..
2013-12-29,2
I tried to use 'sed' to replace and must be with variables only.
oldnum=`cut -d ',' -f2 file1.txt`
newnum=`expr $oldnum + 1`
sed -i 's\$oldnum\$newnum\g' file1.txt
But I get an error from sed syntax, is there any way for this. Thanks in advance.
Sed needs forward slashes, not back slashes. There are multiple interesting issues with your use of '\'s actually, but the quick fix should be (use double quotes too, as you see below):
oldnum=`cut -d ',' -f2 file1.txt`
newnum=`expr $oldnum + 1`
sed -i "s/$oldnum\$/$newnum/g" file1.txt
However, I question whether sed is really the right tool for the job in this case. A more complete single tool ranging from awk to perl to python might work better in the long run.
Note that I used a $ end-of-line match to ensure you didn't replace 2012 with 2022, which I don't think you wanted.
usually I would like to use awk to do jobs like this
following is the code might work
awk -F',' '{printf("%s\t%d\n",$1,$2+1)}' file1.txt
Here is how to do it with awk
awk -F, '{$2=$2+1}1' OFS=, file1.txt
2013-12-29,2
or more simply (this will file if value is -1)
awk -F, '$2=$2+1' OFS=, file1.txt
To make a change to the change to the file, save it somewhere else (tmp in the example below) and then move it back to the original name:
awk -F, '{$2=$2+1}1' OFS=, file1.txt >tmp && mv tmp file1.txt
Or using GNU awk, you can do this to skip temp file:
awk -i include -F, '{$2=$2+1}1' OFS=, file1.txt
Another, single line, way would be
expr cat /tmp/file 2>/dev/null + 1 >/tmp/file
this works if the file doesn't exist or if the file doesnt contain a valid number - in both cases the file is (re)created with a value of 1
awk is the best for your problem, but you can also do the calculation in shell
In case you have more than one rows, I am using loop here
#!/bin/bash
IFS=,
while read DATE NUM
do
echo $DATE,$((NUM+1))
done < file1.txt
Bash one liner option with BC. Sample:
$ echo 3 > test
$ echo 1 + $(<test) | bc > test
$ cat test
4
Also works:
bc <<< "1 + $(<test)" > test

Resources