Remove duplicate lines and overwrite file in same command - bash

I'm trying to remove duplicate lines from a file and update the file. For some reason I have to write it to a new file and replace it. Is this the only way?
awk '!seen[$0]++' .gitignore > .gitignore
awk '!seen[$0]++' .gitignore > .gitignore_new && mv .gitignore_new .gitignore

Redirecting to the same output file as input file like:
awk '!seen[$0]++' .gitignore > .gitignore
will end with an empty file. This is because using the > operator, the shell will open and truncate the file before the command get's executed. Meaning you'll lose all your data.
With newer versions of GNU awk you can use the -i inplace option to edit the file in place:
awk -i inplace '!seen[$0]++' .gitignore
If you don't have a recent version of GNU awk, you'll need to create a temporary file:
awk '!seen[$0]++' .gitignore > .gitignore.tmp
mv .gitignore.tmp .gitignore
Another alternative is to use the sponge program from moreutils:
awk '!seen[$0]++' .gitignore | sponge .gitignore
sponge will soak all stdinput and open the output file after that. This effectively keeps the input file intact before writing to it.

Thomas, I believe the problem is that you are reading from it and writing to it on the same command. This is why you must put to a temporary file first.
The > does overwrite, so you are using the correct redirect operator
Redirect output from a command to a file on disk. Note: if the file already exist, it will be erased and overwritten without warning, so
be careful.
Example: ps -ax >processes.txt Use the ps command to get a list of
processes running on the system, and store the output in a file named
processes.txt

Yes, because if you don't do that shell will create file descriptor and truncate .gitignore even before awk process started.

Related

Removes duplicate lines from files recursively

I have a directory with bunch of csv files. I want to remove the duplicates lines from all the files.
I have tried awk solution but seems to be bit tedious to do it for each and every file.
awk '!x[$0]++' file.csv
Even if I will do
awk '!x[$0]++' *
I will lost the file names. Is there a way to remove duplicates from all the files using just one command or script.
Just to clarify
If there are 3 files in the directory, then the output should contain 3 files, each sorted independently. After running the command or script the same folder should contain 3 files each with unique entries.
for f in dir/*;
do awk '!a[$0]++' "$f" > "$f.uniq";
done
to overwrite the existing files change to: awk '!a[$0]++' "$f" > "$f.uniq" && mv "$f.uniq" "$f" after testing!
With GNU awk for "inplace" editing and automatic open/close management of output files:
awk -i inplace '!seen[FILENAME,$0]++' *.csv
This will create new files, with suffix .new, that have only unique lines:
gawk '!x[$0]++{print>(FILENAME".new")}' *.csv
How it works
!x[$0]++
This is a condition. It evaluates to true only the current line, $0, has not been seen before.
print >(FILENAME".new")
If the condition evaluates to true, then this print statement is executed. It writes the current line to a file whose name is the name of the current file, FILENAME, followed by the string .new.

Bash- why can't a file be overwritten without the use of a temp file?

The standard procedure for overwriting a file is usually the following:
awk '{print $2*3}' file > tmp file
tmpFile > file
However, sometimes this poses to be a bit of a hassle because then one must remove the temp file after it is no longer being used.
So, why is it not possible to do this in the following way (without the need of a temp file) :
awk '{print $2*3}' file > file
The reason I ask is because I know that it is possible to append to a file as so:
awk '{print $2*3}' file >> file
So if appending a file, using >> as shown above, works fine, why can't one overwrite a file in the same way. Why are the two commands so different?
Moreover, does there exist a way of bypassing the need for a temp file (perhaps in a fashion similar to the 2nd excerpt), or is the first excerpt the only way?
NOTE: the awk command is irrelevant, it can be replaced by any other command
Using a temp file is a good idea because you can never be sure if the entire file will be read into memory. If you try to write it before it was read, then you might get a different result than you might have expected.
When using append, the command always goes through the entire file before adding new content, so there never remains a part of the file to be read.
Probably not a great idea (trying to read & write to same file), but if you insist on doing it, you could use the <> operator.
gawk '{print $2*3}' -- <> file
There's a tool for everything. You can use sponge.
awk '{print $2*3}' file | sponge file
You can get it from the moreutils package. The man page reads:
NAME
sponge - soak up standard input and write to a file
SYNOPSIS
sed '...' file | grep '...' | sponge [-a] file
DESCRIPTION
sponge reads standard input and writes it out to the specified file.
Unlike a shell redirect, sponge soaks up all its input before writing the
output file. This allows constructing pipelines that read from and write
to the same file.
sponge preserves the permissions of the output file if it already exists.
When possible, sponge creates or updates the output file atomically by
renaming a temp file into place. (This cannot be done if TMPDIR is not in
the same filesystem.)
If the output file is a special file or symlink, the data will be written
to it, non-atomically.
If no file is specified, sponge outputs to stdout.
OPTIONS
-a
Replace the file with a new file that contains the file's original
content, with the standard input appended to it. This is done
atomically when possible.
AUTHOR
Colin Watson and Tollef Fog Heen
if u happen to be on a mac, you can emulate a copy & paste operation to do in-place edits indirectly without a temp file :
awk '{ . . . }' file | LC_ALL=C pbcopy ; LC_ALL=C pbpaste > file
dunno what the equivalent commands are for linux or other platforms. avoid this if your file is over 500 MB in size
you can also use this for perl or python etc since "pasteboard copy" is simply reading in contents via /dev/stdin.
this is only a convenience shortcut and doesn't guarantee atomic ops whatsoever.

how to delete string from text file and save as same file?

I'm trying to write a bash function to delete a garbage string (including the quotation marks and parentheses) from a text file and resave the file with the same filename.
I got it to work when I save it as a new file...
function fixfile()
{
sed 's/(\"garbagestringhere\")//g' /Users/peter/.emacs.d/recent-addresses > /Users/peter/.emacs.d/recent-addresses-fixed;
}
...but when I save it with the same filename as before, it wipes the file:
function fixfile()
{
sed 's/(\"garbagestringhere\")//g' /Users/peter/.emacs.d/recent-addresses > /Users/peter/.emacs.d/recent-addresses;
}
How do I fix this?
For GNU sed man sed will reveal the -i option:
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if SUFFIX supplied)
If you are on OSx (or BSD) you can employ (see this reference answer)
sed -i '' 's/whatever//g' file
// ^ note the space
Even if in your particular case I suggest the use of sed -i,
in a more general approach you can consider the use of sponge
from moreutils.
sed '...' file | grep '...' | awk '...' | sponge file
Sponge reads standard input and writes it out to the specified file. Unlike a shell redirect, sponge soaks up all its input before opening the output file. This allows constricting pipelines that read from and write to the same file.
If no output file is specified, sponge outputs to stdout
(Git from here, or from here...)
Alternatively you can use a temporary file that you delete before you exit from your function.
Temporary file with a random (or pseudorandom) namefile.

overwrite contents of a file: alternative to `>`?

I often find myself stringing together a series of shell commands, ultimately with the goal to replace the contents of a file. However, when using > it opens the original file for writing, so you lose all the contents.
For lack of a better term, is there a "lazy evaluation" version of > that will wait until all the previous commands have been executed before before opening the file for writing?
Currently I'm using:
somecommand file.txt | ... | ... > tmp.txt && rm file.txt && mv tmp.txt file.txt
Which is quite ugly.
sponge will help here:
(Quoting from the manpage)
NAME
sponge - soak up standard input and write to a file
SYNOPSIS
sed '...' file | grep '...' | sponge file
DESCRIPTION
sponge reads standard input and writes it out to the specified file.
Unlike a shell redirect, sponge soaks up all its input before opening
the output file. This allows constructing pipelines that read from and
write to the same file.
It also creates the output file atomically by renaming a temp file into
place, and preserves the permissions of the output file if it already
exists. If the output file is a special file or symlink, the data will
be written to it.
If no output file is specified, sponge outputs to stdout.
See also: Can I read and write to the same file in Linux without overwriting it? on unix.SE

how to write finding output to same file using awk command

awk '/^nameserver/ && !modif { printf("nameserver 127.0.0.1\n"); modif=1 } {print}' testfile.txt
It is displaying output but I want to write the output to same file. In my example testfile.txt.
Not possible per se. You need a second temporary file because you can't read and overwrite the same file. Something like:
awk '(PROGRAM)' testfile.txt > testfile.tmp && mv testfile.tmp testfile.txt
The mktemp program is useful for generating unique temporary file names.
There are some hacks for avoiding a temporary file, but they rely mostly on caching and read buffers and quickly get unstable for larger files.
Since GNU Awk 4.1.0, there is the "inplace" extension, so you can do:
$ gawk -i inplace '{ gsub(/foo/, "bar") }; { print }' file1 file2 file3
To keep a backup copy of original files, try this:
$ gawk -i inplace -v INPLACE_SUFFIX=.bak '{ gsub(/foo/, "bar") }
> { print }' file1 file2 file3
This can be used to simulate the GNU sed -i feature.
See: Enabling In-Place File Editing
Despite the fact that using a temp file is correct, I don't like it because :
you have to be sure not to erase another temp file (yes you can use mktemp - it's a pretty usefull tool)
you have to take care of deleting it (or moving it like thiton said) INCLUDING when your script crash or stop before the end (so deleting temp files at the end of the script is not that wise)
it generate IO on disk (ok not that much but we can make it lighter)
So my method to avoid temp file is simple:
my_output="$(awk '(PROGRAM)' source_file)"
echo "$my_output" > source_file
Note the use of double quotes either when grabbing the output from the awk command AND when using echo (if you don't, you won't have newlines).
Had to make an account when seeing 'awk' and 'not possible' in one sentence. Here is an awk-only solution without creating a temporary file:
awk '{a[b++]=$0} END {for(c=1;c<=b;c++)print a[c]>ARGV[1]}' file
You can also use sponge from moreutils.
For example
awk '!a[$0]++' file|sponge file
removes duplicate lines and
awk '{$2=10*$2}1' file|sponge file
multiplies the second column by 10.
Try to include statement in your awk file so that you can find the output in a new file. Here total is a calculated value.
print $total, total >> "new_file"
This inline writing worked for me. Redirect the output from print back to the original file.
echo "1" > test.txt
awk '{$1++; print> "test.txt"}' test.txt
cat test.txt
#$> 2

Resources