Bash- why can't a file be overwritten without the use of a temp file? - bash

The standard procedure for overwriting a file is usually the following:
awk '{print $2*3}' file > tmp file
tmpFile > file
However, sometimes this poses to be a bit of a hassle because then one must remove the temp file after it is no longer being used.
So, why is it not possible to do this in the following way (without the need of a temp file) :
awk '{print $2*3}' file > file
The reason I ask is because I know that it is possible to append to a file as so:
awk '{print $2*3}' file >> file
So if appending a file, using >> as shown above, works fine, why can't one overwrite a file in the same way. Why are the two commands so different?
Moreover, does there exist a way of bypassing the need for a temp file (perhaps in a fashion similar to the 2nd excerpt), or is the first excerpt the only way?
NOTE: the awk command is irrelevant, it can be replaced by any other command

Using a temp file is a good idea because you can never be sure if the entire file will be read into memory. If you try to write it before it was read, then you might get a different result than you might have expected.
When using append, the command always goes through the entire file before adding new content, so there never remains a part of the file to be read.

Probably not a great idea (trying to read & write to same file), but if you insist on doing it, you could use the <> operator.
gawk '{print $2*3}' -- <> file

There's a tool for everything. You can use sponge.
awk '{print $2*3}' file | sponge file
You can get it from the moreutils package. The man page reads:
NAME
sponge - soak up standard input and write to a file
SYNOPSIS
sed '...' file | grep '...' | sponge [-a] file
DESCRIPTION
sponge reads standard input and writes it out to the specified file.
Unlike a shell redirect, sponge soaks up all its input before writing the
output file. This allows constructing pipelines that read from and write
to the same file.
sponge preserves the permissions of the output file if it already exists.
When possible, sponge creates or updates the output file atomically by
renaming a temp file into place. (This cannot be done if TMPDIR is not in
the same filesystem.)
If the output file is a special file or symlink, the data will be written
to it, non-atomically.
If no file is specified, sponge outputs to stdout.
OPTIONS
-a
Replace the file with a new file that contains the file's original
content, with the standard input appended to it. This is done
atomically when possible.
AUTHOR
Colin Watson and Tollef Fog Heen

if u happen to be on a mac, you can emulate a copy & paste operation to do in-place edits indirectly without a temp file :
awk '{ . . . }' file | LC_ALL=C pbcopy ; LC_ALL=C pbpaste > file
dunno what the equivalent commands are for linux or other platforms. avoid this if your file is over 500 MB in size
you can also use this for perl or python etc since "pasteboard copy" is simply reading in contents via /dev/stdin.
this is only a convenience shortcut and doesn't guarantee atomic ops whatsoever.

Related

Remove duplicate lines and overwrite file in same command

I'm trying to remove duplicate lines from a file and update the file. For some reason I have to write it to a new file and replace it. Is this the only way?
awk '!seen[$0]++' .gitignore > .gitignore
awk '!seen[$0]++' .gitignore > .gitignore_new && mv .gitignore_new .gitignore
Redirecting to the same output file as input file like:
awk '!seen[$0]++' .gitignore > .gitignore
will end with an empty file. This is because using the > operator, the shell will open and truncate the file before the command get's executed. Meaning you'll lose all your data.
With newer versions of GNU awk you can use the -i inplace option to edit the file in place:
awk -i inplace '!seen[$0]++' .gitignore
If you don't have a recent version of GNU awk, you'll need to create a temporary file:
awk '!seen[$0]++' .gitignore > .gitignore.tmp
mv .gitignore.tmp .gitignore
Another alternative is to use the sponge program from moreutils:
awk '!seen[$0]++' .gitignore | sponge .gitignore
sponge will soak all stdinput and open the output file after that. This effectively keeps the input file intact before writing to it.
Thomas, I believe the problem is that you are reading from it and writing to it on the same command. This is why you must put to a temporary file first.
The > does overwrite, so you are using the correct redirect operator
Redirect output from a command to a file on disk. Note: if the file already exist, it will be erased and overwritten without warning, so
be careful.
Example: ps -ax >processes.txt Use the ps command to get a list of
processes running on the system, and store the output in a file named
processes.txt
Yes, because if you don't do that shell will create file descriptor and truncate .gitignore even before awk process started.

how to delete string from text file and save as same file?

I'm trying to write a bash function to delete a garbage string (including the quotation marks and parentheses) from a text file and resave the file with the same filename.
I got it to work when I save it as a new file...
function fixfile()
{
sed 's/(\"garbagestringhere\")//g' /Users/peter/.emacs.d/recent-addresses > /Users/peter/.emacs.d/recent-addresses-fixed;
}
...but when I save it with the same filename as before, it wipes the file:
function fixfile()
{
sed 's/(\"garbagestringhere\")//g' /Users/peter/.emacs.d/recent-addresses > /Users/peter/.emacs.d/recent-addresses;
}
How do I fix this?
For GNU sed man sed will reveal the -i option:
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if SUFFIX supplied)
If you are on OSx (or BSD) you can employ (see this reference answer)
sed -i '' 's/whatever//g' file
// ^ note the space
Even if in your particular case I suggest the use of sed -i,
in a more general approach you can consider the use of sponge
from moreutils.
sed '...' file | grep '...' | awk '...' | sponge file
Sponge reads standard input and writes it out to the specified file. Unlike a shell redirect, sponge soaks up all its input before opening the output file. This allows constricting pipelines that read from and write to the same file.
If no output file is specified, sponge outputs to stdout
(Git from here, or from here...)
Alternatively you can use a temporary file that you delete before you exit from your function.
Temporary file with a random (or pseudorandom) namefile.

overwrite contents of a file: alternative to `>`?

I often find myself stringing together a series of shell commands, ultimately with the goal to replace the contents of a file. However, when using > it opens the original file for writing, so you lose all the contents.
For lack of a better term, is there a "lazy evaluation" version of > that will wait until all the previous commands have been executed before before opening the file for writing?
Currently I'm using:
somecommand file.txt | ... | ... > tmp.txt && rm file.txt && mv tmp.txt file.txt
Which is quite ugly.
sponge will help here:
(Quoting from the manpage)
NAME
sponge - soak up standard input and write to a file
SYNOPSIS
sed '...' file | grep '...' | sponge file
DESCRIPTION
sponge reads standard input and writes it out to the specified file.
Unlike a shell redirect, sponge soaks up all its input before opening
the output file. This allows constructing pipelines that read from and
write to the same file.
It also creates the output file atomically by renaming a temp file into
place, and preserves the permissions of the output file if it already
exists. If the output file is a special file or symlink, the data will
be written to it.
If no output file is specified, sponge outputs to stdout.
See also: Can I read and write to the same file in Linux without overwriting it? on unix.SE

Shell - saving contents of file to variable then outputting the variable

First off, I'm really bad at shell, as you'll notice :)
Now then, I have the following task: The script gets two arguments (fileName, N). If the number of lines in the file is greater then N, then I need to cut the last N lines, then overwrite the contents of the file with it.
I thought of saving the contents of the file into a variable, then just cat-ing that to the file. However for some reason it's not working.
I have problems with saving the last N lines to a variable.
This is how I tried doing it:
lastNLines=`tail -$2 $1`
cat $lastNLines > $1
Your lastNLines is not a filename. cat takes filenames. You also cannot open the input file for writing, because the shell truncates it before tail can get to it, which is why you need to use a temporary file.
However, if you insist on not using a temporary file, here's a non-portable solution:
tail -n$2 $1 | sponge $1
You may need to install moreutils for sponge.
The arguments cat takes are file names, not the content.
Instead, you can use a temp file, like this:
tail -$2 $1 > $1._tmp
mv $1._tmp $1
To save the content to a variable, you can do what you already included in your question, or:
lastNLines=`cat $1`
(after the mv command, of course)

Best way to modify a file when using pipes?

I often have shell programming tasks where I run into this pattern:
cat file | some_script > file
This is unsafe - cat may not have read in the entire file before some_script starts writing to it. I don't really want to write the result to a temporary file (its slow, and I don't want the added complication of thinking up a unique new name).
Perhaps, there is there is a standard shell command that will buffer a whole stream until EOF is reached? Something like:
cat file | bufferUntilEOF | script > file
Ideas?
Like many others, I like to use temporary files. I use the shell process-id as part of the temporary name so that if multiple copies of the script are running at the same time, they won't conflict. Finally, I then only overwrite the original file if the script succeeds (using boolean operator short-circuiting - it's a little dense but very nice for simple command lines). Putting that all together, it would look like:
some_script < file > smscrpt.$$ && mv smscrpt.$$ file
This will leave the temporary file if the command fails. If you want to clean up on error, you can change that to:
some_script < file > smscrpt.$$ && mv smscrpt.$$ file || rm smscrpt.$$
BTW, I got rid of the poor use of cat and replaced it with input redirection.
You're looking for sponge.
Using a temporary file is the correct solution here. When you use a redirection like '>', it is handled by the shell, and no matter how many commands are in your pipeline, the shell is free to delete and overwrite the output file before any command is executed (during pipeline setup).
Another option is just to read the file into a variable:
file_contents=$(cat file)
echo "$file_contents" | script1 | script2 > file
Using mktemp(1) or tempfile(1) saves you the expense of having to think up unique filename.
In response to the OP's question above about using sponge without external dependencies, and building on #D.Shawley's answer, you can have the effect of sponge with only a dependency on gawk, which is not uncommon on Unix or Unix-like systems:
cat foo | gawk -voutfn=foo '{lines[NR]=$0;} END {if(NR>0){print lines[1]>outfn;} for(i=2;i<=NR;++i) print lines[i] >> outfn;}'
The check for NR>0 is to truncate the input file.
To use this in a shell script, change -voutfn=foo to -voutfn="$1" or whatever syntax your shell uses for filename arguments. For example:
#!/bin/bash
cat "$1" | gawk -voutfn="$1" '{lines[NR]=$0;} END {if(NR>0){print lines[1]>outfn;} for(i=2;i<=NR;++i) print lines[i] >> outfn;}'
Note that, unlike real sponge, this may be limited to the size of RAM. sponge actually buffers in a temporary file if necessary.
Using a temporary file is IMO better than attempting to buffer the data in the pipeline.
It almost defeats the purpose of pipelines to buffer them.
I think you need to use mktemp. Something like this will work:
FILE=example-input.txt
TMP=`mktemp`
some_script <"$FILE" >"$TMP"
mv "$TMP" "$FILE"
I think that the best way is to use a temp file. However, if you want another approach, you can use something like awk to buffer up the input into memory before your application starts receiving input. The following script will buffer the all of the input into the lines array before it starts to output it to the next consumer in the pipeline.
{ lines[NR] = $0; }
END {
for (line_no=1; line_no<=NR; ++line_no) {
print lines[line_no];
}
}
You can collapse it into a one-liner if you want:
cat file | awk '{lines[NR]=$0;} END {for(i=1;i<=NR;++i) print lines[i];}' > file
With all of that, I would still recommend using a temporary file for the output and then overwriting the original file with it.

Resources