overwrite contents of a file: alternative to `>`? - bash

I often find myself stringing together a series of shell commands, ultimately with the goal to replace the contents of a file. However, when using > it opens the original file for writing, so you lose all the contents.
For lack of a better term, is there a "lazy evaluation" version of > that will wait until all the previous commands have been executed before before opening the file for writing?
Currently I'm using:
somecommand file.txt | ... | ... > tmp.txt && rm file.txt && mv tmp.txt file.txt
Which is quite ugly.

sponge will help here:
(Quoting from the manpage)
NAME
sponge - soak up standard input and write to a file
SYNOPSIS
sed '...' file | grep '...' | sponge file
DESCRIPTION
sponge reads standard input and writes it out to the specified file.
Unlike a shell redirect, sponge soaks up all its input before opening
the output file. This allows constructing pipelines that read from and
write to the same file.
It also creates the output file atomically by renaming a temp file into
place, and preserves the permissions of the output file if it already
exists. If the output file is a special file or symlink, the data will
be written to it.
If no output file is specified, sponge outputs to stdout.
See also: Can I read and write to the same file in Linux without overwriting it? on unix.SE

Related

How can I redirect output of a `sed` and `tr` pipe and overwrite the input file? [duplicate]

I would like to run a find and replace on an HTML file through the command line.
My command looks something like this:
sed -e s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html > index.html
When I run this and look at the file afterward, it is empty. It deleted the contents of my file.
When I run this after restoring the file again:
sed -e s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html
The stdout is the contents of the file, and the find and replace has been executed.
Why is this happening?
When the shell sees > index.html in the command line it opens the file index.html for writing, wiping off all its previous contents.
To fix this you need to pass the -i option to sed to make the changes inline and create a backup of the original file before it does the changes in-place:
sed -i.bak s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html
Without the .bak the command will fail on some platforms, such as Mac OSX.
An alternative, useful, pattern is:
sed -e 'script script' index.html > index.html.tmp && mv index.html.tmp index.html
That has much the same effect, without using the -i option, and additionally means that, if the sed script fails for some reason, the input file isn't clobbered. Further, if the edit is successful, there's no backup file left lying around. This sort of idiom can be useful in Makefiles.
Quite a lot of seds have the -i option, but not all of them; the posix sed is one which doesn't. If you're aiming for portability, therefore, it's best avoided.
sed -i 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' index.html
This does a global in-place substitution on the file index.html. Quoting the string prevents problems with whitespace in the query and replacement.
use sed's -i option, e.g.
sed -i bak -e s/STRING_TO_REPLACE/REPLACE_WITH/g index.html
To change multiple files (and saving a backup of each as *.bak):
perl -p -i -e "s/\|/x/g" *
will take all files in directory and replace | with x
this is called a “Perl pie” (easy as a pie)
You should try using the option -i for in-place editing.
Warning: this is a dangerous method! It abuses the i/o buffers in linux and with specific options of buffering it manages to work on small files. It is an interesting curiosity. But don't use it for a real situation!
Besides the -i option of sed
you can use the tee utility.
From man:
tee - read from standard input and write to standard output and files
So, the solution would be:
sed s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html | tee | tee index.html
-- here the tee is repeated to make sure that the pipeline is buffered. Then all commands in the pipeline are blocked until they get some input to work on. Each command in the pipeline starts when the upstream commands have written 1 buffer of bytes (the size is defined somewhere) to the input of the command. So the last command tee index.html, which opens the file for writing and therefore empties it, runs after the upstream pipeline has finished and the output is in the buffer within the pipeline.
Most likely the following won't work:
sed s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html | tee index.html
-- it will run both commands of the pipeline at the same time without any blocking. (Without blocking the pipeline should pass the bytes line by line instead of buffer by buffer. Same as when you run cat | sed s/bar/GGG/. Without blocking it's more interactive and usually pipelines of just 2 commands run without buffering and blocking. Longer pipelines are buffered.) The tee index.html will open the file for writing and it will be emptied. However, if you turn the buffering always on, the second version will work too.
sed -i.bak "s#https.*\.com#$pub_url#g" MyHTMLFile.html
If you have a link to be added, try this. Search for the URL as above (starting with https and ending with.com here) and replace it with a URL string. I have used a variable $pub_url here. s here means search and g means global replacement.
It works !
The problem with the command
sed 'code' file > file
is that file is truncated by the shell before sed actually gets to process it. As a result, you get an empty file.
The sed way to do this is to use -i to edit in place, as other answers suggested. However, this is not always what you want. -i will create a temporary file that will then be used to replace the original file. This is problematic if your original file was a link (the link will be replaced by a regular file). If you need to preserve links, you can use a temporary variable to store the output of sed before writing it back to the file, like this:
tmp=$(sed 'code' file); echo -n "$tmp" > file
Better yet, use printf instead of echo since echo is likely to process \\ as \ in some shells (e.g. dash):
tmp=$(sed 'code' file); printf "%s" "$tmp" > file
And the ed answer:
printf "%s\n" '1,$s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' w q | ed index.html
To reiterate what codaddict answered, the shell handles the redirection first, wiping out the "input.html" file, and then the shell invokes the "sed" command passing it a now empty file.
I was searching for the option where I can define the line range and found the answer. For example I want to change host1 to host2 from line 36-57.
sed '36,57 s/host1/host2/g' myfile.txt > myfile1.txt
You can use gi option as well to ignore the character case.
sed '30,40 s/version/story/gi' myfile.txt > myfile1.txt
With all due respect to the above correct answers, it's always a good idea to "dry run" scripts like that, so that you don't corrupt your file and have to start again from scratch.
Just get your script to spill the output to the command line instead of writing it to the file, for example, like that:
sed -e s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html
OR
less index.html | sed -e s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g
This way you can see and check the output of the command without getting your file truncated.

Bash- why can't a file be overwritten without the use of a temp file?

The standard procedure for overwriting a file is usually the following:
awk '{print $2*3}' file > tmp file
tmpFile > file
However, sometimes this poses to be a bit of a hassle because then one must remove the temp file after it is no longer being used.
So, why is it not possible to do this in the following way (without the need of a temp file) :
awk '{print $2*3}' file > file
The reason I ask is because I know that it is possible to append to a file as so:
awk '{print $2*3}' file >> file
So if appending a file, using >> as shown above, works fine, why can't one overwrite a file in the same way. Why are the two commands so different?
Moreover, does there exist a way of bypassing the need for a temp file (perhaps in a fashion similar to the 2nd excerpt), or is the first excerpt the only way?
NOTE: the awk command is irrelevant, it can be replaced by any other command
Using a temp file is a good idea because you can never be sure if the entire file will be read into memory. If you try to write it before it was read, then you might get a different result than you might have expected.
When using append, the command always goes through the entire file before adding new content, so there never remains a part of the file to be read.
Probably not a great idea (trying to read & write to same file), but if you insist on doing it, you could use the <> operator.
gawk '{print $2*3}' -- <> file
There's a tool for everything. You can use sponge.
awk '{print $2*3}' file | sponge file
You can get it from the moreutils package. The man page reads:
NAME
sponge - soak up standard input and write to a file
SYNOPSIS
sed '...' file | grep '...' | sponge [-a] file
DESCRIPTION
sponge reads standard input and writes it out to the specified file.
Unlike a shell redirect, sponge soaks up all its input before writing the
output file. This allows constructing pipelines that read from and write
to the same file.
sponge preserves the permissions of the output file if it already exists.
When possible, sponge creates or updates the output file atomically by
renaming a temp file into place. (This cannot be done if TMPDIR is not in
the same filesystem.)
If the output file is a special file or symlink, the data will be written
to it, non-atomically.
If no file is specified, sponge outputs to stdout.
OPTIONS
-a
Replace the file with a new file that contains the file's original
content, with the standard input appended to it. This is done
atomically when possible.
AUTHOR
Colin Watson and Tollef Fog Heen
if u happen to be on a mac, you can emulate a copy & paste operation to do in-place edits indirectly without a temp file :
awk '{ . . . }' file | LC_ALL=C pbcopy ; LC_ALL=C pbpaste > file
dunno what the equivalent commands are for linux or other platforms. avoid this if your file is over 500 MB in size
you can also use this for perl or python etc since "pasteboard copy" is simply reading in contents via /dev/stdin.
this is only a convenience shortcut and doesn't guarantee atomic ops whatsoever.

how to delete string from text file and save as same file?

I'm trying to write a bash function to delete a garbage string (including the quotation marks and parentheses) from a text file and resave the file with the same filename.
I got it to work when I save it as a new file...
function fixfile()
{
sed 's/(\"garbagestringhere\")//g' /Users/peter/.emacs.d/recent-addresses > /Users/peter/.emacs.d/recent-addresses-fixed;
}
...but when I save it with the same filename as before, it wipes the file:
function fixfile()
{
sed 's/(\"garbagestringhere\")//g' /Users/peter/.emacs.d/recent-addresses > /Users/peter/.emacs.d/recent-addresses;
}
How do I fix this?
For GNU sed man sed will reveal the -i option:
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if SUFFIX supplied)
If you are on OSx (or BSD) you can employ (see this reference answer)
sed -i '' 's/whatever//g' file
// ^ note the space
Even if in your particular case I suggest the use of sed -i,
in a more general approach you can consider the use of sponge
from moreutils.
sed '...' file | grep '...' | awk '...' | sponge file
Sponge reads standard input and writes it out to the specified file. Unlike a shell redirect, sponge soaks up all its input before opening the output file. This allows constricting pipelines that read from and write to the same file.
If no output file is specified, sponge outputs to stdout
(Git from here, or from here...)
Alternatively you can use a temporary file that you delete before you exit from your function.
Temporary file with a random (or pseudorandom) namefile.

Why might this cause the file to be empty?

TMP="$$.FILE"
#Process puts contents into TMP
cat "$TMP" | sort | head > "$TMP"
I already made sure the file was not empty to begin with. Without the > "$TMP", it outputs something, but when its stored again into the same file, its empty. What might be the cause?
You cannot write to and read from a file at the same time. Here is roughly what happens:
> "$TMP" causes file to be opened for writing, which also truncates the file.
cat "$TMP" reads from now blank file.
File stays empty.
Commands that purport to modify a file in place in fact perform a bit of temp file shuffling under the covers. For example, sed -i will process an input file and save the results to input.tmp, then do mv input.tmp input at the end to overwrite the original. You should follow that model.
Those processes all get run in parallel, so the head command is truncating the file before cat has a chance to read it.
To get the result you want, you need to write the sort output to a different file then move that over the original.
cat "$TMP" | sort | head > "$TMP".new
mv "$TMP".new "$TMP"
The last pipe will truncate the file which the first pipe reads, before anything really happens. So what happens is cat tries to read a file which the call to head immediately truncated. This is the causing the issues here; the > operator is a shell operator which means "truncate this file right away and then have the process write its standard output into the file.
On a related note, you don't need cat here.
Try this instead:
TMP="$$.FILE"
sort <"$TMP" | head > "$TMP.tmp"
mv "$TMP.tmp" "$TMP"

Best way to modify a file when using pipes?

I often have shell programming tasks where I run into this pattern:
cat file | some_script > file
This is unsafe - cat may not have read in the entire file before some_script starts writing to it. I don't really want to write the result to a temporary file (its slow, and I don't want the added complication of thinking up a unique new name).
Perhaps, there is there is a standard shell command that will buffer a whole stream until EOF is reached? Something like:
cat file | bufferUntilEOF | script > file
Ideas?
Like many others, I like to use temporary files. I use the shell process-id as part of the temporary name so that if multiple copies of the script are running at the same time, they won't conflict. Finally, I then only overwrite the original file if the script succeeds (using boolean operator short-circuiting - it's a little dense but very nice for simple command lines). Putting that all together, it would look like:
some_script < file > smscrpt.$$ && mv smscrpt.$$ file
This will leave the temporary file if the command fails. If you want to clean up on error, you can change that to:
some_script < file > smscrpt.$$ && mv smscrpt.$$ file || rm smscrpt.$$
BTW, I got rid of the poor use of cat and replaced it with input redirection.
You're looking for sponge.
Using a temporary file is the correct solution here. When you use a redirection like '>', it is handled by the shell, and no matter how many commands are in your pipeline, the shell is free to delete and overwrite the output file before any command is executed (during pipeline setup).
Another option is just to read the file into a variable:
file_contents=$(cat file)
echo "$file_contents" | script1 | script2 > file
Using mktemp(1) or tempfile(1) saves you the expense of having to think up unique filename.
In response to the OP's question above about using sponge without external dependencies, and building on #D.Shawley's answer, you can have the effect of sponge with only a dependency on gawk, which is not uncommon on Unix or Unix-like systems:
cat foo | gawk -voutfn=foo '{lines[NR]=$0;} END {if(NR>0){print lines[1]>outfn;} for(i=2;i<=NR;++i) print lines[i] >> outfn;}'
The check for NR>0 is to truncate the input file.
To use this in a shell script, change -voutfn=foo to -voutfn="$1" or whatever syntax your shell uses for filename arguments. For example:
#!/bin/bash
cat "$1" | gawk -voutfn="$1" '{lines[NR]=$0;} END {if(NR>0){print lines[1]>outfn;} for(i=2;i<=NR;++i) print lines[i] >> outfn;}'
Note that, unlike real sponge, this may be limited to the size of RAM. sponge actually buffers in a temporary file if necessary.
Using a temporary file is IMO better than attempting to buffer the data in the pipeline.
It almost defeats the purpose of pipelines to buffer them.
I think you need to use mktemp. Something like this will work:
FILE=example-input.txt
TMP=`mktemp`
some_script <"$FILE" >"$TMP"
mv "$TMP" "$FILE"
I think that the best way is to use a temp file. However, if you want another approach, you can use something like awk to buffer up the input into memory before your application starts receiving input. The following script will buffer the all of the input into the lines array before it starts to output it to the next consumer in the pipeline.
{ lines[NR] = $0; }
END {
for (line_no=1; line_no<=NR; ++line_no) {
print lines[line_no];
}
}
You can collapse it into a one-liner if you want:
cat file | awk '{lines[NR]=$0;} END {for(i=1;i<=NR;++i) print lines[i];}' > file
With all of that, I would still recommend using a temporary file for the output and then overwriting the original file with it.

Resources