Best way to modify a file when using pipes? - bash

I often have shell programming tasks where I run into this pattern:
cat file | some_script > file
This is unsafe - cat may not have read in the entire file before some_script starts writing to it. I don't really want to write the result to a temporary file (its slow, and I don't want the added complication of thinking up a unique new name).
Perhaps, there is there is a standard shell command that will buffer a whole stream until EOF is reached? Something like:
cat file | bufferUntilEOF | script > file
Ideas?

Like many others, I like to use temporary files. I use the shell process-id as part of the temporary name so that if multiple copies of the script are running at the same time, they won't conflict. Finally, I then only overwrite the original file if the script succeeds (using boolean operator short-circuiting - it's a little dense but very nice for simple command lines). Putting that all together, it would look like:
some_script < file > smscrpt.$$ && mv smscrpt.$$ file
This will leave the temporary file if the command fails. If you want to clean up on error, you can change that to:
some_script < file > smscrpt.$$ && mv smscrpt.$$ file || rm smscrpt.$$
BTW, I got rid of the poor use of cat and replaced it with input redirection.

You're looking for sponge.

Using a temporary file is the correct solution here. When you use a redirection like '>', it is handled by the shell, and no matter how many commands are in your pipeline, the shell is free to delete and overwrite the output file before any command is executed (during pipeline setup).

Another option is just to read the file into a variable:
file_contents=$(cat file)
echo "$file_contents" | script1 | script2 > file

Using mktemp(1) or tempfile(1) saves you the expense of having to think up unique filename.

In response to the OP's question above about using sponge without external dependencies, and building on #D.Shawley's answer, you can have the effect of sponge with only a dependency on gawk, which is not uncommon on Unix or Unix-like systems:
cat foo | gawk -voutfn=foo '{lines[NR]=$0;} END {if(NR>0){print lines[1]>outfn;} for(i=2;i<=NR;++i) print lines[i] >> outfn;}'
The check for NR>0 is to truncate the input file.
To use this in a shell script, change -voutfn=foo to -voutfn="$1" or whatever syntax your shell uses for filename arguments. For example:
#!/bin/bash
cat "$1" | gawk -voutfn="$1" '{lines[NR]=$0;} END {if(NR>0){print lines[1]>outfn;} for(i=2;i<=NR;++i) print lines[i] >> outfn;}'
Note that, unlike real sponge, this may be limited to the size of RAM. sponge actually buffers in a temporary file if necessary.

Using a temporary file is IMO better than attempting to buffer the data in the pipeline.
It almost defeats the purpose of pipelines to buffer them.

I think you need to use mktemp. Something like this will work:
FILE=example-input.txt
TMP=`mktemp`
some_script <"$FILE" >"$TMP"
mv "$TMP" "$FILE"

I think that the best way is to use a temp file. However, if you want another approach, you can use something like awk to buffer up the input into memory before your application starts receiving input. The following script will buffer the all of the input into the lines array before it starts to output it to the next consumer in the pipeline.
{ lines[NR] = $0; }
END {
for (line_no=1; line_no<=NR; ++line_no) {
print lines[line_no];
}
}
You can collapse it into a one-liner if you want:
cat file | awk '{lines[NR]=$0;} END {for(i=1;i<=NR;++i) print lines[i];}' > file
With all of that, I would still recommend using a temporary file for the output and then overwriting the original file with it.

Related

Unable to modify file as part of entry point command

My Dockerfile's entry point CMD executes a shell script to modify a local file based on an environment variable before executing my application (Flask). The shell script is like so:
cat static/login.html | sed "s/some_match/some_substitute/g" > static/login.html
However, I am finding that the resulting file is zero bytes. Any ideas what might be going on?
Thanks.
I can reproduce your problem, but not explain it. Maybe other answers follow.
But here is a solution which fixes the problem:
Use a different file name for the output than for the input.
cat input.txt | sed "s/a/b/g" > input2.txt
Alternatively, use the -i.bak option.
sed -i.bak "s/b/a/g" input.txt
I can only speculate about what exactly is going on:
Maybe the output of your pipe is opened for writing (non-appending) before the input is read.

Duplicate stdin to stdout

I am looking for a bash one-liner that duplicates stdin to stdout without interleaving. The only solution I have found so far is to use tee, but that does produced interleaved output. What do I mean by this:
If e.g. a file f reads
a
b
I would like to execute
cat f | HERE_BE_COMMAND
to obtain
a
b
a
b
If I use tee - as the command, the output typically looks something like
a
a
b
b
Any suggestions for a clean solution?
Clarification
The cat f command is just an example of where the input can come from. In reality, it is a command that can (should) only be executed once. I also want to refrain from using temporary files, as the processed data is sort of sensitive and temporary files are always error-prone when the executed command gets interrupted. Furthermore, I am not interested in a solution that involves additional scripts (as stated above, it should be a one-liner) or preparatory commands that need to be executed prior to the actual duplication command.
Solution 1:
<command_which_produces_output> | { a="$(</dev/stdin)"; echo "$a"; echo "$a"; }
In this way, you're saving the content from the standard input in a (choose a better name please), and then echo'ing twice.
Notice $(</dev/stdin) is a similar but more efficient way to do $(cat /dev/stdin).
Solution 2:
Use tee in the following way:
<command_which_produces_output> | tee >(echo "$(</dev/stdin)")
Here, you're firstly writing to the standard output (that's what tee does), and also writing to a FIFO file created by process substitution:
>(echo "$(</dev/stdin)")
See for example the file it creates in my system:
$ echo >(echo "$(</dev/stdin)")
/dev/fd/63
Now, the echo "$(</dev/stdin)" part is just the way I found to firstly read the entire file before printing it. It echo'es the content read from the process substitution's standard input, but once all the input is read (not like cat that prints line by line).
Store the second input in a temp file.
cat f | tee /tmp/showlater
cat /tmp/showlater
rm /tmp/showlater
Update:
As shown in the comments (#j.a.) the solution above will need to be adjusted into the OP's real needs. Calling will be easier in a function and what do you want to do with errors in your initial commands and in the tee/cat/rm ?
I recommend tee /dev/stdout.
cat f | tee /dev/stdout
One possible solution I found is the following awk command:
awk '{d[NR] = $0} END {for (i=1;i<=NR;i++) print d[i]; for (i=1;i<=NR;i++) print d[i]}'
However, I feel there must be a more "canonical" way of doing this using.
a simple bash script ?
But this will store all the stdin, why not store the output to a file a read the file both if you need ?
full=""
while read line
do
echo "$line"
full="$full$line\n"
done
printf $full
The best way would be to store the output in a file and show it later on. Using tee has the advantage of showing the output as it comes:
if tmpfile=$(mktemp); then
commands | tee "$tmpfile"
cat "$tmpfile"
rm "$tmpfile"
else
echo "Error creating temporary file" >&2
exit 1
fi
If the amount of output is limited, you can do this:
output=$(commands); echo "$output$output"

Bash- why can't a file be overwritten without the use of a temp file?

The standard procedure for overwriting a file is usually the following:
awk '{print $2*3}' file > tmp file
tmpFile > file
However, sometimes this poses to be a bit of a hassle because then one must remove the temp file after it is no longer being used.
So, why is it not possible to do this in the following way (without the need of a temp file) :
awk '{print $2*3}' file > file
The reason I ask is because I know that it is possible to append to a file as so:
awk '{print $2*3}' file >> file
So if appending a file, using >> as shown above, works fine, why can't one overwrite a file in the same way. Why are the two commands so different?
Moreover, does there exist a way of bypassing the need for a temp file (perhaps in a fashion similar to the 2nd excerpt), or is the first excerpt the only way?
NOTE: the awk command is irrelevant, it can be replaced by any other command
Using a temp file is a good idea because you can never be sure if the entire file will be read into memory. If you try to write it before it was read, then you might get a different result than you might have expected.
When using append, the command always goes through the entire file before adding new content, so there never remains a part of the file to be read.
Probably not a great idea (trying to read & write to same file), but if you insist on doing it, you could use the <> operator.
gawk '{print $2*3}' -- <> file
There's a tool for everything. You can use sponge.
awk '{print $2*3}' file | sponge file
You can get it from the moreutils package. The man page reads:
NAME
sponge - soak up standard input and write to a file
SYNOPSIS
sed '...' file | grep '...' | sponge [-a] file
DESCRIPTION
sponge reads standard input and writes it out to the specified file.
Unlike a shell redirect, sponge soaks up all its input before writing the
output file. This allows constructing pipelines that read from and write
to the same file.
sponge preserves the permissions of the output file if it already exists.
When possible, sponge creates or updates the output file atomically by
renaming a temp file into place. (This cannot be done if TMPDIR is not in
the same filesystem.)
If the output file is a special file or symlink, the data will be written
to it, non-atomically.
If no file is specified, sponge outputs to stdout.
OPTIONS
-a
Replace the file with a new file that contains the file's original
content, with the standard input appended to it. This is done
atomically when possible.
AUTHOR
Colin Watson and Tollef Fog Heen
if u happen to be on a mac, you can emulate a copy & paste operation to do in-place edits indirectly without a temp file :
awk '{ . . . }' file | LC_ALL=C pbcopy ; LC_ALL=C pbpaste > file
dunno what the equivalent commands are for linux or other platforms. avoid this if your file is over 500 MB in size
you can also use this for perl or python etc since "pasteboard copy" is simply reading in contents via /dev/stdin.
this is only a convenience shortcut and doesn't guarantee atomic ops whatsoever.

Shell - saving contents of file to variable then outputting the variable

First off, I'm really bad at shell, as you'll notice :)
Now then, I have the following task: The script gets two arguments (fileName, N). If the number of lines in the file is greater then N, then I need to cut the last N lines, then overwrite the contents of the file with it.
I thought of saving the contents of the file into a variable, then just cat-ing that to the file. However for some reason it's not working.
I have problems with saving the last N lines to a variable.
This is how I tried doing it:
lastNLines=`tail -$2 $1`
cat $lastNLines > $1
Your lastNLines is not a filename. cat takes filenames. You also cannot open the input file for writing, because the shell truncates it before tail can get to it, which is why you need to use a temporary file.
However, if you insist on not using a temporary file, here's a non-portable solution:
tail -n$2 $1 | sponge $1
You may need to install moreutils for sponge.
The arguments cat takes are file names, not the content.
Instead, you can use a temp file, like this:
tail -$2 $1 > $1._tmp
mv $1._tmp $1
To save the content to a variable, you can do what you already included in your question, or:
lastNLines=`cat $1`
(after the mv command, of course)

Why might this cause the file to be empty?

TMP="$$.FILE"
#Process puts contents into TMP
cat "$TMP" | sort | head > "$TMP"
I already made sure the file was not empty to begin with. Without the > "$TMP", it outputs something, but when its stored again into the same file, its empty. What might be the cause?
You cannot write to and read from a file at the same time. Here is roughly what happens:
> "$TMP" causes file to be opened for writing, which also truncates the file.
cat "$TMP" reads from now blank file.
File stays empty.
Commands that purport to modify a file in place in fact perform a bit of temp file shuffling under the covers. For example, sed -i will process an input file and save the results to input.tmp, then do mv input.tmp input at the end to overwrite the original. You should follow that model.
Those processes all get run in parallel, so the head command is truncating the file before cat has a chance to read it.
To get the result you want, you need to write the sort output to a different file then move that over the original.
cat "$TMP" | sort | head > "$TMP".new
mv "$TMP".new "$TMP"
The last pipe will truncate the file which the first pipe reads, before anything really happens. So what happens is cat tries to read a file which the call to head immediately truncated. This is the causing the issues here; the > operator is a shell operator which means "truncate this file right away and then have the process write its standard output into the file.
On a related note, you don't need cat here.
Try this instead:
TMP="$$.FILE"
sort <"$TMP" | head > "$TMP.tmp"
mv "$TMP.tmp" "$TMP"

Resources