Why piping to the same file doesn't work on some platforms? - bash

In cygwin, the following code works fine
$ cat junk
bat
bat
bat
$ cat junk | sort -k1,1 |tr 'b' 'z' > junk
$ cat junk
zat
zat
zat
But in the linux shell(GNU/Linux), it seems that overwriting doesn't work
[41] othershell: cat junk
cat
cat
cat
[42] othershell: cat junk |sort -k1,1 |tr 'c' 'z'
zat
zat
zat
[43] othershell: cat junk |sort -k1,1 |tr 'c' 'z' > junk
[44] othershell: cat junk
Both environments run BASH.
I am asking this because sometimes after I do text manipulation, because of this caveat, I am forced to make the tmp file. But I know in Perl, you can give "i" flag to overwrite the original file after some operations/manipulations. I just want to ask if there is any foolproof method in unix pipeline to overwrite the file that I am not aware of.

Four main points here:
"Useless use of cat." Don't do that.
You're not actually sorting anything with sort. Don't do that.
Your pipeline doesn't say what you think it does. Don't do that.
You're trying to over-write a file in-place while reading from it. Don't do that.
One of the reasons you are getting inconsistent behavior is that you are piping to a process that has redirection, rather than redirecting the output of the pipeline as a whole. The difference is subtle, but important.
What you want is to create a compound command with Command Grouping, so that you can redirect the input and output of the whole pipeline. In your case, this should work properly:
{ sort -k1,1 | tr 'c' 'z'; } < junk > sorted_junk
Please note that without anything to sort, you might as well skip the sort command too. Then your command can be run without the need for command grouping:
tr 'c' 'z' < junk > sorted_junk
Keep redirections and pipelines as simple as possible. It makes debugging your scripts much easier.
However, if you still want to abuse the pipeline for some reason, you could use the sponge utility from the moreutils package. The man page says:
sponge reads standard input and writes it out to the specified
file. Unlike a shell redirect, sponge soaks up all its input before
opening the output file. This allows constricting pipelines that read
from and write to the same file.
So, your original command line can be re-written like this:
cat junk | sort -k1,1 | tr 'c' 'z' | sponge junk
and since junk will not be overwritten until sponge receives EOF from the pipeline, you will get the results you were expecting.

In general this can be expected to break. The processes in a pipeline are all started up in parallel, so the > junk at the end of the line will usually truncate your input file before the process at the head of the pipelining has finished (or even started) reading from it.
Even if bash under Cygwin let's you get away with this you shouldn't rely on it. The general solution is to redirect to a temporary file and then rename it when the pipeline is complete.

You want to edit that file, you can just use the editor.
ex junk << EOF
%!(sort -k1,1 |tr 'b' 'z')
x
EOF

Overriding the same file in pipeline is not advice, because when you do the mistake you can't get it back (unless you've the backup or it's the under version control).
This happens, because the input and output in pipeline is automatically buffered (which gives you an impression it works), but it actually it's running in parallel. Different platforms could buffer the output in different way (based on the settings), so on some you end up with empty file (because the file would be created at the start), on some other with half-finished file.
The solution is to use some method when the file is only overridden when it encounters an EOF with full buffered and processed input.
This can be achieved by:
Using utility which can soaks up all its input before opening the output file.
This can either be done by sponge (as opposite of unbuffer from expect package).
Avoid using I/O redirection syntax (which can create the empty file before starting the command).
For example using tee (which buffers its standard streams), for example:
cat junk | sort | tee junk
This would only work with sort, because it expects all the input to process the sorting. So if your command doesn't use sort, add one.
Another tool which can be used is stdbuf which modifies buffering operations for its standard streams where you can specify the buffer size.
Use text processor which can edit files in-place (such as sed or ex).
Example:
$ ex -s +'%!sort -k1' -cxa myfile.txt
$ sed -i '' s/foo/bar/g myfile.txt

Using the following simple script, you can make it work like you want to:
$ cat junk | sort -k1,1 |tr 'b' 'z' | overwrite_file.sh junk
overwrite_file.sh
#!/usr/bin/env bash
OUT=$(cat -)
FILENAME="$*"
echo "$OUT" | tee "$FILENAME"
Note that if you don't want the updated file to be send to stdout, you can use this approach instead
overwrite_file_no_output.sh
#!/usr/bin/env bash
OUT=$(cat -)
FILENAME="$*"
echo "$OUT" > "$FILENAME"

Related

How can I redirect output of a `sed` and `tr` pipe and overwrite the input file? [duplicate]

I would like to run a find and replace on an HTML file through the command line.
My command looks something like this:
sed -e s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html > index.html
When I run this and look at the file afterward, it is empty. It deleted the contents of my file.
When I run this after restoring the file again:
sed -e s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html
The stdout is the contents of the file, and the find and replace has been executed.
Why is this happening?
When the shell sees > index.html in the command line it opens the file index.html for writing, wiping off all its previous contents.
To fix this you need to pass the -i option to sed to make the changes inline and create a backup of the original file before it does the changes in-place:
sed -i.bak s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html
Without the .bak the command will fail on some platforms, such as Mac OSX.
An alternative, useful, pattern is:
sed -e 'script script' index.html > index.html.tmp && mv index.html.tmp index.html
That has much the same effect, without using the -i option, and additionally means that, if the sed script fails for some reason, the input file isn't clobbered. Further, if the edit is successful, there's no backup file left lying around. This sort of idiom can be useful in Makefiles.
Quite a lot of seds have the -i option, but not all of them; the posix sed is one which doesn't. If you're aiming for portability, therefore, it's best avoided.
sed -i 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' index.html
This does a global in-place substitution on the file index.html. Quoting the string prevents problems with whitespace in the query and replacement.
use sed's -i option, e.g.
sed -i bak -e s/STRING_TO_REPLACE/REPLACE_WITH/g index.html
To change multiple files (and saving a backup of each as *.bak):
perl -p -i -e "s/\|/x/g" *
will take all files in directory and replace | with x
this is called a “Perl pie” (easy as a pie)
You should try using the option -i for in-place editing.
Warning: this is a dangerous method! It abuses the i/o buffers in linux and with specific options of buffering it manages to work on small files. It is an interesting curiosity. But don't use it for a real situation!
Besides the -i option of sed
you can use the tee utility.
From man:
tee - read from standard input and write to standard output and files
So, the solution would be:
sed s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html | tee | tee index.html
-- here the tee is repeated to make sure that the pipeline is buffered. Then all commands in the pipeline are blocked until they get some input to work on. Each command in the pipeline starts when the upstream commands have written 1 buffer of bytes (the size is defined somewhere) to the input of the command. So the last command tee index.html, which opens the file for writing and therefore empties it, runs after the upstream pipeline has finished and the output is in the buffer within the pipeline.
Most likely the following won't work:
sed s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html | tee index.html
-- it will run both commands of the pipeline at the same time without any blocking. (Without blocking the pipeline should pass the bytes line by line instead of buffer by buffer. Same as when you run cat | sed s/bar/GGG/. Without blocking it's more interactive and usually pipelines of just 2 commands run without buffering and blocking. Longer pipelines are buffered.) The tee index.html will open the file for writing and it will be emptied. However, if you turn the buffering always on, the second version will work too.
sed -i.bak "s#https.*\.com#$pub_url#g" MyHTMLFile.html
If you have a link to be added, try this. Search for the URL as above (starting with https and ending with.com here) and replace it with a URL string. I have used a variable $pub_url here. s here means search and g means global replacement.
It works !
The problem with the command
sed 'code' file > file
is that file is truncated by the shell before sed actually gets to process it. As a result, you get an empty file.
The sed way to do this is to use -i to edit in place, as other answers suggested. However, this is not always what you want. -i will create a temporary file that will then be used to replace the original file. This is problematic if your original file was a link (the link will be replaced by a regular file). If you need to preserve links, you can use a temporary variable to store the output of sed before writing it back to the file, like this:
tmp=$(sed 'code' file); echo -n "$tmp" > file
Better yet, use printf instead of echo since echo is likely to process \\ as \ in some shells (e.g. dash):
tmp=$(sed 'code' file); printf "%s" "$tmp" > file
And the ed answer:
printf "%s\n" '1,$s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' w q | ed index.html
To reiterate what codaddict answered, the shell handles the redirection first, wiping out the "input.html" file, and then the shell invokes the "sed" command passing it a now empty file.
I was searching for the option where I can define the line range and found the answer. For example I want to change host1 to host2 from line 36-57.
sed '36,57 s/host1/host2/g' myfile.txt > myfile1.txt
You can use gi option as well to ignore the character case.
sed '30,40 s/version/story/gi' myfile.txt > myfile1.txt
With all due respect to the above correct answers, it's always a good idea to "dry run" scripts like that, so that you don't corrupt your file and have to start again from scratch.
Just get your script to spill the output to the command line instead of writing it to the file, for example, like that:
sed -e s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html
OR
less index.html | sed -e s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g
This way you can see and check the output of the command without getting your file truncated.

Send output from `split` utility to stdout

From this question, I found the split utilty, which takes a file and splits it into evenly sized chunks. By default, it outputs these chunks to new files, but I'd like to get it to output them to stdout, separated by a newline (or an arbitrary delimiter). Is this possible?
I tried cat testfile.txt | split -b 128 - /dev/stdout
which fails with the error split: /dev/stdoutaa: Permission denied.
Looking at the help text, it seems this tells split to use /dev/stdout as a prefix for the filename, not to write to /dev/stdout itself. It does not indicate any option to write directly to a single file with a delimiter. Is there a way I can trick split into doing this, or is there a different utility that accomplishes the behavior I want?
It's not clear exactly what you want to do, but perhaps the --filter option to split will help out:
--filter=COMMAND
write to shell COMMAND; file name is $FILE
Maybe you can use that directly. For example, this will read a file 10 bytes at a time, passing each chunk through the tr command:
split -b 10 --filter "tr [:lower:] [:upper:]" afile
If you really want to emit a stream on stdout that has separators between chunks, you could do something like:
split -b 10 --filter 'dd 2> /dev/null; echo ---sep---' afile
If afile is a file in my current directory that looks like:
the quick brown fox jumped over the lazy dog.
Then the above command will result in:
the quick ---sep---
brown fox ---sep---
jumped ove---sep---
r the lazy---sep---
dog.
---sep---
From info page :
`--filter=COMMAND'
With this option, rather than simply writing to each output file,
write through a pipe to the specified shell COMMAND for each
output file. COMMAND should use the $FILE environment variable,
which is set to a different output file name for each invocation
of the command.
split -b 128 --filter='cat ; echo ' inputfile
Here is one way of doing it. You will get each 128 character into variable "var".
You may use your preferred delimiter to print or use it for further processing.
#!/bin/bash
cat yourTextFile | while read -r -n 128 var ; do
printf "\n$var"
done
You may use it as below at command line:
while read -r -n 128 var ; do printf "\n$var" ; done < yourTextFile
No, the utility will not write anything to standard output. The standard specification of it says specifically that standard output in not used.
If you used split, you would need to concatenate the created files, inserting a delimiter in between them.
If you just want to insert a delimiter every N th line, you may use GNU sed:
$ sed '0~3a\-----\' file
This inserts a line containing ----- every 3rd line.
To divide the file into chunks, separated by newlines, and write to stdout, use fold:
cat yourfile.txt | fold -w 128
...will write to stdout in "chunks" of 128 chars.

Duplicate stdin to stdout

I am looking for a bash one-liner that duplicates stdin to stdout without interleaving. The only solution I have found so far is to use tee, but that does produced interleaved output. What do I mean by this:
If e.g. a file f reads
a
b
I would like to execute
cat f | HERE_BE_COMMAND
to obtain
a
b
a
b
If I use tee - as the command, the output typically looks something like
a
a
b
b
Any suggestions for a clean solution?
Clarification
The cat f command is just an example of where the input can come from. In reality, it is a command that can (should) only be executed once. I also want to refrain from using temporary files, as the processed data is sort of sensitive and temporary files are always error-prone when the executed command gets interrupted. Furthermore, I am not interested in a solution that involves additional scripts (as stated above, it should be a one-liner) or preparatory commands that need to be executed prior to the actual duplication command.
Solution 1:
<command_which_produces_output> | { a="$(</dev/stdin)"; echo "$a"; echo "$a"; }
In this way, you're saving the content from the standard input in a (choose a better name please), and then echo'ing twice.
Notice $(</dev/stdin) is a similar but more efficient way to do $(cat /dev/stdin).
Solution 2:
Use tee in the following way:
<command_which_produces_output> | tee >(echo "$(</dev/stdin)")
Here, you're firstly writing to the standard output (that's what tee does), and also writing to a FIFO file created by process substitution:
>(echo "$(</dev/stdin)")
See for example the file it creates in my system:
$ echo >(echo "$(</dev/stdin)")
/dev/fd/63
Now, the echo "$(</dev/stdin)" part is just the way I found to firstly read the entire file before printing it. It echo'es the content read from the process substitution's standard input, but once all the input is read (not like cat that prints line by line).
Store the second input in a temp file.
cat f | tee /tmp/showlater
cat /tmp/showlater
rm /tmp/showlater
Update:
As shown in the comments (#j.a.) the solution above will need to be adjusted into the OP's real needs. Calling will be easier in a function and what do you want to do with errors in your initial commands and in the tee/cat/rm ?
I recommend tee /dev/stdout.
cat f | tee /dev/stdout
One possible solution I found is the following awk command:
awk '{d[NR] = $0} END {for (i=1;i<=NR;i++) print d[i]; for (i=1;i<=NR;i++) print d[i]}'
However, I feel there must be a more "canonical" way of doing this using.
a simple bash script ?
But this will store all the stdin, why not store the output to a file a read the file both if you need ?
full=""
while read line
do
echo "$line"
full="$full$line\n"
done
printf $full
The best way would be to store the output in a file and show it later on. Using tee has the advantage of showing the output as it comes:
if tmpfile=$(mktemp); then
commands | tee "$tmpfile"
cat "$tmpfile"
rm "$tmpfile"
else
echo "Error creating temporary file" >&2
exit 1
fi
If the amount of output is limited, you can do this:
output=$(commands); echo "$output$output"

Best practices for managing changes to files within a script

I have a BASH script which performs many actions on a file, for e.g.:
cp input.txt file.tmp1
sed (code) file.tmp1 > file.tmp2
sed (code) file.tmp2 > file.tmp3
sed (code) file.tmp3 > file.tmp4
sed (code) file.tmp4 > file.tmp5
sed (code) file.tmp5 > file.tmp6
sed (code) file.tmp6 > file.tmp7
cp output.txt
In this way:
The original file is unchanged.
I can check the files changes at each stage, just to make sure my code did not do anything wrong.
However, this seems a not very ideal way to handle the files.
Is there a better way to do this?
Is there any tool which can help inspect the changes, just to see if anything unusual was introduced?
Working on a temporary file is a fine idea, but you should use mktemp(1) to make your temporary file safely.
While there's nothing wrong with using multiple files for multiple passes, consider using mktemp -d to create a temporary directory for all your files to ensure you never overwrite anything the user cares about.
But if you're never going to look at the intermediate files, multiple passes can be handled like this:
sed (code) input.txt | sed (code) | sed (code) | sed (code) | ...
sed (code) > output.txt
If one fails, they all fail, which can make for easier error handling. There's no temporary files to remove when you're finished.
If you like to inspect the pipelines for errors, tee will help you. It redirects all input both to its standard output and a pipe, used like:
sed (code) input.txt | sed (code) | tee state-of-pipe.txt | sed (code) | ...
sed (code) > output.txt
You can inspect the changes by using diff -u input.txt output.txt. diff(1) is a line-wise differences program, and the -u unified output is pretty easy to read. wdiff(1) is a word-wise differences program, which might be more useful for some cases.
And xxdiff(1) is a superb GUI interface for inspecting the differences between two files -- it will go to some effort to show you individually changed characters. (It is also fantastic for handling CVS- and SVN-style conflict files, but that's another matter completely.)
A more effective way would be to use pipes. E.g.:
cat input.txt | sed ... | ... | sed ... > output.txt
The problem is that you can not check the changes of the different stages.

Best way to modify a file when using pipes?

I often have shell programming tasks where I run into this pattern:
cat file | some_script > file
This is unsafe - cat may not have read in the entire file before some_script starts writing to it. I don't really want to write the result to a temporary file (its slow, and I don't want the added complication of thinking up a unique new name).
Perhaps, there is there is a standard shell command that will buffer a whole stream until EOF is reached? Something like:
cat file | bufferUntilEOF | script > file
Ideas?
Like many others, I like to use temporary files. I use the shell process-id as part of the temporary name so that if multiple copies of the script are running at the same time, they won't conflict. Finally, I then only overwrite the original file if the script succeeds (using boolean operator short-circuiting - it's a little dense but very nice for simple command lines). Putting that all together, it would look like:
some_script < file > smscrpt.$$ && mv smscrpt.$$ file
This will leave the temporary file if the command fails. If you want to clean up on error, you can change that to:
some_script < file > smscrpt.$$ && mv smscrpt.$$ file || rm smscrpt.$$
BTW, I got rid of the poor use of cat and replaced it with input redirection.
You're looking for sponge.
Using a temporary file is the correct solution here. When you use a redirection like '>', it is handled by the shell, and no matter how many commands are in your pipeline, the shell is free to delete and overwrite the output file before any command is executed (during pipeline setup).
Another option is just to read the file into a variable:
file_contents=$(cat file)
echo "$file_contents" | script1 | script2 > file
Using mktemp(1) or tempfile(1) saves you the expense of having to think up unique filename.
In response to the OP's question above about using sponge without external dependencies, and building on #D.Shawley's answer, you can have the effect of sponge with only a dependency on gawk, which is not uncommon on Unix or Unix-like systems:
cat foo | gawk -voutfn=foo '{lines[NR]=$0;} END {if(NR>0){print lines[1]>outfn;} for(i=2;i<=NR;++i) print lines[i] >> outfn;}'
The check for NR>0 is to truncate the input file.
To use this in a shell script, change -voutfn=foo to -voutfn="$1" or whatever syntax your shell uses for filename arguments. For example:
#!/bin/bash
cat "$1" | gawk -voutfn="$1" '{lines[NR]=$0;} END {if(NR>0){print lines[1]>outfn;} for(i=2;i<=NR;++i) print lines[i] >> outfn;}'
Note that, unlike real sponge, this may be limited to the size of RAM. sponge actually buffers in a temporary file if necessary.
Using a temporary file is IMO better than attempting to buffer the data in the pipeline.
It almost defeats the purpose of pipelines to buffer them.
I think you need to use mktemp. Something like this will work:
FILE=example-input.txt
TMP=`mktemp`
some_script <"$FILE" >"$TMP"
mv "$TMP" "$FILE"
I think that the best way is to use a temp file. However, if you want another approach, you can use something like awk to buffer up the input into memory before your application starts receiving input. The following script will buffer the all of the input into the lines array before it starts to output it to the next consumer in the pipeline.
{ lines[NR] = $0; }
END {
for (line_no=1; line_no<=NR; ++line_no) {
print lines[line_no];
}
}
You can collapse it into a one-liner if you want:
cat file | awk '{lines[NR]=$0;} END {for(i=1;i<=NR;++i) print lines[i];}' > file
With all of that, I would still recommend using a temporary file for the output and then overwriting the original file with it.

Resources