bash cat behavior on file versus variables storing file contents - bash

I have a file file1 with the following contents:
Z
X
Y
I can use cat to view the file:
$ cat file1
Z
X
Y
I can sort the file:
$ sort -k1,1 file1
X
Y
Z
I can sort it and store the output in a variable:
sorted_file1=$(sort -k1,1 file1)
But when I try to use cat on the variable sorted_file1 I get an error:
$ cat "$sorted_file1"
cat: X
Y
Z: No such file or directory
I can use echo and it looks about right, but it behaves strangely in my scripts:
$ echo "$sorted_file1"
X
Y
Z
Why does this happen? How does storing the output of a command change how cat interprets it?
Is there a better way to store the output of shell commands within variables to avoid issues like this?

cat operates on files. Your invocation of cat (cat "$sorted_file1") expands to the same as cat $'X\nY\nZ', and of course there's no file of that name, hence the error you see.
Shell variables are not files. If you need to make their values available like files, you need to use echo to create a stream:
echo "$sorted_file1" | cat # portable, STDIN
cat <(echo "$sorted_file1") # Bash, file
cat <<<"$sorted_file1" # Bash, STDIN
(obviously cat is pointless here, but the principle applies to other programs that expect their input from files or STDIN).

Your mixing two concepts, files and variables. Both of these hold data, but they do so in different ways.
I will assume you know what a file is. A variable is like a little data store.
You generally use variables to store little bits of data that you may want to change, use immediately and don't mind losing when your script/program ends.
And you generally use files to store large amounts of data that you want to keep around after your script/program ends.
I believe what you want to do here is sort the file, and store the input in another file. To do this, you need to use redirection, like this
sort -k1,1 file1 > sorted_file1
What this does is sort the file and then outputs the result into a file called "sorted_file1". Then if you do your regular cat sorted_file you will see the sorted contents, as you expect.
You can read a bit more about it here.

Related

Sort files in directory then execute command on each one of them

I have a directory containing files numbered like this
1>chr1:2111-1111_mask.txt
1>chr1:2111-1111_mask2.txt
1>chr1:2111-1111_mask3.txt
2>chr2:345-678_mask.txt
2>chr2:345-678_mask2.txt
2>chr2:345-678_mask3.txt
100>chr19:444-555_mask.txt
100>chr19:444-555_mask2.txt
100>chr19:444-555_mask3.txt
each file contains a name like >chr1:2111-1111 in the first line and a series of characters in the second line.
I need to sort files in this directory numerically using the number before the > as guide, the execute the command for each one of the files with _mask3 and using.
I have this code
ls ./"$INPUT"_temp/*_mask3.txt | sort -n | for f in ./"$INPUT"_temp/*_mask3.txt
do
read FILE
Do something with each file and list the results in output file including the name of the string
done
It works, but when I check the list of the strings inside the output file they are like this
>chr19:444-555
>chr1:2111-1111
>chr2:345-678
why?
So... I'm not sure what "Works" here like your question stated.
It seems like you have two problems.
Your files are not in sorted order
The file names have the leading digits removed
Addressing 1, your command ls ./"$INPUT"_temp/*_mask3.txt | sort -n | for f in ./"$INPUT"_temp/*_mask3.txt here doesn't make a whole lot of sense. You are getting a list of files from ls, and then piping that to sort. That probably gives you the output you are looking for, but then you pipe that to for, which doesn't make any sense.
In fact you can rewrite your entire script to
for f in ./"$INPUT"_temp/*_mask3.txt
do
read FILE
Do something with each file and list the results in output file including the name of the string
done
And you'll have the exact same output. To get this sorted you could do something like:
for f in `ls ./"$INPUT"_temp/*_mask3.txt | sort -n`
do
read FILE
Do something with each file and list the results in output file including the name of the string
done
As for the unexpected truncation, that > character in your file name is important in your bash shell since it directs the stdout of the preceding command to a specified file. You'll need to insure that when you use variable $f from your loop that you stick quotes around that thing to keep bash from misinterpreting the file name a command > file type of thing.

Duplicate stdin to stdout

I am looking for a bash one-liner that duplicates stdin to stdout without interleaving. The only solution I have found so far is to use tee, but that does produced interleaved output. What do I mean by this:
If e.g. a file f reads
a
b
I would like to execute
cat f | HERE_BE_COMMAND
to obtain
a
b
a
b
If I use tee - as the command, the output typically looks something like
a
a
b
b
Any suggestions for a clean solution?
Clarification
The cat f command is just an example of where the input can come from. In reality, it is a command that can (should) only be executed once. I also want to refrain from using temporary files, as the processed data is sort of sensitive and temporary files are always error-prone when the executed command gets interrupted. Furthermore, I am not interested in a solution that involves additional scripts (as stated above, it should be a one-liner) or preparatory commands that need to be executed prior to the actual duplication command.
Solution 1:
<command_which_produces_output> | { a="$(</dev/stdin)"; echo "$a"; echo "$a"; }
In this way, you're saving the content from the standard input in a (choose a better name please), and then echo'ing twice.
Notice $(</dev/stdin) is a similar but more efficient way to do $(cat /dev/stdin).
Solution 2:
Use tee in the following way:
<command_which_produces_output> | tee >(echo "$(</dev/stdin)")
Here, you're firstly writing to the standard output (that's what tee does), and also writing to a FIFO file created by process substitution:
>(echo "$(</dev/stdin)")
See for example the file it creates in my system:
$ echo >(echo "$(</dev/stdin)")
/dev/fd/63
Now, the echo "$(</dev/stdin)" part is just the way I found to firstly read the entire file before printing it. It echo'es the content read from the process substitution's standard input, but once all the input is read (not like cat that prints line by line).
Store the second input in a temp file.
cat f | tee /tmp/showlater
cat /tmp/showlater
rm /tmp/showlater
Update:
As shown in the comments (#j.a.) the solution above will need to be adjusted into the OP's real needs. Calling will be easier in a function and what do you want to do with errors in your initial commands and in the tee/cat/rm ?
I recommend tee /dev/stdout.
cat f | tee /dev/stdout
One possible solution I found is the following awk command:
awk '{d[NR] = $0} END {for (i=1;i<=NR;i++) print d[i]; for (i=1;i<=NR;i++) print d[i]}'
However, I feel there must be a more "canonical" way of doing this using.
a simple bash script ?
But this will store all the stdin, why not store the output to a file a read the file both if you need ?
full=""
while read line
do
echo "$line"
full="$full$line\n"
done
printf $full
The best way would be to store the output in a file and show it later on. Using tee has the advantage of showing the output as it comes:
if tmpfile=$(mktemp); then
commands | tee "$tmpfile"
cat "$tmpfile"
rm "$tmpfile"
else
echo "Error creating temporary file" >&2
exit 1
fi
If the amount of output is limited, you can do this:
output=$(commands); echo "$output$output"

Shell - saving contents of file to variable then outputting the variable

First off, I'm really bad at shell, as you'll notice :)
Now then, I have the following task: The script gets two arguments (fileName, N). If the number of lines in the file is greater then N, then I need to cut the last N lines, then overwrite the contents of the file with it.
I thought of saving the contents of the file into a variable, then just cat-ing that to the file. However for some reason it's not working.
I have problems with saving the last N lines to a variable.
This is how I tried doing it:
lastNLines=`tail -$2 $1`
cat $lastNLines > $1
Your lastNLines is not a filename. cat takes filenames. You also cannot open the input file for writing, because the shell truncates it before tail can get to it, which is why you need to use a temporary file.
However, if you insist on not using a temporary file, here's a non-portable solution:
tail -n$2 $1 | sponge $1
You may need to install moreutils for sponge.
The arguments cat takes are file names, not the content.
Instead, you can use a temp file, like this:
tail -$2 $1 > $1._tmp
mv $1._tmp $1
To save the content to a variable, you can do what you already included in your question, or:
lastNLines=`cat $1`
(after the mv command, of course)

Why piping to the same file doesn't work on some platforms?

In cygwin, the following code works fine
$ cat junk
bat
bat
bat
$ cat junk | sort -k1,1 |tr 'b' 'z' > junk
$ cat junk
zat
zat
zat
But in the linux shell(GNU/Linux), it seems that overwriting doesn't work
[41] othershell: cat junk
cat
cat
cat
[42] othershell: cat junk |sort -k1,1 |tr 'c' 'z'
zat
zat
zat
[43] othershell: cat junk |sort -k1,1 |tr 'c' 'z' > junk
[44] othershell: cat junk
Both environments run BASH.
I am asking this because sometimes after I do text manipulation, because of this caveat, I am forced to make the tmp file. But I know in Perl, you can give "i" flag to overwrite the original file after some operations/manipulations. I just want to ask if there is any foolproof method in unix pipeline to overwrite the file that I am not aware of.
Four main points here:
"Useless use of cat." Don't do that.
You're not actually sorting anything with sort. Don't do that.
Your pipeline doesn't say what you think it does. Don't do that.
You're trying to over-write a file in-place while reading from it. Don't do that.
One of the reasons you are getting inconsistent behavior is that you are piping to a process that has redirection, rather than redirecting the output of the pipeline as a whole. The difference is subtle, but important.
What you want is to create a compound command with Command Grouping, so that you can redirect the input and output of the whole pipeline. In your case, this should work properly:
{ sort -k1,1 | tr 'c' 'z'; } < junk > sorted_junk
Please note that without anything to sort, you might as well skip the sort command too. Then your command can be run without the need for command grouping:
tr 'c' 'z' < junk > sorted_junk
Keep redirections and pipelines as simple as possible. It makes debugging your scripts much easier.
However, if you still want to abuse the pipeline for some reason, you could use the sponge utility from the moreutils package. The man page says:
sponge reads standard input and writes it out to the specified
file. Unlike a shell redirect, sponge soaks up all its input before
opening the output file. This allows constricting pipelines that read
from and write to the same file.
So, your original command line can be re-written like this:
cat junk | sort -k1,1 | tr 'c' 'z' | sponge junk
and since junk will not be overwritten until sponge receives EOF from the pipeline, you will get the results you were expecting.
In general this can be expected to break. The processes in a pipeline are all started up in parallel, so the > junk at the end of the line will usually truncate your input file before the process at the head of the pipelining has finished (or even started) reading from it.
Even if bash under Cygwin let's you get away with this you shouldn't rely on it. The general solution is to redirect to a temporary file and then rename it when the pipeline is complete.
You want to edit that file, you can just use the editor.
ex junk << EOF
%!(sort -k1,1 |tr 'b' 'z')
x
EOF
Overriding the same file in pipeline is not advice, because when you do the mistake you can't get it back (unless you've the backup or it's the under version control).
This happens, because the input and output in pipeline is automatically buffered (which gives you an impression it works), but it actually it's running in parallel. Different platforms could buffer the output in different way (based on the settings), so on some you end up with empty file (because the file would be created at the start), on some other with half-finished file.
The solution is to use some method when the file is only overridden when it encounters an EOF with full buffered and processed input.
This can be achieved by:
Using utility which can soaks up all its input before opening the output file.
This can either be done by sponge (as opposite of unbuffer from expect package).
Avoid using I/O redirection syntax (which can create the empty file before starting the command).
For example using tee (which buffers its standard streams), for example:
cat junk | sort | tee junk
This would only work with sort, because it expects all the input to process the sorting. So if your command doesn't use sort, add one.
Another tool which can be used is stdbuf which modifies buffering operations for its standard streams where you can specify the buffer size.
Use text processor which can edit files in-place (such as sed or ex).
Example:
$ ex -s +'%!sort -k1' -cxa myfile.txt
$ sed -i '' s/foo/bar/g myfile.txt
Using the following simple script, you can make it work like you want to:
$ cat junk | sort -k1,1 |tr 'b' 'z' | overwrite_file.sh junk
overwrite_file.sh
#!/usr/bin/env bash
OUT=$(cat -)
FILENAME="$*"
echo "$OUT" | tee "$FILENAME"
Note that if you don't want the updated file to be send to stdout, you can use this approach instead
overwrite_file_no_output.sh
#!/usr/bin/env bash
OUT=$(cat -)
FILENAME="$*"
echo "$OUT" > "$FILENAME"

Best way to modify a file when using pipes?

I often have shell programming tasks where I run into this pattern:
cat file | some_script > file
This is unsafe - cat may not have read in the entire file before some_script starts writing to it. I don't really want to write the result to a temporary file (its slow, and I don't want the added complication of thinking up a unique new name).
Perhaps, there is there is a standard shell command that will buffer a whole stream until EOF is reached? Something like:
cat file | bufferUntilEOF | script > file
Ideas?
Like many others, I like to use temporary files. I use the shell process-id as part of the temporary name so that if multiple copies of the script are running at the same time, they won't conflict. Finally, I then only overwrite the original file if the script succeeds (using boolean operator short-circuiting - it's a little dense but very nice for simple command lines). Putting that all together, it would look like:
some_script < file > smscrpt.$$ && mv smscrpt.$$ file
This will leave the temporary file if the command fails. If you want to clean up on error, you can change that to:
some_script < file > smscrpt.$$ && mv smscrpt.$$ file || rm smscrpt.$$
BTW, I got rid of the poor use of cat and replaced it with input redirection.
You're looking for sponge.
Using a temporary file is the correct solution here. When you use a redirection like '>', it is handled by the shell, and no matter how many commands are in your pipeline, the shell is free to delete and overwrite the output file before any command is executed (during pipeline setup).
Another option is just to read the file into a variable:
file_contents=$(cat file)
echo "$file_contents" | script1 | script2 > file
Using mktemp(1) or tempfile(1) saves you the expense of having to think up unique filename.
In response to the OP's question above about using sponge without external dependencies, and building on #D.Shawley's answer, you can have the effect of sponge with only a dependency on gawk, which is not uncommon on Unix or Unix-like systems:
cat foo | gawk -voutfn=foo '{lines[NR]=$0;} END {if(NR>0){print lines[1]>outfn;} for(i=2;i<=NR;++i) print lines[i] >> outfn;}'
The check for NR>0 is to truncate the input file.
To use this in a shell script, change -voutfn=foo to -voutfn="$1" or whatever syntax your shell uses for filename arguments. For example:
#!/bin/bash
cat "$1" | gawk -voutfn="$1" '{lines[NR]=$0;} END {if(NR>0){print lines[1]>outfn;} for(i=2;i<=NR;++i) print lines[i] >> outfn;}'
Note that, unlike real sponge, this may be limited to the size of RAM. sponge actually buffers in a temporary file if necessary.
Using a temporary file is IMO better than attempting to buffer the data in the pipeline.
It almost defeats the purpose of pipelines to buffer them.
I think you need to use mktemp. Something like this will work:
FILE=example-input.txt
TMP=`mktemp`
some_script <"$FILE" >"$TMP"
mv "$TMP" "$FILE"
I think that the best way is to use a temp file. However, if you want another approach, you can use something like awk to buffer up the input into memory before your application starts receiving input. The following script will buffer the all of the input into the lines array before it starts to output it to the next consumer in the pipeline.
{ lines[NR] = $0; }
END {
for (line_no=1; line_no<=NR; ++line_no) {
print lines[line_no];
}
}
You can collapse it into a one-liner if you want:
cat file | awk '{lines[NR]=$0;} END {for(i=1;i<=NR;++i) print lines[i];}' > file
With all of that, I would still recommend using a temporary file for the output and then overwriting the original file with it.

Resources