Redirection is used to redirect stdout/stdin/stderr!
Ex: ls > log.txt.
Pipes are used to give the output of a command as input to another command.
Ex: ls | grep file.txt
Why exactly are these two operators doing the same thing?
Why not just write ls > grep to pass the output through, isn't this just a kind of redirection also?
I realize Linux is "Do one thing and do it well", so there has to be more of a logical reason that I'm missing.
You do need a differentiating syntax feature - and using > vs. | will do just fine.
If you used > in both scenarios, how would you know whether
ls > grep
is trying to write to a file named grep or send input to the grep command?
grep is perhaps not the best example, as you may then be tempted to disambiguate by the presence of grep's mandatory arguments; however, (optionally) argument-less commands do exist, such as column.
that other guy offers another example in the comments: test may refer to a test output file or to the argument-less invocation of the standard test command.
Another way of looking at it:
Your suggestion is essentially to use > as a generic send-output-somewhere operator, irrespective of the type of target (file vs. command).
However, that only shifts the need for disambiguation, and then you have to disambiguate when specifying the target - is it a file to output to or a command to run?
Given that the shell also has an implicit disambiguation feature when it comes to the first token of a simple command - foo [...] only ever invokes a command - differentiating at the level of the operator - > for outputting to files, | for sending to commands - is the sensible choice.
This would actually make > do two things, open a file or run a new program, depending on what the operand is. (Ignoring the ambiguity when the argument is the name of an executable file: do we overwrite it or run it?)
bash and some other shells provide additional syntax (process substitution) that does technically replace the need for |, although not in a way that you would choose to use it over a pipe. For instance, you can write
ls > >(grep regex)
>(...) is treated as the "name" of a file (in fact, you can run echo >(true) to see what that file name is), whose contents are provided to the enclosed command as input. So now, instead of a single operator | that handles connecting output from A to the input of B, you have one operator > to redirect output, and another operator to redirect input.
It's also symmetrical:
grep regex < <(ls)
# or grep regex <(ls), since grep can read from standard input or a named file
<(...) is the "name" of an input file whose contents come from the output of the enclosed command.
The benefit of process substitution (and their underlying basis, named pipes) is when you want one process to write to many processes:
command1 | tee >(command2) >(command3) >(command4)
or for one process to read from many processes:
diff <(command1) <(command2)
They are not doing the same job. If you were to take that example:
ls > grep
This is taking the output of ls and writing it to a file called grep.
Now if you were to do something like:
ls | grep '.*.txt'
This will take the output of ls and grep for any txt files. They in no way provide the same outcome.
Related
This question already has answers here:
How do I set a variable to the output of a command in Bash?
(15 answers)
Closed 1 year ago.
here's my issue, I have a bunch of fastq.gz files and I need to determine the number of lines of it (this is not the issue), and from that number of line derive a value that determine a threshold used as a variable used down in the same loop. I browsed but cannot find how to do it. here's what I have so far:
for file in *R1.fastq*; do
var=echo $(zcat "$file" | $((`wc -l`/400000)))
for i in *Bacter*; do
awk -v var1=$var '{if($2 >= var1) print $0}' ${i} | wc -l >> bacter-filtered.txt
done
done
I get the error message: -bash: 14850508/400000: No such file or directory
any help would be greatly appreciated !
The problem is in the line
var=echo $(zcat "$file" | $((`wc -l`/400000)))
There are a bunch of shell syntax elements here combined in ways that don't connect up with each other. To keep things straight, I'd recommend splitting it into two separate operations:
lines=$(zcat "$file" | wc -l)
var=$((lines/400000))
(You may also have to do something about the output to bacter-filtered.txt -- it's just going to contain a bunch of numbers, with no identifications of which ones come from which files. Also since it always appends, if you run this twice you'll have the output from both runs stuck together. You might want to replace all those appends with a single > bacter-filtered.txt after the last done, so the whole output just gets stored directly.)
What's wrong with the original? Well, let's start with this:
zcat "$file" | $((`wc -l`/400000))
Unless I completely misunderstand, the purpose here is to extract $file (with zcat), count lines in the result (with wc -l), and divide that by 400000. But since the output of zcat isn't piped directly to wc, it's piped to a complex expression involving wc, it's somewhat ambiguous what should happen, and is actually different under different shells. In zsh, it does something completely different from that: it lets wc read from the script's stdin (generally your Terminal), divides the result from that by 400000, and then pipes the output of zcat to that ... number?
In bash, it does something closer to what you want: wc actually does read from the output of zcat, so the second part of the pipe essentially turns into:
... | $((14850508/400000))
Now, what I'd expect to happen at this point (and happens in my tests) is that it should evaluate $((14850508/400000)) into 37, giving:
... | 37
which will then try to execute 37 as a command (because it's part of a pipeline, and therefore is supposed to be a command). But for some reason it's apparently not evaluating the division and just trying to execute 14850508/400000 as a command. Which doesn't really work any better or worse than 37, so I guess it doesn't matter much.
So that's where the error is coming from, but there's actually another layer of confusion in the original line. Suppose that internal pipeline was fixed so that it properly output "37" (rather than trying to execute it). The outer structure would then be:
var=echo $(cmdthatprints37)
The $( ) basically means "run the command inside, and substitute its output into the command line here", so that would evaluate to:
var=echo 37
...which, in shell syntax, means "run the command 37 with var set to "echo" in its environment.
The solution here would be simple. The echo is messing everything up so remove it:
var=$(cmdthatprints37)
...which evaluates to:
var=37
...which is what you want. Except that, as I said above, it'd be better to split it up and do the command bits and the math separately rather than getting them mixed up.
BTW, I'd also recommend some additional double-quoting of shell variables; shellcheck.net will be happy to point out where.
I'm piping a command's output to be used as arguments for an executable:
command | xargs -d '\n' "executable"
When the command yields sufficiently many lines of output I can see that the executable is run multiple times with sub-pages of the data each run. This is problematic because the state in the executable assumes that each invocation is independent from the next.
Is it possible to force the "command" to feed the entire output in a single go to the executable?
Don't use xargs, use $(...) to substitute the output into the command line.
IFS=$'\n' # this is analogous to -d '\n' in xargs
set -o noglob # prevent wildcard expansion when substituting command output
executable $(command)
However, this could get an error if the output of command is too long. xargs splits it up into multiple invocations to prevent this. But if you really require everything to be in one invocation, the error is the way to tell that this isn't possible, and prevents the incorrect results due to multiple invocations.
Is it possible to force the "command" to feed the entire output in a single go to the executable?
Yes and no.
To run the executable only once, you can use
command | bash -c 'mapfile -t a; executable "${a[#]}"'
However, this might fail if you exceed ARG_MAX of your system. A program invocation together with its arguments and environment variables must be smaller than ARG_MAX bytes. (On Linux there is even an additional restriction limiting the size of each single argument). There is no way around this.
You can check your ARG_MAX using getconf ARG_MAX or xargs --show-limits < /dev/null. This website compiled a nice list of the values on various systems.
If you are barely over the maximum and don't need environment variables, you can clear the environment to make some space.
command | env -i bash -c 'mapfile -t a; executable "${a[#]}"'
Other than that there is no way but to run executable multiple times or modify it, preferably so that it read lines from stdin instead of arguments. That way you can write
command | executable
First I create 3 files:
$ touch alpha bravo carlos
Then I want to save the list to a file:
$ ls > info.txt
However, I always got my info.txt inside:
$ cat info.txt
alpha
bravo
carlos
info.txt
It looks like the redirection operator creates my info.txt first.
In this case, my question is. How can I save my list of files before creating the info.txt first?
The main question is about the redirection operator. Why does it act first, and how to delay it so I complete my task first? Using the example above to answer it.
When you redirect a command's output to a file, the shell opens a file handle to the destination file, then runs the command in a child process whose standard output is connected to this file handle. There is no way to change this order, but you can redirect to a file in a different directory if you don't want the ls output to include the new file.
ls >/tmp/info.txt
mv /tmp/info.txt ./
In a production script, you should make sure that the file name is unique and unpredictable.
t=$(mktemp -t lstemp.XXXXXXXXXX) || exit
trap 'rm -f "$t"' INT HUP
ls >"$t"
mv "$t" ./info.txt
Alternatively, capture the output into a variable, and then write that variable to a file.
files=$(ls)
echo "$files" >info.txt
As an aside, probably don't use ls in scripts. If you want a list of files in the current directory
printf '%s\n' *
does that.
One simple approach is to save your command output to a variable, like this:
ls_output="$(ls)"
and then write the value of that variable to the file, using any of these commands:
printf '%s\n' "$ls_output" > info.txt
cat <<< "$ls_output" > info.txt
echo "$ls_output" > info.txt
Some caveats with this approach:
Bash variables can't contain null bytes. If the output of the command includes a null byte, that byte and everything after it will be discarded.
In the specific case of ls, though, this shouldn't be an issue, because the output of ls should never contain a null byte.
$(...) removes trailing newlines. The above compensates for this by adding a newline while creating info.txt, but if the the command output ends with multiple newlines, then the above will effectively collapse them into a single newline.
In the specific case of ls, this could happen if a filename ends with a newline — very unusual, and unlikely to be intentional, but nonetheless possible.
Since the above adds a newline while creating info.txt, it will put a newline there even if the command output doesn't end with a newline.
In the specific case of ls, this shouldn't be an issue, because the output of ls should always end with a newline.
If you want to avoid the above issues, another approach is to save your command output to a temporary file in a different directory, and then move it to the right place; for example:
tmpfile="$(mktemp)"
ls > "$tmpfile"
mv -- "$tmpfile" info.txt
. . . which obviously has different caveats (e.g., it requires access to write to a different directory), but should work on most systems.
One way to do what you want is to exclude the info.txt file from the ls output.
If you can rename the list file to .info.txt then it's as simple as:
ls >.info.txt
ls doesn't list files whose names start with . by default.
If you can't rename the list file but you've got GNU ls then you can use:
ls --ignore=info.txt >info.txt
Failing that, you can use:
ls | grep -v '^info\.txt$' >info.txt
All of the above options have the advantage that you can safely run them after the list file has been created.
Another general approach is to capture the output of ls with one command and save it to the list file with a second command. As others have pointed out, temporary files and shell variables are two specific ways to capture the output. Another way, if you've got the moreutils package installed, is to use the sponge utility:
ls | sponge info.txt
Finally, note that you may not be able to reliably extract the list of files from info.txt if it contains plain ls output. See ParsingLs - Greg's Wiki for more information.
I'm having difficulty grasping how pipes work. Initially I thought of them as per the title but I couldn't get a simple example to work e.g.
mkdir temp
cd temp
echo "rubbish" > txtfile
ls | cat
I'm wondering why it returns the output from 'ls' rather than the output of 'cat txtfile' (i.e. "rubbish"). I've read many pipe tutorials but none of them seem to go beyond "STDOUT of LHS becomes STDIN for RHS" and I'm left wondering what is STDIN of RHS. Does it become the first argument? Where does it slot in when RHS of pipe has options or more than one argument. Is there any kind of macro substitution taking place or is my thinking wide of the mark.
Edit: I'm still none the wiser 5 comments later. I'll certainly take a look at Roadowl's pv utility but for now if I type
ls | cut -c 2-4
I get
xtf
which I'd expect. So, does cut take its input from stdin but cat doesn't?
Edit2: I stuck the question up on askubuntu (I originally put it up here by mistake). The answer there https://askubuntu.com/questions/1316848/does-output-from-lhs-of-pipe-become-an-arg-for-rhs-of-pipe throws a bit more light on it.
Edit3: While reading the answers here and ask ubuntu and the links therein it struck me (again) how woeful bash (& cohorts) are. It's almost like they're designed to trip you up. I only started using bash a couple of months back and every time I write a script I have to read endless web pages to get it to work or discover where I'm going wrong. Take a simple [[ $1=="..." ]] condition. You forget the spaces round the operator and the else condition might wipe some files you want without so much as a warning. Yes, you can do great things with it without a lot of typing but at times it's like using a tightrope to get from skyscraper A to skyscraper B to avoid using 2 lifts. What's up with gold c code like cat(ls())? That said, thanks to everyone who contributed.
I guess, you meant while performing
ls | cat
ls should return txtfile and which should go as a file input to cat command.
But, the things happening in the background are different :
First your shell creates a pipe using pipe(int pipefd[2]) system-call. This pipe has 2 ends, one is read and another is write.
When ls command is executing, it writes its output to the write end of the pipe and cat simultaneously reads from the read end of the pipe.
So, here STDOUT of ls is the write end whereas STDIN for cat is read end of the pipe.
While reading from the pipe cat will consider it as a stream of bytes and not as a name of the file.
So basically, cat is printing whatever is coming as a stream of bytes.
Read about pipe() over here : pipe(2) — Linux manual page
ls | cut -c 2-4
Here, cut reads its standard input, gets the line txtfile, takes characters 2 to 4 from it, producing xtf, and prints that on standard output. That's what the command line option tells it to do.
ls | cat
Here, cat reads its standard input, gets the line txtfile, and prints that on standard output, unchanged. That's what cat does. If there were further lines, it would do the same for those.
Both read standard input unless one or more file names are given as arguments. That standard input is connected to the terminal (the same one where you enter the command line), unless you use pipes or redirections to change that.
So, run the command cut -c 2-4, and enter the line abcdefghijkl, and it will print out bcd. Because without any arguments, it reads its standard input, which is the terminal, by default. Similarly for running just cat, you'll get back the same line you entered.
Running ls | cut -c 2-4 changes where the standard input comes from, but it doesn't create any new command line arguments (other than the -c and 2-4 you gave). Command line arguments are not the same as the standard input.
So, echo txtfile | cat is not the same as running cat txtfile, any more than running echo txtfile | cut -c 2-4 is the same as running cut -c 2-4 txtfile. For some reason, you seem to expect the pipe should work differently for cat than it does for cut.
I have this script:
#!/bin/bash
FASTQFILES=~/Programs/ncbi-blast-2.2.29+/DB_files/*.fastq
FASTAFILES=~/Programs/ncbi-blast-2.2.29+/DB_files/*.fasta
clear
for file in $FASTQFILES
do cat $FASTQFILES | perl -e '$i=0;while(<>){if(/^\#/&&$i==0){s/^\#/\>/;print;}elsif($i==1){print;$i=-3}$i++;}' > ~/Programs/ncbi-blast-2.2.29+/DB_files/"${FASTQFILES%.*}.fasta"
mv $FASTAFILES ~/Programs/ncbi-blast-2.2.29+/db/
done
I'm trying it to grab the files defined in $FASTQFILES, do the .fastq to .fasta conversion, name the output with the same filename of the input, and move it to a new folder. E.g., ~/./DB_files/HELLO.fastq should give a converted ~/./db/HELLO.fasta
The problem is that the output of the conversion is a properly formatted hidden file called .fasta in the first folder instead of the expected one named HELLO.fasta. So there is nothing to mv. I think I'm messing up in the ${FASTQFILES%.*}.fasta argument but I can't seem to fix it.
I see three problems:
One part of your trouble is that you use cat $FASTQFILES instead of cat $file.
You also need to fix the I/O redirection at the end of that line to > ~/Programs/ncbi-blast-2.2.29+/DB_files/"${file%.fastq}.fasta".
The mv command needs to be executed outside the loop.
In fact, when processing a single file at a time, you don't need to use cat at all (UUOC — Useless Use Of Cat). Simply provide "$file" as an argument to the Perl script.