Redirecting two files to standard input - bash

There are several unix commands that are designed to operate on two files. Commonly such commands allow the contents for one of the "files" to be read from standard input by using a single dash in place of the file name.
I just came across a technique that seems to allow both files to be read from standard input:
comm -12 <(sort file1) <(sort file2)
My initial disbelieving reaction was, "That shouldn't work. Standard input will just have the concatenation of both files. The command won't be able to tell the files apart or even realize that it has been given the contents of two files."
Of course, this construction does work. I've tested it with both comm and diff using bash 3.2.51 on cygwin 1.7.7. I'm curious how and why it works:
Why does this work?
Is this a Bash extension, or is this straight Bourne shell functionality?
This works on my system, but will this technique work on other platforms? (In other words, will scripts written using this technique be portable?)

Bash, Korn shell (ksh93, anyway) and Z shell all support process substitution. These appear as files to the utility. Try this:
$ bash -c 'echo <(echo)'
/dev/fd/63
$ ksh -c 'echo <(echo)'
/dev/fd/4
$ zsh -c 'echo <(echo)'
/proc/self/fd/12
You'll see file descriptors similar to the ones shown.

This is a standard Bash extension. <(sort file1) opens a pipe with the output of the sort file1 command, gives the pipe a temporary file name, and passes that temporary file name on the comm command line.
You can see how it works by getting echo to tell you what's being passed to the program:
echo <(sort file1) <(sort file2)

Related

xargs -a [file] mv -t [new-directory] gives me mv: cannot stat `filename*': No such file or directory error

I have been trying to run this command (that I have run before in a different directory), and everything I've read on the message boards has not solved my unknown issue.
Of note: 1) the files exist in this directory 2) I have proper permissions to move these files around 3) I have run this exact line of code before and it has worked. 4) I tried listing files with and without '' to capture all the files (see below). 5) I also tired to list each file as 'Sample1', but that did not work.
xargs -a [filename.txt] mv -t [new-directory]
I have file beginnings (I have ~5 file for each beginning), and I want to move all the files associated with that beginning.
Example: Sample1.bam Sample1.sorted.bam, etc
The lines in the file are listed as such:
Sample1*
Sample2*
Sample3* ...etc.
What am I doing incorrectly and how can I fix it?
TIA!
When you execute command using 'xargs' arguments are passed directly to the called program ('mv' in your case). Wildcard patterns in the input are not expanded - 'sample1*' is passed as is to "mv", which issue an error message about note having a file named 'sample1*'.
To get file name expansion, you want to use the shell. One way to handle this situation is
xargs -a FILENAME.TXT -I__ sh -c "mv -t NEW-FOLDER -- __"
Security Note: the code provides some protection against command line injection (e.g., file name starting with '-'). However, other possible attacks are possible. Safer version is
cat FILENAME.txt | grep '^[A-Za-z0-9][A-Z-z0-9._-]*$' | xargs I__ sh -c "mv -t NEW-FOLDER -- __"
which will limit the input to file with alphanumeric. The 'grep' patterns can be extend the pattern as needed.
With GNU Parallel you would do something like:
cat FILENAME.txt | parallel mv {} NEW-FOLDER
One of the benefits of GNU Parallel is that it deals correctly with file names like:
My brother's 12" records cost > $1000.txt

using cat in a bash script is very slow

I have very big text files(~50,000) over which i have to do some text processing. Basically run multiple grep commands.
When i run it manually it returns in an instant , but when i do the same in a bash script - it takes a lot of time. What am i doing wrong in below bash script. I pass the names of files as command line arguments to script
Example Input data :
BUSINESS^GFR^GNevil
PERSONAL^GUK^GSheila
Output that should come in a file - BUSINESS^GFR^GNevil
It starts printing out the whole file on the terminal after quite some while. How do i suppress the same?
#!/bin/bash
cat $2 | grep BUSINESS
Do NOT use cat with program that can read file itself.
It slows thing down and you lose functionality:
grep BUSINESS test | grep '^GFR|^GDE'
Or you can do like this with awk
awk '/BUSINESS/ && /^GFR|^GDE/' test

GNU 'ls' command not outputing the same over a pipe [duplicate]

This question already has answers here:
Why does ls give different output when piped
(3 answers)
Closed 6 years ago.
When I execute the command ls on my system, I get the following output:
System:~ user# ls
asd goodfile testfile this is a test file
However, when I pipe ls to another program (such as cat or gawk), the following is output:
System:~ user# ls | cat
asd
goodfile
testfile
this is a test file
How do I get ls to read the terminal size and output the same over a pipe as it does when printing directly to the terminal?
This question has been solved.
Since I'm using bash, I used the following to achieve the desired output:
System:~ user# ls -C -w "$(tput cols)" | cat
Use ls -C to get columnar output again.
When ls detects that its output isn't a terminal, it assumes that its output is being processed by some other process that wants to parse it, so it switches to -1 (one-entry-per-line) mode to make parsing easier. To make it format in columns as when it's outputting directly to a terminal, use -C to switch back to column mode.
(Note, you may also have to use --color if you care about color output, which is also normally suppressed by outputting to a pipe.)
Maybe -x "list entries by lines instead of by columns" with possible -w "assume screen width instead of current value" is what you need.
When the output goes to a pipe or non-terminal, the output format is like ls -1. If you want the columnar output, use ls -C instead.
The reason for the discrepancy is that it is usually easier to parse one-line-per-file output in shell scripts.
Since I'm using bash, I used the following to achieve the desired output:
System:~ user# ls -C -w "$(tput cols)" | cat

awk: Output to different processes

I have awk script which splits big file into several files by some condition. Than I'm running another script over each file in parallel.
awk -f script.awk -v DEST_FOLDER=tmp input.file
find tmp/ -name "*.part" | xargs -P $ALLOWED_CPUS --replace --verbose /bin/bash -c "./process.sh {}"
The question is: are there any way to run ./process.sh:
before first script is done, because process.sh processes file line by line (one line too long to be passed to xargs directly);
each new file has a header (added in script.awk) that should be run before the rest of file;
limit amount of parallel processes;
GNU parallel,inotifywait is not an option;
assume dest folder is empty, files name are unknown.
The purpose of optimization to get rid of waiting until the awk is done while some files are ready to be processed.
Once you have created a file, you can pass the filename to a process' or script's input:
awk '{print name_of_created_file | "./process.sh &"}'
& sends process.sh to the background, so that they can run in parallel. However, this is a gawk extension and not POSIX. Check the manual
You basically give the answer yourself: GNU Parallel + inotifywait will work.
Since you are not allowed to use inotifywait, you can make your substitute for inotifywait. If you are allowed to write your own script, you are also allowed to run GNU Parallel (as that is just a script).
So something like this:
awk -f script.awk -v DEST_FOLDER=tmp input.file &
sleep 1
record file sizes of files in tmp
while tmp is not empty do
for files in tmp:
if file size is unchanged: print file
record new file size
sleep 1
done | parallel 'process {}; rm {}'
It is assumed that awk will produce some output with one second. If that takes longer, adjust the sleeps accordingly.

Why piping to the same file doesn't work on some platforms?

In cygwin, the following code works fine
$ cat junk
bat
bat
bat
$ cat junk | sort -k1,1 |tr 'b' 'z' > junk
$ cat junk
zat
zat
zat
But in the linux shell(GNU/Linux), it seems that overwriting doesn't work
[41] othershell: cat junk
cat
cat
cat
[42] othershell: cat junk |sort -k1,1 |tr 'c' 'z'
zat
zat
zat
[43] othershell: cat junk |sort -k1,1 |tr 'c' 'z' > junk
[44] othershell: cat junk
Both environments run BASH.
I am asking this because sometimes after I do text manipulation, because of this caveat, I am forced to make the tmp file. But I know in Perl, you can give "i" flag to overwrite the original file after some operations/manipulations. I just want to ask if there is any foolproof method in unix pipeline to overwrite the file that I am not aware of.
Four main points here:
"Useless use of cat." Don't do that.
You're not actually sorting anything with sort. Don't do that.
Your pipeline doesn't say what you think it does. Don't do that.
You're trying to over-write a file in-place while reading from it. Don't do that.
One of the reasons you are getting inconsistent behavior is that you are piping to a process that has redirection, rather than redirecting the output of the pipeline as a whole. The difference is subtle, but important.
What you want is to create a compound command with Command Grouping, so that you can redirect the input and output of the whole pipeline. In your case, this should work properly:
{ sort -k1,1 | tr 'c' 'z'; } < junk > sorted_junk
Please note that without anything to sort, you might as well skip the sort command too. Then your command can be run without the need for command grouping:
tr 'c' 'z' < junk > sorted_junk
Keep redirections and pipelines as simple as possible. It makes debugging your scripts much easier.
However, if you still want to abuse the pipeline for some reason, you could use the sponge utility from the moreutils package. The man page says:
sponge reads standard input and writes it out to the specified
file. Unlike a shell redirect, sponge soaks up all its input before
opening the output file. This allows constricting pipelines that read
from and write to the same file.
So, your original command line can be re-written like this:
cat junk | sort -k1,1 | tr 'c' 'z' | sponge junk
and since junk will not be overwritten until sponge receives EOF from the pipeline, you will get the results you were expecting.
In general this can be expected to break. The processes in a pipeline are all started up in parallel, so the > junk at the end of the line will usually truncate your input file before the process at the head of the pipelining has finished (or even started) reading from it.
Even if bash under Cygwin let's you get away with this you shouldn't rely on it. The general solution is to redirect to a temporary file and then rename it when the pipeline is complete.
You want to edit that file, you can just use the editor.
ex junk << EOF
%!(sort -k1,1 |tr 'b' 'z')
x
EOF
Overriding the same file in pipeline is not advice, because when you do the mistake you can't get it back (unless you've the backup or it's the under version control).
This happens, because the input and output in pipeline is automatically buffered (which gives you an impression it works), but it actually it's running in parallel. Different platforms could buffer the output in different way (based on the settings), so on some you end up with empty file (because the file would be created at the start), on some other with half-finished file.
The solution is to use some method when the file is only overridden when it encounters an EOF with full buffered and processed input.
This can be achieved by:
Using utility which can soaks up all its input before opening the output file.
This can either be done by sponge (as opposite of unbuffer from expect package).
Avoid using I/O redirection syntax (which can create the empty file before starting the command).
For example using tee (which buffers its standard streams), for example:
cat junk | sort | tee junk
This would only work with sort, because it expects all the input to process the sorting. So if your command doesn't use sort, add one.
Another tool which can be used is stdbuf which modifies buffering operations for its standard streams where you can specify the buffer size.
Use text processor which can edit files in-place (such as sed or ex).
Example:
$ ex -s +'%!sort -k1' -cxa myfile.txt
$ sed -i '' s/foo/bar/g myfile.txt
Using the following simple script, you can make it work like you want to:
$ cat junk | sort -k1,1 |tr 'b' 'z' | overwrite_file.sh junk
overwrite_file.sh
#!/usr/bin/env bash
OUT=$(cat -)
FILENAME="$*"
echo "$OUT" | tee "$FILENAME"
Note that if you don't want the updated file to be send to stdout, you can use this approach instead
overwrite_file_no_output.sh
#!/usr/bin/env bash
OUT=$(cat -)
FILENAME="$*"
echo "$OUT" > "$FILENAME"

Resources