Bash - Redirection with wildcards - bash

I'm testing to do redirection with wildcards. Something like:
./TEST* < ./INPUT* > OUTPUT
Anyone have any recommendations? Thanks.

Say you have the following 5 files: TEST1, TEST1, INPUT1, INPUT2, and OUTPUT. The command line
./TEST* < ./INPUT* > OUTPUT
will expand to
./TEST1 ./TEST2 < ./INPUT1 ./INPUT2 > OUTPUT.
In other words, you will run the command ./TEST1 with 2 arguments (./TEST2, ./INPUT2), with its input redirected from ./INPUT1 and its output redirected to OUTPUT.
To address what you are probably trying to do, you can only specify a single file using input redirection. To send input to TEST from both of the INPUT* files, you would need to use something like the following, using process substitution:
./TEST1 < <(cat ./INPUT*) > OUTPUT
To run each of the programs that matches TEST* on all the input files that match INPUT*, use the following loop. It collects the output of all the commands and puts them into a single file OUTPUT.
for test in ./TEST*; do
cat ./INPUT* | $test
done > OUTPUT

There is a program called TEST* that has to get various redirection into into called INPUT*, but the thing is there are many TEST programs and they all have a different number, e.g. TEST678. What I'm trying to do is push all the random INPUT files into all the all TEST programs.
You can write:
for program in TEST* # e.g., program == 'TEST678'
do
suffix="${program#TEST}" # e.g., suffix == '678'
input="INPUT$suffix" # e.g., input == 'INPUT678'
"./$program" < "$input" # e.g., run './TEST678 < INPUT678'
done > OUTPUT

for test in ./TEST*; do
for inp in ./INPUT*; do
$test < $inp >> OUTPUT
done
done

Related

Passing arguments to a shell script via stdin multiple times

I have a script StartProcess.sh that accepts two options in stdin - 3 and a filename test.xml.
If I run the below script, it executes correctly, and waits again for the input.
I want someway to pass 3 and test.xml n times to StartProcess.sh. How do I achieve this.
./StartProcess.sh << STDIN -o other --options
3
test.xml
STDIN
You can run a loop to pass the arguments as many times in a loop and run a script over a pipe-line. That way, the script is just launched once and the arguments gets sent over stdin any number of times of your choice
count=3
for (( iter = 0; iter < 3; iter++ )); do
echo "3" "test.xml"
done | StartProcess.sh
But I'm not fully sure if you wanted to pass the literal string test.xml as an argument or the content of the file.

bash script to modify and extract information

I am creating a bash script to modify and summarize information with grep and sed. But it gets stuck.
#!/bin/bash
# This script extracts some basic information
# from text files and prints it to screen.
#
# Usage: ./myscript.sh </path/to/text-file>
#Extract lines starting with ">#HWI"
ONLY=`grep -v ^\>#HWI`
#replaces A and G with R in lines
ONLYR=`sed -e s/A/R/g -e s/G/R/g $ONLY`
grep R $ONLYR | wc -l
The correct way to write a shell script to do what you seem to be trying to do is:
awk '
!/^>#HWI/ {
gsub(/[AG]/,"R")
if (/R/) {
++cnt
}
END { print cnt+0 }
' "$#"
Just put that in the file myscript.sh and execute it as you do today.
To be clear - the bulk of the above code is an awk script, the shell script part is the first and last lines where the shell just calls awk and passes it the input file names.
If you WANT to have intermediate variables then you can create/print them with:
awk '
!/^>#HWI/ {
only = $0
onlyR = only
gsub(/[AG]/,"R",onlyR)
print "only:", only
print "onlyR:", onlyR
if (/R/) {
++cnt
}
END { print cnt+0 }
' "$#"
The above will work robustly, portably, and efficiently on all UNIX systems.
First of all, and as #fedorqui commented - you're not providing grep with a source of input, against which it will perform line matching.
Second, there are some problems in your script, which will result in unwanted behavior in the future, when you decide to manipulate some data:
Store matching lines in an array, or a file from which you'll later read values. The variable ONLY is not the right data structure for the task.
By convention, environment variables (PATH, EDITOR, SHELL, ...) and internal shell variables (BASH_VERSION, RANDOM, ...) are fully capitalized. All other variable names should be lowercase. Since
variable names are case-sensitive, this convention avoids accidentally overriding environmental and internal variables.
Here's a better version of your script, considering these points, but with an open question regarding what you were trying to do in the last line : grep R $ONLYR | wc -l :
#!/bin/bash
# This script extracts some basic information
# from text files and prints it to screen.
#
# Usage: ./myscript.sh </path/to/text-file>
input_file=$1
# Read lines not matching the provided regex, from $input_file
mapfile -t only < <(grep -v '^\>#HWI' "$input_file")
#replaces A and G with R in lines
for((i=0;i<${#only[#]};i++)); do
only[i]="${only[i]//[AG]/R}"
done
# DEBUG
printf '%s\n' "Here are the lines, after relpace:"
printf '%s\n' "${only[#]}"
# I'm not sure what you were trying to do here. Am I gueesing right that you wanted
# to count the number of R's in ALL lines ?
# grep R $ONLYR | wc -l

Synchronized Output With Bash's Process Substitution

I have to multiply call an inflexible external tool that takes as arguments some input data and an output file to which it will write the processed data, for example:
some_prog() { echo "modified_$1" > "$2"; }
For varying input, I want to call some_prog, filter the output and write the output of all calls into the same file "out_file". Additionally, I want to add a header line to the output file before each call of some_prog. Given the following dummy filter:
slow_filter() {
read input; sleep "0.000$(($RANDOM % 10))"; echo "filtered_$input"
}
I wrote the following code:
rm -f out_file
for input in test_input{1..8}; do
echo "#Header_for_$input" >> "out_file"
some_prog $input >( slow_filter >> "out_file" )
done
However, this will produce an out_file like this:
#Header_for_test_input1
#Header_for_test_input2
#Header_for_test_input3
#Header_for_test_input4
#Header_for_test_input5
#Header_for_test_input6
#Header_for_test_input7
#Header_for_test_input8
filtered_modified_test_input4
filtered_modified_test_input1
filtered_modified_test_input2
filtered_modified_test_input5
filtered_modified_test_input6
filtered_modified_test_input3
filtered_modified_test_input8
filtered_modified_test_input7
The output I expected was:
#Header_for_test_input1
filtered_modified_test_input1
#Header_for_test_input2
filtered_modified_test_input2
#Header_for_test_input3
filtered_modified_test_input3
#Header_for_test_input4
filtered_modified_test_input4
#Header_for_test_input5
filtered_modified_test_input5
#Header_for_test_input6
filtered_modified_test_input6
#Header_for_test_input7
filtered_modified_test_input7
#Header_for_test_input8
filtered_modified_test_input8
I realized that the >( ) process substitution forks the shell. Is there a way to synchronize the output of the subshells? Or is there another elegant solution to this problem? I want to avoid the obvious approach of writing to different files in each iteration because, in my code, the for loop has a few 100,000 iterations.
Write the header inside the process substitution, specifically in a command group with the filter so that the concatenated output is written to out_file as one stream.
rm -f out_file
for input in test_input{1..8}; do
some_prog "$input" >( { echo "#Header_for_$input"; slow_filter; } >> "out_file" )
done
As process substitution is truly asynchronous and there doesn't appear to be a way to wait for it to complete before executing the next iteration of the loop, I would use an explicit named pipe.
rm -f out_file pipe
mkfifo pipe
for input in test_input{1..8}; do
some_prog "$input" pipe &
echo "#Header_for_$input" >> out_file
slow_filter < pipe >> out_file
done
(If some_prog doesn't work with a named pipe for some reason, you can use a regular file. In that case, you shouldn't run the command in the background.)
Since chepner's approach using a named pipe seems to be very slow in my "real world script" (about 10 times slower than this solution), the easiest and safest way to achieve what I want seems to be a temporary file:
rm -f out_file
tmp_file="$(mktemp --tmpdir my_temp_XXXXX.tmp)"
for input in test_input{1..8}; do
some_prog "$input" "$tmp_file"
{
echo "#Header_for_$input"
slow_filter < "$tmp_file"
} >> out_file
done
rm "$tmp_file"
This way, the temporary file tmp_file gets overwritten in each iteration such that it can be kept in memory if the system's temp directory is a RAM disk.

How do I prepend to a stream in Bash?

Suppose I have the following command in bash:
one | two
one runs for a long time producing a stream of output and two performs a quick operation on each line of that stream, but two doesn't work at all unless the first value it reads tells it how many values to read per line. one does not output that value, but I know what it is in advance (let's say it's 15). I want to send a 15\n through the pipe before the output of one. I do not want to modify one or two.
My first thought was to do:
echo "$(echo 15; one)" | two
That gives me the correct output, but it doesn't stream through the pipe at all until the command one finishes. I want the output to start streaming right away through the pipe, since it takes a long time to execute (months).
I also tried:
echo 15; one | two
Which, of course, outputs 15, but doesn't send it through the pipe to two.
Is there a way in bash to pass '15\n' through the pipe and then start streaming the output of one through the same pipe?
You just need the shell grouping construct:
{ echo 15; one; } | two
The spaces around the braces and the trailing semicolon are required.
To test:
one() { sleep 5; echo done; }
two() { while read line; do date "+%T - $line"; done; }
{ printf "%s\n" 1 2 3; one; } | two
16:29:53 - 1
16:29:53 - 2
16:29:53 - 3
16:29:58 - done
Use command grouping:
{ echo 15; one; } | two
Done!
You could do this with sed:
Example 'one' script, emits one line per second to show it's line buffered and running.
#!/bin/bash
while [ 1 ]; do
echo "TICK $(date)"
sleep 1
done
Then pipe that through this sed command, note that for your specific example 'ArbitraryText' will be the number of fields. I used ArbitraryText so that it's obvious that this is the inserted text. On OSX, -l is unbuffered with GNU Sed I believe it's -u
$ ./one | sed -l '1i\
> ArbitraryText
> '
What this does is it instructs sed to insert one line before processing the rest of your file, everything else will pass through untouched.
The end result is processed line-by-line without chunk buffering (or, waiting for the input script to finish)
ArbitraryText
TICK Fri Jun 28 13:26:56 PDT 2013
...etc
You should be able to then pipe that into 'two' as you would normally.

fetching files in a loop as an input to a script

I have a bunch of files and a script which I run on them. That script takes 2 files as an input and all files are in this format: a.txt1 a.txt2
Now the script I use is like this: foo.sh a.txt1 a.txt2
I have to run this script run on 250 pairs (eg. a1.txt1 a1.txt2 to a250.txt1 a250.txt2)
I am doing this manually by entering file names. I was wondering is there any way to automate this process. All these pairs are in same folder, is there any way to loop the process on all pairs?
I hope I made it clear.
Thank you.
To be specific, these are some sample file names:
T39_C.txt2
T39_D.txt1
T39_D.txt2
T40_A.txt1
T40_A.txt2
T40_B.txt1
T40_B.txt2
T40_C.txt1
T40_C.txt2
T40_D.txt1
T40_D.txt2
unmatched.txt1
unmatched.txt2
WT11_A.txt1
WT11_A.txt2
WT11_B.txt1
WT11_B.txt2
WT11_C.txt1
Assuming all files are in pairs (ie, <something>.txt1 and <something>.txt2 then you can do something line this:
1. #!/bin/bash
2.
3. for txt1 in *.txt1; do
4. txt2="${txt1%1}2"
5. # work on $txt1 and $txt2
6. done
In line 3, we use a shell glob to grab all files ending with .txt1. Line 4, we use a substitution to remove the final 1 and replace it with a 2. And the real work is done in line 5.
#FOR EACH FILE IN THE CURRENT DIRECTORY EXCEPT FOR THE FILES WITH .txt2*
for i in ls | sort | grep -v .txt2
do
*#THE FIRST .txt1 file is $i*
first="$i"
*#THE SECOND IS THE SAME EXCEPT WITH .txt2 SO WE REPLACE THE STRING*
second=`echo "$i" | sed 's/.txt1/.txt2/g'`
#WE MAKE THE ASSUMPTION FOO.SH WILL ERROR OUT IF NOT PASSED TWO PARAMETERS
if !(bash foo.sh $first $second); then
{
echo "Problem running against $first $second"
}
else
{
echo "Ran against $first $second"
}
fi
done

Resources