Using a subshell for parameter substitution with diff - bash

I'm writing a shell script, and in an effort to make it shorter and easier to read, I'm trying to use nested subshells to pass parameters to diff.
Here's what I have:
diff -iy '$(sort '$(awk 'BEGIN { FS = "|" } ; {print $1}' new-participants-by-state.csv)' '$(awk 'BEGIN { FS = "|" } ; {print $1}' current-participants-by-state.csv)')' > /dev/null;
echo There is no difference between the files. > ./participants-by-state-results.txt;
diff -iy '$(sort '$(awk 'BEGIN { FS = "|" } ; {print $1}' new-participants-by-state.csv)' '$(awk 'BEGIN { FS = "|" } ; {print $1}' current-participants-by-state.csv)')' > ./participants-by-state-results.txt;
When I run the script, I keep getting diff: extra operand 'AL'
I'd appreciate any insight into why this is failing. I think I'm pretty close. Thanks!

Your code is unreadable because the lines are so long:
if diff -iy '$(sort '$(awk 'BEGIN { FS = "|" } ; {print $1}' new-participants-by-state.csv)' \
'$(awk 'BEGIN { FS = "|" } ; {print $1}' current-participants-by-state.csv)')' \
> /dev/null;
echo There is no difference between the files. > ./participants-by-state-results.txt;
diff -iy '$(sort '$(awk 'BEGIN { FS = "|" } ; {print $1}' new-participants-by-state.csv)' \
'$(awk 'BEGIN { FS = "|" } ; {print $1}' current-participants-by-state.csv)')' \
> ./participants-by-state-results.txt;
Repeating whole commands like that is also fairly nasty. You also have major problems with your use of single quotes; you only have one sort in each set of commands, apparently operating on the combined outputs of two identical awk commands (whereas you probably need two separate sorts, one for the output of each awk command); you're not using the -F option to awk when you could; you are repeating the gargantuan file names all over the place; and finally, it appears that you are probably wanting to use process substitution, but not actually doing so.
Let's take a step back and formulate the question clearly.
Given two files (new-participants-by-state.csv and current-participants-by-state.csv) find the first pipe-separated field on each line of each file, sort the lists of those fields, and compare the results of the two sorted lists.
If there are no differences, write a message into the output file participants-by-state-results.txt; otherwise, list the differences in the output file.
So, we could use:
awk -F'|' '{print $1}' $oldfile | sort > $tmpfile.1
awk -F'|' '{print $1}' $newfile | sort > $tmpfile.2
if diff -iy $tmpfile.1 $tmpfile.2 > $outfile
then echo "There is no difference between the files" > $outfile
rm -f $tmpfile.?
If this was going to be the final script, we'd want to put trap handling in place so that the temporary files are not left around unless the script is killed dead with SIGKILL.
However, we can now use process substitution to avoid the temporary files:
if diff -iy <(awk -F'|' '{print $1}' $oldfile | sort) \
<(awk -F'|' '{print $1}' $newfile | sort) > $outfile
then echo "There is no difference between the files" > $outfile
Note how the code carefully preserves symmetries where there are symmetries. Note the use of shortish variable names to avoid the repetition of long file names. Note that the diff command is run just once, not twice - throwing away results which are needed later is not very sensible.
You could compress the output I/O redirection even more using:
if diff -iy <(awk -F'|' '{print $1}' $oldfile | sort) \
<(awk -F'|' '{print $1}' $newfile | sort)
then echo "There is no difference between the files"
} > $outfile
That sends the standard output of the enclosed commands to the file.
Of course, CSV might not be the appropriate nomenclature if the files are pipe-separated rather than comma-separated, but that's another matter altogether.
I'm also assuming that the status from diff -iy works as suggested by the original script; I've not validated that usage of the diff command.

There are several problems here.
First, you're putting various arguments in single-quotes, which prevents any interpretation being done on them (for example, $(....) doesn't do anything special inside single-quotes). You're probably thinking of double-quotes, but those aren't what you want either.
Which brings us to the second problem, that diff and sort expect to be given filenames as arguments, and they operate on the data in those files; you're trying to pass the data directly as arguments, which doesn't work (and I suspect that's the origin of the error you're getting: diff expects exactly two filenames, you're passing more than two participant names, and AL happened to be third on the list and hence the one that diff panicked on). The usual way to do this is to use intermediate files (and multiple lines in the script), but bash actually has a way of doing this without either of those: process substitution. Essentially, what it does is run one command with output (or input, but we need output in this case) sent to a named pipe; then it passes the name of the pipe as an argument to another command. For example, diff <(command1) <(command2) will give you the differences between the outputs of command1 and command2. Note that since this is a bash-only feature, you must start the script with #!/bin/bash, not #!/bin/sh.
Third, there's a missing close-parenthesis that makes it a little hard to tell what's supposed to happen. Are both files supposed to be sorted before the comparison, or only the new-participants file?
Fourth, since the final comparison ignores case (-i), you'd better use a case-insensitive sort (-f) as well.
Finally, you're doing all of the processing twice if there are any differences. I'd recommend running the comparison once into a file, then if there were no differences just ignore/overwrite the (empty) file.
Oh, and just a stylistic thing: you don't need semicolons at the end of lines in bash. You only need semicolons if you're putting more than one command on the same line (and a few other cases like before then in an if statement).
Anyway, here's my rewrite:
diff -iy <(awk 'BEGIN { FS = "|" } ; {print $1}' new-participants-by-state.csv | sort -f) <(awk 'BEGIN { FS = "|" } ; {print $1}' current-participants-by-state.csv | sort -f) >./participants-by-state-results.txt
echo "There is no difference between the files." > ./participants-by-state-results.txt


Iterative replacement of substrings in bash

I'm trying to write a simple script to make several replacements in a big text file. I've a "map" file which contains the records to be searched and replaced,one per line,separated by a space, and a "input" file where I need the changes to be done. The examples files and the script I wrote are beneath.
Map file
new_0 old_0
new_1 old_1
new_2 old_2
new_3 old_3
new_4 old_4
Input file
while read -r mapline ; do
mapf1=`awk 'BEGIN {FS=" "} {print $1}' <<< "$mapline"`
mapf2=`awk 'BEGIN {FS=" "} {print $2}' <<< "$mapline"`
for line in $(cat "input") ; do
if [[ "${line}" == *"${mapf2}"* ]] ; then
sed "s/${mapf2}/${mapf1}/g" <<< "${line}"
done < "input"
done < "map"
The thing is that the searches and replaces are made correctly, but I can't find a way to save the output of each iteration and work over it in the next. So, my output looks like this:
Yet, the desired output would look like this:
May anyone bring some light in this darkly waters??? Thanks in advance!
Improving the existing script
Use "$()" instead of ``. It supports whitespace and is easier to read.
Don't execute sed for each line. sed already loops over all lines and is faster than a loop in bash.
The adapted script:
text="$(< input)"
while read -r mapline; do
mapf1="$(awk 'BEGIN {FS=" "} {print $1}' <<< "$mapline")"
mapf2="$(awk 'BEGIN {FS=" "} {print $2}' <<< "$mapline")"
text="$(sed "s/${mapf2}/${mapf1}/g" <<< "$text")"
done < "map"
echo "$text"
The variable $text contains the complete input file and is modified in each iteration. The output of this script is the file after all replacements were done.
Alternative approach
Convert the map file into a pattern for sed and execute sed just once using that pattern.
pattern="$(sed 's#\(.*\) \(.*\)#s/\2/\1/g#' map)"
sed "$pattern" input
The first command is the conversion step. The file
new_0 old_0
new_1 old_1
will result in the pattern
It is possible in GNU Awk as follows,
awk 'FNR==NR{hash[$2]=$1; next} \
{for (i=1; i<=NF; i++)\
{for(key in hash) \
{if (match ($i,key)) {$i=sprintf("(%s)",hash[key];break;)}}}print}' \
map-file FS='[()]' OFS= input-file
produces an output as,
Another in Gnu awk, using split and ternary operator(s):
$ awk '
NR==FNR { a[$2]=$1; next }
printf "%s%s",(i%2 ? b[i] : (b[i] in a? "(" a[b[i]] ")":"")),(i==n?ORS:"")
}' map foo
First you read in the map to a hash. When processing the file, split all records by ( and ). Every other could be in the map (i%2==0). While printfing test with ternary operator if matches are found from a and when there is a match, output it parenthesized.

for loop and if statements in awk

I am a biologist that is starting to have to learn some elementary scripting skills to deal with large DNA sequence data sets. So please go easy on me. I am doing this all in bash. I have a file with my data formatted like this:
What I need is to do is loop through this file and write all the sequences from the same sample into their own file. Just to be clear, these sequences come from samples 25 and 9. So my idea was to use awk to reformat my file in the following way:
then pipe this into another awk if statement to say "if sample=$i then write out that entire line to a file named sample.$i" Here is my code so far:
a=`ls /scratch/tkchafin/data/raw | wc -l`;
mkdir /scratch/tkchafin/data/phylogenetics
for ((i=0; i<=$((c)); i++)); do
awk 'ORS=NR%2?"_":"\n"' $1 | awk -F_ '{if($4==$i) print}' >> /scratch/tkchafin/data/phylogenetics/sample.$i
I understand this is not working because $i is in single quotes so bash is not recognizing it. I know awk has a -v option for passing external variables to it, but I don't know how I would apply that in this case. I tried to move the for loop inside the awk statement but this does not produce the desired result either. Any help would be much appreciated.
You can have awk write directly to the desired output file, without a shell loop:
awk -F_ '(NR % 2) == 1 { line1 = $0; fn="/scratch/tkchafin/data/phylogenetics/sample."$4; }
(NR % 2) == 0 { print line1"_"$0 > fn; }' "$1"
But to show how you would use -v in your version, it would be:
for ((i=0; i<=$((c)); i++)); do
awk 'ORS=NR%2?"_":"\n"' $1 | awk -F_ -v i=$i '$4 == i' >> /scratch/tkchafin/data/phylogenetics/sample.$i

Print text between two lines (from list of line numbers in file) in Unix [closed]

I have a sample file which has thousands of lines.
I want to print text between two line numbers in that file. I don't want to input line numbers manually, rather I have a file which contains list of line numbers between which text has to be printed.
Example : linenumbers.txt
I need a shell script which will read line numbers from this file and print the text between each range of lines into a separate (new) file.
That is, it should print lines between 345 and 789 into a new file, say File1.txt, and print text between lines 999 and 1056 into a new file, say File2.txt, and so on.
considering your target file has only thousands of lines. here is a quick and dirty solution.
awk -F'|' '{system("sed -n \""$1","$2"p\" targetFile > file"NR)}' linenumbers.txt
the targetFile is your file containing thousands of lines.
the oneliner does not require your linenumbers.txt to be sorted.
the oneliner allows line range to be overlapped in your linenumbers.txt
after running the command above, you will have n filex files. n is the row counts of linenumbers.txt x is from 1-n you can change the filename pattern as you want.
Here's one way using GNU awk. Run like:
awk -f script.awk numbers.txt file.txt
Contents of script.awk:
# set the field separator
# for the first file in the arguments list
# add the row number and field one as keys to a multidimensional array with
# a value of field two
# skip processing the rest of the code
# for the second file in the arguments list
# for every element in the array's first dimension
for (i in a) {
# for every element in the second dimension
for (j in a[i]) {
# ensure that the first field is treated numerically
# if the line number is greater than the first field
# and smaller than the second field
if (FNR>=j && FNR<=a[i][j]) {
# print the line to a file with the suffix of the first file's
# line number (the first dimension)
print > "File" i
Alternatively, here's the one-liner:
awk -F "|" 'FNR==NR { a[NR][$1]=$2; next } { for (i in a) for (j in a[i]) { j+=0; if (FNR>=j && FNR<=a[i][j]) print > "File" i } }' numbers.txt file.txt
If you have an 'old' awk, here's the version with compatibility. Run like:
awk -f script.awk numbers.txt file.txt
Contents of script.awk:
# set the field separator
# for the first file in the arguments list
# add the row number and field one as a key to a pseudo-multidimensional
# array with a value of field two
# skip processing the rest of the code
# for the second file in the arguments list
# for every element in the array
for (i in a) {
# split the element in to another array
# b[1] is the row number and b[2] is the first field
# if the line number is greater than the first field
# and smaller than the second field
if (FNR>=b[2] && FNR<=a[i]) {
# print the line to a file with the suffix of the first file's
# line number (the first pseudo-dimension)
print > "File" b[1]
Alternatively, here's the one-liner:
awk -F "|" 'FNR==NR { a[NR,$1]=$2; next } { for (i in a) { split(i,b,SUBSEP); if (FNR>=b[2] && FNR<=a[i]) print > "File" b[1] } }' numbers.txt file.txt
I would use sed to process the sample data file because it is simple and swift. This requires a mechanism for converting the line numbers file into the appropriate sed script. There are many ways to do this.
One way uses sed to convert the set of line numbers into a sed script. If everything was going to standard output, this would be trivial. With the output needing to go to different files, we need a line number for each line in the line numbers file. One way to give line numbers is the nl command. Another possibility would be to use pr -n -l1. The same sed command line works with both:
nl linenumbers.txt |
sed 's/ *\([0-9]*\)[^0-9]*\([0-9]*\)|\([0-9]*\)/\2,\3w file\1.txt/'
For the given data file, that generates:
345,789w > file1.txt
999,1056w > file2.txt
1522,1366w > file3.txt
3523,3562w > file4.txt
Another option would be to have awk generate the sed script:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt
If your version of sed will allow you to read its script from standard input with -f - (GNU sed does; BSD sed does not), then you can convert the line numbers file into a sed script on the fly, and use that to parse the sample data:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt |
sed -n -f -
If your system supports /dev/stdin, you can use one of:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt |
sed -n -f /dev/stdin
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt |
sed -n -f /dev/fd/0
Failing that, use an explicit script file:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt > sed.script
sed -n -f sed.script
rm -f sed.script
Strictly, you should deal with ensuring the temporary file name is unique (mktemp) and removed even if the script is interrupted (trap):
tmp=$(mktemp sed.script.XXXXXX)
trap "rm -f $tmp; exit 1" 0 1 2 3 13 15
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt > $tmp
sed -n -f $tmp
rm -f $tmp
trap 0
The final trap 0 allows your script to exit successfully; omit it, and you script will always exit with status 1.
I've ignored Perl and Python; either could be used for this in a single command. The file management is just fiddly enough that using sed seems simpler. You could also use just awk, either with a first awk script writing an awk script to do the heavy duty work (trivial extension of the above), or having a single awk process read both files and produce the required output (harder, but far from impossible).
If nothing else, this shows that there are many possible ways of doing the job. If this is a one-off exercise, it really doesn't matter very much which you choose. If you will be doing this repeatedly, then choose the mechanism that you like. If you're worried about performance, measure. It is likely that converting the line numbers into a command script is a negligible cost; processing the sample data with the command script is where the time is taken. I would expect sed to excel at that point; I've not measured to confirm that it does.
You could do the following
while IFS=\| read start end ; do
echo "sed -n '$start,${end}p;${end}q;' $somefile > $somefile-$start-$end"
done < $linenumbers
run it like so sh
sed -n '345,789p;789q;' afile > afile-345-789
sed -n '999,1056p;1056q;' afile > afile-999-1056
sed -n '1522,1366p;1366q;' afile > afile-1522-1366
sed -n '3523,3562p;3562q;' afile > afile-3523-3562
then when you're happy do sh | sh
EDIT Added William's excellent points on style and correctness.
EDIT Explanation
The basic idea is to get a script to generate a series of shell commands that can be checked for correctness first before being executed by "| sh".
sed -n '345,789p;789q; means use sed and don't echo each line (-n) ; there are two commands saying from line 345 to 789 p(rint) the lines and the second command is at line 789 q(uit) - by quitting on the last line you save having sed read all the input file.
The while loop reads from the $linenumbers file using read, read if given more than one variable name populates each with a field from the input, a field is usually separated by space and if there are too few variable names then read will put the remaining data into the last variable name.
You can put the following in at your shell prompt to understand that behaviour.
ls -l | while read first rest ; do
echo $first XXXX $rest
Try adding another variable second to the above to see what happens then, it should be obvious.
The problem is your data is delimited by |s and that's where using William's suggestion of IFS=\| works as now when reading from the input the IFS has changed and the input is now separated by |s and we get the desired result.
Others can feel free to edit,correct and expand.
To extract the first field from 345|789 you can e.g use awk
awk -F'|' '{print $1}'
Combine that with the answers received from your other question and you will have a solution.
This might work for you (GNU sed):
sed -r 's/(.*)\|(.*)/\1,\2w file-\1-\2.txt/' | sed -nf - file

Explode to Array

I put together this shell script to do two things:
Change the delimiters in a data file ('::' to ',' in this case)
Select the columns and I want and append them to a new file
It works but I want a better way to do this. I specifically want to find an alternative method for exploding each line into an array. Using command line arguments doesn't seem like the way to go. ANY COMMENTS ARE WELCOME.
# Takes :: separated file as 1st parameters
# create csv target file
touch $TARGET
echo #userId,itemId > $TARGET
while read LINE
# Replaces all matches of :: with a ,
set -- $CSV_LINE
echo "$1,$2" >> $TARGET
done < $SOURCE
Instead of set, you can use an array:
echo "${arr[0]},${arr[1]}"
The following would print columns 1 and 2 from infile.dat. Replace with
a comma-separated list of the numbered columns you do want.
awk 'BEGIN { IFS='::'; OFS=","; } { print $1, $2 }' infile.dat > infile.csv
Perl probably has a 1 liner to do it.
Awk can probably do it easily too.
My first reaction is a combination of awk and sed:
Sed to convert the delimiters
Awk to process specific columns
cat inputfile | sed -e 's/::/,/g' | awk -F, '{print $1, $2}'
# Or to avoid a UUOC award (and prolong the life of your keyboard by 3 characters
sed -e 's/::/,/g' inputfile | awk -F, '{print $1, $2}'
awk is indeed the right tool for the job here, it's a simple one-liner.
$ cat
$ awk -F:: -v OFS=, '{$1=$1;print;print $2,$3 >> "altfile"}'
$ cat altfile

Can I chain multiple commands and make all of them take the same input from stdin?

In bash, is there a way to chain multiple commands, all taking the same input from stdin? That is, one command reads stdin, does some processing, writes the output to a file. The next command in the chain gets the same input as what the first command got. And so on.
For example, consider a large text file to be split into multiple files by filtering the content. Something like this:
cat food_expenses.txt | grep "coffee" > coffee.txt | grep "tea" > tea.txt | grep "honey cake" > cake.txt
This obviously does not work, because the second grep gets the first grep's output, not the original text file. I tried inserting tee's but that does not help. Is there some bash magic that can cause the first grep to send its input to the pipe, not the output?
And by the way, splitting a file was a simple example. Consider splitting (filering by pattern search) a continuous live text stream coming over a network and writing the output to different named pipes or sockets. I would like to know if there is an easy way to do it using a shell script.
(This question is a cleaned up version of my earlier one , based on responses that pointed out the unclearness)
For this example, you should use awk as semiuseless suggests.
But in general to have N arbitrary programs read a copy of a single input stream, you can use tee and bash's process output substitution operator:
tee <food_expenses.txt \
>(grep "coffee" >coffee.txt) \
>(grep "tea" >tea.txt) \
>(grep "honey cake" >cake.txt)
Note that >(command) is a bash extension.
The obvious question is why do you want to do this within one command ?
If you don't want to write a script, and you want to run stuff in parallel, bash supports the concepts of subshells, and these can run in parallel. By putting your command in brackets, you can run your greps (or whatever) concurrently e.g.
$ (grep coffee food_expenses.txt > coffee.txt) && (grep tea food_expenses.txt > tea.txt)
Note that in the above your cat may be redundant since grep takes an input file argument.
You can (instead) play around with redirecting output through different streams. You're not limited to stdout/stderr but can assign new streams as required. I can't advise more on this other than direct you to examples here
I like Stephen's idea of using awk instead of grep.
It ain't pretty, but here's a command that uses output redirection to keep all data flowing through stdout:
cat food.txt |
awk '/coffee/ {print $0 > "/dev/stderr"} {print $0}'
2> coffee.txt |
awk '/tea/ {print $0 > "/dev/stderr"} {print $0}'
2> tea.txt
As you can see, it uses awk to send all lines matching 'coffee' to stderr, and all lines regardless of content to stdout. Then stderr is fed to a file, and the process repeats with 'tea'.
If you wanted to filter out content at each step, you might use this:
cat food.txt |
awk '/coffee/ {print $0 > "/dev/stderr"} $0 !~ /coffee/ {print $0}'
2> coffee.txt |
awk '/tea/ {print $0 > "/dev/stderr"} $0 !~ /tea/ {print $0}'
2> tea.txt
You could use awk to split into up to two files:
awk '/Coffee/ { print "Coffee" } /Tea/ { print "Tea" > "/dev/stderr" }' inputfile > coffee.file.txt 2> tea.file.txt
I am unclear why the filtering needs to be done in different steps. A single awk program can scan all the incoming lines, and dispatch the appropriate lines to individual files. This is a very simple dispatch that can feed multiple secondary commands (i.e. persistent processes that monitor the output files for new input, or the files could be sockets that are setup ahead of time and written to by the awk process.).
If there is a reason to have every filter see every line, then just remove the "next;" statements, and every filter will see every line.
$ cat split.awk
/^coffee/ {
print $0 >> "/tmp/coffee.txt" ;
/^tea/ {
print $0 >> "/tmp/tea.txt" ;
{ # default
print $0 >> "/tmp/other.txt" ;
END {}
Here are two bash scripts without awk. The second one doesn't even use grep!
With grep:
tail -F food_expenses.txt | \
while read line
for word in "coffee" "tea" "honey cake"
if [[ $line != ${line#*$word*} ]]
echo "$line"|grep "$word" >> ${word#* }.txt # use the last word in $word for the filename (i.e. cake.txt for "honey cake")
Without grep:
tail -F food_expenses.txt | \
while read line
for word in "coffee" "tea" "honey cake"
if [[ $line != ${line#*$word*} ]] # does the line contain the word?
echo "$line" >> ${word#* }.txt # use the last word in $word for the filename (i.e. cake.txt for "honey cake")
Here's an AWK method:
awk 'BEGIN {
list = "coffee tea";
split(list, patterns)
for (pattern in patterns) {
if ($0 ~ patterns[pattern]) {
print > patterns[pattern] ".txt"
}' food_expenses.txt
Working with patterns which include spaces remains to be resolved.
You can probably write a simple AWK script to do this in one shot. Can you describe the format of your file a little more?
Is it space/comma separated?
do you have the item descriptions on a specific 'column' where columns are defined by some separator like space, comma or something else?
If you can afford multiple grep runs this will work,
grep coffee food_expanses.txt> coffee.txt
grep tea food_expanses.txt> tea.txt
and, so on.
Assuming that your input is not infinite (as in the case of a network stream that you never plan on closing) I might consider using a subshell to put the data into a temp file, and then a series of other subshells to read it. I haven't tested this, but maybe it would look something like this
{ cat inputstream > tempfile };
{ grep tea tempfile > tea.txt };
{ grep coffee tempfile > coffee.txt};
I'm not certain of an elegant solution to the file getting too large if your input stream is not bounded in size however.
