Can I chain multiple commands and make all of them take the same input from stdin? - bash

In bash, is there a way to chain multiple commands, all taking the same input from stdin? That is, one command reads stdin, does some processing, writes the output to a file. The next command in the chain gets the same input as what the first command got. And so on.
For example, consider a large text file to be split into multiple files by filtering the content. Something like this:
cat food_expenses.txt | grep "coffee" > coffee.txt | grep "tea" > tea.txt | grep "honey cake" > cake.txt
This obviously does not work, because the second grep gets the first grep's output, not the original text file. I tried inserting tee's but that does not help. Is there some bash magic that can cause the first grep to send its input to the pipe, not the output?
And by the way, splitting a file was a simple example. Consider splitting (filering by pattern search) a continuous live text stream coming over a network and writing the output to different named pipes or sockets. I would like to know if there is an easy way to do it using a shell script.
(This question is a cleaned up version of my earlier one , based on responses that pointed out the unclearness)

For this example, you should use awk as semiuseless suggests.
But in general to have N arbitrary programs read a copy of a single input stream, you can use tee and bash's process output substitution operator:
tee <food_expenses.txt \
>(grep "coffee" >coffee.txt) \
>(grep "tea" >tea.txt) \
>(grep "honey cake" >cake.txt)
Note that >(command) is a bash extension.

The obvious question is why do you want to do this within one command ?
If you don't want to write a script, and you want to run stuff in parallel, bash supports the concepts of subshells, and these can run in parallel. By putting your command in brackets, you can run your greps (or whatever) concurrently e.g.
$ (grep coffee food_expenses.txt > coffee.txt) && (grep tea food_expenses.txt > tea.txt)
Note that in the above your cat may be redundant since grep takes an input file argument.
You can (instead) play around with redirecting output through different streams. You're not limited to stdout/stderr but can assign new streams as required. I can't advise more on this other than direct you to examples here

I like Stephen's idea of using awk instead of grep.
It ain't pretty, but here's a command that uses output redirection to keep all data flowing through stdout:
cat food.txt |
awk '/coffee/ {print $0 > "/dev/stderr"} {print $0}'
2> coffee.txt |
awk '/tea/ {print $0 > "/dev/stderr"} {print $0}'
2> tea.txt
As you can see, it uses awk to send all lines matching 'coffee' to stderr, and all lines regardless of content to stdout. Then stderr is fed to a file, and the process repeats with 'tea'.
If you wanted to filter out content at each step, you might use this:
cat food.txt |
awk '/coffee/ {print $0 > "/dev/stderr"} $0 !~ /coffee/ {print $0}'
2> coffee.txt |
awk '/tea/ {print $0 > "/dev/stderr"} $0 !~ /tea/ {print $0}'
2> tea.txt

You could use awk to split into up to two files:
awk '/Coffee/ { print "Coffee" } /Tea/ { print "Tea" > "/dev/stderr" }' inputfile > coffee.file.txt 2> tea.file.txt

I am unclear why the filtering needs to be done in different steps. A single awk program can scan all the incoming lines, and dispatch the appropriate lines to individual files. This is a very simple dispatch that can feed multiple secondary commands (i.e. persistent processes that monitor the output files for new input, or the files could be sockets that are setup ahead of time and written to by the awk process.).
If there is a reason to have every filter see every line, then just remove the "next;" statements, and every filter will see every line.
$ cat split.awk
BEGIN{}
/^coffee/ {
print $0 >> "/tmp/coffee.txt" ;
next;
}
/^tea/ {
print $0 >> "/tmp/tea.txt" ;
next;
}
{ # default
print $0 >> "/tmp/other.txt" ;
}
END {}
$

Here are two bash scripts without awk. The second one doesn't even use grep!
With grep:
#!/bin/bash
tail -F food_expenses.txt | \
while read line
do
for word in "coffee" "tea" "honey cake"
do
if [[ $line != ${line#*$word*} ]]
then
echo "$line"|grep "$word" >> ${word#* }.txt # use the last word in $word for the filename (i.e. cake.txt for "honey cake")
fi
done
done
Without grep:
#!/bin/bash
tail -F food_expenses.txt | \
while read line
do
for word in "coffee" "tea" "honey cake"
do
if [[ $line != ${line#*$word*} ]] # does the line contain the word?
then
echo "$line" >> ${word#* }.txt # use the last word in $word for the filename (i.e. cake.txt for "honey cake")
fi
done
done;
Edit:
Here's an AWK method:
awk 'BEGIN {
list = "coffee tea";
split(list, patterns)
}
{
for (pattern in patterns) {
if ($0 ~ patterns[pattern]) {
print > patterns[pattern] ".txt"
}
}
}' food_expenses.txt
Working with patterns which include spaces remains to be resolved.

You can probably write a simple AWK script to do this in one shot. Can you describe the format of your file a little more?
Is it space/comma separated?
do you have the item descriptions on a specific 'column' where columns are defined by some separator like space, comma or something else?
If you can afford multiple grep runs this will work,
grep coffee food_expanses.txt> coffee.txt
grep tea food_expanses.txt> tea.txt
and, so on.

Assuming that your input is not infinite (as in the case of a network stream that you never plan on closing) I might consider using a subshell to put the data into a temp file, and then a series of other subshells to read it. I haven't tested this, but maybe it would look something like this
{ cat inputstream > tempfile };
{ grep tea tempfile > tea.txt };
{ grep coffee tempfile > coffee.txt};
I'm not certain of an elegant solution to the file getting too large if your input stream is not bounded in size however.

Related

Redirecting multiline output to multiple files

I have a list of URLs, and would like to identify what is a directory and what is not:
https://www.example.com/folder/
https://www.example.com/folder9/
https://www.example.com/folder/file.sh
https://www.example.com/folder/text
I can use grep -e /$ to find which is which, but I'd like to do an inline command where I can redirect the output based on that logic.
I understand that awk may have the answer here, but don't have enough experience in awk to do this.
Something like:
cat urls | if /$ matches write to folders.txt else write to files.txt
I could drop it all to a file then read it twice but when it gets to thousands of lines I feel that would be inefficient.
Yes, awk is a great choice for this:
awk '/\/$/ { print > "folders.txt"; next }
{ print > "files.txt" }' urls.txt
/\/$/ { print > "folders.txt"; next } if the line ends with a /, write it to folders.txt and skip to the next line
{ print > "files.txt" } write all other lines to files.txt
You may want to use the expression /\/[[:space:]]*$/ instead of /\/$/ in case you have trailing spaces in your file.
All you need is:
awk '{print > ((/\/$/ ? "folders" : "files")".txt")}' urls.txt
With coreutils, grep and bash process substitution:
<urls tee >(grep '/$' > folders.txt) >(grep -v '/$' > files.txt) > /dev/null

How to quickly delete the lines in a file that contain items from a list in another file in BASH?

I have a file called words.txt containing a list of words. I also have a file called file.txt containing a sentence per line. I need to quickly delete any lines in file.txt that contain one of the lines from words.txt, but only if the match is found somewhere between { and }.
E.g. file.txt:
Once upon a time there was a cat.
{The cat} lived in the forest.
The {cat really liked to} eat mice.
E.g. words.txt:
cat
mice
Example output:
Once upon a time there was a cat.
Is removed because "cat" is found on those two lines and the words are also between { and }.
The following script successfully does this task:
while read -r line
do
sed -i "/{.*$line.*}/d" file.txt
done < words.txt
This script is very slow. Sometimes words.txt contains several thousand items, so the while loop takes several minutes. I attempted to use the sed -f option, which seems to allow reading a file, but I cannot find any manuals explaining how to use this.
How can I improve the speed of the script?
An awk solution:
awk 'NR==FNR{a["{[^{}]*"$0"[^{}]*}"]++;next}{for(i in a)if($0~i)next;b[j++]=$0}END{printf "">FILENAME;for(i=0;i in b;++i)print b[i]>FILENAME}' words.txt file.txt
It converts file.txt directly to have the expected output.
Once upon a time there was a cat.
Uncondensed version:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
b[j++] = $0
}
END {
printf "" > FILENAME
for (i = 0; i in b; ++i)
print b[i] > FILENAME
}
' words.txt file.txt
If files are expected to get too large that awk may not be able to handle it, we can only redirect it to stdout. We may not be able to modify the file directly:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
}
1
' words.txt file.txt
you can use grep to match 2 files like this:
grep -vf words.txt file.txt
In think that using the grep command should be way faster. By example:
grep -f words.txt -v file.txt
The f option make grep use the words.txt file as matching patterns
The v option reverse the matching, ie keeping files that do not match one of the patterns.
It doesn't solve the {} constraint, but that is easily avoidable, for example by adding the brackets to the pattern file (or in a temporary file created at runtime).
I think this should work for you:
sed -e 's/.*/{.*&.*}/' words.txt | grep -vf- file.txt > out ; mv out file.txt
This basically just modifies the words.txt file on the fly and uses it as a word file for grep.
In pure native bash (4.x):
#!/bin/env bash4
# ^-- MUST start with a /bin/bash shebang, NOT /bin/sh
readarray -t words <words.txt # read words into array
IFS='|' # use | as delimiter when expanding $*
words_re="[{].*(${words[*]}).*[}]" # form a regex matching all words
while read -r; do # for each line in file...
if ! [[ $REPLY =~ $words_re ]]; then # ...check whether it matches...
printf '%s\n' "$REPLY" # ...and print it if not.
fi
done <file.txt
Native bash is somewhat slower than awk, but this still is a single-pass solution (O(n+m), whereas the sed -i approach was O(n*m)), making it vastly faster than any iterative approach.
You could do this in two steps:
Wrap each word in words.txt with {.* and .*}:
awk '{ print "{.*" $0 ".*}" }' words.txt > wrapped.txt
Use grep with inverse match:
grep -v -f wrapped.txt file.txt
This would be particularly useful if words.txt is very large, as a pure-awk approach (storing all the entries of words.txt in an array) would require a lot of memory.
If would prefer a one-liner and would like to skip creating the intermediate file you could do this:
awk '{ print "{.*" $0 ".*}" }' words.txt | grep -v -f - file.txt
The - is a placeholder which tells grep to use stdin
update
If the size of words.txt isn't too big, you could do the whole thing in awk:
awk 'NR==FNR{a[$0]++;next}{p=1;for(i in a){if ($0 ~ "{.*" i ".*}") { p=0; break}}}p' words.txt file.txt
expanded:
awk 'NR==FNR { a[$0]++; next }
{
p=1
for (i in a) {
if ($0 ~ "{.*" i ".*}") { p=0; break }
}
}p' words.txt file.txt
The first block builds an array containing each line in words.txt. The second block runs for every line in file.txt. A flag p controls whether the line is printed. If the line matches the pattern, p is set to false. When the p outside the last block evaluates to true, the default action occurs, which is to print the line.

Print text between two lines (from list of line numbers in file) in Unix [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I have a sample file which has thousands of lines.
I want to print text between two line numbers in that file. I don't want to input line numbers manually, rather I have a file which contains list of line numbers between which text has to be printed.
Example : linenumbers.txt
345|789
999|1056
1522|1366
3523|3562
I need a shell script which will read line numbers from this file and print the text between each range of lines into a separate (new) file.
That is, it should print lines between 345 and 789 into a new file, say File1.txt, and print text between lines 999 and 1056 into a new file, say File2.txt, and so on.
considering your target file has only thousands of lines. here is a quick and dirty solution.
awk -F'|' '{system("sed -n \""$1","$2"p\" targetFile > file"NR)}' linenumbers.txt
the targetFile is your file containing thousands of lines.
the oneliner does not require your linenumbers.txt to be sorted.
the oneliner allows line range to be overlapped in your linenumbers.txt
after running the command above, you will have n filex files. n is the row counts of linenumbers.txt x is from 1-n you can change the filename pattern as you want.
Here's one way using GNU awk. Run like:
awk -f script.awk numbers.txt file.txt
Contents of script.awk:
BEGIN {
# set the field separator
FS="|"
}
# for the first file in the arguments list
FNR==NR {
# add the row number and field one as keys to a multidimensional array with
# a value of field two
a[NR][$1]=$2
# skip processing the rest of the code
next
}
# for the second file in the arguments list
{
# for every element in the array's first dimension
for (i in a) {
# for every element in the second dimension
for (j in a[i]) {
# ensure that the first field is treated numerically
j+=0
# if the line number is greater than the first field
# and smaller than the second field
if (FNR>=j && FNR<=a[i][j]) {
# print the line to a file with the suffix of the first file's
# line number (the first dimension)
print > "File" i
}
}
}
}
Alternatively, here's the one-liner:
awk -F "|" 'FNR==NR { a[NR][$1]=$2; next } { for (i in a) for (j in a[i]) { j+=0; if (FNR>=j && FNR<=a[i][j]) print > "File" i } }' numbers.txt file.txt
If you have an 'old' awk, here's the version with compatibility. Run like:
awk -f script.awk numbers.txt file.txt
Contents of script.awk:
BEGIN {
# set the field separator
FS="|"
}
# for the first file in the arguments list
FNR==NR {
# add the row number and field one as a key to a pseudo-multidimensional
# array with a value of field two
a[NR,$1]=$2
# skip processing the rest of the code
next
}
# for the second file in the arguments list
{
# for every element in the array
for (i in a) {
# split the element in to another array
# b[1] is the row number and b[2] is the first field
split(i,b,SUBSEP)
# if the line number is greater than the first field
# and smaller than the second field
if (FNR>=b[2] && FNR<=a[i]) {
# print the line to a file with the suffix of the first file's
# line number (the first pseudo-dimension)
print > "File" b[1]
}
}
}
Alternatively, here's the one-liner:
awk -F "|" 'FNR==NR { a[NR,$1]=$2; next } { for (i in a) { split(i,b,SUBSEP); if (FNR>=b[2] && FNR<=a[i]) print > "File" b[1] } }' numbers.txt file.txt
I would use sed to process the sample data file because it is simple and swift. This requires a mechanism for converting the line numbers file into the appropriate sed script. There are many ways to do this.
One way uses sed to convert the set of line numbers into a sed script. If everything was going to standard output, this would be trivial. With the output needing to go to different files, we need a line number for each line in the line numbers file. One way to give line numbers is the nl command. Another possibility would be to use pr -n -l1. The same sed command line works with both:
nl linenumbers.txt |
sed 's/ *\([0-9]*\)[^0-9]*\([0-9]*\)|\([0-9]*\)/\2,\3w file\1.txt/'
For the given data file, that generates:
345,789w > file1.txt
999,1056w > file2.txt
1522,1366w > file3.txt
3523,3562w > file4.txt
Another option would be to have awk generate the sed script:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt
If your version of sed will allow you to read its script from standard input with -f - (GNU sed does; BSD sed does not), then you can convert the line numbers file into a sed script on the fly, and use that to parse the sample data:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt |
sed -n -f - sample.data
If your system supports /dev/stdin, you can use one of:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt |
sed -n -f /dev/stdin sample.data
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt |
sed -n -f /dev/fd/0 sample.data
Failing that, use an explicit script file:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt > sed.script
sed -n -f sed.script sample.data
rm -f sed.script
Strictly, you should deal with ensuring the temporary file name is unique (mktemp) and removed even if the script is interrupted (trap):
tmp=$(mktemp sed.script.XXXXXX)
trap "rm -f $tmp; exit 1" 0 1 2 3 13 15
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt > $tmp
sed -n -f $tmp sample.data
rm -f $tmp
trap 0
The final trap 0 allows your script to exit successfully; omit it, and you script will always exit with status 1.
I've ignored Perl and Python; either could be used for this in a single command. The file management is just fiddly enough that using sed seems simpler. You could also use just awk, either with a first awk script writing an awk script to do the heavy duty work (trivial extension of the above), or having a single awk process read both files and produce the required output (harder, but far from impossible).
If nothing else, this shows that there are many possible ways of doing the job. If this is a one-off exercise, it really doesn't matter very much which you choose. If you will be doing this repeatedly, then choose the mechanism that you like. If you're worried about performance, measure. It is likely that converting the line numbers into a command script is a negligible cost; processing the sample data with the command script is where the time is taken. I would expect sed to excel at that point; I've not measured to confirm that it does.
You could do the following
# myscript.sh
linenumbers="linenumber.txt"
somefile="afile"
while IFS=\| read start end ; do
echo "sed -n '$start,${end}p;${end}q;' $somefile > $somefile-$start-$end"
done < $linenumbers
run it like so sh myscript.sh
sed -n '345,789p;789q;' afile > afile-345-789
sed -n '999,1056p;1056q;' afile > afile-999-1056
sed -n '1522,1366p;1366q;' afile > afile-1522-1366
sed -n '3523,3562p;3562q;' afile > afile-3523-3562
then when you're happy do sh myscript.sh | sh
EDIT Added William's excellent points on style and correctness.
EDIT Explanation
The basic idea is to get a script to generate a series of shell commands that can be checked for correctness first before being executed by "| sh".
sed -n '345,789p;789q; means use sed and don't echo each line (-n) ; there are two commands saying from line 345 to 789 p(rint) the lines and the second command is at line 789 q(uit) - by quitting on the last line you save having sed read all the input file.
The while loop reads from the $linenumbers file using read, read if given more than one variable name populates each with a field from the input, a field is usually separated by space and if there are too few variable names then read will put the remaining data into the last variable name.
You can put the following in at your shell prompt to understand that behaviour.
ls -l | while read first rest ; do
echo $first XXXX $rest
done
Try adding another variable second to the above to see what happens then, it should be obvious.
The problem is your data is delimited by |s and that's where using William's suggestion of IFS=\| works as now when reading from the input the IFS has changed and the input is now separated by |s and we get the desired result.
Others can feel free to edit,correct and expand.
To extract the first field from 345|789 you can e.g use awk
awk -F'|' '{print $1}'
Combine that with the answers received from your other question and you will have a solution.
This might work for you (GNU sed):
sed -r 's/(.*)\|(.*)/\1,\2w file-\1-\2.txt/' | sed -nf - file

Calling an executable program using awk

I have a program in C that I want to call by using awk in shell scripting. How can I do something like this?
From the AWK man page:
system(cmd)
executes cmd and returns its exit status
The GNU AWK manual also has a section that, in part, describes the system function and provides an example:
system("date | mail -s 'awk run done' root")
A much more robust way would be to use the getline() function of GNU awk to use a variable from a pipe. In form cmd | getline result, cmd is run, then its output is piped to getline. It returns 1 if got output, 0 if EOF, -1 on failure.
First construct the command to run in a variable in the BEGIN clause if the command is not dependant on the contents of the file, e.g. a simple date or an ls.
A simple example of the above would be
awk 'BEGIN {
cmd = "ls -lrth"
while ( ( cmd | getline result ) > 0 ) {
print result
}
close(cmd);
}'
When the command to run is part of the columnar content of a file, you generate the cmd string in the main {..} as below. E.g. consider a file whose $2 contains the name of the file and you want it to be replaced with the md5sum hash content of the file. You can do
awk '{ cmd = "md5sum "$2
while ( ( cmd | getline md5result ) > 0 ) {
$2 = md5result
}
close(cmd);
}1'
Another frequent usage involving external commands in awk is during date processing when your awk does not support time functions out of the box with mktime(), strftime() functions.
Consider a case when you have Unix EPOCH timestamp stored in a column and you want to convert that to a human readable date format. Assuming GNU date is available
awk '{ cmd = "date -d #" $1 " +\"%d-%m-%Y %H:%M:%S\""
while ( ( cmd | getline fmtDate) > 0 ) {
$1 = fmtDate
}
close(cmd);
}1'
for an input string as
1572608319 foo bar zoo
the above command produces an output as
01-11-2019 07:38:39 foo bar zoo
The command can be tailored to modify the date fields on any of the columns in a given line. Note that -d is a GNU specific extension, the *BSD variants support -f ( though not exactly similar to -d).
More information about getline can be referred to from this AllAboutGetline article at awk.freeshell.org page.
There are several ways.
awk has a system() function that will run a shell command:
system("cmd")
You can print to a pipe:
print "blah" | "cmd"
You can have awk construct commands, and pipe all the output to the shell:
awk 'some script' | sh
Something as simple as this will work
awk 'BEGIN{system("echo hello")}'
and
awk 'BEGIN { system("date"); close("date")}'
I use the power of awk to delete some of my stopped docker containers. Observe carefully how i construct the cmd string first before passing it to system.
docker ps -a | awk '$3 ~ "/bin/clish" { cmd="docker rm "$1;system(cmd)}'
Here, I use the 3rd column having the pattern "/bin/clish" and then I extract the container ID in the first column to construct my cmd string and passed that to system.
It really depends :) One of the handy linux core utils (info coreutils) is xargs. If you are using awk you probably have a more involved use-case in mind - your question is not very detailled.
printf "1 2\n3 4" | awk '{ print $2 }' | xargs touch
Will execute touch 2 4. Here touch could be replaced by your program. More info at info xargs and man xargs (really, read these).
I believe you would like to replace touch with your program.
Breakdown of beforementioned script:
printf "1 2\n3 4"
# Output:
1 2
3 4
# The pipe (|) makes the output of the left command the input of
# the right command (simplified)
printf "1 2\n3 4" | awk '{ print $2 }'
# Output (of the awk command):
2
4
# xargs will execute a command with arguments. The arguments
# are made up taking the input to xargs (in this case the output
# of the awk command, which is "2 4".
printf "1 2\n3 4" | awk '{ print $2 }' | xargs touch
# No output, but executes: `touch 2 4` which will create (or update
# timestamp if the files already exist) files with the name "2" and "4"
Update In the original answer, I used echo instead of printf. However, printf is the better and more portable alternative as was pointed out by a comment (where great links with discussions can be found).
#!/usr/bin/awk -f
BEGIN {
command = "ls -lh"
command |getline
}
Runs "ls -lh" in an awk script
You can call easily with parameters via the system argument.
For example, to kill jobs corresponding to a certain string (we can otherly of course) :
ps aux | grep my_searched_string | awk '{system("kill " $2)}'
I was able to have this done via below method
cat ../logs/em2.log.1 |grep -i 192.168.21.15 |awk '{system(`date`); print $1}'
awk has a function called system it enables you to execute any linux bash command within the output of awk.

Using a subshell for parameter substitution with diff

I'm writing a shell script, and in an effort to make it shorter and easier to read, I'm trying to use nested subshells to pass parameters to diff.
Here's what I have:
if
diff -iy '$(sort '$(awk 'BEGIN { FS = "|" } ; {print $1}' new-participants-by-state.csv)' '$(awk 'BEGIN { FS = "|" } ; {print $1}' current-participants-by-state.csv)')' > /dev/null;
then
echo There is no difference between the files. > ./participants-by-state-results.txt;
else
diff -iy '$(sort '$(awk 'BEGIN { FS = "|" } ; {print $1}' new-participants-by-state.csv)' '$(awk 'BEGIN { FS = "|" } ; {print $1}' current-participants-by-state.csv)')' > ./participants-by-state-results.txt;
fi
When I run the script, I keep getting diff: extra operand 'AL'
I'd appreciate any insight into why this is failing. I think I'm pretty close. Thanks!
Your code is unreadable because the lines are so long:
if diff -iy '$(sort '$(awk 'BEGIN { FS = "|" } ; {print $1}' new-participants-by-state.csv)' \
'$(awk 'BEGIN { FS = "|" } ; {print $1}' current-participants-by-state.csv)')' \
> /dev/null;
then
echo There is no difference between the files. > ./participants-by-state-results.txt;
else
diff -iy '$(sort '$(awk 'BEGIN { FS = "|" } ; {print $1}' new-participants-by-state.csv)' \
'$(awk 'BEGIN { FS = "|" } ; {print $1}' current-participants-by-state.csv)')' \
> ./participants-by-state-results.txt;
fi
Repeating whole commands like that is also fairly nasty. You also have major problems with your use of single quotes; you only have one sort in each set of commands, apparently operating on the combined outputs of two identical awk commands (whereas you probably need two separate sorts, one for the output of each awk command); you're not using the -F option to awk when you could; you are repeating the gargantuan file names all over the place; and finally, it appears that you are probably wanting to use process substitution, but not actually doing so.
Let's take a step back and formulate the question clearly.
Given two files (new-participants-by-state.csv and current-participants-by-state.csv) find the first pipe-separated field on each line of each file, sort the lists of those fields, and compare the results of the two sorted lists.
If there are no differences, write a message into the output file participants-by-state-results.txt; otherwise, list the differences in the output file.
So, we could use:
oldfile='current-participants-by-state.csv'
newfile='new-participants-by-state.csv'
outfile='participants-by-state-results.txt'
tmpfile=${TMPDIR:-/tmp}/xx.$$
awk -F'|' '{print $1}' $oldfile | sort > $tmpfile.1
awk -F'|' '{print $1}' $newfile | sort > $tmpfile.2
if diff -iy $tmpfile.1 $tmpfile.2 > $outfile
then echo "There is no difference between the files" > $outfile
fi
rm -f $tmpfile.?
If this was going to be the final script, we'd want to put trap handling in place so that the temporary files are not left around unless the script is killed dead with SIGKILL.
However, we can now use process substitution to avoid the temporary files:
oldfile='current-participants-by-state.csv'
newfile='new-participants-by-state.csv'
outfile='participants-by-state-results.txt'
if diff -iy <(awk -F'|' '{print $1}' $oldfile | sort) \
<(awk -F'|' '{print $1}' $newfile | sort) > $outfile
then echo "There is no difference between the files" > $outfile
fi
Note how the code carefully preserves symmetries where there are symmetries. Note the use of shortish variable names to avoid the repetition of long file names. Note that the diff command is run just once, not twice - throwing away results which are needed later is not very sensible.
You could compress the output I/O redirection even more using:
{
if diff -iy <(awk -F'|' '{print $1}' $oldfile | sort) \
<(awk -F'|' '{print $1}' $newfile | sort)
then echo "There is no difference between the files"
fi
} > $outfile
That sends the standard output of the enclosed commands to the file.
Of course, CSV might not be the appropriate nomenclature if the files are pipe-separated rather than comma-separated, but that's another matter altogether.
I'm also assuming that the status from diff -iy works as suggested by the original script; I've not validated that usage of the diff command.
There are several problems here.
First, you're putting various arguments in single-quotes, which prevents any interpretation being done on them (for example, $(....) doesn't do anything special inside single-quotes). You're probably thinking of double-quotes, but those aren't what you want either.
Which brings us to the second problem, that diff and sort expect to be given filenames as arguments, and they operate on the data in those files; you're trying to pass the data directly as arguments, which doesn't work (and I suspect that's the origin of the error you're getting: diff expects exactly two filenames, you're passing more than two participant names, and AL happened to be third on the list and hence the one that diff panicked on). The usual way to do this is to use intermediate files (and multiple lines in the script), but bash actually has a way of doing this without either of those: process substitution. Essentially, what it does is run one command with output (or input, but we need output in this case) sent to a named pipe; then it passes the name of the pipe as an argument to another command. For example, diff <(command1) <(command2) will give you the differences between the outputs of command1 and command2. Note that since this is a bash-only feature, you must start the script with #!/bin/bash, not #!/bin/sh.
Third, there's a missing close-parenthesis that makes it a little hard to tell what's supposed to happen. Are both files supposed to be sorted before the comparison, or only the new-participants file?
Fourth, since the final comparison ignores case (-i), you'd better use a case-insensitive sort (-f) as well.
Finally, you're doing all of the processing twice if there are any differences. I'd recommend running the comparison once into a file, then if there were no differences just ignore/overwrite the (empty) file.
Oh, and just a stylistic thing: you don't need semicolons at the end of lines in bash. You only need semicolons if you're putting more than one command on the same line (and a few other cases like before then in an if statement).
Anyway, here's my rewrite:
#!/bin/bash
if
diff -iy <(awk 'BEGIN { FS = "|" } ; {print $1}' new-participants-by-state.csv | sort -f) <(awk 'BEGIN { FS = "|" } ; {print $1}' current-participants-by-state.csv | sort -f) >./participants-by-state-results.txt
then
echo "There is no difference between the files." > ./participants-by-state-results.txt
fi

Resources