Performance Issue with While and Read - bash

I have a many-line file containing commas. I want to remove all of the characters appearing after a comma from the line, including the comma. I have a bash script which does this, but it isn't fast enough.
Input:
hello world, def
Output:
hllo worl
My slow script:
#!/bin/bash
while read line; do
values="${line#*, }"
phrase="${line%, *}"
echo "${phrase//[$values]}"
done < "$1"
I want to improve the performance.
Any suggestions?

Using Perl
$ perl -F',' -lane '$F[0] =~ s/[$F[1]]//g; print $F[0]' file
hlloworl
If you don't want to count the space after the comma:
$ perl -F',\s*' -lane '$F[0] =~ s/[$F[1]]//g; print $F[0]' file
hllo worl
Perl excels at text manipulation like this, so I'd expect this to be pretty quick.

Getting rid of the while loop could give your code a boost, most programs take a file as input and will do the reading for you.
You can replace your program with the following and report the times:
cut -d"," -f1 < file
You can try with awk, changing the field separator to ,:
awk 'BEGIN {FS=","}; {print $1}' file
Also you could try with sed (with the modifications suggested by #Qualia):
sed -r -i "s/,.*//g" file
Beware though, that the -i flag will inplace edit your file, if that is not the desired effect you can just do:
sed -r "s/,.*//g" file

An AWK solution (edited taking inspiration from #glenn jackman's perl solution):
awk -F", " '{ gsub("["$2"]",""); print $1 }' "$1"
With this sort of line processing, it's often better to use a compiled solution. I would use Haskell for its expressiveness:
-- answer.hs
import Data.List(nub, delete)
import Data.Char(isSpace)
main = interact (unlines . (map perLine) . lines)
perLine = strSetDiff . break (==',')
strSetDiff (s, ',':' ':sub) = filter (`notElem` sub)) s
strSetDiff (s, _) = s
Compile with the command ghc -O2 answer.hs.
This breaks each line into two lists s and sub on ,, removes the ", " from sub, and then filters s to remove characters that are elements of sub. If there is no comma, the result is the whole line.
This assumes a space always follows a ,. Otherwise remove the ' ': and replace notElem sub with notElem (dropWhile isSpace sub)
Time taken for an 80000 line file consisting of 10 lines repeated 8000 times:
$ time ./answer <infile >outfile
0.38s user 0.00s system 99% cpu 0.386 total
$ time [glenn jackman\'s perl]
0.68s user 0.00s system 99% cpu 0.691 total
$ time awk -F", " '{ gsub("["$2"]",""); print $1 }' infile > outfile
0.85s user 0.04s system 99% cpu 0.897 total
$ time ./ElBarajas.sh infile > outfile
2.77s user 0.32s system 99% cpu 3.105 total
Personally, I'm willing to admit defeat - the perl solution seems best to me.

Related

Stream filter large number of lines that are specified by line number from stdin

I have a huge xz compressed text file huge.txt.xz with millions of lines that is too large to keep around uncompressed (60GB).
I would like to quickly filter/select a large number of lines (~1000s) from that huge text file into a file filtered.txt. The line numbers to select could for example be specified in a separate text file select.txt with a format as follows:
10
14
...
1499
15858
Overall, I envisage a shell command as follows where "TO BE DETERMINED" is the command I'm looking for:
xz -dcq huge.txt.xz | "TO BE DETERMINED" select.txt >filtered.txt
I've managed to find an awk program from a closely related question that almost does the job - the only problem being that it takes a file name instead of reading from stdin. Unfortunately, I don't really understand the awk script and don't know enough awk to alter it in such a way to work in this case.
This is what works right now with the disadvantage of having a 60GB file lie around rather than streaming:
xz -dcq huge.txt.xz >huge.txt
awk '!firstfile_proceed { nums[$1]; next }
(FNR in nums)' select.txt firstfile_proceed=1 >filtered.txt
Inspiration: https://unix.stackexchange.com/questions/612680/remove-lines-with-specific-line-number-specified-in-a-file
Keeping with OP's current idea:
xz -dcq huge.txt.xz | awk '!firstfile_proceed { nums[$1]; next } (FNR in nums)' select.txt firstfile_proceed=1 -
Where the - (at the end of the line) tells awk to read from stdin (in this case the output from xz that's being piped to the awk call).
Another way to do this (replaces all of the above code):
awk '
FNR==NR { nums[$1]; next } # process first file
FNR in nums # process subsequent file(s)
' select.txt <(xz -dcq huge.txt.xz)
Comments removed and cut down to a 'one-liner':
awk 'FNR==NR {nums[$1];next} FNR in nums' select.txt <(xz -dcq huge.txt.xz)
Adding some logic to implement Ed Morton's comment (exit processing once FNR > largest value from select.txt):
awk '
# process first file
FNR==NR { nums[$1]
maxFNR= ($1>maxFNR ? $1 : maxFNR)
next
}
# process subsequent file(s):
FNR > maxFNR { exit }
FNR in nums
' select.txt <(xz -dcq huge.txt.xz)
NOTES:
keeping in mind we're talking about scanning millions of lines of input ...
FNR > maxFNR will obviously add some cpu/processing time to the overall operation (though less time than FNR in nums)
if the operation routinely needs to pull rows from, say, the last 25% of the file then FNR > maxFNR is likely providing little benefit (and probably slowing down the operation)
if the operation routinely finds all desired rows in, say, the first 50% of the file then FNR> maxFNR is probably worth the cpu/processing time to keep from scanning the entire input stream (then again, the xz operation, on the entire file, is likely the biggest time consumer)
net result: the additional NFR > maxFNR test may speed-up/slow-down the overall process depending on how much of the input stream needs to be processed in a typical run; OP would need to run some tests to see if there's a (noticeable) difference in overall runtime
To clarify my previous comment. I'll show a simple reproducible sample:
linelist content:
10
15858
14
1499
To simulate a long input, I'll use seq -w 100000000.
Comparing sed solution with my suggestion, we have:
#!/bin/bash
time (
sed 's/$/p/' linelist > selector
seq -w 100000000 | sed -nf selector
)
time (
sort -n linelist | sed '$!{s/$/p/};$s/$/{p;q}/' > my_selector
seq -w 100000000 | sed -nf my_selector
)
output:
000000010
000000014
000001499
000015858
real 1m23.375s
user 1m38.004s
sys 0m1.337s
000000010
000000014
000001499
000015858
real 0m0.013s
user 0m0.014s
sys 0m0.002s
Comparing my solution with awk:
#!/bin/bash
time (
awk '
# process first file
FNR==NR { nums[$1]
maxFNR= ($1>maxFNR ? $1 : maxFNR)
next
}
# process subsequent file(s):
FNR > maxFNR { exit }
FNR in nums
' linelist <(seq -w 100000000)
)
time (
sort -n linelist | sed '$!{s/$/p/};$s/$/{p;q}/' > my_selector
sed -nf my_selector <(seq -w 100000000)
)
output:
000000010
000000014
000001499
000015858
real 0m0.023s
user 0m0.020s
sys 0m0.001s
000000010
000000014
000001499
000015858
real 0m0.017s
user 0m0.007s
sys 0m0.001s
In my conclusion, seq using q is comparable with awk solution. For readability and maintainability I prefer awk solution.
Anyway, this test is simplistic and only useful for small comparisons. I don't know, for example, what the result would be if I test this against the real compressed file, with heavy disc I/O.
EDIT by Ed Morton:
Any speed test that results in all output values that are less than a second is a bad test because:
In general no-one cares if X runs in 0.1 or 0.2 secs, they're both fast enough unless being called in a large loop, and
Things like cache-ing can impact the results, and
Often a script that runs faster for a small input set where execution speed doesn't matter will run slower for a large input set where execution speed DOES matter (e.g. if the script that's slower for the small input spends time setting up data structures that will allow it to run faster for the larger)
The problem with the above example is it's only trying to print 4 lines rather than the 1000s of lines that the OP said they'd have to select so it doesn't exercise the difference between the sed and the awk solution that causes the sed solution to be much slower than the awk one which is that the sed solution has to test every target line number for every line of input while the awk solution just does a single hash lookup of the current line. It's an order(N) vs order(1) algorithm on each line of the input file.
Here's a better example showing printing every 100th line from a 1000000 line file (i.e. will select 1000 lines) rather than just 4 lines from any size file:
$ cat tst_awk.sh
#!/usr/bin/env bash
n=1000000
m=100
awk -v n="$n" -v m="$m" 'BEGIN{for (i=1; i<=n; i+=m) print i}' > linelist
seq "$n" |
awk '
FNR==NR {
nums[$1]
maxFNR = $1
next
}
FNR in nums {
print
if ( FNR == maxFNR ) {
exit
}
}
' linelist -
$ cat tst_sed.sh
#!/usr/bin/env bash
n=1000000
m=100
awk -v n="$n" -v m="$m" 'BEGIN{for (i=1; i<=n; i+=m) print i}' > linelist
sed '$!{s/$/p/};$s/$/{p;q}/' linelist > my_selector
seq "$n" |
sed -nf my_selector
$ time ./tst_awk.sh > ou.awk
real 0m0.376s
user 0m0.311s
sys 0m0.061s
$ time ./tst_sed.sh > ou.sed
real 0m33.757s
user 0m33.576s
sys 0m0.045s
As you can see the awk solution ran 2 orders of magnitude faster than the sed one, and they produced the same output:
$ diff ou.awk ou.sed
$
If I make the input file bigger and select 10,000 lines from it by setting:
n=10000000
m=1000
in each script, which is probably getting more realistic for the OPs usage, the difference becomes really impressive:
$ time ./tst_awk.sh > ou.awk
real 0m2.474s
user 0m2.843s
sys 0m0.122s
$ time ./tst_sed.sh > ou.sed
real 5m31.539s
user 5m31.669s
sys 0m0.183s
i.e. awk runs in 2.5 seconds while sed takes 5.5 minutes!
If you have a file of line numbers, add p to the end of each and run it as a sed script.
If linelist contains
10
14
1499
15858
then sed 's/$/p/' linelist > selector creates
10p
14p
1499p
15858p
then
$: for n in {1..1500}; do echo $n; done | sed -nf selector
10
14
1499
I didn't send enough lines through to match 15858 so that one didn't print.
This works the same with a decompression from a file.
$: tar xOzf x.tgz | sed -nf selector
10
14
1499

Why is my awk script much slower than the head+tail script?

I want to split a huge file (big.txt). by given line numbers. For example, if the give numbers are 10 15 30, I will get 4 files: 1-10, 11-15, 16-30, and 30 - EOF of the big.txt.
Solving the problem is not a challenge for me, I wrote 3 different solutions. However, I cannot explain the performance. Why the awk script is the slowest. (GNU Awk)
For the big.txt, I just did seq 1.5billion > big.txt (~15Gb)
first, the head and tail:
INPUT_FILE="big.txt" # The input file
LINE_NUMBERS=( 400000 700000 1200000 ) # Given line numbers
START=0 # The offset to calculate lines
IDX=1 # The index used in the name of generated files: file1, file2 ...
for i in "${LINE_NUMBERS[#]}"
do
# Extract the lines
head -n $i "$INPUT_FILE" | tail -n +$START > "file$IDX.txt"
#
(( IDX++ ))
START=$(( i+1 ))
done
# Extract the last given line - last line in the file
tail -n +$START "$INPUT_FILE" > "file$IDX.txt"
The 2nd: sed:
INPUT_FILE="big.txt" # The input file
LINE_NUMBERS=( 400000 700000 1200000 ) # Given line numbers
START=1 # The offset to calculate lines
IDX=1 # The index used in the name of generated files: file1, file2 ...
for i in "${LINE_NUMBERS[#]}"
do
T=$(( i+1 ))
# Extract the lines using sed command
sed -n -e " $START, $i p" -e "$T q" "$INPUT_FILE" > "file$IDX.txt"
(( IDX++ ))
START=$T
done
# Extract the last given line - last line in the file
sed -n "$START, $ p" "$INPUT_FILE" > "file$IDX.txt"
the last, awk
awk -v nums="400000 700000 1200000" 'BEGIN{c=split(nums,a)} {
for(i=1; i<=c; i++){
if( NR<=a[i] ){
print > "file" i ".txt"
next
}
}
print > "file" c+1 ".txt"
}' big.txt
From my testing (using time command), the head+tail is the fastest:
real 73.48
user 1.42
sys 17.62
the sed one:
real 144.75
user 105.68
sys 15.58
the awk one:
real 234.21
user 187.92
sys 3.98
The awk went through the file only once, why it is much slower than the other two? Also, I thought the tail and head would be the slowest solution, how come it's so fast? I guess it might be something to do with the awk's redirection? (print > file)
Can someone explain it to me? Thank you.
Can awk be faster than head and tail for this?
No, it will be slower, at least for a reasonable number of chunks for a large input file. Because it will read every line and do some work with it. On the other hand, head and tail will read massively the newline characters, without doing anything, will seek until they find the line number provided by the argument. Then they don't have again to read line by line and decide what to do, but dump the content, similar to cat.
If we increase the number of chunks, if the array of splitting line numbers is getting larger and larger, then we will reach a point where the cost of calling many head and tail processes will overcome the cost of one awk process, and from that point after, awk would be faster.
awk script improvement
This awk is slow because of that loop! Just think that for the last output file, for every line to print, we run 4 iterations until we print the line. Of course the time complexity still remains linear to the input, but all these checks and assignments have costs that can be observed as input grows. It can be much improved, e.g. like this:
> cat tst.awk
BEGIN {
a[1]
a[40000]
a[70000]
a[120000]
}
NR in a {
close(out)
out = "file" ++i ".txt"
}
{ print > out }
Here we test only NR per line, actually we almost only print.
awk -f tst.awk big.txt
Testing
Here is some basic testing, I did a file, not huge, 5.2M lines.
> wc -l big.txt
5288558 big.txt
Now, with that loop, it really matters where you split the file! If you have to write most of the rows into the last chunks, that means more iterations, it is slower
> head -1 test.sh
awk -v nums="100000 200000 300000" 'BEGIN{c=split(nums,a)} {
> time sh test.sh
real 0m10.960s
user 0m10.823s
sys 0m0.066s
If most rows goes to first file (that means one iteration and next) it becomes faster!
> head -1 test.sh
awk -v nums="5000000 5100000 5200000" 'BEGIN{c=split(nums,a)} {
> time sh test.sh
real 0m6.914s
user 0m6.838s
sys 0m0.043s
With the above modification it should be fast enough regardless the cut points.
> time awk -f tst.awk big.txt
real 0m4.270s
user 0m4.185s
sys 0m0.048s
For awk, each line requires a loop, comparisons, and creating the filename. Maybe awk performs also the hard task of parsing each line.
You may want to try the following experiments
try mawk (fast implementation of awk) and check if it is much faster.
remove print > "file" i ".txt" see how much time it saves.

How to write a bash script that dumps itself out to stdout (for use as a help file)?

Sometimes I want a bash script that's mostly a help file. There are probably better ways to do things, but sometimes I want to just have a file called "awk_help" that I run, and it dumps my awk notes to the terminal.
How can I do this easily?
Another idea, use #!/bin/cat -- this will literally answer the title of your question since the shebang line will be displayed as well.
Turns out it can be done as pretty much a one liner, thanks to #CharlesDuffy for the suggestions!
Just put the following at the top of the file, and you're done
cat "$BASH_SOURCE" | grep -v EZREMOVEHEADER
So for my awk_help example, it'd be:
cat "$BASH_SOURCE" | grep -v EZREMOVEHEADER
# Basic form of all awk commands
awk search pattern { program actions }
# advanced awk
awk 'BEGIN {init} search1 {actions} search2 {actions} END { final actions }' file
# awk boolean example for matching "(me OR you) OR (john AND ! doe)"
awk '( /me|you/ ) || (/john/ && ! /doe/ )' /path/to/file
# awk - print # of lines in file
awk 'END {print NR,"coins"}' coins.txt
# Sum up gold ounces in column 2, and find out value at $425/ounce
awk '/gold/ {ounces += $2} END {print "value = $" 425*ounces}' coins.txt
# Print the last column of each line in a file, using a comma (instead of space) as a field separator:
awk -F ',' '{print $NF}' filename
# Sum the values in the first column and pretty-print the values and then the total:
awk '{s+=$1; print $1} END {print "--------"; print s}' filename
# functions available
length($0) > 72, toupper,tolower
# count the # of times the word PASSED shows up in the file /tmp/out
cat /tmp/out | awk 'BEGIN {X=0} /PASSED/{X+=1; print $1 X}'
# awk regex operators
https://www.gnu.org/software/gawk/manual/html_node/Regexp-Operators.html
I found another solution that works on Mac/Linux and works exactly as one would hope.
Just use the following as your "shebang" line, and it'll output everything from line 2 on down:
test.sh
#!/usr/bin/tail -n+2
hi there
how are you
Running this gives you what you'd expect:
$ ./test.sh
hi there
how are you
and another possible solution - just use less, and that way your file will open in searchable gui
#!/usr/bin/less
and this way you can grep if for something too, e.g.
$ ./test.sh | grep something

Iterative replacement of substrings in bash

I'm trying to write a simple script to make several replacements in a big text file. I've a "map" file which contains the records to be searched and replaced,one per line,separated by a space, and a "input" file where I need the changes to be done. The examples files and the script I wrote are beneath.
Map file
new_0 old_0
new_1 old_1
new_2 old_2
new_3 old_3
new_4 old_4
Input file
itsa(old_0)single(old_2)string(old_1)with(old_5)ocurrences(old_4)ofthe(old_3)records
Script
#!/bin/bash
while read -r mapline ; do
mapf1=`awk 'BEGIN {FS=" "} {print $1}' <<< "$mapline"`
mapf2=`awk 'BEGIN {FS=" "} {print $2}' <<< "$mapline"`
for line in $(cat "input") ; do
if [[ "${line}" == *"${mapf2}"* ]] ; then
sed "s/${mapf2}/${mapf1}/g" <<< "${line}"
fi
done < "input"
done < "map"
The thing is that the searches and replaces are made correctly, but I can't find a way to save the output of each iteration and work over it in the next. So, my output looks like this:
itsa(new_0)single(old_2)string(old_1)withocurrences(old_4)ofthe(old_3)records
itsa(old_0)single(old_2)string(new_1)withocurrences(old_4)ofthe(old_3)records
itsa(old_0)single(new_2)string(old_1)withocurrences(old_4)ofthe(old_3)records
itsa(old_0)single(old_2)string(old_1)withocurrences(old_4)ofthe(new_3)records
itsa(old_0)single(old_2)string(old_1)withocurrences(new_4)ofthe(old_3)records
Yet, the desired output would look like this:
itsa(new_0)single(new_2)string(new_1)withocurrences(new_4)ofthe(new_3)records
May anyone bring some light in this darkly waters??? Thanks in advance!
Improving the existing script
Improvements:
Use "$()" instead of ``. It supports whitespace and is easier to read.
Don't execute sed for each line. sed already loops over all lines and is faster than a loop in bash.
The adapted script:
text="$(< input)"
while read -r mapline; do
mapf1="$(awk 'BEGIN {FS=" "} {print $1}' <<< "$mapline")"
mapf2="$(awk 'BEGIN {FS=" "} {print $2}' <<< "$mapline")"
text="$(sed "s/${mapf2}/${mapf1}/g" <<< "$text")"
done < "map"
echo "$text"
The variable $text contains the complete input file and is modified in each iteration. The output of this script is the file after all replacements were done.
Alternative approach
Convert the map file into a pattern for sed and execute sed just once using that pattern.
pattern="$(sed 's#\(.*\) \(.*\)#s/\2/\1/g#' map)"
sed "$pattern" input
The first command is the conversion step. The file
new_0 old_0
new_1 old_1
...
will result in the pattern
s/old_0/new_0/g
s/old_1/new_1/g
...
It is possible in GNU Awk as follows,
awk 'FNR==NR{hash[$2]=$1; next} \
{for (i=1; i<=NF; i++)\
{for(key in hash) \
{if (match ($i,key)) {$i=sprintf("(%s)",hash[key];break;)}}}print}' \
map-file FS='[()]' OFS= input-file
produces an output as,
itsa(new_0)single(new_2)string(new_1)withold_5ocurrences(new_4)ofthe(new_3)records
Another in Gnu awk, using split and ternary operator(s):
$ awk '
NR==FNR { a[$2]=$1; next }
{
n=split($0,b,"[()]")
for(i=1;i<=n;i++)
printf "%s%s",(i%2 ? b[i] : (b[i] in a? "(" a[b[i]] ")":"")),(i==n?ORS:"")
}' map foo
itsa(new_0)single(new_2)string(new_1)withocurrences(new_4)ofthe(new_3)records
First you read in the map to a hash. When processing the file, split all records by ( and ). Every other could be in the map (i%2==0). While printfing test with ternary operator if matches are found from a and when there is a match, output it parenthesized.

How to quickly delete the lines in a file that contain items from a list in another file in BASH?

I have a file called words.txt containing a list of words. I also have a file called file.txt containing a sentence per line. I need to quickly delete any lines in file.txt that contain one of the lines from words.txt, but only if the match is found somewhere between { and }.
E.g. file.txt:
Once upon a time there was a cat.
{The cat} lived in the forest.
The {cat really liked to} eat mice.
E.g. words.txt:
cat
mice
Example output:
Once upon a time there was a cat.
Is removed because "cat" is found on those two lines and the words are also between { and }.
The following script successfully does this task:
while read -r line
do
sed -i "/{.*$line.*}/d" file.txt
done < words.txt
This script is very slow. Sometimes words.txt contains several thousand items, so the while loop takes several minutes. I attempted to use the sed -f option, which seems to allow reading a file, but I cannot find any manuals explaining how to use this.
How can I improve the speed of the script?
An awk solution:
awk 'NR==FNR{a["{[^{}]*"$0"[^{}]*}"]++;next}{for(i in a)if($0~i)next;b[j++]=$0}END{printf "">FILENAME;for(i=0;i in b;++i)print b[i]>FILENAME}' words.txt file.txt
It converts file.txt directly to have the expected output.
Once upon a time there was a cat.
Uncondensed version:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
b[j++] = $0
}
END {
printf "" > FILENAME
for (i = 0; i in b; ++i)
print b[i] > FILENAME
}
' words.txt file.txt
If files are expected to get too large that awk may not be able to handle it, we can only redirect it to stdout. We may not be able to modify the file directly:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
}
1
' words.txt file.txt
you can use grep to match 2 files like this:
grep -vf words.txt file.txt
In think that using the grep command should be way faster. By example:
grep -f words.txt -v file.txt
The f option make grep use the words.txt file as matching patterns
The v option reverse the matching, ie keeping files that do not match one of the patterns.
It doesn't solve the {} constraint, but that is easily avoidable, for example by adding the brackets to the pattern file (or in a temporary file created at runtime).
I think this should work for you:
sed -e 's/.*/{.*&.*}/' words.txt | grep -vf- file.txt > out ; mv out file.txt
This basically just modifies the words.txt file on the fly and uses it as a word file for grep.
In pure native bash (4.x):
#!/bin/env bash4
# ^-- MUST start with a /bin/bash shebang, NOT /bin/sh
readarray -t words <words.txt # read words into array
IFS='|' # use | as delimiter when expanding $*
words_re="[{].*(${words[*]}).*[}]" # form a regex matching all words
while read -r; do # for each line in file...
if ! [[ $REPLY =~ $words_re ]]; then # ...check whether it matches...
printf '%s\n' "$REPLY" # ...and print it if not.
fi
done <file.txt
Native bash is somewhat slower than awk, but this still is a single-pass solution (O(n+m), whereas the sed -i approach was O(n*m)), making it vastly faster than any iterative approach.
You could do this in two steps:
Wrap each word in words.txt with {.* and .*}:
awk '{ print "{.*" $0 ".*}" }' words.txt > wrapped.txt
Use grep with inverse match:
grep -v -f wrapped.txt file.txt
This would be particularly useful if words.txt is very large, as a pure-awk approach (storing all the entries of words.txt in an array) would require a lot of memory.
If would prefer a one-liner and would like to skip creating the intermediate file you could do this:
awk '{ print "{.*" $0 ".*}" }' words.txt | grep -v -f - file.txt
The - is a placeholder which tells grep to use stdin
update
If the size of words.txt isn't too big, you could do the whole thing in awk:
awk 'NR==FNR{a[$0]++;next}{p=1;for(i in a){if ($0 ~ "{.*" i ".*}") { p=0; break}}}p' words.txt file.txt
expanded:
awk 'NR==FNR { a[$0]++; next }
{
p=1
for (i in a) {
if ($0 ~ "{.*" i ".*}") { p=0; break }
}
}p' words.txt file.txt
The first block builds an array containing each line in words.txt. The second block runs for every line in file.txt. A flag p controls whether the line is printed. If the line matches the pattern, p is set to false. When the p outside the last block evaluates to true, the default action occurs, which is to print the line.

Resources