How to quickly delete the lines in a file that contain items from a list in another file in BASH? - bash

I have a file called words.txt containing a list of words. I also have a file called file.txt containing a sentence per line. I need to quickly delete any lines in file.txt that contain one of the lines from words.txt, but only if the match is found somewhere between { and }.
E.g. file.txt:
Once upon a time there was a cat.
{The cat} lived in the forest.
The {cat really liked to} eat mice.
E.g. words.txt:
cat
mice
Example output:
Once upon a time there was a cat.
Is removed because "cat" is found on those two lines and the words are also between { and }.
The following script successfully does this task:
while read -r line
do
sed -i "/{.*$line.*}/d" file.txt
done < words.txt
This script is very slow. Sometimes words.txt contains several thousand items, so the while loop takes several minutes. I attempted to use the sed -f option, which seems to allow reading a file, but I cannot find any manuals explaining how to use this.
How can I improve the speed of the script?

An awk solution:
awk 'NR==FNR{a["{[^{}]*"$0"[^{}]*}"]++;next}{for(i in a)if($0~i)next;b[j++]=$0}END{printf "">FILENAME;for(i=0;i in b;++i)print b[i]>FILENAME}' words.txt file.txt
It converts file.txt directly to have the expected output.
Once upon a time there was a cat.
Uncondensed version:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
b[j++] = $0
}
END {
printf "" > FILENAME
for (i = 0; i in b; ++i)
print b[i] > FILENAME
}
' words.txt file.txt
If files are expected to get too large that awk may not be able to handle it, we can only redirect it to stdout. We may not be able to modify the file directly:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
}
1
' words.txt file.txt

you can use grep to match 2 files like this:
grep -vf words.txt file.txt

In think that using the grep command should be way faster. By example:
grep -f words.txt -v file.txt
The f option make grep use the words.txt file as matching patterns
The v option reverse the matching, ie keeping files that do not match one of the patterns.
It doesn't solve the {} constraint, but that is easily avoidable, for example by adding the brackets to the pattern file (or in a temporary file created at runtime).

I think this should work for you:
sed -e 's/.*/{.*&.*}/' words.txt | grep -vf- file.txt > out ; mv out file.txt
This basically just modifies the words.txt file on the fly and uses it as a word file for grep.

In pure native bash (4.x):
#!/bin/env bash4
# ^-- MUST start with a /bin/bash shebang, NOT /bin/sh
readarray -t words <words.txt # read words into array
IFS='|' # use | as delimiter when expanding $*
words_re="[{].*(${words[*]}).*[}]" # form a regex matching all words
while read -r; do # for each line in file...
if ! [[ $REPLY =~ $words_re ]]; then # ...check whether it matches...
printf '%s\n' "$REPLY" # ...and print it if not.
fi
done <file.txt
Native bash is somewhat slower than awk, but this still is a single-pass solution (O(n+m), whereas the sed -i approach was O(n*m)), making it vastly faster than any iterative approach.

You could do this in two steps:
Wrap each word in words.txt with {.* and .*}:
awk '{ print "{.*" $0 ".*}" }' words.txt > wrapped.txt
Use grep with inverse match:
grep -v -f wrapped.txt file.txt
This would be particularly useful if words.txt is very large, as a pure-awk approach (storing all the entries of words.txt in an array) would require a lot of memory.
If would prefer a one-liner and would like to skip creating the intermediate file you could do this:
awk '{ print "{.*" $0 ".*}" }' words.txt | grep -v -f - file.txt
The - is a placeholder which tells grep to use stdin
update
If the size of words.txt isn't too big, you could do the whole thing in awk:
awk 'NR==FNR{a[$0]++;next}{p=1;for(i in a){if ($0 ~ "{.*" i ".*}") { p=0; break}}}p' words.txt file.txt
expanded:
awk 'NR==FNR { a[$0]++; next }
{
p=1
for (i in a) {
if ($0 ~ "{.*" i ".*}") { p=0; break }
}
}p' words.txt file.txt
The first block builds an array containing each line in words.txt. The second block runs for every line in file.txt. A flag p controls whether the line is printed. If the line matches the pattern, p is set to false. When the p outside the last block evaluates to true, the default action occurs, which is to print the line.

Related

Using grep -f to find the pattern themselves that have matches

I'm trying to give grep a pattern file (through -f) , but I want to learn which patterns are matching something in the search file
For example, given 1.txt:
a/(.*)
b/(.*)
b/c/(.*)
b/foo/(.*)
d/(.*)
e/(.*)
and 2.txt:
a/
a/foo/bar/
b/foo/
d/foo/
The patterns from 1.txt that match something in 2.txt are (omitting the (.*) suffix) are as follows:
a/
b/
b/foo/
d/
How can I "find the list of patterns that have a match"?
EDIT: I'm only looking for a prefix match but I think the question is interesting enough for general pattern matching.
EDIT: Since a for-loop based solution is given, I should say I'm not looking at calling grep 10000 times. :) The working solution I already have (listed below) is pretty slow:
for line in "${file1_arr[#]}"; do
if ! grep -qE "^$v(.*)\$"; then
echo "$line"
fi
done
Ideally I'm looking for a single grep call or so with less overhead.
In awk:
$ awk 'NR==FNR{a[$0]=FNR;next}{for(i in a)if($0 ~ i)print i,$0}' 1.txt 2.txt
a/(.*) a/
a/(.*) a/foo/bar
b/(.*) b/foo
d/(.*) d/foo
Explained:
$ awk ' # yes
NR==FNR { # process first file
a[$0]=FNR # hash regex, store record number just in case
next # process next record
}
{ # process second file
for(i in a) # loop every entry in 1.txt
if($0 ~ i) # if regex matches record
print i,$0} # print all matching regex and record
' 1.txt 2.txt
Edit: To output each regex just once (like shown here in the expected output) you could delete the regex from a once it's been used, that way it won't get matched and outputed more than once:
$ awk '
NR==FNR { a[$0]; next }
{
for(i in a)
if($0 ~ i) {
print i
delete a[i] # deleted regex wont get matched again
}
}' 1.txt 2.txt
vendor/cloud.google.com/go/compute/metadata/(.*)$
vendor/cloud.google.com/go/compute/(.*)$
vendor/cloud.google.com/go/(.*)$
vendor/cloud.google.com/(.*)$
vendor/github.com/Azure/azure-sdk-for-go/arm/dns/(.*)$
vendor/github.com/Azure/azure-sdk-for-go/arm/(.*)$
vendor/github.com/Azure/azure-sdk-for-go/(.*)$
vendor/github.com/Azure/(.*)$
vendor/github.com/(.*)$
Also, My test showed about 60 % off (mini laptop, 1:16 to 29 s) the time with this modification for GNU awk (using data you provided in the comments, file1.txt and file2.txt):
$ awk '
BEGIN {
FS="." # . splits the url
}
NR==FNR { a[$1][$0]; next } # we index on the first part of url
{
for(i in a[$1]) # search space decreased
if($0 ~ i) {
print i
delete a[$1][i]
}
}' file1.txt file2.txt
The speedup decreases the search space by using the start of the strings up to the first period as the key for the hash, ie:
FS="." # split at first .
...
a[vendor/github][vendor/github.com/Azure/(.*)$] # example of a hash
...
for(i in a[$1]) # search space decreased
Now it does not have to search the whole hash for a matching regex. More feasibe would probably be to use FS="/" ; a[$1 FS $2] but this was just a quick test.
The following script:
#!/usr/bin/env bash
lines=$(wc -l < 1.txt)
for (( i=1; i<=$lines; i++ )); do
line=$(sed -n "$i"p 1.txt)
line=$(sed "s/\/(.*)$//" <<< "$line")
grep -E "$line" 2.txt 1>/dev/null && echo "$line"
done
prints lines in 1.txt that matched in 2.txt:
a
b
b/foo
d
comments:
# gets a single line from 1.txt
line=$(sed -n "$i"p 1.txt)
# removes trailing pattern /(.*) from $line variable
line=$(sed "s/\/(.*)$//" <<< "$line")
# if $line matches in 2.txt, print $line
grep -E "$line" 2.txt 1>/dev/null && echo "$line"
I tried the awk and sed based solutions, and I realized I can do this much faster using bash's builtin regexp engine if I read both files in memory.
Here's basically it.
text="$(cat 2.txt)" # read 2.txt
while read -r line; do # for each 'line' from 1.txt
re=[^\b]*${line} # prepend ^ or \b to the pattern
if [[ "$text" =~ $re ]]; then # match the pattern to 2.txt
echo "${line}" # if there's a match, print the pattern
fi
done < <(cat "1.txt")
Since this doesn't spawn any extra processes and just does it in-memory, I suspect this is quite efficient. My benchmarks with the files I linked under James' answer shows 8-9 seconds for this.
I don't see a solution with grep, but sed is an alternative to awk.
With sed I would like to see patterns like b/foo/.* in 1.txt, but I will show a solution based on the (.*).
The purpose of the first command is constructing sed constructions, that will replace the inputline with the regular expression, when it matches the regular expression. The different output lines must look like
sed -rn 's#b/c/(.*)#b/c/#p' 2.txt
and this can be done with
# Use subprocess
sed 's/\(.*\)\(([.][*])\)/s#\1\2#\1#p/' 1.txt
# resulting in
sed -rnf <(sed 's/\(.*\)\(([.][*])\)/s#\1\2#\1#p/' 1.txt) 2.txt| sort -u
The solution is a bit difficult to read, caused bij the layout of 1.txt, where I would want lines like b/foo/.*.
The above commands will have 2 bugs:
When the match is on a part of the line, the non-matched part will be shown in the output. This can be fixed with matching the garbage
# Use lines like 's#.*b/foo(.*)#b/foo#p'
sed -rnf <(sed 's/\(.*\)\(([.][*])\)/s#.*\1\2#\1#p/' 1.txt) 2.txt| sort -u
The second bug is that strings in 2.txt that have two matches, will be matched only once (the first match will edit the line in the stream).
This can be fixed by adding some unique marker (I will use \a) for the matching lines and repeating the inputlines on the output (with \n&).
The output can be viewed by looking for the \a markers.
sed -rnf <(sed 's/\(.*\)\(([.][*])\)/s#.*\1\2#\\a\1\\n\&#p/' 1.txt) 2.txt|
sed -rn '/\a/ s/.(.*)/\1/p' | sort -u
EDIT:
The work-around with a marker and restoring the original input is not needed when you follow a different approach.
In sed you can print something to stdout without changing the stream.
One possibility (slow for this situation) is using
sed '/something/ eecho "something" '
Another possibility is using the "x" command (that eXchanges the pattern space with the hold buffer). You actuallu want to have a sedscript with commands like
\%a/% {h;s%.*%a/%p;x}
\%b/% {h;s%.*%b/%p;x}
\%b/c/% {h;s%.*%b/c/%p;x}
\%b/foo/% {h;s%.*%b/foo/%p;x}
\%d/% {h;s%.*%d/%p;x}
\%e/% {h;s%.*%e/%p;x}
Using above method the sed solution simplifies into
sed -nf <(
sed 's#([.][*])##; s#.*#\\%&% {h;s%.*%&%p;x} #' 1.txt
) 2.txt | sort -u
When the file 1.txt is not changed often, you might want to preprocess that file.
sed 's#([.][*])##; s#.*#\\%&% {h;s%.*%&%p;x} #' 1.txt > /tmp/sed.in
sed -nf /tmp/sed.in 2.txt | sort -u

How can I find unique characters per line of input?

Is there any way to extract the unique characters of each line?
I know I can find the unique lines of a file using
sort -u file
I would like to determine the unique characters of each line (something like sort -u for each line).
To clarify: given this input:
111223234213
111111111111
123123123213
121212122212
I would like to get this output:
1234
1
123
12
Using sed
sed ':;s/\(.\)\(.*\)\1/\1\2/;t' file
Basically what it does is capture a character and check if it appears anywhere else on the line. It also captures all the characters between these.
Then it replaces all of that including the second occurence with just first occurence and then what was inbetween.
t is test and jumps to the : label if the previous command was successful. Then this repeats until the s/// command fails meaning only unique characters remain.
; just separates commands.
1234
1
123
12
Keeps order as well.
It doesn't get things in the original order, but this awk one-liner seems to work:
awk '{for(i=1;i<=length($0);i++){a[substr($0,i,1)]=1} for(i in a){printf("%s",i)} print "";delete a}' input.txt
Split apart for easier reading, it could be stand-alone like this:
#!/usr/bin/awk -f
{
# Step through the line, assigning each character as a key.
# Repeated keys overwrite each other.
for(i=1;i<=length($0);i++) {
a[substr($0,i,1)]=1;
}
# Print items in the array.
for(i in a) {
printf("%s",i);
}
# Print a newline after we've gone through our items.
print "";
# Get ready for the next line.
delete a;
}
Of course, the same concept can be implemented pretty easily in pure bash as well:
#!/usr/bin/env bash
while read s; do
declare -A a
while [ -n "$s" ]; do
a[${s:0:1}]=1
s=${s:1}
done
printf "%s" "${!a[#]}"
echo ""
unset a
done < input.txt
Note that this depends on bash 4, due to the associative array. And this one does get things in the original order, because bash does a better job of keeping array keys in order than awk.
And I think you've got a solution using sed from Jose, though it has a bunch of extra pipe-fitting involved. :)
The last tool you mentioned was grep. I'm pretty sure you can't do this in traditional grep, but perhaps some brave soul might be able to construct a perl-regexp variant (i.e. grep -P) using -o and lookarounds. They'd need more coffee than is in me right now though.
One way using perl:
perl -F -lane 'print do { my %seen; grep { !$seen{$_}++ } #F }' file
Results:
1234
1
123
12
Another solution,
while read line; do
grep -o . <<< $line | sort -u | paste -s -d '\0' -;
done < file
grep -o . convert 'row line' to 'column line'
sort -u sort letters and remove repetead letters
paste -s -d '\0' - convert 'column line' to 'row line'
- as a filename argument to paste to tell it to use standard input.
This awk should work:
awk -F '' '{delete a; for(i=1; i<=NF; i++) a[$i]; for (j in a) printf "%s", j; print ""}' file
1234
1
123
12
Here:
-F '' will break the record char by char giving us single character in $1, $2 etc.
Note: For non-gnu awk use:
awk 'BEGIN{FS=""} {delete a; for(i=1; i<=NF; i++) a[$i];
for (j in a) printf "%s", j; print ""}' file
This might work for you (GNU sed):
sed 's/\B/\n/g;s/.*/echo "&"|sort -u/e;s/\n//g' file
Split each line into a series of lines. Unique sort those lines. Combine the result back into a single line.
Unique and sorted alternative to the others, using sed and gnu tools:
sed 's/\(.\)/\1\n/g' file | sort | uniq
which produces one character per line; If you want those on one line, just do:
sed 's/\(.\)/\1\n/g' file | sort | uniq | sed ':a;N;$!ba;s/\n//g;'
This has the advantage of showing the characters in sorted order, rather than order of appearance.

Print text between two lines (from list of line numbers in file) in Unix [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I have a sample file which has thousands of lines.
I want to print text between two line numbers in that file. I don't want to input line numbers manually, rather I have a file which contains list of line numbers between which text has to be printed.
Example : linenumbers.txt
345|789
999|1056
1522|1366
3523|3562
I need a shell script which will read line numbers from this file and print the text between each range of lines into a separate (new) file.
That is, it should print lines between 345 and 789 into a new file, say File1.txt, and print text between lines 999 and 1056 into a new file, say File2.txt, and so on.
considering your target file has only thousands of lines. here is a quick and dirty solution.
awk -F'|' '{system("sed -n \""$1","$2"p\" targetFile > file"NR)}' linenumbers.txt
the targetFile is your file containing thousands of lines.
the oneliner does not require your linenumbers.txt to be sorted.
the oneliner allows line range to be overlapped in your linenumbers.txt
after running the command above, you will have n filex files. n is the row counts of linenumbers.txt x is from 1-n you can change the filename pattern as you want.
Here's one way using GNU awk. Run like:
awk -f script.awk numbers.txt file.txt
Contents of script.awk:
BEGIN {
# set the field separator
FS="|"
}
# for the first file in the arguments list
FNR==NR {
# add the row number and field one as keys to a multidimensional array with
# a value of field two
a[NR][$1]=$2
# skip processing the rest of the code
next
}
# for the second file in the arguments list
{
# for every element in the array's first dimension
for (i in a) {
# for every element in the second dimension
for (j in a[i]) {
# ensure that the first field is treated numerically
j+=0
# if the line number is greater than the first field
# and smaller than the second field
if (FNR>=j && FNR<=a[i][j]) {
# print the line to a file with the suffix of the first file's
# line number (the first dimension)
print > "File" i
}
}
}
}
Alternatively, here's the one-liner:
awk -F "|" 'FNR==NR { a[NR][$1]=$2; next } { for (i in a) for (j in a[i]) { j+=0; if (FNR>=j && FNR<=a[i][j]) print > "File" i } }' numbers.txt file.txt
If you have an 'old' awk, here's the version with compatibility. Run like:
awk -f script.awk numbers.txt file.txt
Contents of script.awk:
BEGIN {
# set the field separator
FS="|"
}
# for the first file in the arguments list
FNR==NR {
# add the row number and field one as a key to a pseudo-multidimensional
# array with a value of field two
a[NR,$1]=$2
# skip processing the rest of the code
next
}
# for the second file in the arguments list
{
# for every element in the array
for (i in a) {
# split the element in to another array
# b[1] is the row number and b[2] is the first field
split(i,b,SUBSEP)
# if the line number is greater than the first field
# and smaller than the second field
if (FNR>=b[2] && FNR<=a[i]) {
# print the line to a file with the suffix of the first file's
# line number (the first pseudo-dimension)
print > "File" b[1]
}
}
}
Alternatively, here's the one-liner:
awk -F "|" 'FNR==NR { a[NR,$1]=$2; next } { for (i in a) { split(i,b,SUBSEP); if (FNR>=b[2] && FNR<=a[i]) print > "File" b[1] } }' numbers.txt file.txt
I would use sed to process the sample data file because it is simple and swift. This requires a mechanism for converting the line numbers file into the appropriate sed script. There are many ways to do this.
One way uses sed to convert the set of line numbers into a sed script. If everything was going to standard output, this would be trivial. With the output needing to go to different files, we need a line number for each line in the line numbers file. One way to give line numbers is the nl command. Another possibility would be to use pr -n -l1. The same sed command line works with both:
nl linenumbers.txt |
sed 's/ *\([0-9]*\)[^0-9]*\([0-9]*\)|\([0-9]*\)/\2,\3w file\1.txt/'
For the given data file, that generates:
345,789w > file1.txt
999,1056w > file2.txt
1522,1366w > file3.txt
3523,3562w > file4.txt
Another option would be to have awk generate the sed script:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt
If your version of sed will allow you to read its script from standard input with -f - (GNU sed does; BSD sed does not), then you can convert the line numbers file into a sed script on the fly, and use that to parse the sample data:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt |
sed -n -f - sample.data
If your system supports /dev/stdin, you can use one of:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt |
sed -n -f /dev/stdin sample.data
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt |
sed -n -f /dev/fd/0 sample.data
Failing that, use an explicit script file:
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt > sed.script
sed -n -f sed.script sample.data
rm -f sed.script
Strictly, you should deal with ensuring the temporary file name is unique (mktemp) and removed even if the script is interrupted (trap):
tmp=$(mktemp sed.script.XXXXXX)
trap "rm -f $tmp; exit 1" 0 1 2 3 13 15
awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt > $tmp
sed -n -f $tmp sample.data
rm -f $tmp
trap 0
The final trap 0 allows your script to exit successfully; omit it, and you script will always exit with status 1.
I've ignored Perl and Python; either could be used for this in a single command. The file management is just fiddly enough that using sed seems simpler. You could also use just awk, either with a first awk script writing an awk script to do the heavy duty work (trivial extension of the above), or having a single awk process read both files and produce the required output (harder, but far from impossible).
If nothing else, this shows that there are many possible ways of doing the job. If this is a one-off exercise, it really doesn't matter very much which you choose. If you will be doing this repeatedly, then choose the mechanism that you like. If you're worried about performance, measure. It is likely that converting the line numbers into a command script is a negligible cost; processing the sample data with the command script is where the time is taken. I would expect sed to excel at that point; I've not measured to confirm that it does.
You could do the following
# myscript.sh
linenumbers="linenumber.txt"
somefile="afile"
while IFS=\| read start end ; do
echo "sed -n '$start,${end}p;${end}q;' $somefile > $somefile-$start-$end"
done < $linenumbers
run it like so sh myscript.sh
sed -n '345,789p;789q;' afile > afile-345-789
sed -n '999,1056p;1056q;' afile > afile-999-1056
sed -n '1522,1366p;1366q;' afile > afile-1522-1366
sed -n '3523,3562p;3562q;' afile > afile-3523-3562
then when you're happy do sh myscript.sh | sh
EDIT Added William's excellent points on style and correctness.
EDIT Explanation
The basic idea is to get a script to generate a series of shell commands that can be checked for correctness first before being executed by "| sh".
sed -n '345,789p;789q; means use sed and don't echo each line (-n) ; there are two commands saying from line 345 to 789 p(rint) the lines and the second command is at line 789 q(uit) - by quitting on the last line you save having sed read all the input file.
The while loop reads from the $linenumbers file using read, read if given more than one variable name populates each with a field from the input, a field is usually separated by space and if there are too few variable names then read will put the remaining data into the last variable name.
You can put the following in at your shell prompt to understand that behaviour.
ls -l | while read first rest ; do
echo $first XXXX $rest
done
Try adding another variable second to the above to see what happens then, it should be obvious.
The problem is your data is delimited by |s and that's where using William's suggestion of IFS=\| works as now when reading from the input the IFS has changed and the input is now separated by |s and we get the desired result.
Others can feel free to edit,correct and expand.
To extract the first field from 345|789 you can e.g use awk
awk -F'|' '{print $1}'
Combine that with the answers received from your other question and you will have a solution.
This might work for you (GNU sed):
sed -r 's/(.*)\|(.*)/\1,\2w file-\1-\2.txt/' | sed -nf - file

Count how many times each word from a word list appears in a file?

I have a file, list.txt which contains a list of words. I want to check how many times each word appears in another file, file1.txt, then output the results. A simple output of all of the numbers sufficient, as I can manually add them to list.txt with a spreadsheet program, but if the script adds the numbers at the end of each line in list.txt, that is even better, e.g.:
bear 3
fish 15
I have tried this, but it does not work:
cat list.txt | grep -c file1.txt
You can do this in a loop that reads a single word at a time from a word-list file, and then counts the instances in a data file. For example:
while read; do
echo -n "$REPLY "
fgrep -ow "$REPLY" data.txt | wc -l
done < <(sort -u word_list.txt)
The "secret sauce" consists of:
using the implicit REPLY variable;
using process substitution to collect words from the word-list file; and
ensuring that you are grepping for whole words in the data file.
This awk method only has to pass through each file once:
awk '
# read the words in list.txt
NR == FNR {count[$1]=0; next}
# process file1.txt
{
for (i=0; i<=NF; i++)
if ($i in count)
count[$i]++
}
# output the results
END {
for (word in count)
print word, count[word]
}
' list.txt file1.txt
This might work for you (GNU sed):
tr -s ' ' '\n' file1.txt |
sort |
uniq -c |
sed -e '1i\s|.*|& 0|' -e 's/\s*\(\S*\)\s\(\S*\)\s*/s|\\<\2\\>.*|\2 \1|/' |
sed -f - list.txt
Explanation:
Split file1.txt into words
Sort the words
Count the words
Create a sed script to match the words (initially zero out each word)
Run the above script against the list.txt
single line command
cat file1.txt |tr " " "\n"|sort|uniq -c |sort -n -r -k 1 |grep -w -f list.txt
The last part of the command tells grep to read words to match from list (-f option) and then match whole words(-w) i.e. if list.txt contains contains car, grep should ignore carriage.
However keep in mind that your view of whole word and grep's view might differ. for eg. although car will not match with carriage, it will match with car-wash , notice that "-" will be considered for word boundary. grep takes anything except letters,numbers and underscores as word boundary. Which should not be a problem as this conforms to the accepted definition of a word in English language.

Can I chain multiple commands and make all of them take the same input from stdin?

In bash, is there a way to chain multiple commands, all taking the same input from stdin? That is, one command reads stdin, does some processing, writes the output to a file. The next command in the chain gets the same input as what the first command got. And so on.
For example, consider a large text file to be split into multiple files by filtering the content. Something like this:
cat food_expenses.txt | grep "coffee" > coffee.txt | grep "tea" > tea.txt | grep "honey cake" > cake.txt
This obviously does not work, because the second grep gets the first grep's output, not the original text file. I tried inserting tee's but that does not help. Is there some bash magic that can cause the first grep to send its input to the pipe, not the output?
And by the way, splitting a file was a simple example. Consider splitting (filering by pattern search) a continuous live text stream coming over a network and writing the output to different named pipes or sockets. I would like to know if there is an easy way to do it using a shell script.
(This question is a cleaned up version of my earlier one , based on responses that pointed out the unclearness)
For this example, you should use awk as semiuseless suggests.
But in general to have N arbitrary programs read a copy of a single input stream, you can use tee and bash's process output substitution operator:
tee <food_expenses.txt \
>(grep "coffee" >coffee.txt) \
>(grep "tea" >tea.txt) \
>(grep "honey cake" >cake.txt)
Note that >(command) is a bash extension.
The obvious question is why do you want to do this within one command ?
If you don't want to write a script, and you want to run stuff in parallel, bash supports the concepts of subshells, and these can run in parallel. By putting your command in brackets, you can run your greps (or whatever) concurrently e.g.
$ (grep coffee food_expenses.txt > coffee.txt) && (grep tea food_expenses.txt > tea.txt)
Note that in the above your cat may be redundant since grep takes an input file argument.
You can (instead) play around with redirecting output through different streams. You're not limited to stdout/stderr but can assign new streams as required. I can't advise more on this other than direct you to examples here
I like Stephen's idea of using awk instead of grep.
It ain't pretty, but here's a command that uses output redirection to keep all data flowing through stdout:
cat food.txt |
awk '/coffee/ {print $0 > "/dev/stderr"} {print $0}'
2> coffee.txt |
awk '/tea/ {print $0 > "/dev/stderr"} {print $0}'
2> tea.txt
As you can see, it uses awk to send all lines matching 'coffee' to stderr, and all lines regardless of content to stdout. Then stderr is fed to a file, and the process repeats with 'tea'.
If you wanted to filter out content at each step, you might use this:
cat food.txt |
awk '/coffee/ {print $0 > "/dev/stderr"} $0 !~ /coffee/ {print $0}'
2> coffee.txt |
awk '/tea/ {print $0 > "/dev/stderr"} $0 !~ /tea/ {print $0}'
2> tea.txt
You could use awk to split into up to two files:
awk '/Coffee/ { print "Coffee" } /Tea/ { print "Tea" > "/dev/stderr" }' inputfile > coffee.file.txt 2> tea.file.txt
I am unclear why the filtering needs to be done in different steps. A single awk program can scan all the incoming lines, and dispatch the appropriate lines to individual files. This is a very simple dispatch that can feed multiple secondary commands (i.e. persistent processes that monitor the output files for new input, or the files could be sockets that are setup ahead of time and written to by the awk process.).
If there is a reason to have every filter see every line, then just remove the "next;" statements, and every filter will see every line.
$ cat split.awk
BEGIN{}
/^coffee/ {
print $0 >> "/tmp/coffee.txt" ;
next;
}
/^tea/ {
print $0 >> "/tmp/tea.txt" ;
next;
}
{ # default
print $0 >> "/tmp/other.txt" ;
}
END {}
$
Here are two bash scripts without awk. The second one doesn't even use grep!
With grep:
#!/bin/bash
tail -F food_expenses.txt | \
while read line
do
for word in "coffee" "tea" "honey cake"
do
if [[ $line != ${line#*$word*} ]]
then
echo "$line"|grep "$word" >> ${word#* }.txt # use the last word in $word for the filename (i.e. cake.txt for "honey cake")
fi
done
done
Without grep:
#!/bin/bash
tail -F food_expenses.txt | \
while read line
do
for word in "coffee" "tea" "honey cake"
do
if [[ $line != ${line#*$word*} ]] # does the line contain the word?
then
echo "$line" >> ${word#* }.txt # use the last word in $word for the filename (i.e. cake.txt for "honey cake")
fi
done
done;
Edit:
Here's an AWK method:
awk 'BEGIN {
list = "coffee tea";
split(list, patterns)
}
{
for (pattern in patterns) {
if ($0 ~ patterns[pattern]) {
print > patterns[pattern] ".txt"
}
}
}' food_expenses.txt
Working with patterns which include spaces remains to be resolved.
You can probably write a simple AWK script to do this in one shot. Can you describe the format of your file a little more?
Is it space/comma separated?
do you have the item descriptions on a specific 'column' where columns are defined by some separator like space, comma or something else?
If you can afford multiple grep runs this will work,
grep coffee food_expanses.txt> coffee.txt
grep tea food_expanses.txt> tea.txt
and, so on.
Assuming that your input is not infinite (as in the case of a network stream that you never plan on closing) I might consider using a subshell to put the data into a temp file, and then a series of other subshells to read it. I haven't tested this, but maybe it would look something like this
{ cat inputstream > tempfile };
{ grep tea tempfile > tea.txt };
{ grep coffee tempfile > coffee.txt};
I'm not certain of an elegant solution to the file getting too large if your input stream is not bounded in size however.

Resources