This question already has answers here:
Check if all of multiple strings or regexes exist in a file
(21 answers)
Closed 4 years ago.
I know egrep has a very useful way of anding two expressions together by using:
egrep "pattern1.*pattern2"|egrep "pattern2.*pattern1" filename.txt|wc -l
However is there an easy way to use egrep's AND operator when searching for three expressions as the permutations increase exponentially as you add extra expressions.
I know the other way going about it using sort|uniq -d however I am looking for a simpler solution.
EDIT:
My current way of search will yield five total results:
#!/bin/bash
pid=$$
grep -i "angio" rtrans.txt|sort|uniq|egrep -o "^[0-9]+ [0-9]+ " > /tmp/$pid.1.tmp
grep -i "cardio" rtrans.txt|sort|uniq|egrep -o "^[0-9]+ [0-9]+ " > /tmp/$pid.2.tmp
grep -i "pulmonary" rtrans.txt|sort|uniq|egrep -o "^[0-9]+ [0-9]+ " > /tmp/$pid.3.tmp
cat /tmp/$pid.1.tmp /tmp/$pid.2.tmp|sort|uniq -d > /tmp/$pid.4.tmp
cat /tmp/$pid.4.tmp /tmp/$pid.3.tmp|sort|uniq -d > /tmp/$pid.5.tmp
egrep -o "^[0-9]+ [0-9]+ " /tmp/$pid.5.tmp|getDoc.mps > /tmp/$pid.6.tmp
head -10 /tmp/$pid.6.tmp
mumps#debianMumpsISR:~/Medline2012$ AngioAndCardioAndPulmonary.script
1514 Structural composition of central pulmonary arteries. Growth potential after surgical shunts.
1517 Patterns of pulmonary arterial anatomy and blood supply in complex congenital heart disease
with pulmonary atresia
3034 Controlled reperfusion following regional ischemia.
3481 Anaesthetic management for oophorectomy in pulmonary lymphangiomyomatosis.
3547 A comparison of methods for limiting myocardial infarct expansion during acute reperfusion--
primary role of unload
While:
mumps#debianMumpsISR:~/Medline2012$ grep "angio" rtrans.txt|grep "cardio" rtrans.txt|grep "pulmonary" rtrans.txt|wc -l
185
yields 185 lines of text because it is only taking the value of the search in pulmonary instead of all three searches.
how about
grep "pattern1" file|grep "pattern2"|grep "pattern3"
this will give those lines that contain p1, p2 and p3. but with arbitrary order.
The approach of Kent with
grep "pattern1" file|grep "pattern2"|grep "pattern3"
is correct and it should be faster, just for the record I wanted to post an alternative which uses egrep to do the same without pipping:
egrep "pattern1.*pattern2|pattern2.*pattern1"
which looks for p1 followed by p2 or p2 followed by p1.
The original question is about why his egrep command didn't work.
egrep "pattern1.*pattern2"|egrep "pattern2.*pattern1" filename.txt|wc -l
Kent and Stanislav are correct in pointing out the syntax error by putting the filename.txt up front. But this doesn't address the original problem.
Bob's "current way" (4 years ago) was a multi-command approach to grep out different keywords on different lines. In other words, his script was looking for a set of lines containing any of his search terms. The other proposed solutions would only result in lines containing all of his search terms, which does not appear to be his intent.
Instead, he could use a single line egrep to look for any of the terms, like this:
egrep -e 'pattern1|pattern2' filename.txt
Related
I am having problems when trying to count the number of times a specific pattern appears in a file (let's call it B). In this case, I have a file with 30 patterns (let's call it A), and I want to know how many lines contain that pattern.
With only one pattern it is quite simple:
grep "pattern" file | wc -l
But with a file full of them I am not able to figure out how it may work. I already tried this:
grep -f "fileA" "fileB" | wc -l
Nevertheless, it gives me the total times all patterns appear, not each one of them (that's what I desire to get).
Thank you so much.
Count matches per literal string
If you simply want to know how often each pattern appears and each of your pattern is a fixed string (not a regex), use ...
grep -oFf needles.txt haystack.txt | sort | uniq -c
Count matching lines per literal string
Note that above is slightly different from your formulation " I want to know how many lines contain that pattern" as one line can have multiple matches. If you really have to count matching lines per pattern instead of matches per pattern, then things get a little bit trickier:
grep -noFf needles.txt haystack.txt | sort | uniq | cut -d: -f2- | uniq -c
Count matching lines per regex
If the patterns are regexes, you probably have to iterate over the patterns, as grep's output only tells you that (at least) one pattern matched, but not which one.
# this will be very slow if you have many patterns
while IFS= read -r pattern; do
printf '%8d %s\n' "$(grep -ce "$pattern" haystack.txt)" "$pattern"
done < needles.txt
... or use a different tool/language like awk or perl.
Note on overlapping matches
You did not formulate any precise requirements, so I went with the simplest solutions for each case. The first two solutions and the last solution behave differently in case multiple patterns match (part of) the same substring.
grep -f needles.txt matches each substring at most once. Therefore some matches might be "missed" (interpretation of "missed" depends on your requirements)
whereas iterating grep -e pattern1; grep -e pattern2; ... might match the same substring multiple times.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
So, I wrote a bad shell script (according to several questions, one of which I asked) and now I am wondering which way to go to perform the same, or similar, task(s).
I honestly have no clue about which tool may be best for what I need to achieve and I hope that, by understanding how to rewrite this piece of code, it will be easier to understand which way to go.
There we go:
# read reference file line by line
while read -r linE;
do
# field 2 will be grepped
pSeq=`echo $linE | cut -f2 -d" "`
# field 1 will be used as filename to store the grepped things
fName=`echo $linE | cut -f1 -d" "`
# grep the thing in a very big file
grep -i -B1 -A2 "^"$pSeq a_very_big_file.txt | sed 's/^--$//g' | awk 'NF' > $dir$fName".txt"
# grep the same thing in another very big file and store it in the same file as abovr
grep -i -B1 -A2 "^"$pSeq another_very_big_file.txt | sed 's/^--$//g' | awk 'NF' >> $dir$fName".txt"
done < reference_file.csv
At this point I am wondering...how to achieve the same result, whithout using a while loop to read into the reference_file.csv? What is the best way to go, to solve similar problems?
EDIT: when I mentioned the two very_big_files, I am talking > 5GB.
EDIT II: these should be the format of the files:
reference_file.csv:
object pattern
oj1 ptt1
oj2 ptt2
... ...
ojN pttN
a_very_big_file and another_very_big_file:
>head1
ptt1asequenceofcharacters
+
asequenceofcharacters
>head2
ptt1anothersequenceofcharacters
+
anothersequenceofcharacters
>headN
pttNathirdsequenceofcharacters
+
athirdsequenceofcharacters
Basically, I search for pattern in the two files, then I need to get the line above and the two below each match. Of course, not all the lines in the two files match with the patterns in the reference_file.csv.
Global Maxima
Efficient bash scripts are typically very creative and nothing you can achieve by incrementally improving a naive solution.
The most important part of finding efficient solutions is to know your data. Every restriction you can make allows optimizations. Some examples that can make a huge difference:
- The input is sorted or data in different files has the same order.
- The elements in a list are unique.
- One of the files to be processed is way bigger than the others.
- The symbol X never appears in the input or only appears at special places.
- The order of the output does not matter.
When I try to find an efficient solution, my first goal is to make it work without an explicit loop. For this, I need to know the available tools. Then comes the creative part of combining these tools. To me, this is like assembling a jigsaw puzzle without knowing the final picture. A typical mistake here is similar to the XY problem: After you assembled some pieces, you might be fooled into thinking you'd know the final picture and search for a piece Y that does not exist in your toolbox. Frustrated, you implement Y yourself (typically by using a loop) and ruin the solution.
If there is no right piece for your current approach, either use a different approach or give up on bash and use a better scripting/programming language.
Local Maxima
Even though you might not be able to get the best solution by improving a bad solution, you still can improve it. For this you don't need to be very creative if you know some basic anti-patterns and their better alternatives. Here are some typical examples from your script:
Some of these might seem very small, but starting a new process is way more expensive than one might suppose. Inside a loop, the cost of starting a process is multiplied by the number of iterations.
Extract multiple fields from a line
Instead of calling cut for each individual field, use read to read them all at once:
while read -r line; do
field1=$(echo "$line" | cut -f1 -d" ")
field2=$(echo "$line" | cut -f2 -d" ")
...
done < file
while read -r field1 field2 otherFields; do
...
done < file
Combinations of grep, sed, awk
Everything grep (in its basic form) can do, sed can do better. And everything sed can do, awk can do better. If you have a pipe of these tools you can combine them into a single call.
Some examples of (in your case) equivalent commands, one per line:
sed 's/^--$//g' | awk 'NF'
sed '/^--$/d'
grep -vFxe--
grep -i -B1 -A2 "^$pSeq" | sed 's/^--$//g' | awk 'NF'
awk "/^$pSeq/"' {print last; c=3} c>0; {last=$0; c--}'
Multiple grep on the same file
You want to read files at most once, especially if they are big. With grep -f you can search multiple patterns in a single run over one file. If you just wanted to get all matches, you would replace your entire loop with
grep -i -B1 -A2 -f <(cut -f2 -d' ' reference_file | sed 's/^/^/') \
a_very_big_file another_very_big_file
But since you have to store different matches in different files ... (see next point)
Know when to give up and switch to another language
Dynamic output files
Your loop generates multiple files. The typical command line utils like cut, grep and so on only generate one output. I know only one standard tool that generates a variable number of output files: split. But that does not filter based on values, but on position. Therefore, a non-loop solution for your problem seems unlikely. However, you can optimize the loop by rewriting it in a different language, e.g. awk.
Loops in awk are faster ...
time awk 'BEGIN{for(i=0;i<1000000;++i) print i}' >/dev/null # takes 0.2s
time for ((i=0;i<1000000;++i)); do echo $i; done >/dev/null # takes 3.3s
seq 1000000 > 1M
time awk '{print}' 1M >/dev/null # takes 0.1s
time while read -r l; do echo "$l"; done <1M >/dev/null # takes 5.4s
... but the main speedup will come from something different. awk has everything you need built into it, so you don't have to start new processes. Also ... (see next point)
Iterate the biggest file
Reduce the number of times you have to read the biggest files. So instead of iterating reference_file and reading both big files over and over, iterate over the big files once while holding reference_file in memory.
Final script
To replace your script, you can try the following awk script. This assumes that ...
the filenames (first column) in reference_file are unique
the two big files do not contain > except for the header
the patterns (second column) in reference_file are not prefixes of each other.
If this is not the case, simply remove the break.
awk -v dir="$dir" '
FNR==NR {max++; file[max]=$1; pat[max]=$2; next}
{
for (i=1;i<=max;i++)
if ($2~"^"pat[i]) {
printf ">%s", $0 > dir"/"file[i]
break
}
}' reference_file RS=\> FS=\\n a_very_big_file another_very_big_file
This question already has answers here:
Count the number of times a word appears in a file
(3 answers)
Closed 4 years ago.
I am a Bash & Terminal NEWBIE. I have been given the task of counting the number of entries of a specific area code using a single-line Bash Terminal command. Can you please point me in the right direction to achieving this goal? I've been using a bash scripting cheat sheet but i'm not familiar enough with bash commands to create a script to iterate and count the number of times [213] appears in file:
If you are looking for the string 123 anywhere in the file, then:
grep -c 123 file # counts 123 4123 41235 etc
If you are looking for the "word" 123, then:
grep -wc 123 file # counts 123 /123/ #123# etc., but not 1234 4123 ...
If you want multiple occurrences of the word on the same line to be counted separately, then use the -o option:
grep -ow 123 file | wc -l
See also:
Confused about word boundary on Unix & Linux Stack Exchange
grep -o '213' filename | wc -l
In the future, you should try searching for general forms of your command. You would have found a number of similar questions
See man grep. grep has a count option.
So you want to run grep -c 213 file.
Following awk may help you here too.(It will look for string 213 anywhere in the line(s) of Input_file)
awk /213/{count++} END{print count}' Input_file
In case you want to look for only those lines which have exactly have digit 213 then use following.
awk /^213$/{count++} END{print count}' Input_file
I am writing a unix command to get lines matching abcd at position 87-90 and for the lines matching this critieria it should get me position 10-15, 124-128,250-265.I tried something like this.
grep -h abcd sample.txt |cut -c 10-15,cut -c 124-128,cut -c 250-260
Though this is syntactically wrong I hope it conveys what I am trying to achieve.Could you help me concatenate all the results from the multiple cuts?
cut -c accepts a list of characters. As described in the man page, "each list is made up of one range, or many ranges separated by commas."
grep -h abcd sample.txt | cut -c 10-15,124-128,250-260
This is my first time posting on here so bear with me please.
I received a bash assignment but my professor is completely unhelpful and so are his notes.
Our assignment is to filter and print out palindromes from a file. In this case, the directory is:
/usr/share/dict/words
The word lengths range from 3 to 45 and are supposed to only filter lowercase letters (the dictionary given has characters and uppercases, as well as lowercase letters). i.e. "-dkas-das" so something like "q-evvavve-q" may count as a palindrome but i shouldn't be getting that as a proper result.
Anyways, I can get it to filter out x amount of words and return (not filtering only lowercase though).
grep "^...$" /usr/share/dict/words |
grep "\(.\).\1"
And I can use subsequent lines for 5 letter words and 7 and so on:
grep "^.....$" /usr/share/dict/words |
grep "\(.\)\(.\).\2\1"
But the prof does not want that. We are supposed to use a loop. I get the concept but I don't know the syntax, and like I said, the notes are very unhelpful.
What I tried was setting variables x=... and y=.. and in a while loop, having x=$x$y but that didn't work (syntax error) and neither did x+=..
Any help is appreciated. Even getting my non-lowercase letters filtered out.
Thanks!
EDIT:
If you're providing a solution or a hint to a solution, the simplest method is prefered.
Preferably one that uses 2 grep statements and a loop.
Thanks again.
Like this:
for word in `grep -E '^[a-z]{3,45}$' /usr/share/dict/words`;
do [ $word == `echo $word | rev` ] && echo $word;
done;
Output using my dictionary:
aha
bib
bob
boob
...
wow
Update
As pointed out in the comments, reading in most of the dictionary into a variable in the for loop might not be the most efficient, and risks triggering errors in some shells. Here's an updated version:
grep -E '^[a-z]{3,45}$' /usr/share/dict/words | while read -r word;
do [ $word == `echo $word | rev` ] && echo $word;
done;
Why use grep? Bash will happily do that for you:
#!/bin/bash
is_pal() {
local w=$1
while (( ${#w} > 1 )); do
[[ ${w:0:1} = ${w: -1} ]] || return 1
w=${w:1:-1}
done
}
while read word; do
is_pal "$word" && echo "$word"
done
Save this as banana, chmod +x banana and enjoy:
./banana < /usr/share/dict/words
If you only want to keep the words with at least three characters:
grep ... /usr/share/dict/words | ./banana
If you only want to keep the words that only contain lowercase and have at least three letters:
grep '^[[:lower:]]\{3,\}$' /usr/share/dict/words | ./banana
The multiple greps are wasteful. You can simply do
grep -E '^([a-z])[a-z]\1$' /usr/share/dict/words
in one fell swoop, and similarly, put the expressions on grep's standard input like this:
echo '^([a-z])[a-z]\1$
^([a-z])([a-z])\2\1$
^([a-z])([a-z])[a-z]\2\1$' | grep -E -f - /usr/share/dict/words
However, regular grep does not permit backreferences beyond \9. With grep -P you can use double-digit backreferences, too.
The following script constructs the entire expression in a loop. Unfortunately, grep -P does not allow for the -f option, so we build a big thumpin' variable to hold the pattern. Then we can actually also simplify to a single pattern of the form ^(.)(?:.|(.)(?:.|(.)....\3)?\2?\1$, except we use [a-z] instead of . to restrict to just lowercase.
head=''
tail=''
for i in $(seq 1 22); do
head="$head([a-z])(?:[a-z]|"
tail="\\$i${tail:+)?}$tail"
done
grep -P "^${head%|})?$tail$" /usr/share/dict/words
The single grep should be a lot faster than individually invoking grep 22 or 43 times on the large input file. If you want to sort by length, just add that as a filter at the end of the pipeline; it should still be way faster than multiple passes over the entire dictionary.
The expression ${tail+:)?} evaluates to a closing parenthesis and question mark only when tail is non-empty, which is a convenient way to force the \1 back-reference to be non-optional. Somewhat similarly, ${head%|} trims the final alternation operator from the ultimate value of $head.
Ok here is something to get you started:
I suggest to use the plan you have above, just generate the number of "." using a for loop.
This question will explain how to make a for loop from 3 to 45:
How do I iterate over a range of numbers defined by variables in Bash?
for i in {3..45};
do
* put your code above here *
done
Now you just need to figure out how to make "i" number of dots "." in your first grep and you are done.
Also, look into sed, it can nuke the non-lowercase answers for you..
Another solution that uses a Perl-compatible regular expressions (PCRE) with recursion, heavily inspired by this answer:
grep -P '^(?:([a-z])(?=[a-z]*(\1(?(2)\2))$))++[a-z]?\2?$' /usr/share/dict/words