Multi-line grep with positive and negative filtering - bash

I need to grep for a multi-line string that doesn't include one string, but does include others. This is what I'm searching for in some HTML files:
<not-this>
<this> . . . </this>
</not-this>
In other words, I want to find files that contain <this> and </this> on the same line, but should not be surrounded by html tags <not-this> on the lines before and/or after. Here is some shorthand logic for what I want to do:
grep 'this' && '/this' && !('not-this')
I've seen answers with the following...
grep -Er -C 2 '.*this.*this.*' . | grep -Ev 'not-this'
...but this just erases the line(s) containing the "not" portion, and displays the other lines. What I'd like is for it to not pull those results at all if "not-this" is found within a line or two of "this".
Is there a way to accomplish this?
P.S. I'm using Ubuntu and gnome-terminal.

It sounds like an awk script might work better here:
$ cat input.txt
<not-this>
<this>BAD! DO NOT PRINT!</this>
</not-this>
<yes-this>
<this>YES! PRINT ME!</this>
</yes-this>
$ cat not-this.awk
BEGIN {
notThis=0
}
/<not-this>/ {notThis=1}
/<\/not-this>/ {notThis=0}
/<this>.*<\/this>/ {if (notThis==0) print}
$ awk -f not-this.awk input.txt
<this>YES! PRINT ME!</this>
Or, if you'd prefer, you can squeeze this awk script onto one long line:
$ awk 'BEGIN {notThis=0} /<not-this>/ {notThis=1} /<\/not-this>/ {notThis=0} /<this>.*<\/this>/ {if (notThis==0) print}' input.txt

Related

How to print both the grep pattern and the resulting matched line on the same line?

I have two files File01 and File02.
File01, looks like this:
BU24DRAFT_430534
BU24DRAFT_488391
BU24DRAFT_488386
BU24DRAFT_417707
BU24DRAFT_417704
BU24DRAFT_488335
BU24DRAFT_429509
BU24DRAFT_210092
BU24DRAFT_229465
BU24DRAFT_498094
BU24DRAFT_416051
BU24DRAFT_482795
BU24DRAFT_4305
BU24DRAFT_10621
BU24DRAFT_4883
File02, looks like this:
XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79
XP_033390442.1_uncharacterized_protein_BU24DRAFT_488391_Aaosphaeria_arxii_CBS_175.79
XP_033390437.1_uncharacterized_protein_BU24DRAFT_488386_Aaosphaeria_arxii_CBS_175.79
XP_033390400.1_uncharacterized_protein_BU24DRAFT_417707_Aaosphaeria_arxii_CBS_175.79
XP_033390397.1_uncharacterized_protein_BU24DRAFT_417704_Aaosphaeria_arxii_CBS_175.79
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79
Using the string from File01, via grep, I would like to identify the lines in File02 that match and with this information generate a file that would look like this:
XP_033390442.1_uncharacterized_protein_BU24DRAFT_488391_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488391
XP_033390437.1_uncharacterized_protein_BU24DRAFT_488386_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488386
XP_033390400.1_uncharacterized_protein_BU24DRAFT_417707_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417707
XP_033390397.1_uncharacterized_protein_BU24DRAFT_417704_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417704
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488335
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_429509
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_210092
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_229465
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_498094
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_416051
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_482795
I tried generating such file using the following code:
while read r;do CMD01=$(echo $r);CMD02=$(grep $r File01); echo "$CMD02 $CMD01";done < File02 | awk '(NR>1) && ($2 > 2 ) '
The problem I run into is that what I obtain extra matching lines:
XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_4305
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_4883
Where, for example, the string: BU24DRAFT_4305 is wrongly recognizing the string: XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79
This result is incorrect. The string in File01 must match a string in File02 that has a complete version of File01's string
Any ideas that could help me will be appreciated.
For the updated sample input and full-matching requirement and assuming you never have any regexp metacharacters in file1 and that the matching strings in file2 are never at the start or end of the line:
$ awk 'NR==FNR{strs[$0]; next} {for (str in strs) if ($0 ~ ("[^[:alnum:]]"str"[^[:alnum:]]")) print $0, str}' file1 file2
XP_033390445.1_uncharacterized_protein_BU24DRAFT_430534_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_430534
XP_033390442.1_uncharacterized_protein_BU24DRAFT_488391_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488391
XP_033390437.1_uncharacterized_protein_BU24DRAFT_488386_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488386
XP_033390400.1_uncharacterized_protein_BU24DRAFT_417707_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417707
XP_033390397.1_uncharacterized_protein_BU24DRAFT_417704_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_417704
XP_033390371.1_uncharacterized_protein_BU24DRAFT_488335_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_488335
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_429509
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_210092
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_229465
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_498094
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_416051
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_482795
Original answer doing partial matching:
The correct approach is 1 call to awk:
$ awk 'NR==FNR{strs[$0]; next} {for (str in strs) if (index($0,str)) print $0, str}' file1 file2
XP_033376575.1_uncharacterized_protein_BU24DRAFT_482795,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_482795
XP_033376576.1_uncharacterized_protein_BU24DRAFT_416051,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_416051
XP_033376577.1_uncharacterized_protein_BU24DRAFT_498094,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_498094
XP_033376578.1_uncharacterized_protein_BU24DRAFT_229465,_partial_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_229465
XP_033376580.1_uncharacterized_protein_BU24DRAFT_210092_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_210092
XP_033376581.1_uncharacterized_protein_BU24DRAFT_429509_Aaosphaeria_arxii_CBS_175.79 BU24DRAFT_429509
See https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice and https://mywiki.wooledge.org/Quotes for some of the issues with the script in your question.
So, it looks like yours mostly works. A lot of what you are doing here is unnecessary. Here is your script broken into multiple lines for readability:
while read r; do
CMD01=$(echo $r)
CMD02=$(grep $r zztest01)
echo "$CMD02 $CMD01"
done < <(head zztest) | awk '(NR>1) && ($2 > 2 ) '
First, CMD01=$(echo $r): This is really the same (or intended to be) as CMD01="$r" so kind of useless.
Then, < <(head zztest): You are using head to output the contents of the file. This actually works just as well with a simple redirection like this: < zztest.
Last, | awk '(NR>1) && ($2 > 2 ) ': This appears to just be some sort of logic on whether we are going to print anything or not.
Here is a simplified version:
while read r; do
CMD02=$(grep "$r" zztest01) && echo "$CMD02 $r"
done < zztest
Explanation
CMD02=$(grep $r zztest01) && echo "$CMD02 $r": The main part of this is really two commands separated by &&. This means execute the second command if the first one succeeded. grep will return a "failure" code if it does not find what it is looking for. So, if grep does not find a match, echo will not run.
The output of grep will be stored in the variable $CMD02. Then, you will echo that along with $r for each match.
If you really want to keep this on one line like the original:
while read r; do CMD02=$(grep "$r" zztest01) && echo "$CMD02 $r"; done < zztest
Update
If you want to avoid partial matches as Ed asked, you can change the grep to this grep "$r[^0-9]" zztest01. This will avoid a match if there is a trailing digit after the initial match string (which is really an assumption given the sample).
While not explicit in the question, it seems that each pattern should only match single line in the input file (File02).
Based on this observation, possible to improve performance of the solution from Ed Morton:
awk '
NR==FNR{strs[$0]; next}
{ for (str in strs) if (index($0,str)) { print $0, str ; delete strs[str]; next } }
' file1 file2
For large files. with many patterns, it will reduce runtime by a factor of 4.

Update version number in property file using bash

I am new in bash scripting and I need help with awk. So the thing is that I have a property file with version inside and I want to update it.
version=1.1.1.0
and I use awk to do that
file="version.properties"
awk -F'["]' -v OFS='"' '/version=/{
split($4,a,".");
$4=a[1]"."a[2]"."a[3]"."a[4]+1
}
;1' $file > newFile && mv newFile $file
but I am getting strange result version="1.1.1.0""...1
Could someone help me please with this.
You mentioned in your comment you want to update the file in place. You can do that in a one-liner with perl:
perl -pe '/^version=/ and s/(\d+\.\d+\.\d+\.)(\d+)/$1 . ($2+1)/e' -i version.properties
Explanation
-e is followed by a script to run. With -p and -i, the effect is to run that script on each line, and modify the file in place if the script changes anything.
The script itself, broken down for explanation, is:
/^version=/ and # Do the following on lines starting with `version=`
s/ # Make a replacement on those lines
(\d+\.\d+\.\d+\.)(\d+)/ # Match x.y.z.w, and set $1 = `x.y.z.` and $2 = `w`
$1 . ($2+1)/ # Replace x.y.z.w with a copy of $1, followed by w+1
e # This tells Perl the replacement is Perl code rather
# than a text string.
Example run
$ cat foo.txt
version=1.1.1.2
$ perl -pe '/^version=/ and s/(\d+\.\d+\.\d+\.)(\d+)/$1 . ($2+1)/e' -i foo.txt
$ cat foo.txt
version=1.1.1.3
This is not the best way, but here's one fix.
Test case
I am assuming the input file has at least one line that is exactly version=1.1.1.0.
$ awk -F'["]' -v OFS='"' '/version=/{
> split($4,a,".");
> $4=a[1]"."a[2]"."a[3]"."a[4]+1
> }
> ;1' <<<'version=1.1.1.0'
Output:
version=1.1.1.0"""...1
The """ is because you are assigning to field 4 ($4). When you do that, awk adds field separators (OFS) between fields 1 and 2, 2 and 3, and 3 and 4. Three OFS => """, in your example.
Minimal change
$ awk -F'["]' -v OFS='"' '/version=/{
split($1,a,".");
$1=a[1]"."a[2]"."a[3]"."a[4]+1;
print
}
' <<<'version=1.1.1.0'
version=1.1.1.1
Two changes:
Change $4 to $1
Since the input field separator (-F) is ["], $4 is whatever would be after the third " (if there were any in the input). Therefore, split($4, ...) splits an empty field. The contents of the line, before the first " (if any), are in $1.
print at the end instead of ;1
The 1 after the closing curly brace is the next condition, and there is no action specified. The default action is to print the current line, as modified, so the 1 triggers printing. Instead, just print within your action when you are done processing. That way your action is self-contained. (Of course, if you needed to do other processing, you might want to print later, after that processing.)
You can use the = as the delimiter, like this:
awk -F= -v v=1.0.1 '$1=="version"{printf "version=\"%s\"\n", v}' file.properties

Print line after the match in grep [duplicate]

This question already has answers here:
How to show only next line after the matched one?
(14 answers)
Closed 6 years ago.
I'm trying to get the current track running from 'cmus-remote -Q'
Its always underneath of this line
tag genre Various
<some track>
Now, I need to keep it simple because I want to add it to my i3 bar. I used
cmus-remote -Q | grep -A 1 "tag genre"
but that grep's the 'tag' line AND the line underneath.
I want ONLY the line underneath.
With sed:
sed -n '/tag genre/{n;p}'
Output:
$ cmus-remote -Q | sed -n '/tag genre/{n;p}'
<some track>
If you want to use grep as the tool for this, you can achieve it by adding another segment to your pipeline:
cmus-remote -Q | grep -A 1 "tag genre" | grep -v "tag genre"
This will fail in cases where the string you're searching for is on two lines in a row. You'll have to define what behaviour you want in that case if we're going to program something sensible for it.
Another possibility would be to use a tool like awk, which allows for greater compexity in the line selection:
cmus-remote -Q | awk '/tag genre/ { getline; print }'
This searches for the string, then gets the next line, then prints it.
Another possibility would be to do this in bash alone:
while read line; do
[[ $line =~ tag\ genre ]] && read line && echo "$line"
done < <(cmus-remote -Q)
This implements the same functionality as the awk script, only using no external tools at all. It's likely slower than the awk script.
You can use awk instead of grep:
awk 'p{print; p=0} /tag genre/{p=1}' file
<some track>
/tag genre/{p=1} - sets a flag p=1 when it encounters tag genre in a line.
p{print; p=0} when p is non-zero then it prints a line and resets p to 0.
I'd suggest using awk:
awk 'seen && seen--; /tag genre/ { seen = 1 }'
when seen is true, print the line.
when seen is true, decrement the value, so it will no longer true after the desired number of lines are printed
when the pattern matches, set seen to the number of lines to be printed

Grep for Keyword1Keyword2 but not Keyword1TEXTKeyword2 - Very large grep

I want to be able to grep for exact match results without outputting those with text in between my searched words. the middle being part of the output. For example:
egrep -i "^cat|^dog" list.txt >> startswith.txt
egrep -i "home$|house$" startswith.txt >> final.txt
I want this to return any matches for cathome, cathouse, doghome, doghouse; but not return cathasahome, catneedsahouse, etc. Take note that the files would be wayyy to big for me to go through and say ^word1word2$ in every combination.
Is there a way to do this within grep or egrep.
Use some grouping to specify both parts of your pattern, The anchors (^ and $) will apply to the groups.
$ cat list.txt
cathome
cathouse
catindahouse
dogindahome
doghouse
doghome
$ egrep -i "^(dog|cat)(home|house)$" list.txt
cathome
cathouse
doghouse
doghome
You could try the same thing in Perl regex mode, with non-capturing groups (since you don't care about capturing them):
$ grep -Pi "^(?:dog|cat)(?:home|house)$" list.txt
No idea if that'll make a difference either way, but doesn't hurt to try.
You didn't provide any sample input or expected output so this is an untested guess but this is probably what you're looking for:
awk '
BEGIN {
split("cat dog",beg)
split("home house",end)
for (i in beg)
for (j in end)
matches[beg[i] end[j]]
}
tolower($0) in matches
' file
e.g.:
$ cat file
acathome
CatHome
catinhouse
CATHOUSE
doghomes
dogHOME
dogathouse
DOGhouse
$ awk '
BEGIN {
split("cat dog",beg)
split("home house",end)
for (i in beg)
for (j in end)
matches[beg[i] end[j]]
}
tolower($0) in matches
' file
CatHome
CATHOUSE
dogHOME
DOGhouse

Bash command to extract characters in a string

I want to write a small script to generate the location of a file in an NGINX cache directory.
The format of the path is:
/path/to/nginx/cache/d8/40/32/13febd65d65112badd0aa90a15d84032
Note the last 6 characters: d8 40 32, are represented in the path.
As an input I give the md5 hash (13febd65d65112badd0aa90a15d84032) and I want to generate the output: d8/40/32/13febd65d65112badd0aa90a15d84032
I'm sure sed or awk will be handy, but I don't know yet how...
This awk can make it:
awk 'BEGIN{FS=""; OFS="/"}{print $(NF-5)$(NF-4), $(NF-3)$(NF-2), $(NF-1)$NF, $0}'
Explanation
BEGIN{FS=""; OFS="/"}. FS="" sets the input field separator to be "", so that every char will be a different field. OFS="/" sets the output field separator as /, for print matters.
print ... $(NF-1)$NF, $0 prints the penultimate field and the last one all together; then, the whole string. The comma is "filled" with the OFS, which is /.
Test
$ awk 'BEGIN{FS=""; OFS="/"}{print $(NF-5)$(NF-4), $(NF-3)$(NF-2), $(NF-1)$NF, $0}' <<< "13febd65d65112badd0aa90a15d84032"
d8/40/32/13febd65d65112badd0aa90a15d84032
Or with a file:
$ cat a
13febd65d65112badd0aa90a15d84032
13febd65d65112badd0aa90a15f1f2f3
$ awk 'BEGIN{FS=""; OFS="/"}{print $(NF-5)$(NF-4), $(NF-3)$(NF-2), $(NF-1)$NF, $0}' a
d8/40/32/13febd65d65112badd0aa90a15d84032
f1/f2/f3/13febd65d65112badd0aa90a15f1f2f3
With sed:
echo '13febd65d65112badd0aa90a15d84032' | \
sed -n 's/\(.*\([0-9a-f]\{2\}\)\([0-9a-f]\{2\}\)\([0-9a-f]\{2\}\)\)$/\2\/\3\/\4\/\1/p;'
Having GNU sed you can even simplify the pattern using the -r option. Now you won't need to escape {} and () any more. Using ~ as the regex delimiter allows to use the path separator / without need to escape it:
sed -nr 's~(.*([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2}))$~\2/\3/\4/\1~p;'
Output:
d8/40/32/13febd65d65112badd0aa90a15d84032
Explained simple the pattern does the following: It matches:
(all (n-5 - n-4) (n-3 - n-2) (n-1 - n-0))
and replaces it by
/$1/$2/$3/$0
You can use a regular expression to separate each of the last 3 bytes from the rest of the hash.
hash=13febd65d65112badd0aa90a15d84032
[[ $hash =~ (..)(..)(..)$ ]]
new_path="/path/to/nginx/cache/${BASH_REMATCH[1]}/${BASH_REMATCH[2]}/${BASH_REMATCH[3]}/$hash"
Base="/path/to/nginx/cache/"
echo '13febd65d65112badd0aa90a15d84032' | \
sed "s|\(.*\(..\)\(..\)\(..\)\)|${Base}\2/\3/\4/\1|"
# or
# sed sed 's|.*\(..\)\(..\)\(..\)$|${Base}\1/\2/\3/&|'
Assuming info is a correct MD5 (and only) string
First of all - thanks to all of the responders - this was extremely quick!
I also did my own scripting meantime, and came up with this solution:
Run this script with a parameter of the URL you're looking for (www.example.com/article/76232?q=hello for example)
#!/bin/bash
path=$1
md5=$(echo -n "$path" | md5sum | cut -f1 -d' ')
p3=$(echo "${md5:0-2:2}")
p2=$(echo "${md5:0-4:2}")
p1=$(echo "${md5:0-6:2}")
echo "/path/to/nginx/cache/$p1/$p2/$p3/$md5"
This assumes the NGINX cache has a key structure of 2:2:2.

Resources