SED usage and Regular expression - shell

Can anyone explain the following command?
What happens if the same code is executed using grep?
sed 's/^.*-S[[:space:]]*\([^[:space:]]*\).*$/\1/'

As a sed command, it replaces something like -S paradox (possibly with text on either side the material), with paradox. (There must be a space after paradox for exactly the string paradox to be printed; if there are non-space characters immediately after the word, then those are included in the output too, up to the first space.) For example, the input line someprog -x painter -S paradox file2 file93 yields paradox.
If you apply the expression to grep, then the ^ and the $ lose their special meanings, and it looks for a line such as:
schemas/^semicolon-S(comma)$/(comma)/gratitude
In the grep context, the \( and \) do remember a pattern (and in the example line, that pattern corresponds to (comma) — to confuse you, and me). The \1 then refers to the previously remembered string, the second (comma) in the example. You could drop all the parentheses in the sample line and it would be selected. If your version of grep supports the -o option to output only the text that matches, you can see more clearly which parts of the sample line match the regex.

Related

Remove word from url

I Need to remove /%(tenant_id)s from this source:
https://ext.an1.test.dev:8776/v3/%(tenant_id)s
To make it look like this:
https://ext.an1.test.dev:8776/v3
I'm trying through sed, but unsuccessfully.
curl ....... | jq -r .endpoints[].url | grep '8776/v3' | sed -e 's/[/%(tenant_id)s] //g'
I get it again:
https://ext.an1.test.dev:8776/v3/%(tenant_id)s
You seem to be confused about the meaning of square brackets.
curl ....... |
jq -r '.endpoints[].url' |
sed -n '\;8776/v3;s;/%.*;;p'
fixes the incorrect regex, loses the useless grep, and somewhat simplifies the processing by switching to a different delimiter. To protect against (fairly unlikely) shell wildcard matches on the text in the jq search expression, I also added single quotes around that.
In some more detail, sed -n avoids printing input lines, and the address expression \;8776/v3; selects only input lines which match the regex 8776/v3; we use ; as the delimiter around the regex, which (somewhat obscurely) requires the starting delimiter to be backslashed. Then, we perform the substitution: again, we use ; as the delimiter so that slashes and percent signs in the regex do not need to be escaped. The p flag on the substitution causes sed to print lines where the substitution was performed successfully; we remove the g flag, as we don't expect more than one match per input line. The substitution replaces everything after the first occurrence of /% with nothing.
(Equivalently, with slash delimiters, you would have to backslash all literal slashes: sed -n '/8776\/v3/s/\/%.*//p'.)
For the record, square brackets in regular expressions form a character class; the expression [abc] matches a single character which can be one of a, b, or c. Perhaps review the tips on the Stack Overflow regex tag info page for a quick rerun on this and other common beginner mistakes.
Besides the incorrect square brackets, your regex specified a space after s, which is unlikely to be there. Other than that, your regex should work fine if you are sure the string you want to remove is always exactly /%(tenant_id)s. (Many regex dialects require round parentheses to be escaped, but sed without -E or -r is not one of those.)
If you've managed to get the address into a variable then one parameter expansion idea:
$ myaddr='https://ext.an1.test.dev:8776/v3/%(tenant_id)s'
$ echo "${myaddr%/*}"
https://ext.an1.test.dev:8776/v3
$ mynewaddr="${myaddr%/*}"
$ echo "${mynewaddr}"
https://ext.an1.test.dev:8776/v3

Extract text between two special characters

Trying to extract the text between the special characters "\ and \" through sed
Ex: "\hell##$\"},
expected output : hell##$
You can do it quite easily with using a capture-group and backreference with basic regular-expressions:
sed 's/^["][\]\([^\]*\).*$/\1/'
Explanation
Normal substitution sed 's/find/replace/, where
find is ^["][\] a double-quote and \ before beginning the capture \(...\) which contains [^\]* (zero or more characters not a \), the closing of the capture \) and then .*$ the remainder of the string;
replace is \1 (the first backreference) containing the text captured between \(...\).
(note: if your "\ doesn't begin the string, remove the first '^' anchor)
Example
$ echo '"\hell##$\"},' | sed 's/^["][\]\([^\]*\).*$/\1/'
hell##$
Look things over and let me know if you have questions.
This might work for you (GNU sed):
sed -nE '/"\\[^\\]*\\+([^\\"][^\\]*\\+)*"/{s/"\\/\n/;s/.*\n//;s/\\"/\n/;P;D}' file
The solution comes in two parts:
Firstly, a regexp to determine whether a pair of two characters exists. This can be tricky as a negated class is insufficient because edge cases can easily defeat a simplistic approach.
Secondly, once a pair of characters does exist the text between them must be extracted piece meal.

Using sed to substitute text around dynamic filename

I'm trying to figure out the best method for substituting text in a BASH script. Sed seems to be the best option, but correct me if I'm wrong.
What I'd like to do is take every instance of images/< filename >.png in a file, and add surrounding text - {{media("images/.< filename >.png")}}. The following code is the closest I've been able to get:
sed -i -e 's:images/.*.png:{{media("images/.*.png")}}:g' file.html
How can I make this happen?
In sed, & in the substitution will be replaced with the matched string, so if we can assume no spaces in a filename and a word boundary before and after each, this does what you want:
s:\bimages/\S*\.png\b:{{media("&")}}:g
Try it online!
Apart from doing the substitution, there are a couple issues with your code worth mentioning:
images/.*.png will match images/foo.png, but it will also match images/foopng. Don't forget to escape regex characters: images/.*\.png.
sed quantifiers are always greedy. Suppose you had this input:
Foo images/bar.png baz images/qux.png quux
In this case, the expression images/.*\.png would match everything from the first images to the last .png. The solution above avoids this by using \S instead of . to match only non-whitespace characters.

Need to diff two text files in linux with some patterns in filelines

File A contains
Test-1.2-3
Test1-2.2-3
Test2-4.2-3
File B contains
Test1
Expected output should be
Test-1.2-3
Test2-4.2-3
diff A B doesn't work as expected.
Kindly let me know if any solutions here.
Using grep:
grep -vf B A
-f FILE, --file=FILE
Obtain patterns from FILE, one per line. The empty file
contains zero patterns, and therefore matches nothing.
-v, --invert-match
Invert the sense of matching, to select non-matching lines.
Edit:
Optionally, you may want to use the -w option if you want a more precise match on "words" only which seems to be your case from your example since your match is followed by '-'. As DevSolar points out, you may also want to use the -F option to prevent input patterns from your file B to be interpreted as regular expressions.
grep -vFwf B A
-w, --word-regexp
Select only those lines containing matches that form whole
words. The test is that the matching substring must either be
at the beginning of the line, or preceded by a non-word
constituent character. Similarly, it must be either at the end
of the line or followed by a non-word constituent character.
Word-constituent characters are letters, digits, and the
underscore.
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings (rather than regular
expressions), separated by newlines, any of which is to be matched.
To complement Julien Lopez's helpful answer:
If you want to ensure that lines from File B only match at the beginning of lines from File A, you can prepend ^ to each line from file B, using sed:
grep -vf <(sed 's/^/^/' fileB) fileA
grep, which by default interprets its search strings as BREs (basic regular expressions), then interprets the ^ as the beginning-of-line anchor.
If the lines in File B may contain characters that are regex metacharacters (such as ^, *,?, ...) but should be treated as literals, you must escape them first:
grep -vf <(sed 's/[^^]/[&]/g; s/\^/\\^/g; s/^/^/' fileB) fileA
An explanation of this grim-looking - but generically robust - sed command can be found in this this answer of mine.
Note:
Assumes bash, ksh, or zsh due to use of <(...), a process substitution, which makes the output from sed act as if it were provided via a file.
sed command s/^/^/ looks like it won't do anything, but the first ^, in the regex part of the call, is the beginning-of-line anchor[1]
, whereas the second ^, in the substitution part of the call, is a literal to place at the beginning of the line (which will later itself be interpreted as the beginning-of-line anchor in the context of grep).
[1] Strictly speaking, to sed it is the beginning-of-pattern-space anchor, because it is possible to read multiple lines at once with sed, in which case ^ refers to the beginning of the pattern space (input buffer) as a whole, not to individual lines.

how to use grep to search for 2 or more parentheses and words capitalization

I'm trying to search for a couple of strings in a path that includes about 80 txt files.
I'm trying to search for !!, ??, ;, capitalization, and parentheses.
I'm also trying to search for if there are more than 4 words capitalized, but I just didn't know how to do that
Here is what I did:
grep -lr '!!\|??\|;\|(.*(' path
Can someone help me with it?
Here is a sample input:
file1.txt:
ryan went over there !!
file2.txt:
am I going there??
file3.txt:
how about I GO TO THE PARK TODAY and not TOMORROW
file4.txt:
This is (not) (valid)
file5.txt:
to go; or not to go
the output should be something like this:
path/file1.txt
path/file2.txt
path/file3.txt
path/file5.txt
Try this regex:
grep -Er '\?\?|\!\!|\(.+\).+\(.+\)|([A-Z]+\b.){4,}|\;' /path/to/files/*.txt
Output:
./1.txt:ryan went over there !!
./2.txt:am I going there??
./3.txt:how about I GO TO THE PARK TODAY and not TOMORROW
./4.txt:This is (not) (valid)
./5.txt:to go; or not to go
grep -Elr will output:
./1.txt
./2.txt
./3.txt
./4.txt
./5.txt
The regex searches for:
??
!!
() used at least twice on a line
Four or more capitalized words on a line
;
grep -lr '!!\|??\|;\|(.*(' path
is what you want. (.*( will match a line containing (at least) two open parentheses with arbitrary text in between.
For readability, you might try
grep -lr -e '!!' -e '??' -e ';' -e '(.*(' path
Your notation is off. In modern grep, you need to backslash the braces, just like you backslash the vertical bar for alternation. More conveniently, you might want to switch to grep -E for backslashless syntax; but then you will need \( to match a literal opening parenthesis.
But either way, inside the braces, there can only be a maximum of two numbers: the lower and the upper bound for he number of repetitions.
However, in this case, because there is no limiting context, \({2) will match the first two of an arbitrarily large number of opening parentheses. In other words, \({2,4} will not fail to match if there are more than four parens (though the actual match will end after four, as you will be able to see e.g. with grep -o). If you need to limit to no longer than four, you will need to supply some sort of trailing context, such as ($|[^(]).
To find a line containing more than one but less than five nonadjacent opening parens, try something like
^[^(]*(\([^(]*){2,4}$

Resources