Grep for URL parsing - bash script programming - bash

I am trying to learn some bash scripting and i can't understand how to use grep in order to split a URL link for example :
blabla1.com
blabla2.gov
blabla3.fr
I just want to keep com , gov and fr ( without the '.' character) ignore whats before '.'
Thanks in advance ..

Grep is a tool for matching text. You need something else if you want to transform text. If you have the values in question in a bash variable, then what you ask is pretty easy:
authority=blabla.com
# Here's the important bit:
domain=${authority/*./}
echo $domain
The funny syntax in the middle evaluates to the result of a pattern substitution on the value of variable temp.
If you're trying to do this on lines of a file, then the sed program is your friend:
sed 's/.*\.//' < input.file
This is again a pattern substitution, but sed uses regular expression patterns, whereas bash uses shell glob patterns.

grep -E -o '[^.]+$' < input
-o instructs grep to print only the matching part of the line
-E switches on extended regexp which is needed for + quantifier
[^.]+$ means any character which is not a dot at the end of the line

Try this way:
grep -o -E '[a-z]{2,3}\b' input > output
-o, --only-matching: Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
$ cat input
blabla1.com
blabla2.gov
blabla3.fr
$ cat output
com
gov
fr

$ cut -d. -f2 file
com
gov
fr
If that's not all you need, post some more truly representative input and expected output so we can help you find the right solution.

Related

How to convert separators using regex in bash

How do I modify my bash file to achieve the expected result shown below ?
#!/bin/bash
filename=$1
var="$(<$filename)" | tr -d '\n'
sed -i 's/;/,/g' $var
Convert this input file
a,b;c^d"e}
f;g,h!;i8j-
To this output file
a,b,c,d,e,f,g,h,i,j
How to convert separators using regex in bash
You would, well, literally, do exactly that - convert any of the separators using regex. This consists of steps:
most importantly, figure out the exact definition of what consists of a "separator"
writing a regex for it
writing an algorithm for it
running and testing the code
For example, assuming a separator is a sequence of of any of \n,;^"}!8- characters, you could do:
sed -zi 's/[,;^"}!8-]\+/,/g; s/,$/\n/' input_file
Or similar with first tr '\n' , for example when -z is not available with your sed, and then pass the result of tr to sed. The second regex adds a trailing newline on the output instead of a trailing ,.
Additionally, in your code:
var is unset on sed line. Parts of | pipeline are running in a subshell.
var=$(<$filename) contains the contents of the file, whereas sed wants a filename as argument, not file contents.
var=.... | ... is pipeing the result of assignment to tr. The output of assignment is empty, so that line produces nothing, and its output is unused.
Remember to check bash scripts with shellcheck.
For a somewhat portable solution, maybe try
tr -cs A-Za-z , <input_file | sed '$s/,$/\n/' >output_file
The use of \n to force a final newline is still not entirely reliable; there are some sed versions which interpret the sequence as a literal n.
You'd move output_file back on top of input_file after this command if you want to replace the original.

Read word after a specific word on the same line dont have space between them

How can I extract a word that comes after a specific word in bash ? More precisely, I have a file which has a line which looks like this:
Demo.txt
IN=../files/d
out=../files/d
dataload
name
i want to read "d" from above line.
sed -n '/\/files\// s~.*/files/\([^.]*\)\..*~\1~p' file
this code helping if line having "."
IN=../files/d.txt
so its printing "d"
here we have "d" without "." as end delimeter. So i want to read till end of line.
i/p :
Demo.txt
IN=../files/d
out=../files/d
dataload
name
output looking for:
d
d
code: in bash
You could use GNU grep with PCRE :
grep -oP '/files/\K[^.]+' file
The -P flag makes grep use PCRE, the -o makes it display only the matched part rather than the full line, and the \K in the regex omits what precedes from the displayed matched part.
Alternatively if you don't have access to GNU grep, the following perl command will have the same effect :
perl -nle 'print $& if m{/files/\K[^.]+}' file
Sample run.
This sed variant should work for you:
sed -n '/\/files\// s~.*/files/\([^.]*\).*~\1~p' file
d
d
Minor change from earlier sed is that it doesn't match \. right after first capture group.
When you don't want to think about a single command solution, you can use
grep -Eo "/files/." Demo.txt | cut -d/ -f3

Text processing in bash - extracting information between multiple HTML tags and outputting it into CSV format [duplicate]

I can't figure how to tell sed dot match new line:
echo -e "one\ntwo\nthree" | sed 's/one.*two/one/m'
I expect to get:
one
three
instead I get original:
one
two
three
sed is line-based tool. I don't think these is an option.
You can use h/H(hold), g/G(get).
$ echo -e 'one\ntwo\nthree' | sed -n '1h;1!H;${g;s/one.*two/one/p}'
one
three
Maybe you should try vim
:%s/one\_.*two/one/g
If you use a GNU sed, you may match any character, including line break chars, with a mere ., see :
.
Matches any character, including newline.
All you need to use is a -z option:
echo -e "one\ntwo\nthree" | sed -z 's/one.*two/one/'
# => one
# three
See the online sed demo.
However, one.*two might not be what you need since * is always greedy in POSIX regex patterns. So, one.*two will match the leftmost one, then any 0 or more chars as many as possible, and then the rightmost two. If you need to remove one, then any 0+ chars as few as possible, and then the leftmost two, you will have to use perl:
perl -i -0 -pe 's/one.*?two//sg' file # Non-Unicode version
perl -i -CSD -Mutf8 -0 -pe 's/one.*?two//sg' file # S&R in a UTF8 file
The -0 option enables the slurp mode so that the file could be read as a whole and not line-by-line, -i will enable inline file modification, s will make . match any char including line break chars, and .*? will match any 0 or more chars as few as possible due to a non-greedy *?. The -CSD -Mutf8 part make sure your input is decoded and output re-encoded back correctly.
You can use python this way:
$ echo -e "one\ntwo\nthree" | python -c 'import re, sys; s=sys.stdin.read(); s=re.sub("(?s)one.*two", "one", s); print s,'
one
three
$
This reads the entire python's standard input (sys.stdin.read()), then substitutes "one" for "one.*two" with dot matches all setting enabled (using (?s) at the start of the regular expression) and then prints the modified string (the trailing comma in print is used to prevent print from adding an extra newline).
This might work for you:
<<<$'one\ntwo\nthree' sed '/two/d'
or
<<<$'one\ntwo\nthree' sed '2d'
or
<<<$'one\ntwo\nthree' sed 'n;d'
or
<<<$'one\ntwo\nthree' sed 'N;N;s/two.//'
Sed does match all characters (including the \n) using a dot . but usually it has already stripped the \n off, as part of the cycle, so it no longer present in the pattern space to be matched.
Only certain commands (N,H and G) preserve newlines in the pattern/hold space.
N appends a newline to the pattern space and then appends the next line.
H does exactly the same except it acts on the hold space.
G appends a newline to the pattern space and then appends whatever is in the hold space too.
The hold space is empty until you place something in it so:
sed G file
will insert an empty line after each line.
sed 'G;G' file
will insert 2 empty lines etc etc.
How about two sed calls:
(get rid of the 'two' first, then get rid of the blank line)
$ echo -e 'one\ntwo\nthree' | sed 's/two//' | sed '/^$/d'
one
three
Actually, I prefer Perl for one-liners over Python:
$ echo -e 'one\ntwo\nthree' | perl -pe 's/two\n//'
one
three
Below discussion is based on Gnu sed.
sed operates on a line by line manner. So it's not possible to tell it dot match newline. However, there are some tricks that can implement this. You can use a loop structure (kind of) to put all the text in the pattern space, and then do the operation.
To put everything in the pattern space, use:
:a;N;$!ba;
To make "dot match newline" indirectly, you use:
(\n|.)
So the result is:
root#u1804:~# echo -e "one\ntwo\nthree" | sed -r ':a;N;$!ba;s/one(\n|.)*two/one/'
one
three
root#u1804:~#
Note that in this case, (\n|.) matches newline and all characters. See below example:
root#u1804:~# echo -e "oneXXXXXX\nXXXXXXtwo\nthree" | sed -r ':a;N;$!ba;s/one(\n|.)*two/one/'
one
three
root#u1804:~#

Bash sed replace with exact match of a text in a file

I have a file pattern.txt which is composed of one very long line of complicated code (~8200 chars).
This code can be found in multiple files inside multiple directories.
I can easily identify a list of these files using
grep -rli 'uniquepartofthecode' *
My concern is how do I replace it with the exact text from within the file ?
I tried to do:
var=$(cat pattern.txt)
sed -i "s/$var//g" targetfile.txt
but I got the following error :
sed: -e expression #1, char 96: unknown option to `s'
sed is interpreting my $var content as a regular expression, I would like it to just match the exact text.
The pattern.txt content could be more or less any combination of characters so I'm afraid I cannot escape every characters efficiently.
Is there a solution using sed ? Or should I use another tool for that ?
EDIT:
I tried using this solution to make a proper regex pattern from my text file.
Is it possible to escape regex metacharacters reliably with sed
the overall process is:
var=$(cat pattern.txt)
searchEscaped=$(sed 's/[^^]/[&]/g; s/\^/\\^/g' <<<"$var")
sed -n "s/$searchEscaped/foo/p" <<<"$var" # if ok, echoes 'foo'
This last command displays "foo". $searchEscaped seems to be properly escaped.
Though, this is not returning anything (it should display foo + the rest of the file without the matched part):
sed -n "s/$searchEscaped/foo/p" targetfile.txt
I think that the best solution is to not use regular expressions at all and resort to string replacement.
One way to do this is using perl:
$ echo "$string_to_replace"
some other stuff abc$^%!# some more
$ echo "$search"
abc$^%!#
$ perl -spe '$len = length $search;
while (($pos = index($_, $search, $n)) > -1) {
substr($_, $pos, $len) = "replacement";
$n = $pos + $len;
}' <<<"$string_to_replace" -- -search="$search"
some other stuff replacement some more
The -p switch tells perl to loop through each line of the variable $string_to_replace (which could easily be replaced by a file). -s allows options to be passed to the script - in this case, I've passed a shell variable containing the search string.
For each line of the file, the while loop runs through all of the matches of the search string. substr is used on the left hand of the assignment to replace a substring of $_, which refers to the current line being processed.

Extract all characters after a match - shell script

I am in need to extract all characters after a pattern match.
For example ,
NAME=John
Age=16
I need to extract all characters after "=". Output should be like
John
16
I cant go with perl or Jython for this purpose because of some restrictions.
I tried with grep , but to my knowledge I came as shown below only
echo "NAME=John" |grep -o -P '=.{0,}'
You were pretty close:
grep -oP '(?<=\w=)\w+' file
makes it.
Explanation
it looks for any word after word= and prints it.
-o stands for "Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line".
-P stands for "Interpret PATTERN as a Perl regular expression".
(?<=\w=)\w+ means: match only \w+ following word=. More info in [Regex tutorial - Lookahead][1] and in [this nice explanation by sudo_O][2].
Test
$ cat file
NAME=John
Age=16
$ grep -oP '(?<=\w=)\w+' file
John
16
One sed solution
sed -ne 's/.*=//gp' <filename>
another awk solution
awk -F= '$0=$2' <filename>
Explanation:
in sed we remove anything from the beginning of a line till a = and print the rest.
in awk we break the string in 2 parts, separated by =, now after that $0=$2 is making replacing the whole string with the second portion

Resources