grep multiple pattern with regex - bash

Here is the text:
this is text this is text this is text this is text pattern_abc"00a"this is text this is text this is text this is textthis is text this is text pattern_def"001b"this is text this is text
in the output, I would like:
00a
001b
note: The values I look for are of random length and contents
I use 2 expressions:
exp_1 = grep -oP "(?<=pattern_abc\")[^\"]*"
exp_2 = grep -oP "(?<=pattern_def\")[^\"]*"
egrep does not work (I got "egrep: egrep can only use the egrep pattern syntax")
I try:
cat test | exp_1 && exp_2
cat test | (exp_1 && exp_2)
cat test | exp_1 | exp_2
cat test | (exp_1 | exp_2)
and lastly:
grep -oP "((?<=pattern_abc\")[^\"]* \| (?<=pattern_def\")[^\"]*)" test
grep -oP "((?<=pattern_abc\")[^\"]* | (?<=pattern_def\")[^\"]*)" test
Any idea?
thank you very much !

You can use this grep,
grep -oP "(?<=pattern_(abc|def)\")[^\"]*" file

You can use awk like this:
awk -F\" '{for (i=2;i<NF;i+=2) print $i}' file
00a
001b
If the pattern_* is important you can use this gnu awk (due to RS)
awk -v RS="pattern_(abc|def)" -F\" 'NR>1{print $2}'
00a
001b

And another method through grep with Perl-regex option,
$ grep -oP '\"\K[^\"]*(?="this)' file
00a
001b
It works only if the string you want to match is followed by "this.
OR
You could use the below command which combines the two search patterns,
$ grep -oP 'pattern_abc"\K[^"]*|pattern_def"\K[^"]*' file
00a
001b

Related

remove whitespace from piped output

In a textfile i have some tags with the notation :foo. To get an overview of my tags in the file, I want to get a listing of all this tags.
This is done via
grep -o -e ":[a-z]*\( \|$\)" file.txt | sort | uniq
Now I get duplicates because of the whitespace or newline character at the end.
:movie <-- only newline
:movie <-- whitespace and newline
:read
:read
I want to avoid the duplicates. But I could not figure out how. I tried with | tr -d '[:space:]', but this leads only to a concatenation of all pipe output...
Example of the file.txt
Avengers: Infinity War :movie
Yojimbo 1961 :movie nippon
Some test lines (there is a space after the first :space, you can see it if you highlight the data with your mouse):
$ cat file
with :space
with :space too
without :space
test: this
With grep, sort and uniq:
$ grep -o ":[a-z]\+" file | sort | uniq
:space
With awk (well, gawk and mawk at least):
$ awk 'BEGIN{RS="[" FS "|" RS "]+"}/:[a-z]/&&!a[$0]++' file
:space
Each word is its own record and we pick the first instance of every colon-starting word. RS="[" FS "|" RS "]+" could be written otherwise but it is in this form to emphasize any combination of FS and RS.
You can use Perl regexp and word matching:
grep -oP ':\w+' file.txt | sort | uniq
or, just match non-space characters:
grep -o ':[^ ]*' file.txt | sort | uniq
Since you haven't provided the sample Input_file so couldn't test it as well as I don't have zsh with me. Try following and let me know if this helps you.
awk '/:[a-z]*/{sub(/ +$/,"");} !a[$0]++' Input_file | sort
You can try with sed
sed 's/.*\(:[a-z]*\).*/\1/' file.txt | sort | uniq

Match multiple patterns with grep and print only the matched patterns

I have a file that looks like
..<long-text>..."field1":"some-value"...<long-text>...."field2":"some-value"...
..<long-text>..."field1":"some-value"...<long-text>...."field2":"some-value"...
..<long-text>..."field1":"some-value"...<long-text>...."field2":"some-value"...
I want to extract out field1 and field2 from each line of the file in bash. I want field1 and field2 to appear in the same line for each line. So the output should look like-
"field1":"some-value" "field2":"some-value"
"field1":"some-value" "field2":"some-value"
"field1":"some-value" "field2":"some-value"
I wrote a grep expression like -
grep -E '"field1":"[a-z]*".*"field2":"[a-z]*"' -o
But because of .* in between, it produces all the all text between those two expressions. I also tried
grep -E '"field1":"[a-z]*"|"field2":"[a-z]*"' -o
But this outputs all field1s in separate line and then all field2s in separate line.
How do I get the expected output?
You can use grep with awk to format the result:
grep -oE '"(field1|field2)":"[^"]*"' file | awk 'NR%2{p=$0; next} {print p, $0}'
"field1":"some-value" "field2":"some-value"
"field1":"some-value" "field2":"some-value"
"field1":"some-value" "field2":"some-value"
use sed:
echo abcdef | sed 's/\(.\).*\(.\)/\1\2/'
# yields: af
for your situation:
sed 's/.*\("field1":"[a-z]*"\).*\("field2":"[a-z]*"\).*/\1 \2/' yourfile
if some lines don't match at all, then do your grep first, e.g.,
grep -Eo '"field1":"[a-z]*".*"field2":"[a-z]*"' yourfile |
sed 's/.*\("field1":"[a-z]*"\).*\("field2":"[a-z]*"\).*/\1 \2/'

Extracting multiple lines of data between two delimiters

I have a log file containing multiple lines of data. I need to extract and the all the lines between the delimiters and save it to the output file
input.log
Some data
<delim_begin>ABC<delim_end>
some data
<delim_begin>DEF<delim_end>
some data
The output.log file should look like
ABC
DEF
I tried this code but it does not work, it prints all the content of input.log
sed 's/<delim_begin>\(.*\)<delim_end>/\1/g' input.log > output.log
Using awk you can do it using custom field separator:
awk -F '<(delim_begin|delim_end)>' 'NF>2{print $2}' file
ABC
DEF
Using grep -P (PCRE):
grep -oP '(?<=<delim_begin>).*(?=<delim_end>)' file
ABC
DEF
sed alternative
$ sed -nr 's/<delim_begin>(.*)<delim_end>/\1/p' file
ABC
DEF
This should do it:
cat file | awk -F '<(delim_begin|delim_end)>' '{print $2}'
You can use this command -
cat file | grep "<delim_begin>.*<delim_end>" | sed 's/<delim_begin>//g' | sed 's/<delim_end>//' > output.log

bash (grep|awk|sed) - Extract domains from a file

I need to extract domains from a file.
domains.txt:
eofjoejfej fjpejfe http://ejej.dm1.com dêkkde
ojdoed www.dm2.fr doejd eojd oedj eojdeo
http://dm3.org ieodhjied oejd oejdeo jd
ozjpdj eojdoê jdeojde jdejkd http://dm4.nu/
io d oed 234585 http://jehrhr.dm5.net/hjrehr
[2014-05-31 04:05] eohjpeo jdpiehd pe dpeoe www.dm6.uk/jehr
I need to get:
dm1.com
dm2.fr
dm3.org
dm4.nu
dm5.net
dm6.co.uk
Try this sed command,
$ sed -r 's/.*(dm[^\.]*\.[^/ ]*).*/\1/g' file
dm1.com
dm2.fr
dm3.org
dm4.nu
dm5.net
dm6.uk
This is a bit long, but should work:
grep -oE "http[^ ]*|www[^ ]*" file | sed -e 's|http://||g' -e 's/^www\.//g' -e 's|/.*$||g' -re 's/^.*\.([^\.]+\.[^\.]+$)/\1/g'
Output:
dm1.com
dm2.fr
dm3.org
dm4.nu
dm5.net
dm6.uk
Unrefined method using grep and sed:
grep -oE '[[:alnum:]]+[.][[:alnum:]_.-]+' file | sed 's/www.//'
Outputs:
ejej.dm1.com
dm2.fr
dm3.org
dm4.nu
jehrhr.dm5.net
dm6.uk
An answer with gawk:
LC_ALL=C gawk -d -v RS="[[:space:]]+" -v FS="." '
{
# Remove the http prefix if it exists
sub( /http:[/][/]/, "" )
# Remove the path
sub( /[/].*$/, "" )
# Does it look like a domain?
if ( /^([[:alnum:]]+[.])+[[:alnum:]]+$/ ) {
# Print the last 2 components of the domain name
print $(NF-1) "." $NF
}
}' file
Some notes:
Using RS="[[:space:]]" allow us to process each group of letter independently.
LC_ALL=C forces [[:alnum:]] to be ASCII-only (this is not necessary any more with gawk 4+).
To be able to remove subdomains you have to validate them first, because if you cut the columns it would affect the TLDs. Then you have to do 3 steps.
Step 1: clean domains.txt
grep -oiE '([a-zA-Z0-9][a-zA-Z0-9-]{1,61}\.){1,}(\.?[a-zA-Z]{2,}){1,}' domains.txt | sed -r 's:(^\.*?(www|ftp|ftps|ftpes|sftp|pop|pop3|smtp|imap|http|https)[^.]*?\.|^\.\.?)::gi' | sort -u > capture
Content capture
ejej.dm1.com
dm2.fr
dm3.org
dm4.nu
jehrhr.dm5.net
dm6.uk
Step 2: download and filter TLD list:
wget https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat
grep -v "//" public_suffix_list.dat | sed '/^$/d; /#/d' | grep -v -P "[^a-z0-9_.-]" | sed 's/^\.//' | awk '{print "." $1}' | sort -u > tlds.txt
So far you have two lists (capture and tlds.txt)
Step 3: Download and run this python script:
wget https://raw.githubusercontent.com/maravento/blackweb/master/bwupdate/tools/parse_domain_tld.py && chmod +x parse_domain_tld.py && python parse_domain_tld.py | sort -u
out:
dm1.com
dm2.fr
dm3.org
dm4.nu
dm5.net
dm6.uk
Source: blackweb
This can be useful:
grep -Pho "(?<=http://)[^(\"|'|[:space:])]*" file.txt | sed 's/www.//g' | grep -Eo '[[:alnum:]]{1,}\.[[:alnum:]]{1,}[.]{0,1}[[:alnum:]]{0,}' | sort | uniq
First grep get 'http://www.example.com' enclosed in single or double quotes, but extract only domain. Second, using 'sed' I remove 'www.', third one extract domain names separated by '.' and in block of two or three alfnumeric characters. At the end, output is ordered to display only single instances of each domain

bash scripting removing optional <Integer><colon> prefix

I have a list with all of the content is like:
1:NetworkManager-0.9.9.0-28.git20131003.fc20.x86_64
avahi-0.6.31-21.fc20.x86_64
2:irqbalance-1.0.7-1.fc20.x86_64
abrt-addon-kerneloops-2.1.12-2.fc20.x86_64
mdadm-3.3-4.fc20.x86_64
I need to remove the N: but leave the rest of strings as is.
Have tried:
cat service-rpmu.list | sed -ne "s/#[#:]\+://p" > end.list
cat service-rpmu.list | egrep -o '#[#:]+' > end.list
both result in an empty end.list
//* the N:, just denotes an epoch version */
With sed:
sed 's/^[0-9]\+://' your.file
Output:
NetworkManager-0.9.9.0-28.git20131003.fc20.x86_64
avahi-0.6.31-21.fc20.x86_64
irqbalance-1.0.7-1.fc20.x86_64
abrt-addon-kerneloops-2.1.12-2.fc20.x86_64
mdadm-3.3-4.fc20.x86_64
Btw, your list looks like the output of a grep command with the option -n. If this is true, then omit the -n option there. Also it is likely that your whole task can be done with a single sed command.
awk -F: '{ sub(/^.*:/,""); print}' sample
Here is another way with awk:
awk -F: '{print $NF}’ service-rpmu.list

Resources