bash (grep|awk|sed) - Extract domains from a file - bash

I need to extract domains from a file.
domains.txt:
eofjoejfej fjpejfe http://ejej.dm1.com dêkkde
ojdoed www.dm2.fr doejd eojd oedj eojdeo
http://dm3.org ieodhjied oejd oejdeo jd
ozjpdj eojdoê jdeojde jdejkd http://dm4.nu/
io d oed 234585 http://jehrhr.dm5.net/hjrehr
[2014-05-31 04:05] eohjpeo jdpiehd pe dpeoe www.dm6.uk/jehr
I need to get:
dm1.com
dm2.fr
dm3.org
dm4.nu
dm5.net
dm6.co.uk

Try this sed command,
$ sed -r 's/.*(dm[^\.]*\.[^/ ]*).*/\1/g' file
dm1.com
dm2.fr
dm3.org
dm4.nu
dm5.net
dm6.uk

This is a bit long, but should work:
grep -oE "http[^ ]*|www[^ ]*" file | sed -e 's|http://||g' -e 's/^www\.//g' -e 's|/.*$||g' -re 's/^.*\.([^\.]+\.[^\.]+$)/\1/g'
Output:
dm1.com
dm2.fr
dm3.org
dm4.nu
dm5.net
dm6.uk

Unrefined method using grep and sed:
grep -oE '[[:alnum:]]+[.][[:alnum:]_.-]+' file | sed 's/www.//'
Outputs:
ejej.dm1.com
dm2.fr
dm3.org
dm4.nu
jehrhr.dm5.net
dm6.uk

An answer with gawk:
LC_ALL=C gawk -d -v RS="[[:space:]]+" -v FS="." '
{
# Remove the http prefix if it exists
sub( /http:[/][/]/, "" )
# Remove the path
sub( /[/].*$/, "" )
# Does it look like a domain?
if ( /^([[:alnum:]]+[.])+[[:alnum:]]+$/ ) {
# Print the last 2 components of the domain name
print $(NF-1) "." $NF
}
}' file
Some notes:
Using RS="[[:space:]]" allow us to process each group of letter independently.
LC_ALL=C forces [[:alnum:]] to be ASCII-only (this is not necessary any more with gawk 4+).

To be able to remove subdomains you have to validate them first, because if you cut the columns it would affect the TLDs. Then you have to do 3 steps.
Step 1: clean domains.txt
grep -oiE '([a-zA-Z0-9][a-zA-Z0-9-]{1,61}\.){1,}(\.?[a-zA-Z]{2,}){1,}' domains.txt | sed -r 's:(^\.*?(www|ftp|ftps|ftpes|sftp|pop|pop3|smtp|imap|http|https)[^.]*?\.|^\.\.?)::gi' | sort -u > capture
Content capture
ejej.dm1.com
dm2.fr
dm3.org
dm4.nu
jehrhr.dm5.net
dm6.uk
Step 2: download and filter TLD list:
wget https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat
grep -v "//" public_suffix_list.dat | sed '/^$/d; /#/d' | grep -v -P "[^a-z0-9_.-]" | sed 's/^\.//' | awk '{print "." $1}' | sort -u > tlds.txt
So far you have two lists (capture and tlds.txt)
Step 3: Download and run this python script:
wget https://raw.githubusercontent.com/maravento/blackweb/master/bwupdate/tools/parse_domain_tld.py && chmod +x parse_domain_tld.py && python parse_domain_tld.py | sort -u
out:
dm1.com
dm2.fr
dm3.org
dm4.nu
dm5.net
dm6.uk
Source: blackweb

This can be useful:
grep -Pho "(?<=http://)[^(\"|'|[:space:])]*" file.txt | sed 's/www.//g' | grep -Eo '[[:alnum:]]{1,}\.[[:alnum:]]{1,}[.]{0,1}[[:alnum:]]{0,}' | sort | uniq
First grep get 'http://www.example.com' enclosed in single or double quotes, but extract only domain. Second, using 'sed' I remove 'www.', third one extract domain names separated by '.' and in block of two or three alfnumeric characters. At the end, output is ordered to display only single instances of each domain

Related

Remove everything before a string with bash?

I'm doing this with ffmpeg :
ffmpeg -i /Users/petaire/GDrive/Taff/ASI/Bash/testFolder/SilenceAndBlack.mp4 -af silencedetect=d=2 -f null - 2>&1 | grep silence_duration
And my output is :
[silencedetect # 0x7f9e6940eba0] silence_end: 25.92 | silence_duration: 25.936
But I only want to keep the duration number, so I'm trying to remove everything before the last number.
I've never understood anything about sed/awk & co, so I dont know what is the best way to do that. I thought grep would be powerful enough, but it doesn't seems so.
Any idea?
Using awk to print the last field:
$ awk '{print $NF}'
Test it:
$ echo "[silencedetect # 0x7f9e6940eba0] silence_end: 25.92 | silence_duration: 25.936"| awk '{print $NF}'
25.936
or use sed to replace everything up to last space with nothing:
$ ... | sed 's/.* //'
you can change your grep command to
grep -oP '(?<=silence_duration: )\S+'
which will print the next field to the searched one.
to remove everything before the last number
you can use
grep -o "[^ ]*$"
Another option, grep -o with cut:
$ echo '[silencedetect # 0x7f9e6940eba0] silence_end: 25.92 | silence_duration: 25.936' \
| grep -o 'silence_duration: [0-9]*\.[0-9]*' | cut -d ' ' -f 2
25.936

Find unique words

Suppose there is one file.txt in which below content text is written:
ABC/xyz
ABC/xyz/rst
EFG/ghi
I need to write a shell script that can extract the first unique word before the first /.
So as output, I want ABC and EFG to be written in one file.
You can extract the first word with cut (slash as delimiter), then pipe to sort with the -u (for "unique") option:
$ cut -d '/' -f 1 file.txt | sort -u
ABC
EFG
To get the output into a file, just redirect by appending > filename to the command. (Or pipe to tee filename to see the output and get it in a file.)
Try this :
cat file.txt | tr -s "/" ' ' | awk -F " " '{print $1}' | sort | uniq > outfile.txt
Another interesting variation:
awk -F'/' '{print $1 |" sort -u" }' file.txt > outfile.txt
Not that it matters here, but being able to pipe and redirect within awk can be very handy.
Another easy way:
cut -d"/" -f1 file.txt|uniq > out.txt
You can use a mix of cut and sort like so:
cut -d '/' -f 1 file.txt | sort -u > newfile.txt
The first line grabs any string until a slash / and outputs it into newfile.txt.
The second line sorts the text, removing any duplicate strings you might have.

Create name/value pairs based on file output

I'd like to format the output of cat myFile.txt in the form of:
app1=19
app2=7
app3=20
app4=19
Using some combination of piping output through various commands.
What would be easiest way to achieve this?
I've tried using cut -f2 but this does not change the output, which is odd.
Here is the basic command/file output:
[user#hostname ~]$ cat myFile.txt
1402483560882 app1 19
1402483560882 app2 7
1402483560882 app3 20
1402483560882 app4 19
Basing from your input:
awk '{ print $2 "=" $3 }' myFile
Output
app1=19
app2=7
app3=20
app4=19
Another solution, using sed and cut:
cat myFile.txt | sed 's/ \+/=/gp' | cut -f 3- -d '='
Or using tr and cut:
cat myFile.txt | tr -s ' ' '=' | cut -f 3- -d '='
You could try this sed oneliner also,
$ sed 's/^\s*[^ ]*\s\([^ ]*\)\s*\(.*\)$/\1=\2/g' file
app1=19
app2=7
app3=20
app4=19

grep multiple pattern with regex

Here is the text:
this is text this is text this is text this is text pattern_abc"00a"this is text this is text this is text this is textthis is text this is text pattern_def"001b"this is text this is text
in the output, I would like:
00a
001b
note: The values I look for are of random length and contents
I use 2 expressions:
exp_1 = grep -oP "(?<=pattern_abc\")[^\"]*"
exp_2 = grep -oP "(?<=pattern_def\")[^\"]*"
egrep does not work (I got "egrep: egrep can only use the egrep pattern syntax")
I try:
cat test | exp_1 && exp_2
cat test | (exp_1 && exp_2)
cat test | exp_1 | exp_2
cat test | (exp_1 | exp_2)
and lastly:
grep -oP "((?<=pattern_abc\")[^\"]* \| (?<=pattern_def\")[^\"]*)" test
grep -oP "((?<=pattern_abc\")[^\"]* | (?<=pattern_def\")[^\"]*)" test
Any idea?
thank you very much !
You can use this grep,
grep -oP "(?<=pattern_(abc|def)\")[^\"]*" file
You can use awk like this:
awk -F\" '{for (i=2;i<NF;i+=2) print $i}' file
00a
001b
If the pattern_* is important you can use this gnu awk (due to RS)
awk -v RS="pattern_(abc|def)" -F\" 'NR>1{print $2}'
00a
001b
And another method through grep with Perl-regex option,
$ grep -oP '\"\K[^\"]*(?="this)' file
00a
001b
It works only if the string you want to match is followed by "this.
OR
You could use the below command which combines the two search patterns,
$ grep -oP 'pattern_abc"\K[^"]*|pattern_def"\K[^"]*' file
00a
001b

How do I sed/grep the last word in a filename?

I have a couple of filenames for different languages. I need to grep or sed just the language part. I am using gconftool-2 -R / and want to pipe a command to bring out just the letters with the language.
active = file.so,sv.xml
active = file.so,en_GB.xml
active = file.so,en_US.xml
I need the sv and en_GB part of the file. How can I do that in the most effective way? I am thinking of something like gconftool-2 -R / | sed -n -e '/active =/p̈́' -e '/\.$/' but then I get stuck as I don't know how to print just the part I need and not the whole line.
awk -F. '{print $(NF-1)}'
NF is the number of fields, awk counts from 1 so the 2nd to last field is NF-1.
The -F. says that fields are separated by "." rather than whitespace
How about using simple cut
cut -d. -f3 filename
Test:
[jaypal:~/Temp] cat filename
active = file.so.sv.xml
active = file.so.en_GB.xml
active = file.so.en_US.xml
[jaypal:~/Temp] cut -d. -f3 filename
sv
en_GB
en_US
Based on the updated input:
[jaypal:~/Temp] cat filename
active = file.so,sv.xml
active = file.so,en_GB.xml
active = file.so,en_US.xml
[jaypal:~/Temp] cut -d, -f2 filename | sed 's/\..*//g'
sv
en_GB
en_US
OR
Using awk:
[jaypal:~/Temp] awk -F[,.] '{print $3}' filename
sv
en_GB
en_US
[jaypal:~/Temp] awk -F[,.] '{print $(NF-1)}' filename
sv
en_GB
en_US
OR
Using grep and tr:
[jaypal:~/Temp] egrep -o ",\<.[^\.]*\>" f | tr -d ,
sv
en_GB
en_US
awk would be my main tool for this task but since that has already been proposed, I'll add a solution using cut instead
cut -d. -f3
i.e. use . as delimiter and select the third field.
Since you tagged the question with bash, I'll add a pure bash solution as well:
#!/usr/bin/bash
IFS=.
while read -a LINE;
do
echo ${LINE[2]}
done < file_name
Try:
gconftool-2 -R / | grep '^active = ' | sed 's,\.[^.]\+$,,; s,.*\.,,'
The first sed command says to remove a dot followed by everything not a dot until the end of line; the second one says to remove everything until the last dot.
This might work for you:
gconftool-2 -R / | sed -n 's/^active.*,\([^.]*\).*/\1/p'

Resources