Parser a href tag in a website with bash shell

Parser a href tag in a website with bash shell - bash

i have a website with one url inside. it's a href tag
I need to parser a website to keep the "href" value.
In this website page, there is juste one "href" tag. This "href" hasn't class name.
i use a bash shell with curl
for now, i tried this :
curl http://MyWebsite | grep "href=" | cut -d '>' -f4 | cut -d '<' -f1
but no result. i'm novice with bash shell
Someone have an idea ? Thank's for your answers

If you want to keep the href= part
curl -s http://MyWebsite | grep -E -io 'href="[^\"]+"'
If you only want URL without the href=
curl -s http://MyWebsite | grep -E -io 'href="[^\"]+"' | awk -F\" '{print$2}'

I know that there is only a single href, but just in case... you can also extract URLs from all anchors inside an HTML document with sed and grep:
curl -s http://MyWebsite | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

Related

Using sed to replace tabs if input is not guaranteed to contain tabs?

I'm trying to extract a list of names from a website using sed, but I'm not sure how to go about replacing the tab characters separating them.
This code:
curl -s "https://namnidag.se/?year=2022&month=9&day=12" | sed -nE -e "s#<div class='names'>([^<]*)</div>#\1#p" | html2text
gives me the names for September 12th, but they are separated by a tab character:
Åsa Åslög
If I change the sed script to replace tabs with comma and space, like this:
curl -s "https://namnidag.se/?year=2022&month=9&day=12" | sed -nE -e "s#<div class='names'>([^<]*)</div>#\1#" -e 's/\t/, /p' | html2text
it works as expected:
Åsa, Åslög
However, if I try on a day that only has one name, such as September 13th:
curl -s "https://namnidag.se/?year=2022&month=9&day=13" | sed -nE -e "s#<div class='names'>([^<]*)</div>#\1#" -e 's/\t/, /p' | html2text
I get no output; the first sed script without the tab replacement works fine in this case though. What am I doing wrong here?
I'm using GNU sed 4.8, if that helps.
Thanks!

You need to remove the p
curl -s "https://namnidag.se/?year=2022&month=9&day=12" | sed -nE -e "s#<div class='names'>([^<]*)</div>#\1#p" | sed -e 's/\t/, /'

curl -s "https://namnidag.se/?year=2022&month=9&day=12" > f1
cat > ed1 <<EOF
71W f2
q
EOF
ed -s f1 < ed1
cat f2 | tail -c +20 | head -c -6 > file
rm -v ./ed1
rm -v ./f2
This will give you the names, whether there are two of them or not; and if there are, you can just seperate them with cut.

Sed output a value between two matching strings in a url

I have multiple urls as input
https://drive.google.com/a/domain.com/file/d/1OR9QLGsxiLrJIz3JAdbQRACd-G9ZfL3O/view?usp=drivesdk
https://drive.google.com/a/domain.com/file/d/1sEWMFqGW9p2qT-8VIoBesPlVJ4xvOzXD/view?usp=drivesdk
How can I create a sed command to simply return only the file ID
desired output:
1OR9QLGsxiLrJIz3JAdbQRACd-G9ZfL3O
1sEWMFqGW9p2qT-8VIoBesPlVJ4xvOzXD
Looks like I need to start between /d/ and stop at /view but I'm not quite sure how to do that.
I've tried? sed -e 's/d\(.*\)\/view/\1/'

I was able to do this with cut -d '/' -f 8
also awk -F/ '{print $8}' file worked, thanks!

Your command was almost right:
# Wrong
sed -e 's/d\(.*\)\/view/\1/'
# better, removing unmatched stuff including the / after the d
sed -e 's/.*d\/\(.*\)\/view.*/\1/'
# better: using # for making the command easier to read
sed -e 's#.*d/\(.*\)/view.*#\1#'
# Alternative:Using cut when you don't know which field /d/ is
some_straem | grep -Eo '/d/.*/view' | cut -d/ -f3

How to write a script that will use regex to output only the heading and paragraph text from the http://example.com website

I am a beginner in scripting and i am working on the bash scripting for my work.
for this task i tried the sed command which didn't work

for your problem, following would work:
#!/bin.bash
curl -s http://example.com/ | grep -P "\s*\<h1\>.*\<\/h1\>" |sed -n 's:.*<h1>\(.*\)</h1>.*:\1:p'
curl -s http://example.com/ | grep -P "\s*\<p\>.*\<\/p\>" |sed -n 's:.*<p>\(.*\)</p>.*:\1:p'
The first line scrapes via curl and grep the <h1>..</h1> part(assuming theres only one as we are considering your example) and using sed extract the first capturing group( (.*) ) by :\1:
The second line does the same but for <p1> tag.
I could cram these 2 lines in one grep but these'll work fine!
Edit:
If <p> tag end on different lines, above wouldn't, you may have to use pcregrep
curl -s http://example.com/ | pcregrep -M "\s*\<p\>(\n|.)*\<\/p\>"

You can use the following one liner :
curl -s http://example.com/ | sed -n '2,$p' > /tmp/tempfile && cat /tmp/tempfile | xmllint --xpath '/html/head/title/text()' - && echo ; cat /tmp/tempfile | xmllint --xpath '/html/body/div/p/text()' -
This uses xmllint's xpath command to extract the text within <title> and <p> tags.

Retreive Domain name from a PHP variable

I have a PHP config file which I retrieved from SSH.
Here is the sample config file in PHP :
<?php
$url_root='https://google.fr';
$document_root='/usr/share/nginx/html';
The command I use to retrieve the url :
grep -oE '\$url_root=.*;' conf.php | tail -1 | sed 's/$url_root=//g;s/;//g'
Output:
'https://google.fr'
But I expect to retrieve only google.fr
Then I need to implement this command line into ssh like :
domain=$(ssh -oStrictHostKeyChecking=no root#127.0.0.1 '
COMMAND HERE;
')

In order to accomodate for unpredictable data (aka you can find complete urls including other routes / files and not only domain names) I would go for:
your_str='https://google.fr/somedir/someotherdir/index.html'
echo $your_str | cut -d'/' -f3
Output:
google.fr
In your ssh command:
'grep -oE '\''\$url_root=.*;'\'' conf.php | tail -1 | sed '\''s/$url_root=//g;s/;//g'\'' | cut -d'\''/'\'' -f3'

Try this:
DOMAIN_NAME=$(grep -oE '\$url_root=.*;' conf.php | tail -1| sed "s/\$url_root='//g;s/^[a-z]*:\/\///g;s/';//")
echo "Domain name is: $DOMAIN_NAME";
# ssh user#$DOMAIN_NAME etc...
The portion of code s/^[a-z]*:\/\///g; looks for one or more occurrences of a-z followed by :// and removes it if it exists.

How to determine the latest major and full kernel version string as compactly as possible

So what I'm intending to do here is to determine both the latest major and the full kernel version string as compactly as possible (without a zillion pipes to grep).
I'm already quite content with the result but if anybody has any ideas how to squash the first line even the slightest it'd be very awesome (it has to work when there are no minor patches as well).
The index of kernel.org is only 36kB compared to the 136kB of that of http://www.kernel.org/pub/linux/kernel/v3.x/ so that's why I'm using it:
_major=$(curl -s http://www.kernel.org/ -o /tmp/kernel && cat /tmp/kernel | grep -A1 mainline | tail -1 | cut -d ">" -f3 | cut -d "<" -f1)
pkgver=${_major}.$(cat /tmp/kernel | grep ${_major} | head -1 | cut -d "." -f6)

It's just a thought exercise at this stage as the real answer is in the comments above, but here are some possible improvements.
Original:
_major=$(curl -s http://www.kernel.org/ -o /tmp/kernel && cat /tmp/kernel | grep -A1 mainline | tail -1 | cut -d ">" -f3 | cut -d "<" -f1)
Use tee instead of cat:
_major=$(curl -s http://www.kernel.org/ | tee /tmp/kernel | grep -A1 mainline | tail -1 | cut -d ">" -f3 | cut -d "<" -f1)
Use sed to minimise the number of pipes, and to make the command unreadable
_major=$(curl -s http://www.kernel.org/ | tee /tmp/kernel | sed -n '/ainl/,/<\/s/ s|.*>\([0-9\.]*\)</st.*|\1|p')
Cheap tricks: shorten the URL
_major=$(curl -s kernel.org | tee /tmp/kernel | sed -n '/ainl/,/<\/s/ s|.*>\([0-9\.]*\)</st.*|\1|p')

kernel.org provides a plaintext listing of all the current versions at https://www.kernel.org/finger_banner
For mainline:
curl -s https://www.kernel.org/finger_banner | grep mainline | awk '{print $NF}'
For latest stable:
curl -s https://www.kernel.org/finger_banner | grep -m1 stable | awk '{print $NF}'
The mainline and latest stable versions will never be EOL, but other versions often are, so the above awk commands will not work correctly for all versions. A general solution as a bash function:
latest_kernel() {
curl -s https://www.kernel.org/finger_banner | grep -m1 $1 | sed -r 's/^.+: +([^ ]+)( .+)?$/\1/'
}
Examples:
$ latest_kernel mainline
4.18-rc2
$ latest_kernel stable
4.17.3
$ latest_kernel 4.16
4.16.18

You've got a useless use of cat. You can replace:
cat /tmp/kernel | grep -A1 mainline
with simply:
grep -A1 mainline /tmp/kernel
In your case, you don't even need the file at all. Curl by default will emit to standard output, so you can just do:
curl -s http://www.kernel.org/ | grep -A1 mainline

Expanding on #Justin Brewer's answer, you probably want to know when a kernel is EOL since this is useful information... the following single awk command preserves all this information for you.
latest_kernel() {
curl -s https://www.kernel.org/finger_banner |awk -F ':' -v search="$1" '{if ($1 ~ search) {gsub(/^[ ]+/, "", $2); print $2}}'
}
-F ':' -- field separator because everything after the : is the version string.
-v search="$1" -- pass search string as an awk internal variable
if statement -- check if field $1 matches the search string
gsub -- in-place modify of field $2 to strip leading spaces
Then just print field $2 for any matching records (I presume your search string will only match the left-hand side of one line... if it is important to exit after the first match, use print $2; exit)
Search string can include spaces, etc. Use of awk variables and matching with ~ variable instead of pattern-matching '.../'"$1"'/...' avoids the need to exit single-quote mode and avoids syntax errors where the search string contains "/".

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Parser a href tag in a website with bash shell - bash

If you want to keep the href= part curl -s http://MyWebsite | grep -E -io 'href="[^\"]+"' If you only want URL without the href= curl -s http://MyWebsite | grep -E -io 'href="[^\"]+"' | awk -F\" '{print$2}'

I know that there is only a single href, but just in case... you can also extract URLs from all anchors inside an HTML document with sed and grep: curl -s http://MyWebsite | grep -o '<a .href=.>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .href=['"'"'"]//' -e 's/["'"'"'].$//' -e '/^$/ d'

Related

Using sed to replace tabs if input is not guaranteed to contain tabs?

Sed output a value between two matching strings in a url

How to write a script that will use regex to output only the heading and paragraph text from the http://example.com website

Retreive Domain name from a PHP variable

How to determine the latest major and full kernel version string as compactly as possible

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Parser a href tag in a website with bash shell - bash

If you want to keep the href= part curl -s http://MyWebsite | grep -E -io 'href="[^\"]+"' If you only want URL without the href= curl -s http://MyWebsite | grep -E -io 'href="[^\"]+"' | awk -F\" '{print$2}'

I know that there is only a single href, but just in case... you can also extract URLs from all anchors inside an HTML document with sed and grep: curl -s http://MyWebsite | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

Related

Using sed to replace tabs if input is not guaranteed to contain tabs?

Sed output a value between two matching strings in a url

How to write a script that will use regex to output only the heading and paragraph text from the http://example.com website

Retreive Domain name from a PHP variable

How to determine the latest major and full kernel version string as compactly as possible

Categories

Resources

I know that there is only a single href, but just in case... you can also extract URLs from all anchors inside an HTML document with sed and grep: curl -s http://MyWebsite | grep -o '<a .href=.>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .href=['"'"'"]//' -e 's/["'"'"'].$//' -e '/^$/ d'