How to write a script that will use regex to output only the heading and paragraph text from the http://example.com website - bash

I am a beginner in scripting and i am working on the bash scripting for my work.
for this task i tried the sed command which didn't work

for your problem, following would work:
#!/bin.bash
curl -s http://example.com/ | grep -P "\s*\<h1\>.*\<\/h1\>" |sed -n 's:.*<h1>\(.*\)</h1>.*:\1:p'
curl -s http://example.com/ | grep -P "\s*\<p\>.*\<\/p\>" |sed -n 's:.*<p>\(.*\)</p>.*:\1:p'
The first line scrapes via curl and grep the <h1>..</h1> part(assuming theres only one as we are considering your example) and using sed extract the first capturing group( (.*) ) by :\1:
The second line does the same but for <p1> tag.
I could cram these 2 lines in one grep but these'll work fine!
Edit:
If <p> tag end on different lines, above wouldn't, you may have to use pcregrep
curl -s http://example.com/ | pcregrep -M "\s*\<p\>(\n|.)*\<\/p\>"

You can use the following one liner :
curl -s http://example.com/ | sed -n '2,$p' > /tmp/tempfile && cat /tmp/tempfile | xmllint --xpath '/html/head/title/text()' - && echo ; cat /tmp/tempfile | xmllint --xpath '/html/body/div/p/text()' -
This uses xmllint's xpath command to extract the text within <title> and <p> tags.

Related

Using sed to replace tabs if input is not guaranteed to contain tabs?

I'm trying to extract a list of names from a website using sed, but I'm not sure how to go about replacing the tab characters separating them.
This code:
curl -s "https://namnidag.se/?year=2022&month=9&day=12" | sed -nE -e "s#<div class='names'>([^<]*)</div>#\1#p" | html2text
gives me the names for September 12th, but they are separated by a tab character:
Åsa Åslög
If I change the sed script to replace tabs with comma and space, like this:
curl -s "https://namnidag.se/?year=2022&month=9&day=12" | sed -nE -e "s#<div class='names'>([^<]*)</div>#\1#" -e 's/\t/, /p' | html2text
it works as expected:
Åsa, Åslög
However, if I try on a day that only has one name, such as September 13th:
curl -s "https://namnidag.se/?year=2022&month=9&day=13" | sed -nE -e "s#<div class='names'>([^<]*)</div>#\1#" -e 's/\t/, /p' | html2text
I get no output; the first sed script without the tab replacement works fine in this case though. What am I doing wrong here?
I'm using GNU sed 4.8, if that helps.
Thanks!
You need to remove the p
curl -s "https://namnidag.se/?year=2022&month=9&day=12" | sed -nE -e "s#<div class='names'>([^<]*)</div>#\1#p" | sed -e 's/\t/, /'
curl -s "https://namnidag.se/?year=2022&month=9&day=12" > f1
cat > ed1 <<EOF
71W f2
q
EOF
ed -s f1 < ed1
cat f2 | tail -c +20 | head -c -6 > file
rm -v ./ed1
rm -v ./f2
This will give you the names, whether there are two of them or not; and if there are, you can just seperate them with cut.

User input into variables and grep a file for pattern

H!
So I am trying to run a script which looks for a string pattern.
For example, from a file I want to find 2 words, located separately
"I like toast, toast is amazing. Bread is just toast before it was toasted."
I want to invoke it from the command line using something like this:
./myscript.sh myfile.txt "toast bread"
My code so far:
text_file=$1
keyword_first=$2
keyword_second=$3
find_keyword=$(cat $text_file | grep -w "$keyword_first""$keyword_second" )
echo $find_keyword
i have tried a few different ways. Directly from the command line I can make it run using:
cat myfile.txt | grep -E 'toast|bread'
I'm trying to put the user input into variables and use the variables to grep the file
You seem to be looking simply for
grep -E "$2|$3" "$1"
What works on the command line will also work in a script, though you will need to switch to double quotes for the shell to replace variables inside the quotes.
In this case, the -E option can be replaced with multiple -e options, too.
grep -e "$2" -e "$3" "$1"
You can pipe to grep twice:
find_keyword=$(cat $text_file | grep -w "$keyword_first" | grep -w "$keyword_second")
Note that your search word "bread" is not found because the string contains the uppercase "Bread". If you want to find the words regardless of this, you should use the case-insensitive option -i for grep:
find_keyword=$(cat $text_file | grep -w -i "$keyword_first" | grep -w -i "$keyword_second")
In a full script:
#!/bin/bash
#
# usage: ./myscript.sh myfile.txt "toast" "bread"
text_file=$1
keyword_first=$2
keyword_second=$3
find_keyword=$(cat $text_file | grep -w -i "$keyword_first" | grep -w -i "$keyword_second")
echo $find_keyword

How to use sed to extract text from a webpage

Hey i'm using a combination of sed and curl to extract some text from the webpage example.com
here is my code
curl -s http://example.com | sed -n -e 's/.*<h1>\(.*\)<\/h1>.*<p>\(This.*\)<\/p>/\1 \n \2/p'
however, I don't get any output. What could I be doing wrong?
Although sed is generally not the right tool for extracting text from web pages it may be sufficent for simple tasks. sed is a line oriented tool. So each line will be handled separately.
If you really want to do it with sed, this will you give some output:
curl -s http://example.com | sed -n -e 's/.*<h1>\(.*\)<\/h1>/\1 \n/p' -e 's/<p>\(This.*\)/\1 \n/p'

Parser a href tag in a website with bash shell

i have a website with one url inside. it's a href tag
I need to parser a website to keep the "href" value.
In this website page, there is juste one "href" tag. This "href" hasn't class name.
i use a bash shell with curl
for now, i tried this :
curl http://MyWebsite | grep "href=" | cut -d '>' -f4 | cut -d '<' -f1
but no result. i'm novice with bash shell
Someone have an idea ? Thank's for your answers
If you want to keep the href= part
curl -s http://MyWebsite | grep -E -io 'href="[^\"]+"'
If you only want URL without the href=
curl -s http://MyWebsite | grep -E -io 'href="[^\"]+"' | awk -F\" '{print$2}'
I know that there is only a single href, but just in case... you can also extract URLs from all anchors inside an HTML document with sed and grep:
curl -s http://MyWebsite | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

Removing text in unix shell

Sorry, I'm pretty new to coding. I'm just trying to remove the CST that follows the end of the string. The final output that I'm trying to get says "Sunset: 4:38 PM CST". Exclude the quotation marks.
Here is the code that I'm using within the shell.
curl http://m.wund.com/US/MN/Winona.html | grep 'Sunset' | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | sed -e 's/Sunset/Sunset: /g' | sed -e 's/PST//g'
Just change:
... | sed -e 's/PST//g'
to
... | sed -e 's/CST//g'
You might also want to invoke curl -s instead of just curl to omit all the downloading stuff.

Resources