How to use sed to extract text from a webpage - bash

Hey i'm using a combination of sed and curl to extract some text from the webpage example.com
here is my code
curl -s http://example.com | sed -n -e 's/.*<h1>\(.*\)<\/h1>.*<p>\(This.*\)<\/p>/\1 \n \2/p'
however, I don't get any output. What could I be doing wrong?

Although sed is generally not the right tool for extracting text from web pages it may be sufficent for simple tasks. sed is a line oriented tool. So each line will be handled separately.
If you really want to do it with sed, this will you give some output:
curl -s http://example.com | sed -n -e 's/.*<h1>\(.*\)<\/h1>/\1 \n/p' -e 's/<p>\(This.*\)/\1 \n/p'

Related

Bash script generates line break - Why?

I made a small bash script which gets the current air pressure from a website and writes it into a variable. After the air pressure I would like to add the current date and write everything into a text file. The target should be a kind of CSV file.
My problem. I always get a line break between the air pressure and the date. Attempts to remove the line break by sed or tr '\n' have failed.
2nd guess from me: wget is done "too late" and echo is already done.
So I tried it with && between all commands. Same result.
Operating system is Linux. Where is my thinking error?
I can't get any further right now. Thanks in advance.
Sven
PS.: These are my first attempts with sed. This can be written certainly nicer ;)
#!/bin/bash
luftdruck=$(wget 'https://www.fg-wetter.de/aktuelle-messwerte/' -O aktuell.html && cat aktuell.html | grep -A 0 hPa | sed -e 's/<[^>]*>//g' | sed -e '/^-/d' | sed -e '/title/d' | sed -e 's/ hPa//g')
datum=$(date)
echo -e "${luftdruck} ${datum}" >> ausgabe.txt
Replace sed -e 's/ hPa//g') with sed -e 's/ hPa//g' | dos2unix) to replace trailing carriage return (DOS/Windows) with line feed (Unix/Linux).
The html file you download is using Windows line endings (Carriage Return \r + Line Feed \n). I assume your bash script only removes \ns, but the editor you are using to view the file is showing the \r as a linebreak.
Therefore, you could pipe everything through tr -d \\r\\n which would remove all line breaks.
But there is a better alternative: Extract only the important part instead of whole lines.
luftdruck=$(
wget 'https://www.fg-wetter.de/aktuelle-messwerte/' -O - |
grep -o '[^>]*hPa' | tr -dc 0-9.
)
echo "$luftdruck $(date)" >> ausgabe.txt

How to write a script that will use regex to output only the heading and paragraph text from the http://example.com website

I am a beginner in scripting and i am working on the bash scripting for my work.
for this task i tried the sed command which didn't work
for your problem, following would work:
#!/bin.bash
curl -s http://example.com/ | grep -P "\s*\<h1\>.*\<\/h1\>" |sed -n 's:.*<h1>\(.*\)</h1>.*:\1:p'
curl -s http://example.com/ | grep -P "\s*\<p\>.*\<\/p\>" |sed -n 's:.*<p>\(.*\)</p>.*:\1:p'
The first line scrapes via curl and grep the <h1>..</h1> part(assuming theres only one as we are considering your example) and using sed extract the first capturing group( (.*) ) by :\1:
The second line does the same but for <p1> tag.
I could cram these 2 lines in one grep but these'll work fine!
Edit:
If <p> tag end on different lines, above wouldn't, you may have to use pcregrep
curl -s http://example.com/ | pcregrep -M "\s*\<p\>(\n|.)*\<\/p\>"
You can use the following one liner :
curl -s http://example.com/ | sed -n '2,$p' > /tmp/tempfile && cat /tmp/tempfile | xmllint --xpath '/html/head/title/text()' - && echo ; cat /tmp/tempfile | xmllint --xpath '/html/body/div/p/text()' -
This uses xmllint's xpath command to extract the text within <title> and <p> tags.

Appending query result from a shell script to a csv file

So I'm querying to get my gps location.
The shell script command is like this:
curl -s http://whatismycountry.com/ | sed -n 's/.*Coordinates \(.*\)<.*/\1/p'
Then to save the coordinates to a .csv file I write:
curl -s http://whatismycountry.com/ | sed -n 's/.*Coordinates \(.*\)<.*/\1/p' | sed -e 's/.*\[\([^ ]*\) \([^]]*\)\].*/\1,\2/' > cordinates.csv
which gives me a csv file with the co-ordinates.
the image of the .csv file pattern
Now the query is in a infinite loop and the intent is every time it queries it should save the new coordinates to the next block below.
Something like this
How do I write the regex part in the previous command for that in the script?
Thanks for much help. Totally a noob in regex.
This has nothing to do with regex. You should use >> for file append instead of > which rewrites file every time.
So your command will become
curl -s http://whatismycountry.com/ |\
sed -n 's/.*Coordinates \(.*\)<.*/\1/p' |\
sed -e 's/.*\[\([^ ]*\) \([^]]*\)\].*/\1,\2/' >> cordinates.csv

wget grep sed to extract links and save them to a file

I need to download all page links from http://en.wikipedia.org/wiki/Meme and save them to a file all with one command.
First time using the commmand line so I'm unsure of the exact commands, flags, etc to use. I only have a general idea of what to do and had to search around for what href means.
wget http://en.wikipedia.org/wiki/Meme -O links.txt | grep 'href=".*"' | sed -e 's/^.*href=".*".*$/\1/'
The output of the links in the file does not need to be in any specific format.
Using gnu grep:
grep -Po '(?<=href=")[^"]*' links.txt
or with wget
wget http://en.wikipedia.org/wiki/Meme -q -O - |grep -Po '(?<=href=")[^"]*'
You could use wget's spider mode. See this SO answer for an example.
wget spider
wget http://en.wikipedia.org/wiki/Meme -O links.txt | sed -n 's/.*href="\([^"]*\)".*/\1/p'
but this only take 1 href per line, if there is more than 1, other are lost (same as your original line). You also forget to have a group (\( -> \)) in your orginal sed first pattern so \1 refere to nothing

Unable to get sed to replace commas with a word in my CSV

Hello I am using bash to create CSV file by extracting data from an html file using grep. The problem is after getting the data then using sed to take out , in it and put a word like My_com it gose a crazy on me. here is my code.
time=$(grep -oP 'data-context-item-time=.*.data-context-item-views' index.html \
| cut -d'"' -f2)
title=$(grep -oP 'data-context-item-title=.*.data-context-item-id' index.html |\
cut -d'"' -f2)
sed "s/,/\My_commoms/g" $title
echo "$user,$views,$time,$title" >> test
I keep getting this error
sed: can't read Flipping: No such file or directory
sed: can't read the: No such file or directory
and so on
any advice on what wrong with my code
You can't use sed on text directly on the command line like that; sed expects a file, so it is reading your text as a file name. Try this for your second to last line:
echo $title | sed 's/,/My_com/g'
that way sed sees the text on a file (stdin in this case). Also note that I've used single quotes in the argument to sed; in this case I don't think it will make any difference, but in general it is good practice to make sure bash doesn't mess with the command at all.
If you don't want to use the echo | sed chain, you might also be able to rewrite it like this:
sed 's/,/My_com/g' <<< "$title"
I think that only works in bash, not dash etc. This is called a 'here-string', and bash passes the stuff on the right of the <<< to the command on its stdin, so you get the same effect.

Resources