Bash wget filter specific word - bash

i want to filter a specific word from a website using wget.
the word i want to filter out is hPa and the value of it.
see: https://www.foreca.de/Deutschland/Berlin/Berlin
i can't find useful information on how to filter out a specific string.
this is what i've tried so far:
#!/bin/bash
LAST=$(wget -l1 https://www.foreca.de/Deutschland/Berlin/Berlin -O - | sed -e 'hPa')
echo $LAST
thanks for helping me out.

A fully fledged solution using xpath :
Command :
$ saxon-lint --html --xpath '//div[contains(text(), "hPa")]/text()' \
'https://www.foreca.de/Deutschland/Berlin/Berlin'
Output :
1026 hPa
Notes :
Don't parse HTML with regex, use a proper XML/HTML parser like we do here. Check: Using regular expressions with HTML tags
Check https://github.com/sputnick-dev/saxon-lint (my own project)
if what I wrote bores you and you just want a quick and dirty command even if it's evil, then use curl -s https://www.foreca.de/Deutschland/Berlin/Berlin | grep -oP '\d+\s+hPa'

Related

Grep title of a page which is written with spaces [duplicate]

This question already has answers here:
Parsing XML using unix terminal
(9 answers)
Closed last month.
I am trying to get the meta title of some website...
some people write title like
`<title>AllHeart Web INC, IT Services Digital Solutions Technology
</title>
`
`<title>AllHeart Web INC, IT Services Digital Solutions Technology</title>`
`<title>
AllHeart Web INC, IT Services Digital Solutions Technology
</title>`
some like more ways... my current focus on above 3 ways...
I wrote a simple code, it only capture 2nd way of title written, but i am not sure how can I grep the other ways,
`curl -s https://allheartweb.com/ | grep -o '<title>.*</title>'`
I also made a code (very bad i guess)
where i can grep number of line like
`
% curl -s https://allheartweb.com/ | grep -n '<title>'
7:<title>AllHeart Web INC, IT Services Digital Solutions Technology
% curl -s https://allheartweb.com/ | grep -n '</title>'
8:</title>
`
and store it and run loop to get title item... which i guess a bad idea...
any help I can get all possible of getting title?
Try this:
curl -s https://allheartweb.com/ | tr -d '\n' | grep -m 1 -oP '(?<=<title>).+?(?=</title>)'
You can remove newlines from HTML via tr because they have no meaning in the title. The next step returns the first match of the shortest string enclosed in <title> </title>.
This is quite a simple approach of course. xmllint would be better but that's not available to all platforms by default.
'grep' is not a very good tool to match multiple lines. It is processing line-by-line. You could hack that by making your incoming text one line like
curl -s https://allheartweb.com/ | xargs | grep -o -E "<title>.*</title>"
This is probably what you want.
Try this sed:
curl -s https://allheartweb.com/ | sed -n "{/<title>/,/<\/title>/p}"

how can I scrap data from reddit (in bash)

I want to scrap titles and date from http://www.reddit.com/r/movies.json in bash
wget -q -O - "http://www.reddit.com/r/movies.json" | grep -Po '(?<="title": ").*?(?=",)' | sed 's/\"/"/'
I have titles but I don't know how to add dates, can someone help?
wget -q -O - "http://www.reddit.com/r/movies.json" | grep -Po
'(?<="title": ").*?(?=",)' | sed 's/"/"/'
As extension suggest it is JSON (application/json) file, therefore grep and sed are poorly suited for working with it, as they are mainly for using regular expressions. If you are allowed to install tools, jq should be handy here. Try using your system package manager to install it, if it succeed you should get pretty printed version of movies.json by doing
wget -q -O - "http://www.reddit.com/r/movies.json" | jq
and then find where interesting values are placed which should allow you to grab it. See jq Cheat Sheet for example of jq usage. If you are limited to already installed tools I suggest taking look at json module of python.

Get title of an RSS feed with bash

How can I get the title of an RSS feed with Bash? Say I want to get the most recent article from MacRumors. Their RSS feed link is http://feeds.macrumors.com/MacRumors-All. How can I get the most recent article title with Bash?
An alternative to xmllint is xmlstarlet and so:
curl -s http://feeds.macrumors.com/MacRumors-All | xmlstarlet sel -t -m "/rss/channel/item[1]" -v "title"
Use the xmlstarlet sel command to select the xpath we are looking for and then use -v to display a specific element.
You can combine curl and an XPath expression (here, using xmllint), and rely on the fact that the feed is in reverse chronological order:
curl http://feeds.macrumors.com/MacRumors-All | xmllint --xpath '/rss/channel/item[1]/title/text()'
See How to execute XPath one-liners from shell? for other ways to evaluate XPath.
In particular, if you have an older xmllint with --xpath, you may be able to use the technique suggested by this wrapper:
echo 'cat /rss/channel/item[1]/title/text()' | xmllint --shell <(curl http://feeds.macrumors.com/MacRumors-All)

reverse geocoding in bash

I have a gps unit which extracts longitude and latitude and outputs as a google maps link
http://maps.googleapis.com/maps/api/geocode/xml?latlng=51.601154,-0.404765&sensor=false
From this i'd like to call it via curl and display the "short name" in line 20
"short_name" : "Northwood",
so i'd just like to be left with
Northwood
so something like
curl -s http://maps.googleapis.com/maps/api/geocode/xml?latlng=latlng=51.601154,-0.404765&sensor=false sed sort_name
Mmmm, this is kind of quick and dirty:
curl -s "http://maps.googleapis.com/maps/api/geocode/json?latlng=40.714224,-73.961452&sensor=false" | grep -B 1 "route" | awk -F'"' '/short_name/ {print $4}'
Bedford Avenue
It looks for the line before the line with "route" in it, then the word "short_name" and then prints the 4th field as detected by using " as the field separator. Really you should use a JSON parser though!
Notes:
This doesn't require you to install anything.
I look for the word "route" in the JSON because you seem to want the road name - you could equally look for anything else you choose.
This isn't a very robust solution as Google may not always give you a route, but I guess other programs/solutions won't work then either!
You can play with my solution by successively removing parts from the right hand end of the pipeline to see what each phase produces.
EDITED
Mmm, you have changed from JSON to XML, I see... well, this parses out what you want, but I note you are now looking for a locality whereas before you were looking for a route or road name? Which do you want?
curl -s "http://maps.googleapis.com/maps/api/geocode/xml?latlng=51.601154,-0.404765&sensor=false" | grep -B1 locality | grep short_name| head -1|sed -e 's/<\/.*//' -e 's/.*>//'
The "grep -B1" looks for the line before the line containing "locality". The "grep short_name" then gets the locality's short name. The "head -1" discards all but the first locality if there are more than one. The "sed" stuff removes the <> XML delimiters.
This isn't text, it's structured JSON. You don't want the value after the colon on line 12, you want the value of short name in the address_component with type 'route' from the result.
You could do this with jsawk or python, but it's easier to get it from XML output with xmlstarlet, which is lighter than python and more available than jsawk. Install xmlstarlet and try:
curl -s 'http://maps.googleapis.com/maps/api/geocode/xml?latlng=40.714224,-73.961452&sensor=false' \
| xmlstarlet sel -t -v '/GeocodeResponse/result/address_component[type="route"]/short_name'
This is much more robust than trying to parse JSON as plaintext.
The following seems to work assuming you always like the short_name at line 12:
curl -s 'http://maps.googleapis.com/maps/api/geocode/json?latlng=40.714224,-73.961452&sensor=false' | sed -n -e '12s/^.*: "\([a-zA-Z ]*\)",/\1/p'
or if you are using the xml api and wan't to trap the short_name on line 20:
curl -s 'http://maps.googleapis.com/maps/api/geocode/xml?latlng=51.601154,-0.404765&sensor=false' | sed -n -e '19s/<short_name>\([a-zA-Z ]*\)<\/short_name>/\1/p'

Extract a specific string from a curl'd result

Given this curl command:
curl --user-agent "fogent" --silent -o page.html "http://www.google.com/search?q=insansiate"
* Spelling is intentionally incorrect. I want to grab the suggestion as my result.
I want to be able to either grep into the page.html file perhaps with grep -oE or pipe it right from curl and never store a file.
The result should be: 'instantiate'
I need only the word 'instantiate', or the phrase, whatever google is auto correcting, is what I am after.
Here is the basic html that is returned:
<span class=spell style="color:#cc0000">Did you mean: </span><a href="/search?hl=en&ie=UTF-8&&sa=X&ei=VEMUTMDqGoOINraK3NwL&ved=0CB0QBSgA&q=instantiate&spell=1"class=spell><b><i>instantiate</i></b></a> <span class=std>Top 2 results shown</span>
So perhaps from/to of the string below, which I hope is unique enough to cover all my bases.
class=spell><b><i>instantiate</i></b></a>
I keep running into issues with greedy grep; perhaps I should run it though an html prettify tool first to get a line break or 50 in there. I don't know of any simple way to do so in bash, which is what I would ideally like this to be in. I really don't want to deal with firing up perl, and making sure I have the correct module.
Any suggestions, thank you?
As I'm sure you're aware, screen scraping is a delicate business. This command sequence is no exception since it relies on the specific structure of the page which could change at any time without notice.
grep -o 'Did you mean:\([^>]*>\)\{5\}' page.html | sed 's/.*<i>\([^<]*\)<.*/\1/' page.html
In a pipe:
curl --user-agent "fogent" --silent "http://www.google.com/search?q=insansiate" | grep -o 'Did you mean:\([^>]*>\)\{5\}' page.html | sed 's/.*<i>\([^<]*\)<.*/\1/'
This relies on finding five ">" characters between "Did you mean:" and the "</i>" after the word you're looking for.
Have you considered other methods of getting spelling suggestions or are you specifically interested in what Google provides?
If you have ispell or aspell installed, you can do:
echo insansiate | ispell -a
and parse the result.
xidel is a great utility for scraping web pages; it supports retrieving pages and extracting information in various query languages (CSS selectors, XPath).
In the case at hand, the simple CSS selector a.spell will do the trick.
xidel --user-agent "fogent" "http://google.com/search?q=insansiate" -e 'a.spell'
Note how xidel does its own page retrieval, so no need for curl in this case.
If, however, you needed curl for more exotic retrieval options, here's how you'd combine the two tools (line break for readability):
curl --user-agent "fogent" --silent "http://google.com/search?q=insansiate" |
xidel - -e 'a.spell'
curl --> tidy -asxml --> xmlstarlet sel
Edit: Sorry, did not see your Perl notice.
#!/usr/bin/perl
use strict;
use LWP::UserAgent;
my $arg = shift // 'insansiate';
my $lwp = LWP::UserAgent->new(agent => 'Mozilla');
my $c = $lwp->get("http://www.google.com/search?q=$arg") or die $!;
my #content = split(/:/, $c->content);
for(#content) {
if(m;<b><i>(.+)</i></b>;) {
print "$1\n";
exit;
}
}
Running:
> perl google.pl
instantiate
> perl google.pl disconect
disconnect

Resources