How to scrape Wikipedia GPS latitude/longitude? - shell

I have been wondering how is it possible to scrap Wikipedia information. For example, I have a list of world cities and want to obtain their approximate latitude and longitude. Take Miami as an example. When I type curl https://en.wikipedia.org/wiki/Miami | grep -E '(latitude|longitude)', somewhere in the HTML there will be a tag mark like below.
<span class="latitude">25°46′31″N</span> <span class="longitude">80°12′31″W</span>
I know I can extract it with some regex string, but I speak a very poor regexish. Can some of you help me on this?

With xidel and xpath:
$ xidel -se '
concat(
(//span[#class="latitude"]/text())[1],
" ",
(//span[#class="longitude"]/text())[1]
)
' 'https://en.wikipedia.org/wiki/Miami'
Output
25°46′31″N 80°12′31″W
Or
saxon-lint --html --xpath '<XPATH EXP>' <URL>
If you want most known tools:
curl -s 'https://en.wikipedia.org/wiki/Miami' > Miami.html
xmlstarlet format -H Miami.html 2>/dev/null | sponge Miami.html
xmlstarlet sel -t -v '<XPATH EXP>' Miami.html
Not mentioned, but regex are not the right tool to parse HTML

You can't parse HTML with RegEx. Please use an HTML-parser like xidel instead:
$ xidel -s "https://en.wikipedia.org/wiki/Miami" -e '
(//span[#class="geo-dms"])[1],
(//span[#class="geo-dec"])[1],
(//span[#class="geo"])[1],
replace((//span[#class="geo"])[1],";",())
'
25°46′31″N 80°12′31″W
25.775163°N 80.208615°W
25.775163; -80.208615
25.775163 -80.208615
Take your pick.

Related

Bash sed command issue

I'm trying to further parse an output file I generated using an additional grep command. The code that I'm currently using is:
##!/bin/bash
# fetches the links of the movie's imdb pages for a given actor
# fullname="USER INPUT"
read -p "Enter fullname: " fullname
if [ "$fullname" = "Charlie Chaplin" ];
code="nm0000122"
then
code="nm0000050"
fi
curl "https://www.imdb.com/name/$code/#actor" | grep -Eo
'href="/title/[^"]*' | sed 's#^.*href=\"/#https://www.imdb.com/#g' |
sort -u > imdb_links.txt
#parses each of the link in the link text file and gets the details for
each of the movie. THis is followed by the cleaning process
for i in $(cat imdb_links.txt)
do
curl $i |
html2text |
sed -n '/Sign_In/,$p'|
sed -n '/YOUR RATING/q;p' |
head -n-1 |
tail -n+2
done > imdb_all.txt
The sample generated output is:
EN
⁰
* Fully supported
* English (United States)
* Partially_supported
* Français (Canada)
* Français (France)
* Deutsch (Deutschland)
* हिंदी (भारत)
* Italiano (Italia)
* Português (Brasil)
* Español (España)
* Español (México)
****** Duck Soup ******
* 19331933
* Not_RatedNot Rated
* 1h 9m
IMDb RATING
7.8/10
How do I change the code to further parse the output to get only the data from the title of the movie up until the imdb rating ( in this case, the line that contains the title 'Duck Soup' up until the end.
Here is the code:
#!/bin/bash
# fullname="USER INPUT"
read -p "Enter fullname: " fullname
if [ "$fullname" = "Charlie Chaplin" ]; then
code="nm0000122"
else
code="nm0000050"
fi
rm -f imdb_links.txt
curl "https://www.imdb.com/name/$code/#actor" |
grep -Eo 'href="/title/[^"]*' |
sed 's#^href="#https://www.imdb.com#g' |
sort -u |
while read link; do
# uncomment the next line to save links into file:
#echo "$link" >>imdb_links.txt
curl "$link" |
html2text -utf8 |
sed -n '/Sign_In/,/YOUR RATING/ p' |
sed -n '$d; /^\*\{6\}.*\*\{6\}$/,$ p'
done >imdb_all.txt
Please(!) have a look at the following urls on why it's a really bad idea to parse HTML with sed:
RegEx match open tags except XHTML self-contained tags
Using regular expressions to parse HTML: why not?
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
The thing you're trying to do can be done with the HTML/XML/JSON parser xidel and with just 1 call!
In this example I'll use the IMDB of Charlie Chaplin as source.
Extract all 94 "Actor" IMDB movie urls:
$ xidel -s "https://www.imdb.com/name/nm0000122" -e '
//div[#id="filmo-head-actor"]/following-sibling::div[1]//a/#href
'
/title/tt0061523/?ref_=nm_flmg_act_1
/title/tt0050598/?ref_=nm_flmg_act_2
/title/tt0044837/?ref_=nm_flmg_act_3
[...]
/title/tt0004288/?ref_=nm_flmg_act_94
There's no need to save these to a text-file. Just use -f (--follow) instead of -e and xidel will open all of them.
For the individual movie urls you could parse the HTML to get the text-nodes you want...
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
//h1,
//div[#class="sc-94726ce4-3 eSKKHi"]/ul/li[1]/span,
//div[#class="sc-94726ce4-3 eSKKHi"]/ul/li[3],
(//div[#class="sc-7ab21ed2-2 kYEdvH"])[1]
'
A Countess from Hong Kong
1967
2h
6.0/10
...but with those class-names I'd say that's a rather fragile endeavor. Instead I'd recommend to parse the JSON at the top of the HTML-source within the <script>-node:
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
parse-json(//script[#type="application/ld+json"])/(
name,
datePublished,
duration,
aggregateRating/ratingValue
)
'
A Countess from Hong Kong
1967-03-15
PT2H
6
...or to get a similar output as above:
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
parse-json(//script[#type="application/ld+json"])/(
name,
year-from-date(date(datePublished)),
substring(lower-case(duration),3),
format-number(aggregateRating/ratingValue,"#.0")||"/10"
)
'
A Countess from Hong Kong
1967
2h
6.0/10
All combined:
$ xidel -s "https://www.imdb.com/name/nm0000122" \
-f '//div[#id="filmo-head-actor"]/following-sibling::div[1]//a/#href' \
-e '
parse-json(//script[#type="application/ld+json"])/(
name,
year-from-date(date(datePublished)),
substring(lower-case(duration),3),
format-number(aggregateRating/ratingValue,"#.0")||"/10"
)
'
A Countess from Hong Kong
1967
2h
6.0/10
A King in New York
1957
1h50m
7.0/10
Limelight
1952
2h17m
8.0/10
[...]
Making a Living
1914
11m
5.5/10

How to extract text with sed or grep and regular expression json

Hello I am using curl to get some info which I need to clean up.
This is from curl command:
{"ip":"000.000.000.000","country":"Italy","city":"Milan","longitude":9.1889,"latitude":45.4707, etc..
I would need to get "Ita" as output, that is the first three letter of the country.
After reading sed JSON regular expression i tried to adapt resulting in
sed -e 's/^.*"country":"[a-zA-Z]{3}".*$/\1/
but this won't work.
Can you please help?
Using jq, you can do:
curl .... | jq -r '.country[0:3]'
If you need to set the country to the first 3 chars,
jq '.country = .country[0:3]'
some fairly advanced bash:
{
read country
read city
} < <(
curl ... |
jq -r '.country[0:3], .city[0:3]'
)
Then:
$ echo "$country $city"
Ita Mil

Iterate over string without space as seperator in bash-shell

I'm trying to parse some output of xmllint since hours, but I can't get it to work like i would need it.
Output of "xmllint --xpath "//fub/#name" menu.xml"*
name="Kessel" name="Lager" name="Puffer " name="Boiler.sen"
name="Boiler.jun" name="HK Senior" name="HK Junior" name="Fbh"
name="Solar" name="F.Wärme" name="Sys "
Now I need to seperate all the names (inclusive spaces) and get them in to seperate variables.
My approach was this:
fubNames=$(xmllint --xpath "//fub/#name" menu.xml | sed 's/name=//g')
for name in $fubNames
do
echo $name
done
but this does not workout because the for-loop seperates the string on spaces.
i need the names with spaces. (note: some names have a space at the end)
Does anyone know how to do this properly?
I suggest:
xmllint --xpath "//fub/#name" menu.xml | grep -o '"[^"]*"' | while IFS= read -r name; do echo "$name"; done
grep approach:
xmllint --xpath "//fub/#name" menu.xml | grep -Po 'name=\K\"([^"]+)\"'
The output:
"Kessel"
"Lager"
"Puffer "
"Boiler.sen"
"Boiler.jun"
"HK Senior"
"HK Junior"
"Fbh"
"Solar"
"F.Wärme"
"Sys "
-P option, allows Perl regular expressions
-o option, tells to print only matched parts

I can't figure out how to extract a string in bash

I am trying to make a bash script that will download a youtube page, see the latest video and find its url. I have the part to download the page except I can not figure out how to isolate the text with the url.
I have this to download the page
curl -s https://www.youtube.com/user/h3h3Productions/videos > YoutubePage.txt
which will save it to a file.
But I cannot figure out how to isolate the single part of a div.
The div is
<a class="yt-uix-sessionlink yt-uix-tile-link spf-link yt-ui-ellipsis yt-ui-ellipsis-2" dir="ltr" title="Why I'm Unlisting the Leafyishere Rant" aria-describedby="description-id-877692" data-sessionlink="ei=a2lSV9zEI9PJ-wODjKuICg&feature=c4-videos-u&ved=CD4QvxsiEwicpteI1I3NAhXT5H4KHQPGCqEomxw" href="/watch?v=q6TNODqcHWA">Why I'm Unlisting the Leafyishere Rant</a>
And I need to isolate the href at the end but i cannot figure out how to do this with grep or sed.
With sed :
sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\([^"]*\)".*/\1/p' YoutubePage.txt
To just extract the video ahref :
$ sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' YoutubePage.txt
/watch?v=q6TNODqcHWA
/watch?v=q6TNODqcHWA
/watch?v=ix4mTekl3MM
/watch?v=ix4mTekl3MM
/watch?v=fEGVOysbC8w
/watch?v=fEGVOysbC8w
...
To omit repeated lines :
$ sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' YoutubePage.txt | sort | uniq
/watch?v=2QOx7vmjV2E
/watch?v=4UNLhoePqqQ
/watch?v=5IoTGVeqwjw
/watch?v=8qwxYaZhUGA
/watch?v=AemSBOsfhc0
/watch?v=CrKkjXMYFzs
...
You can also pipe it to your curl command :
curl -s https://www.youtube.com/user/h3h3Productions/videos | sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' | sort | uniq
You can use lynx which is a terminal browser, but have a -dump mode which will output a HTML parsed text, with URL extracted. This makes it easier to grep the URL:
lynx -dump 'https://www.youtube.com/user/h3h3Productions/videos' \
| sed -n '/\/watch?/s/^ *[0-9]*\. *//p'
This will output something like:
https://www.youtube.com/watch?v=EBbLPnQ-CEw
https://www.youtube.com/watch?v=2QOx7vmjV2E
...
Breakdown:
-n ' # Disable auto printing
/\/watch?/ # Match lines with /watch?
s/^ *[0-9]*\. *// # Remove leading index: " 123. https://..." ->
# "https://..."
p # Print line if all the above have not failed.
'

extract substring from lines using grep, awk,sed or etc

I have a files with many lines like:
lily weisy
I want to extract www.youtube.com/user/airuike and lily weisy, and then I also want to separate airuike from www.youtube.com/user/
so I want to get 3 strings: www.youtube.com/user/airuike, airuike and lily weisy
how to achieve this? thanks
do this:
sed -e 's/.*href="\([^"]*\)".*>\([^<]*\)<.*/link:\1 name:\2/' < data
will give you the first part. But I'm not sure what you are doing with it after this.
Since it is html, and html should be parsed with a html parser and not with grep/sed/awk, you could use the pattern matching function of my Xidel.
xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{$link := #href, $user := substring-after($link, "www.youtube.com/user/"), $name:=text()}</a>*'
Or if you want a CSV like result:
xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{string-join((#href, substring-after(#href, "www.youtube.com/user/"), text()), ", ")}</a>*' --hide-variable-names
It is kind of sad, that you also want to have the airuike string, otherwise it could be as simple as
xidel /yourfile.html -e '{$name}*'
(and you were supposed to be able to use xidel '{$name}*', but it seems I haven't thought the syntax through. Just one error check and it is breaking everything. )
$ awk '{split($0,a,/(["<>]|:\/\/)/); u=a[4]; sub(/.*\//,"",a[4]); print u,a[4],a[12]}' file
www.youtube.com/user/airuike airuike lily weisy
I think something like this must work
while read line
do
href=$(echo $line | grep -o 'http[^"]*')
user=$(echo $href | grep -o '[^/]*$')
text=$(echo $line | grep -o '[^>]*<\/a>$' | grep -o '^[^<]*')
echo href: $href
echo user: $user
echo text: $text
done < yourfile
Regular expressions basics: http://en.wikipedia.org/wiki/Regular_expression#POSIX_Basic_Regular_Expressions
Upd: checked and fixed

Resources