I'm wgeting a webpage src code then using pup to grab the <meta> tag that I need. Now I want to print only the value of the content field.
In this case, the output I want is: https://example.com/my/folder/first.jpg?foo=bar
# wget page to /tmp/output.html
IMAGE_URL=$(cat /tmp/output.html | pup 'meta[property*="og:image"]')
echo $IMAGE_URL is:
<meta property="og:image" content="https://example.com/my/folder/first.jpg?foo=bar">
wget -O /tmp/output.html --user-agent="user-agent: Whatever..." https://example.com/somewhere
IMAGE_URL=$(cat /tmp/output.html | pup --plain 'meta[property*="og:image"]' | sed -n 's/.*content=\"\([^"]*\)".*/\1/p')
You can use attr{content} to only get the content of the attribute.
wget -O /tmp/output.html --user-agent="user-agent: Whatever..." https://example.com/somewhere
IMAGE_URL=$(cat /tmp/output.html | pup 'meta[property*="og:image"] attr{content}'
Related
I'm trying to further parse an output file I generated using an additional grep command. The code that I'm currently using is:
##!/bin/bash
# fetches the links of the movie's imdb pages for a given actor
# fullname="USER INPUT"
read -p "Enter fullname: " fullname
if [ "$fullname" = "Charlie Chaplin" ];
code="nm0000122"
then
code="nm0000050"
fi
curl "https://www.imdb.com/name/$code/#actor" | grep -Eo
'href="/title/[^"]*' | sed 's#^.*href=\"/#https://www.imdb.com/#g' |
sort -u > imdb_links.txt
#parses each of the link in the link text file and gets the details for
each of the movie. THis is followed by the cleaning process
for i in $(cat imdb_links.txt)
do
curl $i |
html2text |
sed -n '/Sign_In/,$p'|
sed -n '/YOUR RATING/q;p' |
head -n-1 |
tail -n+2
done > imdb_all.txt
The sample generated output is:
EN
⁰
* Fully supported
* English (United States)
* Partially_supported
* Français (Canada)
* Français (France)
* Deutsch (Deutschland)
* हिंदी (à¤à¤¾à¤°à¤¤)
* Italiano (Italia)
* Português (Brasil)
* Español (España)
* Español (México)
****** Duck Soup ******
* 19331933
* Not_RatedNot Rated
* 1h 9m
IMDb RATING
7.8/10
How do I change the code to further parse the output to get only the data from the title of the movie up until the imdb rating ( in this case, the line that contains the title 'Duck Soup' up until the end.
Here is the code:
#!/bin/bash
# fullname="USER INPUT"
read -p "Enter fullname: " fullname
if [ "$fullname" = "Charlie Chaplin" ]; then
code="nm0000122"
else
code="nm0000050"
fi
rm -f imdb_links.txt
curl "https://www.imdb.com/name/$code/#actor" |
grep -Eo 'href="/title/[^"]*' |
sed 's#^href="#https://www.imdb.com#g' |
sort -u |
while read link; do
# uncomment the next line to save links into file:
#echo "$link" >>imdb_links.txt
curl "$link" |
html2text -utf8 |
sed -n '/Sign_In/,/YOUR RATING/ p' |
sed -n '$d; /^\*\{6\}.*\*\{6\}$/,$ p'
done >imdb_all.txt
Please(!) have a look at the following urls on why it's a really bad idea to parse HTML with sed:
RegEx match open tags except XHTML self-contained tags
Using regular expressions to parse HTML: why not?
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
The thing you're trying to do can be done with the HTML/XML/JSON parser xidel and with just 1 call!
In this example I'll use the IMDB of Charlie Chaplin as source.
Extract all 94 "Actor" IMDB movie urls:
$ xidel -s "https://www.imdb.com/name/nm0000122" -e '
//div[#id="filmo-head-actor"]/following-sibling::div[1]//a/#href
'
/title/tt0061523/?ref_=nm_flmg_act_1
/title/tt0050598/?ref_=nm_flmg_act_2
/title/tt0044837/?ref_=nm_flmg_act_3
[...]
/title/tt0004288/?ref_=nm_flmg_act_94
There's no need to save these to a text-file. Just use -f (--follow) instead of -e and xidel will open all of them.
For the individual movie urls you could parse the HTML to get the text-nodes you want...
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
//h1,
//div[#class="sc-94726ce4-3 eSKKHi"]/ul/li[1]/span,
//div[#class="sc-94726ce4-3 eSKKHi"]/ul/li[3],
(//div[#class="sc-7ab21ed2-2 kYEdvH"])[1]
'
A Countess from Hong Kong
1967
2h
6.0/10
...but with those class-names I'd say that's a rather fragile endeavor. Instead I'd recommend to parse the JSON at the top of the HTML-source within the <script>-node:
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
parse-json(//script[#type="application/ld+json"])/(
name,
datePublished,
duration,
aggregateRating/ratingValue
)
'
A Countess from Hong Kong
1967-03-15
PT2H
6
...or to get a similar output as above:
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
parse-json(//script[#type="application/ld+json"])/(
name,
year-from-date(date(datePublished)),
substring(lower-case(duration),3),
format-number(aggregateRating/ratingValue,"#.0")||"/10"
)
'
A Countess from Hong Kong
1967
2h
6.0/10
All combined:
$ xidel -s "https://www.imdb.com/name/nm0000122" \
-f '//div[#id="filmo-head-actor"]/following-sibling::div[1]//a/#href' \
-e '
parse-json(//script[#type="application/ld+json"])/(
name,
year-from-date(date(datePublished)),
substring(lower-case(duration),3),
format-number(aggregateRating/ratingValue,"#.0")||"/10"
)
'
A Countess from Hong Kong
1967
2h
6.0/10
A King in New York
1957
1h50m
7.0/10
Limelight
1952
2h17m
8.0/10
[...]
Making a Living
1914
11m
5.5/10
I am trying to make a bash script that will download a youtube page, see the latest video and find its url. I have the part to download the page except I can not figure out how to isolate the text with the url.
I have this to download the page
curl -s https://www.youtube.com/user/h3h3Productions/videos > YoutubePage.txt
which will save it to a file.
But I cannot figure out how to isolate the single part of a div.
The div is
<a class="yt-uix-sessionlink yt-uix-tile-link spf-link yt-ui-ellipsis yt-ui-ellipsis-2" dir="ltr" title="Why I'm Unlisting the Leafyishere Rant" aria-describedby="description-id-877692" data-sessionlink="ei=a2lSV9zEI9PJ-wODjKuICg&feature=c4-videos-u&ved=CD4QvxsiEwicpteI1I3NAhXT5H4KHQPGCqEomxw" href="/watch?v=q6TNODqcHWA">Why I'm Unlisting the Leafyishere Rant</a>
And I need to isolate the href at the end but i cannot figure out how to do this with grep or sed.
With sed :
sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\([^"]*\)".*/\1/p' YoutubePage.txt
To just extract the video ahref :
$ sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' YoutubePage.txt
/watch?v=q6TNODqcHWA
/watch?v=q6TNODqcHWA
/watch?v=ix4mTekl3MM
/watch?v=ix4mTekl3MM
/watch?v=fEGVOysbC8w
/watch?v=fEGVOysbC8w
...
To omit repeated lines :
$ sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' YoutubePage.txt | sort | uniq
/watch?v=2QOx7vmjV2E
/watch?v=4UNLhoePqqQ
/watch?v=5IoTGVeqwjw
/watch?v=8qwxYaZhUGA
/watch?v=AemSBOsfhc0
/watch?v=CrKkjXMYFzs
...
You can also pipe it to your curl command :
curl -s https://www.youtube.com/user/h3h3Productions/videos | sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' | sort | uniq
You can use lynx which is a terminal browser, but have a -dump mode which will output a HTML parsed text, with URL extracted. This makes it easier to grep the URL:
lynx -dump 'https://www.youtube.com/user/h3h3Productions/videos' \
| sed -n '/\/watch?/s/^ *[0-9]*\. *//p'
This will output something like:
https://www.youtube.com/watch?v=EBbLPnQ-CEw
https://www.youtube.com/watch?v=2QOx7vmjV2E
...
Breakdown:
-n ' # Disable auto printing
/\/watch?/ # Match lines with /watch?
s/^ *[0-9]*\. *// # Remove leading index: " 123. https://..." ->
# "https://..."
p # Print line if all the above have not failed.
'
I want to parse my website, search for the <iframe>-Tag and get the URL (attr src="").
I tried it like this:
url=`wget -O - http://my-url.com/site 2>&1 | grep iframe`
echo $url
With this, i get the whole HTML line:
<iframe src="//player.vimeo.com/video/AAAAAAAA?title=0&byline=0&portrait=0" width="480" height="360" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe> </div>
Well, how can i parse now the URL?
I tried it with a few sed-syntaxes, but didn't make it :( Here's what I tried:
wget -O - http://myurl.com/ 2>&1 | grep iframe | sed "s/<iframe src/\\n<iframe src/g"
Kind regards,
Matt ;)
sed -n '/<iframe/s/^.*<iframe src="\([^"]*\)".*/\1/p'
You don't need grep, sed pattern matching can do that. Then you use a capture group with \(...\) to pick out the URL inside the quotes in the src attribute.
You don't need sed, cut is sufficient:
~$ url='<iframe src="//player.vimeo.com/video/AAAAAAAA?title=0&byline=0&portrait=0" width="480" height="360" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe> </div>'
~$ echo $url|cut -d'"' -f 2
//player.vimeo.com/video/AAAAAAAA?title=0&byline=0&portrait=0
I'm trying to write a very basic benchmarking script which will load random pages from a website, starting with the home page.
I will be using curl to grab the contents of the page, but then I want to load a random next page from that as well. Could someone give me a bit of Shell code that will get the URL from a random a href from a the output of the curl command?
Here's what I came up with:
curl <url> 2> /dev/null | egrep "a href=" | sed 's/.*<a href="//' | \
cut -d '"' -f 1-1 | while read i; do echo "`expr $RANDOM % 1000`:$i"; done | \
sort -n | sed 's/[0-9]*://' | head -1
Replacing the bit with the URL you are trying to get a link from.
EDIT:
It might be easier to make a script called getrandomurl.sh containing:
#!/bin/sh
curl $1 2> /dev/null | egrep "a href=" | sed 's/.*<a href="//' | \
cut -d '"' -f 1-1 | while read i; do echo "`expr $RANDOM % 1000`:$i"; done | \
sort -n | sed 's/[0-9]*://' | head -1
and run it like ./getrandomurl.sh http://stackoverflow.com or something.
Using both lynx and bash arrays:
hrefs=($(lynx -dump http://www.google.com |
sed -e '0,/^References/{d;n};s/.* \(http\)/\1/'))
echo ${hrefs[$(( $RANDOM % ${#hrefs[#]} ))]}
Not a curl solution, but I think more effective given the task.
I would suggest using the perl WWW::Mechanize module for this. For example to dump all links from a page use something like this:
use WWW::Mechanize;
$mech = WWW::Mechanize->new();
$mech->get("URL");
$mech->dump_links(undef, 'absolute' => 1);
Note URL should be replaced with the wanted page.
Then either continue within perl, the following follows a random link on the URL page:
$number_of_links = "" . #{$mech->links()};
$mech->follow_link( n => int(rand($number_of_links)) )
Or use the dump_links version above to get urls and process further within shell, e.g. to get random url (if the above script is called get_urls.pl):
./get_urls.pl | shuf | while read; do
# Url is now in the $REPLY variable
echo "$REPLY"
done
Using pup
A flexible solution to getting all the links on a page is to use pup to specify CSS selectors. For instance, I can grab all the links (<a> tags) from my blog using:
curl https://jlericson.com/ 2>/dev/null \
| pup 'a attr{href}'
The attr{href} at the end outputs only the href attribute. If you run that command, you'll notice several links aren't to posts on my site, but to my email address and Twitter account.
If I want to get just the blog post links, I can be a bit more choosy:
curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link attr{href}'
That grabs only links with class='post-link', which are the links to my posts.
Now we can select a random line of output:
curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link attr{href}' \
| shuf | head -1
The shuf command mixes the lines like a deck of cards and head -1 draws the top card off the deck. (Or the first line, if you prefer.)
My links are all relative, so I will want to append the domain using sed:
curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link attr{href}' \
| sed -e 's|/|https://jlericson.com/|' \
| shuf | head -1
The sed command replaces the first / with the rest of the URL.
I might also want to include the text of the link. That gets a bit tricky because pup doesn't support two output functions. But it does support outputting to JSON, which can be read with jq:
curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link json{}' \
| jq -r '.[] | [.text,.href] | #tsv' \
| sed -e 's|/|https://jlericson.com/|' \
| shuf | head -1
This is a tab-separated value output, which may or might not be what you want.
I have a file name foo. That file contains some text (shown below). Can you please tell me how can I get the string "I have not created a home page." into a variable. I was using the command variable='cat foo | cut -d ">" -f 3'. It output "I have not created a home page." with lots of new lines in it. Please let me know if you can tell me a way to obtain the string without any newlines. Thanks a lot.
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html>
<META HTTP-EQUIV="resource-type" CONTENT="document">
</HEAD>
<BODY>
I have not created a home page.
</BODY>
</HTML>
cut is the wrong tool. Use awk:
cat >> _.awk << "EOF"
/<BODY>/ { found=1; next }
/<\/BODY>/ && found==1 { exit 0 }
found==1 { if ($1) print $0 }
EOF
awk -f _.awk foo
Ideally you should use a real XML parser like a DOM parser
cat foo | grep "^[^<]". To assign a variable:
v=`cat foo | grep "^[^<]"`
{ xmlstarlet sel -N html='http://www.w3.org/1999/xhtml' -t -m //html:body -v . <(tidy -asxml input.html) | tr -d '\n' ; } 2> /dev/null