Get random link from page using Shell - shell

I'm trying to write a very basic benchmarking script which will load random pages from a website, starting with the home page.
I will be using curl to grab the contents of the page, but then I want to load a random next page from that as well. Could someone give me a bit of Shell code that will get the URL from a random a href from a the output of the curl command?

Here's what I came up with:
curl <url> 2> /dev/null | egrep "a href=" | sed 's/.*<a href="//' | \
cut -d '"' -f 1-1 | while read i; do echo "`expr $RANDOM % 1000`:$i"; done | \
sort -n | sed 's/[0-9]*://' | head -1
Replacing the bit with the URL you are trying to get a link from.
EDIT:
It might be easier to make a script called getrandomurl.sh containing:
#!/bin/sh
curl $1 2> /dev/null | egrep "a href=" | sed 's/.*<a href="//' | \
cut -d '"' -f 1-1 | while read i; do echo "`expr $RANDOM % 1000`:$i"; done | \
sort -n | sed 's/[0-9]*://' | head -1
and run it like ./getrandomurl.sh http://stackoverflow.com or something.

Using both lynx and bash arrays:
hrefs=($(lynx -dump http://www.google.com |
sed -e '0,/^References/{d;n};s/.* \(http\)/\1/'))
echo ${hrefs[$(( $RANDOM % ${#hrefs[#]} ))]}

Not a curl solution, but I think more effective given the task.
I would suggest using the perl WWW::Mechanize module for this. For example to dump all links from a page use something like this:
use WWW::Mechanize;
$mech = WWW::Mechanize->new();
$mech->get("URL");
$mech->dump_links(undef, 'absolute' => 1);
Note URL should be replaced with the wanted page.
Then either continue within perl, the following follows a random link on the URL page:
$number_of_links = "" . #{$mech->links()};
$mech->follow_link( n => int(rand($number_of_links)) )
Or use the dump_links version above to get urls and process further within shell, e.g. to get random url (if the above script is called get_urls.pl):
./get_urls.pl | shuf | while read; do
# Url is now in the $REPLY variable
echo "$REPLY"
done

Using pup
A flexible solution to getting all the links on a page is to use pup to specify CSS selectors. For instance, I can grab all the links (<a> tags) from my blog using:
curl https://jlericson.com/ 2>/dev/null \
| pup 'a attr{href}'
The attr{href} at the end outputs only the href attribute. If you run that command, you'll notice several links aren't to posts on my site, but to my email address and Twitter account.
If I want to get just the blog post links, I can be a bit more choosy:
curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link attr{href}'
That grabs only links with class='post-link', which are the links to my posts.
Now we can select a random line of output:
curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link attr{href}' \
| shuf | head -1
The shuf command mixes the lines like a deck of cards and head -1 draws the top card off the deck. (Or the first line, if you prefer.)
My links are all relative, so I will want to append the domain using sed:
curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link attr{href}' \
| sed -e 's|/|https://jlericson.com/|' \
| shuf | head -1
The sed command replaces the first / with the rest of the URL.
I might also want to include the text of the link. That gets a bit tricky because pup doesn't support two output functions. But it does support outputting to JSON, which can be read with jq:
curl https://jlericson.com/ 2> /dev/null \
| pup 'a.post-link json{}' \
| jq -r '.[] | [.text,.href] | #tsv' \
| sed -e 's|/|https://jlericson.com/|' \
| shuf | head -1
This is a tab-separated value output, which may or might not be what you want.

Related

Bash sed command issue

I'm trying to further parse an output file I generated using an additional grep command. The code that I'm currently using is:
##!/bin/bash
# fetches the links of the movie's imdb pages for a given actor
# fullname="USER INPUT"
read -p "Enter fullname: " fullname
if [ "$fullname" = "Charlie Chaplin" ];
code="nm0000122"
then
code="nm0000050"
fi
curl "https://www.imdb.com/name/$code/#actor" | grep -Eo
'href="/title/[^"]*' | sed 's#^.*href=\"/#https://www.imdb.com/#g' |
sort -u > imdb_links.txt
#parses each of the link in the link text file and gets the details for
each of the movie. THis is followed by the cleaning process
for i in $(cat imdb_links.txt)
do
curl $i |
html2text |
sed -n '/Sign_In/,$p'|
sed -n '/YOUR RATING/q;p' |
head -n-1 |
tail -n+2
done > imdb_all.txt
The sample generated output is:
EN
⁰
* Fully supported
* English (United States)
* Partially_supported
* Français (Canada)
* Français (France)
* Deutsch (Deutschland)
* हिंदी (भारत)
* Italiano (Italia)
* Português (Brasil)
* Español (España)
* Español (México)
****** Duck Soup ******
* 19331933
* Not_RatedNot Rated
* 1h 9m
IMDb RATING
7.8/10
How do I change the code to further parse the output to get only the data from the title of the movie up until the imdb rating ( in this case, the line that contains the title 'Duck Soup' up until the end.
Here is the code:
#!/bin/bash
# fullname="USER INPUT"
read -p "Enter fullname: " fullname
if [ "$fullname" = "Charlie Chaplin" ]; then
code="nm0000122"
else
code="nm0000050"
fi
rm -f imdb_links.txt
curl "https://www.imdb.com/name/$code/#actor" |
grep -Eo 'href="/title/[^"]*' |
sed 's#^href="#https://www.imdb.com#g' |
sort -u |
while read link; do
# uncomment the next line to save links into file:
#echo "$link" >>imdb_links.txt
curl "$link" |
html2text -utf8 |
sed -n '/Sign_In/,/YOUR RATING/ p' |
sed -n '$d; /^\*\{6\}.*\*\{6\}$/,$ p'
done >imdb_all.txt
Please(!) have a look at the following urls on why it's a really bad idea to parse HTML with sed:
RegEx match open tags except XHTML self-contained tags
Using regular expressions to parse HTML: why not?
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
The thing you're trying to do can be done with the HTML/XML/JSON parser xidel and with just 1 call!
In this example I'll use the IMDB of Charlie Chaplin as source.
Extract all 94 "Actor" IMDB movie urls:
$ xidel -s "https://www.imdb.com/name/nm0000122" -e '
//div[#id="filmo-head-actor"]/following-sibling::div[1]//a/#href
'
/title/tt0061523/?ref_=nm_flmg_act_1
/title/tt0050598/?ref_=nm_flmg_act_2
/title/tt0044837/?ref_=nm_flmg_act_3
[...]
/title/tt0004288/?ref_=nm_flmg_act_94
There's no need to save these to a text-file. Just use -f (--follow) instead of -e and xidel will open all of them.
For the individual movie urls you could parse the HTML to get the text-nodes you want...
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
//h1,
//div[#class="sc-94726ce4-3 eSKKHi"]/ul/li[1]/span,
//div[#class="sc-94726ce4-3 eSKKHi"]/ul/li[3],
(//div[#class="sc-7ab21ed2-2 kYEdvH"])[1]
'
A Countess from Hong Kong
1967
2h
6.0/10
...but with those class-names I'd say that's a rather fragile endeavor. Instead I'd recommend to parse the JSON at the top of the HTML-source within the <script>-node:
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
parse-json(//script[#type="application/ld+json"])/(
name,
datePublished,
duration,
aggregateRating/ratingValue
)
'
A Countess from Hong Kong
1967-03-15
PT2H
6
...or to get a similar output as above:
$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
parse-json(//script[#type="application/ld+json"])/(
name,
year-from-date(date(datePublished)),
substring(lower-case(duration),3),
format-number(aggregateRating/ratingValue,"#.0")||"/10"
)
'
A Countess from Hong Kong
1967
2h
6.0/10
All combined:
$ xidel -s "https://www.imdb.com/name/nm0000122" \
-f '//div[#id="filmo-head-actor"]/following-sibling::div[1]//a/#href' \
-e '
parse-json(//script[#type="application/ld+json"])/(
name,
year-from-date(date(datePublished)),
substring(lower-case(duration),3),
format-number(aggregateRating/ratingValue,"#.0")||"/10"
)
'
A Countess from Hong Kong
1967
2h
6.0/10
A King in New York
1957
1h50m
7.0/10
Limelight
1952
2h17m
8.0/10
[...]
Making a Living
1914
11m
5.5/10

Bash: replace specific text with its translation

There is a huge file, in it I want to replace all the text between '=' and '\n' with its translation, here is an example:
input:
screen.LIGHT_COLOR=Lighting Color
screen.LIGHT_M=Light (Morning)
screen.AMBIENT_M=Ambient (Morning)
output:
screen.LIGHT_COLOR=Цвет Освещения
screen.LIGHT_M=Свет (Утро)
screen.AMBIENT_M=Эмбиент (Утро)
All I have managed to do until now is to extract and translate the targeted text.
while IFS= read -r line
do
echo $line | cut -d= -f2- | trans -b en:ru
done < file.txt
output:
Цвет Освещения
Свет (Утро)
Эмбиент (Утро)
*trans is short for translate-shell. It is slow, but does the job. -b for brief translation; en:ru means English to Russian.
If you have any suggestions or solutions i'll be glad to know, thanks!
edit, in case someone needs it:
After discovering trans-shell limitations I ended up going with the #TaylorG. suggestion. It is seams that translation-shell allows around 110 request per some time. Processing each line seperatly results in 1300 requests, which breaks the script.
long story short, it is faster to pack all the data into a single request. Its possible to reduce processing time from couple of minutes to mere seconds. sorry for the messy code, it's my third day with:
cut -s -d = -f 1 en_US.lang > option_en.txt
cut -s -d = -f 2 en_US.lang > value_en.txt
# merge lines
sed ':a; N; $!ba; s/\n/ :: /g' value_en.txt > value_en_block.txt
trans -b en:ru -i value_en_block.txt -o value_ru_block.txt
sed 's/ :: /\n/g' value_ru_block.txt > value_ru.txt
paste -d = option_en.txt value_ru.txt > ru_RU.lang
# remove trmporary files
rm option_en.txt value_en.txt value_en_block.txt value_ru.txt value_ru_block.txt
Thanks Taylor G., Armali and every commentator
Using pipe in a large loop is expensive. You can try the following instead.
cut -s -d = -f 1 file.txt > name.txt
cut -s -d = -f 2- file.txt | trans -b en:ru > translate.txt
paste -d = name.txt translate.txt
It shall be much faster than your current script. I'm not sure how your trans method is written. It needs to be updated to process batch input if it's not, e.g. using a while loop.
trans() {
while read -r line; do
# do translate and print result
done
}
You already did most of the work, though it can be optimized a bit. What's missing is just to output the first part of the line up to the equal sign together with the translation:
while IFS== read left right
do echo $left=`trans -b en:ru <<<$right`
done <file.txt

I can't figure out how to extract a string in bash

I am trying to make a bash script that will download a youtube page, see the latest video and find its url. I have the part to download the page except I can not figure out how to isolate the text with the url.
I have this to download the page
curl -s https://www.youtube.com/user/h3h3Productions/videos > YoutubePage.txt
which will save it to a file.
But I cannot figure out how to isolate the single part of a div.
The div is
<a class="yt-uix-sessionlink yt-uix-tile-link spf-link yt-ui-ellipsis yt-ui-ellipsis-2" dir="ltr" title="Why I'm Unlisting the Leafyishere Rant" aria-describedby="description-id-877692" data-sessionlink="ei=a2lSV9zEI9PJ-wODjKuICg&feature=c4-videos-u&ved=CD4QvxsiEwicpteI1I3NAhXT5H4KHQPGCqEomxw" href="/watch?v=q6TNODqcHWA">Why I'm Unlisting the Leafyishere Rant</a>
And I need to isolate the href at the end but i cannot figure out how to do this with grep or sed.
With sed :
sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\([^"]*\)".*/\1/p' YoutubePage.txt
To just extract the video ahref :
$ sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' YoutubePage.txt
/watch?v=q6TNODqcHWA
/watch?v=q6TNODqcHWA
/watch?v=ix4mTekl3MM
/watch?v=ix4mTekl3MM
/watch?v=fEGVOysbC8w
/watch?v=fEGVOysbC8w
...
To omit repeated lines :
$ sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' YoutubePage.txt | sort | uniq
/watch?v=2QOx7vmjV2E
/watch?v=4UNLhoePqqQ
/watch?v=5IoTGVeqwjw
/watch?v=8qwxYaZhUGA
/watch?v=AemSBOsfhc0
/watch?v=CrKkjXMYFzs
...
You can also pipe it to your curl command :
curl -s https://www.youtube.com/user/h3h3Productions/videos | sed -n 's/<a [^>]*>/\n&/g;s/.*<a.*href="\(\/watch\?[^"]*\)".*/\1/p' | sort | uniq
You can use lynx which is a terminal browser, but have a -dump mode which will output a HTML parsed text, with URL extracted. This makes it easier to grep the URL:
lynx -dump 'https://www.youtube.com/user/h3h3Productions/videos' \
| sed -n '/\/watch?/s/^ *[0-9]*\. *//p'
This will output something like:
https://www.youtube.com/watch?v=EBbLPnQ-CEw
https://www.youtube.com/watch?v=2QOx7vmjV2E
...
Breakdown:
-n ' # Disable auto printing
/\/watch?/ # Match lines with /watch?
s/^ *[0-9]*\. *// # Remove leading index: " 123. https://..." ->
# "https://..."
p # Print line if all the above have not failed.
'

Leveraging graphviz to create a network weathermap configuration

Given a generated list of nodes and links, is there a way I can use dot or some other tool from the graphviz package to create coordinates for those nodes such that I in turn can use that information to generate a configuration file for network weathermap?
The answer is simple, calling dot or the other tools without an output argument printed the information I wanted to stdout.
I wrote this shell script to make a graph from an mrtg config file, but decided to not pursue the weathermap part, due to the results being too cluttered;
grep -P '^SetEnv.*MRTG_INT_IP="..*" MRTG_INT_DESCR=".*"' $1 | grep -v 'MRTG_INT_IP="127.' | grep -v 'MRTG_INT_IP="10.255.' |\
sed \
-e 's/SetEnv\[\(.*\.switch\.hapro\.no_.*\)]: MRTG_INT_IP="\(.*\)" MRTG_INT_DESCR="\(.*\)"/\1 \2 \3/' \
-e 's/\//_/g' |\
sort -t/ -k 1 -n -k 2 -n -k 3 -n -k 4 |\
gawk '
BEGIN { print "graph '$2' {"; }
{
graph[overlap=false];
v = "'$2'"
print v " -- " $3
}
END { print "}" }'
Thought I would share this in case someone else found it useful in the future.
I used the script like ./mkconf ../switch/mrtg.1c.conf 1c | dot -Tpng > test.png

modify the contents of a file without a temp file

I have the following log file which contains lines like this
1345447800561|FINE|blah#13|txReq
1345447800561|FINE|blah#13|Req
1345447800561|FINE|blah#13|rxReq
1345447800561|FINE|blah#14|txReq
1345447800561|FINE|blah#15|Req
I am trying extract the first field from each line and depending on whether it belongs to blah#13 or blah#14, blah#15 i am creating the corresponding files using the following script, which seems quite in-efficient in terms of the number of temp files creates. Any suggestions on how I can optimize it ?
cat newLog | grep -i "org.arl.unet.maca.blah#13" >> maca13
cat newLog | grep -i "org.arl.unet.maca.blah#14" >> maca14
cat newLog | grep -i "org.arl.unet.maca.blah#15" >> maca15
cat maca10 | grep -i "txReq" >> maca10TxFrameNtf_temp
exec<blah10TxFrameNtf_temp
while read line
do
echo $line | cut -d '|' -f 1 >>maca10TxFrameNtf
done
cat maca10 | grep -i "Req" >> maca10RxFrameNtf_temp
while read line
do
echo $line | cut -d '|' -f 1 >>maca10TxFrameNtf
done
rm -rf *_temp
Something like this ?
for m in org.arl.unet.maca.blah#13 org.arl.unet.maca.blah#14 org.arl.unet.maca.blah#15
do
grep -i "$m" newLog | grep "txReq" | cut -d' ' -f1 > log.$m
done
I've found it useful at times to use ex instead of grep/sed to modify text files in place without using temps ... saves the trouble of worrying about uniqueness and writability to the temp file and its directory etc. Plus it just seemed cleaner.
In ksh I would use a code block with the edit commands and just pipe that into ex ...
{
# Any edit command that would work at the colon prompt of a vi editor will work
# This one was just a text substitution that would replace all contents of the line
# at line number ${NUMBER} with the word DATABASE ... which strangely enough was
# necessary at one time lol
# The wq is the "write/quit" command as you would enter it at the vi colon prompt
# which are essentially ex commands.
print "${NUMBER}s/.*/DATABASE/"
print "wq"
} | ex filename > /dev/null 2>&1

Resources