Shell script: Parse URL for iFrame and get iFrame URL - bash

I want to parse my website, search for the <iframe>-Tag and get the URL (attr src="").
I tried it like this:
url=`wget -O - http://my-url.com/site 2>&1 | grep iframe`
echo $url
With this, i get the whole HTML line:
<iframe src="//player.vimeo.com/video/AAAAAAAA?title=0&byline=0&portrait=0" width="480" height="360" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe> </div>
Well, how can i parse now the URL?
I tried it with a few sed-syntaxes, but didn't make it :( Here's what I tried:
wget -O - http://myurl.com/ 2>&1 | grep iframe | sed "s/<iframe src/\\n<iframe src/g"
Kind regards,
Matt ;)

sed -n '/<iframe/s/^.*<iframe src="\([^"]*\)".*/\1/p'
You don't need grep, sed pattern matching can do that. Then you use a capture group with \(...\) to pick out the URL inside the quotes in the src attribute.

You don't need sed, cut is sufficient:
~$ url='<iframe src="//player.vimeo.com/video/AAAAAAAA?title=0&byline=0&portrait=0" width="480" height="360" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe> </div>'
~$ echo $url|cut -d'"' -f 2
//player.vimeo.com/video/AAAAAAAA?title=0&byline=0&portrait=0

Related

Bash + Pup printing only attribute

I'm wgeting a webpage src code then using pup to grab the <meta> tag that I need. Now I want to print only the value of the content field.
In this case, the output I want is: https://example.com/my/folder/first.jpg?foo=bar
# wget page to /tmp/output.html
IMAGE_URL=$(cat /tmp/output.html | pup 'meta[property*="og:image"]')
echo $IMAGE_URL is:
<meta property="og:image" content="https://example.com/my/folder/first.jpg?foo=bar">
wget -O /tmp/output.html --user-agent="user-agent: Whatever..." https://example.com/somewhere
IMAGE_URL=$(cat /tmp/output.html | pup --plain 'meta[property*="og:image"]' | sed -n 's/.*content=\"\([^"]*\)".*/\1/p')
You can use attr{content} to only get the content of the attribute.
wget -O /tmp/output.html --user-agent="user-agent: Whatever..." https://example.com/somewhere
IMAGE_URL=$(cat /tmp/output.html | pup 'meta[property*="og:image"] attr{content}'

How do I get a selection from the output of a grep

I have the following text in a file :
<img id="img_1" style="display: none" src="Logs/P2P2014-04-10_14-24-49.txt"/></span></div></div><script type="text/javascript">document.getElementById('duration').innerHTML = "Finished in <strong>1m31.846s seconds</strong>";</script><script type="text/javascript">document.getElementById('totals').innerHTML = "1
What I want to do is obtain the stuff after the src i.e. Logs/P2P2014-04-10_14-24-49.txt. I tried the following and put it into a variable in ruby or so :
I tried doing :
text = `grep 'Logs\/.*txt\"'`
But that returns the entire damn line instead of only the text. How do I get this done?
Try to use
text=$(grep -o 'Logs\/.*txt\"')
It should return only matching part of the line.
Using Nokogiri, see how easy to solve the problem :
require 'nokogiri'
doc = Nokogiri::HTML.parse <<-html
<img id="img_1" style="display: none" src="Logs/P2P2014-04-10_14-24-49.txt"/></span></div></div>
html
doc.at('#img_1')['src'] # => "Logs/P2P2014-04-10_14-24-49.txt"
Read tutorials to understand and learn Nokogiri.
Using sed
sed -n 's/.*src="\([^"]*\)".*/\1/p' file
Using gnu grep if support -P option
grep -Po '(?<=src=")[^"]*' file

extract pattern with text editors

I have a URL source page like:
href="http://path/to/file.bz2">german.txt.bz2</a> (2,371,487 bytes)</td>
<td><a rel="nofollow" class="external text" href="http://a/web/page/">American cities</a></td>
<td><a rel="nofollow" class="external text" href="http://another/page/to.bz2">us_cities.txt.bz2</a> (77,081 bytes)</td>
<td><a rel="nofollow" class="external text" href="http://other/page/to/file.bz2">test.txt.bz2</a> (7,158,285 bytes)</td>
<td>World's largest test password collection!<br />Created by <a rel="nofollow" class="external text" href="http://page/web.com/">Matt Weir</a>
I want use text editors like sed or awk in order to extract exactly pages that have .bz2 at the end of them...
like:
http://path/to/file.bz2
http://another/page/to.bz2
http://other/page/to/file.bz2
Could you help me?
Sed and grep:
sed 's/.*href=\"\(.*\)\".*/\1/g' file | grep -oP '.*\.bz2$'
$ sed -n 's/.*href="\([^"]*\.bz2\)".*/\1/p' file
http://path/to/file.bz2
http://another/page/to.bz2
http://other/page/to/file.bz2
Use a proper parser. For example, using xsh:
open :F html input.html ;
for //a/#href['bz2' = xsh:matches(., '\.bz2$')]
echo (.) ;

BASH - Select All Code Between A Multiline Div

I have a div on all of my eCommerce site's pages holding SEO content. I'd like to count the number of words in that div. It's for diagnosing empty pages in a large crawl.
The div always starts as follows:
<div class="box fct-seo fct-text
It then contains <h1>, <p> and <a> tags.
it then, obviously, closes with </div>
How can I, using SED, AWK, WC, etc take all the code between the start of the div and its closing div and count how many words occur. If it's 90% accurate, I'm happy.
You'd somehow have to tell it to stop scanning before the first closing </div> it finds.
Here's an example page to work with:
http://www.zando.co.za/women/shoes/
Much appreciated.
-P
When it gets more complicated (like divs nested with in that div) the regex approach won't work anymore and you need a html parser, like in my Xidel. Then you can find the text
either with css:
xidel http://www.zando.co.za/women/shoes/ -e 'css(".fct-seo")' | wc -w
or pattern matching:
xidel http://www.zando.co.za/women/shoes/ -e '<div class="box fct-seo fct-text">{.}</div>' | wc -w
It will also only print the text, not the html tags. (if you/someone wanted them, you could add the --printed-node-format xml option)
In a Perl one-liner you can use the .. operator to specify the patterns that match the beginning and end of the region you're interested in:
$ perl -wne 'print if /<div class="box fct-seo fct-text/ .. /<\/div>/' shoes.html
You can then count the words with wc -w:
$ perl -wne 'print if /<div class="box fct-seo fct-text/ .. /<\/div>/' shoes.html | wc -w
If counting the ‘words’ in the HTML tags themselves is affecting the numbers enough to affect the accuracy, you can remove those from the count with something like:
$ perl -wne 'next unless /<div class="box fct-seo fct-text/ .. /<\/div>/; s/<.*?>//g; print' shoes.html | wc -w
Try:
grep -Pzo '(?<=<div)(.*?\n)*?.*?(?=</div)' -n inputFile.html | sed 's/^[^>]*>//'

Shell: Extract some code from HTML

I have the following code snippet from a HTML file:
<div id="rwImages_hidden" style="display:none;">
<img src="http://example.com/images/I/520z3AjKzHL._SL500_AA300_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/519z3AjKzHL._SL75_AA30_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/31F-sI61AyL._SL75_AA30_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/71k-DIrs-8L._AA30_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/61CCOS0NGyL._AA30_.jpg" style="display:none;"/>
</div>
I want to extract the code
520z3AjKzHL
519z3AjKzHL
31F-sI61AyL
71k-DIrs-8L
61CCOS0NGyL
from the HTML.
Please note that: <img src="" style="display:none;"/> must be used because there are other similar urls in HTML file but I only what the ones between <img src="" style="display:none;"/>.
My Code is:
cat HTML | grep -Po '(?<img src="http://example.com/images/I/).*?(?=.jpg" style="display:none;"/>)'
Something seems to be wrong.
You can solve it by using positive look ahead / look behind:
cat HTML | grep -Po "(?<=<img src=\"http://example.com/images/I/).*?(?=\._.*.jpg\" style=\"display:none;\"/>)"
Demonstration:
ideone.com link
Regexp breakdown:
.*? match all characters reluctantly
(?<=<img src=...ges/I/) preceeded by <img .../I/
(?=\._...ne;\"/>) succeeded by ._...ne;\"/>
I assume you were looking for a lookbehind to start, which is what was throwing the error.
(?<=foo) not (?<foo).
This gives the result case you specified, but I do not know if you need up until the JPG or not:
cat HTML | grep -Po '(?<=img src="http://example.com/images/I/)[^.]*'
Up until and excluding the JPG would be:
cat HTML | grep -Po '(?<=img src="http://example.com/images/I/).*(?=.jpg)'
And if you consider gawk as being a valid bash solution:
awk -F'[/|\._]' -v img='/<img src="" style="display:none;"\/>/' '/img/{print $7}' file

Resources