I have the following text in a file :
<img id="img_1" style="display: none" src="Logs/P2P2014-04-10_14-24-49.txt"/></span></div></div><script type="text/javascript">document.getElementById('duration').innerHTML = "Finished in <strong>1m31.846s seconds</strong>";</script><script type="text/javascript">document.getElementById('totals').innerHTML = "1
What I want to do is obtain the stuff after the src i.e. Logs/P2P2014-04-10_14-24-49.txt. I tried the following and put it into a variable in ruby or so :
I tried doing :
text = `grep 'Logs\/.*txt\"'`
But that returns the entire damn line instead of only the text. How do I get this done?
Try to use
text=$(grep -o 'Logs\/.*txt\"')
It should return only matching part of the line.
Using Nokogiri, see how easy to solve the problem :
require 'nokogiri'
doc = Nokogiri::HTML.parse <<-html
<img id="img_1" style="display: none" src="Logs/P2P2014-04-10_14-24-49.txt"/></span></div></div>
html
doc.at('#img_1')['src'] # => "Logs/P2P2014-04-10_14-24-49.txt"
Read tutorials to understand and learn Nokogiri.
Using sed
sed -n 's/.*src="\([^"]*\)".*/\1/p' file
Using gnu grep if support -P option
grep -Po '(?<=src=")[^"]*' file
Related
I want to parse my website, search for the <iframe>-Tag and get the URL (attr src="").
I tried it like this:
url=`wget -O - http://my-url.com/site 2>&1 | grep iframe`
echo $url
With this, i get the whole HTML line:
<iframe src="//player.vimeo.com/video/AAAAAAAA?title=0&byline=0&portrait=0" width="480" height="360" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe> </div>
Well, how can i parse now the URL?
I tried it with a few sed-syntaxes, but didn't make it :( Here's what I tried:
wget -O - http://myurl.com/ 2>&1 | grep iframe | sed "s/<iframe src/\\n<iframe src/g"
Kind regards,
Matt ;)
sed -n '/<iframe/s/^.*<iframe src="\([^"]*\)".*/\1/p'
You don't need grep, sed pattern matching can do that. Then you use a capture group with \(...\) to pick out the URL inside the quotes in the src attribute.
You don't need sed, cut is sufficient:
~$ url='<iframe src="//player.vimeo.com/video/AAAAAAAA?title=0&byline=0&portrait=0" width="480" height="360" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe> </div>'
~$ echo $url|cut -d'"' -f 2
//player.vimeo.com/video/AAAAAAAA?title=0&byline=0&portrait=0
I have a URL source page like:
href="http://path/to/file.bz2">german.txt.bz2</a> (2,371,487 bytes)</td>
<td><a rel="nofollow" class="external text" href="http://a/web/page/">American cities</a></td>
<td><a rel="nofollow" class="external text" href="http://another/page/to.bz2">us_cities.txt.bz2</a> (77,081 bytes)</td>
<td><a rel="nofollow" class="external text" href="http://other/page/to/file.bz2">test.txt.bz2</a> (7,158,285 bytes)</td>
<td>World's largest test password collection!<br />Created by <a rel="nofollow" class="external text" href="http://page/web.com/">Matt Weir</a>
I want use text editors like sed or awk in order to extract exactly pages that have .bz2 at the end of them...
like:
http://path/to/file.bz2
http://another/page/to.bz2
http://other/page/to/file.bz2
Could you help me?
Sed and grep:
sed 's/.*href=\"\(.*\)\".*/\1/g' file | grep -oP '.*\.bz2$'
$ sed -n 's/.*href="\([^"]*\.bz2\)".*/\1/p' file
http://path/to/file.bz2
http://another/page/to.bz2
http://other/page/to/file.bz2
Use a proper parser. For example, using xsh:
open :F html input.html ;
for //a/#href['bz2' = xsh:matches(., '\.bz2$')]
echo (.) ;
I have a div on all of my eCommerce site's pages holding SEO content. I'd like to count the number of words in that div. It's for diagnosing empty pages in a large crawl.
The div always starts as follows:
<div class="box fct-seo fct-text
It then contains <h1>, <p> and <a> tags.
it then, obviously, closes with </div>
How can I, using SED, AWK, WC, etc take all the code between the start of the div and its closing div and count how many words occur. If it's 90% accurate, I'm happy.
You'd somehow have to tell it to stop scanning before the first closing </div> it finds.
Here's an example page to work with:
http://www.zando.co.za/women/shoes/
Much appreciated.
-P
When it gets more complicated (like divs nested with in that div) the regex approach won't work anymore and you need a html parser, like in my Xidel. Then you can find the text
either with css:
xidel http://www.zando.co.za/women/shoes/ -e 'css(".fct-seo")' | wc -w
or pattern matching:
xidel http://www.zando.co.za/women/shoes/ -e '<div class="box fct-seo fct-text">{.}</div>' | wc -w
It will also only print the text, not the html tags. (if you/someone wanted them, you could add the --printed-node-format xml option)
In a Perl one-liner you can use the .. operator to specify the patterns that match the beginning and end of the region you're interested in:
$ perl -wne 'print if /<div class="box fct-seo fct-text/ .. /<\/div>/' shoes.html
You can then count the words with wc -w:
$ perl -wne 'print if /<div class="box fct-seo fct-text/ .. /<\/div>/' shoes.html | wc -w
If counting the ‘words’ in the HTML tags themselves is affecting the numbers enough to affect the accuracy, you can remove those from the count with something like:
$ perl -wne 'next unless /<div class="box fct-seo fct-text/ .. /<\/div>/; s/<.*?>//g; print' shoes.html | wc -w
Try:
grep -Pzo '(?<=<div)(.*?\n)*?.*?(?=</div)' -n inputFile.html | sed 's/^[^>]*>//'
I have a files with many lines like:
lily weisy
I want to extract www.youtube.com/user/airuike and lily weisy, and then I also want to separate airuike from www.youtube.com/user/
so I want to get 3 strings: www.youtube.com/user/airuike, airuike and lily weisy
how to achieve this? thanks
do this:
sed -e 's/.*href="\([^"]*\)".*>\([^<]*\)<.*/link:\1 name:\2/' < data
will give you the first part. But I'm not sure what you are doing with it after this.
Since it is html, and html should be parsed with a html parser and not with grep/sed/awk, you could use the pattern matching function of my Xidel.
xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{$link := #href, $user := substring-after($link, "www.youtube.com/user/"), $name:=text()}</a>*'
Or if you want a CSV like result:
xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{string-join((#href, substring-after(#href, "www.youtube.com/user/"), text()), ", ")}</a>*' --hide-variable-names
It is kind of sad, that you also want to have the airuike string, otherwise it could be as simple as
xidel /yourfile.html -e '{$name}*'
(and you were supposed to be able to use xidel '{$name}*', but it seems I haven't thought the syntax through. Just one error check and it is breaking everything. )
$ awk '{split($0,a,/(["<>]|:\/\/)/); u=a[4]; sub(/.*\//,"",a[4]); print u,a[4],a[12]}' file
www.youtube.com/user/airuike airuike lily weisy
I think something like this must work
while read line
do
href=$(echo $line | grep -o 'http[^"]*')
user=$(echo $href | grep -o '[^/]*$')
text=$(echo $line | grep -o '[^>]*<\/a>$' | grep -o '^[^<]*')
echo href: $href
echo user: $user
echo text: $text
done < yourfile
Regular expressions basics: http://en.wikipedia.org/wiki/Regular_expression#POSIX_Basic_Regular_Expressions
Upd: checked and fixed
What would be the sed command for mac shell scripting that would replace all iterations of string "fox" with the entire string content of myFile.txt.
myFile.txt would be html content with line breaks and all kinds of characters. An example would be
</div>
</div>
<br>
<div id="container2">
<div class="question" onclick="javascript:show('answer2')";>
Thanks!
EDIT 1
This is my actual code:
sed -i.bkp '/Q/{
s/Q//g
r /Users/ericbrotto/Desktop/question.txt
}' $file
When I run it I get:
sed in place editing only works for regular files.
And in my files the Q is replaced by a ton of chinese characters (!). Bizarre!
You can use the r command. When you find a 'fox' in the input...
/fox/{
...replace it for nothing...
s/fox//g
...and read the input file:
r f.html
}
If you have a file such as:
$ cat file.txt
the
quick
brown
fox
jumps
over
the lazy dog
fox dog
the result is:
$ sed '/fox/{
s/fox//g
r f.html
}' file.txt
the
quick
brown
</div>
</div>
<br>
<div id="container2">
<div class="question" onclick="javascript:show('answer2')";>
jumps
over
the lazy dog
dog
</div>
</div>
<br>
<div id="container2">
<div class="question" onclick="javascript:show('answer2')";>
EDIT: to alter the file being processed, just pass the -i flag to sed:
sed -i '/fox/{
s/fox//g
r f.html
}' file.txt
Some sed versions (such as my own one) require you to pass an extension to the -i flag, which will be the extension of a backup file with the old content of the file:
sed -i.bkp '/fox/{
s/fox//g
r f.html
}' file.txt
And here is the same thing as a one liner, which is also compatible with Makefile
sed -i -e '/fox/{r f.html' -e 'd}'
Ultimately what I went with which is a lot simpler than a lot of solutions I found online:
str=xxxx
sed -e "/$str/r FileB" -e "/$str/d" FileA
Supports templating like so:
str=xxxx
sed -e "/$str/r $fileToInsert" -e "/$str/d" $fileToModify
Another method (minor variation to other solutions):
If your filenames are also variable ( e.g. $file is f.html and the file you are updating is $targetfile):
sed -e "/fox/ {" -e "r $file" -e "d" -e "}" -i "$targetFile"