Xidel: How to printout the node instead of the value? - xpath

html example:
<p class="si-price" data-price="910000.00">Rp 910.000</p>
so in xidel, i do this:
xidel -se '(//p[#class="si-price"])[1]' 'https://www.anekalogam.co.id/id'
what I want is 910000 which is in the data-price instead of Rp 910.000
can I do that?

Try changing it to:
xidel -se '(//p[#class="si-price"])/#data-price' 'https://www.anekalogam.co.id/id'
Output:
905000.00
1750000.00
2595000.00
4300000.00
8550000.00
21250000.00
42400000.00
84000000.00

Related

How to scrape Wikipedia GPS latitude/longitude?

I have been wondering how is it possible to scrap Wikipedia information. For example, I have a list of world cities and want to obtain their approximate latitude and longitude. Take Miami as an example. When I type curl https://en.wikipedia.org/wiki/Miami | grep -E '(latitude|longitude)', somewhere in the HTML there will be a tag mark like below.
<span class="latitude">25°46′31″N</span> <span class="longitude">80°12′31″W</span>
I know I can extract it with some regex string, but I speak a very poor regexish. Can some of you help me on this?
With xidel and xpath:
$ xidel -se '
concat(
(//span[#class="latitude"]/text())[1],
" ",
(//span[#class="longitude"]/text())[1]
)
' 'https://en.wikipedia.org/wiki/Miami'
Output
25°46′31″N 80°12′31″W
Or
saxon-lint --html --xpath '<XPATH EXP>' <URL>
If you want most known tools:
curl -s 'https://en.wikipedia.org/wiki/Miami' > Miami.html
xmlstarlet format -H Miami.html 2>/dev/null | sponge Miami.html
xmlstarlet sel -t -v '<XPATH EXP>' Miami.html
Not mentioned, but regex are not the right tool to parse HTML
You can't parse HTML with RegEx. Please use an HTML-parser like xidel instead:
$ xidel -s "https://en.wikipedia.org/wiki/Miami" -e '
(//span[#class="geo-dms"])[1],
(//span[#class="geo-dec"])[1],
(//span[#class="geo"])[1],
replace((//span[#class="geo"])[1],";",())
'
25°46′31″N 80°12′31″W
25.775163°N 80.208615°W
25.775163; -80.208615
25.775163 -80.208615
Take your pick.

How to replace any text between html tags

i have text between html tags. For example:
<td>vip</td>
I will have any text between tags <td></td>
How can i cut any text from these tags and put any text between these tags.
I need to do it via bash/shell.
How can i do this ?
First of all, i tried to get this text, but without success
sed -n "/<td>/,/<\/td>/p" test.txt. But in a result i have
<td>vip</td>. but according to documentation, i should get only vip
You can try this:
sed -i -e 's/\(<td>\).*\(<\/td>\)/<td>TEXT_TO_REPLACE_BY<\/td>/g' test.txt
Note that it will only work for the <td> tags. It will replace everything between tags <td> (actually with them together and put the tags back) with TEXT_TO_REPLACE_BY.
You can use this to get the value vip
sed -e 's,.*<td>\([^<]*\)</td>.*,\1,g'
If you Input_file is same as shown example then following may help you too.
echo "<td>vip</td>" | awk -F"[><]" '{print $3}'
Simply printing the tag with echo then using awk to create a field separator >< then printing the 3rd field then which is your request.
d=$'<td>vip</td>\n<table>vip</table>\n<td>more data here</td>'
echo "$d"
<td>vip</td>
<table>vip</table>
<td>more data here</td>
awk '/<td>/{match($0,/(<.*>)(.*)(<\/.*>)/,t);print t[1] "something" t[3];next}1' <<<"$d"
<td>something</td>
<table>vip</table>
<td>something</td>
awk '/<table>/{match($0,/(<.*>)(.*)(<\/.*>)/,t);print t[1] "something" t[3];next}1' <<<"$d"
<td>vip</td>
<table>something</table>
<td>more data here</td>

Shell script: Parse URL for iFrame and get iFrame URL

I want to parse my website, search for the <iframe>-Tag and get the URL (attr src="").
I tried it like this:
url=`wget -O - http://my-url.com/site 2>&1 | grep iframe`
echo $url
With this, i get the whole HTML line:
<iframe src="//player.vimeo.com/video/AAAAAAAA?title=0&byline=0&portrait=0" width="480" height="360" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe> </div>
Well, how can i parse now the URL?
I tried it with a few sed-syntaxes, but didn't make it :( Here's what I tried:
wget -O - http://myurl.com/ 2>&1 | grep iframe | sed "s/<iframe src/\\n<iframe src/g"
Kind regards,
Matt ;)
sed -n '/<iframe/s/^.*<iframe src="\([^"]*\)".*/\1/p'
You don't need grep, sed pattern matching can do that. Then you use a capture group with \(...\) to pick out the URL inside the quotes in the src attribute.
You don't need sed, cut is sufficient:
~$ url='<iframe src="//player.vimeo.com/video/AAAAAAAA?title=0&byline=0&portrait=0" width="480" height="360" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe> </div>'
~$ echo $url|cut -d'"' -f 2
//player.vimeo.com/video/AAAAAAAA?title=0&byline=0&portrait=0

How do I get a selection from the output of a grep

I have the following text in a file :
<img id="img_1" style="display: none" src="Logs/P2P2014-04-10_14-24-49.txt"/></span></div></div><script type="text/javascript">document.getElementById('duration').innerHTML = "Finished in <strong>1m31.846s seconds</strong>";</script><script type="text/javascript">document.getElementById('totals').innerHTML = "1
What I want to do is obtain the stuff after the src i.e. Logs/P2P2014-04-10_14-24-49.txt. I tried the following and put it into a variable in ruby or so :
I tried doing :
text = `grep 'Logs\/.*txt\"'`
But that returns the entire damn line instead of only the text. How do I get this done?
Try to use
text=$(grep -o 'Logs\/.*txt\"')
It should return only matching part of the line.
Using Nokogiri, see how easy to solve the problem :
require 'nokogiri'
doc = Nokogiri::HTML.parse <<-html
<img id="img_1" style="display: none" src="Logs/P2P2014-04-10_14-24-49.txt"/></span></div></div>
html
doc.at('#img_1')['src'] # => "Logs/P2P2014-04-10_14-24-49.txt"
Read tutorials to understand and learn Nokogiri.
Using sed
sed -n 's/.*src="\([^"]*\)".*/\1/p' file
Using gnu grep if support -P option
grep -Po '(?<=src=")[^"]*' file

extract substring from lines using grep, awk,sed or etc

I have a files with many lines like:
lily weisy
I want to extract www.youtube.com/user/airuike and lily weisy, and then I also want to separate airuike from www.youtube.com/user/
so I want to get 3 strings: www.youtube.com/user/airuike, airuike and lily weisy
how to achieve this? thanks
do this:
sed -e 's/.*href="\([^"]*\)".*>\([^<]*\)<.*/link:\1 name:\2/' < data
will give you the first part. But I'm not sure what you are doing with it after this.
Since it is html, and html should be parsed with a html parser and not with grep/sed/awk, you could use the pattern matching function of my Xidel.
xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{$link := #href, $user := substring-after($link, "www.youtube.com/user/"), $name:=text()}</a>*'
Or if you want a CSV like result:
xidel yourfile.html -e '<a class="yt-uix-sessionlink yt-user-name " dir="ltr">{string-join((#href, substring-after(#href, "www.youtube.com/user/"), text()), ", ")}</a>*' --hide-variable-names
It is kind of sad, that you also want to have the airuike string, otherwise it could be as simple as
xidel /yourfile.html -e '{$name}*'
(and you were supposed to be able to use xidel '{$name}*', but it seems I haven't thought the syntax through. Just one error check and it is breaking everything. )
$ awk '{split($0,a,/(["<>]|:\/\/)/); u=a[4]; sub(/.*\//,"",a[4]); print u,a[4],a[12]}' file
www.youtube.com/user/airuike airuike lily weisy
I think something like this must work
while read line
do
href=$(echo $line | grep -o 'http[^"]*')
user=$(echo $href | grep -o '[^/]*$')
text=$(echo $line | grep -o '[^>]*<\/a>$' | grep -o '^[^<]*')
echo href: $href
echo user: $user
echo text: $text
done < yourfile
Regular expressions basics: http://en.wikipedia.org/wiki/Regular_expression#POSIX_Basic_Regular_Expressions
Upd: checked and fixed

Resources