Find Only 1st Level XPath Children

Find Only 1st Level XPath Children - xpath

I have the following HTML code:
<div>
<div>
<span>test1</span>
</div>
<span>test2</span>
<span>test3</span>
<span>test4</span>
<div>
<span>test5</span>
</div>
<span>test6</span>
</div>
How can I select all span elements that are direct descendants of the 1st div. (Elements with innerText test2, test3, test4, test6) ?

This XPath will get you want
'//span[not(parent::div[parent::div])]'
xmllint --html --xpath '//span[not(parent::div[parent::div])]' test.html | sed -re 's%(</[^>]+>)%\1\n%g'
<span>test2</span>
<span>test3</span>
<span>test4</span>
<span>test6</span>

Related

How to read text with space in xpath

I have listBox with value "This is_a_test"
( there is a space after the This )
$x('//li[contains(#class,"myClass")][text()="This is_a_test"]')
When I run it I got empty list []
I tried also
$x('//li[contains(#class,"myClass")][text()="This<b></b> <b></bis_a_test"]')
What do I need to change in my expression ?
The XML
<li .... >
"This"
<b></b>
<b></b>
"is_a_test"
</li>

The browser "shows" an space but in the command line you get
xmllint --html --xpath "//li[contains(#class,'myclass')]/text()" test.html
"This"
"is_a_test"
So there are 3 new lines on the result which are also part of text() output.
Removing new lines from the html, this XPath works (reversing quotes for simplicity)
echo '<li class="myclass">"This"<b></b> <b></b>"is_a_test"</li>' | \
xmllint --html --xpath "//li[contains(#class,'myclass')][.='\"This\" \"is_a_test\"']/text()" -
Result:
"This" "is_a_test"
Please note the dot . operator instead of text().
It's not easy to represent new lines on an xpath expression. Also, you may want to check this answer for more info on the difference between dot and text().

deleting <a> tag in th middle of othre tags

I have several lines in html files that look like this:
<div class="thumb tright">
<div class="thumbinner" style="width:302px;">
<a href="https://example.com/en/File:Tools_my_settings.png" class="image">
<img alt="" src="images_en/thumb/0/0a/tool_settings.png/9dd94c2d99eea9.png" width="300" height="110" class="thumbimage" srcset="/my/en/images_en/thumb/0/0a/my_settings.png/450px-my_settings.png 1.5x, /31/en/images_en/thumb/0/0a/my_settings.png/600px-my_settings.png 2x"/>
</a>
<div class="thumbcaption">
<div class="magnify">
</div>
Tool settings
</div>
</div>
</div>Tools Features - So Far
I need to delete the following href and and the corresponding closing tag </a> immediately after the .png 2x"/> text element.
...
at the end I need the line to look like this:
<div class="thumb tright">
<div class="thumbinner" style="width:302px;">
<img alt="" src="images_en/thumb/0/0a/tool_settings.png/9dd94c2d99eea9.png" width="300" height="110" class="thumbimage" srcset="/my/en/images_en/thumb/0/0a/my_settings.png/450px-my_settings.png 1.5x, /31/en/images_en/thumb/0/0a/my_settings.png/600px-my_settings.png 2x"/>
<div class="thumbcaption">
<div class="magnify">
</div>
Tool settings
</div>
</div>
</div>Tools Features - So Far
All files contain the same patern:<a href="https://choopy.com/en/File:...
this is what I have tried:
find /var/www/clients/client1/web2/web/lms_docs/ -type f -print0 | xargs -0 sed 's/<a\shref="https:\/\/choopy.com\/en\/File:([--:\w?#%&+~#=]*[a-z])\.png"\sclass="image">//g'
but it doesn't do anything and i don't know how to delete the corresponding closing tag </a>

This will delete all <a href>'s to https://...com of an image class and the corresponding </a>:
find /var/www/clients/client1/web2/web/lms_docs/ -type f -print0 | xargs -0 sed '/<a href=\"https:\/\/.*\.com\/en\/File:.*\" class=\"image\">/,/<\/a>/{ /<a href=\"https:\/\/.*\.com\/en\/File:.*\" class=\"image\">/d; /<\/a>/d}'
And this one is for the specific domain, as https://example.com:
find /var/www/clients/client1/web2/web/lms_docs/ -type f -print0 | xargs -0 sed '/<a href=\"https:\/\/example\.com\/en\/File:.*\" class=\"image\">/,/<\/a>/{ /<a href=\"https:\/\/example\.com\/en\/File:.*\" class=\"image\">/d; /<\/a>/d}'
This works like this: "match all lines between <a href.... with class image and the corresponding <\a> (sed pattern matching: "/ /")
Then again, for the matched block do "{ }": match the same patterns and delete them "/d".
More info: section 4.24

Multiple occurrences in sed substitution

I am trying to retrieve some data within a specific div tag in my html file.
My current html code is in the following format.
<div class = "class0">
<div class = "class1">
<div class = "class2">
some text some text
</div>
Some more text
</div>
Too much text
</div>
When I try to extract tag in just the div with class2, using the bash code
sed -e ':a;N;$!ba
s/[[:space:]]\+/ /g
s/.*<div class\="class2">\(.*\).*/\1/g' test.html > out.html
I get the output html file with the code as
some text some text </div> Some more text </div> Too much text
I want all the data after the first </div> to be removed but instead the final one is being replaced.
Can someone please elaborate my mistake.

You could do this in awk:
awk '/class2/,/<\/div>/ {a[++i]=$0}END{for (j=2;j<i;++j) print a[j]}' file
Between the lines that match /class2/ and /<\/div>/, write the contents to an array. At the end of the file loop through the array, skipping the first and last lines.
Instead of making an array, you could check for the first and last lines using a regular expression:
awk '/class2/,/<\/div>/ {if (!/class2|<\/div>/) print}' file

This works for retrieving text inside the div class = "class2" tags
#!/bin/bash
htmlcode='
<div class = "class0">
<div class = "class1">
<div class = "class2">
some text some text
</div>
Some more text
</div>
Too much text
</div>
'
echo $htmlcode |
sed -e's,<,\
<,g' |
grep 'div class = "class2"' |
sed -e's,>,>\
,g'|
grep -v 'div class = "class2"'

Parse HTML snippet with awk

I am trying to parse an HTML document with awk.
The document contains several <div class="p_header_bottom"></div blocks
<div class="p_header_bottom">
<span class="fl_r"></span>
287,489 people
</div>
<div class="p_header_bottom">
<span class="fl_r"></span>
5 links
</div>
I am using
awk '/<div class="p_header_bottom">/,/<\/div>/'
to receive all such div's.
How I can get 287,489 number from first one?
Actually awk '/<\/span>/,/people/' doesn't work correctly.

With gawk, and assuming that the only digits and commas within each <div> </div> block occur in the numeric portion of interest
awk -v RS='<[/]?div[^>]*>' '/span/ && /people/{gsub(/[^[:digit:],]/, ""); print}' file.txt

Shell: Extract some code from HTML

I have the following code snippet from a HTML file:
<div id="rwImages_hidden" style="display:none;">
<img src="http://example.com/images/I/520z3AjKzHL._SL500_AA300_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/519z3AjKzHL._SL75_AA30_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/31F-sI61AyL._SL75_AA30_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/71k-DIrs-8L._AA30_.jpg" style="display:none;"/>
<img src="http://example.com/images/I/61CCOS0NGyL._AA30_.jpg" style="display:none;"/>
</div>
I want to extract the code
520z3AjKzHL
519z3AjKzHL
31F-sI61AyL
71k-DIrs-8L
61CCOS0NGyL
from the HTML.
Please note that: <img src="" style="display:none;"/> must be used because there are other similar urls in HTML file but I only what the ones between <img src="" style="display:none;"/>.
My Code is:
cat HTML | grep -Po '(?<img src="http://example.com/images/I/).*?(?=.jpg" style="display:none;"/>)'
Something seems to be wrong.

You can solve it by using positive look ahead / look behind:
cat HTML | grep -Po "(?<=<img src=\"http://example.com/images/I/).*?(?=\._.*.jpg\" style=\"display:none;\"/>)"
Demonstration:
ideone.com link
Regexp breakdown:
.*? match all characters reluctantly
(?<=<img src=...ges/I/) preceeded by <img .../I/
(?=\._...ne;\"/>) succeeded by ._...ne;\"/>

I assume you were looking for a lookbehind to start, which is what was throwing the error.
(?<=foo) not (?<foo).
This gives the result case you specified, but I do not know if you need up until the JPG or not:
cat HTML | grep -Po '(?<=img src="http://example.com/images/I/)[^.]*'
Up until and excluding the JPG would be:
cat HTML | grep -Po '(?<=img src="http://example.com/images/I/).*(?=.jpg)'

And if you consider gawk as being a valid bash solution:
awk -F'[/|\._]' -v img='/<img src="" style="display:none;"\/>/' '/img/{print $7}' file

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Find Only 1st Level XPath Children - xpath

This XPath will get you want '//span[not(parent::div[parent::div])]' xmllint --html --xpath '//span[not(parent::div[parent::div])]' test.html | sed -re 's%(</[^>]+>)%\1\n%g' <span>test2</span> <span>test3</span> <span>test4</span> <span>test6</span>

Related

How to read text with space in xpath

deleting <a> tag in th middle of othre tags

Multiple occurrences in sed substitution

Parse HTML snippet with awk

Shell: Extract some code from HTML

Categories

Resources