Match tag inside tag using bash - bash

I have this html
<article class="article column large-12 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14343208/">
<div class="article__content">
<h2 class="article__title t54 tm24">Person har falt ned bratt terreng - luftambulanse er på vei</h2>
</div>
</a>
</article>
<article class="article column large-6 small-6 article--nyheter">
<a class="article__link" href="/nyheter/14341466/">
<figure class="image image__responsive" style="padding-bottom:42.075%;">
<img class="image__img lazyload" itemprop="image" title="" alt="" src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI+py+0Po5yUFQA7" />
</figure>
<div class="article__content">
<h2 class="article__title t34 tm24">Vil styrke innsatsen mot vold i nære relasjoner</h2>
</div>
</a>
</article>
The thing is that I want to get only those html tags, in this case article tags, which has a child img tag inside them.
I have this sed command
sed -n '/<article class.*article--nyheter/,/<\/article>/p' onlyArticlesWithOutSpace.html > test.html
Now what I am trying ti achieve is to get only those article tags which has img tag inside them.
Output I want would be this
<article class="article column large-6 small-6 article--nyheter">
<a class="article__link" href="/nyheter/14341466/">
<figure class="image image__responsive" style="padding-bottom:42.075%;">
<img class="image__img lazyload" itemprop="image" title="" alt="" src="data:image/gif;base64,R0lGODlhEAAJAIAAAP///wAAACH5BAEAAAAALAAAAAAQAAkAAAIKhI+py+0Po5yUFQA7" />
I cannot use any xml/html parser. Just looking to use sed, grep, awk etc.
</figure>
<div class="article__content">
<h2 class="article__title t34 tm24">Vil styrke innsatsen mot vold i nære relasjoner</h2>
</div>
</a>
</article>

Care: parsing XML using sed is a wrong good idea!
Thanks to Cyrus's comment for pointing to good reference.
Anyway, U could try this:
sed -ne '/<article/{ :a; N; /<\/article/ ! ba ; /<img/p ; }'

Related

Get html tags and loop over them using bash

I have this input file
<article class="article column large-12 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14341175/">
<figure class="image image__responsive" style="padding-bottom:42.075%;">
<img class="image__img lazyload" itemprop="image" title="" alt="" />
</figure>
<div class="article__content">
<h2 class="article__title t54 tm24">Mann savnet etter å ha blitt angrepet av to haier</h2>
</div>
</a>
</article>
<article class="article column large-6 small-6 article--nyheter">
<a class="article__link" href="/nyheter/14315514/">
<figure class="image image__responsive" style="padding-bottom:42.075%;">
<img class="image__img lazyload" itemprop="image" title="" alt="" />
</figure>
<div class="article__content">
<h2 class="article__title t34 tm24">Flere blir 100 år – Tordis (102) tror hun
vet noe av oppskriften</h2>
</div>
</a>
</article>
<article class="article column large-6 small-6 article--nyheter">
<a class="article__link" href="/nyheter/14336393/">
<figure class="image image__responsive" style="padding-bottom:42.075%;">
<img class="image__img lazyload" itemprop="image" title="" alt="" />
</figure>
<div class="article__content">
<h2 class="article__title t34 tm24">To menn i 30-Ã¥rene tiltalt for grov voldtekt av mann tidlig
i 20-Ã¥rene</h2>
</div>
</a>
</article>
These are three article tags. I want to make three files from these three article tags having info from them. Example of the output would be
news1.txt
http://example.no/news/123456789/af31/4928 // link (a tag)
Franzens djevelsk gode forstadsfabel // name (h2 tag)
http://imgs.example.no/pics/0291817572384234.jpg // image (img tag)
2021-10-27 10:45 // date (when the file was created)
Can this be done in bash using sed, awk?

Catching partial part of text with XPath

I have been having some difficulties finding an XPath for the following H
<div>
<p> pppppppp
<span class="rollover-people">
<a class="rollover-people-link">pppppp</a>
<span class="rollover-people-block">
<span class="rollover-block">
<span>
<img src="/someAddress" width="100" height="100" alt>
<a>xxxx</a>
<a>xxxxx</a>
</span>
</span>
</span>
</span>pppppppp
</p>ppppppppp
<div>
So basically I need everything inside the <p> up to <span class="rollover-people-block">. In another word, I want <p> but not <span class="rollover-people-block">. Is that even possible? Keep in mind, the <p> gets repeated more than once in the page.
This is what something closure you are looking for.
//p//text()[not(ancestor::span[#class='rollover-people-block'])]
This will get all the text nodes under p excluding the ones which are under span class='rollover-people-block'.
Sample html:
<!DOCTYPE html>
<html>
<body>
<div>
<p> A
<span class="rollover-people">
<a class="rollover-people-link">B</a>
<span class="rollover-people-block">
<span class="rollover-block">
<span>
<img src="/someAddress" width="100" height="100" alt>
<a>c</a>
<a>d</a>
</span>
</span>
</span>
</span>E
</p>f
<p> G
<span class="rollover-people">
<a class="rollover-people-link">H</a>
<span class="rollover-people-block">
<span class="rollover-block">
<span>
<img src="/someAddress" width="100" height="100" alt>
<a>i</a>
<a>j</a>
</span>
</span>
</span>
</span>K
</p>l
<div>
</body>
</html>
xpath output:

Reverse order of html with awk via line swapping

Basically every week I have to reverse the following snippet
<!-- Homepage Slider Begin -->
<div class="container-fluid">
<div class="single-item-home hidden-xs">
<div class="slide slide--has-caption">
<a href="/1">
<img src="/sliders/1_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/2">
<img src="/sliders/2_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/3">
<img src="/sliders/3_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/4">
<img src="/sliders/4_example.jpg">
</a>
</div>
</div>
</div>
<!-- Homepage Slider End -->
Basically I'm wanting to make awk script and have a cron job to essentially take lines 4-8 to swap with lines 22-26 and lines 10-14 swap with lines 16-20 however I can only seem to find a way to swap one line and not line blocks.
Is this even possible with awk or just silly?
You may use awk . Below script
awk 'NR==FNR{line[i++]=$0}
END{
for(j=0;j<i;j++){
if(j>=3 && j<=7){
print line[j+18];
continue;
}
else if(j>=21 && j<=25){
print line[j-18];
continue;
}
else if(j>=9 && j<=13){
print line[j+6];
continue;
}
else if(j>=15 && j<=19){
print line[j-6];
continue;
}
print line[j];
}
}' file
will do what you want.
Sample Output
<!-- Homepage Slider Begin -->
<div class="container-fluid">
<div class="single-item-home hidden-xs">
<div class="slide slide--has-caption">
<a href="/4">
<img src="/sliders/4_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/3">
<img src="/sliders/3_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/2">
<img src="/sliders/2_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/1">
<img src="/sliders/1_example.jpg">
</a>
</div>
</div>
</div>
<!-- Homepage Slider End -->
Note: I leave the array-bounds check up to you. If the content of the file is static, you may not need this
This doesn't care how many lines are in each block or where they start/end in the file and it doesn't require you to store the whole file in memory (though most of the file is the "slides" which DO need to be stored so that's probably a non-issue):
$ cat tst.awk
/<div class="slide/ { inSlide=1; slide="" }
inSlide {
slide = slide $0 ORS
if ( /<\/div>/ ) {
slides[++numSlides] = slide
inSlide = 0
}
next
}
/<\/div>/ {
for (slideNr=numSlides; slideNr>=1; slideNr--) {
printf "%s", slides[slideNr]
}
numSlides = 0
}
NF
.
$ awk -f tst.awk file
<!-- Homepage Slider Begin -->
<div class="container-fluid">
<div class="single-item-home hidden-xs">
<div class="slide slide--has-caption">
<a href="/4">
<img src="/sliders/4_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/3">
<img src="/sliders/3_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/2">
<img src="/sliders/2_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/1">
<img src="/sliders/1_example.jpg">
</a>
</div>
</div>
</div>
<!-- Homepage Slider End -->
perl -e '#f=<>; print #f[0..2,21..25,8,15..19,14,9..13,20,3..7,26..$#f]' ip.html
-e option to pass Perl code from command line itself
#f=<> Reads the contents of file (passed as command line argument) into an array
and then print as per order required (index starts from 0, $#f gives last index of array #f)
This is a solution, where you define an order where to print in the BEGIN section and in that order it will print:
$ cat > preordered.awk
BEGIN {
split("1,2,3,22,23,24,25,26,9,16,17,18,19,20,15,10,11,12,13,14,21,4,5,6,7,8",a,",")
}
{
b[(NR in a?a[NR]:NR)]=$0
}
END {
PROCINFO["sorted_in"]="#ind_num_asc"
for(i in b)
print b[i]
}
Give it a go:
$ awk -f preordered.awk' file
<!-- Homepage Slider Begin -->
<div class="container-fluid">
<div class="single-item-home hidden-xs">
<div class="slide slide--has-caption">
<a href="/4">
<img src="/sliders/4_example.jpg">
</a>
</div>
...

Using xpath, I can't seem to be able to find a text node

So, I am building a web crawler for one site's comment section, and I have came with a problem, it seems I can't find a text node for the comments content. This is how the web pages element looks:
<div class="comments"> // this is the whole comments section
<div class="comment"> // this is where the p is located
<div class="comment-top">
<div class="comment-nr">208. PROTAS</div>
<div class="comment-info">
<div class="comment-time">2015-06-30 13:00</div>
<div class="comment-ip">IP: 178.250.32.165</div>
<div class="comment-vert1">
<a href="javascript:comr(24470645,'p')">
<img src="http://img.lrytas.lt/css2/img/com-good.jpg" alt="">
</a> <span id="cy_24470645"> </span>
</div>
<div class="comment-vert2">
<a href="javascript:comr(24470645,'m')">
<img src="http://img.lrytas.lt/css2/img/com-bad.jpg" alt="">
</a> <span id="cn_24470645"> </span>
</div>
</div>
</div>
<p class="text-13 no-intend">Test text</p> // I need to get this comments content
</div>
I tried a lot of xpath's like:
*/div[contains(#class, "comment")]/p/text()
/p[contains(#class, "text-13 no-intend")]/text()
etc.
But can't seem able to locate it.
Would appreciate any help.
How about this:
//div[#class = 'comments']/div[#class = 'comment'][1]/p/text()

Right way to list image banners

I would like to ask if what is the right way to use 'ul'? will it be okay to use 'ul' to list some image banners? ex. i have 3 image banners with titles and all are floated left. I use to encounter this situation every time and the approach i came up with is the first markup using 'ul'.
Is it okay to use the markup below:
<section class="banners">
<ul>
<li>
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
</figure>
title here
</li>
<li>
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
</figure>
title here
</li>
<li>
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
</figure>
title here
</li>
</ul>
</section>
or should I use:
<section class="banners">
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
<figcaption>
title here
</figcaption>
</figure>
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
<figcaption>
title here
</figcaption>
</figure>
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
<figcaption>
title here
</figcaption>
</figure>
</section>
Do they both represent semantic coding?
This is the sample of the image banner
Since the HTML5 spec is so mercurial and the semantics don't seem to play a major role practically, it's hard to say. However, based on your image, it looks like this is a navigation section. As such, you would want to section it with <nav>.
<ul> spec: http://www.w3.org/TR/html5/grouping-content.html#the-ul-element
<figure> spec: http://www.w3.org/TR/html5/grouping-content.html#the-figure-element
I don't think that these are much help. They are both used for grouping content. The order does not matter for <ul>.
From what I've read, it seems to me that the purpose of <figure> is for annotations of a document -- describing related images, etc. The spec specifically says that these could be moved elsewhere, like an appendix, but that doesn't seem to apply to your situation.
I don't think that <figure> is appropriate here. Instead, use <nav>. You can use the <ul> for styling if you need -- it doesn't provide much semantic meaning (just a somewhat generic grouping content element).

Resources