I have this input file
<article class="article column large-12 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14341175/">
<figure class="image image__responsive" style="padding-bottom:42.075%;">
<img class="image__img lazyload" itemprop="image" title="" alt="" />
</figure>
<div class="article__content">
<h2 class="article__title t54 tm24">Mann savnet etter å ha blitt angrepet av to haier</h2>
</div>
</a>
</article>
<article class="article column large-6 small-6 article--nyheter">
<a class="article__link" href="/nyheter/14315514/">
<figure class="image image__responsive" style="padding-bottom:42.075%;">
<img class="image__img lazyload" itemprop="image" title="" alt="" />
</figure>
<div class="article__content">
<h2 class="article__title t34 tm24">Flere blir 100 Ã¥r â Tordis (102) tror hun
vet noe av oppskriften</h2>
</div>
</a>
</article>
<article class="article column large-6 small-6 article--nyheter">
<a class="article__link" href="/nyheter/14336393/">
<figure class="image image__responsive" style="padding-bottom:42.075%;">
<img class="image__img lazyload" itemprop="image" title="" alt="" />
</figure>
<div class="article__content">
<h2 class="article__title t34 tm24">To menn i 30-Ã¥rene tiltalt for grov voldtekt av mann tidlig
i 20-Ã¥rene</h2>
</div>
</a>
</article>
These are three article tags. I want to make three files from these three article tags having info from them. Example of the output would be
news1.txt
http://example.no/news/123456789/af31/4928 // link (a tag)
Franzens djevelsk gode forstadsfabel // name (h2 tag)
http://imgs.example.no/pics/0291817572384234.jpg // image (img tag)
2021-10-27 10:45 // date (when the file was created)
Can this be done in bash using sed, awk?
I have been having some difficulties finding an XPath for the following H
<div>
<p> pppppppp
<span class="rollover-people">
<a class="rollover-people-link">pppppp</a>
<span class="rollover-people-block">
<span class="rollover-block">
<span>
<img src="/someAddress" width="100" height="100" alt>
<a>xxxx</a>
<a>xxxxx</a>
</span>
</span>
</span>
</span>pppppppp
</p>ppppppppp
<div>
So basically I need everything inside the <p> up to <span class="rollover-people-block">. In another word, I want <p> but not <span class="rollover-people-block">. Is that even possible? Keep in mind, the <p> gets repeated more than once in the page.
This is what something closure you are looking for.
//p//text()[not(ancestor::span[#class='rollover-people-block'])]
This will get all the text nodes under p excluding the ones which are under span class='rollover-people-block'.
Sample html:
<!DOCTYPE html>
<html>
<body>
<div>
<p> A
<span class="rollover-people">
<a class="rollover-people-link">B</a>
<span class="rollover-people-block">
<span class="rollover-block">
<span>
<img src="/someAddress" width="100" height="100" alt>
<a>c</a>
<a>d</a>
</span>
</span>
</span>
</span>E
</p>f
<p> G
<span class="rollover-people">
<a class="rollover-people-link">H</a>
<span class="rollover-people-block">
<span class="rollover-block">
<span>
<img src="/someAddress" width="100" height="100" alt>
<a>i</a>
<a>j</a>
</span>
</span>
</span>
</span>K
</p>l
<div>
</body>
</html>
xpath output:
Basically every week I have to reverse the following snippet
<!-- Homepage Slider Begin -->
<div class="container-fluid">
<div class="single-item-home hidden-xs">
<div class="slide slide--has-caption">
<a href="/1">
<img src="/sliders/1_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/2">
<img src="/sliders/2_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/3">
<img src="/sliders/3_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/4">
<img src="/sliders/4_example.jpg">
</a>
</div>
</div>
</div>
<!-- Homepage Slider End -->
Basically I'm wanting to make awk script and have a cron job to essentially take lines 4-8 to swap with lines 22-26 and lines 10-14 swap with lines 16-20 however I can only seem to find a way to swap one line and not line blocks.
Is this even possible with awk or just silly?
You may use awk . Below script
awk 'NR==FNR{line[i++]=$0}
END{
for(j=0;j<i;j++){
if(j>=3 && j<=7){
print line[j+18];
continue;
}
else if(j>=21 && j<=25){
print line[j-18];
continue;
}
else if(j>=9 && j<=13){
print line[j+6];
continue;
}
else if(j>=15 && j<=19){
print line[j-6];
continue;
}
print line[j];
}
}' file
will do what you want.
Sample Output
<!-- Homepage Slider Begin -->
<div class="container-fluid">
<div class="single-item-home hidden-xs">
<div class="slide slide--has-caption">
<a href="/4">
<img src="/sliders/4_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/3">
<img src="/sliders/3_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/2">
<img src="/sliders/2_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/1">
<img src="/sliders/1_example.jpg">
</a>
</div>
</div>
</div>
<!-- Homepage Slider End -->
Note: I leave the array-bounds check up to you. If the content of the file is static, you may not need this
This doesn't care how many lines are in each block or where they start/end in the file and it doesn't require you to store the whole file in memory (though most of the file is the "slides" which DO need to be stored so that's probably a non-issue):
$ cat tst.awk
/<div class="slide/ { inSlide=1; slide="" }
inSlide {
slide = slide $0 ORS
if ( /<\/div>/ ) {
slides[++numSlides] = slide
inSlide = 0
}
next
}
/<\/div>/ {
for (slideNr=numSlides; slideNr>=1; slideNr--) {
printf "%s", slides[slideNr]
}
numSlides = 0
}
NF
.
$ awk -f tst.awk file
<!-- Homepage Slider Begin -->
<div class="container-fluid">
<div class="single-item-home hidden-xs">
<div class="slide slide--has-caption">
<a href="/4">
<img src="/sliders/4_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/3">
<img src="/sliders/3_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/2">
<img src="/sliders/2_example.jpg">
</a>
</div>
<div class="slide slide--has-caption">
<a href="/1">
<img src="/sliders/1_example.jpg">
</a>
</div>
</div>
</div>
<!-- Homepage Slider End -->
perl -e '#f=<>; print #f[0..2,21..25,8,15..19,14,9..13,20,3..7,26..$#f]' ip.html
-e option to pass Perl code from command line itself
#f=<> Reads the contents of file (passed as command line argument) into an array
and then print as per order required (index starts from 0, $#f gives last index of array #f)
This is a solution, where you define an order where to print in the BEGIN section and in that order it will print:
$ cat > preordered.awk
BEGIN {
split("1,2,3,22,23,24,25,26,9,16,17,18,19,20,15,10,11,12,13,14,21,4,5,6,7,8",a,",")
}
{
b[(NR in a?a[NR]:NR)]=$0
}
END {
PROCINFO["sorted_in"]="#ind_num_asc"
for(i in b)
print b[i]
}
Give it a go:
$ awk -f preordered.awk' file
<!-- Homepage Slider Begin -->
<div class="container-fluid">
<div class="single-item-home hidden-xs">
<div class="slide slide--has-caption">
<a href="/4">
<img src="/sliders/4_example.jpg">
</a>
</div>
...
So, I am building a web crawler for one site's comment section, and I have came with a problem, it seems I can't find a text node for the comments content. This is how the web pages element looks:
<div class="comments"> // this is the whole comments section
<div class="comment"> // this is where the p is located
<div class="comment-top">
<div class="comment-nr">208. PROTAS</div>
<div class="comment-info">
<div class="comment-time">2015-06-30 13:00</div>
<div class="comment-ip">IP: 178.250.32.165</div>
<div class="comment-vert1">
<a href="javascript:comr(24470645,'p')">
<img src="http://img.lrytas.lt/css2/img/com-good.jpg" alt="">
</a> <span id="cy_24470645"> </span>
</div>
<div class="comment-vert2">
<a href="javascript:comr(24470645,'m')">
<img src="http://img.lrytas.lt/css2/img/com-bad.jpg" alt="">
</a> <span id="cn_24470645"> </span>
</div>
</div>
</div>
<p class="text-13 no-intend">Test text</p> // I need to get this comments content
</div>
I tried a lot of xpath's like:
*/div[contains(#class, "comment")]/p/text()
/p[contains(#class, "text-13 no-intend")]/text()
etc.
But can't seem able to locate it.
Would appreciate any help.
How about this:
//div[#class = 'comments']/div[#class = 'comment'][1]/p/text()
I would like to ask if what is the right way to use 'ul'? will it be okay to use 'ul' to list some image banners? ex. i have 3 image banners with titles and all are floated left. I use to encounter this situation every time and the approach i came up with is the first markup using 'ul'.
Is it okay to use the markup below:
<section class="banners">
<ul>
<li>
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
</figure>
title here
</li>
<li>
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
</figure>
title here
</li>
<li>
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
</figure>
title here
</li>
</ul>
</section>
or should I use:
<section class="banners">
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
<figcaption>
title here
</figcaption>
</figure>
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
<figcaption>
title here
</figcaption>
</figure>
<figure>
<a href="#">
<img src="" width="" height="" alt="" />
</a>
<figcaption>
title here
</figcaption>
</figure>
</section>
Do they both represent semantic coding?
This is the sample of the image banner
Since the HTML5 spec is so mercurial and the semantics don't seem to play a major role practically, it's hard to say. However, based on your image, it looks like this is a navigation section. As such, you would want to section it with <nav>.
<ul> spec: http://www.w3.org/TR/html5/grouping-content.html#the-ul-element
<figure> spec: http://www.w3.org/TR/html5/grouping-content.html#the-figure-element
I don't think that these are much help. They are both used for grouping content. The order does not matter for <ul>.
From what I've read, it seems to me that the purpose of <figure> is for annotations of a document -- describing related images, etc. The spec specifically says that these could be moved elsewhere, like an appendix, but that doesn't seem to apply to your situation.
I don't think that <figure> is appropriate here. Instead, use <nav>. You can use the <ul> for styling if you need -- it doesn't provide much semantic meaning (just a somewhat generic grouping content element).