Build a text from html using xpath - xpath

I receive an html like that below from a server. I rebuild the textual part by using the XPath exp #"//text()" and appending the "nodeContent" value to a string. The code is something like this:
for (int i=2; i<[resultXPathQuery count]; i++) {
[mytext appendString:[[resultXPathQuery objectAtIndex:i] objectForKey:#"nodeContent"]];
[mytext appendString:#"\n"];
}
I obtain:
Line 1
line 2
line 3
line 4
How could I build the textual part also considering the empty node?
I would to obtain:
Line 1
line 2
line 3
line 4
<html><head><title>A title</title><style type="text/css">
ol{margin:0;padding:0}p{margin:0}
.c0{font-size:12pt;background-color:#ffffff;font-family:Times New Roman}
.c6{width:432.0pt;background-color:#ffffff;padding:72.0pt 90.0pt 72.0pt 90.0pt}
.c7{color:#aaaaaa;font-family:Times New Roman}
.c3{color:#0000ee;text-decoration:underline}
.c5{color:inherit;text-decoration:inherit}
.c2{font-size:12pt;font-family:Times New Roman}
.c4{height:12pt}.c1{direction:ltr}
body{color:#000000;font-size:12pt;font-family:Times New Roman}
h1{padding-top:12.0pt;line-height:1.0;text-align:left;color:#000000;font-size:24pt;font- family:Times New Roman;font-weight:bold;padding-bottom:12.0pt}
h2{padding-top:11.25pt;line-height:1.0;text-align:left;color:#000000;font-size:18pt;font-family:Times New Roman;font-weight:bold;padding-bottom:11.25pt}
h3{padding-top:12.0pt;line-height:1.0;text-align:left;color:#000000;font-size:14pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.0pt}
h4{padding-top:12.75pt;line-height:1.0;text-align:left;color:#000000;font-size:12pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.75pt}
h5{padding-top:12.75pt;line-height:1.0;text-align:left;color:#000000;font-size:9pt;font-family:Times New Roman;font-weight:bold;padding-bottom:12.75pt}
h6{padding-top:18.0pt;line-height:1.0;text-align:left;color:#000000;font-size:8pt;font-family:Times New Roman;font-weight:bold;padding-bottom:18.0pt}</style>
</head>
<body class="c6">
<p class="c1"><span class="c2">A title</span></p>
<p class="c1 c4"><span class="c2"></span></p>
<p class="c4 c1"><span class="c2"></span></p>
<p class="c1"><span class="c7">Line 1</span></p>
<p class="c1"><span class="c7">line 2</span></p>
<p class="c4 c1"><span class="c7"></span></p>
<p class="c1"><span class="c7">line 3</span></p>
<p class="c4 c1"><span class="c7"></span></p>
<p class="c4 c1"><span class="c7"></span></p>
<p class="c3 c2"><span class="c1"></span></p>
<p class="c1"><span class="c7">line 4</span></p>
</body></html>
EDIT
Really, I noticed that the html can be more "complicated", so it's not enough selecting all the span elements or p elements. Moreover, more span elements can appear in the same p element, so in that case I have not to create a new line in my string.
This is the body of a more complicated returned html:
<body class="c13">
<p class="c5"><span>gfgfgfd</span></p>
<p class="c1"><span></span></p>
<p class="c5 c10"><span>ghhgfhgfh hghg hgkfhjgk ghjgkh ghjgjhg gjhjg gjhj gjhgjhgjhg gfhjkgjg jghjgfhjgf fghfj jghfj fghjggf jhgjgjgkjg</span></p>
<p class="c1 c10"><span></span></p>
<p class="c4"><span>gfgfgfd</span></p>
<p class="c4"><span>f</span></p>
<p class="c4">
<span>gfdgfdg</span>
<span class="c7">hg</span></p>
<p class="c4"><span class="c7">ghgfhgfh</span></p>
<p class="c4"><span class="c7">gfhgfhgf</span></p>
<p class="c5">
<span class="c7">hgfh </span>
<span class="c0">gfdgfg</span></p>
<p class="c5"><span class="c0">fgfdgfdgfd</span></p>
<p class="c5"><span class="c0">gdfgdfgfd</span></p>
<p class="c5"><span class="c0">gfgf</span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0 c8"><a class="c12" href="http://www.google.com">www.google.com</a></span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0">fgfdgfdg</span></p>
<p class="c5">
<span class="c0">fgffgfdgfg</span>
<span class="c0 c11">gfgfdgfd fgd fd</span>
<span class="c0">fdgfdg</span></p>
<p class="c5"><span class="c0">fgfdgfdgf</span></p>
<p class="c5"><span class="c0">gfd</span></p>
<p class="c5"><span class="c0">gfgf</span></p>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0 c8"><a class="c12" href="mailto:….">...</a></span></p>
<p class="c1"><span class="c0"></span></p>
<ol class="c9" start="1">
<li class="c3"><span class="c0">gfgfd</span></li>
<li class="c3"><span class="c0">gfdgfd</span></li>
<li class="c3"><span class="c0">gfdgfd</span></li>
<li class="c3"><span class="c0">gdfgfd</span></li>
</ol>
<p class="c1"><span class="c0"></span></p>
<p class="c5"><span class="c0">hgfhgf</span></p>
<p class="c5"><span class="c0">gfhgfh</span></p>
<p class="c5"><span class="c0">hgfhgf</span></p>
<p class="c1"><span class="c0"></span></p>
<ol class="c2" start="1">
<li class="c3"><span class="c0">gfhg</span></li>
<li class="c3"><span class="c0">hgfh</span></li>
<li class="c3"><span class="c0">hgf</span></li>
</ol>
<p class="c1"><span class="c0"></span></p>
<h1 class="c5 c15"><a name="h.kafwflosthlg"></a><span class="c7 c14">hgfhgfh</span></h1>
<p class="c1"><span class="c6"></span></p>
<p class="c1"><span class="c6"></span></p>
<p class="c1"><span class="c6"></span></p>
</body>
I'd need an XPath expression that selects p, h1, h2,..., h6, li elements, and considers the inner textual part in such way that new line and empty lines are properly detected.

For the example above you can use //span which will return all the <span> elements regardless of their contents. It looks like you are doing some other filtering also because //text() should also return your CSS block and A Title from the <title> and first <span>.

I would rather use a regex for this one:
Grab all the content between the body tags (you can also do that with XPath)
Replace </p> by </p>\n
Strip tags

Related

Get html tags and loop over them using bash

I have this input file
<article class="article column large-12 small-12 article--nyheter">
<a class="article__link" href="/nyheter/14341175/">
<figure class="image image__responsive" style="padding-bottom:42.075%;">
<img class="image__img lazyload" itemprop="image" title="" alt="" />
</figure>
<div class="article__content">
<h2 class="article__title t54 tm24">Mann savnet etter å ha blitt angrepet av to haier</h2>
</div>
</a>
</article>
<article class="article column large-6 small-6 article--nyheter">
<a class="article__link" href="/nyheter/14315514/">
<figure class="image image__responsive" style="padding-bottom:42.075%;">
<img class="image__img lazyload" itemprop="image" title="" alt="" />
</figure>
<div class="article__content">
<h2 class="article__title t34 tm24">Flere blir 100 år – Tordis (102) tror hun
vet noe av oppskriften</h2>
</div>
</a>
</article>
<article class="article column large-6 small-6 article--nyheter">
<a class="article__link" href="/nyheter/14336393/">
<figure class="image image__responsive" style="padding-bottom:42.075%;">
<img class="image__img lazyload" itemprop="image" title="" alt="" />
</figure>
<div class="article__content">
<h2 class="article__title t34 tm24">To menn i 30-Ã¥rene tiltalt for grov voldtekt av mann tidlig
i 20-Ã¥rene</h2>
</div>
</a>
</article>
These are three article tags. I want to make three files from these three article tags having info from them. Example of the output would be
news1.txt
http://example.no/news/123456789/af31/4928 // link (a tag)
Franzens djevelsk gode forstadsfabel // name (h2 tag)
http://imgs.example.no/pics/0291817572384234.jpg // image (img tag)
2021-10-27 10:45 // date (when the file was created)
Can this be done in bash using sed, awk?

Catching partial part of text with XPath

I have been having some difficulties finding an XPath for the following H
<div>
<p> pppppppp
<span class="rollover-people">
<a class="rollover-people-link">pppppp</a>
<span class="rollover-people-block">
<span class="rollover-block">
<span>
<img src="/someAddress" width="100" height="100" alt>
<a>xxxx</a>
<a>xxxxx</a>
</span>
</span>
</span>
</span>pppppppp
</p>ppppppppp
<div>
So basically I need everything inside the <p> up to <span class="rollover-people-block">. In another word, I want <p> but not <span class="rollover-people-block">. Is that even possible? Keep in mind, the <p> gets repeated more than once in the page.
This is what something closure you are looking for.
//p//text()[not(ancestor::span[#class='rollover-people-block'])]
This will get all the text nodes under p excluding the ones which are under span class='rollover-people-block'.
Sample html:
<!DOCTYPE html>
<html>
<body>
<div>
<p> A
<span class="rollover-people">
<a class="rollover-people-link">B</a>
<span class="rollover-people-block">
<span class="rollover-block">
<span>
<img src="/someAddress" width="100" height="100" alt>
<a>c</a>
<a>d</a>
</span>
</span>
</span>
</span>E
</p>f
<p> G
<span class="rollover-people">
<a class="rollover-people-link">H</a>
<span class="rollover-people-block">
<span class="rollover-block">
<span>
<img src="/someAddress" width="100" height="100" alt>
<a>i</a>
<a>j</a>
</span>
</span>
</span>
</span>K
</p>l
<div>
</body>
</html>
xpath output:

Bootstrap 4: hide overflow in card-text

I have a bootstrap 4 card in which I want to hide the overflow of the subtitle (and show "..."). How can I achieve this? If possible with pure bootstrap code...
<div class="card-block p-1">
<p class="card-title">Test object</p>
<p class="card-subtitle text-muted">Added by Someone with a long name</p>
<p class="card-text mx-auto text-center"><span class="badge badge-pill badge-danger">€ 800</span></p>
</div>
Just use the text-truncate util class..
<div class="card">
<div class="card-block p-1">
<p class="card-title">Test object</p>
<p class="card-subtitle text-muted text-truncate">Added by Someone with a long name</p>
<p class="card-text mx-auto text-center"><span class="badge badge-pill badge-danger">€ 800</span></p>
</div>
</div>
https://www.codeply.com/go/bZufg6X1So

extract text without <div> and <p> with xpath

<tr><td class=term>1st param</td>
<td>PUTIN
<div class='info-icon'>
<a href='#' onmouseover='show_pd(351);' onmouseout='hide_pd(351);' id='info-icon-351'></a>
</div>
<div id='pd-351' style='display: none; position: absolute;'>
<b>СПРАВКА</b>
<br /><br />
<P align=justify><NOBR><STRONG>ABS</STRONG></NOBR>bla-bla-bla text</P>
<P align=justify>bla-bla-bla text 2</P>
<P align=justify>bla-bla-bla text 3</P>
<P align=justify>bla-bla-bla text 4</P>
</div>
</td>
I need extract only "PUTIN".
Now I'm on
//td[#class="term"][contains(text(), "1st param")]/following-sibling::td/[not(self::p)]
With some adjustments to your XML following XPath
//td[#class="term"][contains(text(), "1st param")]/following-sibling::td/node()[1]
has the output PUTIN
Adjustments were to change <td class=term> into <td class="term"> and all <P align=justify> into <P align="justify"> (maybe not necessary for your settings but was required for the XPath evaluator I just used).

Using schema.org branchOf with itemref

I have a company page that has all the local branches listed on it.
On the page header, I have an itemType of Organization defined.
Each LocalBusiness (Hotel) is further down the page.
For each Hotel, I'm trying to add the branchOf property using a meta tag, but both Yandex and Google Snippets comes up blank for this attribute. Is it possible to do this way?
<div itemscope itemtype="http://schema.org/Organization" id="schema-organization">
<meta itemprop="description" content="blah blah blah" />
<a href="/" itemprop="url">
<h1 itemprop="name">The Hotel Chain</h1>
</a>
<div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<div itemprop="addressLocality">new york city</div>
<meta itemprop="addressRegion" content="NY" />
</div>
</div>
<!-- snip -->
<div itemscope itemtype="http://schema.org/Hotel">
<meta itemprop="branchOf" itemref="schema-organization" />
<h2 itemprop="name">Hotel Location 1</h2>
Get directions on Google Maps
<div itemprop="geo" itemscope itemtype="http://schema.org/GeoCoordinates">
<meta itemprop="latitude" content="40.7450605" />
<meta itemprop="longitude" content="-73.98301879999997" />
</div>
<div class="wrap-address clearfix" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<ul class="ul-reset">
<li><span itemprop="streetAddress">1234 Some Street</span> (between 3rd Ave & Lexington Ave)</li>
<li>
<span itemprop="addressLocality">New York City</span>,
<span itemprop="addressRegion">NY</span>
<span itemprop="postalCode">10016</span>
</li>
</ul>
</div>
<ul>
<li><strong>Phone:</strong><span itemprop="telephone">555-555-5555</span></li>
<li><strong>Fax:</strong><span itemprop="faxNumber">555-555-5555</span></li>
</ul>
<ul>
<li>
Information:
info#hotellocation1.com
</li>
<li itemprop="contactPoint" itemscope itemtype="http://schema.org/ContactPoint">
<span itemprop="contactType">Reservations</span>:
reservations#hotellocation1.com
</li>
<li itemprop="contactPoint" itemscope itemtype="http://schema.org/ContactPoint">
<span itemprop="contactType">Concierge</span>:
Concierge#hotellocation1.com
</li>
</ul>
</div>
About itemref:
it has to be specified on elements with itemscope
it is used to reference other properties (= itemprop in Microdata)
So this means for you:
move itemref to the Hotel
move itemprop="branchOf" to the Organization
Minimal example:
<div itemprop="branchOf" itemscope itemtype="http://schema.org/Organization" id="schema-organization">
<h1 itemprop="name">The Hotel Chain</h1>
</div>
<div itemscope itemtype="http://schema.org/Hotel" itemref="schema-organization">
<h2 itemprop="name">Hotel Location 1</h2>
</div>

Resources