How to get text between two strings with special characters in ruby? - ruby

I have a string (#description) that contains HTML code and I want to extract the content between two elements. It looks something like this
<b>Content title<b><br/>
*All the content I want to extract*
<a href="javascript:print()">
I've managed to do something like this
#want = #description.match(/Content title(.*?)javascript:print()/m)[1].strip
But obviously this solution is far from perfect as I get some unwanted characters in my #want string.
Thanks for your help
Edit:
As requested in the comments, here is the full code:
I'm already parsing an HTML document doing something where the following code:
#description = #doc.at_css(".entry-content").to_s
puts #description
returns:
<div class="post-body entry-content">
<img alt="Photo title" height="333" src="http://photourl.com" width="500"><br><br><div style="text-align: justify;">
Some text</div>
<b>More text</b><br><b>More text</b><br><br><ul>
<li>Numered item</li>
<li>Numered item</li>
<li>Numered item</li>
</ul>
<br><b>Content Title</b><br>
Some text<br><br>
Some text(with links and images)<br>
Some text(with links and images)<br>
Some text(with links and images)<br>
<br><br><img src="http://url.com/photo.jpg">
<div style="clear: both;"></div>
</div>
The text can include more paragraphs, links, images, etc. but it always starts with the "Content Title" part and ends with the javascript reference.

This XPath expression selects all (sibling) nodes between the nodes $vStart and $vEnd:
$vStart/following-sibling::node()
[count(.|$vEnd/preceding-sibling::node())
=
count($vEnd/preceding-sibling::node())
]
To obtain the full XPath expression to use in your specific case, simply substitute $vStart with:
/*/b[. = 'Content Title']
and substitute $vEnd with:
/*/a[#href = 'javascript:print()']
The final XPath expressions after the substitutions is:
/*/b[. = 'Content Title']/following-sibling::node()
[count(.|/*/a[#href = 'javascript:print()']/preceding-sibling::node())
=
count(/*/a[#href = 'javascript:print()']/preceding-sibling::node())
]
Explanation:
This is a simple corollary of the Kayessian formula for the intersection of two nodesets $ns1 and $ns2:
$ns1[count(.|$ns2) = count($ns2)]
In our case, the set of all nodes between the nodes $vStart and $vEnd is the intersection of two node-sets: all following siblings of $vStart and all preceding siblings of $vEnd.
XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="vStart" select="/*/b[. = 'Content Title']"/>
<xsl:variable name="vEnd" select="/*/a[#href = 'javascript:print()']"/>
<xsl:template match="/">
<xsl:copy-of select=
"$vStart/following-sibling::node()
[count(.|$vEnd/preceding-sibling::node())
=
count($vEnd/preceding-sibling::node())
]
"/>
==============
<xsl:copy-of select=
"/*/b[. = 'Content Title']/following-sibling::node()
[count(.|/*/a[#href = 'javascript:print()']/preceding-sibling::node())
=
count(/*/a[#href = 'javascript:print()']/preceding-sibling::node())
]
"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document (converted to a well-formed XML document):
<div class="post-body entry-content">
<a href="http://www.photourl">
<img alt="Photo title" height="333" src="http://photourl.com" width="500"/>
</a>
<br />
<br />
<div style="text-align: justify;">
Some text</div>
<b>More text</b>
<br />
<b>More text</b>
<br />
<br />
<ul>
<li>Numered item</li>
<li>Numered item</li>
<li>Numered item</li>
</ul>
<br />
<b>Content Title</b>
<br />
Some text
<br />
<br />
Some text(with links and images)
<br />
Some text(with links and images)
<br />
Some text(with links and images)
<br />
<br />
<br />
<a href="javascript:print()">
<img src="http://url.com/photo.jpg"/>
</a>
<div style="clear: both;"></div>
</div>
the two XPath expressions (with and without variable references) are evaluated and the nodes selected in each case, conveniently delimited, are copied to the output:
<br/>
Some text
<br/>
<br/>
Some text(with links and images)
<br/>
Some text(with links and images)
<br/>
Some text(with links and images)
<br/>
<br/>
<br/>
==============
<br/>
Some text
<br/>
<br/>
Some text(with links and images)
<br/>
Some text(with links and images)
<br/>
Some text(with links and images)
<br/>
<br/>
<br/>

To test your HTML, I have added tags around your code then pasting it in a file
xmllint --html --xpath '/html/body/div/text()' /tmp/l.html
output :
Some text
Some text
Some text
Some text
Now, you can use an Xpath module in ruby and re-use the Xpath expression
You will find many examples on stackoverflow website searches.

Related

How to exclude from a contains query all the informations from a child class & after some sibling text?

<root>
<a></a>
<b></b>
<c></c>
<a></a>
<d></d>
<e></e>
<a></a>
<a></a>
</root>
In an XML document, how can I exclude from a contains research all the information from nodes after <d> ?
to get only result from:
<a></a>
<b></b>
<c></c>
<a></a>
<d></d>
I can't say only the first 2 answer from
and first for
and <c> because sometimes a value will exist only after the <d>
I have this code that is working:
//div[contains(#class,'class searched')]/*[contains(text(), 'Text Searched')] | //div[contains(#class,'class searched')]/*[not(contains(#class,'class excluded'))]/*[contains(text(), 'Text Searched')]
Thanks for your help :)
EDIT for clarity:
<div Class="TopClass">
<div Class="ClassA">
<div Class="ClassB">
<h3> Text Researched</h3>
<u1 Class="ClassC">
<h3> Text Researched</h3>
</u1>
</div>
</div>
<h4>Other Text</h4>
<div Class="ClassA">
<div Class="ClassB">
<h3> Text Researched</h3>
<u1 Class="ClassC">
<h3> Text Researched</h3>
</u1>
</div>
</div>
I would like to get only the Text Researched that is between the Class B and Class C and that is above the "Other Text". Sometime the "Text researched" will only appear below the "Other Text" and i don't want to get this result so a [1] will not work there. Also the <h3> and <h4> are used elsewhere in the code.
Given this html
<div class="TopClass">
<div class="ClassA">
<div class="ClassB">
<h3> Text Researched 1</h3>
<u1 class="ClassC">
<h3> Text Researched 2</h3>
</u1>
</div>
</div>
<h4>Other Text</h4>
<div class="ClassA">
<div class="ClassB">
<h3> Text Researched 3</h3>
<u1 class="ClassC">
<h3> Text Researched 4</h3>
</u1>
</div>
</div>
</div>
This XPath expression will get the first 2 h3 tags
//div[#class="ClassA" and following-sibling::h4[.="Other Text"]]//h3[contains(.,"Text Researched 1")]/text()
Result:
echo -e 'cat //div[#class="ClassA" and following-sibling::h4[.="Other Text"]]//h3/text()\nbye' | xmllint --shell test.html
/ > cat //div[#class="ClassA" and following-sibling::h4[.="Other Text"]]//h3[contains(.,"Text Researched 1")]/text()
-------
Text Researched 1
/ > bye

Rich Snippets : Microdata itemprop out of the itemtype?

I've recently decided to update a website by adding rich snippets - microdata.
The thing is I'm a newbie to this kind of things and I'm having a small question about this.
I'm trying to define the Organization as you can see from the code below:
<div class="block-content" itemscope itemtype="http://schema.org/Organization">
<p itemprop="name">SOME ORGANIZATION</p>
<p itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Manufacture Street no 4</span>,
<span itemprop="PostalCode">4556210</span><br />
<span itemprop="addressLocality">CityVille</span>,
<span itemprop="addressCountry">SnippetsLand</span></p>
<hr>
<p itemprop="telephone">0444 330 226</p>
<hr>
<p>info#snippets.com</p>
</div>
Now, my problems consists in the following: I'd like to also tag the LOGO in order to make a complete Organization profile, but the logo stands in the header of my page, and the div I've posted above stands in the footer and the style/layout of the page doesnt permit me to add the logo in here and also make it visible.
So, how can I solve this thing? What's the best solution?
Thanks.
You can use the itemref attribute.
Give your logo in the header an id and add the corresponding itemprop:
<img src="acme-logo.png" alt="ACME Inc." itemprop="logo" id="logo" />
Now add itemref="logo" to your div in the footer:
<div class="block-content" itemscope itemtype="http://schema.org/Organization" itemref="logo">
…
</div>
If this is not possible in your case, you could "duplicate" the logo so that it’s included in your div, but not visible. Microdata allows meta and link elements in the body for this case. You should use the link element, as http://schema.org/Organization expects an URL for the logo property. (Alternatively, add it via meta as a separate ImageObject).
<div class="block-content" itemscope itemtype="http://schema.org/Organization">
…
<link itemprop="logo" src="logo.png" />
…
</div>
Side note: I don’t think that you are using the hr element correctly in your example. If you simply want to display a horizontal line, you should use CSS (e.g. border-top on the p) instead.
Dan, you could simply add in the logo schema with this code:
<img itemprop="logo" src="http://www.example.com/logo.png" />
So in your example, you could simply tag it as:
<div class="block-content" itemscope itemtype="http://schema.org/Organization">
<p itemprop="name">SOME ORGANIZATION</p>
<img itemprop="logo" src="http://www.example.com/logo.png" />
<p itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Manufacture Street no 4</span>,
<span itemprop="PostalCode">4556210</span><br />
<span itemprop="addressLocality">CityVille</span>,
<span itemprop="addressCountry">SnippetsLand</span></p>
<hr>
<p itemprop="telephone">0444 330 226</p>
<hr>
<p>info#snippets.com</p>
</div>
I believe that should work for your particular case and it won't actually show the logo and you wouldn't have to mark up the logo separately. Hope that helps.

xpath - how to find an embedded li with an input element inside it?

Given this HTML:
<li class="check_boxes input optional" id="activity_roles_input">
<fieldset class="choices">
<legend class="label"><label>Roles</label></legend>
<input id="activity_roles_none" name="activity[role_ids][]" type="hidden" value="" />
<ol class="choices-group">
<li class="choice">
<label for="activity_role_ids_104">
<input id="activity_role_ids_104" name="activity[role_ids][]" type="checkbox" value="104" />Language Therapist
</label>
</li>
<li class="choice">
<label for="activity_role_ids_103">
<input id="activity_role_ids_103" name="activity[role_ids][]" type="checkbox" value="103" />Speech Therapist
</label>
</li>
</ol>
</fieldset>
</li>
I am trying to use Selenium and xpath with it.
I am trying to select the first 'checkbox' input element link.
I am having problems selecting the element.
I cannot use the db ID (104) as this is for repeated tests with new ID's each time. I need to select the 'first' input checkbox, based on it having the text for Language Therapist.
I have tried:
xpath=(//li[contains(#id,'activity_roles_input')])//input
and
xpath=(//li[contains(#id,'activity_roles_input')])//contains('Language Therapist")
but it is not finding the element.
When I do:
xpath=(//li[contains(#id,'activity_roles_input')])
it gets to the input set. The problem I am having is selecting the first input checkbox control for 'Language Therapist'.
First, find any <li> containing the text and than look for in the descendant of those for the first checkbox.
xpath=(//li[contains(., "Language Therapist")]/descendant::input[#type="checkbox"][1])
(From Michael)
The above worked for me. In the end I actually used
xpath=(//li[contains(#id,'activity_roles_input')]/descendant::input[#type="checkbox"][1])
becuase I liked ID'ing by css ID.
interesting fact to notice when I try to run this small xsl against your xml.
XSL:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:for-each select="//li[#id ='activity_roles_input']">
<xsl:value-of select="."/>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Output:
Roles
Language Therapist
Speech Therapist
You have
xpath=(//li[contains(#id,'activity_roles_input')])//input
Shouldn't that be
xpath=(//li[contains(#id,'activity_roles_input')]//input)
or rather
xpath=(//li[#id='activity_roles_input']//input)
?
xpath=(//li[#id='activity_roles_input']//input[1])

Using Watir to verify strike tag exists in html

I have the following html code below that I am using watir to try and verify that March is not have a strike tag and April, June, and July do have strike tag. I'm pretty sure xpath is the key to my answer but have failed at coming up with right solution. Any help is greatly appreciated.
<div class="availability">
Available:
<ul>
<li><span class="month available">March</span></li>
<li><span class="month unavailable"><strike>April</strike></span></li>
<li><span class="month unavailable"><strike>May</strike></span></li>
<li><span class="month unavailable"><strike>June</strike></span></li>
</ul>
</div>
If you are using watir-webdriver, you can do:
#Create an array of the strike elements
months_with_strike = browser.elements(:tag_name, 'strike').collect(&:text)
#Determine if the specified month is in the array
months_with_strike.include?('March')
#=> false
months_with_strike.include?('April')
#=> true
Alternatively, if you only want to check for a single element:
browser.element(:tag_name => 'strike', :text => 'March').exists?
#=> false
browser.element(:tag_name => 'strike', :text => 'April').exists?
#=> true
The important part is that you can get custom elements by using the :tag_name as a locator.
Note: I would think this should also work in watir-classic, but for some reason I am getting exceptions.
Use (assuming the initial context node is the parent of the div element):
div/ul/li/span[not(strike)]
This selects any span elements that doesn't have a strike child (and is a child of a li that is a child of a ul that is a child of a div that is a child of the initial context node)
And use:
div/ul/li/span[strike]
This selects any span elements that has a strike child (and is a child of a li that is a child of a ul that is a child of a div that is a child of the initial context node)
XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:copy-of select="div/ul/li/span[not(strike)]"/>
==============
<xsl:copy-of select="div/ul/li/span[strike]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied to the provided XML document:
<div class="availability">
Available:
<ul>
<li><span class="month available">March</span></li>
<li><span class="month unavailable"><strike>April</strike></span></li>
<li><span class="month unavailable"><strike>May</strike></span></li>
<li><span class="month unavailable"><strike>June</strike></span></li>
</ul>
</div>
the two XPath expressions are evaluated and the results (selected nodes) are copied to the output, delimited by a visually distinctive delimiter string:
<span class="month available">March</span>
==============
<span class="month unavailable">
<strike>April</strike>
</span>
<span class="month unavailable">
<strike>May</strike>
</span>
<span class="month unavailable">
<strike>June</strike>
</span>

XPath expression to select self, preceding and following nodes

I'd like to select the following HTML in a document, based on the content of TARGET. I.e. if TARGET matches, select everything. However, I'm not sure where to go after: id('page')/x:div/span/a='TARGET' – How to use parent, child, and sibling expressions to get the containing div, the a preceding that div, and the two br tags following the div
<a></a>
<div>
<br />
<span>
<a>TARGET</a>
<a></a>
<span>
<span>
<a></a>
</span>
<a></a>
<span></span>
</span>
<span>
<a></a>
</span>
</span>
</div>
<br />
<br />
Use a single XPath like:
"//*[
(self::a and following-sibling::*[1][self::div and span/a='TRAGET']) or
(self::div and span/a='TARGET') or
(self::br and preceding-sibling::*[1][self::div and span/a='TARGET']) or
(self::br and preceding-sibling::*[2][self::div and span/a='TARGET'])
]"
Do note that your document is not well formed due to unclosed br tags. Moreover, I didn't include any namespace, which you can add if necessary.
Probably, you should first find all divs (not sure about conditions should be met):
//div[span[a[text()="TARGET"]]][preceding-sibling::*[1][name()="a"]][following-sibling::*[1][name()="br"]]
after that - all related elements for each div:
./preceding-sibling::a[1]
./following-sibling::br[1]
./following-sibling::br[2]

Resources