How to skip paragraphs with comments in XPath expression? - xpath

I'm trying to scrape websites like this with the following Xpath expression:
.//div[#class="tresc"]/p[not(starts-with(text(), "<!--"))]
The thing is that the first paragraph is a comment section, so I'd like to skip it:
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:HyphenationZone>21</w:HyphenationZone>
<w:PunctuationKerning />
<w:ValidateAgainstSchemas />
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid
<w:IgnoreMixedContent>false</w:IgnoreMixedContent
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables />
<w:SnapToGridInCell />
<w:WrapTextWithPunct />
<w:UseAsianBreakRules />
<w:DontGrowAutofit />
</w:Compatibility>
<w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
</w:WordDocument>
</xml><![endif]-->
Unfortunately, my expression does not skip the paragraph with comments. Anyone know what I'm doing wrong?

Comments are not part of text(), they constitute a node of their own: comment(). To exclude p's that contain comments, use
p[not(comment())]

Related

Xpath query expression: summing up a attribute over a condition

<Cities>
<city>
<name />
<country />
<population asof = "2019" />
<total> 2918695</total>
<Average_age> 28 </Average_age>
</city>
<city>
<name />
<country />
<population asof = "2020" />
<total> 78805467 </total>
<Average_age> 32 </Average_age>
</city>
</Cities>
I want to build a Xpath query which returns the total population of cities where asof is higher than 2018
Try this XPath-1.0 expression:
sum(/Cities/city[population/#asof > 2018]/total)
Or, another, less specific, version:
sum(//city[population/#asof > 2018]/total)
the expression to grab population with asof attribute greater than 2018 would be:
//population[#asof > '2018']
If you looking for <total> which is a sibling of <population> despite your indentation use following-sibling::total after the expression
otherwise use /total
lets follow the first approach so the XPath continues as:
//population[#asof > '2019']/following-sibling::total
and add /text() at the end to get text inside of desired <total> tag. additionally if you want sum of populations you can put the whole expression inside sum() function. the inside expression of sum gonna be like:
//population[#asof > '2019']/following-sibling::total/text()

XMLUNIT 2 using comparison with ignore element order with diffbuilder and namespaces fails

I am trying to use DiffBuilder to ignore XML elements order when comparing two .xml files but it fails. I have tried every possible combination and read many articles before posting this question.
For example:
<Data:Keys>
<Data:Value Key="1" Name="Example1" />
<Data:Value Key="2" Name="Example2" />
<Data:Value Key="3" Name="Example3" />
</Data:Keys>
<Data:Keys>
<Data:Value Key="2" Name="Example2" />
<Data:Value Key="1" Name="Example1" />
<Data:Value Key="3" Name="Example3" />
</Data:Keys>
I want these two treated as same XML. Notice that elements are empty, they have only attributes.
What I did so far:
def diff = DiffBuilder.compare(Input.fromString(xmlIN))
.withTest(Input.fromString(xmlOUT))
.ignoreComments()
.ignoreWhitespace()
.checkForSimilar()
.withNodeMatcher(new DefaultNodeMatcher(ElementSelectors.conditionalBuilder()
.whenElementIsNamed("Data:Keys").thenUse(ElementSelectors.byXPath("./Data:Value",
ElementSelectors.byNameAndText))
.elseUse(ElementSelectors.byName)
.build()))
But it fails every time. I don't know if the issue is the namespace, or that the elements are empty.
Any help will be appricated. Thank you in advance.
if you aim to match tags Data:Value by their attributes together, you should start with this:
.withNodeMatcher(new DefaultNodeMatcher(ElementSelectors.conditionalBuilder()
.whenElementIsNamed("Data:Value")
and since that tag doesn't have any text, the byNameAndText won't work. You can only work on names and attributes. My advice is to do it like this:
.thenUse(ElementSelectors.byNameAndAttributes("Key"))
or
.thenUse(ElementSelectors.byNameAndAllAttributes())
//equivalent
.thenUse(ElementSelectors.byNameAndAttributes("Key", "Name"))
As of issues with namespaces, checkForSimilar() should output SIMILAR, this means they are not DIFFERENT, so this is what you need. If you didn't use checkForSimilar() the differences in namespaces would be outputed as DIFFERENT.

XPath remove single node (via Saxon CLI)

I want to remove a node from an XML file (using SaxonHE9-8-0-11J):
<project name="Build">
<property name="src" value="src/main/resources" />
<property name="target" value="target/classes" />
<condition property="target.exists">
<available file="target" />
</condition>
</project>
Apparently there are 2 ways I can do this.
XPath1: using a not function
XPath2: using an except clause. But both simply return the entire node-set.
With a not function:
saxonb-xquery -s:test.xml -qs:'*[not(local-name()="condition")]'
With an except clause:
saxonb-xquery -s:test.xml -qs:'* except condition'
With -explain switch the queries are:
<query>
<body>
<filterExpression>
<axis name="child" nodeTest="element()"/>
<operator op="ne (on empty return true())">
<functionCall name="local-name">
<dot/>
</functionCall>
<literal value="condition" type="xs:string"/>
</operator>
</filterExpression>
</body>
</query>
and
<query>
<body>
<operator op="except">
<axis name="child" nodeTest="element()"/>
<path>
<root/>
<axis name="descendant" nodeTest="element(condition, xs:anyType)"/>
</path>
</operator>
</body>
</query>
In general, XPath select nodes from one or more input documents, it doesn't allow you to construct new ones, for that you need XSLT or XQuery. And removing the condition child of the project root, if that is what you want to achieve, is something you need XSLT or XQuery for, with XPath, even if you use /*/(* except condition), you then get all children except the condition element, but as a sequence, not wrapped into a a root.
So with XQuery you could use
/*/element {node-name()} { * except condition }
as a compact but generic way to reconstruct any root with all child elements except the condition: https://xqueryfiddle.liberty-development.net/948Fn5b
Whether you get such an expression through a command line shell is a different problem, on Windows with a Powershell window and the cmd shell it works for me to use
-qs:"/*/element {node-name()} { * except condition }"

JSTL: How to print a newline

How can I print a newline ("\n" or "\r\n" or "\n\r"). Which is the right one to be understood by a browser?) using JSTL or EL? I want to really print a newline (not a <BR>), since I need to place it in a javascript section in a HTML file.
Try the xml entities for this:
for a newline and 
 for carriage return.
Simple, solution is just not to use JSTL/EL
<% out.print("\n"); %>
Even simpler:
<%= '\n' %>
You are asking the wrong question in your main post, and you later added it in a comment (perhaps you should edit your post to reflect the information in the comment?):
I need to put a \n after a // <![CDATA[ to end the comemnt before actaul JS code starts.
The easiest way to fix your issue is to comment out the CDATA using a block comment like this:
/* <![CDATA[ */
This will allow you to continue your code on the same line and it will not be part of the comment.
Example
/* <![CDATA[ */ var foo = "var";alert( foo ); /* ]]> */
Try the below concept and it works.
<c:set var="String1" value="line1 line2 line3 line4" />
<c:set var="String2" value="${fn:split(String1, ' ')}" />
<c:set var= "new" value="<br />" />
<c:out value="${String2[0]}${new}" escapeXml="false" />
<c:out value="${String2[1]}${new}" escapeXml="false" />
<c:out value="${String2[2]}${new}" escapeXml="false" />
<c:out value="${String2[3]}${new}" escapeXml="false" />

Shortest match in Regex [duplicate]

This question already has answers here:
Find shortest matches between two strings
(4 answers)
Closed 3 years ago.
This is my regex:
/<strong>.*ingredients.*<\/ul>/im
Assuming the source code:
<strong>Contest closes on Thursday May 10th 2012 at 9pm PST</strong></div>
<br />
<br />
<br />
* I am not affiliated with Blue Marble Brands or Ines Rosales Tortas in any way. I am not sponsored by them and did not receive any compensation to write this post...I just simply think the Tortas are wonderful!<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<img border="0" height="480" mea="true" src="http://1.bp.blogspot.com/-35J5vNrXkqE/T6htXTafrmI/AAAAAAAAA5E/g2mtiuSpSmw/s640/food+003.JPG" width="640" /></div>
<br />
<strong><span style="font-size: large;">Ingredients:</span></strong><br />
<ul>
<li>Ines Rosales Rosemary and Thyme Tortas</li>
<li>Pizza Sauce (ready made in a jar)</li>
<li>Roma Tomatoes</li>
<li>Roasted Red Peppers </li>
<li>Marinated Artichoke Hearts</li>
<li>Olives (I used Pitted Spanish Manzanilla Olives)</li>
<li>Daiya Vegan Mozzarella Cheese</li>
</ul>
<span style="font-size: large;"><strong>Directions:</strong></span><br />
<br />
Spread small amount of pizza sauce over Torta.
the Regex is greedy and grabs everything from <strong>Contest...</ul> but the shortest match should yield <strong><span style="font-size: large;">Ingredients...</ul>
this is my gist: https://gist.github.com/3660370
::EDIT::
Please allow flexibility inbetween strong tag and ingredients, and ingredients and ul.
Try this:
/<strong><span.*ingredients.*<\/ul>/im
Please refrain from regex-ing html. Use Nokogiri or a similar library instead.
This should work:
/(?!<strong>.*<strong>.*<\/ul>)<strong>.*?ingredients.*?<\/ul>/im
Test it here
Basically, the regex is using the negative lookahead to avoid multiple <strong> before <\ul> like this: (?!<strong>.*<strong>.*<\/ul>)
I think this is what you're looking for:
/<strong>(?:(?!<strong>).)*ingredients.*?<\/ul>/im
Replacing the first .* with (?:(?!<strong>).)* allows it to match anything except another <strong> tag before it finds ingredients. After that, the non-greedy .*? causes it to stop matching at the first instance of </ul> it sees. (Your sample only contains the one <UL> element, but I'm assuming the real data could have more.)
The usual warnings apply: there are many ways this regex can be fooled even in perfectly valid HTML, to say nothing of the dreck we usually see out there.

Resources