Altova XMLspy 2014: Multi-line xpath in XSD 1.1 assertions - xpath

In Altova XMLspy 2014, in a XSD 1.1 document, if I add an assertion, I can insert a XPATH 2.0 expresion for the "test" atribute of the assertion, but only ONE line is shown. How can I enter a multi-line xpath in an assertion?
Of course, I can enter a multi-line xpath in text view. But I'm using a graphical tool to edit my XSD files easily, so I would like to edit complex xpath expressions graphically (in schema view).
In other components (for example, in annotations) I can press control+intro to insert multiple lines. I can't do it in assertions.
Even worse, if I enter a multi-line xpath assertion in text view, and I change to schema view ("Schema Overview" or "Content Model View") and try to edit the xpath, then the multi-line xpath is shown as only one line.
Multi-line xpath in assertions is required for advanced (complex) node checking. For example, the following xpath:
every $symbol in symbols/symbol satisfies
every $state in states/state satisfies
some $tran in transition-function/transition satisfies
$tran/#current-symbol eq $symbol
and $tran/#current-state eq $state
can be easily understood only with a multi-line format.
Xpath 2.0 is near to be a programming languaje, very useful to check relations between node values. So, as a programming languaje, the expresions can be long and complex, and the multi-line feature is absolutely required.
Perhaps am I missing some setup option to enable it?

Related

Evaluate xpath selector to get text in p- and li-tags

For purposes to automatically replace keywords with links based on a list of keyword-link pairs I need to get text that is not already linked, not a script or manually excluded, inside paragraphs (p) and list items (li) –- to be used in Drupal's Alinks module.
I modified the existing xpath selector as follows and would like to get feedback on it, if it is efficient or might be improved:
//*[p or li]//text()[not(ancestor::a) and not(ancestor::script) and not(ancestor::*[#data-alink-ignore])]
The xpath is meant to work with any html5 content, also with self closing tags (not well-formed xml) -- that's the way the module was designed, and it works quite well.
In order to select text node descendant of p or li elements that are not descendant of a or script elements, you can use this XPath 1.0:
//*[self::p|self::li]
//text()[
not(ancestor::a|ancestor::script|ancestor::*[#data-alink-ignore])
]
Your XPath expression is invalid. You are missing a / before text(). So a valid expression would be
//*[p or li]/text()[not(ancestor::a) and not(ancestor::script) and not(ancestor::*[#data-alink-ignore])]
But without an XML source file it is impossible to tell if this expression would match your desired node.

XPath - How to get image source from xml

Hello i have this xml
<item>
<title> Something for title»</title>
<link>some url</link>
<description><![CDATA[<div class="feed-description"><div class="feed-image"><img src="pictureUrl.jpg" /></div>text for desc</div>]]></description>
<pubDate>Thu, 11 Jun 2015 16:50:16 +0300</pubDate>
</item>
I try to get the img src with path: //description//div[#class='feed-description']//div[#class='feed-image']//img/#src but it doesn't work
is there any solution?
A CDATA section escapes its contents. In other words, CDATA prevents its contents from being parsed as markup when the rest of the document is parsed. So the <div>s in there are not seen as XML elements, only as flat text. The <description> element has no element children ... only a single text child. As such, XPath can't select any <div> descendant of <description> because none exists in the parsed XML tree.
What to do?
If your XPath environment supports XPath 3.0, you could use parse-xml() to turn the flat text into a tree, then use XPath to select //div[#class='feed-description']//div[#class='feed-image']//img/#src from the resulting tree.
Otherwise, your best workaround may be to use primitive string-processing functions like substring-before(), substring-after(), or match(). (The latter uses regular expressions and requires XPath 2.0.) Of course, many people will tell you not to use regular expressions to analyze markup like XML and HTML. For good reason: in the general case, it's very difficult to do it right (with regexes or with plain string searches). But for very restricted cases where the input is highly predictable, and in absence of better tools, it can be the best tool for a less-than-ideal job.
For example, for the data shown in your question, you could use
substring-before(substring-after(//description, 'img src="'), '"')
In this case, the inner call substring-after(//description, 'img src="') returns pictureUrl.jpg" /></div>text for desc</div>, of which the substring before " is pictureUrl.jpg.
This isn't really robust, for example it'll fail if there's a space between src and =. But if the exact formatting is predictable, you'll be OK.

XPath on Wikipedia Summary

I'm currently trying to extract the blurb, or summary from any given Wikipedia page, using XPath. Now, there are many places online where this has already been done: http://jimblackler.net/blog/?p=13, How to use XPath or xgrep to find information in Wikipedia?.
But, when I try to use similar XPath expressions, on a variety of pages, the returned results are strange. For the sake of this question, let's assume I'm trying to retrieve the very first paragraph in the printable Wikipedia page on Boston: http://en.wikipedia.org/w/index.php?title=Boston&printable=yes.
When I try to use this expression /html/body/div[#id='content']/div[#id='bodyContent']//p, only the last four words of the paragraph, "in the United States.", are returned.
Actually, the expression used above could be simplified to //div/p, but the results are the same.
Strangely, the links I linked to previously seem to use similar methods and return great results; originally, I imagined this was due to Wikipedia changing the formatting of their pages in recent years, but honestly, I can't seem to find what's wrong with both the expressions.
Does anyone have any idea about this?
When I try to use this expression
/html/body/div[#id='content']/div[#id='bodyContent']//p, only the
last four words of the paragraph, "in the United States.", are
returned.
There are a few problems here:
The XML document is in a default namespace. Writing XPath expressions to select nodes in a document that is in a default namespace is the most FAQ about XPath -- search for "XPath and default namespace". In short, any unprefixed name will most probably cause nothing to be selected. One must register the default namespace and associate a specific prefix with this namespace. Then any element name in the XPath expression must be written with this prefix. So, the expression above will become:
:
/x:html/x:body/x:div[#id='content']/x:div[#id='bodyContent']//x:p
where the "x:" prefix is associated to the "http://www.w3.org/1999/xhtml" namespace.
.2. Even the above expression doesn't select (only) the wanted node. In order to select only the first x:p from the above, the XPath expression must be specified as (note the brackets):
(/x:html/x:body/x:div[#id='content']/x:div[#id='bodyContent']//x:p)[1]
.3. As you want the text of the paragraph, an easy way to do this is to use the standard XPath function string():
string((/x:html/x:body/x:div[#id='content']/x:div[#id='bodyContent']//x:p)[1])
When this XPath expression is evaluated, I get the text of the paragraph -- for example in the XPath Visualizer I wrote some years ago:

xpath expression to select text from link

I have such content of html file:
<a class="bf" title="Link to book" href="/book/229920/">book name</a>
Help me to construct xpath expression to get link text (book name).
I try to use /a, but expression evaluates without results.
If the context is the entire document you should probably use // instead of /. Also you may (not sure about that) need to get down one more level to retrieve the text.
I think it should look like this
//a/text()
EDIT: As Tomalak pointed out it's text() not text
Have you tried
//a
?
More specific is better:
//a[#class='bf' and starts-with(#href, '/book/')]
Note that this selects the <a> element. In your host environment it's easy to extract the text value of that node via standard DOM methods (like the .textContent property).
To select the actual text node, see the other answers in this thread.
It depends also on the rest of your document. If you use // in the beginning all the matching nodes will be returned, which might be too many results in case you have other links in your document.
Apart from that a possible xpath expression is //a/text().
The /a you tried only returns the a-tag itself, if it is the root element. To get the link text you need to append the /text() part.

html 4.0 entities in XPATH queries

I don't know exactly why the xpath expression:
//h3[text()='Foo › Bar']
doesn't match:
<h3>Foo › Bar</h3>
Does that seem right? How do I query for that markup?
XPath does not define any special escape sequences. When XPath is used within XSLT (e.g. in attributes of elements of an XSLT document), the escape sequences are processed by the XML processor that reads the stylesheet. If you use XPath in non-XML context (e.g. from Java or C# or other language) via a library, and your XPath query is a string literal in that language, you won't get any escape processing aside from that which the language itself usually does.
If this is C# or Java, this should work:
String xpath = "//h3[text()='Foo \u8250 Bar']";
...
As a side note, it wouldn't work in XSLT either, as XSLT uses XML, which doesn't define a character entity › - it only defines <, >, ", &apos; and &. You'd have to either use 艐, or define the character entity yourself in DOCTYPE declaration of the XSLT stylesheet.
From the XPath specification:
XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax
… so unless you are using the query inside (as opposed to "to query") a language that resolves that entity (perhaps XSLT with a DTD that includes the entity (if that is possible, I'm far from an XSLT expert)), I wouldn't expect it to work.
Use a literal character or an escape sequence recognized by whatever language you are using XPath from.

Resources