below is the xml file -
<Countries>
<Country>
<Name>India</Name>
<Capital>New Delhi</Capital>
</Country>
<Country>
<Name>USA</Name>
<Capital>Washington DC</Capital>
</Country>
<Country>
<Name>England</Name>
<Capital>London</Capital>
</Country>
<Country>
<Name>Japan</Name>
<Capital>Tokyo</Capital>
</Country>
<Country>
<Name>Srilanka</Name>
<Capital>Colombo</Capital>
</Country>
</Countries>
I have stored it in BaseX, an XMLDB. Now like plain DBs if I had stored there, I would have written simple select statement to retrieve the data from table. For example:
select name, capital from country
and got both the rows. Right? How can this be done using XQuery?
In a relational database (which you graphically describe as a "plain database") every query takes tables as its input and produces a table as its output. In an XML database, the input is XML and the output is XML. So you need to describe the XML you want to produce. Once you have done that, the answer to your question is yes: you can certainly write an XQuery to produce that output.
Seems you want a sequence of all name and capital elements:
/Countries/Country/(Name|Capital)
The result produced by this query is:
<Name>India</Name>
<Capital>New Delhi</Capital>
<Name>USA</Name>
<Capital>Washington DC</Capital>
<Name>England</Name>
<Capital>London</Capital>
<Name>Japan</Name>
<Capital>Tokyo</Capital>
<Name>Srilanka</Name>
<Capital>Colombo</Capital>
If you are expecting the output without elements then try this one
for $x in doc("Country")/Countries/Country
return string-join( ($x/Name, $x/Capital), " - ")
The output will be -
`India - New Delhi USA - Washington DC England - London Japan - Tokyo Srilanka - Colombo`
If you want them on separate lines then, this is what you will have to use -
for $x in doc("Country")/Countries/Country
return <li>{string-join( ($x/Name, $x/Capital), " - ")}</li>
Instead of <li> you can use any other relevant tag.
Related
I got an .xml file which has the following entries:
<country>
<province id="prov-cid-cia-Greece-3" country="GR">
<name>Attiki</name>
<area>3808</area>
<population>3522769</population>
<city id="cty-Greece-Athens" is_country_cap="yes" country="GR" province="prov-cid-cia-Greece-3">
<name>Athens</name>
<longitude>23.7167</longitude>
<latitude>37.9667</latitude>
<population year="81">885737</population>
<located_at watertype="sea" sea="sea-Mittelmeer"/>
</city>
</province>
</country>
However, there are also nodes which are called city without the province as a parent:
<country>
<city id="stadt-Shkoder-AL-AL" country="AL">
<name>Shkoder</name>
<longitude>19.2</longitude>
<latitude>42.2</latitude>
<population year="87">62000</population>
<located_at watertype="lake" lake="lake-Skutarisee"/>
</city>
</country>
Without further explanation, I want to select all nodes city, however, in my current query it selects only cities without province as a parent
query = f"//country/city[#is_country_cap = \"yes\" and ./located_at[#watertype]]/name/text()"
How could I include the /province/country in my query? I have tried:
query = f"//country/(city | ./province/city)[#is_country_cap = \"yes\" and ./located_at[#watertype]]/name/text()"
without any success, I get an Error.
You can match for all <city> elements that have a parent of <country> or <province>. Then, in a second predicate, add your other requirements like this:
//city[parent::country or parent::province][#is_country_cap = 'yes' and located_at[#watertype]]/name
Or, approaching your language
query = f"//city[parent::country or parent::province][#is_country_cap = \"yes\" and located_at[#watertype]]/name/text()"
Maybe this is of some help to you.
Your mistake has been using the | operator instead of the keyword or. In XPath, the | operator means "merge nodesets" and not a logical "OR" like in C.
I have an XML of the form:
<articleslist>
<articles>
<originalId>507948</originalId>
<title>Hogan Lovells Training Contract</title>
<slug>hogan-lovells-training-contract</slug>
<metaTitle>Hogan Lovells Training Contract</metaTitle>
<metaDescription>Find out about the Hogan Lovells Training Contract and Application Process</metaDescription>
<language>en</language>
<disableAds>false</disableAds>
<shortUrl>false</shortUrl>
<category_slug>law</category_slug>
<subcategory_slug>industry</subcategory_slug>
<updatedAt>2021-03-15T18:38:51.058+00:00</updatedAt>
<createdAt>2018-11-29T06:42:51.665+00:00</createdAt>
</articles>
</articlelist>
I'm able to select the row values with the XPATH //articles.
How can I select the child properties of articles (i.e. the column headings), so I get back a list of the form:
originalId
title
slug
etc...
Depends on your XPath version.
In XPath 2.0 it's simply //articles/*/name()
In 1.0 it's not possible because there's no such data type as a "sequence of strings". You would have to return the set of elements as //articles/*, and then extract their names in the calling program.
I have posted sample XML and expected output kindly help to get the result.
Sample XML
<root>
<A id="1">
<B id="2"/>
<C id="2"/>
</A>
</root>
Expected output:
<A id="1"/>
You can formulate this query in several ways:
Find elements that have a matching attribute, only ascending all the time:
//*[#id=1]
Find the attribute, then ascend a step:
//#id[.=1]/..
Use the fn:id($id) function, given the document is validated and the ID-attribute is defined as such:
/id('1')
I think it's not possible what you're after. There's no way of selecting a node without its children using XPATH (meaning that it'd always return the nodes B and C in your case)
You could achieve this using XQuery, I'm not sure if this is what you want but here's an example where you create a new node based on an existing node that's stored in the $doc variable.
declare variable $doc := <root><A id="1"><B id="2"/><C id="2"/></A></root>;
element {fn:node-name($doc/*)} {$doc/*/#*}
The above returns <A id="1"></A>.
is that what you are looking for?
//*[#id='1']/parent::* , similar to //*[#id='1']/../
if you want to verify that parent is root :
//*[#id='1']/parent::root
https://en.wikipedia.org/wiki/XPath
if you need not just parent - but previous element with some attribute: Read about Axis specifiers and use Axis "ancestor::" =)
I am trying to build a simple search-engine using HtmlAgilityPack and Xpath with C# (.NET 4).
I want to find every node containing a userdefined searchword, but I can't seem to get the XPath right.
For Example:
<HTML>
<BODY>
<H1>Mr T for president</H1>
<div>We believe the new president should be</div>
<div>the awsome Mr T</div>
<div>
<H2>Mr T replies:</H2>
<p>I pity the fool who doesn't vote</p>
<p>for Mr T</p>
</div>
</BODY>
</HTML>
If the specified searchword is "Mr T" I'd want the following nodes: <H1>, The second <div>, <H2> and the second <p>.
I have tried numerous variants of doc.DocumentNode.SelectNodes("//text()[contains(., "+ searchword +")]"); but I always seem to wind up with every single node in the entire DOM.
Any hints to get me in the right direction would be very appreciated.
Use:
//*[text()[contains(., 'Mr T')]]
This selects all elements in the XML document that have a text-node child which contains the string 'Mr T'.
This can also be written shorter as:
//text()[contains(., 'Mr T')]/..
This selects the parent(s) of any text node that contains the string 'Mr T'.
According to Xpath, if you want to find a specific keyword you need to follow the format ("keyword" is the word you like to search) :
//*[text()[contains(., 'keyword')]]
You have to follow the same format as above in C#, keyword is the string variable you call:
doc.DocumentNode.SelectNodes("//*[text()[contains(., '" + keyword + "')]]");
Use the following:
doc.DocumentNode.SelectNodes("//*[contains(text()[1], " + searchword + ")]")
This selects all elements (*) whose first text child (text()[1]) contains the searchword.
Case-insensitive solution:
var xpathForFindText =
"//*[text()[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '" + lowerFocusKwd + "')]]";
var result=doc.DocumentNode.SelectNodes(xpathForFindText);
Note:
Be careful, because the lowerFocusKwd must not contain the following character, because the xpath will be in bad format:
'
I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[#name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[#name='hw2'] and /a[#name='hw3'].
Your xpath is selecting the text of the a nodes, not the text of the td nodes:
$item//a[#name='hw']/text()
Change it to this:
$item[a/#name='hw']/text()
Update (following comments and update to question):
This xpath selects the second text node from $item that have an a tag containing a name attribute set to hw:
$item[a/#name='hw']//text()[2]
I would not want to use text()[3] but
is there some way I could extract the
text out between /a[#name='hw2'] and
/a[#name='hw3'].
If there is just one text node between the two <a> elements, then the following would be quite simple:
/a[#name='hw3']/preceding::text()[1]
If there are more than one text nodes between the two elements, then you need to express the intersection of all text nodes following the first element with all text nodes preceding the second element. The formula for intersection of two nodesets (aka Kaysian method of intersection) is:
$ns1[count(.|$ns2) = count($ns2)]
So, just replace in the above expression $ns1 with:
/a[#name='hw2']/following-sibling::text()
and $ns2 with:
/a[#name='hw3']/preceding-sibling::text()
Lastly, if you really have XQuery (or XPath 2), then this is simply:
/a[#name='hw2']/following-sibling::text()
intersect
/a[#name='hw3']/preceding-sibling::text()
This handles your expanded case, while letting you select by attribute value rather than position:
let $item :=
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
return $item//node()[./preceding-sibling::a/#name = "hw2"][1]
This gets the first node that has a preceding-sibling "a" element with a name attribute of "hw2".