exist-db: XQuery and documents with XInclude - exist-db

I'm embarking on a new project with eXist. We'll be storing a few hundred TEI XML documents that represent manuscripts. A number of things we want to capture are repetitve, mainly people and places. My colleague has asked the TEI community about strategies for representing what we want to capture and using XInclude had been suggested as a way of reducing duplication.
I've had a quick play with adding an XInclude into a document and the serialized XML does render the include XML file. However, the included text was missing from an XQuery. I notice in the eXist docs (http://exist-db.org/exist/apps/doc/xinclude.xml) that:
eXist-db expands XIncludes at serialization time, which means that the
query engine will see the XInclude tags before they are expanded. You
therefore cannot query across XIncludes - unless you create your own
code (e.g. an XQuery function) for it. We would certainly like to
support queries over xincluded content in the future though.
What is the best practice for querying files that use XInclude?
I'm wondering whether I should have a 'job' that serializes the source TEI XML files to expand the XIncludes and store these files in a separate collection? In that case, would file:serialize be the correct function for this task?
We are at the start of the project, so any advice appreciated.

Can you describe what kind of query you tried that was missing the text?
Generally, since the files referenced via XInclude are well-formed xml documents, you can use collections (folders) to organise your queries in exist-db. So instead of for $search in doc("mydoc.xml") you could for $search in collection('/app/mydata')/*
more elaborate answers would follow the attribute of the unexpanded xinclude statement in source document and find the matching element in the target, but its difficult to abstract that without a concrete MWE.
have you tried to create a temporary and expanded fragment in a let clause, and query that instead of the stored xml?
Beware of namespaces !
Hope this helps, and greetings to Sebastiaan.

Related

What is the essential difference between Document and Collectiction in YAML syntax?

Warning: This question is a more philosophical question than practical, but I find it well as to be asked and answered in practical contexts (forums like StackOverflow here, instead of the SoftwareEngineering stack-exchange website), due to the native development in the actual use de-facto of YAML and the way the way it's specification has evolved and features have been added to it over time. Let's ask:
As opposed to formats/languages/protocols such as JSON, the YAML format allows you (according to this link, that seems pretty official, or at least accurate and reliable source to understand the YAML specification) to embed multiple 'Documents' within one file/stream, using the three-dashes marking ("---").
If so, it's hard to ignore the fact that the concept/model/idea of 'Document' in YAML, is no longer an external definition, or "meta"-directive that helps the human/parser to organize multiple/distincted documents along each other (similar to the way file-systems defining the concept of "file" to organize different files, but each file in itself - does not necessarily recognize that it's a file, or that it's being part of a file system that wraps it, by definition, AFAIK.
However, when YAML allows for a multi-Document YAML files, that gather collections of Documents in a single YAML file (and perhaps in a way that is similar/analogous to HTTP Pipelining approach of HTTP protocol), the concept/model/idea/goal of Document receives a new, wider definition/character de-facto, as a part of the YAML grammar and it's produces, and not just of the YAML specification as an assistive concept or format description that helps to describe the specification.
If so, being a Document part of the language itself, what is the added value of this data-structure, compared to the existing, familiar and well-used good old data-structure of Collection (array of items)?
I'm asking it, because I've seen in this link (here) some snippet (in the second example), which describes a YAML sequence that is actually a collection of logs. For some reason, the author of the example, chose to prefer to present each log as a separate "Document" (separated with three-dashes), gathered together in the same YAML sequence/file, instead of writing a file that has a "Collection" of logs represented with the data-type of array. Why did he choose to do this? Is his choice fit, correct, ideal?
I can speculate that the added value of the distinction between a Document and a Collection become relevant when using more advanced features of the YAML grammar, such as Anchors, Tags, References. I guess every Document provide a guarantee that all these identifiers will be a unique set, and there is no collision or duplicates among them. Am I right? And if so, is this the only advantage, or maybe there are any more justifications for the existence of these two pretty-similar data structures?
My best for now, is to see Document as a "meta"-Collection, that is more strict, and lack of high-level logic, or as two different layers of collection schemes. Is it correct, accurate way of view?
And even if I am right, why in the above example (of the logs document from the link), when there's no use and not imply or expected to use duplications or collisions or even identifiers/anchors or compound structures at all - the author is still choosing to represent the collection's items as separate documents? Is this just not so successful selection of an example? Or maybe I'm missing something, and this is a redundancy in the specification, or an evolving syntactic-sugar due to practical needs?
Because the example was written on a website that looks serious with official information written by professionals who dealt with the essence of the language and its definition, theory and philosophy behind (as opposed to practical uses in the wild), and also in light of other provided examples I have seen in it and the added value of them being meticulous, I prefer not to assume that the example is just simply imperfect/meticulous/fit, and that there may be a good reason to choose to write it this way over another, in the specific case exampled.
First, let's look at the technical difference between the list of documents in a YAML stream and a YAML sequence (which is a collection of ordered items). For this, I'll discuss YAML tags, which are an advanced feature so I'll provide a quick overview:
YAML nodes can have tags, such as !!str (the official tag for string values) or !dice (a local tag that can be interpreted by your application but is unknown to others). This applies to all nodes: Scalars, mappings and sequences. Nodes that do not have such a tag set in the source will be assigned the non-specific tag ?, except for quoted scalars which get ! instead. These non-specific tags are later resolved to specific tags, thereby defining to which kind of data structure the node will be deserialized into.
YAML implementations in scripting languages, such as PyYAML, usually only implement resolution by looking at the node's value. For example, a scalar node containing true will become a boolean value, 42 will become an integer, and droggeljug will become a string.
YAML implementations for languages with static types, however, do this differently. For example, assume you deserialize your YAML into a Java class
public class Config {
String name;
int count;
}
Assume the YAML is
name: 42
count: five
The 42 will become a String despite the fact that it looks like a number. Likewise, five will generate an error because it is not a number; it won't be deserialized into a string. This means that not the content of the node defines how it will be deserialized, but the path to the node.
What does this have to do with documents? Well, the YAML spec says:
Resolving the tag of a node must only depend on the following three parameters: (1) the non-specific tag of the node, (2) the path leading from the root to the node and (3) the content (and hence the kind) of the node.)
So, the technical difference is: If you put your data into a single document with a collection at the top, the YAML processor is allowed to take into account the position of the data in the top-level collection when resolving a tag. However, when you put your data in different documents, the YAML processor must not depend on the position of the document in the YAML stream for resolving the tag.
What does this mean in practice? It means that YAML documents are structurally disjoint from one another. Whether a YAML document is valid or not must not depend on any preceeding or succeeding documents. Consequentially, even when deserialization runs into a semantic problem (such as with the five above) in one document, a following document may still be deserialized successfully.
The goal of this design is to be able to concatenate arbitrary YAML documents together without altering their semantics: A middleware component may, without understanding the semantics of the YAML documents, collect multiple streams together or split up a single stream. As long as they are syntactically correct, stream splitting and merging are sound operations that do not invalidate a YAML document even if another document is structurally invalid.
This design primary focuses on sending and receiving data over networks. Of course, nowadays, YAML is primarily used as configuration language. This is why this feature is seldom used and of rather little importance.
Edit: (Reply to comment)
What about end-cases like a string-tagged Document starts with a folded-string, making even its following "---" and "..." just a characters of the global string?
That is not the case, see rules l-bare-document and c-forbidden. A line containing un-indented ... not followed by non-whitespace will always end a document if one is open.
Moreover, ... doesn't do anything if no document is open. This ensures that a stream merger can always append ... to a document to ensure that the current document is closed, but no additional one is created.
--- has widely been adopted as separator between YAML documents (and, perhaps more prominently, between YAML front matter and content in tools like Jekyll) where ... would have been more appropriate, particularly in Jekyll. This gives the false impression that --- should be used by tooling to separate documents, when in reality ... is the syntactic element designed for that use-case.

Partial Indexing of an XML file (Bleve)

I am evaluating a couple different libraries to see which one will best fit what I need.
Right now I am looking at Bleve, but I am happy to use any library.
I am looking to index full files except specific ones which are in XML format. For those I only want Bleve to index specific tags as most of the tags are worthless to search. I am trying to evaluate if this is possible but, being new to Bleve, I am not sure what part I need to customize.
The documentation is very good, but I can't seem to find this answer. All I need is an explanation with keywords and steps, no code is required, I just need a push as I have spent hours spinning my wheels with google searches and I am getting no where.
There are probably many ways to approach this. Here's one.
Bleve indexes documents which are collections of key/value metadata pairs.
In your case, a document could be represented by 2 key/value pairs: name of .xml file (to uniquely identify the document) and content of the file.
type Doc struct {
Name string
Body string
}
The issue is that body is XML and Bleve doesn't support XML out-of-the-box.
A way to address it would be to pre-process XML file by stripping unwanted tags and content. You can do it using encoding/xml standard library.
For an example of a similar task you can see the code of https://github.com/blevesearch/fosdem-search/
In there they index file in custom format (https://github.com/blevesearch/fosdem-search/blob/master/fosdem.ical) by parsing it into a format they can submit to Bleve for indexing (https://github.com/blevesearch/fosdem-search/blob/master/ical.go).

Need help rewriting XQuery to avoid expanded tree cache full error in MarkLogic

I am new to XQuery and MarkLogic.
I am trying to update documents in MarkLogic and get the extended tree cache full error.
Just to get the work done I have increased the expanded tree cache but that is not recommended.
I would like to tune this query so that it does not need to simultaneously cache as much XML.
Here is my query
I have uploaded my query as an image because it was not so pretty when I pasted it on the editor. If any one knows a better way please suggest.
Thanks in advance.
Expanded tree cache errors can be caused by executing queries that select too many XML nodes at once. In your example, this is likely the culprit: /tx:AttVal[tx:AttributeName/text()=$attributeName].
It's possible that calling text() is the source of your problem (and text() probably not what you mean anyway - see this blog), causing MarkLogic to evaluate that function on all these nodes, and that by simply using /tx:AttVal[tx:AttributeName=$attributeName] it may solve your problem.
Next I would consider an adding a path range index on /tx:AttVal/tx:AttributeName and query those nodes using cts:search and cts:path-range-query. This will be substantially faster than just XPath without a range index. It's also possible to use XPath with a range index: MarkLogic will automatically optimize the XPath expression to use the range index; however, there can be reasons it doesn't optimize the expression correctly, and you would want to check that using xdmp:plan.
Also note that the general best practice recommendation for XML in MarkLogic is to use "semantic XML". E.g., when you mean an attribute, use an attribute: <some-node AttributeName=AttVal>. MarkLogic's indexes are optimized out of the box for semantic XML design. However, if you don't have an option but to work with XML that's not, then that's what path range indexes were designed for.
I've just solved exactly this scenario. There are two things I did
I put the node-replace and node-insert type calls (that is any calls that modify the XML structure into a separate module and then called that module using xdmp:invoke, passing in any parameters required, like this
let $update := xdmp:invoke("/app/lib/update-attribute-node.xqy",
(xs:QName("newValue"), $new),
{xdmp:modules-database()})
The reason why this works is that the call to xdmp:invoke happens in it's own transaction and once it completes, the memory is cleared up. If you don't do this then, each time you call the update or insert function, it will not actually do the write, until the end in a single transaction meaning your memory will fill up pretty quickly.
Any time I needed to loop over paths in MarkLogic (or documents or whatever they are called - I've only been using MarkLogic for a few days) and there are a large number of them I processed them only a few at a time like below. I came up with an elaborate way of skipping and taking only a batch of documents at a time, but you can do it in any number of ways.
let $whatever:= xdmp:directory("/whatever/")[$start to $end]
I also put this into a separate module so that it is processed immediately and not in a single transaction.
Putting all expensive calls into separate modules and taking only a subset of large data sets at a time helped me solve my expanded tree cache full errors.

How to uniquly identify an two objects in same page having same url

I Have two objects in same page but with different locations(tabs), I want to verify those objects each a part ...
i cant uniquely any of objects because the have same properties.
These objects clearly are unique to a point because they have completely different text, this means that you will be able to create an object to match only one of them. My suggestion would be to look for the object by using its text property, one of them will always have "Top Ranking" the other you wil need to turn into a regular expression for the text and will be something "Participants (\d+)".
I am assuming that this next answer is unlikely to be possible so saved it for after the answer you are likely to use but the best solution would of course be to get someone with access to give these elements ids for you to search for. This will in the long term be much easier for you to maintain and not using text will allow this test to run in any language.
Manaysah, do these objects have different indexes? Use the object spy and determine which index they have, the ordinal identifier index may be a solution to your problem. You could also try adding an innertext object property if possible, using a wildcard for the number inside the () as it appears dynamic.
try using xpath for the objects...xpath will definitely be different

Dynamically updating RDF File

Is it possible to update an rdf file dynamically from user generated input through a webform? The exact scenario would beskos concept definitions being created and updated through user input to html forms.
I was considering xpath but is there a better / generally accepted / best practice way of doing this kind of thing?
For this type of thing there are IMO two approaches:
1 - Using Named Graphs in a Triple Store
Rather than editing an actual fixed file you use a Graph which is stored as a named graph in a Triple Store that supports triple level updates (i.e. you can change individual Triples in a Graph). For example you could use a store like Virtuoso or a Jena based store (Jena SDB/TDB) to do this, basically any store that supports the SPARUL language or has it's own equivalent.
2 - Using a fixed RDF file and altering it
From your mention of XPath I assume that you are intending to store your file as RDF/XML. While XPath would potentially work for this it's going to be dependent on the exact serialization of your file and may get very complex. If your app is going to allow users to submit and edit their own files then they'll be no guarantees over how the RDF has been serialized into RDF/XML so your XPath expressions might not work. If you control all the serialization and processing of the RDF/XML then you can keep it in a format that your XPath will work on.
From my point of view the simplest way to do this approach is to load the file into memory using an appropriate RDF library, manipulate it in memory and then persist the whole thing back to disk when the user is done (or at regular intervals or whatever is appropriate to your application)

Resources