Need help rewriting XQuery to avoid expanded tree cache full error in MarkLogic - caching

I am new to XQuery and MarkLogic.
I am trying to update documents in MarkLogic and get the extended tree cache full error.
Just to get the work done I have increased the expanded tree cache but that is not recommended.
I would like to tune this query so that it does not need to simultaneously cache as much XML.
Here is my query
I have uploaded my query as an image because it was not so pretty when I pasted it on the editor. If any one knows a better way please suggest.
Thanks in advance.

Expanded tree cache errors can be caused by executing queries that select too many XML nodes at once. In your example, this is likely the culprit: /tx:AttVal[tx:AttributeName/text()=$attributeName].
It's possible that calling text() is the source of your problem (and text() probably not what you mean anyway - see this blog), causing MarkLogic to evaluate that function on all these nodes, and that by simply using /tx:AttVal[tx:AttributeName=$attributeName] it may solve your problem.
Next I would consider an adding a path range index on /tx:AttVal/tx:AttributeName and query those nodes using cts:search and cts:path-range-query. This will be substantially faster than just XPath without a range index. It's also possible to use XPath with a range index: MarkLogic will automatically optimize the XPath expression to use the range index; however, there can be reasons it doesn't optimize the expression correctly, and you would want to check that using xdmp:plan.
Also note that the general best practice recommendation for XML in MarkLogic is to use "semantic XML". E.g., when you mean an attribute, use an attribute: <some-node AttributeName=AttVal>. MarkLogic's indexes are optimized out of the box for semantic XML design. However, if you don't have an option but to work with XML that's not, then that's what path range indexes were designed for.

I've just solved exactly this scenario. There are two things I did
I put the node-replace and node-insert type calls (that is any calls that modify the XML structure into a separate module and then called that module using xdmp:invoke, passing in any parameters required, like this
let $update := xdmp:invoke("/app/lib/update-attribute-node.xqy",
(xs:QName("newValue"), $new),
{xdmp:modules-database()})
The reason why this works is that the call to xdmp:invoke happens in it's own transaction and once it completes, the memory is cleared up. If you don't do this then, each time you call the update or insert function, it will not actually do the write, until the end in a single transaction meaning your memory will fill up pretty quickly.
Any time I needed to loop over paths in MarkLogic (or documents or whatever they are called - I've only been using MarkLogic for a few days) and there are a large number of them I processed them only a few at a time like below. I came up with an elaborate way of skipping and taking only a batch of documents at a time, but you can do it in any number of ways.
let $whatever:= xdmp:directory("/whatever/")[$start to $end]
I also put this into a separate module so that it is processed immediately and not in a single transaction.
Putting all expensive calls into separate modules and taking only a subset of large data sets at a time helped me solve my expanded tree cache full errors.

Related

How can I improve Marklogic 7 performance on the following: /*[fn:name()="something"]

I have a basic query:
/*[fn:name()="something"]
(1) Marklogic 7 is taking multiple seconds, is there an index that I can add to make this query faster?
(2) Which in memory limits should be increased to improve performance?
(3) Are there other ways to improve the performance with a different query but get exactly the same result?
Try using fn:node-name instead. I believe that's optimized. You'll need to handle namespaces properly, and that's part of why it can be optimized while fn:name can't.
/*[fn:node-name()=fn:QName("","something")]
The following two xPaths should be the exact same:
/theNameOfMyElement
and
/*[fn:name()="theNameOfMyElement"]
The latter is adding an unnecessary and costly qualifier. First off, the * has to search for everything, not just elements. Several other problems exist with that approach.
If my first query is still taking a long time, use cts:search, which is much faster as it searches against indexes. The queries above can be written like this:
cts:search(/theNameOfMyElement, ())
Where the second parameter (empty sequence) can be a qualifying cts:query.
If namespaces are giving you fits, you can just do:
/*:theNameOfMyElement
/*[fn:name()="something"] seems like very bad practice to me. Use /something instead.
EDIT
After seeing that the other answer got accepted, I've been trying to think of what scenario you must be trying to solve if his solution worked and mine didn't. I'm still very certain that there is a faster way by just using xPath the way it was designed to work.
After some thought, I've decided your "real" scenario must either involve a dynamic element name or you may be trying to see if the element name matches one of a sequence of names.
I have drawn up a sample with it's output provided below, that demonstrates how you could still use both without using the qualifier based on fn:node-name
let $xml as element(sample) := <sample>
<wrapper>
<product>
<entry>
<red>$1.00</red>
<yellow>$3.00</yellow>
<blue>$4.50</blue>
</entry>
</product>
</wrapper>
</sample>
let $type as xs:string := "product"
return $xml/wrapper/xdmp:unpath($type)/entry/(red|yellow)
(: Returns
<red>$1.00</red>
<yellow>$3.00</yellow>
:)
In addition to the other good suggestions, consider applying pagination. MarkLogic can identify interesting content from indexes fast, but pulling up actual content from disk is relatively slow. And sending over all results at once could mean trying to hold all results (potentially billions) in memory before sending a reply over the wire. Pagination allows pulling up results in batches, which keeps memory usage low, and potentially allows parallelization as well.
HTH!

In XQuery, how do I avoid deteriorating performance in a paged simple query?

I have a ML database with a few tens of thousands of documents in it, and a query that returns some simple calculated values for either all or a subset of those documents. The document count has grown to the point that the "all documents" option no longer reliably runs without timing out, and is only going to get worse as the document count grows. The obvious solution is for the client application to use the other form and paginate the results. It's an offline batch process, so overall speed isn't an issue - we'd just like to keep individual requests sane.
The paged version of the query is very simple:
declare namespace ns = "http://some.namespace/here"
declare variable $fromCount external;
declare variable $toCount external;
<response> {
for $doc in fn:doc()/ns:entity[$fromCount to $toCount]
return
<doc> omitted for brevity </doc>
} </response>
The problem is that the query is slower the further through the document set the requested page is; presumably because it's having to load every document in order, check whether it's the right type and iterate until its found $fromCount ns:entitys before it even begins building the response.
One wrinkle is that there are other types of document in the database, so just using fn:doc isn't a realistic option (although, they are in different directories, so xdmp:directory() might be an option; something I'll look into.)
There also isn't currently an index on the ns:entity element; would that help? It's always the root-node of a document, and the documents are quite large, so I'm concerned about the size of the index. Also, (the slow part of) this query isn't interested in the value of the element, just that it exists.
I thought about using the search: api for it's built-in paging, but it seems overkill for a query that is intended to match all documents; surely it's possible to manually construct the query that search:search() would build internally.
It seems like what I really need is an efficient list of all root-nodes of a certain type in the database. Does Marklogic maintain such a thing? If Not would an index solve the problem?
Edit: It turns out that the answer in my case is use the xdmp:directory() option, since ML apparently stores a fast, in-memory list of all documents. Still, if there is a more general solution to the problem, it's bound to be of interest, so I'll leave the question here.
Your analysis is correct:
presumably because it's having to load every document in order, check whether it's the right type and iterate until its found $fromCount ns:entitys before it even begins building the response
The usual answer is cts:search plus the unfiltered option. You found that xdmp:directory was faster, but you should still be able to measure pagination times as O(n) even if the scale is smaller. See http://docs.marklogic.com/guide/performance/unfiltered#chapter - basically the database is guarding against returning false positives, unless you tell it not to.
Another approach might be to use cts:uris and its limit option, but this might require managing pagination state in terms of start values rather than page counts. For example if the last item on page 1 was "cat", you would use "cat" as arg2 when calling cts:uris for the next page. You could still use pagination start-stop values, too. That would still be O(n) - but at a much smaller scale.

MongoDB find and remove - the fastest way

I have a quick question, what is the fast way to grab and delete an object from a mongo collection. Here is the code, I have currently:
$cursor = $coll->find()->sort(array('created' => 1))->limit(1);
$obj = $cursor->getNext();
$coll->remove(array('name' => $obj['name']));
as you can see above it grabs one document from the database and deletes it (so it isn't processed again). However fast this may be, I need it to perform faster. The challenge is that we have multiple processes doing this and processing what they have found BUT sometimes two or more of the processes grab the same document therefore making duplicates. Basically I need to make it so a document can only be grabbed once. So any ideas would be much appreciated.
Peter,
It's hard to say what the best solution is here without understanding all the context - but one approach which you could use is findAndModify. This will query for a single document and return it, and also apply an update to it.
You could use this to find a document to process and simultaneously modify a "status" field to mark it as being processed, so that other workers can recognize it as such and ignore it.
There is an example here that may be useful:
http://docs.mongodb.org/manual/reference/command/findAndModify/
Use the findAndRemove function as documented here:
http://api.mongodb.org/java/current/com/mongodb/DBCollection.html
The findAndRemove function retrieve and object from the mongo database and delete it in a single (atomic) operation.
findAndRemove(query, sort[, options], callback)
The query object is used to retrieve the object from the database (see collection.find())
The sort parameter is used to sort the results (in case many where found)
I make a new answer to remark the fact:
As commented by #peterscodeproblems in the accepted answer. The native way to this in mongodb right now is to use the
findAndModify(query=<document>, remove=True)
As pointed out by the documentation.
As it is native, and atomic, I expect this to be the faster way to do this.
I am new to mongodb and not entirely sure what your query is trying to do, but here is how I would do it
# suppose database is staging
# suppose collection is data
use staging
db.data.remove(<your_query_criteria>)
where is a map and can contain any search criteria you want
Not sure if this would help you.

XPath Query in JMeter

I'm currently working with JMeter in order to stress test one of our systems before release. Through this, I need to simulate users clicking links on the webpage presented to them. I've decided to extract theese links with an XPath Post-Processor.
Here's my problem:
I have an a XPath expression that looks something like this:
//div[#data-attrib="foo"]//a//#href
However I need to extract a specific child for each thread (user). I want to do something like this:
//div[#data-attrib="foo"]//a[position()=n]//#href
(n being the current index)
My question:
Is there a way to make this query work, so that I'm able to extract a new index of the expression for each thread?
Also, as I mentioned, I'm using JMeter. JMeter creates a variable for each of the resulting nodes, of an XPath query. However it names them as "VarName_n", and doesn't store them as a traditional array. Does anyone know how I can dynamicaly pick one of theese variables, if possible? This would also solve my problem.
Thanks in advance :)
EDIT:
Nested variables are apparently not supported, so in order to dynamically refer to variables that are named "VarName_1", VarName_2" and so forth, this can be used:
${__BeanShell(vars.get("VarName_${n}"))}
Where "n" is an integer. So if n == 1, this will get the value of the variable named "VarName_1".
If the "n" integer changes during a single thread, the ForEach controller is designed specifically for this purpose.
For the first question -- use:
(//div[#data-attrib="foo"]//a)[position()=$n]/#href
where $n must be substituted with a specific integer.
Here we also assume that //div[#data-attrib="foo"] selects a single div element.
Do note that the XPath pseudo-operator // typically result in very slow evaluation (a complete sub-tree is searched) and also in other confusing problems ( this is why the brackets are needed in the above expression).
It is recommended to avoid using // whenever the structure of the document is known and a complete, concrete path can be specified.
As for the second question, it is not clear. Please, provide an example.

Dynamically updating RDF File

Is it possible to update an rdf file dynamically from user generated input through a webform? The exact scenario would beskos concept definitions being created and updated through user input to html forms.
I was considering xpath but is there a better / generally accepted / best practice way of doing this kind of thing?
For this type of thing there are IMO two approaches:
1 - Using Named Graphs in a Triple Store
Rather than editing an actual fixed file you use a Graph which is stored as a named graph in a Triple Store that supports triple level updates (i.e. you can change individual Triples in a Graph). For example you could use a store like Virtuoso or a Jena based store (Jena SDB/TDB) to do this, basically any store that supports the SPARUL language or has it's own equivalent.
2 - Using a fixed RDF file and altering it
From your mention of XPath I assume that you are intending to store your file as RDF/XML. While XPath would potentially work for this it's going to be dependent on the exact serialization of your file and may get very complex. If your app is going to allow users to submit and edit their own files then they'll be no guarantees over how the RDF has been serialized into RDF/XML so your XPath expressions might not work. If you control all the serialization and processing of the RDF/XML then you can keep it in a format that your XPath will work on.
From my point of view the simplest way to do this approach is to load the file into memory using an appropriate RDF library, manipulate it in memory and then persist the whole thing back to disk when the user is done (or at regular intervals or whatever is appropriate to your application)

Resources