Execute query lazily in Orient-DB - lazy-evaluation

In current project we need to find cheapest paths in almost fully connected graph which can contain lots of edges per vertex pair.
We developed a plugin containing functions
for special traversal this graph to lower reoccurences of similar paths while TRAVERSE execution. We will refer it as search()
for special effective extraction of desired information from results of such traverses. We will refer it as extract()
for extracting best N records according to target parameter without costly ORDER BY. We will refer it as best()
But resulted query still has unsatisfactory performance on full data.
So we decided to modify search() function so it could watch best edges first and prune paths leading to definitely undesired result by using current state of best() function.
Overall solution is effectively a flexible implementation of Branch and Bound method
Resulting query (omitting extract() step) should look like
SELECT best(path, <limit>) FROM (
TRAVERSE search(<params>) FROM #<starting_point>
WHILE <conditions on intermediate vertixes>
) WHERE <conditions on result elements>
This form is very desired so we could adapt conditions under WHILE and WHERE for our current task. The path field is generated by search() containing all information for best() to proceed.
The trouble is that best() function is executed strictly after search() function, so search() can not prune non-optimal branches according to results already evaluated by best().
So the Question is:
Is there a way to pipeline results from TRAVERSE step to SELECT step in the way that older paths were TRAVERSEd with search() after earlier paths handled by SELECT with best()?

the query execution in this case will be streamed. If you add a
System.out.println()
or you put a breakpoint in your functions you'll see that the invocation sequence will be
search
best
search
best
search
...
You can use a ThreadLocal object http://docs.oracle.com/javase/7/docs/api/java/lang/ThreadLocal.html
to store some context data and share it between the two functions, or you can use the OCommandContext (the last parameter in OSQLFunction.execute() method to store context information.
You can use context.getVariable() and context.setVariable() for this.
The contexts of the two queries (the parent and the inner query) are different, but they should be linked by a parent/child relationship, so you should be able to retrieve them using OCommandContext.getParent()

Related

Does the order in which BooleanPredicateClausesStep filters are chained matter?

I have the following method that creates a BooleanPredicateClausesStep to do a query with.
private BooleanPredicateClausesStep<?> getJournalAndSpatialSearchCriteria(GeoFilter geoFilter, SearchPredicateFactory factory, Boolean includeJournalsWithStatusFinished) {
SearchPredicate journalLocationMustResideWithinRadius = getJournalsContainedWithinRadiusPredicate(geoFilter, factory);
SearchPredicate mustOrShouldBeOfStatus = getSubmissionStatusConditionPredicate(includeJournalsWithStatusFinished, factory);
return factory.bool()
.filter( journalLocationMustResideWithinRadius )
.filter( factory.match().field( "deleted" ).matching( "false" ) )
.filter( mustOrShouldBeOfStatus )
.filter( factory.match().field( "containsHarvestEntry" ).matching( "true" ) )
.filter( factory.match().field( "grownOutdoors" ).matching( "true" ) );
}
It contains one spatial search predicate that checks whether journals fall within a predefined circular geographical area or not. All the other filters are simple ones that only check whether a certain field matches a value or not.
My question is: Do all these filters get implemented sequentially or all at once? Or to put it differently; would lucene first fetch all of the objects that fall within the defined geographical area before it checks whether they are deleted or does it check both simultaneously? The hibernate search documentation doesn't say anything about the order in which filters are processed.
In short: no, the order in which filters are declared within a single boolean predicate doesn't matter. The results will be the same regardless of order, and performance will most likely be the same regardless of order.
Detailed answer:
The order of clauses in a given boolean predicate doesn't matter in the sense that the predicate will match the same documents regardless of order.
Regarding implementation, it's a bit complex, but roughly speaking filters are turned into DocIdSetIterators which are then combined, so that the combined iterator only goes through documents returned by all iterators. That means each iterator will be incremented one after the other.
However, there are optimizations that allow iterators to "skip through" to the document matched by the previous iterator, so order might matter, but for performance only: if you have a filter that is quick and matches very few documents, it's better to execute it first.
But... Lucene often has knowledge of the "cost" of each filter/iterator, and will automatically change the order of iterators to execute the less costly ones first (see org.apache.lucene.search.ConjunctionDISI#createConjunction in Lucene 8.11).
All that to say: even for performance, while order matters internally, it shouldn't matter to you. So, don't even think about it :)

Choosing specific element in XPath

I got 2 elements under the same name "reason". When i'm using //*:reason/text() it gives me both of the elements, but i need the first one. (not the one inside "details"). please help..
<xml xmlns:gob="http://osb.yes.co.il/GoblinAudit">
<fault>
<ctx:fault xmlns:ctx="http://www.bea.com/wli/sb/context">
<ctx:errorCode>BEA-382500</ctx:errorCode>
<ctx:reason>OSB Service Callout action received SOAP Fault response</ctx:reason>
<ctx:details>
<ns0:ReceivedFaultDetail xmlns:ns0="http://www.bea.com/wli/sb/stages/transform/config">
<ns0:faultcode xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">soapenv:Server</ns0:faultcode>
<ns0:faultstring>BEA-380001: Internal Server Error</ns0:faultstring>
<ns0:detail>
<con:fault xmlns:con="http://www.bea.com/wli/sb/context">
<con:errorCode>BEA-380001</con:errorCode>
<con:reason>Internal Server Error</con:reason>
<con:location>
<con:node>RouteTo_FinancialControllerBS</con:node>
<con:path>response-pipeline</con:path>
</con:location>
</con:fault>
</ns0:detail>
</ns0:ReceivedFaultDetail>
</ctx:details>
<ctx:location>
<ctx:node>PipelinePairNode2</ctx:node>
<ctx:pipeline>PipelinePairNode2_request</ctx:pipeline>
<ctx:stage>set maintain offer</ctx:stage>
<ctx:path>request-pipeline</ctx:path>
</ctx:location>
</ctx:fault>
</fault>
</xml>
You are using the // qualifier which will descend into any subtree and find all occurences of reason. You can try to be more specific about the subpath:
//fault/*:fault/*:reason/text()
This will only match the outer reason but not the inner reason..
"...but i need the first one"
You can use position index to get the first matched reason element :
(//*:reason)[1]/text()
" (not the one inside "details")"
The above can be expressed as finding reason element which doesn't have ancestor details :
//*:reason[not(ancestor::*:details)]/text()
For a large XML document, using more specific path i.e avoid // at the beginning, would results in a more efficient XPath :
/xml/fault/*:fault/*:reason/text()
But for a small XML, it's just a matter of personal preference, since the improvement is likely to be negligible.

xpath - matching value of child in current node with value of element in parent

Edit: I think I found the answer but I'll leave the open for a bit to see if someone has a correction/improvement.
I'm using xpath in Talend's etl tool. I have xml like this:
<root>
<employee>
<benefits>
<benefit>
<benefitname>CDE</benefitname>
<benefit_start>2/3/2004</benefit_start>
</benefit>
<benefit>
<benefitname>ABC</benefitname>
<benefit_start>1/1/2001</benefit_start>
</benefit>
</benefits>
<dependent>
<benefits>
<benefit>
<benefitname>ABC</benefitname>
</benefit>
</dependent>
When parsing benefits for dependents, I want to get elements present in the employee's
benefit element. So in the example above, I want to get 1/1/2001 for the dependent's
start date. I want 1/1/2001, not 2/3/2004, because the dependent's benefit has benefitname ABC, matching the employee's benefit with the same benefitname.
What xpath, relative to /root/employee/dependent/benefits/benefit, will yield the value of
benefit_start for the benefit under parent employee that has the same benefit name as the
dependent benefit name? (Note I don't know ahead of time what the literal value will be, I can't just look for 'ABC', I have to match whatever value is in the dependent's benefitname element.
I'm trying:
../../../benefits/benefit[benefitname=??what??]/benefit_start
I don't know how to refer to the current node's ancestor in the middle of
the xpath (since I think "." at the point I have ??what?? will refer to
the benefit node of the employee/benefits.
EDIT: I think what I want is "current()/benefitname" where the ??what?? is. Seems to work with saxon, I haven't tried it in the etl tool yet.
Your XML is malformed, and I don't think you've described your siduation very well (the XPath you're trying has a bunch of ../../s at the beginning, but you haven't said what the context node is, whether you're iterating through certain nodes, or what.
Supposing the current context node were an employee element, you could select benefit_starts that match dependent benefits with
benefits/benefit[benefitname = ../../dependent/benefits/benefit/benefitname]
/benefit_start
If the current context node is a benefit element in a dependents section, and you want to get the corresponding benefit_start for just the current benefit element, you can do:
../../../benefits/benefit[benefitname = current()/benefitname]/benefit_start
Which is what I think you've already discovered.

Interpreting Cascading dot diagrams

Can someone explain how to read these diagrams? I understand the flow from head to tail, but I am specifically wondering about how to read the field (bracket) transitions between ellipses (Pipes/Taps).
By way of example using the Fields following the Every Pipe in the image, the way I have been able to interpret these is the first Field set i.e. [{2}:'token', 'count'] is what goes into the next Pipe/Tap, but what is the significance of the second Field set [{1}: 'token']?
Is this the field set that went into the previous Pipe above? Is there a programmatic significance to the second bracket i.e. are we able to access it within that pipe with particular Cascading code? (In the case where the second Fields set is greater than the first)
(source: cascading.org)
The second field set represents which fields are available for subsequent operations in that map or reduce.
In your example above, in the reduce step, since you grouped by 'token', only 'token' is available for subsequent aggregations (Everys) in that reduce step. You could, for example, add another aggregation which output the average token length, but you could not use an aggregation which utilized the 'count' yet.
The reason for this behaviour is that subsequent aggregations on the same group happen in parallel. Thus, the Count won't be completed to feed into any other aggregations you chained on.

Is there "positional distinct" in XPath

Consider a query
//item[value='testvalue']/ancestor::container[1]
if item appears several times inside a container then we have several hits that supposedly should appear several times in the results. The results are nodes, right? So if I apply distinct-values to them they would stop being nodes and the function would technically return values losing positional information. But is there operation (refactoring, function) that allows to keep "noded" result while at the same time exclude duplicate hits?
is there operation (refactoring, function) that allows to keep "noded"
result while at the same time exclude duplicate hits?
By definition the XPath operator / performs deduplication, therefore:
//item[value='testvalue']/ancestor::container[1]
doesn't select two identical nodes.

Resources