Performance improvement - Alternative for a map in xquery - xpath

I am using cts:element-value-co-occurences for a query which returns a big list of values. Since I am giving "map" as an option, those occurences are copied into a map(an example given below).
Returned Map for the query
map:map(
<map:map xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:map="http://marklogic.com/xdmp/map">
<map:entry key="doc">
<map:value xsi:type="map:map">
<map:map>
<map:entry key="/data/fb/www.abcdefgh.com#form123456665364#thread123456968765#post987986513213_65434360536840613_66445444">
<map:value xsi:type="xs:float">0.289406</map:value>
</map:entry>
</map:map>
</map:value>
</map:entry>
</map:map>
)
As you notice, my map key is "doc", the occurences returned are copied into this key "doc" as sub-map.
The sub-map's "<map:entry key=" has a big URL as the key.
When one occurence/result is returned for the query, its good. The problem is when hundred's of result is returned, the performance is getting very worse. Is there any alternative for this? I am only concerned with the value "0.289406" than the big key.
can i get the value alone directly into any xml element as like :-
<doc>
<valu>0.289406</valu>
<valu>-0.23456</valu>
<valu>0.3665</valu>
</doc>
instead of using a map or any iterations??

Hundreds of results should be fine. On a regular basis I work with thousands to millions of results in maps.
So I suspect something else is going on. If you can come up with a reproducible test case, someone may be able to spot the problem.

Related

Spring Data JPA fetching list always returns at least a single result

I've noticed a slight problem with how my API is working where I'm using Spring Data JPA.
My query looks something along the lines of:
#Query("SELECT p.id AS id, COUNT(l) AS likes FROM Post p LEFT JOIN Like l ON l.post = p WHERE p.location.id = ?1")
My actual query is bigger, this this contains everything necessary to explain what the issue is. This query will return a list, but assume the location does not exist, it should return null or an empty list, correct? Oh, how wrong you are, my sweet summer child!
This query will instead always return a list of at least one element, regardless of whether or not there are any posts linked to said location.
[{"id": null, "likes": 0}]
That is what the result looks like when serialized to JSON. I am not quite sure what to do about this little predicament, as I obviously don't want to return a list with faulty data, but needing to use processing to filter out duds also seems dumb and unnecessary.
Is there any way to prevent this that I've yet to find? If it is of any relevance, I am using projections currently for my responses.
What I've tried so far:
Adding a not null condition for fields. Does not work, ignored by COUNT.
Adding constraints to all fields #NotNull. Does not work, will still become null.
For what it's worth, I've tried different kinds of joins, though anything but LEFT JOIN doesn't make much sense.
I haven't been able to find any other case which resembles this either, although it most likely exists, but is drowned out by everything else. I'm not quite sure what can be done in this regard, so I'm curious if it's just a quirk with the framework, or if there is an actual solution.
It might be possible to solve through native queries, but I would prefer not to use them.
I'm no SQL expert but I believe that a left join will give you this result if the ID does not exist.
Have you run the query in your DB? Doesn't it give you one row in your result set for IDs that do not exist?
I believe this is intended to say there is a 0 match.
You might want to validate your query before running it. Meaning checking that the location exists first.
As the issue is inherently due to a COUNT and CASE keyword in my real query, resulting in there always being at least one row, and I can't find any method of doing this automatically, the solution I've used is the following:
List<Item> items = repository.customQuery(id);
if (0 < items.size() && null == items.get(0).getId()) {
items.remove(0);
}
The first condition is arbitrary as I know there is always at least one entry, but is done just as a safety measure. A try-catch block would do the trick as well. In the case where you use a primitive int instead of Integer, you'd need to initialize the value in the constructor to something which would normally never be present in the database, such as -1.
If anyone knows of a better method, I'd love to know about it.

How to ignore "stop words" while sorting in MarkLogic?

Is there any way to ignore "stop words" while sorting.
For example:
I have words like
dixit
singla
the marklogic
On sorting in descending order the result should be
singla, the marklogic, dixit
As in the above example the is ignored.
Any way to achieve this?
Update:
Stop word can occur at any place.
for example
the MarkLogic
MarkLogic is the best
the MarkLogic is awesome
while sorting should not consider any stop word in the text.
Above is just a small example to describe the problem.
In actual I am using search:search API.
For sorting, I am using sort-order search options.
The element on which I have to perform sorting is dynamic. There are approx 30-35 elements.
Is there any way to customize the collation at this level like to configure some words (stop words) which will be ignored while sorting.
There is no standard collation URI that is going to do this for you (at least none that I've ever seen). You can do it dynamically, of course, by sorting on the result of a function invocation, but if you want it done efficiently at scale (and available to search:search), then you need to materialize the sortable string into your document. I've often done this as an attribute on the element:
<title sortable="Great Gatsby, The">The Great Gatsby</title>
Then you put a range index on the title/#sortable attribute.
You can also use the "envelope pattern" where materialized metadata like this is maintained in its own section of the document with the original kept in its own section. For things like this, I think it's a bit more elegant to decorate the elements directly, to keep the context.
If I understand your question correctly you're trying to get rid of the definite article when sorting your result-set.
In order to do this you need to use some additional functions and create a 'sort' criteria. My solution would look like this (I'm also including some sample documents so that you can test this just by copy-pasting):
(:
xdmp:document-insert("/peter.xml", <person><firstName>Peter</firstName><lastName>O'Toole</lastName><age>60</age></person>);
xdmp:document-insert("/john.xml", <person><firstName>John</firstName><lastName>Adams</lastName><age>18</age></person>);
xdmp:document-insert("/simon.xml", <person><firstName>Simon</firstName><lastName>Petrov</lastName><age>22</age></person>);
xdmp:document-insert("/mark.xml", <person><firstName>Mark</firstName><lastName>the Lord</lastName><age>25</age></person>);
:)
for $person in /person
let $sort := fn:reverse(fn:tokenize($person/lastName, ' '))[1]
order by $sort
(: return $person :)
return $person/lastName/text()
Notice that now the sort order is going to be
- Adams
- the Lord
- O'Toole
- Petrov
I hope this will help.

How to check, whether an index is empty in Elasticsearch?

In a java application I want to find out, whether my index is empty or not (containing zero documents, or containing at least one documents).
I have the feeling, that counting all documents (by sending a search request with size=0) is not readable and might not be optimal in performance.
Is there a dedicated way for to check, whether an index is empty?
You could do that with the following Java code:
IndicesStatsResponse indicesStatsResponse = client().admin().indices().prepareStats(INDEX).get();
indicesStatsResponse.getIndices().get(INDEX).getTotal().docs.getCount());
where client is org.elasticsearch.client.Client and INDEX is the String, representing the name of the index.
Interesting is, that some simple test shows, that in average, this request is faster by 20%, than doing MatchAllDocsQuery. But I'm not sure if this is a correct test. Maybe I will do a proper one later

What's the term of data structure where field is missing if no value?

My data looks like this:
[
{"name":"Jimmy H.","title":"Mr."},
{"name": "Janice H."}
]
So, if field does not have value, then also the field name is missing. What's the proper term for that?
EDIT:
Basically I'm looking for a term that differentiates structure above from structure where every field name (even without value) is guaranteed to exist in every record.
The one in the example is a combination of well known structures, it seems indeed an array of maps. At least, in JavaScript it would be an array of objects, but those objects behave like maps the way they are used in the example.

Solr query conundrum

I've recently swapped from using Lucene for Sitecore to Solr.
For the most part it has been smooth, but the way I was writing some queries (using Sitecore.ContentSearch.Linq) abstraction now don't seem to be compatible.
Specifically, I have a situation where I've got "global" content and "regional" content, like so:
Home (000)
X
Y
Z
Regions (ID: 111)
Region 1 (ID: 221)
A
B
Region 2 (ID: 222)
D
My code worked on Lucene, but now doesn't on Solr. It should find all "global" and a single region's content, excluding all other region's content. So as an example, if the user's current region was Region 1, I'd want the query to return content X, Y, Z, A, B.
Sitecore's Item Crawler has a field for each item in the index called "_path" which is a multivalued string field of IDs, so as an example, Region 1's _path field value would be [000, 111, 221 ].
When I write this using the Linq abstraction it comes out as below which doesn't return results.
-_path:(111) OR _path:(221)
But _path:(111) does return result. Mind blown.
When I use the Solr interface and wrap each side of the OR in extra brackets like below (which I'd consider redundant) it works! Mind blown v2.
(-_path:(111)) OR (_path:(221))
Firstly, what's the difference between those queries?
Secondly, my real problem is I can't add these extra brackets as I'm working in an abstraction Linq so the brackets will be "optimized" out.
Any advice would be awesome! Cheers.
The problem here is, lucene's negative queries don't work like you think they do. They only remove results from what has been found. -_path:111 doesn't find all documents which aren't in 111, it doesn't find anything at all. It only removes results. So you are finding all results with path "221", then removing any that also have path "111", which from your heirarchy, I assume is all of them. See my answer here for a bit more on that topic.
The OR makes it seem like it ought to work, but really -_path:(111) OR _path:(221) is the same as -_path:(111) _path:(221). The moral here is: Don't use Lucene's AND/OR/NOT syntax, if you can help it. Use +/-. +/- syntax actually expresses how the query operates, AND/OR/NOT doesn't. It attempts to shoehorn it into a different, SQL-like retrieval model and leads to some unexpected behavior like this.
So, what about: (-_path:(111)) OR (_path:(221))
Well, first, does it actually work? Or does it just get some results?
If it just gets some results, but just seems to get the same results as _path:221: The reason is -_path:111 gets no results, so your query is, in practice, something like: (nothing) OR (_path:221), which is equivalent to _path:221
If it really does get the results you expect (I'm guessing it probably does): Something is translating your query into something like: (*:* -_path:111) (_path:221). Solr does have some logic along these lines, though I'm not quite sure in this case. Essentially, it puts a match-all in front of any lonely negative queries it finds, allowing them to do what you were expecting. If the implicit *:* makes you nervous about performance, well, it should. But lucene is an inverted index, it does well with finding matches on a term quickly. Getting everything that doesn't match goes against the grain of that retrieval model, and will pretty much have to do a full scan of the index.

Resources