How to check, whether an index is empty in Elasticsearch? - elasticsearch

In a java application I want to find out, whether my index is empty or not (containing zero documents, or containing at least one documents).
I have the feeling, that counting all documents (by sending a search request with size=0) is not readable and might not be optimal in performance.
Is there a dedicated way for to check, whether an index is empty?

You could do that with the following Java code:
IndicesStatsResponse indicesStatsResponse = client().admin().indices().prepareStats(INDEX).get();
indicesStatsResponse.getIndices().get(INDEX).getTotal().docs.getCount());
where client is org.elasticsearch.client.Client and INDEX is the String, representing the name of the index.
Interesting is, that some simple test shows, that in average, this request is faster by 20%, than doing MatchAllDocsQuery. But I'm not sure if this is a correct test. Maybe I will do a proper one later

Related

How to ignore "stop words" while sorting in MarkLogic?

Is there any way to ignore "stop words" while sorting.
For example:
I have words like
dixit
singla
the marklogic
On sorting in descending order the result should be
singla, the marklogic, dixit
As in the above example the is ignored.
Any way to achieve this?
Update:
Stop word can occur at any place.
for example
the MarkLogic
MarkLogic is the best
the MarkLogic is awesome
while sorting should not consider any stop word in the text.
Above is just a small example to describe the problem.
In actual I am using search:search API.
For sorting, I am using sort-order search options.
The element on which I have to perform sorting is dynamic. There are approx 30-35 elements.
Is there any way to customize the collation at this level like to configure some words (stop words) which will be ignored while sorting.
There is no standard collation URI that is going to do this for you (at least none that I've ever seen). You can do it dynamically, of course, by sorting on the result of a function invocation, but if you want it done efficiently at scale (and available to search:search), then you need to materialize the sortable string into your document. I've often done this as an attribute on the element:
<title sortable="Great Gatsby, The">The Great Gatsby</title>
Then you put a range index on the title/#sortable attribute.
You can also use the "envelope pattern" where materialized metadata like this is maintained in its own section of the document with the original kept in its own section. For things like this, I think it's a bit more elegant to decorate the elements directly, to keep the context.
If I understand your question correctly you're trying to get rid of the definite article when sorting your result-set.
In order to do this you need to use some additional functions and create a 'sort' criteria. My solution would look like this (I'm also including some sample documents so that you can test this just by copy-pasting):
(:
xdmp:document-insert("/peter.xml", <person><firstName>Peter</firstName><lastName>O'Toole</lastName><age>60</age></person>);
xdmp:document-insert("/john.xml", <person><firstName>John</firstName><lastName>Adams</lastName><age>18</age></person>);
xdmp:document-insert("/simon.xml", <person><firstName>Simon</firstName><lastName>Petrov</lastName><age>22</age></person>);
xdmp:document-insert("/mark.xml", <person><firstName>Mark</firstName><lastName>the Lord</lastName><age>25</age></person>);
:)
for $person in /person
let $sort := fn:reverse(fn:tokenize($person/lastName, ' '))[1]
order by $sort
(: return $person :)
return $person/lastName/text()
Notice that now the sort order is going to be
- Adams
- the Lord
- O'Toole
- Petrov
I hope this will help.

Solr query conundrum

I've recently swapped from using Lucene for Sitecore to Solr.
For the most part it has been smooth, but the way I was writing some queries (using Sitecore.ContentSearch.Linq) abstraction now don't seem to be compatible.
Specifically, I have a situation where I've got "global" content and "regional" content, like so:
Home (000)
X
Y
Z
Regions (ID: 111)
Region 1 (ID: 221)
A
B
Region 2 (ID: 222)
D
My code worked on Lucene, but now doesn't on Solr. It should find all "global" and a single region's content, excluding all other region's content. So as an example, if the user's current region was Region 1, I'd want the query to return content X, Y, Z, A, B.
Sitecore's Item Crawler has a field for each item in the index called "_path" which is a multivalued string field of IDs, so as an example, Region 1's _path field value would be [000, 111, 221 ].
When I write this using the Linq abstraction it comes out as below which doesn't return results.
-_path:(111) OR _path:(221)
But _path:(111) does return result. Mind blown.
When I use the Solr interface and wrap each side of the OR in extra brackets like below (which I'd consider redundant) it works! Mind blown v2.
(-_path:(111)) OR (_path:(221))
Firstly, what's the difference between those queries?
Secondly, my real problem is I can't add these extra brackets as I'm working in an abstraction Linq so the brackets will be "optimized" out.
Any advice would be awesome! Cheers.
The problem here is, lucene's negative queries don't work like you think they do. They only remove results from what has been found. -_path:111 doesn't find all documents which aren't in 111, it doesn't find anything at all. It only removes results. So you are finding all results with path "221", then removing any that also have path "111", which from your heirarchy, I assume is all of them. See my answer here for a bit more on that topic.
The OR makes it seem like it ought to work, but really -_path:(111) OR _path:(221) is the same as -_path:(111) _path:(221). The moral here is: Don't use Lucene's AND/OR/NOT syntax, if you can help it. Use +/-. +/- syntax actually expresses how the query operates, AND/OR/NOT doesn't. It attempts to shoehorn it into a different, SQL-like retrieval model and leads to some unexpected behavior like this.
So, what about: (-_path:(111)) OR (_path:(221))
Well, first, does it actually work? Or does it just get some results?
If it just gets some results, but just seems to get the same results as _path:221: The reason is -_path:111 gets no results, so your query is, in practice, something like: (nothing) OR (_path:221), which is equivalent to _path:221
If it really does get the results you expect (I'm guessing it probably does): Something is translating your query into something like: (*:* -_path:111) (_path:221). Solr does have some logic along these lines, though I'm not quite sure in this case. Essentially, it puts a match-all in front of any lonely negative queries it finds, allowing them to do what you were expecting. If the implicit *:* makes you nervous about performance, well, it should. But lucene is an inverted index, it does well with finding matches on a term quickly. Getting everything that doesn't match goes against the grain of that retrieval model, and will pretty much have to do a full scan of the index.

ElasticSearch[1.4.4] :Range search on numeric index with non numeric input

I am using Elasticsearch Java API to create indexes and write queries for search.
The indexes are created on various fields. One of the field is numeric(integer) on which index is created.
Now the input we get is in form of string. We have to search all the fields for the input provided. To search on numeric field we are using
QueryBuilders.rangeQuery() method.
But when it encounters any non integer value in "to" or "from" field it throws
SearchPhaseExecutionException[Failed to execute phase [query].
nested: NumberFormatException[For input string: \"30y\"]
How can I avoid this? Its fine that we do not get any search results, but I want to avoid this Exception as there can be cases where we get non integer input.
Another option is to check all the input tokens, which I want to avoid because it will add another level of check which will impact performance.
Is there any way I can accomplish this with elasticsearch API?
Another option is to check all the input tokens, which I want to avoid because it will add another level of check which will impact performance.
Checking/validating the user input is something that you should always do, in any case, whatever your performance requirements are. If you don't, you unnecessarily expose your cluster to unknown future threats, but also known ones that have been causing some damages lately and which can have a much worse impact on your cluster and/or business than a few milliseconds spent cleaning up user input. Elasticsearch is flexible and can do wonders, but you have to play nice with it, too.
That being said, if you really want to avoid secure coding best practices, you can use the following query that won't bark if the input data is not compliant.
{
"query": {
"simple_query_string": {
"query": "numfield:[10y TO *]"
}
}
}
simple_query_string is the equivalent of query_string but is much more permissive with the input and will never throw an exception.

Trouble with facet counts

I'm attempting to use ElasticSearch for analytics -- specifically to track "top content" for hand-rolled Rails CMS. The requirement is quite a bit more complicated than keeping a counter for each piece of content. I won't get into the depth of problem right now, as I can't seem to get even the basics working.
My problem is this: I'm using facets and the counts aren't what I expect them to be. For example:
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":1,"all_terms":false,"order":"count"}}}}
Result:
{"el_ids":{"_type":"terms","missing":0,"total":16672,"other":16657,"terms":[{"term":"quis","count":15}]}}
Ok, great, the piece of content with id "quis" had 15 hits and since the order is count, it should be my top piece of content. Now lets get the top 5 pieces of content.
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":5,"all_terms":false,"order":"count"}}}}
Result (just the facet):
[
{"term":"qgz9","count":26},
{"term":"quis","count":15},
{"term":"hnqn","count":15},
{"term":"higp","count":15},
{"term":"csns","count":15}
]
Huh? So the piece of content w/ id "qgz9" had more hits with 26? Why wasn't it the top result in the first query?
Ok, lets get the top 100 now.
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":100,"all_terms":false,"order":"count"}}}}
Results (just the facet):
[
{"term":"qgz9","count":43},
{"term":"difc","count":37},
{"term":"zryp","count":31},
{"term":"u65r","count":31},
{"term":"sxsi","count":31},
...
]
So now "qgz9" has 43 hits instead of 26? How can that be? I can assure you there's nothing happening in the background modifying the index. If I repeat these queries, I get the same results.
As I repeat this process of increasing the result size, counts continue to change and new content ids emerge at the top. Can someone explain to me what I'm doing wrong or where my understanding of how this works is flawed?
It turns out that this is a known issue:
...the way top N facets work now is by getting the top N from each shard, and merging the results. This can give inaccurate results.
By default, my index was being created with 5 shards. By changing this so the index only has a single shard, the counts behave inline with my expectations. Another workaround would be to always set size to a value greater than the number of expected facets and peel off the top N results.

data structure to support lookup based on full key or part of key

I need to be able to lookup based on the full key or part of the key..
e.g. I might store keys like 10,20,30,40 11,12,30,40, 12,20,30,40
I want to be able to search for 10,20,30,40 or 20,30,40
What is the best data structure for achieving this..best for time.
our programming language is Java..any pointers for open source projects will be appreciated..
Thanks in advance..
If those were the actual numbers I'd be working with, I'd use an array where a given index contains an array of all records that contain the index. If the actual numbers were larger, I'd use a hash table employed the same way.
So the structure would look like (empty indexes elided, in the case of the array implementation):
10 => ((10,20,30,40)),
11 => ((11,12,30,40)),
12 => ((11,12,30,40), (12,20,30,40)),
20 => ((10,20,30,40), (12,20,30,40)),
30 => ((10,20,30,40), (11,12,30,40), (12,20,30,40)),
40 => ((10,20,30,40), (11,12,30,40), (12,20,30,40)),
It's not clear to me whether your searches are inclusive (OR-based) or exclusive (AND-based), but either way you look up the record groups for each element of the search set; for the inclusive search you find their union, and for the exclusive search you find their intersection.
Since you seen to care about retrieval time over other concerns (such as space), I suggest you use a hashtable and you enter your items several times, once per subkey. So you'd put("10,20,30,40",mydata), then put("20,30,40",mydata) and so on (of course this would be a method, you're not going to manually call put so many times).
Use a tree structure. Here is an open source project that might help ... written in Java :-)
http://suggesttree.sourceforge.net/

Resources