Can I use an elasticsearch facet (or aggregate) to get e.g. the average of the maximum value in each document, rather than the average of all values across all documents?
While you can use a script_field for this, I'd recommend determining the maximum at index time and then storing it in a separate field to facet/aggregate on. It'll be faster, and the memory requirement for doing the facet will be much smaller.
Related
I'm creating an application using Marklogic 8 and the search API. I need to create facets based on MarkLogic defined collections, but instead of the facet count giving a tally of the number of fragments (documents) which contain X number of occurrences of the keyword search performed, I need the facet count to reflect the total number of times the keyword appears in all documents in the collection.
Right now, I'm using search:search() to process the query and return a element with the facet option enabled.
In the MarkLogic documentation, I've been looking at cts:frequency() which says:
"If you want the total frequency instead of the fragment-based frequency (that is, the total number of occurences of the value in the items specified in the cts:query option of the lexicon API), you must specify the item-frequency option to the lexicon API value input to cts:frequency."
But, I can't get that to work.
I've tried running a query like this in query console, but it times out.
cts:element-values(QName("http://www.tei-c.org/ns/1.0", "TEI"),
"", "item-frequency",
cts:and-query((
fn:collection("KirchlicheDogmatik/volume4/part3"),
cts:word-query("lehre"))))
The issue is probably that you have a range index on <TEI>, which contains the entire document. Range indexes are memory-mapped, so you have essentially forced the complete text contents of your database into memory. It's hard to say exactly what's going on, but it's probably struggling to inspect the values (range indexes are designed for smaller atomic values) and possibly swapping to disk.
MarkLogic has great documentation on its indexing, so I'd recommend starting there for a better understanding on how to use them: https://docs.marklogic.com/guide/concepts/indexing#id_51573
Note that even using the item-frequency option, results (or counts) are not guaranteed to be one-to-one with the "total number of times the keyword appears." It will report the number of "items" matching - in your example it would report on the number of <TEI> elements matching.
The problem of getting an exact count of terms matching a query across the whole database is actually quite hard. To get exact matching values within a document, you would need to use cts:highlight or cts:walk, which requires loading the whole document into memory. That typically works fine for a subset of documents, but ultimately to get an accurate value for the entire database, you would need to load the entire database into memory and process every document.
Nearly any approach to getting a term match count requires some kind of approximation and depends heavily on your markup. For example, if you index <p> (or even better <s>) elements, it would be possible to construct a query that uses indexes to count the number of matching paragraphs (or sentences), but that would still load an incredibly large amount of data into memory and keep it there. This is technically feasible if you are willing to allocate enough memory (and/or enough servers), but it hardly seems worth it.
I am using Elasticsearch 6.2, and I have some queries that analyze a massive amount of documents. I am sorting to one field inside the index. Elasticsearch examines 10.000 documents (default configuration value) and then returns them paginated.
I tried to read the documentation, but I cannot find any information if the database applies the sorting before or after the analysis process of the documents from the index.
In other words, the sort is applied directly during the index analysis or the documents are sorted once analyzed? If the last option is correct, which kind of sort applies Elasticsearch during the scan?
Thanks a lot.
Sorting, aggregations, and access to field values in scripts requires
a different data access pattern. Instead of looking up the term and
finding documents, we need to be able to look up the document and find
the terms that it has in a field.
This quote from the Elasticsearch reference documentation implies to me, that sorting is happening on the non-analyzed level, but I've also decided to double check and do some tests on it.
In the Elasticsearch we have capabilities to do sorting on non-analyzed fields - e.g. keyword. Those fields are using doc-values to do sorting and after the test I could say that it's using pre-analyzed values to do sorting according to the codes representing characters (numbers, uppercase letters, lowercase letters)
It's also possible to do a sorting on text fields with some caveat and tuning (e.g. need to enable fielddata, since text fields do not support doc_values)
In this case the documents are sorted according to analyzed values. Of course a lot depends on analyzing pipeline, since it could do various stuff to the text. Also, just as a reminder:
Fielddata can consume a lot of heap space, especially when loading
high cardinality text fields. Once fielddata has been loaded into the
heap, it remains there for the lifetime of the segment. Also, loading
fielddata is an expensive process which can cause users to experience
latency hits. This is why fielddata is disabled by default.
A field in an index I'm building gets regularly appended to. I'd like to be able to query elasticsearch to count the number of current items. Is it possible to query for the length of an array field? I can write something to bring back the field and count the items but some of them have a large number of entries and so am looking for something that is done in place in ES.
Here's what I recommend. Perform a terms aggregation over _uid. Then perform another aggregation over all the fields in the array field and sum the doc_counts.
But such an operation really depends on the number of records that you need to query otherwise, this can be an expensive operation.
Another option you have is to store count of array elements as another field and query it directly and given the fact that you are already storing large arrays in your document, having an integer field for the size seems to be a fair trade-off. In case you need the count to filter records I would recommend using scripted filter as explained here
It is common to have elastic indices with dates, in particular from something like logstash.
So for example, you have indices like foo-2016.05.01, foo-2016.05.02, etc...
When doing a time range query for data. What is the cost of querying indexes that I already know won't have data for that time range?
So for example if time range query only asks for data from 2016.05.02 but I also include the foo-2016.05.01 index in my query.
Is that basically a quick one-op per index where the index knows it has no data in that date range, or will doing this be costly to performance? I'm hoping not only to know the yes/no answer, but to understand why it behaves the way it does.
Short version: it's probably going to be expensive. The cost will be n where n is the number of distinct field values for the date data. If all entries in the index had an identical date field value, it'd be a cheap query of 1 check (and would be pointless since it'd be a binary "all or nothing" response at that point). Of course, the reality is usually that every single doc has a unique date field value (which is incrementing such as in a log), depending on how granular the date is (assuming here that the time is included to seconds or milliseconds). Elasticsearch will check each aggregated, unique date field value of the included indices to try and find documents that match on the field by satisfying the predicates of the range query. This is the nature of the inverted index (indexing documents by their fields).
An easy way to improve performance is to change the Range Query to a Range Filter which caches results and improves performance for requests beyond the first one. Of course, this is only valuable if you're repeating the same range filter over time (the cache is read more than it is written), and if the range is not part of scoring the documents (that is to say those in range are not more valuable that those not in range when returning a set of both - also known as "boosting").
Another way to improve performance is by convention. If you query by day, store each day in its own rolling index and then do pre-search logic to select the indexes to query. This eliminates the need for the filter or query entirely.
Elasticsearch doesn't care about the index name (that includes the date) and it doesn't automagically exclude that index from your range query. It will query all the shards (a copy - be it replica or primary) of all the indices specified in the query. Period.
Kibana, on the other hand, knows based on the time range selected to query specific indices only.
If you know your range will not make sense on some indices, then exclude those from the query before creating the query.
A common approach for logging usecase, in case the current day is most frequently queried is to create an alias. Give it a significant name - like today - that will always point to today's index. Also, common with time based indices is the retention period. For these two tasks - managing the aliases and deleting the "expired" indices - you can use Curator.
In case the most times you care about the current day, use that alias and thus you get rid of the days before today.
In case not, then filter the indices to be queried based on the range before deciding on which indices to run the query.
Elasticsearch takes the length of a document into account when ranking (they call this field normalization). The default behavior is to rank shorter matching documents higher than longer matching documents.
Is there anyway to turn off or modify field normalization at query time? I am aware of the index time omit_norms option, but I would prefer to not reindex everything to try this out.
Also, instead of simply turning off field normalization, I wanted to try out a few things. I would like to take field length into account, but not as heavily as elasticsearch currently does. With the default behavior, a document will rank 2 times higher than a document which is two times longer. I wanted to try a non-linear relationship between ranking and length.