Can elasticsearch create histograms by X occurences of a field? - elasticsearch

I'm not seeing how this would be done, but is it possible to have a facet that uses an interval to give the stats on every X number of occurrences? As an example, if net was a sequence of numbers ordered by date like:
1,2,3,4,5,6,7
and I set the interval to 2, I would like to get back a histogram like:
count: 2
value: 3,
count: 2,
value: 7,
count: 2,
value: 11,
...

Elasticsearch doesn't support such operation out of the box. It's possible to write such facet, but it's not very practical since it would require writing quite complex custom facet processor and optionally controlling the way records are split into shards (so called routing).
In elasticsearch, any operation that relies on global order of elements is somewhat problematic from the architectural perspective. Elasticsearch splits records into shards, and most operations including searching and facet calculation occur on shards and then results of these shard-level operations are collected and merged into a global result. This is basically map/reduce architecture, and it is the key for horizontal scalability of elasticsearch. Optimal implementation of your facet would require changing routing in such a way that records are split into shards based on their order rather than hash code of id. Alternatively, it can be done by limiting shard-level phase to just extraction of the field values and performing the actual calculation of the facet in the merge phase. The latter approach seems to be more practical but at the same time it is not much different from simply extracting field values for all records and doing calculations on the client side, which is exactly what I would suggest doing here. Just extract all values using desired sort order and calculate all stats on the client. If the number of records in your index is large, you can use Scroll API to retrieve all records using multiple requests.

Related

Elastic search calculation with data from different indexes

Good day, everyone. I have a lit bit strange case of using elastic search for me.
There are two different indexes, each index contain one data type.
First type contains next important for this case data:
keyword (text,keyword),
URL (text,keyword)
position (number).
Second type contains next data fields:
keyword (text,keyword)
numberValue (number).
I need to do next things:
1.Group data from the first ind by URL
2.For each object in group calculate new metric (metric A) by next simple formula: position*numberValue*Param
3.For each groups calculate sum of elements metric A we have calculated on stage 1
4.Order by desc result groups by sums we have calculated on stage 3
5.Take some interval of result groups.
Param - param, i need to set for calculation, this is not in elastic.
That is not difficult algorithm, but data in different indices, and i don`t know how to do it fast, and i prefer to do it on elastic search level.
I don`t know how to make effective data search or pipeline of data processing which can help me to implement this case.
I use ES version 6.2.3 if it is important.
Give me some advice, please, how can i implement this algorithm.
By reading 2. you seem to assume keyword is some sort of primary key. Elasticsearch is not an RDB and can only reason over one document at a time, so unless numberValue and position are (indexed) fields of the same document you can't combine them.
The rest of the items seem to be possible to achieve with the help of Aggregation

Is there a way to change the Search API facet count to show a total word count instead of the count of matching fragments (documents)?

I'm creating an application using Marklogic 8 and the search API. I need to create facets based on MarkLogic defined collections, but instead of the facet count giving a tally of the number of fragments (documents) which contain X number of occurrences of the keyword search performed, I need the facet count to reflect the total number of times the keyword appears in all documents in the collection.
Right now, I'm using search:search() to process the query and return a element with the facet option enabled.
In the MarkLogic documentation, I've been looking at cts:frequency() which says:
"If you want the total frequency instead of the fragment-based frequency (that is, the total number of occurences of the value in the items specified in the cts:query option of the lexicon API), you must specify the item-frequency option to the lexicon API value input to cts:frequency."
But, I can't get that to work.
I've tried running a query like this in query console, but it times out.
cts:element-values(QName("http://www.tei-c.org/ns/1.0", "TEI"),
"", "item-frequency",
cts:and-query((
fn:collection("KirchlicheDogmatik/volume4/part3"),
cts:word-query("lehre"))))
The issue is probably that you have a range index on <TEI>, which contains the entire document. Range indexes are memory-mapped, so you have essentially forced the complete text contents of your database into memory. It's hard to say exactly what's going on, but it's probably struggling to inspect the values (range indexes are designed for smaller atomic values) and possibly swapping to disk.
MarkLogic has great documentation on its indexing, so I'd recommend starting there for a better understanding on how to use them: https://docs.marklogic.com/guide/concepts/indexing#id_51573
Note that even using the item-frequency option, results (or counts) are not guaranteed to be one-to-one with the "total number of times the keyword appears." It will report the number of "items" matching - in your example it would report on the number of <TEI> elements matching.
The problem of getting an exact count of terms matching a query across the whole database is actually quite hard. To get exact matching values within a document, you would need to use cts:highlight or cts:walk, which requires loading the whole document into memory. That typically works fine for a subset of documents, but ultimately to get an accurate value for the entire database, you would need to load the entire database into memory and process every document.
Nearly any approach to getting a term match count requires some kind of approximation and depends heavily on your markup. For example, if you index <p> (or even better <s>) elements, it would be possible to construct a query that uses indexes to count the number of matching paragraphs (or sentences), but that would still load an incredibly large amount of data into memory and keep it there. This is technically feasible if you are willing to allocate enough memory (and/or enough servers), but it hardly seems worth it.

Kibana, filter on count greater than or equal to X

I'm using Kibana to visualize some (Elasticsearch) data but I'd like to filter out all the results with "Count" less than 1000 (X).
I am using an Y-axis with a "count Aggregation", this is the count I'd like to filter on. I tried adding in a min_document_count as suggested by several online resources but this didn't change anything. Any help would be greatly appreciated.
My entire Kibana "data" tab:
Using min_doc_count with order: ascending does not work as you would except.
TL;DR: Increasing shard_size and/or shard_min_doc_count should do the trick.
Why the aggregation is empty
As stated by the documentation:
The min_doc_count criterion is only applied after merging local terms
statistics of all shards.
This mean that when you use a terms aggregations with the parameters size and min_doc_count and descending order, Elasticsearch retrieve the size less frequent terms in your data set and filter this list to keep only the terms with doc_count>min_doc_count.
If you want an example, given this dataset:
terms | doc_count
----------------
lorem | 3315
ipsum | 2487
olor | 1484
sit | 1057
amet | 875
conse | 684
adip | 124
elit | 86
If you perform the aggregation with size=3 and min_doc_count=100 Elasticsearch will first compute the 3 less frequents terms:
conse: 684
adip : 124
elit : 86
and then filter for doc_count>100, so the final result would be:
conse: 684
adip : 124
Even though you would expect "amet" (doc_count=875) to appear in the list. Elasticsearch loose this field while computing the result and cannot retrieve it at the end.
If your case, you have so many terms with doc_count<1000 that they fill your list and then, after the filtering phase, you have no results.
Why is Elasticsearch behaving like this?
Everybody would like to apply a filter and then sort the results. We are able to do that with older datastore and it was nice. But Elasticsearch is designed to scale, so it turn off by default some of the magic that was used before.
Why? Because with large datasets it would break.
For instance, imagine that you have 800,000 different terms in your index, data is distributed over different shards (by default 4), that can be distributed other machine (at most 1 machine per shard).
When requesting terms with doc_count>1000, each machine has to compute several hundreds of thousands of counters (more than 200,000 since some occurrence of a term can be in one shard, others in another, etc). And since even if a shard saw a result only once, it may have been seen 999 times by the other shards, it cannot drop the information before merging the result. So we need to send more than 1 million counters over the network. So it is quite heavy, especially if it is done often.
So, by default, Elasticsearch will:
Compute doc_count for each term in each shard.
Not apply a filter on doc_count on a shard (loss in terms of speed and resource usage but better for accuracy): No shard_min_doc_count.
Send the size * 1.5 + 10 (shard_size) terms to a node. It will be the less frequent terms if order is ascending, most frequent terms otherwise.
Merge the counters in this node.
Apply the min_doc_count filter.
Return the size most/less frequent results.
Could it be simple for once?
Yes, sure, I said that this behavior was by default. If you do not have a huge dataset you can tune those parameters :)
Solution
If you are not OK with some loss of accuracy:
Increase the shard_size parameter to be greater than [your number of terms with a doc_count below your threshold] + [the number of values you want if you want exact results].
If you want all the results with doc_count>=1000, set it to the cardinality of the field (number of different terms), but then I do not see the point of order: ascending.
It has a massive memory impact if you have many terms, and a network if you have multiple ES nodes.
If you are OK with some loss of accuracy (often minor)
Set shard_size between this sum and [the number of values you want if you want exact results]. It is useful if you want more speed or if you do not have enough RAM to perform the exact computation. The good value for this one depends of your dataset.
Use the shard_min_doc_count parameter of the term aggregation to partially pre-filter the less frequent values. It is an efficient way to filter your data, especially if they are randomly distributed between your shards (default) and/or you do not have a lot of shards.
You can also put your data in one shard. There is no loss in term of accuracy but it is bad for performance and scaling. Yet you may not need the full power of ES if you have a small dataset.
NB: Descending order for terms aggregations is deprecated (because it cost a lot in terms of time and hardware to be accurate), it will most likely be removed in the future.
PS: You should add the Elasticsearch request generated by Kibana, it is often useful when Kibana is returning data but not the ones you want? You can find it in the "Request" tab when you click on the arrow that should be below your graph in your screenshot (ex: http://imgur.com/a/dMCWE).

Performance of Elastic Time Range Queries against Out of Range Indexes

It is common to have elastic indices with dates, in particular from something like logstash.
So for example, you have indices like foo-2016.05.01, foo-2016.05.02, etc...
When doing a time range query for data. What is the cost of querying indexes that I already know won't have data for that time range?
So for example if time range query only asks for data from 2016.05.02 but I also include the foo-2016.05.01 index in my query.
Is that basically a quick one-op per index where the index knows it has no data in that date range, or will doing this be costly to performance? I'm hoping not only to know the yes/no answer, but to understand why it behaves the way it does.
Short version: it's probably going to be expensive. The cost will be n where n is the number of distinct field values for the date data. If all entries in the index had an identical date field value, it'd be a cheap query of 1 check (and would be pointless since it'd be a binary "all or nothing" response at that point). Of course, the reality is usually that every single doc has a unique date field value (which is incrementing such as in a log), depending on how granular the date is (assuming here that the time is included to seconds or milliseconds). Elasticsearch will check each aggregated, unique date field value of the included indices to try and find documents that match on the field by satisfying the predicates of the range query. This is the nature of the inverted index (indexing documents by their fields).
An easy way to improve performance is to change the Range Query to a Range Filter which caches results and improves performance for requests beyond the first one. Of course, this is only valuable if you're repeating the same range filter over time (the cache is read more than it is written), and if the range is not part of scoring the documents (that is to say those in range are not more valuable that those not in range when returning a set of both - also known as "boosting").
Another way to improve performance is by convention. If you query by day, store each day in its own rolling index and then do pre-search logic to select the indexes to query. This eliminates the need for the filter or query entirely.
Elasticsearch doesn't care about the index name (that includes the date) and it doesn't automagically exclude that index from your range query. It will query all the shards (a copy - be it replica or primary) of all the indices specified in the query. Period.
Kibana, on the other hand, knows based on the time range selected to query specific indices only.
If you know your range will not make sense on some indices, then exclude those from the query before creating the query.
A common approach for logging usecase, in case the current day is most frequently queried is to create an alias. Give it a significant name - like today - that will always point to today's index. Also, common with time based indices is the retention period. For these two tasks - managing the aliases and deleting the "expired" indices - you can use Curator.
In case the most times you care about the current day, use that alias and thus you get rid of the days before today.
In case not, then filter the indices to be queried based on the range before deciding on which indices to run the query.

Paging in Elasticsearch when results have equal scores

Is it possible to implement reliable paging of elasticsearch search results if multiple documents have equal scores?
I'm experimenting with custom scoring in elasticsearch. Many of the scoring expressions I try yield result sets where many documents have equal scores. They seem to come in the same order each time I try, but can it be guaranteed?
AFAIU it can't, especially not if there is more than one shard in a cluster. Documents with equal score wrt. a given elasticsearch query are returned in random, non-deterministic order that can change between invocations of the same query, even if the underlying database does not change (and therefore paging is unreliable) unless one of the following holds:
I use function_score to guarantee that the score is unique for each document (e.g. by using a unique number field).
I use sort and guarantee that the sorting defines a total order (e.g. by using a unique field as fallback if everything else is equal).
Can anyone confirm (and maybe point at some reference)?
Does this change if I know that there is only one primary shard without any replicas (see other, similar querstion: Inconsistent ordering of results across primary /replica for documents with equivalent score) ? E.g. if I guarantee that there is one shard AND there is no change in the database between two invocations of the same query then that query will return results in the same order?
What are other alternatives (if any)?
I ended up using additional sort in cases where equal scores are likely to happen - for example searching by product category. This additional sort could be id, creation date or similar. The setup is 2 servers, 3 shards and 1 replica.

Resources