In the ElasticSearch documentation for the Cardinality Aggregation under the heading "Pre-computed hashes" I see the following:
On string fields that have a high cardinality, it might be faster to
store the hash of your field values in your index and then run the
cardinality aggregation on this field. This can either be done by
providing hash values from client-side or by letting Elasticsearch
compute hash values for you by using the mapper-murmur3 plugin.
Pre-computing hashes is usually only useful on very large and/or
high-cardinality fields as it saves CPU and memory. However, on
numeric fields, hashing is very fast and storing the original values
requires as much or less memory than storing the hashes. This is also
true on low-cardinality string fields, especially given that those
have an optimization in order to make sure that hashes are computed at
most once per unique value per segment.
I'm curious about the part where it says, "[this can be done] by providing hash values from client-side," because it doesn't elaborate at all on that point, but goes on to discuss numeric fields.
If I wanted to pre-compute hashes on the client, would using something like xxhash and putting the result in an appropriate number field be sufficient? (And, of course, having cardinality target that field.) Or would I need to use another type of field for the hash value?
Pre-computing hashes for high-cardinality string fields will speed up the cardinality aggregation, because hashes don't have to be computed in real-time. No need to do it on numeric fields, though!
For string fields, they advise to use the mapper-murmur3 plugin. Those hashes will be alphanumeric and should be stored in keyword fields (not a numeric field type!), that you then use in your cardinality aggregation.
I've personally seen 10x+ improvements when computing the cardinality of high-cardinality string fields with pre-computed hashes. Worth a try!
Related
I'm coming from a long-term SQL background -- NoSQL (and ElasticSearch) is very new to me.
An engineer on my team is constructing a new index for document storage, and they have mapped all short/int/long values to strings for use in term queries.
This surprised me, as a SQL index with an SmallInt/Int/BigInt key will perform much better than that same set of values turned into a VarChar(X) and indexed accordingly.
I was pointed to this article: https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html
Which has this comment:
Consider mapping a numeric identifier as a keyword if:
You don’t plan to search for the identifier data using range queries.
Fast retrieval is important. term query searches on keyword fields are often faster than term searches on numeric fields.
I'm happy take this at face value, but I don't understand why this is.
Assuming an exact match type query (e.g. ID = 100), can anyone speak to the mechanics of ElasticSearch (or NoSQL in general), that would explain why a query against a stringified numeric value is faster than a query against numeric values directly?
Basically, keywords are stored in the inverted index and the lookup is really fast, which makes keyword the ideal type for term/s queries (i.e. exact match)
Numeric values, however, are stored in BKD trees (since ES 5/Lucene 6) which are more optimal than the inverted index for numeric values and also optimized for range-like queries.
The downside is that searching for an exact numerical value within a BKD tree is less performant than looking up the term in the inverted index.
So the take away from this is that if your IDs are numeric and you plan on querying them in ranges, map them with a numeric type like integer, etc. But, if you plan on matching your ID in a term/exact-like fashion, then store them as string with a keyword type.
I read below words on Elasticsearch docs.
https://www.elastic.co/guide/en/elasticsearch/reference/5.4/search-request-sort.html#_memory_considerations
When sorting, the relevant sorted field values are loaded into memory. This means that per shard, there should be enough memory to contain them.
This is different from my understanding about sorting. I thought that some datatype, keyword for example, should already be sorted since Elasticsearch will create index on them. These already sorted fields should not need to be load into memory to sort again.
So am I understand right?
Index in relational databases means B* tree and that is indeed sorted.
Index in Elasticsearch is where you store your data; previously we compared that to a table in the relational world but for various reasons this is not really true, so let's not use that as a direct comparison. Except for the index-time sorting Val mentioned above, an index is not stored as a sorted data structure based on a specific field. However, some fields can be used efficiently for sorting (like numeric data types or not analyzed text). And this is where the memory consideration from above comes into play.
I understand the concept of inverted-index and how Dictionary storage optimization could help to load entire dictionary in main memory for the faster query.
I am trying to understand how Lucene index work.
Suppose I have a String type field which has only four distinct values for the 200 billion documents indexed in Lucene. This field is a Stored field.
If I change the field to Byte or Int type to represent all 4 distinct values and re-index and store all the 200 billion documents.
What would be storage and query optimization for this data type change? If there would be any.
Please suggest if I can do some test on my laptop to get a sense.
As far as I know, a document in Lucene consists of a simple list of field-value pairs. A field must have at least one value, but any field can contain multiple values. Similarly, a single string value may be converted into multiple values by the analysis process.
Lucene doesn’t care if the values are strings or numbers or dates. All
values are just treated as opaque bytes.
For more information, please see this document.
I'm using Lucene.Net to index and search. My data contains some numeric fields.
How can I sort search result by sum/multiplying of mulitple numeric fields?
The simplest solution is to index an extra field with the calculated value then sort by that.
This is a very common technique in "no-sql" stores. ie denormalise and store whatever extra values are needed to optimise query time performance/capabilities.
Can I use an elasticsearch facet (or aggregate) to get e.g. the average of the maximum value in each document, rather than the average of all values across all documents?
While you can use a script_field for this, I'd recommend determining the maximum at index time and then storing it in a separate field to facet/aggregate on. It'll be faster, and the memory requirement for doing the facet will be much smaller.