I'm using Lucene.Net to index and search. My data contains some numeric fields.
How can I sort search result by sum/multiplying of mulitple numeric fields?
The simplest solution is to index an extra field with the calculated value then sort by that.
This is a very common technique in "no-sql" stores. ie denormalise and store whatever extra values are needed to optimise query time performance/capabilities.
Related
In the ElasticSearch documentation for the Cardinality Aggregation under the heading "Pre-computed hashes" I see the following:
On string fields that have a high cardinality, it might be faster to
store the hash of your field values in your index and then run the
cardinality aggregation on this field. This can either be done by
providing hash values from client-side or by letting Elasticsearch
compute hash values for you by using the mapper-murmur3 plugin.
Pre-computing hashes is usually only useful on very large and/or
high-cardinality fields as it saves CPU and memory. However, on
numeric fields, hashing is very fast and storing the original values
requires as much or less memory than storing the hashes. This is also
true on low-cardinality string fields, especially given that those
have an optimization in order to make sure that hashes are computed at
most once per unique value per segment.
I'm curious about the part where it says, "[this can be done] by providing hash values from client-side," because it doesn't elaborate at all on that point, but goes on to discuss numeric fields.
If I wanted to pre-compute hashes on the client, would using something like xxhash and putting the result in an appropriate number field be sufficient? (And, of course, having cardinality target that field.) Or would I need to use another type of field for the hash value?
Pre-computing hashes for high-cardinality string fields will speed up the cardinality aggregation, because hashes don't have to be computed in real-time. No need to do it on numeric fields, though!
For string fields, they advise to use the mapper-murmur3 plugin. Those hashes will be alphanumeric and should be stored in keyword fields (not a numeric field type!), that you then use in your cardinality aggregation.
I've personally seen 10x+ improvements when computing the cardinality of high-cardinality string fields with pre-computed hashes. Worth a try!
I read below words on Elasticsearch docs.
https://www.elastic.co/guide/en/elasticsearch/reference/5.4/search-request-sort.html#_memory_considerations
When sorting, the relevant sorted field values are loaded into memory. This means that per shard, there should be enough memory to contain them.
This is different from my understanding about sorting. I thought that some datatype, keyword for example, should already be sorted since Elasticsearch will create index on them. These already sorted fields should not need to be load into memory to sort again.
So am I understand right?
Index in relational databases means B* tree and that is indeed sorted.
Index in Elasticsearch is where you store your data; previously we compared that to a table in the relational world but for various reasons this is not really true, so let's not use that as a direct comparison. Except for the index-time sorting Val mentioned above, an index is not stored as a sorted data structure based on a specific field. However, some fields can be used efficiently for sorting (like numeric data types or not analyzed text). And this is where the memory consideration from above comes into play.
I understand the concept of inverted-index and how Dictionary storage optimization could help to load entire dictionary in main memory for the faster query.
I am trying to understand how Lucene index work.
Suppose I have a String type field which has only four distinct values for the 200 billion documents indexed in Lucene. This field is a Stored field.
If I change the field to Byte or Int type to represent all 4 distinct values and re-index and store all the 200 billion documents.
What would be storage and query optimization for this data type change? If there would be any.
Please suggest if I can do some test on my laptop to get a sense.
As far as I know, a document in Lucene consists of a simple list of field-value pairs. A field must have at least one value, but any field can contain multiple values. Similarly, a single string value may be converted into multiple values by the analysis process.
Lucene doesn’t care if the values are strings or numbers or dates. All
values are just treated as opaque bytes.
For more information, please see this document.
A field in an index I'm building gets regularly appended to. I'd like to be able to query elasticsearch to count the number of current items. Is it possible to query for the length of an array field? I can write something to bring back the field and count the items but some of them have a large number of entries and so am looking for something that is done in place in ES.
Here's what I recommend. Perform a terms aggregation over _uid. Then perform another aggregation over all the fields in the array field and sum the doc_counts.
But such an operation really depends on the number of records that you need to query otherwise, this can be an expensive operation.
Another option you have is to store count of array elements as another field and query it directly and given the fact that you are already storing large arrays in your document, having an integer field for the size seems to be a fair trade-off. In case you need the count to filter records I would recommend using scripted filter as explained here
I want to sort documents randomly but give preference to a specific field.
I tried dynamic random field using RandomSortField field type
but sorting ignores scoring and boost factor becomes irrelevant in my condition.
Sorting with multiple condition does not work either.
sort=random_82423 asc,rating desc
Thanks in advance.