Say I have a field that can only have a finite set of values.
Would it not be more efficient (index-wise, and/or storage-wise) to store it as some kind of ENUM?
Is there some such possibility in elasticsearch?
An example would be the names of the states in a state machine.
Yes it would. When you index full text fields, Elasticsearch also indexes information like the length of the field, and the position and frequency of each term in the field.
These are irrelevant to ENUM values, and can be excluded completely.
In fact, if you map your field as {"index": "not_analyzed"} then, besides storing the exact value that you provide without trying to analyze it, it also disables storage of the extra info that I mentioned above.
In your app use hash map { "enumVal1" => 1, "enumVal2" => 2, "enumValX" => 3 } and then use in ES only the values from hashmap, this can save space.
Related
We have an ES index which has a field which stores its data as an array. In this field, we include the original text, plus text without any punctuation, special characters, etc. The problem is, when searching on the field, the multiple values appears to be skewing the score.
For example, if we search on the term 'up', the document which has the array ['up, up and away', 'up up and away'] is scoring higher with a multi_match (we are using because we may search more than one field) than the document with the array as simply ['up'].
In the end, I guess what I am looking for is a score that emulates calculating a score for each item in the array and returning me the highest. I believe in this case, comparing 'up' to 'Up' and 'Up, Up and Away' will give me a higher score for 'Up'.
With my research, I believe I may need to do custom scoring on this field...? If that is true, am I looking at "score_mode": "max" as what I want?
I think you slightly over-engineered your index. You don't need to create duplicate fields for the same information and remove punctuation, lowercase fields yourself.
I'd recommend you to read what are elasticsearch token filters and how to create multiple analyzers for the same field.
For your exact use case, if you provided a document sample, it would certainly help. But in any case looking at what you are dealing with - index your array of strings with default analyzer and with a custom one that you'll build yourself. Then you can use the same field, but with different analyzers (differently processed text) to control your score.
Good day:
I have an indexed field called amount, which is of string type. The value of amount can be either one or 1. Say in this example, we have amount=1 as an indexed document but, I try to search for one, ElasticSearch will not return the value unless I put 1 for the search query. Thoughts on how I can get this to work? I'm thinking a tokenizer is what's needed.
Thanks.
You probably don't want this for sevenmillionfourhundredfifteenthousendtwohundredfourteen and the like, but only for a small number of values.
At index time I would convert everything to a proper number and store it in a numerical field, which then even allows to sort --- if you need it. Apart from this I would use synonyms at index and at query time and map everything to the digit-strings, but in a general text field that is searched by default.
Does Elasticsearch keep an order of multi-value fields?
I.e. if I've put following values into fields:
{
"values": ["one", "two", "three"],
"values_original": ["1", "2", "3"]
}
(Given that fields are not analyzed)
Can I be sure that the contents of lists will always be returned in the same order I put it there?
In the example above, I want to make sure that "one" on first position in "values" will always correspond to "1" in "values_original" etc.
I could keep it also as nested objects, i.e.
{
"values": [
{"original": "1", "new": "one"},
{"original":"2", "new":"two"},
{"original":"3","new":"three"}
]
}
but I want to avoid the overhead.
If it is guaranteed that order of values in multi-value field is preserved, then my approach of keeping two parallel multi-valued fields will work.
I found out the answer.
Yes, I can rely on Elasticsearch to keep an order of values in multivalue field within a document. (However, when I am performing a search, there is no information available to Elasticsearch about at what position certain term was).
According to documentation:
When you get a document back from Elasticsearch, any arrays will be in the same order as when you indexed the document. The _source field
that you get back contains exactly the same JSON document that you
indexed.
However, arrays are indexed—made searchable—as multivalue fields, which are unordered. At search time, you can’t refer to “the first
element” or “the last element.” Rather, think of an array as a bag of
values.
https://www.elastic.co/guide/en/elasticsearch/guide/current/complex-core-fields.html#_multivalue_fields
Elasticsearch - The Definitive Guide says the following:
When you get a document back from Elasticsearch, any arrays will be in the same order as when you indexed the document. The _source field that you get back contains exactly the same JSON document that you indexed.
However, arrays are indexed—made searchable—as multivalue fields, which are unordered. At search time, you can’t refer to “the first element” or “the last element.” Rather, think of an array as a bag of values.
So it seems that for stored fields order is preserved, for indexed fields it's not.
I have some data indexed in elasticsearch, in _source I have a field to store file size:
{"file_size":"25.2MB"}
{"file_size":"2GB"}
{"file_size":"800KB"}
Currently the mapping of this field is string. I want to do search with sorting by file_size. I guess I need change the mapping to integer and do re-index.
How can I calculate the size in bytes and re-index them as integer?
Elasticsearch does not support field reindexing, as documents in lucenes index is immutable. So, internally, every document need to be fetched, changed, indexed back to index and old copy should be removed. Its doesn't matter what you actually need - change mapping or change data.
So, about practical part. Straightforward way:
Create new index with proper mapping
Fetch all your documents from old index
Change your file_size field to integer according to any logic you need
Index documents to new index
Drop old index after full migration
So, application side will contain additional logic to transform data from human-readable strings to Long + standard ES driver functionality. To speed this process up, consider using scroll-scan for read and bulk api for write. For future, I recommend using aliases to be able to migrate your data seamlessly.
In case, when you can't do server-side changes for some reason, you can potentially add new field with proper mapping and fire up ES-side updates with scripted partial updates (). Or try your luck with experimental plugin
why not use sort by keyword?
just add this:
{
"sort": {
"file_size.keyword": {
"order": "asc"
}
}
}
it was only sort it by string, so if there is data 2.5GB, 1KB, 5KB, the data will be 1KB, 2.5GB, 5KB
i think you have to save it into Byte first, so you can easily sorting it if it was in the same format.
Suppose I have a field "epoch_date" that will be sorted often when I do Elastic Search queries. How should I map this field? Right now, I just have stored: yes. Should I index it even though this field will not count towards the relevancy scoring? What should I add to this field if I intend to sort on this field often, so it will be more efficient?
{
"tweet" : {
"properties" : {
"epoch_date" : {
"type" : "integer",
"store" : "yes"
}
}
}
}
There's nothing you need to change to sort on the field given your mapping. You can only sort on a field if it's indexed, and the default is "index":"yes" for numeric or dates. You can not set a numeric type to analyzed, since there's no text to analyze. Also, better to use the date type for a date instead of the integer.
Sorting can be memory expensive if your field you are sorting on has a lot of unique terms. Just make sure you have enough memory for it. Also, keep in mind that sorting on a specific field you throw away the relevance ranking, which is a big part of what a search engine is all about.
Whether you want to store the field too doesn't have anything to do with sorting, but just with the way you retrieve it in order to return it together with your search results. If you use the _source field (default behaviour) there's no reason to store specific fields. If you ask for specific fields using the fields option when querying, then the stored fields would be retrieved directly from lucene rather than extracted from the _source field parsing the json.
An index is used for efficient sorting. So YES, you want to create an index for the field.
As to needing it to be "more efficient", I'd kindly advise you to first check your results and see if they're fast enough. I don't see a reason beforehand (with the limited info you provided) to think it wouldn't be efficient.
If you intend to filter on the field as well (date-ranges?) be sure to use filters instead of queries whenever you feel the filters used will be used often. This because filters can be efficiently cached.