Does Elasticsearch keep an order of multi-value fields?
I.e. if I've put following values into fields:
{
"values": ["one", "two", "three"],
"values_original": ["1", "2", "3"]
}
(Given that fields are not analyzed)
Can I be sure that the contents of lists will always be returned in the same order I put it there?
In the example above, I want to make sure that "one" on first position in "values" will always correspond to "1" in "values_original" etc.
I could keep it also as nested objects, i.e.
{
"values": [
{"original": "1", "new": "one"},
{"original":"2", "new":"two"},
{"original":"3","new":"three"}
]
}
but I want to avoid the overhead.
If it is guaranteed that order of values in multi-value field is preserved, then my approach of keeping two parallel multi-valued fields will work.
I found out the answer.
Yes, I can rely on Elasticsearch to keep an order of values in multivalue field within a document. (However, when I am performing a search, there is no information available to Elasticsearch about at what position certain term was).
According to documentation:
When you get a document back from Elasticsearch, any arrays will be in the same order as when you indexed the document. The _source field
that you get back contains exactly the same JSON document that you
indexed.
However, arrays are indexed—made searchable—as multivalue fields, which are unordered. At search time, you can’t refer to “the first
element” or “the last element.” Rather, think of an array as a bag of
values.
https://www.elastic.co/guide/en/elasticsearch/guide/current/complex-core-fields.html#_multivalue_fields
Elasticsearch - The Definitive Guide says the following:
When you get a document back from Elasticsearch, any arrays will be in the same order as when you indexed the document. The _source field that you get back contains exactly the same JSON document that you indexed.
However, arrays are indexed—made searchable—as multivalue fields, which are unordered. At search time, you can’t refer to “the first element” or “the last element.” Rather, think of an array as a bag of values.
So it seems that for stored fields order is preserved, for indexed fields it's not.
Related
I have a document has a "bag.contents" field (indexed as text with a .keyword derivative) that contains a comma separated list of items contained in it. Below are some samples:
`Apple, Apple, Apple`
`Apple, Orange`
`Car, Apple` <--
`Orange`
`Bus` <--
`Grape, Car` <--
'Car, Bus` <--
The desired query results should be all documents where there is at least one instance of something other than 'Apple', 'Orange', 'Grape', as per the arrows above.
I'm sure the DSL is a combination of must and not but after 20 or so iterations it seems very difficult to get Elasticsearch to return the correct result set short of one that doesn't contain any of those 3 things.
It is also worth noting that this field in the original document is a JSON array and Kibana shows it as a single field with the elements as a comma-separated field. I suspect this may be complicating it.
1 - If it is showing up as single field, probably its not indexed as array - Please make sure document to index is formed properly. i.e, you need it to be
{ "contents": ["apple","orange","grape"]}
and not
{"contents": "apple,orange,grape"}
2- Regarding query - if you know all the terms possible while doing query- you can form a term_set query with all other terms but apple , orange and grape. termset query allows to control min matches required ( 1 in your case)
If you dont know all possible terms , may be create a separate field for indexing all other words minus apple orange and grape and query against that field.
My usecase is I have a field called subjects in elasticsearch index which is a list. This field will be having multiple values. For example one doc has ['subject one', 'subject two', 'subject three'] in field subjects, another doc has ['one test', 'one example', 'two'] in field name. So when I search for subject one in field name, I should get the first document first since it is most relevant, but I was getting the second doc first, even though I am sorting the result by _score.
Basically what I want is for when the user searches multiple search terms, and if all the search terms are present in one documents field then that document should get listed first. For text fields and all, it works fine, But for array fields, it didn't. my list field has more data.
Is there anyway that we can achieve this using any ES similarity mechanisms like BM25..
Thank you
Say I have a field that can only have a finite set of values.
Would it not be more efficient (index-wise, and/or storage-wise) to store it as some kind of ENUM?
Is there some such possibility in elasticsearch?
An example would be the names of the states in a state machine.
Yes it would. When you index full text fields, Elasticsearch also indexes information like the length of the field, and the position and frequency of each term in the field.
These are irrelevant to ENUM values, and can be excluded completely.
In fact, if you map your field as {"index": "not_analyzed"} then, besides storing the exact value that you provide without trying to analyze it, it also disables storage of the extra info that I mentioned above.
In your app use hash map { "enumVal1" => 1, "enumVal2" => 2, "enumValX" => 3 } and then use in ES only the values from hashmap, this can save space.
Suppose I have a field "epoch_date" that will be sorted often when I do Elastic Search queries. How should I map this field? Right now, I just have stored: yes. Should I index it even though this field will not count towards the relevancy scoring? What should I add to this field if I intend to sort on this field often, so it will be more efficient?
{
"tweet" : {
"properties" : {
"epoch_date" : {
"type" : "integer",
"store" : "yes"
}
}
}
}
There's nothing you need to change to sort on the field given your mapping. You can only sort on a field if it's indexed, and the default is "index":"yes" for numeric or dates. You can not set a numeric type to analyzed, since there's no text to analyze. Also, better to use the date type for a date instead of the integer.
Sorting can be memory expensive if your field you are sorting on has a lot of unique terms. Just make sure you have enough memory for it. Also, keep in mind that sorting on a specific field you throw away the relevance ranking, which is a big part of what a search engine is all about.
Whether you want to store the field too doesn't have anything to do with sorting, but just with the way you retrieve it in order to return it together with your search results. If you use the _source field (default behaviour) there's no reason to store specific fields. If you ask for specific fields using the fields option when querying, then the stored fields would be retrieved directly from lucene rather than extracted from the _source field parsing the json.
An index is used for efficient sorting. So YES, you want to create an index for the field.
As to needing it to be "more efficient", I'd kindly advise you to first check your results and see if they're fast enough. I don't see a reason beforehand (with the limited info you provided) to think it wouldn't be efficient.
If you intend to filter on the field as well (date-ranges?) be sure to use filters instead of queries whenever you feel the filters used will be used often. This because filters can be efficiently cached.
I have multiple Solr instances with separate schemas.
I need to receive multivalue field in sorted order, e.g. by type: train_station, airport, city_district, and so on:
q=köln&sort=query({!v="type:(airport OR train_station)"}) desc
I would like to see airport type document before train_station type. For now I am always getting train_station type at the top.
How should I write the query?
You are getting train_stations at the top because of the IDF.
A quick hack to fix it would be to use a range query (which has the advantage of having constant scores) and query boosts: q=köln&sort=query({!v="type:([airport TO airport]^3 OR [train_station TO train_station]^2)"}) desc.
This way, documents which have airport in their type field will have a score of 3, documents which have train_station in their type field will have a score of 2 and documents which have airport and train_station in their field type will have a score of 2+3=5 (to a multiplicative constant).
A more elegant (and effective) way of doing this would be to write a custom query parser (or even a function query).
You can sort on a function only if it returns a single value per document. You definitely can't sort on a multiValued field or any field that is tokenized. Seems like you would need a function that returns "airport" if the field contains "airport" (even if it contains "train station") and "train station" if it contains "train station" but not "airport", and then sort on that.
Another option would be to handle this at index time. Add a field called "airport_train_station_sort" that returns 1 if the field contains "airport", 2 if the field contains "train station" but NOT airport, and 3 if it contains neither. Then simply sort on that field.
You cannot solve this problem inside SOLR. Check the documentation, SOLR does not sort multivalued fields. Older versions of SOLR let you try, but the results were undefined and unpredictable.
You either change your schema and put this sort data into single value indexed fields, or you need to make several queries, first for airports, then city districts, then train stations.
To order items within the field itself you have to either index it in order you want, or do post processing. Solr's sort will sort only docs!