Query a text/keyword field in Elasticsearch that contains at least one item not matching a set - elasticsearch

I have a document has a "bag.contents" field (indexed as text with a .keyword derivative) that contains a comma separated list of items contained in it. Below are some samples:
`Apple, Apple, Apple`
`Apple, Orange`
`Car, Apple` <--
`Orange`
`Bus` <--
`Grape, Car` <--
'Car, Bus` <--
The desired query results should be all documents where there is at least one instance of something other than 'Apple', 'Orange', 'Grape', as per the arrows above.
I'm sure the DSL is a combination of must and not but after 20 or so iterations it seems very difficult to get Elasticsearch to return the correct result set short of one that doesn't contain any of those 3 things.
It is also worth noting that this field in the original document is a JSON array and Kibana shows it as a single field with the elements as a comma-separated field. I suspect this may be complicating it.

1 - If it is showing up as single field, probably its not indexed as array - Please make sure document to index is formed properly. i.e, you need it to be
{ "contents": ["apple","orange","grape"]}
and not
{"contents": "apple,orange,grape"}
2- Regarding query - if you know all the terms possible while doing query- you can form a term_set query with all other terms but apple , orange and grape. termset query allows to control min matches required ( 1 in your case)
If you dont know all possible terms , may be create a separate field for indexing all other words minus apple orange and grape and query against that field.

Related

elasticseach similarity mechanism in array field

My usecase is I have a field called subjects in elasticsearch index which is a list. This field will be having multiple values. For example one doc has ['subject one', 'subject two', 'subject three'] in field subjects, another doc has ['one test', 'one example', 'two'] in field name. So when I search for subject one in field name, I should get the first document first since it is most relevant, but I was getting the second doc first, even though I am sorting the result by _score.
Basically what I want is for when the user searches multiple search terms, and if all the search terms are present in one documents field then that document should get listed first. For text fields and all, it works fine, But for array fields, it didn't. my list field has more data.
Is there anyway that we can achieve this using any ES similarity mechanisms like BM25..
Thank you

Elasticsearch - match by all terms but full field must be matched

I'm trying to improve search on my service but get stuck on complex queries.
I need to match some documents by terms but return only documents that contains all of provided terms in any order and contains only this terms.
So for example, lets take movie titles:
"Jurassic Park"
"Lost World: Jurassic Park"
"Jurassic Park III"
When I type "Park Jurassic" I want only first document to be returned because it contains both words and nothing more.
This is silly example of complex problem but I've simplified it.
I tried with terms queries, match etc but I don't know how to check if entire field was matched.
So in short it must match all tokens in any order.
Field is mapped as text and also as keyword.
You tested the terms set query?
Returns documents that contain a minimum number of exact terms in a
provided field.
The terms_set query is the same as the terms query, except you can
define the number of matching terms required to return a document.

analyzed field vs doc_values: true field

We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.
The url field mapping currently has the settings:
{
index: not_analyzed
doc_values: true
...
}
We want our users to be able to search URLs, or portions of URLs without having to use wildcards.
For example, taking the URL with path: /part1/user#site/part2/part3.ext
They should be able to bring back a matching document by searching:
part3.ext
user#site
part1
part2/part3.ext
The way I see it, we have two options:
Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user#site into user and site).
Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.
My question is this:
Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!
Thanks for your help!
Your question is about a field where you need doc_values but can not index with keyword-analyzer.
You did not mention why you need doc_values. But you did mention that you currently not search in this field.
So I guess that the name of the search-field do not have to be the same: you can copy the field value in an other field which is only for search ( "store": false ). For this new field you can use the pattern-analyzer or pattern-tokenizer for your use case.
It seems that no-one has actually performance tested the two options, so I did.
I took a sample of 10 million documents and created two new indices:
An index with an analysed field that was setup as suggested in the other answer.
An index with a string field that would store all permutations of URL segmentation.
I ran an enrichment process over the second index to populate the fields. The field values on the first index were created when I re-indexed the sample data from my main index.
Then I created a set of gatling tests to run against the indices and compared the gatling results and netdata (https://github.com/firehol/netdata) landscape for each.
The results were as follows:
Regarding the netadata landscape: The analysed field showed a spike - although only a small one - on all elastic nodes. The not_analysed list field tests didn't even register.
It is worth mentioning that enriching the list field with URL segmentation permutations bloated the index by about 80% in our case. So there's a trade off - you never need to do wildcard searches for exact sub-segment matching on URLs, but you'll need a lot more disk to do it.
Update
Don't do this. Go for doc_values. Doing anything with analyzed strings that have a massive number of possible terms will mean massive field data that will, eventually, never fit in the amount of memory you can allocate it.

Elasticsearch multi term search

I am using Elasticsearch to allow a user to type in a term to search. I have the following property 'name' I'd like to search, for instance:
'name': 'The car is black'
I'd like to have this document returned if the following is used to search black car or car black.
I've tried doing a bool must and doing multiple terms ['black', 'car'] but it seems like it only works if the entire string is a match.
So what I'd really like to do is more of a, does the term contain both words in any order.
Can someone please get me on the right track? I've been banging my head on this one for a while.
If it seems like it only works if the entire string is a match, first make sure that in index mapping your string property name is analysed, i.e. mapping for this property doesn't contain "index": "not_analyzed". If it isn't so, you'll need to reindex your index in order to be able to search for tokens rather than for the whole phrase only.
Once you're sure your strings are analysed you can use:
Terms query with "minimum_should_match" parameter equalling to the number of words entered.
Bool query with must clause containing term queries per each word.
Common terms query which has a nice clean syntax for this purpose (you don't need to break down input string and construct more complex query structure in your app like with previous two) in addition to taking a smarter approach to stopwords analysing.

Sorting Solr multivalue fields based on field values

I have multiple Solr instances with separate schemas.
I need to receive multivalue field in sorted order, e.g. by type: train_station, airport, city_district, and so on:
q=köln&sort=query({!v="type:(airport OR train_station)"}) desc
I would like to see airport type document before train_station type. For now I am always getting train_station type at the top.
How should I write the query?
You are getting train_stations at the top because of the IDF.
A quick hack to fix it would be to use a range query (which has the advantage of having constant scores) and query boosts: q=köln&sort=query({!v="type:([airport TO airport]^3 OR [train_station TO train_station]^2)"}) desc.
This way, documents which have airport in their type field will have a score of 3, documents which have train_station in their type field will have a score of 2 and documents which have airport and train_station in their field type will have a score of 2+3=5 (to a multiplicative constant).
A more elegant (and effective) way of doing this would be to write a custom query parser (or even a function query).
You can sort on a function only if it returns a single value per document. You definitely can't sort on a multiValued field or any field that is tokenized. Seems like you would need a function that returns "airport" if the field contains "airport" (even if it contains "train station") and "train station" if it contains "train station" but not "airport", and then sort on that.
Another option would be to handle this at index time. Add a field called "airport_train_station_sort" that returns 1 if the field contains "airport", 2 if the field contains "train station" but NOT airport, and 3 if it contains neither. Then simply sort on that field.
You cannot solve this problem inside SOLR. Check the documentation, SOLR does not sort multivalued fields. Older versions of SOLR let you try, but the results were undefined and unpredictable.
You either change your schema and put this sort data into single value indexed fields, or you need to make several queries, first for airports, then city districts, then train stations.
To order items within the field itself you have to either index it in order you want, or do post processing. Solr's sort will sort only docs!

Resources