Efficiently match texts contained in a query text - elasticsearch

I have a set of documents S in an index, where each document D has a text field D.text. I want to use a text query Q to find the documents with texts contained/match within the query Q.
An example:
A set S with documents D1, D2 and D3 have texts "Stranger Things", "special effects are top-notch", "entertaining and always keeps me on the edge", respectively.
A text query Q: "I think 'Stranger Things' is one of the best shows on Netflix. The acting is superb, the plot is intriguing, and the special effects are top-notch."
I have to find documents D1 and D2, because the text D1.text and D2.text are in Q.
So far I am using the PhraseMatcher class from SpaCy, which efficiently match large terminology lists based on a text. However, I every time need to build a large terminology list (the size of set S is >100000 document but can be even bigger) in memory to query around 100 texts to find these matches. I get requests to do this a few times a second.
Is there any way to perform this query in Elastic Search?

ElasticSearch, with default mapping, will seperate the text in D1,D2,D3 to "keywords seperated by space" and index those keywords. you can split the words in your query Q to keywords and search those keywords in your index which will produce only the documents that contain one or more of the searched "terms"

Related

Query a text/keyword field in Elasticsearch that contains at least one item not matching a set

I have a document has a "bag.contents" field (indexed as text with a .keyword derivative) that contains a comma separated list of items contained in it. Below are some samples:
`Apple, Apple, Apple`
`Apple, Orange`
`Car, Apple` <--
`Orange`
`Bus` <--
`Grape, Car` <--
'Car, Bus` <--
The desired query results should be all documents where there is at least one instance of something other than 'Apple', 'Orange', 'Grape', as per the arrows above.
I'm sure the DSL is a combination of must and not but after 20 or so iterations it seems very difficult to get Elasticsearch to return the correct result set short of one that doesn't contain any of those 3 things.
It is also worth noting that this field in the original document is a JSON array and Kibana shows it as a single field with the elements as a comma-separated field. I suspect this may be complicating it.
1 - If it is showing up as single field, probably its not indexed as array - Please make sure document to index is formed properly. i.e, you need it to be
{ "contents": ["apple","orange","grape"]}
and not
{"contents": "apple,orange,grape"}
2- Regarding query - if you know all the terms possible while doing query- you can form a term_set query with all other terms but apple , orange and grape. termset query allows to control min matches required ( 1 in your case)
If you dont know all possible terms , may be create a separate field for indexing all other words minus apple orange and grape and query against that field.

Elasticsearch - match by all terms but full field must be matched

I'm trying to improve search on my service but get stuck on complex queries.
I need to match some documents by terms but return only documents that contains all of provided terms in any order and contains only this terms.
So for example, lets take movie titles:
"Jurassic Park"
"Lost World: Jurassic Park"
"Jurassic Park III"
When I type "Park Jurassic" I want only first document to be returned because it contains both words and nothing more.
This is silly example of complex problem but I've simplified it.
I tried with terms queries, match etc but I don't know how to check if entire field was matched.
So in short it must match all tokens in any order.
Field is mapped as text and also as keyword.
You tested the terms set query?
Returns documents that contain a minimum number of exact terms in a
provided field.
The terms_set query is the same as the terms query, except you can
define the number of matching terms required to return a document.

Elasticsearch show what is matched from a query

I'm implementing a sort of "natural language" search assistant. I have a form with a number of select fields. The list of options in each field can be pretty lengthy. So rather than having to select each item individually, I'm adding a text input box where people can just type what they're looking for and the app will suggest possible searches, based on the options in the select dropdowns.
Let's say my options are:
Color: red, blue, black, yellow, green
Size: very small, kinda medium, super large
Shape: round, square, oblong, cylindrical
Year: 2007, 2008, 2009, 2010
If you typed in "2007 very small star-spangled", the text input would suggest "Search all 2007 very small widgets for 'star-spangled'". It understood that "2007" and "very small" were select options in the form, and that "star-spangled" was not, and suggested a search where "2007" and "very small" are selected, and then left the "star-spangled" bit for a plaintext search.
What I'm working on right now is parsing the search query and picking out the bits that fit into the select fields. I have all the options in Elasticsearch. I was thinking of searching each type individually to see if it matches anything in the search query. That seems straightforward to me. I can easily find matches. However, I don't know which part of the query actually matches each type, which I need in order to find out that e.g. "star-spangled" is the part that didn't match options.
So, in the end, I need to know that only the "2007" substring matched the year, only the "very small" substring matched the size, and "star-spangled" didn't match anything.
My first thought is to split the query into word-grams (e.g. "2007", "2007 very", "2007 very small", "2007 very small star-spangled", "very", "very small", "very small star-spangled", "small", "small star-spangled", "star-spangled") and search each option for each gram. Then I would know for sure which gram matched. However, this could obviously get resource intensive pretty quickly. Also, I know Elasticsearch can do that sort of search internally much faster.
So what I really need is to be able to perform a search and, along with the results, get back which part of the original query actually matched. So if I searched, "2007 verr small" (intentional misspelling) and did a fuzzy search of sizes, passing the entire query string, and I get the "Very Small" size back as a result, it would indicate that "verr small" is the part of the query that matched that size.
Any idea of how to do that? Or possibly some other solutions?
I could do the search and parse the results to see which bits match the string. Though I could see that being resource intensive as well. And if I'm doing a fuzzy search, it wouldn't necessarily be clear which part of the query triggered a match in the result.
I was also thinking that highlighting might work for this, but I don't know enough about Elasticsearch to know for sure.
EDIT: I tested this out using highlighting. It's so close to working. The highlight field comes back with the part of the string that matches. However, it only shows the part of the result that matches. It doesn't show the part of the query that matches. So if I want to allow for fuzzy searches, the highlight field won't match the original query and I won't be able to tell which part of the query matched. For example, a query of "very smaal" will return the size "Very Small", but the highlight field will show <em>very</em> <em>small</em>, not <em>very</em> <em>smaal</em>.
There are 2 types of queries in Elasticsearch, Match Query and Filtered Query. Match query matches your term in the documents and find all the relevant documents with a relevance score. For example when you search for term: "help fixing javascript problem" you are interested in all documents which contain one or more of the search term.
On the other hand, when you are using Filtered Query, a document is either a match or not match... there is no relevance score here... as an example, you want all the products built in year "2007"... here you need to use a filtered query. All the product built in 2007 have the same score and all other years are excluded from the result.
In my opinion, your problem should be dealt with Filter Query...
When using filter query, normally each filter has its own corresponding input in the UI, consider the following screen-shot which is from ebay:
If I have understood your requirement correctly, you want to include all those filters in a single search-box. In my opinion, this is nearly impossible to implement because you have no way to parse user input and decide which word corresponds to which filter...
If you want to go down the filter path, it's better to introduce corresponding UI fields for each filter...
If you want to stick to a single search box, then don't implement the filter functionality and stick to Elasticsearch Multi-match query... you can match the input term across multiple fields but you won't be able to filter out (exclude) result instead you get a relevance score.

Elasticsearch - Match long query text to short field

Standard match involves providing a small query term or phrase and matching it against a larger blob of text stored in a document. I want to do the reverse - my query will be a large blob of text, like a paragraph, and I want to rank the relevance of documents that contain a full name (i.e. John Smith). I want to supply a paragraph and determine which document's full name is most likely to be contained in that paragraph. What's the best way to do this?
Answering my own question: The Percolate API allows you to store queries and run documents against them: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-percolate-query.html

Elasticsearch multi term search

I am using Elasticsearch to allow a user to type in a term to search. I have the following property 'name' I'd like to search, for instance:
'name': 'The car is black'
I'd like to have this document returned if the following is used to search black car or car black.
I've tried doing a bool must and doing multiple terms ['black', 'car'] but it seems like it only works if the entire string is a match.
So what I'd really like to do is more of a, does the term contain both words in any order.
Can someone please get me on the right track? I've been banging my head on this one for a while.
If it seems like it only works if the entire string is a match, first make sure that in index mapping your string property name is analysed, i.e. mapping for this property doesn't contain "index": "not_analyzed". If it isn't so, you'll need to reindex your index in order to be able to search for tokens rather than for the whole phrase only.
Once you're sure your strings are analysed you can use:
Terms query with "minimum_should_match" parameter equalling to the number of words entered.
Bool query with must clause containing term queries per each word.
Common terms query which has a nice clean syntax for this purpose (you don't need to break down input string and construct more complex query structure in your app like with previous two) in addition to taking a smarter approach to stopwords analysing.

Resources