Elasticsearch query too many results

Elasticsearch query too many results - elasticsearch

I'm tring to set up a simple search that would return me simple results with a custom ordering, the ordering i get back is fine based on a custom score.
The problem is that for this query
"query": {
"query_string": {
"query": query_term,
"fields": ["name_auto"],
}
}
NOTE: name_auto is an Edge N gram field on elastics
I always get a result set also if the query does not make any sense.
Example:
I have an elastcisearch index populated with the name of all the android applications.
If i search for face i get back all the results related to it ordered by number of comments on the play store, menans [facebook, facebook messenger, ...]
The problem is that when i query for something like facesomeuselesschars i still get the same results as before but fore sure there is nothing that match "someuselesschars".
Can anybody help about

ElasticSearch will always return results that match your query, even if the score of those results are poor. Your query for 'facesomeuselesschars' will match anything that has 'face' in it because of your ngrams (e.g. the first four characters of your query will be match multiple tokens in your index).
The rest of the characters in your query will simply lower the score of the returned match, but not prevent it from being returned.
If you want to set a minimum score that a result must reach, you can use the min_score parameter.

Related

Why does elastic search analyze a document 2 times?

From what I've understood, When I index a document say:
PUT <index>/_doc/1
{
"title":"black white fox cat"
}
Elastic search analyzes this via a standard analyzer and turns the title into an array of tokens.
But then when I search for this document let's say
POST <index>/_search
{
"query":
{
"match":
{
"title":"black"
}
}
}
It analyzez again via the same analyzer, isn't that inefficient?

It's not efficient, its necessary step to provide the search results.
let me explain under the hood, how search and index process works.
Index tokenize the text based on data type, and configured analyzer and index the tokens into the inverted index.
Search terms again is tokenised based on the query type(no tokens in case of term family of queries), and search generated tokens into the inverted index created at index time(step-1).
Tokens match process(matching index time tokens in the inverted index to the tokens generated at the query time), is what finds the matches documents and provides the search results, normally this tokens match is a exact string match process, with the exception in some cases like (prefix query, wildcard query etc). and as its a exact string match, its very fast and optimized process.
There are various use-cases, like when you use the keywords data type, text is not analyzed and when you use term level queries search time analysis doesn't happen.
Now, important thing to not is that during search time also same analyzer used at index time, otherwise it would end up generating different token which not produce match in step-3 Described earlier.

How to query for alternative spellings and representations of words in elasticsearch?

I'm using elasticsearch to query on the theme field in documents. For example:
[
{ theme: 'landcover' },
{ theme: 'land cover' },
{ theme: 'land-cover' },
etc
]
I would like to specify a search of the term landcover that matches all these documents. How do I do this?
So far I've tried using the fuzziness operator in a match search, and also a fuzzy query. However neither of these approaches seems to work, which surprised me because my understanding of fuzzy searches is that they would provide a means of inexact matching.
What am I missing? From the docs I see that fuzziness definitely looks for close approximations to a search term:
When querying text or keyword fields, fuzziness is interpreted as a Levenshtein Edit Distance — the number of one character changes that need to be made to one string to make it the same as another string.
I would consider 'landcover' and 'land cover' to be close. Is this not the case? (this is the first I have heard of Levenshtein Edit Distance so I don't know what extra/less characters mean in terms of this measurement).
An example of a match query that this doesn't seem to work:
{
query: {
match: {
'theme': {
query: 'landcover'
fuzziness: 'AUTO' // I've tried 2, '2', 6, '6', etc.
},
},
},
}
// When the term is 'land-cover' and fuzziness is auto, then 'land cover' is matched. But 'landcover' is not
And an example of a 'fuzzy' query that doesn't seem to work:
{
query: {
fuzzy: {
'theme': {
value: query,
fuzziness: 'AUTO', // Tried other values
},
},
},
}
// When the term is 'land-cover' and fuzziness is auto, then 'landcover' is matched. But 'land cover' is not. So works almost opposite to the match query in this regard
(NOTE - these queries are converted to JSON and do run and return sensible results, just the fuzziness doesn't seem to work as I would have expected)
Looking around StackOverflow, I see some questions that seem to indicate that querying an index is in some way related to how the index is created - i.e. that i cannot just run adhoc queries on any index that already exists and expect results. Is this correct? (sorry - I'm new to elasticsearch and I'm querying an index that already exists).
This answer seems related (how to find near matches for a search term): https://stackoverflow.com/a/55772800/3114742 - mentions that I should do something referred to as 'field mapping' prior to indexing data. but then the example query doesn't include the fuzziness operator. So in this case I'm confused as to what the point of the fuzziness operator is actually for.

Looking more into the documentation I've found the following:
Elasticsearch uses the concept of an 'index' rather than a database. But from the perspective of someone familiar with CouchDB and MongoDB, which are both JSON stores, there is definitely some similarity between a CouchDB database and an Elasticsearch index. Although the elasticsearch index is not an authoritative data storage in itself (it's 'built' from a source of data).
For a given index called, for example, my-index. you can insert JSON strings (documents) into my-index by PUTting to Elasticsearch:
PUT /... '{... json string ...}'
The JSON string can come directly from a JSON store (Mongo, Couch, etc.) or be cobbled together from a variety of sources. I guess.
Elasticsearch will process the document on insert and append to the inverted tree. For text fields this means K:V pairs will be created from JSON document text, with the keys being fragments of the text, and the values being references to where that text fragment is found in the source (the JSON document).
In other words, when inserting documents into an Elasticsearch index, the content is 'analyzed' to create K:V pairs that are added to the index.
I guess, then, that searching Elasticsearch means looking up search terms that are keys in the index, and comparing the values (the source of the key) to the source defined in the search (I think), and returning the source document where a search term is present for a particular field.
So:
Text is analyzed on insertion to an index
Queries are analyzed (using the same analyzer that was used to create the index)
So in my case (as mentioned above) the default analyzer is good enough to create indices that allow for basic fuzzy matching (i.e. in the match query, "land-cover" is matched to "land cover", and in the fuzzy query, "land-cover" is matched to "landcover" - I have no idea why these match differently!)
But to improve on the search results, I think I need to adjust the analyzer / tokenizer both when inserting documents into an index, and for when parsing queries to apply to an index.
My understanding of the analysis/tokenization is that this is the configuration by which inverted indexes are built from source documents. i.e. defining what the keys of the inverted index will be. As far as I can tell there is no magic in searching the index. search terms have to match keys in the inverted index otherwise there will be no results.
I'm still not sure what fuzziness is actually doing in this context.
So in short, querying elasticsearch seems to require a 'holistic perspective' over both how source data is indexed, and how queries are designed.
As a disclaimer,though, I'm not exactly an authoritative answer on this subject with less than one day of elasticsearch experience, so a better answer would still be appreciated!

How do I match a partial result with Elastic Search?

I'm trying to find out how to properly write my query in order to do a LIKE query with ElasticSearch.
Let's say I have a record of firstname and I want to find every one where there is ma in it.
So I've tried multiple things but none are working. Here is a list :
{"match": {"text": ".*ma.*"}}
{"match": {"text": "*ma*"}}
{"match":{"text"{"query":"ma","fuzziness":"AUTO","prefix_length":1}}}
Do you have an idea of how to do that or where am I missing something?

You might look into using the N-Gram tokenizer to split your documents' tokens up into their substrings.
This will allow you to search against the index with the "partial" matches you're describing.
Bear in mind that this will affect how your documents are tokenized for search so, if you are using other types of analysis for other parts of your application, you may want to create additional fields for your N-Gram tokenized values (or even create a separate index for them).
As a rule of thumb, always try to optimize your index for the queries you want to perform, rather than trying to solve your search problems at query time.

Return a list of search results with results related to user first with ElasticSearch or Neo4j

I'm trying to choose a database/search engine to return a list of results which shows any results the user has a relationship with first, then others after. Similar to the way Facebook works where you search a business name and one's you have liked appear first then others after?
I've seen this question which is similar to what I need but I believe it only show's results for that user: How can ElasticSearch be used to implement social search?
Is this possible with either ElasticSearch, Neo4j or anything else?

Elasticsearch can certainly do this.
Results are returned from Elasticsearch based on the score, which basically means the better the match the bigger the score.
You could use the "bool" query to specify your query as a "must" and then the user match as a "should". Optionally you might want to add a "boost" to the should query so it scores highest if matched.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

Elasticsearch multi term search

I am using Elasticsearch to allow a user to type in a term to search. I have the following property 'name' I'd like to search, for instance:
'name': 'The car is black'
I'd like to have this document returned if the following is used to search black car or car black.
I've tried doing a bool must and doing multiple terms ['black', 'car'] but it seems like it only works if the entire string is a match.
So what I'd really like to do is more of a, does the term contain both words in any order.
Can someone please get me on the right track? I've been banging my head on this one for a while.

If it seems like it only works if the entire string is a match, first make sure that in index mapping your string property name is analysed, i.e. mapping for this property doesn't contain "index": "not_analyzed". If it isn't so, you'll need to reindex your index in order to be able to search for tokens rather than for the whole phrase only.
Once you're sure your strings are analysed you can use:
Terms query with "minimum_should_match" parameter equalling to the number of words entered.
Bool query with must clause containing term queries per each word.
Common terms query which has a nice clean syntax for this purpose (you don't need to break down input string and construct more complex query structure in your app like with previous two) in addition to taking a smarter approach to stopwords analysing.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio