Unexpected Match query scoring on a FirstMiddleLast field - elasticsearch

I am using a match query to search a fullName field which contains names in (first [middle] last) format. I have two documents, one with "Brady Holt" as the fullName and the other as "Brad von Holdt". When I search for "brady holt", the document with "Brad von Holdt" is scored higher than the document with "Brady Holt" even though it is an exact match. I would expect the document with "Brady Holt" to have the highest score. I am guessing it has something to do with the 'von' middle name causing the score to be higher?
These are my documents:
[
{
"id": 509631,
"fullName": "Brad von Holdt"
},
{
"id": 55425,
"fullName": "Brady Holt"
}
]
This is my query:
{
"query": {
"match": {
"fullName": {
"query": "brady holt",
"fuzziness": 1.0,
"prefix_length": 3,
"operator": "and"
}
}
}
}
This is the query result:
"hits": [
{
"_index": "demo",
"_type": "person",
"_id": "509631",
"_score": 2.4942014,
"_source": {
"id": 509631,
"fullName": "Brad von Holdt"
}
},
{
"_index": "demo",
"_type": "person",
"_id": "55425",
"_score": 2.1395948,
"_source": {
"id": 55425,
"fullName": "Brady Holt"
}
}
]

A good read on how Elasticsearch does scoring, and how to manipulate relevancy, can be found in the Elasticsearch Guide: What is Relevance?. In particular, you may want to experiment with the explain functionality of a search query.
The shortest answer for you here is that the score of a hit is the product of its best-matching term according to a TF/IDF calculation. The number of matching terms will affect which documents are matched, but it's the "best" term that determine's a document's score. Your query doesn't have an "exact" match, per se: it has multiple matching terms, the scores of which are calculated independently.
Tuning relevancy can be a bit of a subtle art, and depends a lot on how the fields are being analyzed, the overall frequency distributions of various terms, the queries you're running, and even how you're sharding and distributing the index within a cluster (different shards will have different term frequencies).
(It may also be relevant, so to speak, that your example has two spellings of "Holt" and "Holdt".)
In any case, getting familiar with explain functionality and the underlying scoring mechanics is a helpful next step for you here.
Also, if you want an exact phrase match, you should read the ES guide on Phrase Matching.

Related

ElasticSearch - Phrase match on whole document? Not just one specific field

Is there a way I can use elastic match_phrase on an entire document? Not just one specific field.
We want the user to be able to enter a search term with quotes, and do a phrase match anywhere in the document.
{
"size": 20,
"from": 0,
"query": {
"match_phrase": {
"my_column_name": "I want to search for this exact phrase"
}
}
}
Currently, I have only found phrase matching for specific fields. I must specify the fields to do the phrase matching within.
Our document has hundreds of fields, so I don't think its feasible to manually enter the 600+ fields into every match_phrase query. The resultant JSON would be huge.
You can use a multi-match query with type phrase that runs a match_phrase query on each field and uses the _score from the best field. See phrase and phrase_prefix.
If no fields are provided, the multi_match query defaults to the
index.query.default_field index settings, which in turn defaults to *.
This extracts all fields in the mapping that are eligible to term queries and filters the metadata fields. All extracted fields are then
combined to build a query.
Adding a working example with index data, search query and search result
Index data:
{
"name":"John",
"cost":55,
"title":"Will Smith"
}
{
"name":"Will Smith",
"cost":55,
"title":"book"
}
Search Query:
{
"query": {
"multi_match": {
"query": "Will Smith",
"type": "phrase"
}
}
}
Search Result:
"hits": [
{
"_index": "64519840",
"_type": "_doc",
"_id": "1",
"_score": 1.2199391,
"_source": {
"name": "Will Smith",
"cost": 55,
"title": "book"
}
},
{
"_index": "64519840",
"_type": "_doc",
"_id": "2",
"_score": 1.2199391,
"_source": {
"name": "John",
"cost": 55,
"title": "Will Smith"
}
}
]
You can use * in match query field parameter which will search all the available field in the document. But it will reduce your query speed since you are searching the whole document

Is there any way to match similar match in Elastic Search

I have a elastic search big document
I am searching with below query
{"size": 1000, "query": {"query_string": {"query": "( string1 )"}}}
Let say my string1 = Product, If some one accident type prduct some one forgot to o
Is there any way to search for that also
{"size": 1000, "query": {"query_string": {"query": "( prdct )"}}} also has to return result of prdct + product
You can use fuzzy query that returns documents that contain terms similar to the search term. Refer this blog to get detailed explanation of fuzzy queries.
Since,you have more edit distance to match prdct. Fuzziness parameter can be defined as :
0, 1, 2
0..2 = Must match exactly
3..5 = One edit allowed
More than 5 = Two edits allowed
Index Data:
{
"title":"product"
}
{
"title":"prdct"
}
Search Query:
{
"query": {
"fuzzy": {
"title": {
"value": "prdct",
"fuzziness":15,
"transpositions":true,
"boost": 5
}
}
}
}
Search Result:
"hits": [
{
"_index": "my-index1",
"_type": "_doc",
"_id": "2",
"_score": 3.465736,
"_source": {
"title": "prdct"
}
},
{
"_index": "my-index1",
"_type": "_doc",
"_id": "1",
"_score": 2.0794415,
"_source": {
"title": "product"
}
}
]
There are many solutions to this problem:
Suggestions (did you mean X instead).
Fuzziness (edits from your original search term).
Partial matching with autocomplete (if someone types "pr" and you provide the available search terms, they can click on the correct results right away) or n-grams (matching groups of letters).
All of those have tradeoffs in index / search overhead as well as the classic precision / recall problem.

Count of "actual hits" (not just matching docs) for arbitrary queries in Elasticsearch

This one really frustrates me. I tried to find a solution for quite a long time, but wherever I try to find questions from people asking for the same, they either want something a little different (like here or here or here) or don't get an answer that solves the problem (like here).
What I need
I want to know how many hits my search has in total, independently from the type of query used. I am not talking about the number of hits you always get from ES, which is the number of documents found for that query, but rather the number of occurrences of document features matching my query.
For example, I could have two documents with text a text field "description", both containing the word hero, but one of them containing it twice.
Like in this minimal example here:
Index mapping:
PUT /sample
{
"settings": {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 0
}
},
"mappings": {
"doc": {
"properties": {
"name": { "type": "keyword" },
"description": { "type": "text" }
}
}
}
}
Two sample documents:
POST /sample/doc
{
"name": "Jack Beauregard",
"description": "An aging hero"
}
POST /sample/doc
{
"name": "Master Splinter",
"description": "This rat is a hero, a real hero!"
}
...and the query:
POST /sample/_search
{
"query": {
"match": { "description": "hero" }
},
"_source": false
}
... which gives me:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.22396864,
"hits": [
{
"_index": "sample",
"_type": "doc",
"_id": "hoDsm2oB22SyyA49oDe_",
"_score": 0.22396864
},
{
"_index": "sample",
"_type": "doc",
"_id": "h4Dsm2oB22SyyA49xDf8",
"_score": 0.22227617
}
]
}
}
So there are two hits ("total": 2), which is correct, because the query matches two documents. BUT I want to know many times my query matched inside each document (or the sum of this), which would be 3 in this example, because the second document contained the search term twice.
IMPORTANT: This is just a simple example. But I want this to work for any type of query and any mapping, also nested documents with inner_hits and all.
I didn't expect this to be so difficult, because it must be an information ES comes across during search anyway, right? I mean it ranks the documents with more hits inside them higher, so why can't I get the count of these hits?
I am tempted to call them "inner hits", but that is the name of a different ES feature (see below).
What I tried / could try (but it's ugly)
I could use highlighting (which I do anyway) and try to make the highlighter generate one highlight for each "inner match" (and don't combine them), then post-process the complete set of search results and count all the highlights --> Of course, this is very ugly, because (1) I don't really want to post-process my results and (2) I'd have to get all results to do this by setting size to a high enough value, but actually i only want to get the number of results requested by the client. This would be a lot of overhead!
The feature inner_hits sounds very promising, but it just means that you can handle the hits inside nested documents independently to get a highlighting for each of them. I use this for my nested docs already, but it doesn't solve this problem because (1) it persists on inner hit level and (2) I want this to work with non-nested queries, too.
Is there a way to achieve this in a generic way for arbitrary queries? I'd be most thankful for any suggestions. I'm even down for solving it by tinkering with the ranking or using script fields, anything.
Thank's a lot in advance!
I would definitely not recommend this for any kind of practical use due to the awful performance, but this data is technically available in the term frequency calculation in the results from the explain API. See What is Relevance? for a conceptual explanation and Explain API for usage.

Percentage of matched terms in Elasticsearch

I am using elasticsearch to find similar documents. Below is the query I am using:
{
"query": {
"more_like_this":{
"like": {
"_index": "docs",
"_type": "pdfs",
"_id": "pdf_1"
},
"min_term_freq": 1,
"min_doc_freq": 1,
"max_query_terms: 50,
"minimum_should_match": "50%"
}
}
}
I am extracting the text from PDF and storing in my index "docs". Below are the mappings for type "pdfs":
{
"properties": {
"content":{
"type": "string",
"analyzer": "my_analyzer"
}
}
}
In the result sets I am getting similar documents with their scores. Based on what I have read so far it is not possible to calculate percentage similarity based on score so I am not trying to do that. I am trying to figure out if it is possible to know:
"Out of 50 query terms from the source document how many terms are
matched in a document? or percentage of terms matched?"
As you can see that in my query I am specifying minimum_should_match as 50% so I am assuming that elasticsearch is filtering the documents somewhere based on the how much percentage of terms are matched in a document. I want to get that percentage. I am fairly new to elasticsearch. So far I have gone through the documentation but couldn't find out how to do it.
Any pointer/help is appreciated!

Directions on how to index words and annotate with their type (entity, etc) and then Elasticsearch/w.e. returns these words with the annotations?

I'm trying to build a very simple NLP chat (I could even say pseudo-NLP?), where I want to identify a fixed subset of intentions (verbs, sentiments) and entities (products, etc)
It's a kind of entity identification or named-entity recognition, but I'm not sure I need a full fledged NER solution for what I want to achieve. I don't care if the person types cars instead of car. HE HAS to type the EXACT word. So no need to deal with language stuff here.
It doesn't need to identity and classify the words, I'm just looking for a way that when I search a phrase, it returns all results that contains each word of if.
I want to index something like:
want [type: intent]
buy [type: intent]
computer [type: entity]
car [type: entity]
Then the user will type:
I want to buy a car.
Then I send this phrase to ElasticSearch/Solr/w.e. and it should return me something like below (it doesn't have to be structured like that, but each word should come with its type):
[
{"word":"want", "type:"intent"},
{"word":"buy", "type":"intent"},
{"word":"car","type":"car"}
]
The approach I came with was Indexing each word as:
{
"word": "car",
"type": "entity"
}
{
"word": "buy",
"type": "intent"
}
And then I provide the whole phrase, searching by "word". But I had no success so far, because Elastic Search doesn't return any of the words, even although phrases contains words that are indexed.
Any insights/ideas/tips to keep this using one of the main search engines?
If I do need to use a dedicated NER solution, what would be the approach to annotate words like this, without the need to worry about fixing typos, multi-languages, etc? I want to return results only if the person types the intents and entities exactly as they are, so not an advanced NLP solution.
Curiously I didn't find much about this on google.
I created a basic index and indexed some documents like this
PUT nlpindex/mytype/1
{
"word": "buy",
"type": "intent"
}
I used query string to search for all the words that appear in a phrase
GET nlpindex/_search
{
"query": {
"query_string": {
"query": "I want to buy a car",
"default_field": "word"
}
}
}
By default the operator is OR so it will search for every single word in the phrase in word field.
This is the results I get
"hits": [
{
"_index": "nlpindex",
"_type": "mytype",
"_id": "1",
"_score": 0.09427826,
"_source": {
"word": "car",
"type": "entity"
}
},
{
"_index": "nlpindex",
"_type": "mytype",
"_id": "4",
"_score": 0.09427826,
"_source": {
"word": "want",
"type": "intent"
}
},
{
"_index": "nlpindex",
"_type": "mytype",
"_id": "3",
"_score": 0.09427826,
"_source": {
"word": "buy",
"type": "intent"
}
}
]
Does this help?

Resources