Elasticsearch Fuzzy Phrases - elasticsearch

I have the following query to add fuzziness to my search. However, I now realize that the match query doesn't consider the order of the words in the search string, as the match_phrase does. However, I can't get match_phrase to give me results with fuzziness. Is there a way to tell match to consider the order and distance between words?
{
"query": {
"match": {
"content": {
"query": "some search terms like this",
"fuzziness": 1,
"operator": "and"
}
}
}
}

Eventually figured out that I needed to use a combination of span queries, which give an excellent amount of fine tuning to fuzziness and slop. I needed to add a function to manually tokenize my phrases and add to the "clauses" array in an programmatically:
{"query":
{
"span_near": {
"clauses": [
{
"span_multi": {
"match": {
"fuzzy": {
"content": {
"fuzziness": "2",
"value": "word"
}
}
}
}
},
{
"span_multi": {
"match": {
"fuzzy": {
"content": {
"fuzziness": "2",
"value": "another"
}
}
}
}
}
],
"slop": 1,
"in_order": "true"

#econgineer Excellent post.
I wanted to try this for an ES query we are working on - but I am too lazy to keep doing the JSON data....
I think this code works... strangely it causes jq to complain but ElasticSearch work....
import json
import pprint
from collections import defaultdict
nested_dict = lambda: defaultdict(nested_dict)
query=nested_dict()
query['span_near']['clauses']=list()
query['slop']='1'
query['in_order']="true"
words=['what','is','this']
for w in words:
nest = nested_dict()
nest["span_multi"]["match"]["fuzzy"]["msg"]["fuzziness"]["value"]=w
nest["span_multi"]["match"]["fuzzy"]["msg"]["fuzziness"]["fuzziness"]="2"
json.dumps(nest)
query['span_near']['clauses'].append(json.loads(json.dumps(nest)))
pprint.pprint(json.loads(json.dumps(query)))
If you beautify the output by
cat t2.json | tr "\'" "\"" | jq '.'
You should see something like
{
"in_order": "true",
"slop": "1",
"span_near": {
"clauses": [
{
"span_multi": {
"match": {
"fuzzy": {
"msg": {
"fuzziness": {
"fuzziness": "2",
"value": "what"
}
}
}
}
}
},
{
"span_multi": {
"match": {
"fuzzy": {
"msg": {
"fuzziness": {
"fuzziness": "2",
"value": "is"
}
}
}
}
}
},
{
"span_multi": {
"match": {
"fuzzy": {
"msg": {
"fuzziness": {
"fuzziness": "2",
"value": "this"
}
}
}
}
}
}
]
}
}
And then to query ES it is just a normal
curl --silent My_ES_Server:9200:/INDEX/_search -d #t2.json
Many thanks for the initial guidance, I hope someone else find this of use.

Indeed, an excellent question and answer.
I'm surprised that this 'fuzzy phrase match' doesn't have support out of the box.
Here's a tested NodeJS code that generates the fuzzy phrase match (multi clause) query block, in the context of a multi search (msearch), but that should work just the same with a single search.
Usage:
const queryBody = [
{ index: 'YOUR_INDEX' },
createESFuzzyPhraseQueryBlock('YOUR PHRASE', 'YOUR_FIELD_NAME', 2)
];
client.msearch({
body: queryBody
})
Functions:
const createESFuzzyPhraseClauseBlock = (word, esFieldName, fuzziness) => {
const clauseBlock = {
"span_multi": {
"match": {
"fuzzy": {
[esFieldName]: {
"fuzziness": fuzziness,
"value": word
}
}
}
}
});
return clauseBlock;
};
const createESFuzzyPhraseQueryBlock = (phrase, esFieldName, fuzziness) => {
const clauses = phrase.split(' ').map(word => createESFuzzyPhraseClauseBlock(word, esFieldName, fuzziness));
const queryBlock =
{
"query":
{
"span_near": {
"clauses": clauses,
"slop": 1,
"in_order": "true"
}
}
};
return queryBlock;
};

Consider also mixing the queries, for me basic query looked like this - for phrases of length 2 I've used prefix query and for the rest I've used match query with fuziness set to AUTO.

Related

How to transform a Kibana query to `elasticsearch_dsl` query

I have a query
GET index/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"key1": "value"
}
},
{
"wildcard": {
"key2": "*match*"
}
}
]
}
}
}
I want to make the same call with elasticsearch_dsl package
I tried with
s = Search(index=index).query({
"bool": {
"should": [
{
"match": {
"key1": "value"
}
},
{
"wildcard": {
"key2": "*match*"
}
}
]
}
})
s.using(self.client).scan()
But the results are not same, am I missing something here
Is there a way to represent my query with elasticsearch_dsl
tried this, no results
s = Search(index=index).query('wildcard', key2='*match*').query('match', key1=value)
s.using(self.client).scan()
it seems to me that you forgot the stars in the query.
s = Search(index=index).query('wildcard', key='*match*').query('match', key=value)
This query worked for me
s = Search(index=index).query('match', key1=value)
.query('wildcard', key2='*match*')
.source(fields)
also, if key has _ like key_1 elastic search behaves differently and query matches results even which do not match your query. So try to choose your key which do not have underscores.

Search for documents with exactly different fields values

I'm adding documents with the following strutucte
{
"proposta": {
"matriculaIndicacao": 654321,
"filial": 100,
"cpf": "12345678901",
"idStatus": "3",
"status": "Reprovada",
"dadosPessoais": {
"nome": "John Five",
"dataNascimento": "1980-12-01",
"email": "fulanodasilva#fulano.com.br",
"emailValidado": true,
"telefoneCelular": "11 99876-9999",
"telefoneCelularValidado": true,
"telefoneResidencial": "11 2211-1122",
"idGenero": "1",
"genero": "M"
}
}
}
I'm trying to perform a search with multiple field values.
I can successfull search for a document with a specific cpf atribute with the following search
{
"query": {
"term" : {
"proposta.cpf" : "23798770823"
}
}
}
But now I need to add an AND clause, like
{
"query": {
"term" : {
"proposta.cpf" : "23798770823"
,"proposta.dadosPessoais.dataNascimento": "1980-12-01"
}
}
}
but it's returning an error message.
P.S: If possible I would like to perform a search where if the field doesn't exist, it returns the document that matches only the proposta.cpf field.
I really appreciate any help.
The idea is to combine your constraints within a bool/should query
{
"query": {
"bool": {
"should": [
{
"term": {
"proposta.cpf": "23798770823"
}
},
{
"term": {
"proposta.dadosPessoais.dataNascimento": "1980-12-01"
}
}
]
}
}
}

Boolean AND with exact matches oin Elasticsearch

In our Elasticsearch collection of products, we have an an array of hashes, called "nutrients". A partial example of the data would be:
"_source": {
"quantity": "150.0",
"id": 1001,
"barcode": "7610809001066",
"nutrients": [
{
"per_hundred": "1010.0",
"name_fr": "Énergie",
"per_portion": "758.0",
"name_de": "Energie",
"per_day": "9.0",
"name_it": "Energia",
"name_en": "Energy"
},
{
"per_hundred": "242.0",
"name_fr": "Énergie (kCal)",
"per_portion": "181.0",
"name_de": "Energie (kCal)",
"per_day": "9.0",
"name_it": "Energia (kCal)",
"name_en": "Energy (kCal)"
},
{
"per_hundred": "18.0",
"name_fr": "Matières grasses",
"per_portion": "13.5",
"name_de": "Fett",
"per_day": "19.0",
"name_it": "Grassi",
"name_en": "Fat"
},
In the search, we are trying to bring back the products based on an exact match of two of the fields contained in the nutrients array. What I am finding is the conditions seemed to be OR and not AND.
The two attempts have been:
"query": {
"bool": {
"must": [
{ "match": { "nutrients.name_fr": "Énergie" } },
{ "match": { "nutrients.per_hundred": "242.0" } }
]
}
}
}
and
"query": {
"filtered": {
"filter": {
"and": [
{ "term": { "nutrients.name_fr": "Énergie" } },
{ "term": { "nutrients.per_hundred": "242.0" } }
]
}
}
}
Both of these are in fact bringing back entries with Énergie and 242.0, but are also match on different name_fr, eg:
{
"per_hundred": "242.0",
"name_fr": "Acide folique",
"per_portion": "96.0",
"name_de": "Folsäure",
"per_day": "48.0",
"name_it": "Acido folico",
"name_en": "Folic acid"
},
They are also matching on a non exact match, i.e: matching also on "Énergie (kCal)" when we want to match only on "Énergie"
On your first problem:
You have to make the nutrients field nested, so you can query each object inside it for itself Elasticsearch Nested Objects.

Elasticsearch: Get report of unmatched should elements in a bool query

I'm looking for a way to get a report of unmatched should querys and display it.
For instance I have two user objects
User 1:
{
"username": "user1"
"docType": "user"
"level": "Professor"
"discipline": "Sciences"
"sub-discipline": "Mathematical"
}
User 2:
{
"username": "user1"
"docType": "user"
"level": "Professor"
"discipline": "Sciences"
"subDiscipline": "Physics"
}
When I do a bool query where the matching discipline is in must query and the sub-discipline is in the should query
bool:
must: [{
term: { "doc.docType": "user" }
},{
term: { "doc.level": "professor" }
},{
term: { "doc.discipline": "sciences" }
}],
should: [{
term: { "subDiscipline": "physics" }
}]
How can I get the unmatched elements in my result like that:
Result 1: user1 match 100%
Result 2: user2 match 70% (unmatch subdiscipine "physics")
I had a look into the explainApi but the result doesn't seems to be provided for that use case and seems very complicated to parse.
You will need to use named queries for this.
Using the same , create a bool query like below -
{
"query": {
"bool": {
"must": [
{
"match": {
"SourceName": {
"query": "CNN",
"_name": "sourceMatch"
}
}
},
{
"match": {
"author": {
"query": "qbox.io",
"_name": "author"
}
}
}
]
}
}
}
In the result section , it will tell which all named queries matched.
You can use this information to fabricate the stats you are looking for.

Querystring search on array elements in Elastic Search

I'm trying to learn elasticsearch with a simple example application, that lists quotations associated with people. The example mapping might look like:
{
"people" : {
"properties" : {
"name" : { "type" : "string"},
"quotations" : { "type" : "string" }
}
}
}
Some example data might look like:
{ "name" : "Mr A",
"quotations" : [ "quotation one, this and that and these"
, "quotation two, those and that"]
}
{ "name" : "Mr B",
"quotations" : [ "quotation three, this and that"
, "quotation four, those and these"]
}
I would like to be able to use the querystring api on individual quotations, and return the people who match. For instance, I might want to find people who have a quotation that contains (this AND these) - which should return "Mr A" but not "Mr B", and so on. How can I achieve this?
EDIT1:
Andrei's answer below seems to work, with data values now looking like:
{"name":"Mr A","quotations":[{"value" : "quotation one, this and that and these"}, {"value" : "quotation two, those and that"}]}
However, I can't seem to get a query_string query to work. The following produces no results:
{
"query": {
"nested": {
"path": "quotations",
"query": {
"query_string": {
"default_field": "quotations",
"query": "quotations.value:this AND these"
}
}
}
}
}
Is there a way to get a query_string query working with a nested object?
Edit2: Yes it is, see Andrei's answer.
For that requirement to be achieved, you need to look at nested objects, not to query a flattened list of values but individual values from that nested object. For example:
{
"mappings": {
"people": {
"properties": {
"name": {
"type": "string"
},
"quotations": {
"type": "nested",
"properties": {
"value": {
"type": "string"
}
}
}
}
}
}
}
Values:
{"name":"Mr A","quotations":[{"value": "quotation one, this and that and these"}, {"value": "quotation two, those and that"}]}
{"name":"Mr B","quotations":[{"value": "quotation three, this and that"}, {"value": "quotation four, those and these"}]}
Query:
{
"query": {
"nested": {
"path": "quotations",
"query": {
"bool": {
"must": [
{ "match": {"quotations.value": "this"}},
{ "match": {"quotations.value": "these"}}
]
}
}
}
}
}
Unfortunately there is no good way to do that.
https://web.archive.org/web/20141021073225/http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/complex-core-fields.html
When you get a document back from Elasticsearch, any arrays will be in
the same order as when you indexed the document. The _source field
that you get back contains exactly the same JSON document that you
indexed.
However, arrays are indexed — made searchable — as multi-value fields,
which are unordered. At search time you can’t refer to “the first
element” or “the last element”. Rather think of an array as a bag of
values.
In other words, it is always considering all values in the array.
This will return only Mr A
{
"query": {
"match": {
"quotations": {
"query": "quotation one",
"operator": "AND"
}
}
}
}
But this will return both Mr A & Mr B:
{
"query": {
"match": {
"quotations": {
"query": "this these",
"operator": "AND"
}
}
}
}
If scripting is enabled, this should work:
"script": {
"inline": "for(element in _source.quotations) { if(element == 'this' && element == 'these') {return true;} }; return false;"
}

Resources