ElasticSearch: Partial/Exact Scoring with edge_ngram & fuzziness - elasticsearch

In ElasticSearch I am trying to get correct scoring using edge_ngram with fuzziness. I would like exact matches to have the highest score and sub matches have lesser scores. Below is my setup and scoring results.
settings: {
number_of_shards: 1,
analysis: {
filter: {
ngram_filter: {
type: 'edge_ngram',
min_gram: 2,
max_gram: 20
}
},
analyzer: {
ngram_analyzer: {
type: 'custom',
tokenizer: 'standard',
filter: [
'lowercase',
'ngram_filter'
]
}
}
}
},
mappings: [{
name: 'voter',
_all: {
'type': 'string',
'index_analyzer': 'ngram_analyzer',
'search_analyzer': 'standard'
},
properties: {
last: {
type: 'string',
required : true,
include_in_all: true,
term_vector: 'yes',
index_analyzer: 'ngram_analyzer',
search_analyzer: 'standard'
},
first: {
type: 'string',
required : true,
include_in_all: true,
term_vector: 'yes',
index_analyzer: 'ngram_analyzer',
search_analyzer: 'standard'
},
}
}]
After doing a POST with first name "Michael" I do a query as below with changes "Michael", "Michae", "Micha", "Mich", "Mic", and "Mi".
GET voter/voter/_search
{
"query": {
"match": {
"_all": {
"query": "Michael",
"fuzziness": 2,
"prefix_length": 1
}
}
}
}
My score results are:
-"Michael": 0.19535106
-"Michae": 0.2242768
-"Micha": 0.24513611
-"Mich": 0.22340237
-"Mic": 0.21408978
-"Mi": 0.15438235
As you can see the score results aren't getting as expected. I would like "Michael" to have the highest score and "Mi" to have the lowest
Any help would be appreciated!

One way to approach this problem would be to add raw version of text in your mapping like this
last: {
type: 'string',
required : true,
include_in_all: true,
term_vector: 'yes',
index_analyzer: 'ngram_analyzer',
search_analyzer: 'standard',
"fields": {
"raw": {
"type": "string" <--- index with standard analyzer
}
}
},
first: {
type: 'string',
required : true,
include_in_all: true,
term_vector: 'yes',
index_analyzer: 'ngram_analyzer',
search_analyzer: 'standard',
"fields": {
"raw": {
"type": "string" <--- index with standard analyzer
}
}
},
You could also make it exact with index : not_analyzed
Then you can query like this
{
"query": {
"bool": {
"should": [
{
"match": {
"_all": {
"query": "Michael",
"fuzziness": 2,
"prefix_length": 1
}
}
},
{
"match": {
"last.raw": {
"query": "Michael",
"boost": 5
}
}
},
{
"match": {
"first.raw": {
"query": "Michael",
"boost": 5
}
}
}
]
}
}
}
Documents that matches more clauses will be scored higher.
You could specify boost according to your requirements.

Related

elasticsearch nested query returns only last 3 results

We have the following elasticsearch mapping
{
index: 'data',
body: {
settings: {
analysis: {
analyzer: {
lowerCase: {
tokenizer: 'whitespace',
filter: ['lowercase']
}
}
}
},
mappings: {
// used for _all field
_default_: {
index_analyzer: 'lowerCase'
},
entry: {
properties: {
id: { type: 'string', analyzer: 'lowerCase' },
type: { type: 'string', analyzer: 'lowerCase' },
name: { type: 'string', analyzer: 'lowerCase' },
blobIds: {
type: 'nested',
properties: {
id: { type: 'string' },
filename: { type: 'string', analyzer: 'lowerCase' }
}
}
}
}
}
}
}
and a sample document that looks like the following:
{
"id":"5f02e9dae252732912749e13",
"type":"test_type",
"name":"test_name",
"creationTimestamp":"2020-07-06T09:07:38.775Z",
"blobIds":[
{
"id":"5f02e9dbe252732912749e18",
"filename":"test1.csv"
},
{
"id":"5f02e9dbe252732912749e1c",
"filename":"test2.txt"
},
// removed in-between elements for simplicity
{
"id":"5f02e9dbe252732912749e1e",
"filename":"test3.csv"
},
{
"id":"5f02e9dbe252732912749e58",
"filename":"test4.txt"
},
{
"id":"5f02e9dbe252732912749e5a",
"filename":"test5.csv"
},
{
"id":"5f02e9dbe252732912749e5d",
"filename":"test6.txt"
}
]
}
I have the following ES query which is querying documents for a certain timerange based on the creationTimestamp field and then filtering the nested field blobIds based on a user query, that should match the blobIds.filename field.
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"range": {
"creationTimestamp": {
"gte": "2020-07-01T09:07:38.775Z",
"lte": "2020-07-07T09:07:40.147Z"
}
}
},
{
"nested": {
"path": [
"blobIds"
],
"query": {
"query_string": {
"fields": [
"blobIds.filename"
],
"query": "*"
}
},
// returns the actual blobId hit
// and not the whole array
"inner_hits": {}
}
},
{
"query": {
"query_string": {
"query": "+type:*test_type* +name:*test_name*"
}
}
}
]
}
}
}
},
"sort": [
{
"creationTimestamp": {
"order": "asc"
},
"id": {
"order": "asc"
}
}
]
}
The above entry is clearly matching the query. However, it seems like there is something wrong with the returned inner_hits, since I always get only the last 3 blobIds elements instead of the whole array that contains 24 elements, as can be seen below.
{
"name": "test_name",
"creationTimestamp": "2020-07-06T09:07:38.775Z",
"id": "5f02e9dae252732912749e13",
"type": "test_type",
"blobIds": [
{
"id": "5f02e9dbe252732912749e5d",
"filename": "test4.txt"
},
{
"id": "5f02e9dbe252732912749e5a",
"filename": "test5.csv"
},
{
"id": "5f02e9dbe252732912749e58",
"filename": "test6.txt"
}
]
}
I find it very strange since I'm only doing a simple * query.
Using elasticsearch v1.7 and cannot update at the moment

Elasticsearch multi-match not returning all results when providing empty string

I have a total of 1783 records and I want ES to return all of them in case the multi_match query is not provided (searchObject.query = '')
I manage to do so if I pass an empty array to query.bool.should, so in theory I could update the ES object below based on the searchObject.query value but I'm not sure if that's a good idea.
{
_source: [
'id',
'event',
'description',
'element',
'date'
],
track_total_hits: true,
query: {
bool: {
should: [{
multi_match:{
query: searchObject.query
fields: ["element","description","nar.*","title","identifier"]
}
}],
filter: []
}
},
highlight: { fields: { '*': {} } },
sort: [],
from: 0,
size: 10
}
Any suggestions?
You can append a match_all to the should:
{
"_source": [
"id",
"event",
"description",
"element",
"date"
],
"track_total_hits": true,
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "",
"fields": [
"line",
"element",
"description",
"nar.*",
"title",
"identifier"
]
}
},
{
"match_all": {}
}
],
"filter": []
}
},
"highlight": {
"fields": {
"*": {}
}
},
"sort": [],
"from": 0,
"size": 10
}
That's what it's usually for. IMHO the empty string should be checked before you perform the ES request. I'm assuming it's coming from an autocomplete or such.
This is regulated by Match query zero_terms_query property. Just add this property to your multi_match block: "zero_terms_query": "all".

elasticsearch: use aggregated value for filtering

I'm using nested mapping (below), that represents a "task" and has a nested element of "requests" making progress towards that task.
I'm trying to find all tasks which have not made progress, i.e. all documents for which the "max" aggregation on the nested objects is empty. This requires being able to filter on the result of an aggregation - and that's where I'm a bit stuck.
I can order by the results of the aggregation. but I can't find a way to filter.
Is there such a capability?
mapping:
mapping = {
properties: {
'prefix' => {
type: "string",
store: true,
index: "not_analyzed"
},
'last_marker' => {
type: "string",
store: true,
index: "not_analyzed"
},
'start_time' => {
type: "date",
store: true,
index: "not_analyzed"
},
'end_time' => {
type: "date",
store: true,
index: "not_analyzed"
},
'obj_count' => {
type: "long",
store: true,
index: "not_analyzed"
},
'requests' => {
type: 'nested',
include_in_parent: true,
'properties' => {
'start_time' => {
type: "date",
store: true,
index: "not_analyzed"
},
'end_time' => {
type: "date",
store: true,
index: "not_analyzed"
},
'amz_req_id' => {
type: "string",
store: true,
index: "not_analyzed"
},
'last_marker' => {
type: "string",
store: true,
index: "not_analyzed"
}
}
}
}
}
Ordering by aggregation query (and looking for the filter...):
{
"size":0,
"aggs": {
"pending_prefix": {
"terms": {
"field": "prefix",
"order": {"max_date": "asc"},
"size":20000
},
"aggs": {
"max_date": {
"max": {
"field": "requests.end_time"
}
}
}
}
}
}
It is like HAVING clause in SQL terms. It is not possible in current Elasticsearch Release.
In the upcoming 2.0 release, with newly introduced Pipeline Aggregation, it should be possible then.
More: https://www.elastic.co/blog/out-of-this-world-aggregations

Elasticsearch how to use multi_match with wildcard

I have a User object with properties Name and Surname. I want to search these fields using one query, and I found multi_match in the documentation, but I don't know how to properly use that with a wildcard. Is it possible?
I tried with a multi_match query but it didn't work:
{
"query": {
"multi_match": {
"query": "*mar*",
"fields": [
"user.name",
"user.surname"
]
}
}
}
Alternatively you could use a query_string query with wildcards.
"query": {
"query_string": {
"query": "*mar*",
"fields": ["user.name", "user.surname"]
}
}
This will be slower than using an nGram filter at index-time (see my other answer), but if you are looking for a quick and dirty solution...
Also I am not sure about your mapping, but if you are using user.name instead of name your mapping needs to look like this:
"your_type_name_here": {
"properties": {
"user": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"surname": {
"type": "string"
}
}
}
}
}
Such a query worked for me:
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"should": [
{"query": {"wildcard": {"user.name": {"value": "*mar*"}}}},
{"query": {"wildcard": {"user.surname": {"value": "*mar*"}}}}
]
}
}
}
}
}
Similar to what you are doing, except that in my case there could be different masks for different fields.
I just did this now:
GET _search {
"query": {
"bool": {
"must": [
{
"range": {
"theDate": {
"gte": "2014-01-01",
"lte": "2014-12-31"
}
}
},
{
"match" : {
"Country": "USA"
}
}
],
"should": [
{
"wildcard" : { "Id_A" : "0*" }
},
{
"wildcard" : { "Id_B" : "0*" }
}
],"minimum_number_should_match": 1
}
}
}
Similar to suggestion above, but this is simple and worked for me:
{
"query": {
"bool": {
"must":
[
{
"wildcard" : { "processname.keyword" : "*system*" }
},
{
"wildcard" : { "username" : "*admin*" }
},
{
"wildcard" : { "device_name" : "*10*" }
}
]
}
}
}
I would not use wildcards, it will not scale well. You are asking a lot of the search engine at query time. You can use the nGram filter, to do the processing at index-time not search time.
See this discussion on the nGram filter.
After indexing the name and surname correctly (change your mapping, there are examples in the above link) you can use multi-match but without wildcards and get the expected results.
description: {
type: 'keyword',
normalizer: 'useLowercase',
},
product: {
type: 'object',
properties: {
name: {
type: 'keyword',
normalizer: 'useLowercase',
},
},
},
activity: {
type: 'object',
properties: {
name: {
type: 'keyword',
normalizer: 'useLowercase',
},
},
},
query:
query: {
bool: {
must: [
{
bool: {
should: [
{
wildcard: {
description: {
value: `*${value ? value : ''}*`,
boost: 1.0,
rewrite: 'constant_score',
},
},
},
{
wildcard: {
'product.name': {
value: `*${value ? value : ''}*`,
boost: 1.0,
rewrite: 'constant_score',
},
},
},
{
wildcard: {
'activity.name': {
value: `*${value ? value : ''}*`,
boost: 1.0,
rewrite: 'constant_score',
},
},
},
],
},
},
{
match: {
recordStatus: RecordStatus.Active,
},
},
{
bool: {
must_not: [
{
term: {
'user.id': req.currentUser?.id,
},
},
],
},
},
{
bool: {
should: tags
? tags.map((name: string) => {
return {
nested: {
path: 'tags',
query: {
match: {
'tags.name': name,
},
},
},
};
})
: [],
},
},
],
filter: {
bool: {
must_not: {
terms: {
id: existingIds ? existingIds : [],
},
},
},
},
},
},
sort: [
{
updatedAt: {
order: 'desc',
},
},
],

ElasticSearch filter boosting based on field value

I have the following query:
{
"from": 0,
"query": {
"custom_filters_score": {
"filters": [
{
"boost": 1.5,
"filter": {
"term": {
"format": "test1"
}
}
},
{
"boost": 1.5,
"filter": {
"term": {
"format": "test2"
}
}
}
],
"query": {
"bool": {
"must": {
"query_string": {
"analyzer": "query_default",
"fields": [
"title^5",
"description^2",
"indexable_content"
],
"query": "blah"
}
},
"should": []
}
}
}
},
"size": 50
}
Which should be boosting things that have {"format":"test1"} in them, if I'm reading the documentation correctly.
However, using explain tells me that "custom score, no filter match, product of:" is the outcome, and the score of the returned documents that match the filter isn't changed by this filter.
What am I doing wrong?
Edit: here's the schema:
mapping:
edition:
_all: { enabled: true }
properties:
title: { type: string, index: analyzed }
description: { type: string, index: analyzed }
format: { type: string, index: not_analyzed, include_in_all: false }
section: { type: string, index: not_analyzed, include_in_all: false }
subsection: { type: string, index: not_analyzed, include_in_all: false }
subsubsection: { type: string, index: not_analyzed, include_in_all: false }
link: { type: string, index: not_analyzed, include_in_all: false }
indexable_content: { type: string, index: analyzed }
and let's assume a typical document is like:
{
"format": "test1",
"title": "blah",
"description": "blah",
"indexable_content": "keywords here",
"section": "section",
"subsection": "child-section",
"link":"/section/child-section/blah"
}
If it says "no filter match", it means that it didn't match any filters in your query. Most likely reason for this is that the records that match your query don't have terms "test1" in them. Unfortunately, you didn't provide mapping and test data, so it's difficult to tell what's going on there for sure.
Try running this query to see if you can actually find any records that match your search criteria and should be boosted:
{
"from": 0,
"query": {
"bool": {
"must": [{
"query_string": {
"analyzer": "query_default",
"fields": ["title^5", "description^2", "indexable_content"],
"query": "blah"
}
}, {
"term": {
"format": "test1"
}
}]
}
},
"size": 50
}
Your query looks fine and based on the provided information, it should work: https://gist.github.com/4448954

Resources