Elasticsearch edgeNGram analyzer/tokenizer fuzzy query matching - elasticsearch

We have an Accounts table that we are searching for similar records using fuzzy query with edgeNGram analyzer for multiple fields. Our setup:
Settings
{
settings: {
analysis: {
analyzer: {
edge_n_gram_analyzer: {
tokenizer: "whitespace",
filter: ["lowercase", "ednge_gram_filter"]
}
},
filter: {
ednge_gram_filter: {
type: "edgeNGram",
min_gram: 2,
max_gram: 10
}
}
}
}
}
Mapping
{
mappings: {
document_type: {
properties: {
uid: {
type: "text",
analyzer: "edge_n_gram_analyzer"
},
shop_name: {
type: "text",
analyzer: "edge_n_gram_analyzer"
},
seller_name: {
type: "text",
analyzer: "edge_n_gram_analyzer"
},
...
...
...
locale_id: {
type: "integer"
}
}
}
}
}
Query
{
body: {
query: {
bool: {
must: [
{
bool: {
should: [
{
fuzzy: {
uid: {
value: "antonline",
boost: 1.0,
fuzziness: 2,
prefix_length: 0,
max_expansions: 100
}
}
},
{
fuzzy: {
seller_name: {
value: "antonline",
boost: 1.0,
fuzziness: 2,
prefix_length: 0,
max_expansions: 100
}
}
},
{
fuzzy: {
shop_name: {
value: "antonline",
boost: 1.0,
fuzziness: 2,
prefix_length: 0,
max_expansions: 100
}
}
}
]
}
}
],
must_not: [
{
term: {
locale_id: {
value: 7
}
}
}
]
}
}
}
}
The above example finds different variations of 'antonline' string such as "antonline", "sanjonline", "tanonline", "kotonline", "htonline", "awmonline". However, it doesn't match strings with punctuation like antonline.com or even antonlinecom without the dot. We tried different types of tokenizers but nothing helps.
How could we achieve the search result as we expect?

I resolved that issue by removing everything that matches this regex:
[.,'\"\-+:~\^!?*\\]
Do the removal while building the index as well as while searching.

Related

is there a way to write a boolean query with different conditions in elasticsearch?

Ive setted up elasticsearch with a NodeJS Server and need a working boolean query that checks different conditions in the search. How to do that?
I am using mongoosastic(query DSL) with NodeJS and the following query to get the results
mapping Function
async function mapMongoToElastic() {
return new Promise((resolve, reject) => {
console.log("---------Mapping gets created---------");
Product.createMapping(
{
mappings: {
product: {
properties: {
ArtNumber: { type: "text" },
Title: {
type: "text",
fields: {
keyword: {
type: "keyword"
}
}
},
DisplayPrice: { type: "double" },
CF_Deliverytime: { type: "text" },
Description_Short: { type: "text" },
MainCat: {
type: "text",
fields: {
keyword: {
type: "keyword"
}
}
},
ItemCat: {
type: "text",
fields: {
keyword: {
type: "keyword"
}
}
},
Deeplink1: { type: "text" },
Img_url: { type: "text" },
CF_img_url2: { type: "text" },
CF_img_url3: { type: "text" },
CF_Availability: { type: "text" },
CF_Productrating: {
type: "text",
fields: {
keyword: {
type: "keyword"
}
}
},
CF_Discount: {
type: "text",
fields: {
keyword: {
type: "keyword"
}
}
},
Shop: {
type: "text",
fields: {
keyword: {
type: "keyword"
}
}
}
}
}
}
},
function(err, mapping) {
if (err) {
console.log("error creating mapping (you can safely ignore this)");
console.log(err);
resolve(err);
} else {
console.log("X - ElasticSearch Mapping Created");
resolve(mapping);
}
}
);
});
}
Query function
async function asyncSearchWithOutFilter(query, from, to) {
return new Promise((resolve, reject) => {
Product.esSearch(
{
from: from,
size: to,
query: {
multi_match: {
query: query.suche,
fields: [ "Title^10", "ItemCat^5" ]
}
},
aggs: {
mainCats: {
terms: { field: "MainCat.keyword" }
},
itemCats: {
terms: { field: "ItemCat.keyword" }
},
itemShops: {
terms: {
field: "Shop.keyword"
}
}
}
},
{},
async (err, results) => {
if (err) throw err;
let res = await results;
/* console.log("-------------Total Hits---------------");
console.log(res.hits.total);
console.log("-----------------------------------------");
console.log("-------------Shops---------------");
console.log(res.aggregations.itemShops.buckets);
console.log("-----------------------------------------");
console.log("-------------Item-Categories---------------");
console.log(res.aggregations.itemCats.buckets);
console.log("-----------------------------------------"); */
resolve(res);
}
);
});
}
Expected results:
- Query for "TV"
Results:
Products with Title "TV" in
- if Category has "TV" also, rank it up higher.
Problem:
Smart-TV-Controller is also listed if searched for "TV", but not expected if someone is searching for a "TV"
help appreciated.
Seems like you are trying to get exact match. You already have keyword type sub-field for both the fields Title and ItemCat. So instead use keyword field in match query.
query: {
multi_match: {
query: query.suche,
fields: [ "Title.keyword^10", "ItemCat.keyword^5" ]
}
}
If you are not looking for exact match in Title then another way can be to set fields as below:
fields: [ "Title^10", "ItemCat.keyword^5" ]

Elasticsearch: Order by date field (descending): gauss or field_value_factor?

I have an issue concerning the modification of the score document according to its creation date. I have tried gauss function and field_value_factor.
The fist one is (all the query clause):
#search_definition[:query] = {
function_score:{
query: {
bool: {
must: [
{
query_string: {
query: <query_term>,
fields: %w( field_1ˆ2
field_2ˆ3
...
field_n^2),
analyze_wildcard: true,
auto_generate_phrase_queries: false,
analyzer: 'brazilian',
default_operator: 'AND'
}
}
],
filter: {
bool: {
should: [
{ term: {"boolean_field": false}},
{ terms: {"array_field_1": options[:key].ids}},
{ term: {"array_field_2.id": options[:key].id}}
]
}
}
}
},
gauss:{
date_field: {
scale: "1d",
decay: "0.5"
}
}
}
}
With this configuration, I am telling elastic that the last documents must have a higher score. When I execute the query with it, the result is totally the opposite! The oldest documents are being returned firstly. Even if I change the origin to
origin: "2010-05-01 00:00:00"
which is the date of the first document, the oldest ones are also being retrieved firstly. What am I doing wrong?
With field_value_factor, the things are better, but not yet what I am waiting for.... (all the query clause is)
#search_definition[:query] = {
function_score:{
query: {
bool: {
must: [
{
query_string: {
query: <query_term>,
fields: %w( field_1ˆ2
field_2ˆ3
...
field_n^2),
analyze_wildcard: true,
auto_generate_phrase_queries: false,
analyzer: 'brazilian',
default_operator: 'AND'
}
}
],
filter: {
bool: {
should: [
{ term: {"boolean_field": false}},
{ terms: {"array_field_1": options[:key].ids}},
{ term: {"array_field_2.id": options[:key].id}}
]
}
}
}
},
field_value_factor: {
field: "date_field",
factor : 100,
modifier: "sqrt"
}
}
}
With this other configuration, the documents from 2016 and 2015 are being returned firstly, however there are tons of documents from 2016 that receive less score than others from 2015, even if I set a modifier "sqrt" with factor: 100 !!!!
I suppose guass function would be the appropriate solution. How can I invert this gauss result? Or how can I increase the field_value_factor so that the 2016 comes before the 2015??
Thanks a lot,
Guilherme
You might want to try putting gauss function insides functions param and give it a weight like following query. I also think scale is too low which could be making lot of documents score zero. I have also increased decay to 0.8 and given higher weight to recent documents. You could also use explain api to see how scoring is done.
{
"function_score": {
query: {
bool: {
must: [{
query_string: {
query: < query_term > ,
fields: % w(field_1ˆ2 field_2ˆ3
...field_n ^ 2),
analyze_wildcard: true,
auto_generate_phrase_queries: false,
analyzer: 'brazilian',
default_operator: 'AND'
}
}],
filter: {
bool: {
should: [{
term: {
"boolean_field": false
}
}, {
terms: {
"array_field_1": options[: key].ids
}
}, {
term: {
"array_field_2.id": options[: key].id
}
}]
}
}
}
},
"functions": [{
"gauss": {
"date_field": {
"origin": "now"
"scale": "30d",
"decay": "0.8"
}
},
"weight": 20
}]
}
}
Also the origin should be latest date so rather than origin: "2010-05-01 00:00:00", try
origin: "2016-05-01 00:00:00"
Does this help?

Elasticsearch searching and sorting across 2 models

I have 2 models: Products and Skus, where a Product has one or more Skus, and a Sku belongs to exactly one Product. They have the following columns:
Product: id, title, content, category_id
Sku: id, product_id, price
I'd like to be able to display 48 products per page across various search and sort configurations, but I'm having trouble translating this to elasticsearch.
For example, it's not clear to me how I would search on title while sorting the relevant results by the lowest-priced Sku for each Product. I've tried a few different things, and closest has been to index everything as belonging to the Sku, then searching like so:
size: '48',
aggs: {
group_by_product: {
terms: { field: 'product_id' }
}
},
filter: {
and: [{
bool: {
must: { range: { price: { gte: 0, lte: 50 } } }
},{
bool: {
must: { terms: { category_id: [ 1, 2, 3, 4, 5, 6 ] } }
}
}]
},
query: {
fuzzy_like_this: {
fields: [ 'title', 'content' ],
like_text: 'Chair',
fuzziness: 1
}
}
But this gives 48 matching Skus, many of which belong to the same Product, so my pagination is off if I try to combine them after the search.
What would be the best way to handle this use case?
Update
Trying with the nested method, using the following structure:
{
size: '48',
query:
{ bool:
{ should:
{ fuzzy_like_this:
{ fields: [ 'title' ],
like_text: 'chair',
fuzziness: 1 },
},
{ must:
{ nested:
{ path: 'skus',
query:
{ bool:
{ must: { range: { price: { gte: 0, lte: 100 } } }
}
}
}
}
}
}
},
sort:
{ _score: 'asc',
'skus.price':
{ nested_path: 'skus',
nested_filter:
{ range: { 'skus.price': { gte: 0, lte: 100 } } },
order: 'asc',
mode: 'min'
}
}
}
This is likely closer, but still not sure how to format it. The above gives products ordered by price, but seems to completely disregard the search field.
Since paginating aggregation results is not possible, even though the approach of including the sku inside the product is a good one, I would go with nested objects depending on the requirements for queries.
As an example query:
GET /product/test/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": {
"query": "whatever",
"fuzziness": 1,
"prefix_length": 3
}
}
},
{
"nested": {
"path": "skus",
"query": {
"range": {
"skus.price": {
"gte": 11,
"lte": 50
}
}
}
}
}
]
}
},
"sort": [
{
"skus.price": {
"nested_path": "skus",
"order": "asc",
"mode": "min"
}
}
]
}

Stemming and highlighting for phrase search

My Elasticsearch index is full of large English-text documents. When I search for "it is rare", I get 20 hits with that exact phrase and when I search for "it is rarely" I get a different 10. How can I get all 30 hits at once?
I've tried creating creating a multi-field with the english analyzer (below), but if I search in that field then I only get results from parts of phrase (e.g., documents matchin it or is or rare) instead of the whole phrase.
"mappings" : {
...
"text" : {
"type" : "string",
"fields" : {
"english" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets_payloads",
"analyzer" : "english"
}
}
},
...
Figured it out!
Store two fields, one for the text content (text) and a sub-field with the English-ified stem words (text.english).
Create a custom analyzer based off of the default English analyzer which doesn't strip stop words.
Highlight both fields and check for each when displaying results to the user.
Here's my index configuration:
{
mappings: {
documents: {
properties: {
title: { type: 'string' },
text: {
type: 'string',
term_vector: 'with_positions_offsets_payloads',
fields: {
english: {
type: 'string',
analyzer: 'english_nostop',
term_vector: 'with_positions_offsets_payloads',
store: true
}
}
}
}
}
},
settings: {
analysis: {
filter: {
english_stemmer: {
type: 'stemmer',
language: 'english'
},
english_possessive_stemmer: {
type: 'stemmer',
language: 'possessive_english'
}
},
analyzer: {
english_nostop: {
tokenizer: 'standard',
filter: [
'english_possessive_stemmer',
'lowercase',
'english_stemmer'
]
}
}
}
}
}
And here's what a query looks like:
{
query: {
query_string: {
query: <query>,
fields: ['text.english'],
analyzer: 'english_nostop'
}
},
highlight: {
fields: {
'text.english': {}
'text': {}
}
},
}

Multiple types in Elasticsearch Type Filter

I have a filtered query like this
query: {
filtered: {
query: {
bool: {
should: [{multi_match: {
query: #query,
fields: ['title', 'content']
}
},{fuzzy: {
content: {
value: #query,
min_similarity: '1d',
}
}}]
}
},
filter: {
and: [
type: {
value: #type
}]
}}}
That works fine if #type is a string, but does not work if #type is an array. How can I search for multiple types?
This worked, but I'm not happy with it:
filter: {
or: [
{ type: { value: 'blog'} },
{ type: { value: 'category'} },
{ type: { value: 'miscellaneous'} }
]
}
I'd love to accept a better answer
You can easily specify multiple types in your search request's URL, e.g. http://localhost:9200/twitter/tweet,user/_search, or with type in the header if using _msearch, as documented here.
These are then added as filters for you by Elasticsearch.
Also, you usually want to be using bool to combine filters, for reasons described in this article: all about elasticsearch filter bitsets
This worked for me:
Within the filter parameter, wrap multiple type queries as should clauses for a bool query
e.g
{
"query": {
"bool": {
"must": {
"term": { "foo": "bar" }
},
"filter": {
"bool": {
"should": [
{ "type": { "value": "myType" } },
{ "type": { "value": "myOtherType" } }
]
}
}
}
}
}
Suggested by dadoonet in the Elasticsearch github issue Support multiple types in Type Query

Resources