Combining terms with synonyms - ElasticSearch - elasticsearch

I am new to Elasticsearch and have a synonym analyzer in place which looks like-
{
"settings": {
"index": {
"analysis": {
"filter": {
"graph_synonyms": {
"type": "synonym_graph",
"synonyms": [
"gowns, dresses",
"backpacks, bags",
"coats, jackets"
]
}
},
"analyzer": {
"search_time_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"graph_synonyms"
]
}
}
}
}
}
}
And the mapping looks like-
{
"properties": {
"category": {
"type": "text",
"search_analyzer": "search_time_analyzer",
"fields": {
"no_synonyms": {
"type": "text"
}
}
}
}
}
If I search for gowns, it gives me proper results for both gowns as well as dresses.
But the problem is if I search for red gowns, (the system does not have any red gowns) the expected behavior is to search for red dresses and return those results. But instead, it returns results of gowns and dresses irrespective of the color.
I would want to configure the system such that it considers both the terms and their respective synonyms if any and then return the results.
For reference, this is what my search query looks like-
"query":
{
"bool":
{
should:
[
{
"multi_match":
{
"boost": 300,
"query": term,
"type": "cross_fields",
"operator": "or",
"fields": ["bu.keyword^10", "bu^10", "category.keyword^8", "category^8", "category.no_synonyms^8", "brand.keyword^7", "brand^7", "colors.keyword^2", "colors^2", "size.keyword", "size", "hash.keyword^2", "hash^2", "name"]
}
}
]
}
}
Sample document:
_source: {
productId: '12345',
name: 'RUFFLE FLORAL TRIM COTTON MAXI DRESS',
brand: [ 'self-portrait' ],
mainImage: 'http://test.jpg',
description: 'Self-portrait presents this maxi dress, crafted from cotton, to offer your off-duty ensembles an elegant update. Trimmed with ruffled broderie details, this piece is an effortless showcase of modern femininity.',
status: 'active',
bu: [ 'womenswear' ],
category: [ 'dresses', 'gowns' ],
tier1: [],
tier2: [],
colors: [ 'WHITE' ],
size: [ '4', '6', '8', '10' ],
hash: [
'ballgown', 'cotton',
'effortless', 'elegant',
'floral', 'jar',
'maxi', 'modern',
'off-duty', 'ruffle',
'ruffled', '1',
'2', 'crafted'
],
styleCode: '211274856'
}
How can I achieve the desired output? Any help would be appreciated. Thanks

You can configured index time analyzer insted of search time analyzer like below:
{
"properties": {
"category": {
"type": "text",
"analyzer": "search_time_analyzer",
"fields": {
"no_synonyms": {
"type": "text"
}
}
}
}
}
Once you done with index mapping change, reindex your data and try below query:
Please note that I have changed operator to and and analyzer to standard:
{
"query": {
"multi_match": {
"boost": 300,
"query": "gowns red",
"analyzer": "standard",
"type": "cross_fields",
"operator": "and",
"fields": [
"category",
"colors"
]
}
}
}
Why your current query is not working:
Inexing:
Your current index mapping indexing data with standard analyzer so it will not index any of your category with synonyms values.
Searching:
Your current query have operator or so if you search for red gowns then it will create query like red OR gowns OR dresses and it will giving you result irrespective of the color. Also, if you change operator to and in existing configuration then it will return zero result as it will create query like red AND gowns AND dresses.
Solution: Once you done changes as i suggsted it will index synonyms for category field as well and it will work with and operator. So if you try query gowns red then it will create query like gowns AND red. It will match because category field have both values gowns and dresses due to synonyms applied at index time.

Related

ElasticSearch autocomplete doesn't work with the middle words

Using python elasticsearch-dsl:
class Record(Document):
tags = Keyword()
tags_suggest = Completion(preserve_position_increments=False)
def clean(self):
self.tags_suggest = {
"input": self.tags
}
class Index:
name = 'my-index'
settings = {
"number_of_shards": 2,
}
When I index
r1 = Record(tags=['my favourite tag', 'my hated tag'])
r2 = Record(tags=['my good tag', 'my bad tag'])
And when I try to use autocomplete with the word in the middle:
dsl = Record.search()
dsl = dsl.suggest("auto_complete", "favo", completion={"field": "tags_suggest"})
search_response = dsl.execute()
for option in search_response.suggest.auto_complete[0].options:
print(option.to_dict())
It won't return anything, but it will when I search "my favo". Any good practices to fix that (make it return 'my favourite tag' when I request suggestions for "favo")?
Check Mapping
Search in Elasticsearch, Is also depends on how you are indexing your data. I would suggest to have look on index mapping with the below query:
curl -X GET "elasticsearch.url:port/index_name/_mapping?pretty"
You need to check how data is being inserted like is it using any analyzer or tokeninzer to save data. If you have not specified any analyzer elasticsearch default uses standard analyzer. It will produce the terms accordingly.
As per your use case you need to apply analyzer, tokens & filters. Here is the one Example where i have to use like query and implemented ngram token filter.
Solution
As i can see you are using suggester, The suggest feature suggests similar looking terms based on a provided text by using a suggester.
If you want to achieve autocomplete, I would suggest to use search as you type.
I tried to reproduce your use case and below is something which worked for me.
Create Index
PUT /test1?pretty
{
"mappings": {
"properties": {
"tags": {
"type": "search_as_you_type"
}
}
}
}
Indexing data
POST test1/_doc?pretty
{
"tags":"my favourite tag"
}
POST test1/_doc?pretty
{
"tags":"my hated tag"
}
POST test1/_doc?pretty
{
"tags":"my good tag"
}
POST test1/_doc?pretty
{
"tags":"my bad tag"
}
Query with your keyword
GET /test1/_search?pretty
{
"query": {
"multi_match": {
"query": "my",
"type": "bool_prefix",
"fields": [
"tags",
"tags._2gram",
"tags._3gram"
]
}
}
}
GET /test1/_search?pretty
{
"query": {
"multi_match": {
"query": "bad",
"type": "bool_prefix",
"fields": [
"tags",
"tags._2gram",
"tags._3gram"
]
}
}
}
GET /test1/_search?pretty
{
"query": {
"multi_match": {
"query": "fav",
"type": "bool_prefix",
"fields": [
"tags",
"tags._2gram",
"tags._3gram"
]
}
}
}
You can achive this by setting preserve_position_increments parameter to false in your mappings.
"tags_completion": {
"type": "completion",
"analyzer": "simple",
"preserve_separators": false,
"preserve_position_increments": false,
"max_input_length": 50
}
You can query it in console like this:
GET /_search
{
"suggest" : {
"my-suggester": {
"prefix": "favou",
"completion": {
"field": "tags_completion",
"skip_duplicates": true,
"fuzzy": {
"fuzziness": 1
}
}
}
}
}
}

Sort similar data by property

I have the following data:
[
{
DocumentId": "85",
"figureText": "General Seat Assembly - DBL",
"descriptionShort": "Seat Assembly - DBL",
"partNumber": "1012626-001FG05",
"itemNumeric": "5"
},
{
DocumentId": "85",
"figureText": "General Seat Assembly - DBL",
"descriptionShort": "Seat Assembly - DBL",
"partNumber": "1012626-001FG05",
"itemNumeric": "45"
}
]
I use the following query to get data:
{
"query": {
"bool": {
"must": {
"match": {
"DocumentId": "85"
}
},
"should": [
{
"match": {
"figureText": {
"boost": 5,
"query": "General Seat Assembly - DBL",
"operator": "or"
}
}
},
{
"match": {
"descriptionShort": {
"boost": 4,
"query": "Seat Assembly - DBL",
"operator": "or"
}
}
},
{
"term": {
"partNumber": {
"boost": 1,
"value": "1012626-001FG05"
}
}
}
]
}
}
}
Currently, it will returns the item with "itemNumeric" = 45 and I would like to get itemNumeric = "5" (the lowest).
Is a tips exists to do that ? I tried with "sort":[{"itemNumeric":"desc"}]
Thx
Looking at your comment, you can resolve the issue in two ways.
Solution 1: Updating your mapping, so that your query would work as expected now
PUT my_index/_mapping/_doc
{
"properties": {
"itemNumeric": {
"type": "text",
"fielddata": true
}
}
}
Solution 2: Check the mapping of your itemNumeric field in case if your mapping has been created dynamically, you field itemNumeric would be multi-field.
"itemNumeric": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
In this case you can have your sorting logic applied on itemNumeric.keyword field.
"sort":[{"itemNumeric.keyword":"desc"}]
In elasticsearch, whenever you have text data, it is always recommended to have two fields created for it. One of type text so that you can apply full text queries and other of type keyword so that you can use if to implement sorting or any aggregation operations.
Solution 1 is not recommended as ES official documentation mentions below reason
Fielddata is disabled on text fields by default. Set fielddata=true on
[your_field_name] in order to load fielddata in memory by uninverting
the inverted index. Note that this can however use significant memory.
I'd suggest to read about multi-field and fielddata so that you will have more clarity on what's happening.

Getting results for multi_match cross_fields query in elasticsearch with custom analyzer

I have an elastic search 5.3 server with products.
Each product has a 14 digit product code that has to be searchable by the following rules. The complete code should match as well as a search term with only the last 9 digits, the last 6, the last 5 or the last 4 digits.
In order to achieve this I created a custom analyser which creates the appropriate tokens at index time using the pattern capture token filter. This seems to be working correctly. The _analyse API shows that the correct terms are created.
To fetch the documents from elastic search I'm using a multi_match cross_fields bool query to search a number of fields simultaneously.
When I have a query string that has a part that matches a product code and a part that matches any of the other fields no results are returned, but when I search for each part separately the appropriate results are returned. Also when I have multiple parts spanning any of the fields except the product code the correct results are returned.
My maping and analyzer:
PUT /store
{
"mappings": {
"products":{
"properties":{
"productCode":{
"analyzer": "ProductCode",
"search_analyzer": "standard",
"type": "text"
},
"description": {
"type": "text"
},
"remarks": {
"type": "text"
}
}
}
},
"settings": {
"analysis": {
"filter": {
"ProductCodeNGram": {
"type": "pattern_capture",
"preserve_original": "true",
"patterns": [
"\\d{5}(\\d{9})",
"\\d{8}(\\d{6})",
"\\d{9}(\\d{5})",
"\\d{10}(\\d{4})"
]
}
},
"analyzer": {
"ProductCode": {
"filter": ["ProductCodeNGram"],
"type": "custom",
"preserve_original": "true",
"tokenizer": "standard"
}
}
}
}
}
The query
GET /store/products/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "[query_string]",
"fields": ["productCode", "description", "remarks"],
"type": "cross_fields",
"operator": "and"
}
}
]
}
}
}
Sample data
POST /store/products
{
"productCode": "999999123456789",
"description": "Foo bar",
"remarks": "Foobar"
}
The following query strings all return one result:
"456789", "foo", "foobar", "foo foobar".
But the query_string "foo 456789" returns no results.
I am very curious as to why the last search does not return any results. I am convinced that it should.
The problem is that you are doing a cross_fields over fields with different analysers. Cross fields only works for fields using the same analyser. It in fact groups the fields by analyser before doing the cross fields. You can find more information in this documentation.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#_literal_cross_field_literal_and_analysis
Although cross_fields needs the same analyzer across the fields it operates on, I've had luck using the tie_breaker parameter to allow other fields (that use different analyzers) to be weighed for the total score.
This has the added benefit of allowing per-field boosting to be calculated in the final score, too.
Here's an example using your query:
GET /store/products/_search
{
"query": {
"bool": {
"must": [
{
"multi_match": {
"query": "[query_string]",
"fields": ["productCode", "description", "remarks"],
"type": "cross_fields",
"tie_breaker": 1 # You may need to tweak this
}
}
]
}
}
}
I also removed the operator field, as I believe using the "AND" operator will cause fields that don't have the same analyzer to be scored inappropriately.

How do I specify a different analyzer at query time with Elasticsearch?

I would like to use a different analyzer at query time to compose my query.
I read that is possible from the documentation "Controlling Analysis" :
[...] the full sequence at search time:
The analyzer defined in the query itself, else
The search_analyzer defined in the field mapping, else
The analyzer defined in the field mapping, else
The analyzer named default_search in the index settings, which defaults to
The analyzer named default in the index settings, which defaults to
The standard analyzer
But i don't know how to compose the query in order to specify different analyzers for different clauses:
"query" => [
"bool" => [
"must" => [
{
"match": ["my_field": "My query"]
"<ANALYZER>": <ANALYZER_1>
}
],
"should" => [
{
"match": ["my_field": "My query"]
"<ANALYZER>": <ANALYZER_2>
}
]
]
]
I know that i can index two or more different fields, but I have strong secondary memory constraints and I can't index the same information N times.
Thank you
If you haven't yet, you first need to map the custom analyzers to your index settings endpoint.
Note: if the index exists and is running, make sure to close it first.
POST /my_index/_close
Then map the custom analyzers to the settings endpoint.
PUT /my_index/_settings
{
"settings": {
"analysis": {
"analyzer": {
"custom_analyzer1": {
"type": "standard",
"stopwords_path": "stopwords/stopwords.txt"
},
"custom_analyzer2": {
"type": "standard",
"stopwords": ["stop", "words"]
}
}
}
}
}
Open the index again.
POST /my_index/_open
Now you can query your index with the new analyzers.
GET /my_index/_search
{
"query": {
"bool": {
"should": [{
"match": {
"field_1": {
"query": "Hello world",
"analyzer": "custom_analyzer1"
}
}
}],
"must": [{
"match": {
"field_2": {
"query": "Stop words can be tough",
"analyzer": "custom_analyzer2"
}
}
}]
}
}
}

Best way to index arbitrary attribute value pairs on elastic search

I am trying to index documents on elastic search, which have attribute value pairs. Example documents:
{
id: 1,
name: "metamorphosis",
author: "franz kafka"
}
{
id: 2,
name: "techcorp laptop model x",
type: "computer",
memorygb: 4
}
{
id: 3,
name: "ss2014 formal shoe x",
color: "black",
size: 42,
price: 124.99
}
Then, I need queries like:
1. "author" EQUALS "franz kafka"
2. "type" EQUALS "computer" AND "memorygb" GREATER THAN 4
3. "color" EQUALS "black" OR ("size" EQUALS 42 AND price LESS THAN 200.00)
What is the best way to store these documents for efficiently querying them? Should I store them exactly as shown in the examples? Or should I store them like:
{
fields: [
{ "type": "computer" },
{ "memorygb": 4 }
]
}
or like:
{
fields: [
{ "key": "type", "value": "computer" },
{ "key": "memorygb", "value": 4 }
]
}
And how should I map my indices for being able to perform both my equality and range queries?
If someone is still looking for an answer, I wrote a post about how to index arbitrary data into Elasticsearch and then to search by specific fields and values. All this, without blowing up your index mapping.
The post: http://smnh.me/indexing-and-searching-arbitrary-json-data-using-elasticsearch/
In short, you will need to create special index described in the post. Then you will need to flatten your data using the flattenData function https://gist.github.com/smnh/30f96028511e1440b7b02ea559858af4. Then, the flattened data can be safely indexed into Elasticsearch index.
For example:
flattenData({
id: 1,
name: "metamorphosis",
author: "franz kafka"
});
Will produce:
[
{
"key": "id",
"type": "long",
"key_type": "id.long",
"value_long": 1
},
{
"key": "name",
"type": "string",
"key_type": "name.string",
"value_string": "metamorphosis"
},
{
"key": "author",
"type": "string",
"key_type": "author.string",
"value_string": "franz kafka"
}
]
And
flattenData({
id: 2,
name: "techcorp laptop model x",
type: "computer",
memorygb: 4
});
Will produce:
[
{
"key": "id",
"type": "long",
"key_type": "id.long",
"value_long": 2
},
{
"key": "name",
"type": "string",
"key_type": "name.string",
"value_string": "techcorp laptop model x"
},
{
"key": "type",
"type": "string",
"key_type": "type.string",
"value_string": "computer"
},
{
"key": "memorygb",
"type": "long",
"key_type": "memorygb.long",
"value_long": 4
}
]
Then you can use build Elasticsearch queries to query your data. Every query should specify both the key and type of value. If you are unsure of what keys or types the index has, you can run an aggregation to find out, this is also discussed in the post.
For example, to find a document where author == "franz kafka" you need to execute the following query:
{
"query": {
"nested": {
"path": "flatData",
"query": {
"bool": {
"must": [
{"term": {"flatData.key": "author"}},
{"match": {"flatData.value_string": "franz kafka"}}
]
}
}
}
}
}
To find documents where type == "computer" and memorygb > 4 you need to execute the following query:
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "flatData",
"query": {
"bool": {
"must": [
{"term": {"flatData.key": "type"}},
{"match": {"flatData.value_string": "computer"}}
]
}
}
}
},
{
"nested": {
"path": "flatData",
"query": {
"bool": {
"must": [
{"term": {"flatData.key": "memorygb"}},
{"range": {"flatData.value_long": {"gt": 4}}}
]
}
}
}
}
]
}
}
}
Here, because we want same document match both conditions, we are using outer bool query with a must clause wrapping two nested queries.
Elastic Search is a schema-less data store which allows dynamic indexing of new attributes and there is no performance impact in having optional fields. You first mapping is absolutely fine and you can have boolean queries around your dynamic attributes.
There is no inherent performance benefit by making them nested fields, they will anyways be flattened on indexing like fields.type , fields.memorygb etc.
On the contrary your last mapping where you try to store as key-value pairs, will have a performance impact, since you will have to query on 2 different indexed fields i.e where key='memorygb' and value =4
Have a look at the documentation about dynamic mapping:
One of the most important features of Elasticsearch is its ability to be schema-less. There is no performance overhead if an object is
dynamic, the ability
to turn it off is provided as a safety mechanism so "malformed"
objects won’t, by mistake, index data that we do not wish to be
indexed.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-object-type.html
you need filtered query look from here :
you have to use together range query with match query

Resources