Sort similar data by property - elasticsearch

I have the following data:
[
{
DocumentId": "85",
"figureText": "General Seat Assembly - DBL",
"descriptionShort": "Seat Assembly - DBL",
"partNumber": "1012626-001FG05",
"itemNumeric": "5"
},
{
DocumentId": "85",
"figureText": "General Seat Assembly - DBL",
"descriptionShort": "Seat Assembly - DBL",
"partNumber": "1012626-001FG05",
"itemNumeric": "45"
}
]
I use the following query to get data:
{
"query": {
"bool": {
"must": {
"match": {
"DocumentId": "85"
}
},
"should": [
{
"match": {
"figureText": {
"boost": 5,
"query": "General Seat Assembly - DBL",
"operator": "or"
}
}
},
{
"match": {
"descriptionShort": {
"boost": 4,
"query": "Seat Assembly - DBL",
"operator": "or"
}
}
},
{
"term": {
"partNumber": {
"boost": 1,
"value": "1012626-001FG05"
}
}
}
]
}
}
}
Currently, it will returns the item with "itemNumeric" = 45 and I would like to get itemNumeric = "5" (the lowest).
Is a tips exists to do that ? I tried with "sort":[{"itemNumeric":"desc"}]
Thx

Looking at your comment, you can resolve the issue in two ways.
Solution 1: Updating your mapping, so that your query would work as expected now
PUT my_index/_mapping/_doc
{
"properties": {
"itemNumeric": {
"type": "text",
"fielddata": true
}
}
}
Solution 2: Check the mapping of your itemNumeric field in case if your mapping has been created dynamically, you field itemNumeric would be multi-field.
"itemNumeric": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
In this case you can have your sorting logic applied on itemNumeric.keyword field.
"sort":[{"itemNumeric.keyword":"desc"}]
In elasticsearch, whenever you have text data, it is always recommended to have two fields created for it. One of type text so that you can apply full text queries and other of type keyword so that you can use if to implement sorting or any aggregation operations.
Solution 1 is not recommended as ES official documentation mentions below reason
Fielddata is disabled on text fields by default. Set fielddata=true on
[your_field_name] in order to load fielddata in memory by uninverting
the inverted index. Note that this can however use significant memory.
I'd suggest to read about multi-field and fielddata so that you will have more clarity on what's happening.

Related

Combining terms with synonyms - ElasticSearch

I am new to Elasticsearch and have a synonym analyzer in place which looks like-
{
"settings": {
"index": {
"analysis": {
"filter": {
"graph_synonyms": {
"type": "synonym_graph",
"synonyms": [
"gowns, dresses",
"backpacks, bags",
"coats, jackets"
]
}
},
"analyzer": {
"search_time_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"graph_synonyms"
]
}
}
}
}
}
}
And the mapping looks like-
{
"properties": {
"category": {
"type": "text",
"search_analyzer": "search_time_analyzer",
"fields": {
"no_synonyms": {
"type": "text"
}
}
}
}
}
If I search for gowns, it gives me proper results for both gowns as well as dresses.
But the problem is if I search for red gowns, (the system does not have any red gowns) the expected behavior is to search for red dresses and return those results. But instead, it returns results of gowns and dresses irrespective of the color.
I would want to configure the system such that it considers both the terms and their respective synonyms if any and then return the results.
For reference, this is what my search query looks like-
"query":
{
"bool":
{
should:
[
{
"multi_match":
{
"boost": 300,
"query": term,
"type": "cross_fields",
"operator": "or",
"fields": ["bu.keyword^10", "bu^10", "category.keyword^8", "category^8", "category.no_synonyms^8", "brand.keyword^7", "brand^7", "colors.keyword^2", "colors^2", "size.keyword", "size", "hash.keyword^2", "hash^2", "name"]
}
}
]
}
}
Sample document:
_source: {
productId: '12345',
name: 'RUFFLE FLORAL TRIM COTTON MAXI DRESS',
brand: [ 'self-portrait' ],
mainImage: 'http://test.jpg',
description: 'Self-portrait presents this maxi dress, crafted from cotton, to offer your off-duty ensembles an elegant update. Trimmed with ruffled broderie details, this piece is an effortless showcase of modern femininity.',
status: 'active',
bu: [ 'womenswear' ],
category: [ 'dresses', 'gowns' ],
tier1: [],
tier2: [],
colors: [ 'WHITE' ],
size: [ '4', '6', '8', '10' ],
hash: [
'ballgown', 'cotton',
'effortless', 'elegant',
'floral', 'jar',
'maxi', 'modern',
'off-duty', 'ruffle',
'ruffled', '1',
'2', 'crafted'
],
styleCode: '211274856'
}
How can I achieve the desired output? Any help would be appreciated. Thanks
You can configured index time analyzer insted of search time analyzer like below:
{
"properties": {
"category": {
"type": "text",
"analyzer": "search_time_analyzer",
"fields": {
"no_synonyms": {
"type": "text"
}
}
}
}
}
Once you done with index mapping change, reindex your data and try below query:
Please note that I have changed operator to and and analyzer to standard:
{
"query": {
"multi_match": {
"boost": 300,
"query": "gowns red",
"analyzer": "standard",
"type": "cross_fields",
"operator": "and",
"fields": [
"category",
"colors"
]
}
}
}
Why your current query is not working:
Inexing:
Your current index mapping indexing data with standard analyzer so it will not index any of your category with synonyms values.
Searching:
Your current query have operator or so if you search for red gowns then it will create query like red OR gowns OR dresses and it will giving you result irrespective of the color. Also, if you change operator to and in existing configuration then it will return zero result as it will create query like red AND gowns AND dresses.
Solution: Once you done changes as i suggsted it will index synonyms for category field as well and it will work with and operator. So if you try query gowns red then it will create query like gowns AND red. It will match because category field have both values gowns and dresses due to synonyms applied at index time.

ElasticSearch autocomplete doesn't work with the middle words

Using python elasticsearch-dsl:
class Record(Document):
tags = Keyword()
tags_suggest = Completion(preserve_position_increments=False)
def clean(self):
self.tags_suggest = {
"input": self.tags
}
class Index:
name = 'my-index'
settings = {
"number_of_shards": 2,
}
When I index
r1 = Record(tags=['my favourite tag', 'my hated tag'])
r2 = Record(tags=['my good tag', 'my bad tag'])
And when I try to use autocomplete with the word in the middle:
dsl = Record.search()
dsl = dsl.suggest("auto_complete", "favo", completion={"field": "tags_suggest"})
search_response = dsl.execute()
for option in search_response.suggest.auto_complete[0].options:
print(option.to_dict())
It won't return anything, but it will when I search "my favo". Any good practices to fix that (make it return 'my favourite tag' when I request suggestions for "favo")?
Check Mapping
Search in Elasticsearch, Is also depends on how you are indexing your data. I would suggest to have look on index mapping with the below query:
curl -X GET "elasticsearch.url:port/index_name/_mapping?pretty"
You need to check how data is being inserted like is it using any analyzer or tokeninzer to save data. If you have not specified any analyzer elasticsearch default uses standard analyzer. It will produce the terms accordingly.
As per your use case you need to apply analyzer, tokens & filters. Here is the one Example where i have to use like query and implemented ngram token filter.
Solution
As i can see you are using suggester, The suggest feature suggests similar looking terms based on a provided text by using a suggester.
If you want to achieve autocomplete, I would suggest to use search as you type.
I tried to reproduce your use case and below is something which worked for me.
Create Index
PUT /test1?pretty
{
"mappings": {
"properties": {
"tags": {
"type": "search_as_you_type"
}
}
}
}
Indexing data
POST test1/_doc?pretty
{
"tags":"my favourite tag"
}
POST test1/_doc?pretty
{
"tags":"my hated tag"
}
POST test1/_doc?pretty
{
"tags":"my good tag"
}
POST test1/_doc?pretty
{
"tags":"my bad tag"
}
Query with your keyword
GET /test1/_search?pretty
{
"query": {
"multi_match": {
"query": "my",
"type": "bool_prefix",
"fields": [
"tags",
"tags._2gram",
"tags._3gram"
]
}
}
}
GET /test1/_search?pretty
{
"query": {
"multi_match": {
"query": "bad",
"type": "bool_prefix",
"fields": [
"tags",
"tags._2gram",
"tags._3gram"
]
}
}
}
GET /test1/_search?pretty
{
"query": {
"multi_match": {
"query": "fav",
"type": "bool_prefix",
"fields": [
"tags",
"tags._2gram",
"tags._3gram"
]
}
}
}
You can achive this by setting preserve_position_increments parameter to false in your mappings.
"tags_completion": {
"type": "completion",
"analyzer": "simple",
"preserve_separators": false,
"preserve_position_increments": false,
"max_input_length": 50
}
You can query it in console like this:
GET /_search
{
"suggest" : {
"my-suggester": {
"prefix": "favou",
"completion": {
"field": "tags_completion",
"skip_duplicates": true,
"fuzzy": {
"fuzziness": 1
}
}
}
}
}
}

How to correctly query inside of terms aggregate values in elasticsearch, using include and regex?

How do you filter out/search in aggregate results efficiently?
Imagine you have 1 million documents in elastic search. In those documents, you have a multi_field (keyword, text) tags:
{
...
tags: ['Race', 'Racing', 'Mountain Bike', 'Horizontal'],
...
},
{
...
tags: ['Tracey Chapman', 'Silverfish', 'Blue'],
...
},
{
...
tags: ['Surfing', 'Race', 'Disgrace'],
...
},
You can use these values as filters, (facets), against a query to pull only the documents that contain this tag:
...
"filter": [
{
"terms": {
"tags": [
"Race"
]
}
},
...
]
But you want the user to be able to query for possible tag filters. So if the user types, race the return should show (from previous example), ['Race', 'Tracey Chapman', 'Disgrace']. That way, the user can query for a filter to use. In order to accomplish this, I had to use aggregates:
{
"aggs": {
"topics": {
"terms": {
"field": "tags",
"include": ".*[Rr][Aa][Cc][Ee].*", // I have to dynamically form this
"size": 6
}
}
},
"size": 0
}
This gives me exactly what I need! But it is slow, very slow. I've tried adding the execution_hint, it does not help me.
You may think, "Just use a query before the aggregate!" But the issue is that it'll pull all values for all documents in that query. Meaning, you can be displaying tags that are completely unrelated. If I queried for race before the aggregate, and did not use the include regex, I would end up with all those other values, like 'Horizontal', etc...
How can I rewrite this aggregation to work faster? Is there a better way to write this? Do I really have to make a separate index just for values? (sad face) Seems like this would be a common issue but have found no answers through documentation and googling.
You certainly don't need a separate index just for the values...
Here's my take on it:
What you're doing with the regex is essentially what should've been done by a tokenizer -- i.e. constructing substrings (or N-grams) such that they can be targeted later.
This means that the keyword Race will need to be tokenized into the n-grams ["rac", "race", "ace"]. (It doesn't really make sense to go any lower than 3 characters -- most autocomplete libraries choose to ignore fewer than 3 characters because the possible matches balloon too quickly.)
Elasticsearch offers the N-gram tokenizer but we'll need to increase the default index-level setting called max_ngram_diff from 1 to (arbitrarily) 10 because we want to catch as many ngrams as is reasonable:
PUT tagindex
{
"settings": {
"index": {
"max_ngram_diff": 10
},
"analysis": {
"analyzer": {
"my_ngrams_analyzer": {
"tokenizer": "my_ngrams",
"filter": [ "lowercase" ]
}
},
"tokenizer": {
"my_ngrams": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [ "letter", "digit" ]
}
}
}
},
{ "mappings": ... } --> see below
}
When your tags field is a list of keywords, it's simply not possible to aggregate on that field without resorting to the include option which can be either exact matches or a regex (which you're already using). Now, we cannot guarantee exact matches but we also don't want to regex! So that's why we need to use a nested list which'll treat each tag separately.
Now, nested lists are expected to contain objects so
{
"tags": ["Race", "Racing", "Mountain Bike", "Horizontal"]
}
will need to be converted to
{
"tags": [
{ "tag": "Race" },
{ "tag": "Racing" },
{ "tag": "Mountain Bike" },
{ "tag": "Horizontal" }
]
}
After that we'll proceed with the multi field mapping, keeping the original tags intact but also adding a .tokenized field to search on and a .keyword field to aggregate on:
"index": { ... },
"analysis": { ... },
"mappings": {
"properties": {
"tags": {
"type": "nested",
"properties": {
"tag": {
"type": "text",
"fields": {
"tokenized": {
"type": "text",
"analyzer": "my_ngrams_analyzer"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
}
We'll then add our adjusted tags docs:
POST tagindex/_doc
{"tags":[{"tag":"Race"},{"tag":"Racing"},{"tag":"Mountain Bike"},{"tag":"Horizontal"}]}
POST tagindex/_doc
{"tags":[{"tag":"Tracey Chapman"},{"tag":"Silverfish"},{"tag":"Blue"}]}
POST tagindex/_doc
{"tags":[{"tag":"Surfing"},{"tag":"Race"},{"tag":"Disgrace"}]}
and apply a nested filter terms aggregation:
GET tagindex/_search
{
"aggs": {
"topics_parent": {
"nested": {
"path": "tags"
},
"aggs": {
"topics": {
"filter": {
"term": {
"tags.tag.tokenized": "race"
}
},
"aggs": {
"topics": {
"terms": {
"field": "tags.tag.keyword",
"size": 100
}
}
}
}
}
}
},
"size": 0
}
yielding
{
...
"topics_parent" : {
...
"topics" : {
...
"topics" : {
...
"buckets" : [
{
"key" : "Race",
"doc_count" : 2
},
{
"key" : "Disgrace",
"doc_count" : 1
},
{
"key" : "Tracey Chapman",
"doc_count" : 1
}
]
}
}
}
}
Caveats
in order for this to work, you'll have to reindex
ngrams will increase the storage footprint -- depending on how many tags-per-doc you have, it may become a concern
nested fields are internally treated as "separate documents" so this affects the disk space too
P.S.: This is an interesting use case. Let me know how the implementation went!

How to specify or target a field from a specific document type in queries or filters in Elasticsearch?

Given:
Documents of two different types, let's say 'product' and 'category', are indexed to the same Elasticsearch index.
Both document types have a field 'tags'.
Problem:
I want to build a query that returns results of both types, but the documents of type 'product' are allowed to have tags 'X' and 'Y', and the documents of type 'category' are only allowed to have tag 'Z'. How can I achieve this? It appears I can't use product.tags and category.tags since then ES will look for documents' product/category field, which is not what I intend.
Note:
While for the example above there might be some kind of workaround, I'm looking for a general way to target or specify fields of a specific document type when writing queries. I basically want to 'namespace' the field names used in my query so only documents of the type I want to work with are considered.
I think field aliasing would be the best answer for you, but it's not possible.
Instead you can use "copy_to" but I it probably affects index size:
DELETE /test
PUT /test
{
"mappings": {
"product" : {
"properties": {
"tags": { "type": "string", "copy_to": "ptags" },
"ptags": { "type": "string" }
}
},
"category" : {
"properties": {
"tags": { "type": "string", "copy_to": "ctags" },
"ctags": { "type": "string" }
}
}
}
}
PUT /test/product/1
{ "tags":"X" }
PUT /test/product/2
{ "tags":"Y" }
PUT /test/category/1
{ "tags":"Z" }
And you can query one of fields or many of them:
GET /test/product,category/_search
{
"query": {
"term": {
"ptags": {
"value": "x"
}
}
}
}
GET /test/product,category/_search
{
"query": {
"multi_match": {
"query": "x",
"fields": [ "ctags", "ptags" ]
}
}
}

Elastic search multiple terms in a dictionary

I have mapping like:
"profile": {
"properties": {
"educations": {
"properties": {
"university": {
"type": "string"
},
"graduation_year": {
"type": "string"
}
}
}
}
}
which obviously holds the educations history of people. Each person can have multiple educations. What I want to do is search for people who graduated from "SFU" in "2012". To do that I am using filtered search:
"filtered": {
"filter": {
"and": [
{
"term": {
"educations.university": "SFU"
}
},
{
"term": {
"educations.graduation_year": "2012"
}
}
]
}
But what this query does is to find the documents who have "SFU" and "2012" in their education, so this document would match, which is wrong:
educations[0] = {"university": "SFU", "graduation_year": 2000}
educations[1] = {"university": "UBC", "graduation_year": 2012}
Is there anyway I could filter both terms on each education?
You need to define nested type for educations and use nested filter to filter it, or Elasticsearch will internally flattens inner objects into a single object, and return the wrong results.
You can refer here for detail explainations and samples:
http://www.elasticsearch.org/blog/managing-relations-inside-elasticsearch/
http://www.spacevatican.org/2012/6/3/fun-with-elasticsearch-s-children-and-nested-documents/

Resources