How I can get the distinct result? - elasticsearch

What I am trying to do is the query to elastic search (ver 6.4), to get the unique search result (named eids). I made a query as below. What I'd like to do is first text search from both 2 fields called eLabel and pLabel, and get the distinct result called eid. But actually the result is not aggregated, showing redundant ids from 0 to over 20. How I can adjust the query?
{
"query": {
"multi_match": {
"query": "Brazil Capital",
"fields": [
"eLabel",
"pLabel"
]
}
},
"size": 200,
"_source": [
"eid",
"eLabel"
],
"aggs": {
"eids": {
"terms": {
"field": "eid"
}
}
}
}
my current mappings are as follows.
eid : id of entity
eLabel: entity label (ex, Brazil)
prop_id: property id of the entity (eid)
pLabel: the label of the property (ex, is the capital of, is located at ...)
"mappings": {
"entity": {
"properties": {
"eLabel": {
"type": "text" ,
"index_options": "docs" ,
"analyzer": "my_analyzer"
} ,
"eid": {
"type": "keyword"
} ,
"subclass": {
"type": "boolean"
} ,
"pLabel": {
"type": "text" ,
"index_options": "docs" ,
"analyzer": "my_analyzer"
} ,
"prop_id": {
"type": "keyword"
} ,
"pType": {
"type": "keyword"
} ,
"way": {
"type": "keyword"
} ,
"chain": {
"type": "integer"
} ,
"siteKey": {
"type": "keyword"
},
"version": {
"type": "integer"
},
"docId": {
"type": "integer"
}
}
}
}

Based on your comment, you can make use of the below query using Bool. Don't think anything is wrong with aggregation query, just replace the query you have with the bool query I've mentioned and I think it would suffice.
When you make use of multi_match query, it would retrieve even if the document has eLabel = "Rio is capital of brazil" & pLabel = "something else entirely here"
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"eLabel": "capital"
}
},
{
"match": {
"pLabel": "brazil"
}
}
]
}
},
"size": 200,
"_source": [
"eid",
"eLabel"
],
"aggs": {
"eids": {
"terms": {
"field": "eid"
}
}
}
}
Note that if you only want the values of eid and do not want the documents, you can set "size":0 in the above query. That way you'd only have aggregation results returned.
Let me know if this helps!!

Related

How can I get parent with all children in one query

I have following mapping:
PUT /test_products
{
"mappings": {
"_doc": {
"properties": {
"type": {
"type": "keyword"
},
"name": {
"type": "text"
},
"entity_id": {
"type": "integer"
},
"weighted": {
"type": "integer"
}
"product_relation": {
"type": "join",
"relations": {
"window": "simple"
}
}
}
}
}
}
I want to get "window" products with all "simple"s but only where one or more "simple"s have property "weighted" = 1
I wrote following query:
GET test_products/_search
{
"query": {
"has_child": {
"type": "simple",
"query": {
"term": {
"weighted": 1
}
},
"inner_hits": {}
}
}
}
But I've got "window"s with "simple"s which are match to the term. In other words I want to filter "window"s list by "simple"'s option and get all matched "window"s with all their "simple"s. Is it possible without "nested" in one query? Or I have to do some queries?
OK. Luckily, I need to get only one "window" product with all it's children by it's ID, so I found parent_id query which can helps me with this task.
Now I have following query:
GET test_products/_search
{
"query": {
"parent_id": {
"type": "simple",
"id": "window-1"
}
}
}
Unfortunately, I have to execute 2 queries (has_child and then parent_id) instead of one but it's OK for me.

ElasticSearch - Fuzzy and strict match with multiple fields

We want to leverage ElasticSearch to find us similar objects.
Lets say I have an Object with 4 fields:
product_name, seller_name, seller_phone, platform_id.
Similar products can have different product names and seller names across different platforms (fuzzy match).
While, phone is strict and a single variation might cause yield a wrong record (strict match).
What were trying to create is a query that will:
Take into account all fields we have for current record and OR
between them.
Mandate platform_id is the one I want to specific look at. (AND)
Fuzzy the product_name and seller_name
Strictly match the phone number or ignore it in the OR between the fields.
If I would write it in pseudo code, I would write something like:
((product_name like 'some_product_name') OR (seller_name like
'some_seller_name') OR (seller_phone = 'some_phone')) AND (platform_id
= 123)
To do exact match on seller_phone i am indexing this field without ngram analyzers along with fuzzy_query for product_name and seller_name
Mapping
PUT index111
{
"settings": {
"analysis": {
"analyzer": {
"edge_n_gram_analyzer": {
"tokenizer": "whitespace",
"filter" : ["lowercase", "ednge_gram_filter"]
}
},
"filter": {
"ednge_gram_filter" : {
"type" : "NGram",
"min_gram" : 2,
"max_gram": 10
}
}
}
},
"mappings": {
"document_type" : {
"properties": {
"product_name" : {
"type": "text",
"analyzer": "edge_n_gram_analyzer"
},
"seller_name" : {
"type": "text",
"analyzer": "edge_n_gram_analyzer"
},
"seller_phone" : {
"type": "text"
},
"platform_id" : {
"type": "text"
}
}
}
}
}
Index documents
POST index111/document_type
{
"product_name":"macbok",
"seller_name":"apple",
"seller_phone":"9988",
"platform_id":"123"
}
For following pseudo sql query
((product_name like 'some_product_name') OR (seller_name like 'some_seller_name') OR (seller_phone = 'some_phone')) AND (platform_id = 123)
Elastic Query
POST index111/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"platform_id": {
"value": "123"
}
}
},
{
"bool": {
"should": [{
"fuzzy": {
"product_name": {
"value": "macbouk",
"boost": 1.0,
"fuzziness": 2,
"prefix_length": 0,
"max_expansions": 100
}
}
},
{
"fuzzy": {
"seller_name": {
"value": "apdle",
"boost": 1.0,
"fuzziness": 2,
"prefix_length": 0,
"max_expansions": 100
}
}
},
{
"term": {
"seller_phone": {
"value": "9988"
}
}
}
]
}
}]
}
}
}
Hope this helps

Range Query on a score returned by match Query in Elastic Search

Suppose I have a set of documents like :-
{
"Name":"Random String 1"
"Type":"Keyword"
"City":"Lousiana"
"Quantity":"10"
}
Now I want to implement a full text search using an N-gram analyazer on the field Name and City.
After that , I want to filter only the results returned with
"_score" :<Query Score Returned by ES>
greater than 1.2 (Maybe By Range Query Aggregation Method)
And after that apply term aggregation method on the property: "Type" and then return the top results in each bucket by using "top_hits" aggregation method.
How can I do so ?
I've been able to implement everything apart from the Range Query on score returned by a search query.
if you want to score the documents organically then i you can use min_score in query to filter the matched documents for the score.
for ngram analyer i added whitespace tokenizer and a lowercase filter
Mappings
PUT index1
{
"settings": {
"analysis": {
"analyzer": {
"edge_n_gram_analyzer": {
"tokenizer": "whitespace",
"filter" : ["lowercase", "ednge_gram_filter"]
}
},
"filter": {
"ednge_gram_filter" : {
"type" : "NGram",
"min_gram" : 2,
"max_gram": 10
}
}
}
},
"mappings": {
"document_type" : {
"properties": {
"Name" : {
"type": "text",
"analyzer": "edge_n_gram_analyzer"
},
"City" : {
"type": "text",
"analyzer": "edge_n_gram_analyzer"
},
"Type" : {
"type": "keyword"
}
}
}
}
}
Index Document
POST index1/document_type
{
"Name":"Random String 1",
"Type":"Keyword",
"City":"Lousiana",
"Quantity":"10"
}
Query
POST index1/_search
{
"min_score": 1.2,
"size": 0,
"query": {
"bool": {
"should": [
{
"term": {
"Name": {
"value": "string"
}
}
},
{
"term": {
"City": {
"value": "string"
}
}
}
]
}
},
"aggs": {
"type_terms": {
"terms": {
"field": "Type",
"size": 10
},
"aggs": {
"type_term_top_hits": {
"top_hits": {
"size": 10
}
}
}
}
}
}
Hope this helps

Raw nested aggregation

I would like to create a raw nested aggregation in ElasticSearch, but I'm enable to get it working.
My documents look like this :
{
"_index": "items",
"_type": "frame_spec",
"_id": "19770602001",
"_score": 1,
"_source": {
"item_type_name": "frame_spec",
"status": "published",
"creation_date": "2016-02-18T11:19:15Z",
"last_change_date": "2016-02-18T11:19:15Z",
"publishing_date": "2016-02-18T11:19:15Z",
"attributes": [
{
"brand": "Sun"
},
{
"model": "Sunglasses1"
},
{
"eyesize": "56"
},
{
"opc": "19770602001"
},
{
"madein": "UNITED KINGDOM"
}
]
}
}
What I want to do is to aggregate based on one of the attributes. I can't do a normal aggregation with "attributes.model" (for example) because some of them contain spaces. So I've tried using the "raw" property but it appears that ES considers it as a normal property and does not return any result.
This is what I've tried :
{
"size": 0,
"aggs": {
"brand": {
"terms": {
"field": "attributes.brand.raw"
}
}
}
}
But I have no result.
Have you any solution I could use for this problem ?
You should use a dynamic_template in your mapping that will catch all attributes.* string fields and create a raw sub-field for all of them. For other types than string, you don't really need raw fields. You need to delete your current index and then recreate it with this:
DELETE items
PUT items
{
"mappings": {
"frame_spec": {
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "string",
"path_match": "attributes.*",
"mapping": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed",
"ignore_above": 256
}
}
}
}
}
]
}
}
}
After that, you need to re-populate your index and then you'll be able to run this:
POST /items/_search
{
"size": 0,
"aggs": {
"brand": {
"terms": {
"field": "attributes.brand.raw"
}
}
}
}

Elasticsearch Query aggregated by unique substrings (email domain)

I have an elasticsearch query that queries over an index and then aggregates based on a specific field sender_not_analyzed. I then use a term aggregation on that same field sender_not_analyzed which returns buckets for the top "senders". My query is currently:
{
"size": 0,
"query": {
"regexp": {
"sender_not_analyzed": ".*[#].*"
}
},
"aggs": {
"sender-stats": {
"terms": {
"field": "sender_not_analyzed"
}
}
}
}
which returns buckets that look like:
"aggregations": {
"sender-stats": {
"buckets": [
{
"key": "<Mike <mike#fizzbuzz.com>#MISSING_DOMAIN>",
"doc_count": 5017
},
{
"key": "jon.doe#foo.com",
"doc_count": 3963
},
{
"key": "jane.doe#foo.com",
"doc_count": 2857
},
{
"key": "jon.doe#bar.com",
"doc_count":1544
}
How can I write an aggregation such that I get single bucket for each unique email domain, eg foo.com would have a doc_count of (3963 + 2857) 6820? Can I accomplish this with a regex aggregation or do I need to write some kind of custom analyzer to split the string at the # to the end of string?
This is pretty late, but I think this can be done by using pattern_replace char filter, you capture the domain name with regex, This is my setup
POST email_index
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"char_filter": [
"domain"
],
"tokenizer": "keyword",
"filter": [
"lowercase",
"asciifolding"
]
}
},
"char_filter": {
"domain": {
"type": "pattern_replace",
"pattern": ".*#(.*)",
"replacement": "$1"
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"domain": {
"type": "string",
"analyzer": "my_custom_analyzer"
},
"sender_not_analyzed": {
"type": "string",
"index": "not_analyzed",
"copy_to": "domain"
}
}
}
}
}
Here domain char filter will capture the domain name, we need to use keyword tokenizer to get the domain as it is, I am using lowercase filter but it is up to you if you want to use it or not. Using copy_to parameter to copy the value of the sender_not_analyzed to domain field, although _source field won't be modified to include this value but we can query it.
GET email_index/_search
{
"size": 0,
"query": {
"regexp": {
"sender_not_analyzed": ".*[#].*"
}
},
"aggs": {
"sender-stats": {
"terms": {
"field": "domain"
}
}
}
}
This will give you desired result.

Resources