Range Query on a score returned by match Query in Elastic Search - elasticsearch

Suppose I have a set of documents like :-
{
"Name":"Random String 1"
"Type":"Keyword"
"City":"Lousiana"
"Quantity":"10"
}
Now I want to implement a full text search using an N-gram analyazer on the field Name and City.
After that , I want to filter only the results returned with
"_score" :<Query Score Returned by ES>
greater than 1.2 (Maybe By Range Query Aggregation Method)
And after that apply term aggregation method on the property: "Type" and then return the top results in each bucket by using "top_hits" aggregation method.
How can I do so ?
I've been able to implement everything apart from the Range Query on score returned by a search query.

if you want to score the documents organically then i you can use min_score in query to filter the matched documents for the score.
for ngram analyer i added whitespace tokenizer and a lowercase filter
Mappings
PUT index1
{
"settings": {
"analysis": {
"analyzer": {
"edge_n_gram_analyzer": {
"tokenizer": "whitespace",
"filter" : ["lowercase", "ednge_gram_filter"]
}
},
"filter": {
"ednge_gram_filter" : {
"type" : "NGram",
"min_gram" : 2,
"max_gram": 10
}
}
}
},
"mappings": {
"document_type" : {
"properties": {
"Name" : {
"type": "text",
"analyzer": "edge_n_gram_analyzer"
},
"City" : {
"type": "text",
"analyzer": "edge_n_gram_analyzer"
},
"Type" : {
"type": "keyword"
}
}
}
}
}
Index Document
POST index1/document_type
{
"Name":"Random String 1",
"Type":"Keyword",
"City":"Lousiana",
"Quantity":"10"
}
Query
POST index1/_search
{
"min_score": 1.2,
"size": 0,
"query": {
"bool": {
"should": [
{
"term": {
"Name": {
"value": "string"
}
}
},
{
"term": {
"City": {
"value": "string"
}
}
}
]
}
},
"aggs": {
"type_terms": {
"terms": {
"field": "Type",
"size": 10
},
"aggs": {
"type_term_top_hits": {
"top_hits": {
"size": 10
}
}
}
}
}
}
Hope this helps

Related

How I can get the distinct result?

What I am trying to do is the query to elastic search (ver 6.4), to get the unique search result (named eids). I made a query as below. What I'd like to do is first text search from both 2 fields called eLabel and pLabel, and get the distinct result called eid. But actually the result is not aggregated, showing redundant ids from 0 to over 20. How I can adjust the query?
{
"query": {
"multi_match": {
"query": "Brazil Capital",
"fields": [
"eLabel",
"pLabel"
]
}
},
"size": 200,
"_source": [
"eid",
"eLabel"
],
"aggs": {
"eids": {
"terms": {
"field": "eid"
}
}
}
}
my current mappings are as follows.
eid : id of entity
eLabel: entity label (ex, Brazil)
prop_id: property id of the entity (eid)
pLabel: the label of the property (ex, is the capital of, is located at ...)
"mappings": {
"entity": {
"properties": {
"eLabel": {
"type": "text" ,
"index_options": "docs" ,
"analyzer": "my_analyzer"
} ,
"eid": {
"type": "keyword"
} ,
"subclass": {
"type": "boolean"
} ,
"pLabel": {
"type": "text" ,
"index_options": "docs" ,
"analyzer": "my_analyzer"
} ,
"prop_id": {
"type": "keyword"
} ,
"pType": {
"type": "keyword"
} ,
"way": {
"type": "keyword"
} ,
"chain": {
"type": "integer"
} ,
"siteKey": {
"type": "keyword"
},
"version": {
"type": "integer"
},
"docId": {
"type": "integer"
}
}
}
}
Based on your comment, you can make use of the below query using Bool. Don't think anything is wrong with aggregation query, just replace the query you have with the bool query I've mentioned and I think it would suffice.
When you make use of multi_match query, it would retrieve even if the document has eLabel = "Rio is capital of brazil" & pLabel = "something else entirely here"
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"eLabel": "capital"
}
},
{
"match": {
"pLabel": "brazil"
}
}
]
}
},
"size": 200,
"_source": [
"eid",
"eLabel"
],
"aggs": {
"eids": {
"terms": {
"field": "eid"
}
}
}
}
Note that if you only want the values of eid and do not want the documents, you can set "size":0 in the above query. That way you'd only have aggregation results returned.
Let me know if this helps!!

ElasticSearch - Fuzzy and strict match with multiple fields

We want to leverage ElasticSearch to find us similar objects.
Lets say I have an Object with 4 fields:
product_name, seller_name, seller_phone, platform_id.
Similar products can have different product names and seller names across different platforms (fuzzy match).
While, phone is strict and a single variation might cause yield a wrong record (strict match).
What were trying to create is a query that will:
Take into account all fields we have for current record and OR
between them.
Mandate platform_id is the one I want to specific look at. (AND)
Fuzzy the product_name and seller_name
Strictly match the phone number or ignore it in the OR between the fields.
If I would write it in pseudo code, I would write something like:
((product_name like 'some_product_name') OR (seller_name like
'some_seller_name') OR (seller_phone = 'some_phone')) AND (platform_id
= 123)
To do exact match on seller_phone i am indexing this field without ngram analyzers along with fuzzy_query for product_name and seller_name
Mapping
PUT index111
{
"settings": {
"analysis": {
"analyzer": {
"edge_n_gram_analyzer": {
"tokenizer": "whitespace",
"filter" : ["lowercase", "ednge_gram_filter"]
}
},
"filter": {
"ednge_gram_filter" : {
"type" : "NGram",
"min_gram" : 2,
"max_gram": 10
}
}
}
},
"mappings": {
"document_type" : {
"properties": {
"product_name" : {
"type": "text",
"analyzer": "edge_n_gram_analyzer"
},
"seller_name" : {
"type": "text",
"analyzer": "edge_n_gram_analyzer"
},
"seller_phone" : {
"type": "text"
},
"platform_id" : {
"type": "text"
}
}
}
}
}
Index documents
POST index111/document_type
{
"product_name":"macbok",
"seller_name":"apple",
"seller_phone":"9988",
"platform_id":"123"
}
For following pseudo sql query
((product_name like 'some_product_name') OR (seller_name like 'some_seller_name') OR (seller_phone = 'some_phone')) AND (platform_id = 123)
Elastic Query
POST index111/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"platform_id": {
"value": "123"
}
}
},
{
"bool": {
"should": [{
"fuzzy": {
"product_name": {
"value": "macbouk",
"boost": 1.0,
"fuzziness": 2,
"prefix_length": 0,
"max_expansions": 100
}
}
},
{
"fuzzy": {
"seller_name": {
"value": "apdle",
"boost": 1.0,
"fuzziness": 2,
"prefix_length": 0,
"max_expansions": 100
}
}
},
{
"term": {
"seller_phone": {
"value": "9988"
}
}
}
]
}
}]
}
}
}
Hope this helps

Elasticsearch fielddata - should I use it?

Given an index with documents that have a brand property, we need to create a term aggregation that is case insensitive.
Index definition
Please note that the use of fielddata
PUT demo_products
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase"
]
}
}
}
},
"mappings": {
"product": {
"properties": {
"brand": {
"type": "text",
"analyzer": "my_custom_analyzer",
"fielddata": true,
}
}
}
}
}
Data
POST demo_products/product
{
"brand": "New York Jets"
}
POST demo_products/product
{
"brand": "new york jets"
}
POST demo_products/product
{
"brand": "Washington Redskins"
}
Query
GET demo_products/product/_search
{
"size": 0,
"aggs": {
"brand_facet": {
"terms": {
"field": "brand"
}
}
}
}
Result
"aggregations": {
"brand_facet": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "new york jets",
"doc_count": 2
},
{
"key": "washington redskins",
"doc_count": 1
}
]
}
}
If we use keyword instead of text we end up the 2 buckets for New York Jets because of the differences in casing.
We're concerned about the performance implications by using fielddata. However if fielddata is disabled we get the dreaded "Fielddata is disabled on text fields by default."
Any other tips to resolve this - or should we not be so concerned about fielddate?
Starting with ES 5.2 (out today), you can use normalizers with keyword fields in order to (e.g.) lowercase the value.
The role of normalizers is a bit like analyzers for text fields, though what you can do with them is more restrained, but that would probably help with the issue you're facing.
You'd create the index like this:
PUT demo_products
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
"product": {
"properties": {
"brand": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
And your query would return this:
"aggregations" : {
"brand_facet" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "new york jets",
"doc_count" : 2
},
{
"key" : "washington redskins",
"doc_count" : 1
}
]
}
}
Best of both worlds!
You can lowercase the aggregation at query time if you use a script. It won't perform as well as a normalized keyword field, but is still quite fast in my experience. For example, your query would be:
GET demo_products/product/_search
{
"size": 0,
"aggs": {
"brand_facet": {
"terms": {
"script": "doc['brand'].value.toLowerCase()"
}
}
}
}

Elasticsearch partial matching of a number field

I'm trying to partially match a number field. In the query below, I'd like id (which is defined as a long) to match any document which starts with 419. (so 4191 should match, as should 419534 but not 123419)
{
"size": 20,
"from": 0,
"sort": [{
"customerName": "asc"
}],
"query": {
"bool": {
"must": [{
"bool": {
"should": [{
"term": {
"id": 419
}
}]
}
}]
}
}
}
Anyone got a neat solution to use in my query?
To avoid a egde ngram, you could declare a not analyzed text sub field in your id mapping :
"mappings": {
"default": {
"properties": {
"id": {
"type": "integer",
"fields": {
"prefixed": {
"type": "string",
"index": "not_analyzed"
}
}
},
...
}
}
}
and use a prefix query against that field:
"query": {
"prefix" : {
"id.prefixed" : { "value" : 419 }
}
}
in a string field you can use wildcard query:
{
"query" : {
"wildcard" : { "id" : "*419*" }
}
}

ElasticSearch query on tags

I am trying to crack the elasticsearch query language, and so far I'm not doing very good.
I've got the following mapping for my documents.
{
"mappings": {
"jsondoc": {
"properties": {
"header" : {
"type" : "nested",
"properties" : {
"plainText" : { "type" : "string" },
"title" : { "type" : "string" },
"year" : { "type" : "string" },
"pages" : { "type" : "string" }
}
},
"sentences": {
"type": "nested",
"properties": {
"id": { "type": "integer" },
"text": { "type": "string" },
"tokens": { "type": "nested" },
"rhetoricalClass": { "type": "string" },
"babelSynsetsOcc": {
"type": "nested",
"properties" : {
"id" : { "type" : "integer" },
"text" : { "type" : "string" },
"synsetID" : { "type" : "string" }
}
}
}
}
}
}
}
}
It mainly resembles a JSON file referring to a pdf document.
I have been trying to make queries with aggregations and so far is going great. I've gotten to the point of grouping by (aggregating) rhetoricalClass, get the total number of repetitions of babelSynsetsOcc.synsetID. Heck, even the same query even by grouping the whole result by header.year
But, right now, I am struggling with filtering the documents that contain a term and doing the same query.
So, how could I make a query such that grouping by rhetoricalClass and only taking into account those documents whose field header.plainText contains either ["Computational", "Compositional", "Semantics"]. I mean contain instead of equal!.
If I were to make a rough translation to SQL it would be something similar to
SELECT count(sentences.babelSynsetsOcc.synsetID)
FROM jsondoc
WHERE header.plainText like '%Computational%' OR header.plainText like '%Compositional%' OR header.plainText like '%Sematics%'
GROUP BY sentences.rhetoricalClass
WHERE clauses are just standard structured queries, so they translate to queries in Elasticsearch.
GROUP BY and HAVING loosely translate to aggregations in Elasticsearch's DSL. Functions like count, min max, and sum are a function of GROUP BY and it's therefore also an aggregation.
The fact that you're using nested objects may be necessary, but it adds an extra layer to each part that touches them. If those nested objects are not arrays, then do not use nested; use object in that case.
I would probably look at translating your query to:
{
"query": {
"nested": {
"path": "header",
"query": {
"bool": {
"should": [
{
"match": {
"header.plainText" : "Computational"
}
},
{
"match": {
"header.plainText" : "Compositional"
}
},
{
"match": {
"header.plainText" : "Semantics"
}
}
]
}
}
}
}
}
Alternatively, it could be rewritten as this, which is a little less obvious of its intent:
{
"query": {
"nested": {
"path": "header",
"query": {
"match": {
"header.plainText": "Computational Compositional Semantics"
}
}
}
}
}
The aggregation would then be:
{
"aggs": {
"nested_sentences": {
"nested": {
"path": "sentences"
},
"group_by_rhetorical_class": {
"terms": {
"field": "sentences.rhetoricalClass",
"size": 10
},
"aggs": {
"nested_babel": {
"path": "sentences.babelSynsetsOcc"
},
"aggs": {
"count_synset_id": {
"count": {
"field": "sentences.babelSynsetsOcc.synsetID"
}
}
}
}
}
}
}
}
Now, if you combine them and throw away hits (since you're just looking for the aggregated result), then it looks like this:
{
"size": 0,
"query": {
"nested": {
"path": "header",
"query": {
"match": {
"header.plainText": "Computational Compositional Semantics"
}
}
}
},
"aggs": {
"nested_sentences": {
"nested": {
"path": "sentences"
},
"group_by_rhetorical_class": {
"terms": {
"field": "sentences.rhetoricalClass",
"size": 10
},
"aggs": {
"nested_babel": {
"path": "sentences.babelSynsetsOcc"
},
"aggs": {
"count_synset_id": {
"count": {
"field": "sentences.babelSynsetsOcc.synsetID"
}
}
}
}
}
}
}
}

Resources