Elasticsearch - Cardinality over Full Field Value - elasticsearch

I have a document that looks like this:
{
"_id":"some_id_value",
"_source":{
"client":{
"name":"x"
},
"project":{
"name":"x November 2016"
}
}
}
I am attempting to perform a query that will fetch me the count of unique project names for each client. For this, I am using a query with cardinality over the project.name. I am sure that there are only 4 unique project names for this particular client. However, when I run my query, I get a count of 5, which I know is wrong.
The project names all contain the name of the client. For instance, if a client is "X", project names will be "X Testing November 2016", or "X Jan 2016", etc. I don't know if that is a consideration.
This is the mapping for the document type
{
"mappings":{
"vma_docs":{
"properties":{
"client":{
"properties":{
"contact":{
"type":"string"
},
"name":{
"type":"string"
}
}
},
"project":{
"properties":{
"end_date":{
"format":"yyyy-MM-dd",
"type":"date"
},
"project_type":{
"type":"string"
},
"name":{
"type":"string"
},
"project_manager":{
"index":"not_analyzed",
"type":"string"
},
"start_date":{
"format":"yyyy-MM-dd",
"type":"date"
}
}
}
}
}
}
}
This is my search query
{
"fields":[
"client.name",
"project.name"
],
"query":{
"bool":{
"must":{
"match":{
"client.name":{
"operator":"and",
"query":"ABC systems"
}
}
}
}
},
"aggs":{
"num_projects":{
"cardinality":{
"field":"project.name"
}
}
},
"size":5
}
These are the results I get (I have only posted 2 results for the sake of brevity). Please find that the num_projects aggregation returns 5, but must only return 4, which are the total number of projects.
{
"hits":{
"hits":[
{
"_score":5.8553367,
"_type":"vma_docs",
"_id":"AVTMIM9IBwwoAW3mzgKz",
"fields":{
"project.name":[
"ABC"
],
"client.name":[
"ABC systems Pvt Ltd"
]
},
"_index":"vma"
},
{
"_score":5.8553367,
"_type":"vma_docs",
"_id":"AVTMIM9YBwwoAW3mzgK2",
"fields":{
"project.name":[
"ABC"
],
"client.name":[
"ABC systems Pvt Ltd"
]
},
"_index":"vma"
}
],
"total":18,
"max_score":5.8553367
},
"_shards":{
"successful":5,
"failed":0,
"total":5
},
"took":4,
"aggregations":{
"num_projects":{
"value":5
}
},
"timed_out":false
}
FYI: The project names are ABC, ABC Nov 2016, ABC retest November, ABC Mobile App

You need the following mapping for your project.name field:
{
"mappings": {
"vma_docs": {
"properties": {
"client": {
"properties": {
"contact": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"project": {
"properties": {
"end_date": {
"format": "yyyy-MM-dd",
"type": "date"
},
"project_type": {
"type": "string"
},
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"project_manager": {
"index": "not_analyzed",
"type": "string"
},
"start_date": {
"format": "yyyy-MM-dd",
"type": "date"
}
}
}
}
}
}
}
It's basically a subfield called raw where the same value put in project.name is put in project.name.raw but without touching it (tokenizing or analyzing it). And then the query you need to use is:
{
"fields": [
"client.name",
"project.name"
],
"query": {
"bool": {
"must": {
"match": {
"client.name": {
"operator": "and",
"query": "ABC systems"
}
}
}
}
},
"aggs": {
"num_projects": {
"cardinality": {
"field": "project.name.raw"
}
}
},
"size": 5
}

Related

How I can get the distinct result?

What I am trying to do is the query to elastic search (ver 6.4), to get the unique search result (named eids). I made a query as below. What I'd like to do is first text search from both 2 fields called eLabel and pLabel, and get the distinct result called eid. But actually the result is not aggregated, showing redundant ids from 0 to over 20. How I can adjust the query?
{
"query": {
"multi_match": {
"query": "Brazil Capital",
"fields": [
"eLabel",
"pLabel"
]
}
},
"size": 200,
"_source": [
"eid",
"eLabel"
],
"aggs": {
"eids": {
"terms": {
"field": "eid"
}
}
}
}
my current mappings are as follows.
eid : id of entity
eLabel: entity label (ex, Brazil)
prop_id: property id of the entity (eid)
pLabel: the label of the property (ex, is the capital of, is located at ...)
"mappings": {
"entity": {
"properties": {
"eLabel": {
"type": "text" ,
"index_options": "docs" ,
"analyzer": "my_analyzer"
} ,
"eid": {
"type": "keyword"
} ,
"subclass": {
"type": "boolean"
} ,
"pLabel": {
"type": "text" ,
"index_options": "docs" ,
"analyzer": "my_analyzer"
} ,
"prop_id": {
"type": "keyword"
} ,
"pType": {
"type": "keyword"
} ,
"way": {
"type": "keyword"
} ,
"chain": {
"type": "integer"
} ,
"siteKey": {
"type": "keyword"
},
"version": {
"type": "integer"
},
"docId": {
"type": "integer"
}
}
}
}
Based on your comment, you can make use of the below query using Bool. Don't think anything is wrong with aggregation query, just replace the query you have with the bool query I've mentioned and I think it would suffice.
When you make use of multi_match query, it would retrieve even if the document has eLabel = "Rio is capital of brazil" & pLabel = "something else entirely here"
POST <your_index_name>/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"eLabel": "capital"
}
},
{
"match": {
"pLabel": "brazil"
}
}
]
}
},
"size": 200,
"_source": [
"eid",
"eLabel"
],
"aggs": {
"eids": {
"terms": {
"field": "eid"
}
}
}
}
Note that if you only want the values of eid and do not want the documents, you can set "size":0 in the above query. That way you'd only have aggregation results returned.
Let me know if this helps!!

How to select fields after aggregation in Elastic Search 2.3

I have following schema for an index:-
PUT
"mappings": {
"event": {
"properties": {
"#timestamp": { "type": "date", "doc_values": true},
"partner_id": { "type": "integer", "doc_values": true},
"event_id": { "type": "integer", "doc_values": true},
"count": { "type": "integer", "doc_values": true, "index": "no" },
"device_id": { "type": "string", "index":"not_analyzed","doc_values":true }
"product_id": { "type": "integer", "doc_values": true},
}
}
}
I need result equivalent to following query:-
SELECT product_id, device_id, sum(count) FROM index WHERE partner_id=5 AND timestamp<=end_date AND timestamp>=start_date GROUP BY device_id,product_id having sum(count)>1;
I am able to achieve the result by following elastic query:-
GET
{
"store": true,
"size":0,
"aggs":{
"matching_events":{
"filter":{
"bool":{
"must":[
{
"term":{
"partner_id":5
}
},
{
"range":{
"#timestamp":{
"from":1470904000,
"to":1470904999
}
}
}
]
}
},
"aggs":{
"group_by_productid": {
"terms":{
"field":"product_id"
},
"aggs":{
"group_by_device_id":{
"terms":{
"field":"device_id"
},
"aggs":{
"total_count":{
"sum":{
"field":"count"
}
},
"sales_bucket_filter":{
"bucket_selector":{
"buckets_path":{
"totalCount":"total_count"
},
"script": {"inline": "totalCount > 1"}
}
}
}
}}
}
}
}
}
}'
However for the case where count is <=1 query is returning empty buckets with key as product_id. Now out of 40 million groups, only 100k will have satisfy the condition, so I am returned with huge result set, majority of which is useless. How can I select only particular field after aggregation? I tried this but not working- `"fields": ["aggregations.matching_events.group_by_productid.group_by_device_id.buckets.key"]
Edit:
I have following set of data:-
device id Partner Id Count
db63te2bd38672921ffw27t82 367 3
db63te2bd38672921ffw27t82 272 1
I go this output:-
{
"took":6,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"hits":{
"total":7,
"max_score":0.0,
"hits":[
]
},
"aggregations":{
"matching_events":{
"doc_count":5,
"group_by_productid":{
"doc_count_error_upper_bound":0,
"sum_other_doc_count":0,
"buckets":[
{
"key":367,
"doc_count":3,
"group_by_device_id":{
"doc_count_error_upper_bound":0,
"sum_other_doc_count":0,
"buckets":[
{
"key":"db63te2bd38672921ffw27t82",
"doc_count":3,
"total_count":{
"value":3.0
}
}
]
}
},
{
"key":272,
"doc_count":1,
"group_by_device_id":{
"doc_count_error_upper_bound":0,
"sum_other_doc_count":0,
"buckets":[
]
}
}
]
}
}
}
}
As you can see, bucket with key 272 is empty which make sense, but shouldn't this bucket be removed from result set altogether?
I've just found out that there is a fairly recent issue and PR that adds a _bucket_count path to a buckets_path option so that an aggregation can potentially filter the parent bucket based on the number of buckets another aggregation has. In other words if the _bucket_count is 0 for a parent bucket_selector the bucket should be removed.
This is the github issue: https://github.com/elastic/elasticsearch/issues/19553

Multi-level nesting in elastic search

I have the below structure (small part of a very large elastic-search document)
sample: {
{
"md5sum":"4002cbda13066720513d1c9d55dba809",
"id":1,
"sha256sum":"1c6e77ec49413bf7043af2058f147fb147c4ee741fb478872f072d063f2338c5",
"sha1sum":"ba1e6e9a849fb4e13e92b33d023d40a0f105f908",
"created_at":"2016-02-02T14:25:19+00:00",
"updated_at":"2016-02-11T20:43:22+00:00",
"file_size":188416,
"type":{
"name":"EXE"
},
"tags":[
],
"sampleSources":[
{
"filename":"4002cbda13066720513d1c9d55dba809",
"source":{
"name":"default"
}
},
{
"filename":"4002cbda13066720332513d1c9d55dba809",
"source":{
"name":"default"
}
}
]
}
}
The filter I would like to use is to find by the 'name' contained within sample.sampleSources.source using elastic search.
I tried the below queries
curl -XGET "http://localhost:9200/app/sample/_search?pretty" -d {query}
where, {query} is
{
"query":{
"nested":{
"path":"sample.sampleSources",
"query":{
"nested":{
"path":"sample.sampleSources.source",
"query":{
"match":{
"sample.sampleSources.source.name":"default"
}
}
}
}
}
}
}
However, it is not returning me any results. I have certain cases in my document where the nesting is more deeper than this. Can someone please guide me as to how should I formulate this query so that it works for all cases?
EDIT 1
Mappings:
{
"app":{
"mappings":{
"sample":{
"sampleSources":{
"type":"nested",
"properties":{
"filename":{
"type":"string"
},
"source":{
"type":"nested",
"properties":{
"name":{
"type":"string"
}
}
}
}
}
}
EDIT 2
The solution posted by Waldemar Neto below works well for match query but not for a wild-card or neither for a regexp
Can you please guide? I need the wild-card and the regexp queries to be working for this.
i tried here using your examples and works fine.
Take a look in my data.
mapping:
PUT /app
{
"mappings": {
"sample": {
"properties": {
"sampleSources": {
"type": "nested",
"properties": {
"source": {
"type": "nested"
}
}
}
}
}
}
}
indexed data
POST /app/sample
{
"md5sum": "4002cbda13066720513d1c9d55dba809",
"id": 1,
"sha256sum": "1c6e77ec49413bf7043af2058f147fb147c4ee741fb478872f072d063f2338c5",
"sha1sum": "ba1e6e9a849fb4e13e92b33d023d40a0f105f908",
"created_at": "2016-02-02T14:25:19+00:00",
"updated_at": "2016-02-11T20:43:22+00:00",
"file_size": 188416,
"type": {
"name": "EXE"
},
"tags": [],
"sampleSources": [
{
"filename": "4002cbda13066720513d1c9d55dba809",
"source": {
"name": "default"
}
},
{
"filename": "4002cbda13066720332513d1c9d55dba809",
"source": {
"name": "default"
}
}
]
}
Search query
GET /app/sample/_search
{
"query": {
"nested": {
"path": "sampleSources.source",
"query": {
"match": {
"sampleSources.source.name": "default"
}
}
}
}
}
Example using wildcard
GET /app/sample/_search
{
"query": {
"nested": {
"path": "sampleSources.source",
"query": {
"wildcard": {
"sampleSources.source.name": {
"value": "*aul*"
}
}
}
}
}
}
The only thing that I saw some difference was in the path, you don't need to set the sample (type) in the nested path, only the inner objets.
Test and give me a feedback.

Partial word search - ElasticSearch 1.7.2

I've been trying to build a search module for an application, using ElasticSearch. Below is the Index Structure I've constructed from sample code I read from other StackOverflow posts.
{
"megacorp4":{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"type":"custom",
"tokenizer":"my_ngram_tokenizer",
"filter":[
"my_ngram_filter"
]
}
},
"filter":{
"my_ngram_filter":{
"type":"edgeNGram",
"min_gram":3,
"max_gram":15
}
},
"tokenizer":{
"my_ngram_tokenizer":{
"type":"edgeNGram",
"min_gram":3,
"max_gram":15
}
}
},
"mappings":{
"employee":{
"properties":{
"about":{
"type":"string",
"analyzer":"my_analyzer"
},
"age":{
"type":"long"
},
"first_name":{
"type":"string"
},
"interests":{
"type":"string",
"analyzer":"my_analyzer"
},
"last_name":{
"type":"string"
}
}
}
}
}
}
}
Below are the records I inserted to test the search functionality
[
{
"first_name":"John",
"last_name":"Smith",
"age":25,
"about":"I love to go rock climbing",
"interests":[
"sports",
"music"
]
},
{
"first_name":"Douglas",
"last_name":"Fir",
"age":35,
"about":"I like to build album climb cabinets",
"interests":[
"forestry",
"music"
]
},
{
"first_name":"Jane",
"last_name":"Smith",
"age":32,
"about":"I like to collect rock albums",
"interests":[
"music"
]
}
]
I ran a search on the 'about' column, both using API (through POSTMAN) and in the Python client as follows :
API Query:
localhost:9200/megacorp4/_search?q=climb
Python Query :
from elasticsearch import Elasticsearch
from pprint import pprint
es = Elasticsearch()
res = es.search(index="megacorp4", body={"query": {"match": {'about':"climb"}}})
pprint(res)
I'm able to obtain only exact match, and I don't get the result with 'climbing' in the output. However when I replace 'climb' with 'climb*' in the query, I get 2 records with 'climb' and 'climbing'. I don't want to use '*' wildcard approach.
I've also tried using 'english', 'standard' & 'ngram' inbuilt analyzers, but nothing seemed to work.
In need of help to implement Search a key as Partial words in Full Text.
Thanks in advance.
Use this mapping instead:
DELETE test
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"my_ngram_filter"
]
}
},
"filter": {
"my_ngram_filter": {
"type": "edgeNGram",
"min_gram": 3,
"max_gram": 15
}
}
}
},
"mappings": {
"employee": {
"properties": {
"about": {
"type": "string",
"analyzer": "my_analyzer"
},
"age": {
"type": "long"
},
"first_name": {
"type": "string"
},
"interests": {
"type": "string",
"analyzer": "my_analyzer"
},
"last_name": {
"type": "string"
}
}
}
}
}
POST /test/employee/_bulk
{"index":{}}
{"first_name":"John","last_name":"Smith","age":25,"about":"I love to go rock climbing","interests":["sports","music"]}
{"index":{}}
{"first_name":"Douglas","last_name":"Fir","age":35,"about":"I like to build album climb cabinets","interests":["forestry","music"]}
{"index":{}}
{"first_name":"Jane","last_name":"Smith","age":32,"about":"I like to collect rock albums","interests":["music"]}
GET /test/_search?q=about:climb
GET /test/_search
{
"query": {
"query_string": {
"query": "about:climb"
}
}
}
GET /test/_search
{
"query": {
"match": {
"about": "climb"
}
}
}
Two changes:
you need another closing curly bracket for the settings part
replace your custom tokenizer (which will not help you since you already have the edgeNGram filter) with another one, my suggestion is standard tokenizer
And for the ?q=climb part, by default this searches the _all field which is analyzed with standard analyzer and not with your custom one.
So, the correct query is localhost:9200/megacorp4/_search?q=about:climb.

Highlight hits within attachment content with ElasticSearch

I'm having trouble getting ElasticSearch to highlight hits within attachment content indexed using the elasticsearch-mapper-attachments.
My data at /stuff/file looks like this:
{
"id": "string"
"name": "string"
"attachment": "...base 64 encoded file"
}
My mapper configuration put to /stuff/file/_mapper looks like this:
{
"file" : {
"properties" : {
"attachment" : {
"type" : "attachment",
"path" : "full",
"fields": {
"name": { "store": true },
"title": { "store": true },
"content": { "store": true },
"attachment": {
"type": "string",
"term_vector": "with_positions_offsets",
"store": true
}
}
}
}
}
}
And when I query it at /stuff/_mapper/file I get this returned:
{
"stuff":{
"mappings":{
"file":{
"properties":{
"attachment":{
"type":"attachment",
"path":"full",
"fields":{
"attachment":{
"type":"string"
},
"author":{
"type":"string"
},
"title":{
"type":"string"
},
"name":{
"type":"string"
},
"date":{
"type":"date",
"format":"dateOptionalTime"
},
"keywords":{
"type":"string"
},
"content_type":{
"type":"string"
},
"content_length":{
"type":"integer"
},
"language":{
"type":"string"
}
}
},
"id":{
"type":"string"
},
"name":{
"type":"string"
}
}
}
}
}
}
And my query looks like this:
{
"size": 30,
"query": {
"multi_match": {
"query": "{{.Query}}",
"operator": "and",
"fields": ["id", "name^4", "attachment"],
"fuzziness": "AUTO",
"minimum_should_match": "80%"
}
},
"highlight" : {
"fields" : {
"attachment": { }
}
}
}
When I search for a term that is in the attachment, it returns the correct result but there is no highlighting. There was a similar question from a few years ago that swapped attachment for file in a few places but there were comments that this has changed again... What's the right configuration to get the highlighting to work?
Turns out that you can't overwrite mapper configurations using PUT. You need to delete the existing configuration first (I actually had delete the entire database, DELETE on the configuration didn't seem to have any effect). Once the mapper configuration was actually updated, highlighting works fine.

Resources