How can we do a key insensitive cardinality aggregation? - elasticsearch

We can use cardinality to get a distinct count on a field, however the cardinality is case sensitive... meaning that if we have emails like user#x.com, User#x.com and USER#x.com these will count as 3 emails, however I need this to count as a single email count.
This is the aggregation I am using:
"aggs" : {
"emails" : {
"cardinality" : {
"field" : "emails.keyword"
}
}
}
I would need something like:
"aggs" : {
"emails" : {
"cardinality" : {
"field" : "emails.keyword",
"casesensitive": false ????
}
}
}
How can we do to make a cardinality aggregation to be key insensitive?

Although I would go with Val's suggestion, here is the query I thought may be useful if you do not have the control of the mapping where I made use of a custom script in Cardinality Aggregation
Aggregation Query:
POST <your_index_name>/_search
{
"size":0,
"aggs":{
"email_count":{
"cardinality":{
"script":{
"source":"doc['email.keyword'].toString().toLowerCase()"
}
}
}
}
}
Note that you would find more details on Scripting in the aforementioned link.
Hope this helps!

Related

Elastic search Group by count for particular field

I have a elastic search index with following documents
{
"id":1
"mainid ": "497940311988134801282012-04-10 ",
}
{
"id":2
"mainid ": "497940311988134801282012-04-10 ",
}
I am looking to have a query similar like -example mysql table
id mainid
1 497940311988134801282012-04-10
2 497940311988134801282012-04-10
3 497940311988134801282012-04-10
4 something different
select id ,mainid ,count(mainid) as county from wfcharges group by mainid,id having county>1;
in elastic search ,as there is no count aggregate function is available in elastic .I am stuck here.This is what ,I have tried. Any suggestions or online resources.Thanks
GET /wfcharges/_search
{
"aggs" : {
"countfield" : {
"count" : { "field" : "mainid" }
}
}
}
I think you'd want to use the terms aggregation. This will group by similar terms and return a count of each term. Look at the linked url for example.
In you case, it would look like this:
GET /wfcharges/_search
{
"aggs" : {
"countfield" : {
"terms" : { "field" : "mainid" }
}
}
}
This query is going to be exactly what you need:
GET /wfcharges/_search
{
"aggs": {
"countfield": {
"terms": {
"field": "mainid",
"min_doc_count": 2
}
}
}
}
It's going to aggregate by mainid field and tell that minimum document count for this bucket has to be 2 ( more than 1):

How to get elasticsearch most used words?

I am using terms aggregation on elasticsearch to get most used words in a index with 380607390 (380 millions) and i receive timeout on my application.
The aggregated field is a text with a simple analyzer( the field holds post content).
My question is:
The terms aggregation is the correct aggregation to do that? With a large content field?
{
"aggs" : {
"keywords" : {
"terms" : { "field" : "post_content" }
}
}
}
You can try this using min_doc_count. You would ofcourse not want to get those words which have been used just once or twice or thrice...
You can set min_doc_count as per your requirement. This would definitely
reduce the time.
{
"aggs" : {
"keywords" : {
"terms" : { "field" : "post_content",
"min_doc_count": 5 //----->Set it as per your need
}
}
}
}

Elasticsearch integer range query is not working

I have field hcc_member_id as of Integer type. I want to perform range query on this field. I tried queries given in the ES documentation, but it does not seem to work. No matter what the query is it always returns same response.
I think I am doing things in a wrong way but not able to identify the problem. Any help is good.
You should use POST instead of GET. Otherwise your Json will be ignored.
Furtermore you should add a "query" field to our json:
(without query you will get something like No parser for element [range]])
{
"query": {
"range": {
"hc_member_id": {
"gte": 1000
}
}
}
}
this is a working (for me) query
//EDIT // IT WORK ONLY IN POST NOT GET
{
"query" : {
"range" : {
"hcc_member_id" : {
"gte" : 1000
}
}
}
}

Ordering term aggregation buckets by sub-aggregration result values

I have two questions about the query seen on this capture:
How do I order by value in the sum_category field in the results?
I use respsize again in the query but it's not correct as you can see below.
Even if I make only an aggregration, why do all the documents come with the result? I mean, if I make a group by query in SQL it retrieves only grouped data, but Elasticsearch retrieves all documents as if I made a normal search query. How do I skip them?
Try this:
{
"query" : {
"match_all" : {}
},
"size" : 0,
"aggs" : {
"categories" : {
"terms" : {
"field" : "category",
"size" : 999999,
"order" : {
"sum_category" : "desc"
}
},
"aggs" : {
"sum_category" : {
"sum" : {
"field" : "respsize"
}
}
}
}
}
}
1). See the note in (2) for what your sort is doing. As for ordering the categories by the value of sum_category, see the order portion. There appears to be an old and closed issue related to that https://github.com/elastic/elasticsearch/issues/4643 but it worked fine for me with v1.5.2 of Elasticsearch.
2). Although you do not have that match_all query, I think that's probably what you are getting results for. And so the sort your specified is actually getting applied to those results. To not get these back, I just have size: 0 portion.
Do you want buckets for all the categories? I noticed you do not have size specified for the main aggregation. That's the size: 999999 portion.

How to use lucene SpanQuery in ElasticSearch

For my project, I thought of using Span Near Queries of ElasticSearch, with the constraint that is, certain tokens may have to searched with Fuzziness. I was able to generate a set of SpanQuery (org.apache.lucene.search.spans.SpanQuery) objects some with fuzzy enabled, some without. I couldn't figure out how to use these set of SpanQueries in ElasticSearch spanNearQuery.
Can someone help me out with right pointers to samples or docs. And is there any way to construct ES SpanNearQueryBuilder with some clauses fuzzy enabled ?
You can wrap an fuzzy query into a span query with Span Multi Term Query:
{
"span_near" : {
"clauses" : [
{ "span_term" : { "field" : "value1" } },
{ "span_multi" :
"match" : {
"prefix" : { "user" : { "field" : "value2" } }
}
}
],
...
}
}

Resources