Elastic search Group by count for particular field - elasticsearch

I have a elastic search index with following documents
{
"id":1
"mainid ": "497940311988134801282012-04-10 ",
}
{
"id":2
"mainid ": "497940311988134801282012-04-10 ",
}
I am looking to have a query similar like -example mysql table
id mainid
1 497940311988134801282012-04-10
2 497940311988134801282012-04-10
3 497940311988134801282012-04-10
4 something different
select id ,mainid ,count(mainid) as county from wfcharges group by mainid,id having county>1;
in elastic search ,as there is no count aggregate function is available in elastic .I am stuck here.This is what ,I have tried. Any suggestions or online resources.Thanks
GET /wfcharges/_search
{
"aggs" : {
"countfield" : {
"count" : { "field" : "mainid" }
}
}
}

I think you'd want to use the terms aggregation. This will group by similar terms and return a count of each term. Look at the linked url for example.
In you case, it would look like this:
GET /wfcharges/_search
{
"aggs" : {
"countfield" : {
"terms" : { "field" : "mainid" }
}
}
}

This query is going to be exactly what you need:
GET /wfcharges/_search
{
"aggs": {
"countfield": {
"terms": {
"field": "mainid",
"min_doc_count": 2
}
}
}
}
It's going to aggregate by mainid field and tell that minimum document count for this bucket has to be 2 ( more than 1):

Related

Run a subquery for each of the filtered elasticsearch documents

I have an index named employees with the following structure:
{
id: integer,
name: text,
age: integer,
cityId: integer,
resumeText: text <--------- parsed resume text
}
I want to search employees with certain criteria e.g having age > 40, resumeText contains a specific skill or employee belongs to a certain city etc, and have the following query for so far requirement:
{
query:{
bool:{
should:[
{
term:{
cityId:2990
},
{
match:{
resumeText:"marketing"
},
{
match:{
resumeText:"critical thinking"
}}}
],
filter:{
range:{
age:{
gte:40
}}}}}
}
This gives me expected results but i want to know also among the returned documents/employees which are the ones whose resumeText contains the mentioned skills. e.g in the response, I want to get documents having mentioned that this document had matched "critical thinking" , this employee had matched both the skills and this employee didn't match any skills (as it was returned based on other filters)
What changes do i need to do to get the desired results:
can aggregation help?
can we rum a script for EACH filtered document to compute desired result (sub query for each document)?
any other approach?
Yes, You can use aggregation.
Refer this
You can bucket like how many resumes are matching each skill you are looking for.
GET employees/_search
{
"size": 0,
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"marketing_resume_count" : { "match" : { "resumeText" : "marketing" }},
"thinking_resume_count" : { "match" : { "resumeText" : "thinking" }}
}
}
}
}
}
To extend to your use case:
You can add query section to the query as below
GET employees/_search
{
"size": 0,
"query":{
"match":{
"region":"AM"
}
},
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"marketing_resume_count" : { "match" : { "resumeText" : "marketing" }},
"thinking_resume_count" : { "match" : { "resumeText" : "thinking" }}
}
}
}
}
}
You can use range query to handle gte and let conditions. You can refer this for range query example. This can be used in place of query section.

How can we do a key insensitive cardinality aggregation?

We can use cardinality to get a distinct count on a field, however the cardinality is case sensitive... meaning that if we have emails like user#x.com, User#x.com and USER#x.com these will count as 3 emails, however I need this to count as a single email count.
This is the aggregation I am using:
"aggs" : {
"emails" : {
"cardinality" : {
"field" : "emails.keyword"
}
}
}
I would need something like:
"aggs" : {
"emails" : {
"cardinality" : {
"field" : "emails.keyword",
"casesensitive": false ????
}
}
}
How can we do to make a cardinality aggregation to be key insensitive?
Although I would go with Val's suggestion, here is the query I thought may be useful if you do not have the control of the mapping where I made use of a custom script in Cardinality Aggregation
Aggregation Query:
POST <your_index_name>/_search
{
"size":0,
"aggs":{
"email_count":{
"cardinality":{
"script":{
"source":"doc['email.keyword'].toString().toLowerCase()"
}
}
}
}
}
Note that you would find more details on Scripting in the aforementioned link.
Hope this helps!

Springdata mongodb aggregation match

After asking question to understand a bit more of the aggregation framework in MongoDB I finally found the way to do aggregation for my need (thanks to a StackExchange user)
So basically here is a document from my collection:
{
"_id" : ObjectId("s4dcsd5s4d6c54s6d"),
"items" : [
{
type : "TYPE_1",
text : "blablabla"
},
{
type : "TYPE_2",
text : "blablabla"
},
{
type : "TYPE_3",
text : "blablabla"
},
{
type : "TYPE_1",
text : "blablabla"
},
{
type : "TYPE_2",
text : "blablabla"
},
{
type : "TYPE_1",
text : "blablabla"
}
]
}
The idea was to be able to filter only some elements of my collections (avoiding Type 2 and 3). In fact I have more than 30 types and 6 are not allowed but for simplicity I made this example.
So the aggregation command in command line is this one:
db.history.aggregate([{
$match: {
_id: ObjectId("s4dcsd5s4d6c54s6d")
}
}, {
$unwind: '$items'
}, {
$match: {
'items.type': { '$nin': [ "TYPE_2" , "TYPE_3"] }
}
},
{ $limit: 10 }
]);
With this I am able to retrieve the 10 elements items of this document which do not match TYPE_2 and TYPE_3
However when I am using spring data there is no output. I looked a bit at the example to build mine but its still not working.
So I did:
Aggregation aggregation = newAggregation(
match(Criteria.where("id").is(myID)),
unwind("items"),
match(Criteria.where("items.type").nin(ignoreditemstype)),
limit(3),
skip(offsetLong)
);
AggregationResults<PersonnalHistory> results = mongAccess.getOperation().aggregate(query,
"items", PersonnalHistory.class);
PersonnalHistory is marked with annotation #Document(collection = "history") and id with the #id annotation
ignoreditemstype is a list containing TYPE_2 and TYPE_3
Here is what I have in the toString method of aggregation:
{
"aggregate" : "__collection__" ,
"pipeline" : [
{ "$match": { "id" : "s4dcsd5s4d6c54s6d"} },
{ "$unwind": "$items"},
{ "$match": { "items.type": { "$nin" : [ "TYPE_2" , "TYPE_3" ] } } },
{ "$limit" : 3},
{ "$skip" : 0 }
]
}
I tried a lot of stuff (to have at least an answer :) ) like removing id or the nin:
aggregation = newAggregation(
unwind("items"),
match(Criteria.where("items.type").nin(ignoreditemstype)),
limit(3),
skip(offsetLong)
);
aggregation = newAggregation(
match(Criteria.where("id").is(myid)),
unwind("items")
);
For information when I do a simple query like:
query.addCriteria(Criteria.where("id").is(myID));
My document is returned. However I have thousands of items. So I just want to have the 15 first (in fact the 15 first are the 15 last added)
Do you maybe see what I am doing wrong?
Yeah looks like you are passing simple String while it is expecting ObjectId
Aggregation aggregation = newAggregation(
match(Criteria.where("_id").is(new ObjectId(myID))),
unwind("items"),
match(Criteria.where("items.type").nin(ignoreditemstype)),
limit(3),
skip(offsetLong)
);
Now the question is why it works with simple query, my answer would be because spring-data driver is not that mature at least not with aggregation pipeline.

How to get elasticsearch most used words?

I am using terms aggregation on elasticsearch to get most used words in a index with 380607390 (380 millions) and i receive timeout on my application.
The aggregated field is a text with a simple analyzer( the field holds post content).
My question is:
The terms aggregation is the correct aggregation to do that? With a large content field?
{
"aggs" : {
"keywords" : {
"terms" : { "field" : "post_content" }
}
}
}
You can try this using min_doc_count. You would ofcourse not want to get those words which have been used just once or twice or thrice...
You can set min_doc_count as per your requirement. This would definitely
reduce the time.
{
"aggs" : {
"keywords" : {
"terms" : { "field" : "post_content",
"min_doc_count": 5 //----->Set it as per your need
}
}
}
}

Query elasticsearch to find docs that don't have a key

i have logs like:
{
"a":"XXX",
"b":"YYY",
"token":"acquired"
}
Also, i have logs that do not have this token key set. Kibana's terms panel tells that they are around by showing them as Missing fields(3047). How can i query all docs that do not have the token key set?
You can query in ES for missing fields:
From: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_dealing_with_null_values.html
GET /my_index/posts/_search
{
"query" : {
"filtered" : {
"filter": {
"missing" : { "field" : "token" }
}
}
}
}

Resources