Performance of terms aggregation - performance

I have an index with 10,000,000 documents in it. Elasticsearch is configured to use 5 shards, no replicas.
Each document has two fields:
Field tags, containing three tags. There are currently 82 user-defined tags in my index.
Field singleTag, containing only one tag out of the same set of 82 possible values. (this field is used for comparison only)
I want to query the first ten tags (according to alphabetical order). This takes around 300ms for field tags (and only takes 150ms for field singleTag). How can I increase the performance of this query?
A sample document:
{
"tags": ["player", "ballsports", "goals"],
"singleTag": "football"
}
My index definition:
{
"settings": {
"number_of_shards": "5",
"number_of_replicas": "0"
},
"mappings" : {
"issue": {
"properties": {
"tags": {
"index": "not_analyzed",
"type": "string"
},
"singleTag": {
"index": "not_analyzed",
"type": "string"
}
}
}
}
}
The current (too slow) query:
{
"size" : 0,
"aggregations" : {
"tagsAggregation" : {
"terms" : {
"field" : "tags",
"size" : 10,
"order" : {
"_term" : "asc"
}
}
}
}
}
I'm currently using Elasticsearch 2.4.4 but solutions including Elasticsearch 5 are also welcome. The performance in Elasticsearch 5 is similar for the first request, but caching is much better.

Related

Elasticsearch - Is it possible to create histograms without having the field indexed

I come across the following phrase
https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html
For instance if you have a numeric field called foo that you need to run histograms on but that you never need to filter on, you can safely disable indexing on this field in your mappings:
PUT index
{
"mappings": {
"properties": {
"foo": {
"type": "integer",
"index": false
}
}
}
}
Does it mean aggregations like histograms can be created though the field is NOT indexed ?
Yes, that's correct and that's easy to test:
Create the index:
PUT index
{
"mappings": {
"properties": {
"foo": {
"type": "integer",
"index": false
}
}
}
}
Index a sample document:
PUT index/_doc/1
{
"foo": 23
}
Run an histogram aggregation:
POST index/_search
{
"aggs": {
"histo": {
"histogram": {
"field": "foo",
"interval": 10
}
}
}
}
Results:
"aggregations" : {
"histo" : {
"buckets" : [
{
"key" : 20.0,
"doc_count" : 1
}
]
}
}

How to get nested aggregations buckets using java high level REST client Elasticsearch

I have some nested fields, of which I want to calculate all distinct values, for example:
"author":{
"type":"nested",
"properties":{
"first_name":{
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
"last_name":{
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
Suppose I need all unique first names, so I am adding an aggregation like this :
GET /statementmetadataindex/data/_search?size=0
{
"aggs": {
"distinct_authors": {
"nested": {
"path": "authors"
},
"aggs": {
"distinct_first_names": {
"terms": {
"field": "authors.first_name.keyword"
}
}
}
}
}
}
which returns an aggregation like this:
"aggregations" : {
"distinct_authors" : {
"doc_count" : 20292,
"distinct_first_names" : {
"doc_count_error_upper_bound" : 4761,
"sum_other_doc_count" : 124467,
"buckets" : [
{
"key" : "Charles",
"doc_count" : 48411
},
{
"key" : "Rudyard",
"doc_count" : 30954
}
]
}
}
}
Now, I am using Nested aggregation builder in the java code like this :
NestedAggregationBuilder uniqueAuthors=AggregationBuilders.nested("distinct_authors", "authors");
TermsAggregationBuilder distinct_first_name= AggregationBuilders.terms("distinct_first_names")
.field("authors.first_name.keyword").size(size);
uniqueAuthors.subAggregation(distinct_first_name);
and I usually get the aggregation like this from the response:
Terms distinct_authornames=aggregations.get("distinct_authors");
but the buckets that I need are in the sub-aggregation "distinct_first_names" inside "distinct_authors" , so how do I parse the aggregation result to get the unique buckets with the first names?
Try this (not tested):
Nested distinct_authornames=aggregations.get("distinct_authors");
Terms distinct_first_names=distinct_authornames.getAggregations().get("distinct_first_names");
for (Terms.Bucket bucket : distinct_first_names.getBuckets())
{
System.out.println((int) bucket.getDocCount());
System.out.println(bucket.getKeyAsString());
}
Hope this helps
Figured out the solution, quite long time back , but didn't realise it was working because I kept getting exception , due to some other reason. The following works well :
Nested distinct_authorsOuter=aggregations.get("distinct_authors");
Aggregations distinct_authors_aggs=distinct_authorsOuter.getAggregations();
Terms distinct_firstNames= distinct_authors_aggs.get("distinct_first_names");

Word Cloud in Elasticsearch 5

I am able to get word cloud using old elasticsearch version using term aggregations. I want to get word cloud from post content in es5 and I am using below query.
"aggs": {
"tagcloud": {
"terms": {
"field": "content.raw",
"size": 10
}
}
}
I did mapping like this
"content": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
But the result is not coming as a word cloud as expected. It is grouping similar posts (whole post) and giving as a list given belown
"buckets": [
{
"key" : "This car is awesome.",
"doc_count" : 199
},
..
..
How to do this?
The type keyword does pretty much the same as string with not_analyzed index mode. The whole string is indexed. And you can search only by exact value.
In your case, I think, you need to use a field that is analyzed and tokenized, such as content field. However you need to make sure that the field's option fielddata is set to true. Otherwise server returns exception.
Therefore your mapping should look like
"content": {
"fielddata" : true,
"type": "text"
}
and aggregation
"aggs": {
"tagcloud": {
"terms": {
"field": "content",
"size": 10
}
}
}
As the result you should see something that looks like (it depends on what analyzer you choose)
"buckets": [
{
"key" : "this",
"doc_count" : 199
},
{
"key" : "car",
"doc_count" : 199
},
{
"key" : "is",
"doc_count" : 199
},
{
"key" : "awesome",
"doc_count" : 199
},
...

Elasticsearch apply search filter on aggregation

I have an elasticsearch (v2.3) backend storing ip addresses in multiple indices. The document type of an ip looks like this :
{
"ip" : {
"properties" : {
"ip": { "type" : "string", "index" : "not_analyzed" },
"categories": { "type" : "string", "index" : "not_analyzed" }
}
}
}
My goal is to group all ip documents by unique ip field to apply an operation with categories (and all other fields) of all records.
There is an easy way to go : aggregating all unique ip documents with the aggregation below and iterating over each result in my script making an other search query.
{
'size': 0,
'aggs': {
'uniq': {
'terms': { 'fields': 'ip', 'size': 0 }
}
}
}
But it is not very efficient. Is there a way to do this in one single search query ?
I've seen a workaround here Elasticsearch filter document group by field, with a top_hits aggregation :
{
"size": 0,
"aggs":{
"uniq":{
"terms": {
"field": "ip",
"size": 0
},
"aggs": {
"tops": {
"top_hits": {
"size": 10
}
}
}
}
}
}
However, I can't have top_hits size to 0, which is what I want because I would like it to handle cases with the same ip in N different indices.
I've taken a look at pipeline aggregations but it does seem to be able to perform raw searches.
Thanks for help !

Saving variable types under a single key in elasticsearch?

I have bunch of documents coming in from fluentd and I'm saving then to elasticsearch with fluent-plugin-elasticsearch.
Some of those documents have a string under the key name and some have an object.
Example
{
"name": "foo"
}
and
{
"name": {
"en": "foo",
"fi": "bar"
}
}
These documents are the same type in terms of my application and they are saved to same elasticsearch index.
But elasticsearch has an issue with this. When the second document is saved to elasticsearch it throws this error:
org.elasticsearch.index.mapper.MapperParsingException: failed to parse [name]
This seems to happen because elasticsearch has set the key name to be type of string. I can see this using curl http://localhost:9200/fluentd-[tagname]/_mapping and it obviously doesn't like it when I try save an object to it afterwards.
So is there any way to workaround this in elasticsearch?
I cannot control the incoming documents and there are multiple keys with variable types - not just name. So I cannot make a single hack for that key only.
This is pretty annoying since those documents are completely left out of elasticsearch and sent to /dev/null.
If this is completely impossible - is possible to at least save those documents to a file or something so I wouldn't lose them?
Here's my template for the fluentd-* indices:
{
"fluentd_template": {
"template": "fluentd-*",
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"index": {
"query": {
"default_field": "msg"
},
"analysis" : {
"analyzer" : {
"default" : {
"type" : "keyword"
}
}
}
}
},
"mappings": {
"_default_": {
"_all": {
"enabled": false
},
"_source": {
"compress": true
},
"properties": {
"#timestamp": {
"type": "date",
"index": "not_analyzed"
}
}
}
}
}
}

Resources