How can I get options for filtering by a field directly from elasticsearch? - elasticsearch

I want to populate a filtering field based on the data I have indexed inside Elasticsearch. How can I retrieve this data? For example, my documents inside index "test" and type "doc" could be
{"id":1, "tag":"foo", "name":"foothing"}
{"id":2, "tag":"bar", "name":"barthing"}
{"id":3, "tag":"foo", "name":"something"}
{"id":4, "tag":"quux", "name":"quuxthing"}
I'm looking for something like GET /test/doc/_magic?q=tag that would return [foo,bar,quux] from my data. I don't know what this is called or even possible. I don't want to get all index entries into memory and do this programmatically, I have millions of documents in the index with around a hundred different tags.
Is this possible with ES?

Yes, that's possible and this is called a terms aggregation
You can do it like this:
GET /test/doc/_search
{
"size": 0,
"aggs" : {
"tags" : {
"terms" : {
"field" : "tag.keyword",
"size": 100
}
}
}
}
Note that depending on the cardinality of your tag field, you can increase/decrease the size setting (10 by default).

Related

Creating histogram in Elasticsearch

I have an index with several documents. A field found in each document is "id". I want to know how many documents per id count. There can be several documents for each id. Just like in any store there can be many transactions for each customer, for instance.
Meaning for instance, I want to get something like: "There are 5 ids with 1 document. There are 10 ids with 2 documents" and so on.
How can I write that aggregation in Elasticsearch?
I believe this would be a classic terms aggregation. Something along these lines should work for you:
GET /_search
{
"aggs" : {
"ids" : {
"terms" : { "field" : "id" }
}
}
}

Selecting all the results from a bucket using TopHits aggregation

I am using TopHits aggregation over the Terms aggregation to fetch the records as shown in below query.
{
"aggregations" : {
"group by" : {
"terms" : {
"field" : "City"
},
"aggregations" : {
"top" : {
"top_hits" : {
"size" : 200
}
}}}}
I want to fetch all the records that are present in bucket instead of only top 200 records, but as the value of size increases the query time also increases for the same indexed data (for same number of records).
So I can not set the size value to a randomly large number as it is hampering the querying time.
Is there any way to achieve the same efficiently ?
Thanks.
In elastic search size having limitations default it returns 10 documents but if you want to increase documents then size values increase.
Let's check this example in this case
if deep pagination with from and size — e.g. ?size=10&from=10000 — is very inefficient as (in this example) 100,000 sorted results have to be retrieved from each shard and resorted in order to return just 10 results. This process has to be repeated for every page requested.
So this case you should use scroll api because of
The scroll API keeps track of which results have already been returned and so is able to return sorted results more efficiently than with deep pagination. However, sorting results (which happens by default) still has a cost.
In your case you should use scan and scroll as below :
curl - s - XGET localhost: 9200 / logs / syslogs / _search ? scroll = 10 m & search_type = scan ' {
"aggregations": {
"group by": {
"terms": {
"field": "City"
},
"aggregations": {
"top": {
"top_hits": {
"size": 200
}
}
}
}
}
}'
Above query return scroll id then pass that scroll id as below
curl -XGET 'localhost:9200/_search/scroll?scroll=1m' -d 'scroll id '

How do I use doc_count in an aggregations range query in ElasticSearch 1.0

I have a bunch of user generated events in my ES cluster. Each event contains the user's UUID.
I'm trying to write a query that buckets users into low, medium and high activity based on the number of events each user generates.
I'm using this query to get the number of events generated by each user:
{
"aggs" : {
"users" : {
"terms" : { "field" : "user_id.raw" }
}
}
}
This works fine, but I need to further bucket the results into a range query using the previous results "doc_count", so that I can sort each user into a low, med, high activity bucket.
I tried a bunch of ways to access the doc_count field using a sub-aggregation but never manage to get it work. I figured this would be a fairly common use case, but can't seem to crack it, so any help would be much appreciated.
I have updated https://github.com/elasticsearch/elasticsearch/issues/4983?_pjax=%23js-repo-pjax-container with this issue as well.
Looks like a minor enhancement to the aggregation framework (but) will be really useful.
you can probably do something like :
{
"aggs" : {
"tally" : {
"sum" : {
"script": "1"
}
},
"aggs" : {
//refer to tally here as the value would be same as doc_count
}
}
}

Can I specify the result fields in elasticsearch query?

In my dataset, a document contains 20+ fields with nested objects. Most of them are long text fields. These fields are important for full-text search but we only need to show the title, short-description and Id in output.
Is it possible to specify the output fields in ElasticSearch for a full text query? (like projection in MongoDB)
I think you're looking for the fields property of a search request:
Allows to selectively load specific fields for each document
represented by a search hit. Defaults to load the internal _source
field.
{
"fields" : ["user", "postDate"],
"query" : {
"term" : { "user" : "kimchy" }
}
}
The fields will automatically load stored fields (store mapping set to
yes), or, if not stored, will load the _source and extract it from it
(allowing to return nested document object).
Take care in ElasticSearch 1.0.0.RC1 the fields return values now are always lists,
if need the result to be a long instead of a list of longs (which might be a single value list for you most of the time) you can limit those with _source
{"_source" : ["field1", "field2", ...],
"query" : {
"term" : { "user" : "kimchy" }
}
}

How to perform a date range elasticsearch query given multiple dates per document?

I'm using ElasticSearch to index forum threads and reply posts. Each post has a date field associated with it. I'd like to perform a query that includes a date range which will return threads that contain posts matching a date range. I've looked at using a nested mapping but the docs say the feature is experimental and may lead to inaccurate results.
What's the best way to accomplish this? I'm using the Java API.
You haven't said much about your data structure, but I'm inferring from your question that you have post objects which contain a date field, and presumably a thread_id field, ie some way of identifying which thread a post belongs to?
Do you also have a thread object, or is your thread_id sufficient?
Either way, your stated goal is to return a list of threads which have posts in a particular date range. This means that you need to group your threads (rather than returning the same thread_id multiple times for each post in the date range).
This grouping can be done by using facets.
So the query in JSON would look like this:
curl -XGET 'http://127.0.0.1:9200/posts/post/_search?pretty=1&search_type=count' -d '
{
"facets" : {
"thread_id" : {
"terms" : {
"size" : 20,
"field" : "thread_id"
}
}
},
"query" : {
"filtered" : {
"query" : {
"text" : {
"content" : "any keywords to match"
}
},
"filter" : {
"numeric_range" : {
"date" : {
"lt" : "2011-02-01",
"gte" : "2011-01-01"
}
}
}
}
}
}
'
Note:
I'm using search_type=count because I don't actually want the posts returned, just the thread_ids
I've specified that I want the 20 most frequently encountered thread_ids (size: 20). The default would be 10
I'm using a numeric_range for the date field because dates typically have many distinct values, and the numeric_range filter uses a different approach to the range filter, making it perform better in this situation
If your thread_ids look like how-to-perform-a-date-range-elasticsearch-query then you can use these values directly. But if you have a separate thread object, then you can use the multi-get API to retrieve these
your thread_id field should be mapped as { "index": "not_analyzed" } so that the whole value is treated as a single term, rather than being analyzed into separate terms

Resources