How to perform a date range elasticsearch query given multiple dates per document? - elasticsearch

I'm using ElasticSearch to index forum threads and reply posts. Each post has a date field associated with it. I'd like to perform a query that includes a date range which will return threads that contain posts matching a date range. I've looked at using a nested mapping but the docs say the feature is experimental and may lead to inaccurate results.
What's the best way to accomplish this? I'm using the Java API.

You haven't said much about your data structure, but I'm inferring from your question that you have post objects which contain a date field, and presumably a thread_id field, ie some way of identifying which thread a post belongs to?
Do you also have a thread object, or is your thread_id sufficient?
Either way, your stated goal is to return a list of threads which have posts in a particular date range. This means that you need to group your threads (rather than returning the same thread_id multiple times for each post in the date range).
This grouping can be done by using facets.
So the query in JSON would look like this:
curl -XGET 'http://127.0.0.1:9200/posts/post/_search?pretty=1&search_type=count' -d '
{
"facets" : {
"thread_id" : {
"terms" : {
"size" : 20,
"field" : "thread_id"
}
}
},
"query" : {
"filtered" : {
"query" : {
"text" : {
"content" : "any keywords to match"
}
},
"filter" : {
"numeric_range" : {
"date" : {
"lt" : "2011-02-01",
"gte" : "2011-01-01"
}
}
}
}
}
}
'
Note:
I'm using search_type=count because I don't actually want the posts returned, just the thread_ids
I've specified that I want the 20 most frequently encountered thread_ids (size: 20). The default would be 10
I'm using a numeric_range for the date field because dates typically have many distinct values, and the numeric_range filter uses a different approach to the range filter, making it perform better in this situation
If your thread_ids look like how-to-perform-a-date-range-elasticsearch-query then you can use these values directly. But if you have a separate thread object, then you can use the multi-get API to retrieve these
your thread_id field should be mapped as { "index": "not_analyzed" } so that the whole value is treated as a single term, rather than being analyzed into separate terms

Related

Creating histogram in Elasticsearch

I have an index with several documents. A field found in each document is "id". I want to know how many documents per id count. There can be several documents for each id. Just like in any store there can be many transactions for each customer, for instance.
Meaning for instance, I want to get something like: "There are 5 ids with 1 document. There are 10 ids with 2 documents" and so on.
How can I write that aggregation in Elasticsearch?
I believe this would be a classic terms aggregation. Something along these lines should work for you:
GET /_search
{
"aggs" : {
"ids" : {
"terms" : { "field" : "id" }
}
}
}

How can I get options for filtering by a field directly from elasticsearch?

I want to populate a filtering field based on the data I have indexed inside Elasticsearch. How can I retrieve this data? For example, my documents inside index "test" and type "doc" could be
{"id":1, "tag":"foo", "name":"foothing"}
{"id":2, "tag":"bar", "name":"barthing"}
{"id":3, "tag":"foo", "name":"something"}
{"id":4, "tag":"quux", "name":"quuxthing"}
I'm looking for something like GET /test/doc/_magic?q=tag that would return [foo,bar,quux] from my data. I don't know what this is called or even possible. I don't want to get all index entries into memory and do this programmatically, I have millions of documents in the index with around a hundred different tags.
Is this possible with ES?
Yes, that's possible and this is called a terms aggregation
You can do it like this:
GET /test/doc/_search
{
"size": 0,
"aggs" : {
"tags" : {
"terms" : {
"field" : "tag.keyword",
"size": 100
}
}
}
}
Note that depending on the cardinality of your tag field, you can increase/decrease the size setting (10 by default).

How to run Elasticsearch completion suggester query on limited set of documents

I'm using a completion suggester in Elasticsearch on a single field. The type contains documents of several users. Is there a way to limit the returned suggestions to documents that match a specific query?
I'm currently using this query:
{
"name" : {
"text" : "Peter",
"completion" : {
"field" : "name_suggest"
}
}
}
Is there a way to combine this query with a different one, e.g.
{
"query":{
"term" : {
"user_id" : "590c5bd2819c3e225c990b48"
}
}
}
Have a look at the context suggester, which is just a specialized completion suggester with filtering capabilities - however this is still not a regular query filter, just keep that in mind.
You can specify both the query and the suggester in your query, like this:
{
"query":{
"term" : {
"user_id" : "590c5bd2819c3e225c990b48"
}
},
"suggest": {
"name" : {
"text" : "Peter",
"completion" : {
"field" : "name_suggest"
}
}
}
}
I have a similar use case, and I've posted my question on elastic search forum, see here
From what I've read so far, I don't think with completion suggester you can limit documents. They essentially create a finite state transducer (prefix tree) at index time, this makes it fast but you lose the flexibility of filtering on additional fields. I don't think context suggester would work in your case (let me know if i am wrong), because the cardinality of user_id is very high.
I think edge-ngrams partial matching is more flexible and might actually work in your use case.
Let me know what you end up implementing.

Queries vs Filters - Order of execution

I've read this question and a colleague of mine made me doubt:
In a filtered query, when is the filter applied ? Before or after executing the query ? When is the result cached ?
If the filter is applied beforehand, wouldn't it be a a good thing to duplicate the query part in the filters ?
If the filter is applied afterward, then i'm having trouble understanding what is cached.
Luckily, ES provides two types of filters for you to work with:
{
"query" : {
"field" : { "title" : "Catch-22" }
},
"filter" : {
"term" : { "year" : 1961 }
}
}
{
"query": {
"filtered" : {
"query" : {
"field" : { "title" : "Catch-22" }
},
"filter" : {
"term" : { "year" : 1961 }
}
}
}
}
In the first case, filters are applied to all documents found by the query. In the second case, the documents are filtered before the query runs. This yields better performance.
Quoted from: http://www.packtpub.com/elasticsearch-server-for-fast-scalable-flexible-search-solution/book
About cache, I'm not sure about cache mechanism of filters.
My guessing would be:
First case, since the filter is against a set of results returned by query, the cache is kind of specific for this return set.
Second case, the filter is applied first, the cache is stored for the indices you checked against, thus, this cache is more reusable because it does not rely on the content of the query, but at larger memory cost and query time for first time(before the cache is generated).
Let me explain you search query execution-
First thing is that there is always a Complete document of reference in which you want to search.
If you have filter query included with search query then it will just make that document smaller or in other words filter queries are cached results of same query.
Now you have a smaller tree to search from with your query text.
Now your doubt part- Duplicating the query in filters will only increase overhead of cache mechanism and There are many guide lines on what to include in filter query and what to ignore. It's all play of relevancy.

elastic search faceted query returns incorrect count

I need help in aggregate / faceted queries in elastic search. I have used faceted query to group the results but I’m not getting grouped result with correct count.
Please suggest on how to get grouped results from elastic search.
{
"query" : {
"query_string" : {"query" : "pared_cat_id:1"} } ,
"facets" : {
"subcategory" : {
"terms" : {
"field": "sub_cat_id",
"size" : 50,
"order" : "term",
"all_terms" : true
}
}
},
"from" : 0,
"size": 50
}
Trying to get grouped results for sub category id for passed parent category id.
"query_string" : {"query" : "pared_cat_id:1"} } ,
This is applied to overall data and not on the facets counts.
FOr this you need to use facet query in which you can specify same which you are specifying in the main query string.
So facets count which are being shown to you now are based on the results without applying "query_string" : {"query" : "pared_cat_id:1"} } , ie. to the whole data. Incase you want facets counts after applying "query_string" : {"query" : "pared_cat_id:1"} } , provide it in the facet query.
Elasticsearch faceting queries works very well in terms of accuracy, at least I have not seen any problem yet.
Just a few questions:
What field is this string or numeric,give example?
Have you applied any custom mapping or you have used default "standard" analyzer
Please state the kind of inaccuracy like "aa" should have count 100 but its 50 or is there any other kind of inaccuracy?
Elasticsearch facets query returns incorrect count if the number of shards is >1, so as for now Facets are deprecated and will be removed in a future release. You are encouraged to migrate to aggregations instead.
I suggest that you take a look at this blog post in which Alex Brasetvik give a good description along with some examples on how to use the aggregations feature properly.

Resources