Elastic Search - Having clause equivalent - elasticsearch

I have written a faceted query which returns the faceted results (equivalent to Group by in the SQL World).
Now, I would like to get only the faceted results where the count is greater than a particular number (Equivalent to Having clause in SQL)
Any suggestions on the same?
Update : added the query.
I need only the locations where the count is greater than 5. For e.g. US has 7, UK has 5 , and rest has 3 each. So want to return US and UK only in the result.
"facets":
{
"locations":
{
"terms":
{
"field": "location"
},
"facet_filter":
{
"terms": { "location": [ "US", "UK", "DE", "FR", "JP" ]}
}
}
}

HAVING clauses are not implemented in Elasticsearch yet. You have to handle that client-side.
See https://github.com/elastic/elasticsearch/issues/8110
There are plans to add it, but it has not been done as of May 2015.

Related

Sorting a set of results with pre-ordered items

I have a list of pre-ordered items (order by score ASC) like:
[{
"id": "id2",
"score": 1
}, {
"id": "id12",
"score": 1
}, {
"id": "id8",
"score": 1.4
}, {
"id": "id9",
"score": 1.4
}, {
"id": "id14",
"score": 1.75
}, {
...
}]
Let's say I have an elasticsearch index with a massive of items. Note that there's no "score" field in indexed documents.
Now I want elasticsearch to return only those items with ids in the said list. Ok, this one is easy. I'm now stuck at sorting the result. That means I need the result to be sorted exactly as my pre-ordered list above.
Any suggestion for me to achieve that?
I'm not an English native speaker, so sorry for my grammar and words.
As version of 7.4, Elastic introduced pinned query that promotes selected documents to rank higher than those matching a given query. In your case this search query should return what you want:
GET /_search
{
"query": {
"pinned" : {
"ids" : ["id2", "id12", "id8"],
"organic" : {
other queries
}
}
}
}
For more information you can check Elasticsearch official documentation here.

Elasticsearch query speed up with repeated used terms query filter

I will need to find out the co-occurrence times between one single tag and another fixed set of tags as whole. I have 10000 different single tags, and there are 10k tags inside fixed set of tags. I loop through all single tags under a fixed set of tags context with a fixed time range. I have total 1 billion documents inside the index with 20 shards.
Here is the elasticsearch query, elasticsearch 6.6.0:
es.search(index=index, size=0, body={
"query": {
"bool": {
"filter": [
{"range": {
"created_time": {
"gte": fixed_start_time,
"lte": fixed_end_time,
"format": "yyyy-MM-dd-HH"
}}},
{"term": {"tags": dynamic_single_tag}},
{"terms": {"tags": {
"index" : "fixed_set_tags_list",
"id" : 2,
"type" : "twitter",
"path" : "tag_list"
}}}
]
}
}, "aggs": {
"by_month": {
"date_histogram": {
"field": "created_time",
"interval": "month",
"min_doc_count": 0,
"extended_bounds": {
"min": two_month_start_time,
"max": start_month_start_time}
}}}
})
My question: Is there any solution which can have a cache inside elasticsearch for a fixed 10k set of tags terms query and time range filter which can speed up the query time? It took 1.5s for one single tag for my query above.
What you are seeing is normal behavior for Elasticsearch aggregations (actually, pretty good performance given that you have 1 billion documents).
There are a couple of options you may consider: using a batch of filter aggregations, re-indexing with a subset of documents, and downloading the data out of Elasticsearch and computing the co-occurrences offline.
But probably it is worth trying to send those 10K queries and see if Elasticsearch built-in caching kicks in.
Let me explain in a bit more detail each of these options.
Using filter aggregation
First, let's outline what we are doing in the original ES query:
filter documents with create_time in certain time window;
filter documents containing desired tag dynamic_single_tag;
also filter documents who have at least one tag from the list fixed_set_tags_list;
count how many such documents there are per each month in certain time period.
The performance is a problem because we've got 10K of tags to make such queries for.
What we can do here is to move filter on dynamic_single_tag from query to aggregations:
POST myindex/_doc/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "terms": { ... } }
]
}
},
"aggs": {
"by tag C": {
"filter": {
"term": {
"tags": "C" <== here's the filter
}
},
"aggs": {
"by month": {
"date_histogram": {
"field": "created_time",
"interval": "month",
"min_doc_count": 0,
"extended_bounds": {
"min": "2019-01-01",
"max": "2019-02-01"
}
}
}
}
}
}
}
The result will look something like this:
"aggregations" : {
"by tag C" : {
"doc_count" : 2,
"by month" : {
"buckets" : [
{
"key_as_string" : "2019-01-01T00:00:00.000Z",
"key" : 1546300800000,
"doc_count" : 2
},
{
"key_as_string" : "2019-02-01T00:00:00.000Z",
"key" : 1548979200000,
"doc_count" : 0
}
]
}
}
Now, if you are asking how this can help the performance, here is the trick: to add more such filter aggregations, for each tag: "by tag D", "by tag E", etc.
The improvement will come from doing "batch" requests, combining many initial requests into one. It might not be practical to put all 10K of them in one query, but even batches of 100 tags per query can be a game changer.
(Side note: roughly the same behavior can be achieved via terms aggregation with include filter parameter.)
This method of course requires getting hands dirty and writing a bit more complex query, but it will come handy if one needs to run such queries at random times with 0 preparation.
re-index the documents
The idea behind second method is to reduce the set of documents beforehand, via reindex API. reindex query might look like this:
POST _reindex
{
"source": {
"index": "myindex",
"type": "_doc",
"query": {
"bool": {
"filter": [
{
"range": {
"created_time": {
"gte": "fixed_start_time",
"lte": "fixed_end_time",
"format": "yyyy-MM-dd-HH"
}
}
},
{
"terms": {
"tags": {
"index": "fixed_set_tags_list",
"id": 2,
"type": "twitter",
"path": "tag_list"
}
}
}
]
}
}
},
"dest": {
"index": "myindex_reduced"
}
}
This query will create a new index, myindex_reduced, containing only elements that satisfy first 2 clauses of filtering.
At this point, the original query can be done without those 2 clauses.
The speed-up in this case will come from limiting the number of documents, the smaller it will be, the bigger the gain. So, if fixed_set_tags_list leaves you with a little portion of 1 billion, this is the option you can definitely try.
Downloading data and processing outside Elasticsearch
To be honest, this use-case looks more like a job for pandas. If data analytics is your case, I would suggest using scroll API to extract the data on disk and then process it with an arbitrary script.
In python it could be as simple as using .scan() helper method of elasticsearch library.
Why not to try the brute force approach?
Elasticsearch will already try to help you with your query via request cache. It is applied only to pure-aggregation queries (size: 0), so should work in your case.
But it will not, because the content of the query will always be different (the whole JSON of the query is used as caching key, and we have a new tag in every query). A different level of caching will start to play.
Elasticsearch heavily relies on the filesystem cache, which means that under the hood the more often accessed blocks of the filesystem will get cached (practically loaded into RAM). For the end-user it means that "warming up" will come slowly and with volume of similar requests.
In your case, aggregations and filtering will occur on 2 fields: create_time and tags. This means that after doing maybe 10 or 100 requests, with different tags, the response time will drop from 1.5s to something more bearable.
To demonstrate my point, here is a Vegeta plot from my study of Elasticsearch performance under the same query with heavy aggregations sent with fixed RPS:
As you can see, initially the request was taking ~10s, and after 100 requests it diminished to brilliant 200ms.
I would definitely suggest to try this "brute force" approach, because if it works it is good, if it does not - it costed nothing.
Hope that helps!

ElasticSearch Ignoring words having one single letter

I'm a beginner in ElasticSearch, I have an application that uses elasticSearch to look for ingredients in a given food or fruit...
I'm facing a problem with scoring if the user for example tapes: "Vitamine d"
ElasticSearch will give the "vitamine" phrase that has the best scoring even if the phrase "Vitamine D" exists and normally it should have the highest score.
I see that if the second word "d" in my case is just one letter then elastic search will ignore it.
I did another example: "vitamine b12" and I had the correct score.
Here is the query that the application send to the server:
{
"from": 0,
"size": 5,
"query": {
"bool": {
"must": [
{
"match": {
"constNomFr": {
"query": "vitamine d"
}
}
}
],
"should": [
{
"prefix": {
"constNomFr": {
"value": "vitamine d",
"boost": 2
}
}
}
]
}
},
"_source": {
"excludes": [
"alimentDtos"
]
}
}
What could I modify to make it work?
Thank you so much.
If you can identify your ingredients, I recommend you to index them on a separate field "ingredients" setting it's type to keyword. This way you can use a term filter and you can even run aggregations.
You may already have your documents indexed that way, in that case if your are using the default mapping, just run your query against your_field_name.keyword.
If you don't have your ingredients indexed as an array then you should take a look to the elasticsearch analyzers to choose or build the right one.

Calculate counts of hits of several subqueries inside one query to Elasticsearch

I have 3 fields in a document that I need to match. I'd like to identify which of those 3 fields have any matches.
More specifically, I'd like to find out if the given wildcard query matches only one field through the document set or matches several fields. If the wildcard query matches only, say field1, then I can make a conclusion that the given wildcard query is applicable to only field1. If the wildcard query matches two or three fields, then I cannot make such a conclusion and I'll wait for more characters to be entered by user to narrow search.
I've written the following query that matches all 3 fields:
{
"query": {
"bool": {
"should": [
{"wildcard": { "field1": "*R*" }},
{"wildcard": { "field2": "*R*" }},
{"wildcard": { "field3": "*R*" }}
]
}
},
"size": 0
}
It returns the total count of all documents that have matches on any of those fields. Now I'd like to know if it's possible to receive 3 separate counts for each subquery. This can be achieved by sending 3 separate requests but I'd like to minimize the number of requests to elasticsearch.
I've tried bool and dis_max queries but could not find a solution.
UPDATE
Using named queries I've built the following query:
{
"query": {
"bool": {
"should": [
{"wildcard": { "field1": { "value": "*R*", "_name": "query1" }}},
{"wildcard": { "field2": { "value": "*R*", "_name": "query2" }}},
{"wildcard": { "field3": { "value": "*R*", "_name": "query3" }}}
]
}
},
"size": 1
}
This query returns a single result with the best score. By default, the score is higher when more fields are matched in the same document. So if the found document was matched by two or three fields, it already answers my initial question. However, if the found document was matched by a single field, say, field1, it does not guarantee there are no other documents that are matched by field2 or field3, so it's still not a solution.
Do I have to send 3 requests to run searches over each field separately to solve my problem?

Significant Terms Aggregation of "flat" structures

I currently try to prototype a product recommendation system using the Elasticsearch Significant Terms aggregation. So far, I didn't find a good example yet which deals with "flat" JSON structures of sales (here: The itemId) coming from a relational database, such as mine:
Document 1
{
"lineItemId": 1,
"lineNo": 1,
"itemId": 1,
"productId": 1234,
"userId": 4711,
"salesQuantity": 2,
"productPrice": 0.99,
"salesGross": 1.98,
"salesTimestamp": 1234567890
}
Document 2
{
"lineItemId": 1,
"lineNo": 2,
"itemId": 1,
"productId": 1235,
"userId": 4711,
"salesQuantity": 1,
"productPrice": 5.99,
"salesGross": 5.99,
"salesTimestamp": 1234567890
}
I have around 1.5 million of these documents in my Elasticsearch index. A lineItem is a part of a sale (identified by itemId), which can consist of 1 or more lineItems What I would like to receive is the, say, 5 most uncommonly common products which were bought in conjunction with the sale of one specific productId.
The MovieLens example (https://www.elastic.co/guide/en/elasticsearch/guide/current/_significant_terms_demo.html) deals with data in the structure of
{
"movie": [122,185,231,292,
316,329,355,356,362,364,370,377,420,
466,480,520,539,586,588,589,594,616
],
"user": 1
}
so it's unfortunately not really useful to me. I'd be very glad for an example or a suggestion using my "flat" structures. Thanks a lot in advance.
It sounds like you're trying to build an item-based recommender. Apache Mahout has tools to help with collaborative filtering (formerly the Taste project).
There is also a Taste plugin for Elasticsearch 1.5.x which I believe can work with data like yours to produce item-based recommendations.
(Note: This plugin uses Rivers which were deprecated in Elasticsearch 1.5, so I'd check with the authors about plans to support more recent versions of Elasticsearch before adopting this suggestion.)
Since I don't have the amount of data that you do, try this:
get the list of itemIds for bundles that contain a certain productId that you want to find "stuff" for:
{
"query": {
"filtered": {
"filter": {
"term": {
"productId": 1234
}
}
}
},
"fields": [
"itemId"
]
}
Then
using this list create this query:
GET /sales/sales/_search?search_type=count
{
"query": {
"filtered": {
"filter": {
"terms": {
"itemId": [1,2,3,4,5,6,7,11]
}
}
}
},
"aggs": {
"most_sig": {
"significant_terms": {
"field": "productId",
"size": 0
}
}
}
}
If I understand correctly you have a doc per order line item. What you want is a single doc per order. The Order doc should have an array of productIds (or an array of line item objects that each include a productId field).
That way when you query for orders containing product X the sig_terms aggregation should find product Y is found to be uncommonly common in these orders.

Resources