Elastic search - change the relevance according to an external factor

Elastic search - change the relevance according to an external factor - elasticsearch

My use-case is a bit complicated so I'm simplifying it by using products and purchases:
The application has a big database with varies tables, among them - products and purchases (many to many: user_id:product_id).
Elastic has an index for the products only, as this is the only entity needed an advanced/high scale search.
What I'm trying to achieve is as following:
The more times the current user bought a product, the more relevant I want it to be.
The tricky part is the fact that Elastic has an index of the products only, not the purchases.
I can execute a query in the DB and get the info of how many times a user bought a product, and pass the results to Elastic, the question is how to do it.
Thanks.

If you can produce a reasonably-bounded purchase history for each searcher, you could implement this inside a bool query using a list of optional should block term queries
E.g.
"bool": {
"must": [ <existing query logic> ],
"should": [
{
"term": { "product_id": 654321 },
"boost": 3 <e.g. Purchased 3 times>
},
{
...
}
]
}
As a heads-up, evaluating large numbers of these optional Boolean clauses will degrade your query performance, so you might also consider using a rescore request to apply your boosting logic to only, say, the top 100 unboosted search hits, if that would satisfy your requirement.

Related

Elasticsearch Multi Index Query and Filter

I have 2 indexes, one that stores data about an event and one that stores the availability of that event. I am trying to create a single query that gets events by a query but only returns ones that are available, and I am having difficulty doing so.
The events index stores
{
"id" : "152ce52d-e975-4ebd-849a-0a12f535e644",
"createdAt" : 1.5519999143126902E12,
"description" : "A very not so concise description",
"geoHash" : "dnh00x6x5",
"name" : "a name",
...etc...
}
The availability index stores availability like so:
{
"eventId" : "152ce52d-e975-4ebd-849a-0a12f535e644",
"maxGuests" : 8,
"availability" : {
"lte" : "2019-10-18T22:15:00.000Z",
"gte" : "2019-10-18T02:30:00.000Z"
}
}
I am trying to create a query like below, but what I can't figure out is how to filter by listings that meet the criteria in the events index AND are available in the availability index.
GET events,availability/_search
{
"size": 5,
"from": 0,
"_source": [
"id"
],
"query": {
"bool": {
"must": [
{
"geo_distance": {
"distance": "25mi",
"geoHash": {
"lat": 34.0389,
"lon": -84.3826
}
}
}
],
"should": [],
"filter":[
{
"range" : {
"availability" : {
"gte" : "2019-10-31",
"lte" : "2020-11-01",
"relation" : "within"
}
}
}
]
}
}
}
--
The reason I want to only do one query is that the client is expecting a certain specified number of events. If I filter out the unavailable events after I get the event data then I am likely to be left with fewer events than the client expected and would need to do yet another search to fill the gap.
Also, of course, I could merge the two indices so that the event also stores the availability info, but I originally set them up this way because the availability info may have hundreds or thousands of entries per event.

What you want to accomplish is an equivalent of a foreign key of SQL (join). There is no way to have exactly what you want, meaning to filter documents from index A by querying an index B. Your options are:
As you've mentioned solve it on application level (although this causes other problems for you, so it's not a solution).
Merge the data in one index and have duplicated event informatin. Although it seems expensive, the duplication of data in a NoSQL database is to be expected. If you need a relational model then maybe you should use a SQL solution.
Use parent/child (join datatype). The problem here is that you will need to have the data in the same index overall. Moreover, parent and child will be stored in the same shard as well.
One approach to this (a bit more complex though) that I believe would work for you is to use the nested datatype, which actually is a more compact approach for the solution number 2 (combine your data in one index, but save root information only once). Make events be at the root and availability appear as nested. When you want to add one availability you can use the update api, and when you query, you can search by the root fields and by the nested. If you need to retrieve specific availability entries for an event you can use inner hits
What you are trying to do (multi-index search) will not join your data automatically, it will not work. Elasticsearch doesn't work that way, and the relational model is not suited for this product.
One last thing, it's a good thing to plan ahead, but it's a bad thing to try to optimize early on.
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
An interesting read that summarizes the above

Really huge query or optimizing an elasticsearch update

I'm working in documents-visualization for binary classification of a big amount of documents (around 150 000). The challenge is how to present general visual information to end-users, so they can have an idea on the main "concepts" on each category (positive/negative). As each document has an associated set of topics, I thought about asking Elasticsearch through aggregations for the top-20 topics on positive classified documents, and then the same for the negatives.
I created a python script that downloads the data from Elastic and classify the docs, BUT the problem is that the predictions on the dataset are not registered on Elasticsearch, so I can not ask for the top-20 topics on a certain category. First I thought about creating a query in elastic to ask for the aggregations and passing a match
As I have the ids of the positive/negative documents, I can write a query to retrieve the aggregation of topics BUT in the query I should provide a really big amount of documents IDS to indicate, for instance, just the positive documents. That is impossible, since there is a limit on the endpoint and I can not pass 50 000 ids like:
"query": {
"bool": {
"should": [
{"match": {"id_str": "939490553510748161"}},
{"match": {"id_str": "939496983510742348"}}
...
],
"minimum_should_match" : 1
}
},
"aggs" : { ... }
So I tried to register the predicted categories of the classification in the Elastic index, but as the amount of documents is really huge, it takes like half an hour (compared to less than a minute for running the classification)... which is a LOT of time just for storing the predictions.... Then I also need to query the index to et the right data for the visualization. To update the documents, I am using:
for id in docs_ids:
es.update(
index=kwargs["index"],
doc_type=kwargs["doc_type"],
id=id,
body={"doc": {
"prediction": kwargs["category"]
}}
)
Do you know an alternative to update the predictions faster?

You could use bulk query that permits you to serialize your requests and query only one time against elasticsearch executing a lot of searches.
Try:
from elasticsearch import helpers
query_list = []
list_ids = ["1","2","3"]
es = ElasticSearch("myurl")
for id in list_ids:
query_dict ={
'_op_type': 'update',
'_index': kwargs["index"],
'_type': kwargs["doc_type"],
'_id': id,
'doc': {"prediction": kwargs["category"]}
}
query_list.append(query_dict)
helpers.bulk(client=es,actions=query_list)
Please have a read here
Regarding to query the list ids, to get faster response you should't bring with you the match_string value, as you have done in the question, but the _id field. That permits you to use multiget query, a bulk query for the get operation. Here in the python library. Try:
my_ids_list = [<some_ids_here>]
es.mget(index = kwargs["index"],
doc_type = kwargs["index"],
body = {'ids': my_ids_list})

Application-side Joins Elasticsearch

I have two indexes in Elasticsearch, a system index, and a telemetry index. I'd like to perform queries and aggregations on the telemetry index using filters from the systems index. The systems index is relatively small and only receives new documents occasionally, but the telemetry index is much larger and is constantly receiving new documents. This seems like an ideal situation for using an application-side join.
I tried emulating the example query at the pervious link, but it turns out the filtered query is deprecated as of ES 5.0. (Why is this example in the current documentation?!)
Here are my queries:
GET /system/_search
{
"query": {
"match": {
"name": "George's system"
}
}
}
GET /telemetry/_search
{
"query": {
"bool":{
"must": {
"multi_match": {
"operator": "and",
"fields": ["systemId"]
, [1] }
}
}
}
}
}
The second one fails with a json_parse_exception because for some reason it doesn't like the [ ] characters after "fields".
Can anyone provide a simple example of using application-side joins?
Once such a query is defined (perhaps in Kibana's Dev Tools console) is there a way to visualize it in Kibana?

With elastic there is no way to execute two nested queries like in a relational database where the first query uses the response of the second. The example in the application-side join, means that you are actually making two queries (two different requests to elastic) on the application side.
First query you get the list of ids you need to filter on.
Second query you pass the list of ids that you got to the terms filter.
This works when you have no more than 1024 values for systemId. Because terms query has a limit on the number of terms.
Because this query is not feasible, then you can't visualize it in kibana.
In such case you have to sacrifice a little of space and add the systemId to your mapping.
Good Luck!

Elasticsearch filter vs term query for many ids

I have an index of documents connected with some product_id. And I would like to find all documents for specific ids (around 100 000 product_ids to be found and 100 million are in total in index).
Would the filter query be the fastest and best option in that case?
"query": {
"bool": {
"filter": {"terms": {"product_id": product_ids}
}
}
Or is it better to chunkify ids and use just terms query or smth else?
The question is probably kind of a duplicate, but I would be very grateful for the best practice advice (and a bit of reasoning).

After some testing and more reading I found an answer:
Filter query works much much faster as chunks with just terms query.
But making really big filter can slower getting the result a lot.
In my case, using filter query with chunks of 10 000 ids is 10 times faster, than using filter query with all 100 000 ids at once (btw, this number is already restricted in Elasticsearch 6).
Also from official elasticsearch documentation:
Potentially the amount of ids specified in the terms filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup mechanism.
The only disadvantage to be taken into account is that filter query is stored in cache. (The cache implements an LRU eviction policy: when a cache becomes full, the least recently used data is evicted to make way for new data.)
P.S. In all cases I always used scroll.

you can use "paging" or "scrolling" feature of elastic search query for very large result sets.
Use "from - to" query : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
or "scroll" query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
I think that "From / To" is a more efficient way to go unless you want to return thousands of results each time (which could be many many MB of data so you probably don't want that)
Edit:
You can make a query like this in bulks:
GET my_index/_search
{
"query": {
"terms": {
"_id": [ "1", "2", "3", .... "10000" ] // tune for the best array length
}
}
}
If your document Id is sequential or some other number form that you could easily order by, and have a field available you can do a "range query"
GET _search
{
"query": {
"range" : {
"document_id_that_is_a_number" : {
"gte" : 0, // bump this on each query by "lte" step factor
"lte" : 10000 // find a good number here
}
}
}
}

Querying large amounts of terms without expanding maxClauseCount

In a data flow of mine, I am trying to retrieve a subset of documents from a previous terms aggregation, but hitting the maxClauseCount limit within my ES cluster. The follow up query is along these lines:
GET dataset/_search
{
"size": 2000,
"query": {
"bool": {
"must": [
(a filter or two)...,
{
"terms":{
"otherid":[
"789e18f2-bacb-4e38-9800-bf8e4c65c206",
"8e6967aa-5b98-483e-b50f-c681c7396a6a",
...
]
}
}
]}
}
}
In my research I've come across a lookup - which sadly we can't use - as well as the ids query.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-ids-query.html
From experimentation, it appears that the ids query doesn't share the limit the terms query has (potentially it's not converted into terms clauses). Do any of you know if there's a good way to achieve similar functionality to the ids query without using the ids fields.
My version of ES is 5.0.
Thanks!

instead of using terms use the Terms filter it will solve the issue
OR
index.query.bool.max_clause_count: increase to higher value(*Not Recommended)
http://george-stathis.com/2013/10/18/setting-the-booleanquery-maxclausecount-in-elasticsearch/

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio