How to exclude a large set of of ids from elasticsearch result? - elasticsearch

I have a lot of Products indexed in elasticsearch. I need to exclude a list of ids (that I am fetching from a SQL database), from a query in elasticsearch.
Suppose Products are stored as,
{
"id" : "1",
"name" : "shirt",
"size" : "xl"
}
We show a list of recommended products to a customer based on some algorithm using elasticsearch.
If a customer marks a product as 'Not Interested', we don't have to show him that product again.
We keep such products in a separate SQL table with product_id, customer_id and status 'not_interested'.
Now while fetching recommendations for a customer on runtime, we get the list of 'not_interested' products from the SQL database, and send the array of product_ids in a not filter in elasticsearch to exclude them from recommendation.
But the problem arises, when the size of product_ids array becomes too large.
How should I store the product_id and customer_id mapping in elasticsearch
to filter out the 'not_interested' products on runtime using elasticsearch only?
Will it make sense to store them as nested objects or parent/child documents.? Or some completely other way to store such that I can exclude some ids from the result efficiently.

You can exclude IDs (or any other literal strings) efficiently using a terms query.
Both Elasticsearch and Solr have this. It is very powerful and very efficient.
Elasticsearch has this with the IDS query. This query is in fact a terms query on the _uid field. Make sure you use this query in a mustNot clause within a bool query. See: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-ids-query.html
In Solr you can use the terms query within a fq like fq=-{!terms f=id}doc334,doc125,doc777,doc321,doc253. Note the minus to indicate that it is a negation. See: http://yonik.com/solr-terms-query/

Use "ids" query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-ids-query.html
{
"query": {
"ids" : {
"type" : "my_type",
"values" : ["1", "4", "100"]
}
}
}
Wrapped inside a bool > must_not.

Add Terms under must_not section like the following:
{
"must_not": [
{
"terms": {
"id": [
"1",
"3",
"5"
]
}
}
]
}

Related

Creating histogram in Elasticsearch

I have an index with several documents. A field found in each document is "id". I want to know how many documents per id count. There can be several documents for each id. Just like in any store there can be many transactions for each customer, for instance.
Meaning for instance, I want to get something like: "There are 5 ids with 1 document. There are 10 ids with 2 documents" and so on.
How can I write that aggregation in Elasticsearch?
I believe this would be a classic terms aggregation. Something along these lines should work for you:
GET /_search
{
"aggs" : {
"ids" : {
"terms" : { "field" : "id" }
}
}
}

Is it possible to check that specific data matches the query without loading it to the index?

Imagine that I have a specific data string and a specific query. The simple way to check that the query matches the data is to load the data into the Elastic index and run the online query. But can I do it without putting it into the index?
Maybe there are some open-source libraries that implement the Elastic search functionality offline, so I can call something like getScore(data, query)? Or it's possible to implement by using specific API endpoints?
Thanks in advance!
What you can do is to leverage the percolator type.
What this allows you to do is to store the query instead of the document and then test whether a document would match the stored query.
For instance, you first create an index with a field of type percolator that will contain your query (you also need to add in the mapping any field used by the query so ES knows what their types are):
PUT my_index
{
"mappings": {
"properties": {
"query": {
"type": "percolator"
},
"message": {
"type": "text"
}
}
}
}
Then you can index a real query, like this:
PUT my_index/_doc/match_value
{
"query" : {
"match" : {
"message" : "bonsai tree"
}
}
}
Finally, you can check using the percolate query if the query you've just stored would match
GET /my_index/_search
{
"query" : {
"percolate" : {
"field" : "query",
"document" : {
"message" : "A new bonsai tree in the office"
}
}
}
}
So all you need to do is to only store the query (not the documents), and then you can use the percolate query to check if the documents would have been selected by the query you stored, without having to store the documents themselves.

Elasticsearch - boosting fields for multi match without specifying complete field list in query

I am trying to boost fields using multi match query without specifying complete field list but I cannot find out how to do it. I am searching through multiple indices on all fields, which I don't know at the run time, but I know which are the important ones.
For example I have index A with the fields 1,2,3,4 and index B with fields 1,5,6,7,8. I need to search across both indexes through all fields with the boosting on field 1.
So far I got
GET A,B/_search
{
"query": {
"multi_match" : {
"query" : "somethingToSearch"
}
}
}
Which goes through all fields on both indices, but I would like to have something like this (boosting match on field 1 before the others)
GET A,B/_search
{
"query": {
"multi_match" : {
"query" : "somethingToSearch",
"fields" : ["1^5,*"]
}
}
}
Is there any way how to do it without using bool queries?

How can I get options for filtering by a field directly from elasticsearch?

I want to populate a filtering field based on the data I have indexed inside Elasticsearch. How can I retrieve this data? For example, my documents inside index "test" and type "doc" could be
{"id":1, "tag":"foo", "name":"foothing"}
{"id":2, "tag":"bar", "name":"barthing"}
{"id":3, "tag":"foo", "name":"something"}
{"id":4, "tag":"quux", "name":"quuxthing"}
I'm looking for something like GET /test/doc/_magic?q=tag that would return [foo,bar,quux] from my data. I don't know what this is called or even possible. I don't want to get all index entries into memory and do this programmatically, I have millions of documents in the index with around a hundred different tags.
Is this possible with ES?
Yes, that's possible and this is called a terms aggregation
You can do it like this:
GET /test/doc/_search
{
"size": 0,
"aggs" : {
"tags" : {
"terms" : {
"field" : "tag.keyword",
"size": 100
}
}
}
}
Note that depending on the cardinality of your tag field, you can increase/decrease the size setting (10 by default).

Elasticsearch bulk or search

Background
I am working on an API that allows the user to pass in a list of details about a member (name, email addresses, ...) I want to use this information to match up with account records in my Elasticsearch database and return a list of potential matches.
I thought this would be as simple as doing a bool query on the fields I want, however I seem to be getting no hits.
I'm relatively new to Elasticsearch, my current _search request looks like this.
Example Query
POST /member/account/_search
{
"query" : {
"filtered" : {
"filter" : {
"bool" : {
"should" [{
"term" : {
"email": "jon.smith#gmail.com"
}
},{
"term" : {
"email": "samy#gmail.com"
}
},{
"term" : {
"email": "bo.blog#gmail.com"
}
}]
}
}
}
}
}
Question
How should I update this query to return records that match any of the email addresses?
Am I able to prioritise records that match email and another field? Example "family_name".
Will this be a problem if I need to do this against a few hundred emails addresses?
Well , you need to make the change in the index side rather than query side.
By default your email ID is broken into
jon.smith#gmail.com => [ jon , smith , gmail , com]
While indexing.
Now when you are searching using term query , it does not apply the analyzer and it tries to get the exact match of jon.smith#gmail.com , which as you can see , wont work.
Even if you use match query , then you will end up getting all document as matches.
Hence you need to change the mapping to index email ID as a single token , rather than tokenizing it.
So using not_analyzed would be the best solution here.
When you define email field as not_analyzed , the following happens while indexing.
jon.smith#gmail.com => [ jon.smith#gmail.com]
After changing the mapping and indexing all your documents , now you can freely run the above query.
I would suggest to use terms query as following -
{
"query": {
"terms": {
"email": [
"jon.smith#gmail.com",
"samy#gmail.com",
"bo.blog#gmail.com"
]
}
}
}
To answer the second part of your question - You are looking for boosting and would recommend to go through function score query

Resources