Find all ID where ID are not in my blacklist - elasticsearch

After many lectures , I cannot say if this kind of query is possible with elasticsearch , I found the "getting started" really excellent but the rest of guide have a lack of examples (from my point of vue ).
See my structure below, I need to retrieve all id who are not in my blacklist. My blacklist is some reference id. For this example I am the id 1 with the firstname "me" . Here in the structure we see I blacklisted "bob" , so the bob id (2) is in my blacklist array because I don't want to find bob in my search result.. :)
Is it possible to only retrieve (dynamically for sure) all id who are not in my blacklist in one query?
If you come from SQL, the same logic could be :
SELECT id FROM index WHERE id NOT IN (SELECT * FROM blacklist WHERE id = 1)
I would like to avoid the 2 step query , if my schema is bad and should be reconsidered , please I'm totally open for advice or suggestions.
Here is the structure :
{
"id: 1,
"balance": 16623,
"firstname": "me",
"blacklist" : [2,1982,939,1982,98716,7611,983838, and thousands others ....],
}
{
"id: 2,
"balance": 16623,
"firstname": "bob,
"blacklist" : [18,1982,939,1982,98716,7611,983838, and thousands others ....],
}
{
"id: 3,
"balance": 16623,
"firstname": "jhon",
"blacklist" : [18,1982,939,1982,98716,7611,983838, and thousands others ....],
}

You can use use a terms filter lookup together with a not filter as follows.
I set up the index with the three docs you have listed:
DELETE /test_index
PUT /test_index
PUT /test_index/doc/1
{
"id": 1,
"balance": 16623,
"firstname": "me",
"blacklist" : [2,1982,939,1982,98716,7611,983838]
}
PUT /test_index/doc/2
{
"id": 2,
"balance": 16623,
"firstname": "bob",
"blacklist" : [18,1982,939,1982,98716,7611,983838]
}
PUT /test_index/doc/3
{
"id": 3,
"balance": 16623,
"firstname": "john",
"blacklist" : [18,1982,939,1982,98716,7611,983838]
}
Then set up a query that filters out docs that are in the blacklist for "me":
POST /test_index/doc/_search
{
"filter": {
"not": {
"filter": {
"terms": {
"id": {
"index": "test_index",
"type": "doc",
"id": "1",
"path": "blacklist"
}
}
}
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"id": 1,
"balance": 16623,
"firstname": "me",
"blacklist": [2,1982,939,1982,98716,7611,983838]
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"id": 3,
"balance": 16623,
"firstname": "john",
"blacklist": [18,1982,939,1982,98716,7611,983838]
}
}
]
}
}
If you also want to filter out the user whose blacklist is being used, you can set up a slightly more complex filter using or:
POST /test_index/doc/_search
{
"filter": {
"not": {
"filter": {
"or": {
"filters": [
{
"terms": {
"id": {
"index": "test_index",
"type": "doc",
"id": "1",
"path": "blacklist"
}
}
},
{
"term": {
"id": "1"
}
}
]
}
}
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"id": 3,
"balance": 16623,
"firstname": "john",
"blacklist": [18,1982,939,1982,98716,7611,983838]
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/0b6808414f9447d4f7d23eb4c0d3e937ec2ea4e7

Related

Find duplicates by id in two indices in Elasticsearch

I have several indices for storing my data in week-related index, with template myindex-2022-weekOfYear
How to find all duplicates by id across these indices?
I've tried to used aggregations (based from another questions here)
GET myindex-*/_search
{
"stored_fields": [
"myKey"
],
"size": 100,
"aggs": {
"duplicateNames": {
"terms": {
"field": "myKey",
"min_doc_count": 2
},
"aggs": {
"duplicateDocuments": {
"top_hits": {}
}
}
}
}
}
But it looks like that query is not working properly, as searching a single document by id (from query result) returns only one index, so I assume that min_doc_count is not working as I am expecting.
EDIT:
I see in response:
"genres" : {
"doc_count_error_upper_bound" : 530,
"sum_other_doc_count" : 357290963,
"buckets" : [ ]
}
so probably shard_size is too low (and I cant really increase it, due to es resources)
Tldr;
I could not find why this is not working, but I made a POC which show it is working correctly.
(For a rather smalls dimensions)
Poc
POST _bulk
{"index": {"_index": "74473038-0", "_id": "1"}}
{"data": "some dummy data", "id": 1}
{"index": {"_index": "74473038-1", "_id": "1"}}
{"data": "some dummy data", "id": 1}
{"index": {"_index": "74473038-2", "_id": "1"}}
{"data": "some dummy data", "id": 1}
{"index": {"_index": "74473038-3", "_id": "1"}}
{"data": "some dummy data", "id": 1}
{"index": {"_index": "74473038-0", "_id": "2"}}
{"data": "some dummy data", "id": 2}
{"index": {"_index": "74473038-2", "_id": "2"}}
{"data": "some dummy data", "id": 2}
{"index": {"_index": "74473038-0", "_id": "3"}}
{"data": "some dummy data", "id": 3}
{"index": {"_index": "74473038-1", "_id": "3"}}
{"data": "some dummy data", "id": 3}
{"index": {"_index": "74473038-3", "_id": "3"}}
{"data": "some dummy data", "id": 3}
{"index": {"_index": "74473038-0", "_id": "4"}}
{"data": "some dummy data", "id": 4}
{"index": {"_index": "74473038-2", "_id": "4"}}
{"data": "some dummy data", "id": 4}
{"index": {"_index": "74473038-3", "_id": "4"}}
{"data": "some dummy data", "id": 4}
{"index": {"_index": "74473038-0", "_id": "5"}}
{"data": "some dummy data", "id": 5}
GET 74473038-*/_search
{
"size": 0,
"aggs": {
"duplicateNames": {
"terms": {
"field": "id",
"min_doc_count": 2
},
"aggs": {
"duplicateDocuments": {
"top_hits": {}
}
}
}
}
}
I am getting as expected, document with id : 1, 2, 3, 4.
Omitting 5.
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 4,
"successful": 4,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 13,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"duplicateNames": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": 1,
"doc_count": 4,
"duplicateDocuments": {
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "74473038-0",
"_id": "1",
"_score": 1,
"_source": {
"data": "some dummy data",
"id": 1
}
},
{
"_index": "74473038-1",
"_id": "1",
"_score": 1,
"_source": {
"data": "some dummy data",
"id": 1
}
},
{
"_index": "74473038-3",
"_id": "1",
"_score": 1,
"_source": {
"data": "some dummy data",
"id": 1
}
}
]
}
}
},
{
"key": 3,
"doc_count": 3,
"duplicateDocuments": {
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "74473038-0",
"_id": "3",
"_score": 1,
"_source": {
"data": "some dummy data",
"id": 3
}
},
{
"_index": "74473038-1",
"_id": "3",
"_score": 1,
"_source": {
"data": "some dummy data",
"id": 3
}
},
{
"_index": "74473038-3",
"_id": "3",
"_score": 1,
"_source": {
"data": "some dummy data",
"id": 3
}
}
]
}
}
},
{
"key": 4,
"doc_count": 3,
"duplicateDocuments": {
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "74473038-3",
"_id": "4",
"_score": 1,
"_source": {
"data": "some dummy data",
"id": 4
}
},
{
"_index": "74473038-2",
"_id": "4",
"_score": 1,
"_source": {
"data": "some dummy data",
"id": 4
}
},
{
"_index": "74473038-0",
"_id": "4",
"_score": 1,
"_source": {
"data": "some dummy data",
"id": 4
}
}
]
}
}
},
{
"key": 2,
"doc_count": 2,
"duplicateDocuments": {
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "74473038-2",
"_id": "2",
"_score": 1,
"_source": {
"data": "some dummy data",
"id": 2
}
},
{
"_index": "74473038-0",
"_id": "2",
"_score": 1,
"_source": {
"data": "some dummy data",
"id": 2
}
}
]
}
}
}
]
}
}
}

Elasticsearch: search score puzzle me. Same score for different match levels

To simplify:
PUT /test/vendors/1
{
"type": "doctor",
"name": "Ron",
"place": "Boston"
}
PUT /test/vendors/2
{
"type": "doctor",
"name": "Tom",
"place": "Boston"
}
PUT /test/vendors/3
{
"type": "doctor",
"name": "Jack",
"place": "San Fran"
}
Then search:
GET /test/_search
{
"query": {
"multi_match" : {
"query": "doctor in Boston",
"fields": [ "type", "place" ]
}
}
}
I understand why I get Jack who works in San Fran -- it's because he's a doctor too. However, I can't figure out why the match score is the SAME for him. The other two were matched with the place too, weren't they? why aren't Ron and Tom scored higher?
{
"took": 11,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.9245277,
"hits": [
{
"_index": "test",
"_type": "vendors",
"_id": "2",
"_score": 0.9245277,
"_source": {
"type": "doctor",
"name": "Tom",
"place": "Boston"
}
},
{
"_index": "test",
"_type": "vendors",
"_id": "1",
"_score": 0.9245277,
"_source": {
"type": "doctor",
"name": "Ron",
"place": "Boston"
}
},
{
"_index": "test",
"_type": "vendors",
"_id": "3",
"_score": 0.9245277,
"_source": {
"type": "doctor",
"name": "Jack",
"place": "San Fran"
}
}
]
}
}
Is there a way to force it to score less when less search keywords are found? Also, If I'n going to wrong way about this kind of search and there's a better pattern/way to do it -- I'd appreciate to be pointed in the right direction.
Your search structure is incorrect. The search query above is ignoring the place property and that's why you get the same score for all documents (only type property is taken into account). The reason for that is because works_at is a nested mapping, which should be treated differently when searching.
First, you should defined works_at as a nested mapping (read more here). Then you'll have to adjust your query to work with that nested mapping, see an example here.
GET /test/_search
{
"query": {
"multi_match" : {
"query": "doctor in Boston",
"fields": [ "type", "place" ],
"type": "most_fields" . <---- I WAS MISSING THIS
}
}
}
once in, that gave the correct results, where the "San Fran" guy is scored lower.
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.2122098,
"hits": [
{
"_index": "test",
"_type": "vendors",
"_id": "2",
"_score": 1.2122098,
"_source": {
"type": "doctor",
"name": "Tom",
"place": "Boston"
}
},
{
"_index": "test",
"_type": "vendors",
"_id": "1",
"_score": 1.2122098,
"_source": {
"type": "doctor",
"name": "Ron",
"place": "Boston"
}
},
{
"_index": "test",
"_type": "vendors",
"_id": "3",
"_score": 0.9245277,
"_source": {
"type": "doctor",
"name": "Jack",
"place": "San Fran"
}
}
]
}
}

Update elastic search data with new key-value pair

{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 25,
"max_score": 1,
"hits": [
{
"_index": "surtest1",
"_type": "catalog",
"_id": "prod_9876561740",
"_score": 1,
"_source": {
"id": "01",
"type": "product"
}
},
{
"_index": "surtest1",
"_type": "catalog",
"_id": "prod_9876543375",
"_score": 1,
"_source": {
"id": "02",
"type": "product"
}
}
]
}
}
This is the sample json response of search inside elastic search.
We need to add one more key-value pair("spec":"4gb") in all the json object like,
"_source": {
"id": "01",
"type": "product" ,
"spec": "4gb"
},
"_source": {
"id": "02",
"type": "product" ,
"spec": "4gb"
}
this updation should be in a single command.Please guide us to perform this operation.
Try
POST /surtest1/_update_by_query?refresh
{
"script": {
"source": "ctx._source['spec']='4gb'"
}
}
Take a look at Update By Query API. You are able to prepare a query to match all documents and use scripting to add the property you want.
Example:
POST twitter/_update_by_query
{
"script": {
"source": "ctx._source.likes++",
"lang": "painless"
},
"query": {
"term": {
"user": "kimchy"
}
}
}

Elasticsearch - Search with wildcards

I've managed to populate my index with 4 documents using this bulk request:
POST localhost:9200/titles/movies/_bulk
{"index":{"_id":"1"}}
{"id": "1","level": "first","titles": [{"value": "The Bad and the Beautiful","type": "Catalogue","main": true},{"value": "The Bad and the Beautiful (1945)","type": "International","main": false}]}
{"index":{"_id":"2"}}
{"id": "2","level": "first","titles": [{"value": "Bad Day at Black Rock","type": "Drama","main": true}]}
{"index":{"_id":"3"}}
{"id": "3","level": "second","titles": [{"value": "Baker's Wife","type": "AnotherType","main": true},{"value": "Baker's Wife (1940)","type": "Trasmitted","main": false}]}
{"index":{"_id":"4"}}
{"id": "4","level": "second","titles": [{"value": "Bambi","type": "Educational","main": true},{"value": "The Baby Deer and the hunter (1942)","type": "Fantasy","main": false}]}
Now how can I perform searches with wildcards on all available titles?
Something like
localhost:9200/titles/movies/_search?q=*&sort=level:asc
but providing one or more wilcards. For instance searching for "The % the %" and parsing the response from elasticsearch to eventually return something like:
{
"count":2,
"results":[{
"id":"1",
"level":"first",
"foundInTitleTypes":["Catalogue","International"]
},{
"id":"4",
"level":"second",
"foundInTitleTypes":["Fantasy"]
}]
}
Thanks!
Elasticsearch provides regex support in the the regular match query
GET titles/movies/_search
{
"query": {
"match" : { "titles.value" : "The * the *" }
}
}
Gives you this
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.6406528,
"hits": [
{
"_index": "titles",
"_type": "movies",
"_id": "4",
"_score": 1.6406528,
"_source": {
"id": "4",
"level": "second",
"titles": [
{
"value": "Bambi",
"type": "Educational",
"main": true
},
{
"value": "The Baby Deer and the hunter (1942)",
"type": "Fantasy",
"main": false
}
]
}
},
{
"_index": "titles",
"_type": "movies",
"_id": "1",
"_score": 0.9026783,
"_source": {
"id": "1",
"level": "first",
"titles": [
{
"value": "The Bad and the Beautiful",
"type": "Catalogue",
"main": true
},
{
"value": "The Bad and the Beautiful (1945)",
"type": "International",
"main": false
}
]
}
}
]
}
}
To update to your question URI search, I'm not sure if it is possible, if you do it with curl you just omit the query dsl as data
curl localhost:9200/titles/movies/_search -d '{"query":{"match":{"titles.value":"The * the *"}}}'
{"took":46,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":1.6406528,"hits":[{"_index":"titles","_type":"movies","_id":"4","_score":1.6406528,"_source":{"id": "4","level": "second","titles": [{"value": "Bambi","type": "Educational","main": true},{"value": "The Baby Deer and the hunter (1942)","type": "Fantasy","main": false}]}},{"_index":"titles","_type":"movies","_id":"1","_score":0.9026783,"_source":{"id": "1","level": "first","titles": [{"value": "The Bad and the Beautiful","type": "Catalogue","main": true},{"value": "The Bad and the Beautiful (1945)","type": "International","main": false}]}}]}}
Update to latest question:
Well if you want to sort by level, you need to provide a mapping for elasticsearch. What I did:
Delete index
DELETE titles
Add mapping
PUT titles
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"movies": {
"properties": {
"level": {
"type": "keyword"
}
}
}
}
}
Refine Query DSL
GET titles/movies/_search
{
"_source": [
"id",
"level",
"titles.value"
],
"sort": [
{
"level": {
"order": "asc"
}
}
],
"query": {
"match": {
"titles.value": "The * the *"
}
}
}
That gives me
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": null,
"hits": [
{
"_index": "titles",
"_type": "movies",
"_id": "1",
"_score": null,
"_source": {
"level": "first",
"id": "1",
"titles": [
{
"value": "The Bad and the Beautiful"
},
{
"value": "The Bad and the Beautiful (1945)"
}
]
},
"sort": [
"first"
]
},
{
"_index": "titles",
"_type": "movies",
"_id": "4",
"_score": null,
"_source": {
"level": "second",
"id": "4",
"titles": [
{
"value": "Bambi"
},
{
"value": "The Baby Deer and the hunter (1942)"
}
]
},
"sort": [
"second"
]
}
]
}
}

Elasticsearch: facet or aggregation returning doc counts over multiple fields

I have an elasticsearch document structure for which I'd like to have a terms facet (or aggragation) for which I obtain the number of documents independently of the field in which they appear.
For example, le following result shows both the documents and facetted search result:
{
"_shards": {
"failed": 0, "successful": 5, "total": 5
},
"hits": {
"hits": [
{
"_id": "003", "_index": "test", "_score": 1.0, "_type": "test",
"_source": {
"root": {
"content": [
"five",
"five",
"five"
],
"title": "four"
}
}
},
{
"_id": "002", "_index": "test", "_score": 1.0, "_type": "test",
"_source": {
"root": {
"content": "two three",
"title": "three"
}
}
},
{
"_id": "001", "_index": "test", "_score": 1.0, "_type": "test",
"_source": {
"root": {
"content": "one two",
"title": "one"
}
}
}
],
"max_score": 1.0, "total": 3
},
"facets": {
"terms": {
"_type": "terms", "missing": 0, "other": 0,
"terms": [
{
"count": 2,
"term": "two"
},
{
"count": 2,
"term": "three"
},
{
"count": 2,
"term": "one"
},
{
"count": 1,
"term": "four"
},
{
"count": 1,
"term": "five"
}
],
"total": 8
}
},
"timed_out": false,
"took": 18,
}
We can see that the terms "one" and "three" have counts of 2 (once for each field of the same doc) where I would like them to have a count of 1. The only term with a count of 2 should be "two".
I looked into aggregation to see if it could help but it doesn't seem to work with multiple fields (or I have missed something).
It would have been nice to build a "terms" facet on "root" rather than the individual fields... but that doesn't seem possible either.
Any ideas, how to work this out ?
You can use the script in terms aggregation to achieve this.
Inside the script , collect the tokens from both the field , do a set union operation and then return the set.
{
"aggs" : {
"genders" : {
"terms" : {
"script" : "union(doc['content'].values, doc['title'].values) "
}
}
}
}
You need to see how to apply the union operation in whichever language you use to use as script language.
you could add new field, which keeps unique terms from both content and title fields, and make facet aggregation on it.

Resources